Activation Functions in Neural Networks
When our brain is fed with a lot of information simultaneously, it tries hard to understand and classify the information into “useful” and “not-so-useful” information. We need a similar mechanism for classifying incoming information as “useful” or “less-useful” in case of Neural Networks.
This is important in the way a network learns because not all the information is equally useful. Some of it is just noise. This is where activation functions come into picture. The activation functions help the network use the important information and suppress the irrelevant data points.
Activation functions are mathematical equations that determine the output of a neural network. The function is attached to each neuron in the network, and determines whether it should be activated (“fired”) or not, based on whether each neuron’s input is relevant for the model’s prediction
Neural network
Neural networks are a class of machine learning algorithms used to model complex patterns in datasets using multiple hidden layers and non-linear activation functions. A neural network takes an input, passes it through multiple layers of hidden neurons (mini-functions with unique coefficients that must be learned), and outputs a prediction representing the combined input of all the neurons
A neuron takes a group of weighted inputs, applies an activation function, and returns an output.
Forward propagation
Forward propagation is how neural networks make predictions. Input data is “forward propagated” through the network layer by layer to the final layer which outputs a prediction
Steps
- Calculate the weighted input to the hidden layer by multiplying Xby the hidden weight W + bias b
- Apply the activation function and pass the result to the final layer
- Repeat step 2 except this time X is replaced by the hidden layer’s output, Z
weights
Weights are values that control the strength of the connection between two neurons. That is, inputs are typically multiplied by weights, and that defines how much influence the input will have on the output.
Bias
Bias terms are additional constants attached to neurons and added to the weighted input before the activation function is applied. Bias terms help models represent patterns that do not necessarily pass through the origin.
Weighted input
A neuron’s input equals the sum of weighted outputs from all neurons in the previous layer. Each input is multiplied by the weight associated with the synapse connecting the input to the current neuron.
Activation Functions
Activation functions live inside neural network layers and modify the data they receive before passing it to the next layer. Activation functions give neural networks their power — allowing them to model complex non-linear relationships.
The activation function is the most important factor in a neural network which decided whether or not a neuron will be activated or not and transferred to the next layer. This simply means that it will decide whether the neuron’s input to the network is relevant or not in the process of prediction. For this reason, it is also referred to as threshold or transformation for the neurons which can converge the network.
Let us go through these activation functions, learn how they work:
1-Binary Step Function
A binary step function is a threshold-based activation function. If the input value is above or below a certain threshold, the neuron is activated and sends exactly the same signal to the next layer.
Function:
f(z) = 1 if z> 0 else 0 if z< 0
This is the simplest activation function, which can be implemented with a single if-else condition in python
def binary_step(z):
if z<0:
return 0
else:
return 1
The interval is (0 or 1)
In this, we decide the threshold value to 0. It is very simple and useful to classify binary problems or classifier.
The binary step function can be used as an activation function while creating a binary classifier. As you can imagine, this function will not be useful when there are multiple classes in the target variable. That is one of the limitations of binary step function.
Derivative:
Moreover, the gradient of the step function is zero which causes a hindrance in the back propagation process. That is, if you calculate the derivative of f(z) with respect to x, it comes out to be 0.
f'(z) = 0, for all z
The problem with a step function is that it does not allow multi-value outputs for example, it cannot support classifying the inputs into one of several categories.
2-Linear
It takes the inputs, multiplied by the weights for each neuron, and creates an output signal proportional to the input. In one sense, a linear function is better than a step function because it allows multiple outputs, not just yes and no.
Function:
f(z)=a*z
‘a’ in this case can be any constant value. Let’s quickly define the function in python:
def linear_function(z):
return 4*z
The interval is (-∞, ∞)
Derivative:
f'(z) = a
Advantages
- It gives a range of activations, so it is not binary activation.
- We can definitely connect a few neurons together and if more than 1 fires, we could take the max ( or softmax) and decide based on that.
Disadvantages
- For this function, derivative is a constant. That means, the gradient has no relationship with X.
- It is a constant gradient and the descent is going to be on constant gradient.
- If there is an error in prediction, the changes made by back propagation is constant and not depending on the change in input delta(x) !
3-Sigmoid
Sigmoid takes a real value as input and outputs another value between 0 and 1. It’s easy to work with and has all the nice properties of activation functions: it’s non-linear, continuously differentiable, monotonic, and has a fixed output range.
Function:
f(z) = 1/(1+e^-z)
sigmoid is a non-linear function. This essentially means -when I have multiple neurons having sigmoid function as their activation function,the output is non linear as well. Here is the python code for defining the function in python
import numpy as np
def sigmoid_function(z):
a = (1/(1 + np.exp(-z)))
return a
The interval is (-∞, ∞)
Derivative:
f'(z) = sigmoid(z)*(1-sigmoid(z))
Advantages
- It is nonlinear in nature. Combinations of this function are also nonlinear!
- It will give an analog activation unlike step function.
- It has a smooth gradient too.
- It’s good for a classifier.
- The output of the activation function is always going to be in range (0,1) compared to (-inf, inf) of linear function. So we have our activations bound in a range. Nice, it won’t blow up the activations then.
Disadvantages
- Towards either end of the sigmoid function, the Y values tend to respond very less to changes in X.
- It gives rise to a problem of “vanishing gradients”.
- Its output isn’t zero centered. It makes the gradient updates go too far in different directions. 0 < output < 1, and it makes optimization harder.
- Sigmoids saturate and kill gradients.
- The network refuses to learn further or is drastically slow ( depending on use case and until gradient /computation gets hit by floating point value limits ).
4-Tanh
The tanh function is very similar to the sigmoid function. The only difference is that it is symmetric around the origin. The range of values in this case is from -1 to 1. Thus the inputs to the next layers will not always be of the same sign. The tanh function is defined as
Function:
tanh(z)=2sigmoid(2z)-1
or
tanh(x)=(np.exp(z) - np.exp(-z)) / (np.exp(z) + np.exp(-z))
or
tanh(z) = 2/(1+e^(-2z)) -1
And here is the python code for the same:
def tanh_function(z):
return (np.exp(z) - np.exp(-z)) / (np.exp(z) + np.exp(-z))
or
def tanh_function(z):
a = (2/(1 + np.exp(-2*z))) -1
return a
The interval is (-1, 1)
Derivative:
tanh′(z)=1−tanh(z)^2
advantages
The gradient is stronger for tanh than sigmoid ( derivatives are steeper).
disadvantages
Tanh also has the vanishing gradient problem.
5- ReLU
A recent invention which stands for Rectified Linear Units. The formula is deceptively simple: max(0,z)max(0,z). Despite its name and appearance, it’s not linear and provides the same benefits as Sigmoid but with better performance.
Function:
f(z)=max(0,z)
For the negative input values, the result is zero, that means the neuron does not get activated. Since only a certain number of neurons are activated, the ReLU function is far more computationally efficient when compared to the sigmoid and tanh function. Here is the python function for ReLU:
def relu(z):
return max(0, z)
The interval is (0, ∞)
Derivative:
f'(z) = 1, z>=0
= 0, z<0
Advantages
- It avoids and rectifies vanishing gradient problem.
- ReLu is less computationally expensive than tanh and sigmoid because it involves simpler mathematical operations.
Disadvantages
- Some gradients can be fragile during training and can die. It can cause a weight update which will makes it never activate on any data point again. Simply saying that ReLu could result in Dead Neurons.
- In another words, For activations in the region (x<0) of ReLu, gradient will be 0 because of which the weights will not get adjusted during descent. That means, those neurons which go into that state will stop responding to variations in error/ input ( simply because gradient is 0, nothing changes ). This is called dying ReLu problem.
- The range of ReLu is [0, inf]. This means it can blow up the activation.
6-Softmax
The softmax function is a function that turns a vector of K real values into a vector of K real values that sum to 1. The input values can be positive, negative, zero, or greater than one, but the softmax transforms them into values between 0 and 1, so that they can be interpreted as probabilities. If one of the inputs is small or negative, the softmax turns it into a small probability, and if an input is large, then it turns it into a large probability, but it will always remain between 0 and 1.
f(z)= e ^ (z - max(z)) / sum(e^(z - max(z))def softmax(z):
"""Compute the softmax of vector z."""
exps = np.exp(z)
return exps / np.sum(exps)
Example of softmax:
7-Leaky ReLU
Leaky ReLU function is nothing but an improved version of the ReLU function. As we saw that for the ReLU function, the gradient is 0 for x<0, which would deactivate the neurons in that region.
Leaky ReLU is defined to address this problem. Instead of defining the Relu function as 0 for negative values of x, we define it as an extremely small linear component of x.
Function:
f(z)= 0.01z, z<0
= z, z>=0
def leaky_relu_function(x):
if x<0:
return 0.01*x
else:
return x
The interval is (0.01*X, ∞)
Derivative:
f'(z) = 1, z>=0
=0.01, z<0
8-Parameterised ReLU
This is another variant of ReLU that aims to solve the problem of gradient’s becoming zero for the left half of the axis. The parameterised ReLU, as the name suggests, introduces a new parameter as a slope of the negative part of the function
Function:
f(z) = z, z>=0
= az, z<0
Derivative:
The derivative of the function would be same as the Leaky ReLu function, except the value 0.01 will be replcaed with the value of a.
f'(Z) = 1, Z>=0
= a, Z<0
The parameterized ReLU function is used when the leaky ReLU function still fails to solve the problem of dead neurons and the relevant information is not successfully passed to the next layer.
9-Exponential Linear (ELU, SELU)
Similar to leaky ReLU, ELU has a small slope for negative values. Instead of a straight line, it uses a log curve like the following:
It is designed to combine the good parts of ReLU and leaky ReLU while it doesn’t have the dying ReLU problem, it saturates for large negative values, allowing them to be essentially inactive.