Activation_functions_in_dl

Li Wang

Last updated on Jan 13, 2020 5 min read Computer Vision, DL

Activation functions

Description

Activation functions works like an on-off button that determines whether the output of a neuron or what information should be passed to next layer. In biology, it works like synaptic in brain which decides what information it passes from one neuron cell to next one. There are several activation functions widely used in neural networks.

Binary function (step function)

In one word, the output of binary function is 1 or 0 which is based on whether the input is greater or lower than a threshold. In math, it looks like this: f(x) = {1, if x > T; 0, otherwise}.

Cons: it does not allow multiple outputs, and it can not support to classify inputs to one of categories.

Linear function

f(x) = $cx$. Cons: 1. the deviation of linear function is a constant, which does not help for backpropagation as the deviation is not correlated to its inputs, in another word, it can not distinguih what weights or parameters help to learn the task; 2. linear function makes the entire multiple neural network into one linear layer (as the combination of linear functions is still a linear function), which becomes a simple regression model. It can not handle complex tasks by varying parameters of inputs.

Non-linear functions

Non-linear functions address the problems by two aspects:

The deviation of non-liear function is a function correlated to its inputs, which contributes the backpropagation to learn how to update weights for high accurancy.
Non-linear functions form the layers with hidden neurons into a deep neural network which is capable of predicting for complicated tasks by learning from complex datasets.

There are several popular activation functions used in modern deep neural networks.

Sigmoid/Logistic Regression

Equation: $$Sigmoid(x) = \frac{1}{1+e^{-x}}$$ Derivative (with respect to $x$): $$Sigmoid^{'}(x) = Sigmoid(x)(1-Sigmoid(x))$$ Pros:

smooth gradient, no jumping output values compared to binary function.
output value lies between 0 and 1, normalizing output of each neuron.
right choice for probability prediction, the probability of anything exists only between 0 and 1.

Cons:

vanishing gradient, the gradient barely changes when $x>2$ or $x<-2$.
computationally expensive.
non zero centered outputs. The outputs after applying sigmoid are always positive, during gradient descent, the gradients on weights in backpropagation will always be positive or negative, which means the gradient updates go too far in different directions, and makes the optimization harder.

Softmax

$$Softmax(x_i)= \frac{x_i}{\Sigma_{j=1}^{n}{x_j}}$$

Pros: capable of handling multiple classification and the sum of predicted probabilities is 1. Cons: only used for output layer.

Softmax is more suitable for multiple classification case when the predicted class must and only be one of categories. k-sigmoid/LR can be used to classify such multi-class problem that the predicted class could be multiple.

Tanh

Equation: $$tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$$ Derivative (with respect to $x$): $$tanh^{'}(x) = 1 -tanh(x)^2$$

Pros:

zero centered. make it easier to model inputs that have strongly positive, strongly negative, and natural values.
similar to sigmoid

Cons:

vanishing gradient
computationally expensive as it includes division and exponential operation.

Vanishing gradient

Vanishing gradient means that the values of weights and biases are barely change along with the training.

Exploding gradient

Gradient explosion means that the values of weights and biases are increasing rapidly along with the training.

ReLU

Equation: $$ReLU(x) = max(0, x)$$ Derivative (with respect to $x$): \begin{equation} ReLU^{'}(x) = \begin{cases} 0, &x \leqslant 0; \newline 1, & x > 0 \end{cases} \end{equation} Pros:

computationally efficient
non-linear

Why ReLU performs better in modern NNs? The answer is not so sure right now, but its propeties like non-saturation gradient and computionally efficient indeed lead to fast convergence. Additionally, its property sparsing the network also improves the modeling preformance. The non-zero centered issue can be tackled by other regularization techniques like Batch Normalization which produces a stable distribution for ReLU.

Cons:

Dying ReLU problem. The backpropagation won’t work when inputs approach zero or negative. However, to some extent, dying ReLU problem makes input values sparse which is helpful for neural network to learn more important values and perform better.
Non differentiable at zero.
Non zero centered.
Don’t avoid gradient explode

ELU

Equation: \begin{equation} ELU(x) = \begin{cases} \alpha (e^x-1), & x \leqslant 0 \newline x, &x > 0
\end{cases} \end{equation}

Derivative: \begin{equation} ELU^{'}(x) = \begin{cases} ELU(x) + \alpha, & x \leqslant 0 \newline 1, & x > 0 \end{cases} \end{equation}

Pros:

prevent dying ReLU problem.
gradient works when input values are negative.
non-linear, gradient is not zero.

Cons:

don’t avoid gradient explode.
not computationally efficient.
$\alpha$ is not learnt by neural networks.

Leaky ReLU

Fig 4. LReLU Visualization ($\alpha=0.1$)

Equation: \begin{equation} LReLU(x) = \begin{cases} \alpha x, &x \leqslant 0 \newline x, &x > 0 \end{cases} \end{equation}

Derviative: \begin{equation} LReLU^{'}(x) = \begin{cases} \alpha, &x \leqslant 0 \newline 1, &x > 0 \end{cases} \end{equation}

Pros:

prevent Dying ReLU problem
computationally efficient
non-linear

Cons:

don’t avoid gradient explode
Non consistent results for negative input values.
non-zero centered
non differentiable at Zeros

SELU

Equation: \begin{equation} SELU(x) = \lambda \begin{cases} \alpha e^x-\alpha, & x \leqslant 0 \newline x, & x > 0 \end{cases} \end{equation}

Derivative: \begin{equation} SELU^{'}(x) = \lambda \begin{cases} \alpha e^x, & x \leqslant 0 \newline 1, & x > 0 \end{cases} \end{equation} where $\alpha \approx 1.6732632423543772848170429916717$ and $\lambda \approx 1.0507009873554804934193349852946$.

Pros:

Internal normalization, which means faster convergence.
Preventing vanishing gradient and exploding gradient.

Cons: Need more applications to prove its performance on CNNs and RNNs.

GELU

Equation: \begin{equation} GELU(x) = 0.5x(1 + tanh(\sqrt{\frac{2}{\pi}} (x + 0.044715x^3))) \end{equation}

Pros:

Best performance in NLP, especially BERT and GPT-2
Avoid vanishing gradient

Cons: Need more applications to prove its performance.

Reference

Academic Activation functions in DL

Li Wang

Research Fellow (Postdoctoral) on Computer Vision

My research focuses on image/video/geometry based neural style transfer.

Activation_functions_in_dl

Activation functions

Description

Binary function (step function)

Linear function

Non-linear functions

Sigmoid/Logistic Regression

Softmax

Tanh

Vanishing gradient

Exploding gradient

ReLU

ELU

Leaky ReLU

SELU

GELU

Reference

Li Wang

Research Fellow (Postdoctoral) on Computer Vision

Related