Aman's AI Journal • Primers • Activation Functions

Overview

Activation functions play a crucial role in neural networks by determining whether a neuron should be ‘activated’ or not. They introduce non-linearity, allowing neural networks to model complex, non-linear relationships. They are applied to the output of each neuron in a neural network and decide whether the neuron is critical to the network or not.
Let’s explore some commonly used activation functions and their characteristics.

The sigmoid function is often used for binary classification problems. It maps any real-valued number to the range (0, 1), providing an output that can be interpreted as a probability.
However, the sigmoid function has some drawbacks, such as gradient saturation and slow convergence.
Sigmoid is defined as \(sigmoid(x) = 1 / (1 + exp(-x))\)

The hyperbolic tangent function maps inputs to values between -1 and 1. It is often used to model continuous outputs in the range [-1, 1]. Tanh is suitable for tasks such as modeling sequential data in recurrent neural networks (RNNs) and long short-term memory (LSTM) networks.
For example, it is commonly used in recurrent neural networks (RNNs) and long short-term memory (LSTM) networks to model sequential data.
“Historically, the tanh function became preferred over the sigmoid function as it gave better performance for multi-layer neural networks.
But it did not solve the vanishing gradient problem that sigmoids suffered, which was tackled more effectively with the introduction of ReLU activations.” Source
Hyperbolic Tangent is defined as: \(tanh(x) = (exp(x) - exp(-x)) / (exp(x) + exp(-x))\)

ReLU is a popular activation function used in the hidden layers of feedforward neural networks. It outputs 0 for negative input values and leaves positive values unchanged.
ReLU activations address the vanishing gradient problem that sigmoid activations suffer from.
ReLU is defined as:

\[ReLU(x) = max(0, x)\]

Leaky ReLU is a variant of ReLU that introduces a small, non-zero slope for negative inputs. This prevents complete saturation of negative values and is useful in scenarios where sparse gradients may occur, such as training generative adversarial networks (GANs).
It is defined as max(αx, x), where x is the input and α is a small positive constant.
“Leaky Rectified Linear Unit, or Leaky ReLU, is a type of activation function based on a ReLU, but it has a small slope for negative values instead of a flat slope. The slope coefficient is determined before training, i.e. it is not learnt during training.
This type of activation function is popular in tasks where we may suffer from sparse gradients, for example training generative adversarial networks.”Source

\[LeakyReLU(x) = max(αx, x)\]

Now here is where the confusion intensifies because we have a Softmax-Loss as well as a Softmax activation function and we will explain it in more detail further below.
The softmax function is an activation function often used in the output layer of a neural network for multi-class classification problems. It transforms the raw score outputs from the previous layer into probabilities that sum up to 1, giving a distribution of class probabilities.
Cross-entropy loss is a popular loss function for classification tasks, including multi-class classification. It measures the dissimilarity between the predicted probability distribution (often obtained by applying the softmax function to the raw output scores) and the actual label distribution.
Sometimes, the combination of softmax activation and cross-entropy loss is collectively referred to as “Softmax Loss” or “Softmax Cross-Entropy Loss”.
This naming can indeed cause some confusion, as it’s not the softmax function itself acting as the loss function, but rather the cross-entropy loss applied to the outputs of the softmax function.
The softmax function is indeed differentiable, which is vital for backpropagation and gradient-based optimization algorithms in training neural networks.
Softmax is defined as: \(softmax(x_i) = exp(x_i) / sum(exp(x_j)) for all x_j in the input vector\)