Overview

  • Activation functions play a crucial role in neural networks by determining whether a neuron should be ‘activated’ or not. They introduce non-linearity, allowing neural networks to model complex, non-linear relationships. They are applied to the output of each neuron in a neural network and decide whether the neuron is critical to the network or not.
  • Let’s explore some commonly used activation functions and their characteristics.

Image source:TheAiEdge,io

Sigmoid Function

  • The sigmoid function is often used for binary classification problems. It maps any real-valued number to the range (0, 1), providing an output that can be interpreted as a probability.
  • However, the sigmoid function has some drawbacks, such as gradient saturation and slow convergence.
  • Sigmoid is defined as \(sigmoid(x) = 1 / (1 + exp(-x))\)

Hyperbolic Tangent (tanh):

  • The hyperbolic tangent function maps inputs to values between -1 and 1. It is often used to model continuous outputs in the range [-1, 1]. Tanh is suitable for tasks such as modeling sequential data in recurrent neural networks (RNNs) and long short-term memory (LSTM) networks.
  • For example, it is commonly used in recurrent neural networks (RNNs) and long short-term memory (LSTM) networks to model sequential data.
  • “Historically, the tanh function became preferred over the sigmoid function as it gave better performance for multi-layer neural networks.
  • But it did not solve the vanishing gradient problem that sigmoids suffered, which was tackled more effectively with the introduction of ReLU activations.” Source
  • Hyperbolic Tangent is defined as: \(tanh(x) = (exp(x) - exp(-x)) / (exp(x) + exp(-x))\)

Rectified Linear Unit (ReLU):

  • ReLU is a popular activation function used in the hidden layers of feedforward neural networks. It outputs 0 for negative input values and leaves positive values unchanged.
  • ReLU activations address the vanishing gradient problem that sigmoid activations suffer from.
  • ReLU is defined as:
\[ReLU(x) = max(0, x)\]

Leaky ReLU:

  • Leaky ReLU is a variant of ReLU that introduces a small, non-zero slope for negative inputs. This prevents complete saturation of negative values and is useful in scenarios where sparse gradients may occur, such as training generative adversarial networks (GANs).
  • It is defined as max(αx, x), where x is the input and α is a small positive constant.
  • “Leaky Rectified Linear Unit, or Leaky ReLU, is a type of activation function based on a ReLU, but it has a small slope for negative values instead of a flat slope. The slope coefficient is determined before training, i.e. it is not learnt during training.
  • This type of activation function is popular in tasks where we may suffer from sparse gradients, for example training generative adversarial networks.”Source
\[LeakyReLU(x) = max(αx, x)\]

Softmax:

  • Now here is where the confusion intensifies because we have a Softmax-Loss as well as a Softmax activation function and we will explain it in more detail further below.
  • The softmax function is an activation function often used in the output layer of a neural network for multi-class classification problems. It transforms the raw score outputs from the previous layer into probabilities that sum up to 1, giving a distribution of class probabilities.
  • Cross-entropy loss is a popular loss function for classification tasks, including multi-class classification. It measures the dissimilarity between the predicted probability distribution (often obtained by applying the softmax function to the raw output scores) and the actual label distribution.
  • Sometimes, the combination of softmax activation and cross-entropy loss is collectively referred to as “Softmax Loss” or “Softmax Cross-Entropy Loss”.
  • This naming can indeed cause some confusion, as it’s not the softmax function itself acting as the loss function, but rather the cross-entropy loss applied to the outputs of the softmax function.
  • The softmax function is indeed differentiable, which is vital for backpropagation and gradient-based optimization algorithms in training neural networks.
  • Softmax is defined as: \(softmax(x_i) = exp(x_i) / sum(exp(x_j)) for all x_j in the input vector\)

Further Reading