• Activation functions essentially decide whether a neuron should be ‘activated’ or not.
  • They are applied to the output of each neuron in a neural network and decide whether the neuron is critical to the network or not.
  • The activation function determines the output of a neuron given an input.
  • It is used to introduce non-linearity into the network which is important to model complex, non-linear relationships between the input and output.

Image source from TheAiEdge,io

Sigmoid Function

  • The sigmoid activation function maps any input value to an output value between 0 and 1.
  • The sigmoid activation function is often used for binary classification problem where we have a clarity between different classes and the output is 0 or 1.
  • “Some drawbacks of this activation that have been noted in the literature are: sharp damp gradients during backpropagation from deeper hidden layers to inputs, gradient saturation, and slow convergence.” Source

Hyperbolic Tangent (tanh):

  • The tanh activation function maps any input value to a value between -1 and 1 and thus, is often used for modeling continuous outputs in the range [-1, 1].
  • For example, it is commonly used in recurrent neural networks (RNNs) and long short-term memory (LSTM) networks to model sequential data.
  • “Historically, the tanh function became preferred over the sigmoid function as it gave better performance for multi-layer neural networks.
  • But it did not solve the vanishing gradient problem that sigmoids suffered, which was tackled more effectively with the introduction of ReLU activations.” Source

Rectified Linear Unit (ReLU):

  • These are activation functions that produce output in the range [0, +inf) and thus the ReLU activation function maps any negative input value to 0 and leaves positive values unchanged.
  • These activation functions are often used for hidden layers in feedforward neural networks and are well-suited for problems with non-linear relationships in the positive part of the input space.
  • ReLUs functions are linear in the positive dimension but zero in the negative dimension.
  • ReLU activations tackle vanishing gradient problems that sigmoids suffered from.

Leaky ReLU:

  • The leaky ReLU is a variant of the ReLU activation function.
  • The leaky ReLU allows for small, non-zero values for negative inputs.
  • It is defined as max(αx, x), where x is the input and α is a small positive constant.
  • “Leaky Rectified Linear Unit, or Leaky ReLU, is a type of activation function based on a ReLU, but it has a small slope for negative values instead of a flat slope. The slope coefficient is determined before training, i.e. it is not learnt during training.
  • This type of activation function is popular in tasks where we may suffer from sparse gradients, for example training generative adversarial networks.”Source


  • Now here is where the confusion intensifies because we have a Softmax-Loss as well as a Softmax activation function.
  • Softmax Loss, or cross-entropy loss as discused in the page here, is a commonly used loss function for multi-class classification problems.
  • The Softmax loss is a combination of the logistic loss and the Softmax activation function.
  • It measures the difference between the predicted probability distribution and the true label distribution.
  • Softmax activation function is differentiable, which makes it possible to train neural networks using gradient-based optimization algorithms.
  • The Softmax activation function maps the inputs of a network to a probability distribution over multiple classes.

Further Reading