## Overview

• Activation functions essentially decide whether a neuron should be ‘activated’ or not.
• They are applied to the output of each neuron in a neural network and decide whether the neuron is critical to the network or not.
• The activation function determines the output of a neuron given an input.
• It is used to introduce non-linearity into the network which is important to model complex, non-linear relationships between the input and output.

Image source from TheAiEdge,io

## Sigmoid Function

• The sigmoid activation function maps any input value to an output value between 0 and 1.
• The sigmoid activation function is often used for binary classification problem where we have a clarity between different classes and the output is 0 or 1.
• “Some drawbacks of this activation that have been noted in the literature are: sharp damp gradients during backpropagation from deeper hidden layers to inputs, gradient saturation, and slow convergence.” Source

## Hyperbolic Tangent (tanh):

• The tanh activation function maps any input value to a value between -1 and 1 and thus, is often used for modeling continuous outputs in the range [-1, 1].
• For example, it is commonly used in recurrent neural networks (RNNs) and long short-term memory (LSTM) networks to model sequential data.
• “Historically, the tanh function became preferred over the sigmoid function as it gave better performance for multi-layer neural networks.
• But it did not solve the vanishing gradient problem that sigmoids suffered, which was tackled more effectively with the introduction of ReLU activations.” Source

## Rectified Linear Unit (ReLU):

• These are activation functions that produce output in the range [0, +inf) and thus the ReLU activation function maps any negative input value to 0 and leaves positive values unchanged.
• These activation functions are often used for hidden layers in feedforward neural networks and are well-suited for problems with non-linear relationships in the positive part of the input space.
• ReLUs functions are linear in the positive dimension but zero in the negative dimension.
• ReLU activations tackle vanishing gradient problems that sigmoids suffered from.

### Leaky ReLU:

• The leaky ReLU is a variant of the ReLU activation function.
• The leaky ReLU allows for small, non-zero values for negative inputs.
• It is defined as max(αx, x), where x is the input and α is a small positive constant.
• “Leaky Rectified Linear Unit, or Leaky ReLU, is a type of activation function based on a ReLU, but it has a small slope for negative values instead of a flat slope. The slope coefficient is determined before training, i.e. it is not learnt during training.
• This type of activation function is popular in tasks where we may suffer from sparse gradients, for example training generative adversarial networks.”Source

## Softmax:

• Now here is where the confusion intensifies because we have a Softmax-Loss as well as a Softmax activation function.
• Softmax Loss, or cross-entropy loss as discused in the page here, is a commonly used loss function for multi-class classification problems.
• The Softmax loss is a combination of the logistic loss and the Softmax activation function.
• It measures the difference between the predicted probability distribution and the true label distribution.
• Softmax activation function is differentiable, which makes it possible to train neural networks using gradient-based optimization algorithms.
• The Softmax activation function maps the inputs of a network to a probability distribution over multiple classes.