Background: Backprop

  • From the Wikipedia article on Backprop,

Backpropagation, an abbreviation for “backward propagation of errors”, is a common method of training artificial neural networks used in conjunction with an optimization method such as gradient descent. The method calculates the gradient of a loss function with respect to all the weights in the network. The gradient is fed to the optimization method which in turn uses it to update the weights, in an attempt to minimize the loss function.

  • Note that the terms backprop and backward pass are used interchangeably. Technically, you carry out backprop during the backward pass while training your network.

Why understand Backprop?

It is easy to fall into the trap of abstracting away the learning process — believing that you can simply stack arbitrary layers together and backprop will “magically make them work” on your data.

Backprop and Gradient Descent

  • Gradient descent is the backbone of backprop. During backprop, we update our weights using gradient descent, which is a first-order iterative optimization algorithm for finding the minima of our (differentiable) loss function.
  • To minimize our loss function using gradient descent, we take steps proportional to the negative of the gradient of the function at the current point.

Primer: Differential Calculus

  • Calculus is the study of continuous change. It has two major sub-fields: differential calculus, which studies the rate of change of functions, and integral calculus, which studies the area under the curve. Differential calculus is at the core of Deep Learning, so it is important to understand what derivatives and gradients are, how they are used in Deep Learning, and understand what their limitations are.
  • For a primer on differential calculus, please refer Aurélien Geron’s notebook on differential calculus.

Primer: Chain Rule for Backprop

  • In calculus, the chain rule helps compute the derivative of composite functions.
  • Formally, it states that:
\[\frac{d}{d x}[f(g(x))]=f^{\prime}(g(x)) g^{\prime}(x)\]
  • Read more about it here.

(Partial) Derivatives of Standard Layers/Loss Functions

(Partial) Gradients of Standard Layers/Loss Functions