Motivation

  • To understand the channeling of the gradient backwards through the layers of your network, a basic understanding of the chain rule is vital.

Chain Rule

  • The figure below summarizes the use of the chain rule for the backward pass in computational graphs.

  • In the figure above, the left-hand-side of the figure illustrates the forward pass and calculates \(z = f(x,y)\) using the input variables \(x\) and \(y\).
  • The right-hand-side of the figure shows the backward pass. Receiving the gradient of the loss function with respect to \(z\), denoted by \(\frac{dL}{dz}\), the gradients of \(x\) and \(y\) on the loss function can be calculated by applying the chain rule, as shown in the figure.
  • The chain rule states that to get the gradient flowing downstream, we need to multiply the local gradient of the function in the current “node” with the upstream gradient. Formally,
\[\text{downstream gradient = local gradient }\times\text{ upstream gradient}\]
  • In summary, assume that we’re given a function \(f(x, y)\) where \(x(m, n)\) and \(y(m, n)\). To determine the value of \(\frac{\partial f}{\partial m}\) and \(\frac{\partial f}{\partial n}\), we need to apply the chain rule:
\[\frac{\partial f}{\partial m} = \frac{\partial f}{\partial x} \cdot \frac{\partial x}{\partial m} + \frac{\partial f}{\partial y} \cdot \frac{\partial y}{\partial m} \\ \frac{\partial f}{\partial n} = \frac{\partial f}{\partial x} \cdot \frac{\partial x}{\partial n} + \frac{\partial f}{\partial y} \cdot \frac{\partial y}{\partial n}\]