Motivation

• To understand the channeling of the gradient backwards through the layers of your network, a basic understanding of the chain rule is vital.

Chain Rule

• The figure below summarizes the use of the chain rule for the backward pass in computational graphs.

• In the figure above, the left-hand-side of the figure illustrates the forward pass and calculates $$z = f(x,y)$$ using the input variables $$x$$ and $$y$$.
• The right-hand-side of the figure shows the backward pass. Receiving the gradient of the loss function with respect to $$z$$, denoted by $$\frac{dL}{dz}$$, the gradients of $$x$$ and $$y$$ on the loss function can be calculated by applying the chain rule, as shown in the figure.
• The chain rule states that to get the gradient flowing downstream, we need to multiply the local gradient of the function in the current “node” with the upstream gradient. Formally,
$\text{downstream gradient = local gradient }\times\text{ upstream gradient}$
• In summary, assume that we’re given a function $$f(x, y)$$ where $$x(m, n)$$ and $$y(m, n)$$. To determine the value of $$\frac{\partial f}{\partial m}$$ and $$\frac{\partial f}{\partial n}$$, we need to apply the chain rule:
$\frac{\partial f}{\partial m} = \frac{\partial f}{\partial x} \cdot \frac{\partial x}{\partial m} + \frac{\partial f}{\partial y} \cdot \frac{\partial y}{\partial m} \\ \frac{\partial f}{\partial n} = \frac{\partial f}{\partial x} \cdot \frac{\partial x}{\partial n} + \frac{\partial f}{\partial y} \cdot \frac{\partial y}{\partial n}$