## Linear Algebra

• Linear Algebra is the branch of mathematics that studies vector spaces and linear transformations between vector spaces, such as rotating a shape, scaling it up or down, translating it (ie. moving it), etc.
• Machine Learning relies heavily on Linear Algebra, so it is essential to understand what vectors and matrices are, what operations you can perform with them, and how they can be useful.

## Vectors

### Definition

A vector is a quantity defined by a magnitude and a direction. For example, a rocket’s velocity is a 3-dimensional vector: its magnitude is the speed of the rocket, and its direction is (hopefully) up. A vector can be represented by an array of numbers called scalars. Each scalar corresponds to the magnitude of the vector with regards to each dimension.

For example, say the rocket is going up at a slight angle: it has a vertical speed of 5,000 m/s, and also a slight speed towards the East at 10 m/s, and a slight speed towards the North at 50 m/s. The rocket’s velocity may be represented by the following vector:

\begin{aligned} \text{velocity} &= \left[ & 10 \\ & 50 \\ & 5000 \\ \right]

Note: by convention vectors are generally presented in the form of columns. Also, vector names are generally lowercase to distinguish them from matrices (which we will discuss below) and in bold (when possible) to distinguish them from simple scalar values such as $${meters_per_second} = 5026$$.

A list of N numbers may also represent the coordinates of a point in an N-dimensional space, so it is quite frequent to represent vectors as simple points instead of arrows. A vector with 1 element may be represented as an arrow or a point on an axis, a vector with 2 elements is an arrow or a point on a plane, a vector with 3 elements is an arrow or point in space, and a vector with N elements is an arrow or a point in an N-dimensional space… which most people find hard to imagine.

### Purpose

Vectors have many purposes in Machine Learning, most notably to represent observations and predictions. For example, say we built a Machine Learning system to classify videos into 3 categories (good, spam, clickbait) based on what we know about them. For each video, we would have a vector representing what we know about it, such as:

$\text{video} = \left[ 10.5 \\ 5.2 \\ 3.25 \\ 7.0 \right]$

This vector could represent a video that lasts 10.5 minutes, but only 5.2% viewers watch for more than a minute, it gets 3.25 views per day on average, and it was flagged 7 times as spam. As you can see, each axis may have a different meaning.

Based on this vector our Machine Learning system may predict that there is an 80% probability that it is a spam video, 18% that it is click-bait, and 2% that it is a good video. This could be represented as the following vector:

$\text{class_probabilities} = \left[ 0.80 \\ 0.18 \\ 0.02 \right]$

## Vectors in python

• In python, a vector can be represented in many ways, the simplest being a regular python list of numbers:
[10.5, 5.2, 3.25, 7.0]

• Since we plan to do quite a lot of scientific calculations, it is much better to use NumPy’s ndarray, which provides a lot of convenient and optimized implementations of essential mathematical operations on vectors (for more details about NumPy, check out the NumPy tutorial). For example:
import numpy as np
video = np.array([10.5, 5.2, 3.25, 7.0])
video

• The size of a vector can be obtained using the size attribute:
video.size

• The $$i^{th}$$ element (also called entry or item) of a vector $$\textbf{v}$$ is noted $$\textbf{v}_i$$.

• Note that indices in mathematics generally start at 1, but in programming they usually start at 0. So to access $$\textbf{video}_3$$ programmatically, we would write:

video[2]  # 3rd element


## Norm

• The norm of a vector $$\textbf{u}$$, noted $$\left \Vert \textbf{u} \right |$$, is a measure of the length (a.k.a. the magnitude) of $$\textbf{u}$$. There are multiple possible norms, but the most common one (and the only one we will discuss here) is the Euclidian norm, which is defined as:
$\left \Vert \textbf{u} \right \| = \sqrt{\sum_{i}{\textbf{u}_i}^2}$
• We could implement this easily in pure python, recalling that $$\sqrt x = x^{\frac{1}{2}}$$.
def vector_norm(vector):
squares = [element**2 for element in vector]
return sum(squares)**0.5

print("||", u, "|| =")
vector_norm(u)

• However, it is much more efficient to use NumPy’s norm function, available in the linalg (Linear Algebra) module:
import numpy.linalg as LA
LA.norm(u)

• Let’s plot a little diagram to confirm that the length of vector $$\textbf{v}$$ is indeed $$\approx5.4$$:
radius = LA.norm(u)
plt.gca().add_artist(plt.Circle((0,0), radius, color="#DDDDDD"))
plot_vector2d(u, color="red")
plt.axis([0, 8.7, 0, 6])
plt.grid()
plt.show()


## Differential Calculus

• Calculus is the study of continuous change. It has two major subfields: differential calculus, which studies the rate of change of functions, and integral calculus, which studies the area under the curve. In this notebook, we will discuss the former.
• Differential calculus is at the core of Deep Learning, so it is important to understand what derivatives and gradients are, how they are used in Deep Learning, and understand what their limitations are.

## Slope of a straight line

• The slope of a (non-vertical) straight line can be calculated by taking any two points $$\mathrm{A}$$ and $$\mathrm{B}$$ on the line, and computing the “rise over run”:
$slope = \dfrac{\Delta y}{\Delta x} = \dfrac{height}{width} = \dfrac{rise}{run} = \dfrac{y_\mathrm{B} - y_\mathrm{A}}{x_\mathrm{B} - x_\mathrm{A}}$
• In this example, the height (rise) is $$3$$, and the width (run) is $$6$$, so the slope is $$3/6 = 0.5$$.

## Defining the slope of a curve

• Now, let’s try to figure out how we can compute the slope of something else than a straight line. For example, let’s consider the curve defined by $$y = f(x) = x^2$$:

• Obviously, the slope varies: on the left (i.e., when $$x<0$$), the slope is negative (i.e., when we move from left to right, the curve goes down), while on the right (i.e., when $$x>0$$) the slope is positive (i.e., when we move from left to right, the curve goes up). At the point $$x=0$$, the slope is equal to 0 (i.e., the curve is locally flat). The fact that the slope is 0 when we reach a minimum (or indeed a maximum) is crucially important, and we will come back to it later.

• How can we put numbers on these intuitions? Well, say we want to estimate the slope of the curve at a point $$\mathrm{A}$$, we can do this by taking another point $$\mathrm{B}$$ on the curve, not too far away, and then computing the slope between these two points.

• As you can see, when point $$\mathrm{B}$$ is very close to point $$\mathrm{A}$$, the $$(\mathrm{AB})$$ line becomes almost indistinguishable from the curve itself (at least locally around point $$\mathrm{A}$$). The $$(\mathrm{AB})$$ line gets closer and closer to the tangent line to the curve at point $$\mathrm{A}$$: this is the best linear approximation of the curve at point $$\mathrm{A}$$.

So it makes sense to define the slope of the curve at point $$\mathrm{A}$$ as the slope that the $$\mathrm{(AB)}$$ line approaches when $$\mathrm{B}$$ gets infinitely close to $$\mathrm{A}$$. This slope is called the derivative of the function $$f$$ at $$x=x_\mathrm{A}$$. For example, the derivative of the function $$f(x)=x^2$$ at $$x=x_\mathrm{A}$$ is equal to $$2x_\mathrm{A}$$ (we will see how to get this result shortly), so on the graph above, since the point $$\mathrm{A}$$ is located at $$x_\mathrm{A}=-1$$, the tangent line to the curve at that point has a slope of $$-2$$.

## Differentiability

• Note that some functions are not quite as well-behaved as $$x^2$$: for example, consider the function $$f(x)=|x|$$, the absolute value of $$x$$:

• No matter how much you zoom in on the origin (the point at $$x=0, y=0$$), the curve will always look like a V. The slope is -1 for any $$x < 0$$, and it is +1 for any $$x > 0$$, but at $$x = 0$$, the slope is undefined, since it is not possible to approximate the curve $$y=|x|$$ locally around the origin using a straight line, no matter how much you zoom in on that point.
• The function $$f(x)=|x|$$ is said to be non-differentiable at $$x=0$$: its derivative is undefined at $$x=0$$. This means that the curve $$y=|x|$$ has an undefined slope at that point. However, the function $$f(x)=|x|$$ is differentiable at all other points.
• In order for a function $$f(x)$$ to be differentiable at some point $$x_\mathrm{A}$$, the slope of the $$(\mathrm{AB})$$ line must approach a single finite value as $$\mathrm{B}$$ gets infinitely close to $$\mathrm{A}$$.

• This implies several constraints:
• First, the function must of course be defined at $$x_\mathrm{A}$$. As a counterexample, the function $$f(x)=\dfrac{1}{x}$$ is undefined at $$x_\mathrm{A}=0$$, so it is not differentiable at that point.
• The function must also be continuous at $$x_\mathrm{A}$$, meaning that as $$x_\mathrm{B}$$ gets infinitely close to $$x_\mathrm{A}$$, $$f(x_\mathrm{B})$$ must also get infinitely close to $$f(x_\mathrm{A})$$.
• As a counterexample,

$f(x)=\begin{cases}-1 \text{ if }x < 0\\+1 \text{ if }x \geq 0\end{cases}$
• is not continuous at $$x_\mathrm{A}=0$$, even though it is defined at that point: indeed, when you approach it from the negative side, it does not approach infinitely close to $$f(0)=+1$$. Therefore, it is not continuous at that point, and thus not differentiable either.
• The function must not have a breaking point at $$x_\mathrm{A}$$, meaning that the slope that the $$(\mathrm{AB})$$ line approaches as $$\mathrm{B}$$ approaches $$\mathrm{A}$$ must be the same whether $$\mathrm{B}$$ approaches from the left side or from the right side. We already saw a counterexample with $$f(x)=|x|$$, which is both defined and continuous at $$x_\mathrm{A}=0$$, but which has a breaking point at $$x_\mathrm{A}=0$$: the slope of the curve $$y=|x|$$ is -1 on the left, and +1 on the right.
• The curve $$y=f(x)$$ must not be vertical at point $$\mathrm{A}$$. One counterexample is $$f(x)=\sqrt[3]{x}$$, the cubic root of $$x$$: the curve is vertical at the origin, so the function is not differentiable at $$x_\mathrm{A}=0$$.

## Differentiating a function

• Now let’s see how to actually differentiate a function (i.e., find its derivative).
• The derivative of a function $$f(x)$$ at $$x = x_\mathrm{A}$$ is noted $$f’(x_\mathrm{A})$$, and it is defined as:
$f'(x_\mathrm{A}) = \underset{x_\mathrm{B} \to x_\mathrm{A}}\lim\dfrac{f(x_\mathrm{B}) - f(x_\mathrm{A})}{x_\mathrm{B} - x_\mathrm{A}}$
• Don’t be scared, this is simpler than it looks! You may recognize the rise over run equation $$\dfrac{y_\mathrm{B} - y_\mathrm{A}}{x_\mathrm{B} - x_\mathrm{A}}$$ that we discussed earlier. That’s just the slope of the $$\mathrm{(AB)}$$ line. And the notation $$\underset{x_\mathrm{B} \to x_\mathrm{A}}\lim$$ means that we are making $$x_\mathrm{B}$$ approach infinitely close to $$x_\mathrm{A}$$. So in plain English, $$f’(x_\mathrm{A})$$ is the value that the slope of the $$\mathrm{(AB)}$$ line approaches when $$\mathrm{B}$$ gets infinitely close to $$\mathrm{A}$$. This is just a formal way of saying exactly the same thing as earlier.

### Example: finding the derivative of $$x^2$$

• Let’s look at a concrete example. Let’s see if we can determine what the slope of the $$y=x^2$$ curve is, at any point $$\mathrm{A}$$:
$\begin{split} f'(x_\mathrm{A}) \, && = \underset{x_\mathrm{B} \to x_\mathrm{A}}\lim\dfrac{f(x_\mathrm{B}) - f(x_\mathrm{A})}{x_\mathrm{B} - x_\mathrm{A}} \\ && = \underset{x_\mathrm{B} \to x_\mathrm{A}}\lim\dfrac^2 - {x_\mathrm{A}}^2}{x_\mathrm{B} - x_\mathrm{A}} \quad && \text{since } f(x) = x^2\\ && = \underset{x_\mathrm{B} \to x_\mathrm{A}}\lim\dfrac{(x_\mathrm{B} - x_\mathrm{A})(x_\mathrm{B} + x_\mathrm{A})}{x_\mathrm{B} - x_\mathrm{A}}\quad && \text{since } {x_\mathrm{A}}^2 - {x_\mathrm{B}}^2 = (x_\mathrm{A}-x_\mathrm{B})(x_\mathrm{A}+x_\mathrm{B})\\ && = \underset{x_\mathrm{B} \to x_\mathrm{A}}\lim(x_\mathrm{B} + x_\mathrm{A})\quad && \text{since the two } (x_\mathrm{B} - x_\mathrm{A}) \text{ cancel out}\\ && = \underset{x_\mathrm{B} \to x_\mathrm{A}}\lim x_\mathrm{B} \, + \underset{x_\mathrm{B} \to x_\mathrm{A}}\lim x_\mathrm{A}\quad && \text{since the limit of a sum is the sum of the limits}\\ && = x_\mathrm{A} \, + \underset{x_\mathrm{B} \to x_\mathrm{A}}\lim x_\mathrm{A} \quad && \text{since } x_\mathrm{B}\text{ approaches } x_\mathrm{A} \\ && = x_\mathrm{A} + x_\mathrm{A} \quad && \text{since } x_\mathrm{A} \text{ remains constant when } x_\mathrm{B}\text{ approaches } x_\mathrm{A} \\ && = 2 x_\mathrm{A} \end{split}$
• That’s it! We just proved that the slope of $$y = x^2$$ at any point $$\mathrm{A}$$ is $$f’(x_\mathrm{A}) = 2x_\mathrm{A}$$. What we have done is called differentiation: finding the derivative of a function.

• Note that we used a couple of important properties of limits. Here are the main properties you need to know to work with derivatives:

• $$\underset{x \to k}\lim c = c \quad$$ if $$c$$ is some constant value that does not depend on $$x$$, then the limit is just $$c$$.
• $$\underset{x \to k}\lim x = k \quad$$ if $$x$$ approaches some value $$k$$, then the limit is $$k$$.
• $$\underset{x \to k}\lim\,\left[f(x) + g(x)\right] = \underset{x \to k}\lim f(x) + \underset{x \to k}\lim g(x) \quad$$ the limit of a sum is the sum of the limits
• $$\underset{x \to k}\lim\,\left[f(x) \times g(x)\right] = \underset{x \to k}\lim f(x) \times \underset{x \to k}\lim g(x) \quad$$ the limit of a product is the product of the limits
• Important note: in Deep Learning, differentiation is almost always performed automatically by the framework you are using (such as TensorFlow or PyTorch). This is called auto-differentiation. However, you should still make sure you have a good understanding of derivatives, or else they will come and bite you one day, for example when you use a square root in your cost function without realizing that its derivative approaches infinity when $$x$$ approaches 0 (tip: you should use $$\sqrt{x+\epsilon}$$ instead, where $$\epsilon$$ is some small constant, such as $$10^{-4}$$).
• You will often find a slightly different (but equivalent) definition of the derivative. Let’s derive it from the previous definition. First, let’s define $$\epsilon = x_\mathrm{B} - x_\mathrm{A}$$. Next, note that $$\epsilon$$ will approach 0 as $$x_\mathrm{B}$$ approaches $$x_\mathrm{A}$$. Lastly, note that $$x_\mathrm{B} = x_\mathrm{A} + \epsilon$$. With that, we can reformulate the definition above like so:
$f'(x_\mathrm{A}) = \underset{\epsilon \to 0}\lim\dfrac{f(x_\mathrm{A} + \epsilon) - f(x_\mathrm{A})}{\epsilon}$
• While we’re at it, let’s just rename $$x_\mathrm{A}$$ to $$x$$, to get rid of the annoying subscript A and make the equation simpler to read:

$f'(x) = \underset{\epsilon \to 0}\lim\dfrac{f(x + \epsilon) - f(x)}{\epsilon}$
• Okay! Now let’s use this new definition to find the derivative of $$f(x) = x^2$$ at any point $$x$$, and (hopefully) we should find the same result as above (except using $$x$$ instead of $$x_\mathrm{A}$$):
$\begin{split} f'(x) \, && = \underset{\epsilon \to 0}\lim\dfrac{f(x + \epsilon) - f(x)}{\epsilon} \\ && = \underset{\epsilon \to 0}\lim\dfrac{(x + \epsilon)^2 - {x}^2}{\epsilon} \quad && \text{since } f(x) = x^2\\ && = \underset{\epsilon \to 0}\lim\dfrac{x^2 + 2x\epsilon + \epsilon^2 - {x}^2}{\epsilon}\quad && \text{since } (x + \epsilon)^2 = {x}^2 + 2x\epsilon + \epsilon^2\\ && = \underset{\epsilon \to 0}\lim\dfrac{2x\epsilon + \epsilon^2}{\epsilon}\quad && \text{since the two } {x}^2 \text{ cancel out}\\ && = \underset{\epsilon \to 0}\lim \, (2x + \epsilon)\quad && \text{since } 2x\epsilon \text{ and } \epsilon^2 \text{ can both be divided by } \epsilon\\ && = 2 x \end{split}$
• As we see, it works out.

### Notations

• A word about notations: there are several other notations for the derivative that you will find in the litterature:
$f'(x) = \dfrac{\mathrm{d}f(x)}{\mathrm{d}x} = \dfrac{\mathrm{d}}{\mathrm{d}x}f(x)$
• This notation is also handy when a function is not named. For example $$\dfrac{\mathrm{d}}{\mathrm{d}x}[x^2]$$ refers to the derivative of the function $$x \mapsto x^2$$.
• Moreover, when people talk about the function $$f(x)$$, they sometimes leave out “$$(x)$$”, and they just talk about the function $$f$$. When this is the case, the notation of the derivative is also simpler:
$f' = \dfrac{\mathrm{d}f}{\mathrm{d}x} = \dfrac{\mathrm{d}}{\mathrm{d}x}f$
• The $$f’$$ notation is Lagrange’s notation, while $$\dfrac{\mathrm{d}f}{\mathrm{d}x}$$ is Leibniz’s notation.
• There are also other less common notations, such as Newton’s notation $$\dot y$$ (assuming $$y = f(x)$$) or Euler’s notation $$\mathrm{D}f$$.

## Differentiation rules

• One very important rule is that the derivative of a sum is the sum of the derivatives. More precisely, if we define $$f(x) = g(x) + h(x)$$, then $$f’(x) = g’(x) + h’(x)$$. This is quite easy to prove:

$\begin{split} f'(x) && = \underset{\epsilon \to 0}\lim\dfrac{f(x+\epsilon) - f(x)}{\epsilon} && \quad\text{by definition}\\ && = \underset{\epsilon \to 0}\lim\dfrac{g(x+\epsilon) + h(x+\epsilon) - g(x) - h(x)}{\epsilon} && \quad \text{using }f(x) = g(x) + h(x) \\ && = \underset{\epsilon \to 0}\lim\dfrac{g(x+\epsilon) - g(x) + h(x+\epsilon) - h(x)}{\epsilon} && \quad \text{just moving terms around}\\ && = \underset{\epsilon \to 0}\lim\dfrac{g(x+\epsilon) - g(x)}{\epsilon} + \underset{\epsilon \to 0}\lim\dfrac{h(x+\epsilon) - h(x)}{\epsilon} && \quad \text{since the limit of a sum is the sum of the limits}\\ && = g'(x) + h'(x) && \quad \text{using the definitions of }g'(x) \text{ and } h'(x) \end{split}$
• Similarly, it is possible to show the following important rules (I’ve included the proofs at the end of this notebook, in case you’re curious):

Function $$f$$ Derivative $$f’$$
Constant $$f(x) = c$$ $$f’(x) = 0$$
Sum $$f(x) = g(x) + h(x)$$ $$f’(x) = g’(x) + h’(x)$$
Product $$f(x) = g(x) h(x)$$ $$f’(x) = g(x)h’(x) + g’(x)h(x)$$
Quotient $$f(x) = \dfrac{g(x)}{h(x)}$$ $$f’(x) = \dfrac{g’(x)h(x) - g(x)h’(x)}{h^2(x)}$$
Power $$f(x) = x^r$$ with $$r \neq 0$$ $$f’(x) = rx^{r-1}$$
Exponential $$f(x) = \exp(x)$$ $$f’(x)=\exp(x)$$
Logarithm $$f(x) = \ln(x)$$ $$f’(x) = \dfrac{1}{x}$$
Sin $$f(x) = \sin(x)$$ $$f’(x) = \cos(x)$$
Cos $$f(x) = \cos(x)$$ $$f’(x) = -\sin(x)$$
Tan $$f(x) = \tan(x)$$ $$f’(x) = \dfrac{1}{\cos^2(x)}$$
Chain Rule $$f(x) = g(h(x))$$ $$f’(x) = g’(h(x))\,h’(x)$$

• Let’s try differentiating a simple function using the above rules: we will find the derivative of $$f(x)=x^3+\cos(x)$$. Using the rule for the derivative of sums, we find that $$f’(x)=\dfrac{\mathrm{d}}{\mathrm{d}x}[x^3] + \dfrac{\mathrm{d}}{\mathrm{d}x}[\cos(x)]$$. Using the rule for the derivative of powers and for the $$\cos$$ function, we find that $$f’(x) = 3x^2 - \sin(x)$$.

• Let’s try a harder example: let’s find the derivative of $$f(x) = \sin(2 x^2) + 1$$. First, let’s define $$u(x)=\sin(x) + 1$$ and $$v(x) = 2x^2$$. Using the rule for sums, we find that $$u’(x)=\dfrac{\mathrm{d}}{\mathrm{d}x}[sin(x)] + \dfrac{\mathrm{d}}{\mathrm{d}x}[1]$$. Since the derivative of the $$\sin$$ function is $$\cos$$, and the derivative of constants is 0, we find that $$u’(x)=\cos(x)$$. Next, using the product rule, we find that $$v’(x)=2\dfrac{\mathrm{d}}{\mathrm{d}x}[x^2] + \dfrac{\mathrm{d}}{\mathrm{d}x}[2]\,x^2$$. Since the derivative of a constant is 0, the second term cancels out. And since the power rule tells us that the derivative of $$x^2$$ is $$2x$$, we find that $$v’(x)=4x$$. Lastly, using the chain rule, since $$f(x)=u(v(x))$$, we find that $$f’(x)=u’(v(x))\,v’(x)=\cos(2x^2)\,4x$$.

### The chain rule

• The chain rule is easier to remember using Leibniz’s notation:

• If $$f(x)=g(h(x))$$ and $$y=h(x)$$, then:

$\dfrac{\mathrm{d}f}{\mathrm{d}x} = \dfrac{\mathrm{d}f}{\mathrm{d}y} \dfrac{\mathrm{d}y}{\mathrm{d}x}$
• Indeed, $$\dfrac{\mathrm{d}f}{\mathrm{d}y} = f’(y) = f’(h(x))$$ and $$\dfrac{\mathrm{d}y}{\mathrm{d}x}=h’(x)$$.

• It is possible to chain many functions. For example, if $$f(x)=g(h(i(x)))$$, and we define $$y=i(x)$$ and $$z=h(y)$$, then $$\dfrac{\mathrm{d}f}{\mathrm{d}x} = \dfrac{\mathrm{d}f}{\mathrm{d}z} \dfrac{\mathrm{d}z}{\mathrm{d}y} \dfrac{\mathrm{d}y}{\mathrm{d}x}$$. Using Lagrange’s notation, we get $$f’(x)=g’(z)\,h’(y)\,i’(x)=g’(h(i(x)))\,h’(i(x))\,i’(x)$$.

• The chain rule is crucial in Deep Learning, as a neural network is basically as a long composition of functions. For example, a 3-layer dense neural network corresponds to the following function: $$f(\mathbf{x})=\operatorname{Dense}_3(\operatorname{Dense}_2(\operatorname{Dense}_1(\mathbf{x})))$$ (in this example, $$\operatorname{Dense}_3$$ is the output layer).

## Derivatives and optimization

• When trying to optimize a function $$f(x)$$, we look for the values of $$x$$ that minimize (or maximize) the function.

• It is important to note that when a function reaches a minimum or maximum, assuming it is differentiable at that point, the derivative will necessarily be equal to 0. For example, you can check the above animation, and notice that whenever the function $$f$$ (in the upper graph) reaches a maximum or minimum, then the derivative $$f’$$ (in the lower graph) is equal to 0.

• So one way to optimize a function is to differentiate it and analytically find all the values for which the derivative is 0, then determine which of these values optimize the function (if any). For example, consider the function $$f(x)=\dfrac{1}{4}x^4 - x^2 + \dfrac{1}{2}$$. Using the derivative rules (specifically, the sum rule, the product rule, the power rule and the constant rule), we find that $$f’(x)=x^3 - 2x$$. We look for the values of $$x$$ for which $$f’(x)=0$$, so $$x^3-2x=0$$, and therefore $$x(x^2-2)=0$$. So $$x=0$$, or $$x=\sqrt2$$ or $$x=-\sqrt2$$. As you can see on the following graph of $$f(x)$$, these 3 values correspond to local extrema. Two global minima $$f\left(\sqrt2\right)=f\left(-\sqrt2\right)=-\dfrac{1}{2}$$ and one local maximum $$f(0)=\dfrac{1}{2}$$.

• If a function has a local extremum at a point $$x_\mathrm{A}$$ and is differentiable at that point, then $$f’(x_\mathrm{A})=0$$. However, the reverse is not always true. For example, consider $$f(x)=x^3$$. Its derivative is $$f’(x)=x^2$$, which is equal to 0 at $$x_\mathrm{A}=0$$. Yet, this point is not an extremum, as you can see on the following diagram. It’s just a single point where the slope is 0.

• So in short, you can optimize a function by analytically working out the points at which the derivative is 0, and then investigating only these points. It’s a beautifully elegant solution, but it requires a lot of work, and it’s not always easy, or even possible. For neural networks, it’s practically impossible.
• Another option to optimize a function is to perform Gradient Descent (we will consider minimizing the function, but the process would be almost identical if we tried to maximize a function instead): start at a random point $$x_0$$, then use the function’s derivative to determine the slope at that point, and move a little bit in the downwards direction, then repeat the process until you reach a local minimum, and cross your fingers in the hope that this happens to be the global minimum.
• At each iteration, the step size is proportional to the slope, so the process naturally slows down as it approaches a local minimum. Each step is also proportional to the learning rate: a parameter of the Gradient Descent algorithm itself (since it is not a parameter of the function we are optimizing, it is called a hyperparameter).

## Higher order derivatives

• What happens if we try to differentiate the function $$f\prime(x)$$? Well, we get the so-called second order derivative, noted $$f\prime\prime(x)$$, or $$\dfrac{\mathrm{d}^2f}{\mathrm{d}x^2}$$. If we repeat the process by differentiating $$f\prime\prime(x)$$, we get the third-order derivative $$f\prime\prime\prime(x)$$, or $$\dfrac{\mathrm{d}^3f}{\mathrm{d}x^3}$$. And we could go on to get higher order derivatives.
• What’s the intuition behind second order derivatives? Well, since the (first order) derivative represents the instantaneous rate of change of $$f$$ at each point, the second order derivative represents the instantaneous rate of change of the rate of change itself, in other words, you can think of it as the acceleration of the curve: if $$f\prime\prime(x) < 0$$, then the curve is accelerating “downwards”, if $$f\prime\prime(x) > 0$$ then the curve is accelerating “upwards”, and if $$f\prime\prime(x) = 0$$, then the curve is locally a straight line. Note that a curve could be going upwards (i.e., $$f\prime(x)>0$$) but also be accelerating downwards (i.e., $$f\prime\prime(x) < 0$$): for example, imagine the path of a stone thrown upwards, as it is being slowed down by gravity (which constantly accelerates the stone downwards).
• Deep Learning generally only uses first order derivatives, but you will sometimes run into some optimization algorithms or cost functions based on second order derivatives.

## Partial derivatives

• Up to now, we have only considered functions with a single variable $$x$$. What happens when there are multiple variables? For example, let’s start with a simple function with 2 variables: $$f(x,y)=\sin(xy)$$. If we plot this function, using $$z=f(x,y)$$, we get the following 3D graph. I also plotted some point $$\mathrm{A}$$ on the surface, along with two lines I will describe shortly.

• If you were to stand on this surface at point $$\mathrm{A}$$ and walk along the $$x$$ axis towards the right (increasing $$x$$), your path would go down quite steeply (along the dashed blue line). The slope along this axis would be negative. However, if you were to walk along the $$y$$ axis, towards the back (increasing $$y$$), then your path would almost be flat (along the solid red line), at least locally: the slope along that axis, at point $$\mathrm{A}$$, would be very slightly positive.
• As you can see, a single number is no longer sufficient to describe the slope of the function at a given point. We need one slope for the $$x$$ axis, and one slope for the $$y$$ axis. One slope for each variable. To find the slope along the $$x$$ axis, called the partial derivative of $$f$$ with regards to $$x$$, and noted $$\dfrac{\partial f}{\partial x}$$ (with curly $$\partial$$), we can differentiate $$f(x,y)$$ with regards to $$x$$ while treating all other variables (in this case just $$y$$) as constants:
$\dfrac{\partial f}{\partial x} = \underset{\epsilon \to 0}\lim\dfrac{f(x+\epsilon, y) - f(x,y)}{\epsilon}$
• If you use the derivative rules listed earlier (in this example you would just need the product rule and the chain rule), making sure to treat $$y$$ as a constant, then you will find:
$\dfrac{\partial f}{\partial x} = y\cos(xy)$
• Similarly, the partial derivative of $$f$$ with regards to $$y$$ is defined as:
$\dfrac{\partial f}{\partial y} = \underset{\epsilon \to 0}\lim\dfrac{f(x, y+\epsilon) - f(x,y)}{\epsilon}$
• All variables except for $$y$$ are treated like constants (just $$x$$ in this example). Using the derivative rules, we get:
$\dfrac{\partial f}{\partial y} = x\cos(xy)$
• We now have equations to compute the slope along the $$x$$ axis and along the $$y$$ axis. But what about the other directions? If you were standing on the surface at point $$\mathrm{A}$$, you could decide to walk in any direction you choose, not just along the $$x$$ or $$y$$ axes. What would the slope be then? Shouldn’t we compute the slope along every possible direction?

#### Facts about the normal density

• If $$X \sim \mathcal{N}(\mu,\sigma^2)$$, then $$Z = \frac{X -\mu}{\sigma}$$ is the standard normal distribution.
• If $$Z \sim \phi$$, i.e., Z is a random variable that follows the standard normal distribution, then $$X = \mu + \sigma Z \sim \mathcal{N}(\mu, \sigma^2)$$.
• The PDF of a general normal distribution in terms of the PDF of a standard normal $$\phi(\cdot)$$ is,
$\frac{1}{\sigma} \phi\left(\frac{x - \mu}{\sigma}\right)$
• Approximately $$68\%$$, $$95\%$$ and $$99.7\%$$ of the normal density lies within $$1$$, $$2$$ and $$3$$ standard deviations from the mean, respectively. $$-1.28$$, $$-1.645$$, $$-1.96$$ and $$-2.33$$ are the $$10^{th}$$, $$5^{th}$$, $$2.5^{th}$$ and $$1^{st}$$ percentiles of the standard normal distribution respectively.
• By symmetry, $$1.28$$, $$1.645$$, $$1.96$$ and $$2.33$$ are the $$90^{th}$$, $$95^{th}$$, $$97.5^{th}$$ and $$99^{th}$$ percentiles of the standard normal distribution respectively.

#### Other properties

• The normal distribution is symmetric and peaked around its mean (therefore the mean, median and mode are all equal).
• A constant times a normally distributed random variable is also normally distributed (what is the mean and variance?).
• Sums of normally distributed random variables are again normally distributed even if the variables are dependent (what is the mean and variance?).
• Sample means of normally distributed random variables are again normally distributed (with what mean and variance?).
• The square of a standard normal random variable follows what is called chi-squared distribution.
• The exponent of a normally distributed random variables follows what is called the log-normal distribution.
• As we will see later, many random variables, properly normalized, limit to a normal distribution.

### Poisson distribution

• A Poisson random variable counts the number of events occurring in a fixed interval of time or space, given that these events occur with an average rate $$\lambda$$.
• This distribution can be used to model events such as:
• The number of meteor showers in a year.
• The number of goals in a soccer match.
• The number of patients arriving in an emergency room between 10 and 11 PM.
• The number of laser photons hitting a detector in a particular time interval.
• The number of customers arriving in a store (or say, the number of page-views on a website).
• A Poisson random variable thus models a discrete distribution.
• Both the mean and variance of this distribution is $$\lambda$$.
• Note that $$\lambda$$ ranges from $$0$$ to $$\infty$$.

#### PMF

• The PMF of the the Poisson distribution is given by,
$P(X = x; \lambda) = \frac{\lambda^x e^{-\lambda}}{x!}\text{ for }x=0,1,\ldots$

#### CDF

• The CDF of the Poisson distribution is given by,
$\frac{\Gamma(\lfloor k+1\rfloor, \lambda)}{\lfloor k\rfloor !}\text{, or }e^{-\lambda} \sum_{i=0}^{\lfloor k\rfloor} \frac{\lambda^{i}}{i!}\text{ or }Q(\lfloor k+1\rfloor, \lambda)$ $(\text{for }k \geq 0,\text{ where }\Gamma(x, y)\text{ is the upper incomplete gamma function, }\lfloor k\rfloor\text{ is the floor function, and }\mathrm{Q}\text{ is the regularized qamma function})$

#### Use-cases for the Poisson distribution

• Modeling count data, i.e., $$\frac{\text{number of events}}{\text{time}}$$ data. Examples include radioactive decay, survival data, contingency tables etc.
• Approximating binomials when $$n$$ is large and $$p$$ is small.

#### Poisson derivation

• Let $$h$$ be very small.
• Now, if we assume that…
• Prob. of an event in an interval of length $$h$$ is $$\lambda h$$ while the prob. of more than one event is negligible.
• Whether or not an event occurs in one small interval does not impact whether or not an event occurs in another small interval
• … then, the number of events per unit time is Poisson with mean $$\lambda$$.

#### Rates and Poisson random variables

• Poisson random variables are used to model rates.
• $$X \sim \operatorname{Poisson}(\lambda t)$$ where,
• $$\lambda = E\left[\frac{X}{t}\right]$$ is the expected count per unit of time.
• $$t$$ is the total monitoring time.

#### Poisson approximation to the binomial

• A binomial random variable is the sum of $$n$$ independent Bernoulli random variables with parameter $$p$$. It is frequently used to model the number of successes in a specified number of identical binary experiments, such as the number of heads in five coin tosses.
• When $$n$$ is large and $$p$$ is small (with $$np < 10$$, the Poisson distribution is an accurate approximation to the binomial distribution.
• Formally, $$X \sim \mbox{Binomial}(n, p)$$, $$\lambda = n p$$ $$#### Example • The number of people that show up at a bus stop is Poisson with a mean of \(2.5$$ per hour.
• If watching the bus stop for 4 hours, what is the probability that $$3$$ or fewer people show up for the whole time?
ppois(3, lambda = 2.5 * 4) # Returns 0.01034


#### Example: Poisson approximation to the binomial

• If we flip a coin with success probablity $$0.01$$ five hundred times, what’s the probability of $$2$$ or fewer successes?
pbinom(2, size=500, prob=0.01) # Returns 0.1234
ppois(2, lambda=500 * 0.01) # Returns 0.1247


### Uniform distribution

• The uniform distribution (or rectangular distribution) is a continuous distribution such that all intervals of equal length on the distribution’s support have equal probability. For example, this distribution might be used to model people’s full birth dates, where it is assumed that all times in the calendar year are equally likely.
• The distribution describes an experiment where there is an arbitrary outcome that lies between certain bounds.
• The bounds are defined by the parameters, $$a$$ and $$b$$, which are the minimum and maximum values. The interval can be either be closed (e.g., $$[a, b]$$) or open (e.g., $$(a, b)$$).
• Therefore, the distribution is often abbreviated $$U(a, b)$$, where $$U$$ stands for uniform distribution.

#### PDF

• The PDF of the continuous uniform distribution is given by,
$f(x)=\left\{\begin{array}{ll} \frac{1}{b-a} & \text { for } a \leq x \leq b \\ 0 & \text { for } x<a \text { or } x>b \end{array}\right.$

#### CDF

• The CDF of the continuous uniform distribution is given by,
$F(x)=\left\{\begin{array}{ll}0 & \text { for } x<a \\ \frac{x-a}{b-a} & \text { for } x \in[a, b] \\ 1 & \text { for } x>b\end{array}\right.$

### Geometric distribution

• A geometric random variable counts the number of trials that are required to observe a single success, where each trial is independent and has success probability $$p$$. A geometric random variable thus models a discrete distribution.
• For example, this distribution can be used to model the number of times a die must be rolled in order for a six to be observed.

### Student’s t-distribution

• A Student’s t-distribution (or simply the t-distribution), is a continuous probability distribution that arises when estimating the mean of a normally distributed population in situations where the sample size is small and population standard deviation is unknown.

### Chi-squared distribution

• A chi-squared random variable with $$k$$ degrees of freedom is the sum of $$k$$ independent and identically distributed squared standard normal random variables. A chi-squared random variable thus models a continuous distribution.
• It is often used in hypothesis testing and in the construction of confidence intervals.

### Exponential distribution

• The exponential distribution is the continuous analogue of the geometric distribution. It is often used to model waiting times.

### F distribution

• The F-distribution (also known as the Fisher–Snedecor distribution), is a continuous distribution that arises frequently as the null distribution of a test statistic, most notably in the analysis of variance.

### Gamma distribution

• The gamma distribution is a general family of continuous probability distributions. The exponential and chi-squared distributions are special cases of the gamma distribution.

### Beta distribution

• The beta distribution is a general family of continuous probability distributions bound between $$0$$ and $$1$$. The beta distribution is frequently used as a conjugate prior distribution in Bayesian statistics.

## Frequentist inference

• Frequentist inference is the process of determining properties of an underlying distribution via the observation of data.

### Point estimation

• One of the main goals of statistics is to estimate unknown parameters. To approximate these parameters, we choose an estimator, which is simply any function of randomly sampled observations. To illustrate this idea, let’s consider the problem of estimating the value of $$\pi$$. To do so, we can uniformly drop samples on a square containing an inscribed circle. Notice that the value of $$\pi$$ can be expressed as a ratio of the areas,
$\begin{array}{l} S_{\text {circle}}=\pi r^{2} \\ S_{\text {square}}=4 r^{2} \end{array} \Longrightarrow \pi=4 \frac{S_{\text {circle}}}{S_{\text {square}}}$
• We can estimate this ratio with our samples. Let $$m$$ be the number of samples within our circle and $$n$$ the total number of samples dropped. We define our estimator $$\hat{\pi}$$ as:
$\hat{\pi}=4 \frac{m}{n}$
• It can be shown that this estimator has the desirable properties of being unbiased and consistent.

### Confidence interval

• In contrast to point estimators, confidence intervals estimate a parameter by specifying a range of possible values. Such an interval is associated with a confidence level, which is the probability that the procedure used to generate the interval will produce an interval containing the true parameter.

### The Bootstrap

• Much of frequentist inference centers on the use of “good” estimators. The precise distributions of these estimators, however, can often be difficult to derive analytically. The computational technique known as the Bootstrap provides a convenient way to estimate properties of an estimator via resampling.

## Bayesian inference

• Bayesian inference techniques specify how one should update one’s beliefs upon observing data.

### Bayes’ Theorem

• Suppose that on your most recent visit to the doctor’s office, you decide to get tested for a rare disease. If you are unlucky enough to receive a positive result, the logical next question is, “Given the test result, what is the probability that I actually have this disease?” (Medical tests are, after all, not perfectly accurate.) Bayes’ Theorem tells us exactly how to compute this probability:
$P(Disease\mid+)=\frac{P(+\mid \text { Disease }) P(\text { Disease })}{P(+)}$
• As the equation indicates, the posterior probability of having the disease given that the test was positive depends on the prior probability of the disease $$P(Disease)$$. Think of this as the incidence of the disease in the general population. Set this probability by dragging the bars below.
• The posterior probability also depends on the test accuracy: How often does the test correctly report a negative result for a healthy patient, and how often does it report a positive result for someone with the disease?

### Likelihood Function

• In statistics, the likelihood function has a very precise definition:
$L(θ|x)=P(x|θ)$
• The concept of likelihood plays a fundamental role in both Bayesian and frequentist statistics. To read more, refer the section on likelihood vs. probability in our CS229 notes.

### Prior to Posterior

• At the core of Bayesian statistics is the idea that prior beliefs should be updated as new data is acquired. Consider a possibly biased coin that comes up heads with probability $$p$$. This purple slider determines the value of $$p$$ (which would be unknown in practice).
• As we acquire data in the form of coin tosses, we update the posterior distribution on $$p$$, which represents our best guess about the likely values for the bias of the coin. This updated distribution then serves as the prior for future coin tosses.

## Regression Analysis

• Linear regression is an approach for modeling the linear relationship between two variables.

### Ordinary Least Squares

• The ordinary least squares (OLS) approach to regression allows us to estimate the parameters of a linear model.
• The goal of this method is to determine the linear model that minimizes the sum of the squared errors between the observations in a dataset and those predicted by the model.

### Correlation

• Correlation is a measure of the linear relationship between two variables. It is defined for a sample as the following and takes value between +1 and -1 inclusive:
$r=\frac{s_{x y}}{\sqrt{s_{x x}} \sqrt{s_{y y}}}$
• $$s_{x y}, s_{x x}, s_{y y}$$ are defined as:
\begin{aligned} s_{x y} &=\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)\left(y_{i}-\bar{y}\right) \\ s_{x x} &=\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2} \\ s_{y y} &=\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2} \end{aligned}
• It can also be understood as the cosine of the angle formed by the ordinary least square line determined in both variable dimensions.

### Analysis of Variance

• Analysis of Variance (ANOVA) is a statistical method for testing whether groups of data have the same mean. ANOVA generalizes the t-test to two or more groups by comparing the sum of square error within and between groups.

## Trigonometric ratios

$$Angle^{\circ}$$ $$0^{\circ}$$ $$30^{\circ}$$ $$45^{\circ}$$ $$60^{\circ}$$ $$90^{\circ}$$
$$Angle^{c}$$ $$0^{c}$$ $${\pi/6}^{c}$$ $${\pi/4}^{c}$$ $${\pi/3}^{c}$$ $${\pi/2}^{c}$$
$$\sin \theta$$ $$0$$ $$\frac{1}{2}$$ $$\frac{1}{\sqrt{2}}$$ $$\frac{\sqrt{3}}{2}$$ $$1$$
$$\cos \theta$$ $$1$$ $$\frac{\sqrt{3}}{2}$$ $$\frac{1}{\sqrt{2}}$$ $$\frac{1}{2}$$ $$0$$
$$\tan \theta$$ $$0$$ $$\frac{1}{\sqrt{3}}$$ $$1$$ $$\sqrt{3}$$ $$\text{N/A}$$
$$\operatorname{cosec} \theta$$ $$\text{N/A}$$ $$2$$ $$\sqrt{2}$$ $$\frac{2}{\sqrt{3}}$$ $$1$$
$$\sec \theta$$ $$1$$ $$\frac{2}{\sqrt{3}}$$ $$\sqrt{2}$$ $$2$$ $$\text{N/A}$$
$$\cot \theta$$ $$\text{N/A}$$ $$\sqrt{3}$$ $$1$$ $$\frac{1}{\sqrt{3}}$$ $$0$$
• $$\text{N/A}$$ = not defined.

## Graphical view of sin and cos

• Shown below is a graphical view of how $$\cos (\theta)$$ and $$\sin (\theta)$$ vary as the angle goes from $$0^{\circ}$$ to $$360^{\circ}$$ (or equivalently, $$0^{c}$$ to $$2\pi^{c}$$).
• Note that the below diagram shows a unit circle (with radius = $$1$$).

## Citation

If you found our work useful, please cite it as:

@article{Chadha2020DistilledMathTutorial,
title   = {Math Tutorial},
author  = {Chadha, Aman},
journal = {Distilled AI},
year    = {2020},
note    = {\url{https://aman.ai}}
}