Deep Learning

Stanford CS 230

Matt Deitke

Last updated: February 15 , 2020

Acknowledgements

This set of notes follows the lectures from Stanford’s graduate-level course CS230: Deep

Learning. The course was taught by Andrew Ng, Kian Katanforoosh. The course

website is cs230.stanford.edu.

i

Contents

  • 1 Intro d u cti on
    • 1.1 Deep learning
      • 1.1.1 What is a neural network?
      • 1.1.2 Supervised learning with neural networks
    • 1.2 Logistic regression as a ne ur al network
      • 1.2.1 Notation
      • 1.2.2 Binary classification
      • 1.2.3 Logistic Regression
      • 1.2.4 Logistic regression cost function
      • 1.2.5 Gradient descent
      • 1.2.6 Computation graphs
      • 1.2.7 Derivat ives with a computational graph
      • 1.2.8 Logistic regression gradient descent
      • 1.2.9 Gradient descent on m examples
    • 1.3 Python and vectorization
      • 1.3.1 Vectorization
      • 1.3.2 Vectorizing logistic regression
      • 1.3.3 Vectorizing logistic regression’s gradient computation
      • 1.3.4 Broadcasting example
      • 1.3.5 A note on NumPy vectors
  • 2 Neu ral networks
    • 2.1 Deep learning intuition
      • 2.1.1 Day’n’night classification
      • 2.1.2 Face verification
      • 2.1.3 Face recognition
      • 2.1.4 Art generation (neural style transfer)
      • 2.1.5 Trigger word detection
    • 2.2 Shallow neur al networks
      • 2.2.1 Neural networks overview
      • 2.2.2 Neural network representations
      • 2.2.3 Computing a neural network’s output CONTENTS iii
      • 2.2.4 Vectorizing across multiple examples
      • 2.2.5 Activati on functions
      • 2.2.6 Why use non-linear activation functions?
      • 2.2.7 Derivat ives of activation functions
      • 2.2.8 Gradient descent for neural networks
      • 2.2.9 Random Initialization
    • 2.3 Deep neural networks
      • 2.3.1 Forward propagation in a dee p network
      • 2.3.2 Getting the matrix dimensions right
      • 2.3.3 Why deep representations?
      • 2.3.4 Forward and backward propagati on
      • 2.3.5 Parameters v s hyperparameters
      • 2.3.6 Connection to the brain
  • 3 Op t imi zin g and structuring a neural network
    • 3.1 Full-cycle deep learning projects
    • 3.2 Setting up a machine learning application
      • 3.2.1 Training, development, and t e st i ng sets
      • 3.2.2 Bias and variance
      • 3.2.3 Combatting bi as and variance
    • 3.3 Regularizing a neural network
      • 3.3.1 Regularization
      • 3.3.2 Why regular i zat i on reduces overfitting
      • 3.3.3 Dropout regularization
      • 3.3.4 Other regularization methods
    • 3.4 Setting up an optimization problem
      • 3.4.1 Normalizing inputs
      • 3.4.2 Data vanishing and exploding gradients
      • 3.4.3 Weight initialization for deep neural networks
      • 3.4.4 Gradient numerical approximations
      • 3.4.5 Gradient checking
    • 3.5 Optimization algorithms
      • 3.5.1 Mini-batch gradient descent
      • 3.5.2 Exponentially weighted averages
      • 3.5.3 Bias correction for exponentially weighted averages
      • 3.5.4 Gradient descent with momentum
      • 3.5.5 RMSprop
      • 3.5.6 Adam optimization algorithm
      • 3.5.7 Learning rate decay
      • 3.5.8 The problem with local optima iv CONTENTS
    • 3.6 Hyperparameter Tuning
      • 3.6.1 Tuning process
      • 3.6.2 Using an appropriate scale to pick hyperparameters
      • 3.6.3 Hyperparameters tuning in practice: pandas vs caviar
    • 3.7 Batch normali zat i on
      • 3.7.1 Normalizing activations in a network
      • 3.7.2 Fitting batch norm in a neural network
      • 3.7.3 Batch norm at test time
    • 3.8 Multi-class classificati on
      • 3.8.1 Softmax regression
      • 3.8.2 Training a sof t max classifier
  • 4 Ap p li ed deep learning
    • 4.1 Orthogonalization
    • 4.2 Setting up goals
      • 4.2.1 Single numb er evaluati on metric
      • 4.2.2 Train/dev/test distributions
    • 4.3 Comparing to human-level performance
      • 4.3.1 Avoidable bi as
      • 4.3.2 Understanding human-level performance
      • 4.3.3 Surpassing human-level performance
    • 4.4 Error analysis
      • 4.4.1 Cleaning up incorrectly labeled data
    • 4.5 Mismatched training and dev set
      • 4.5.1 Bias and variance with mismatched data distributions
      • 4.5.2 Addressing data mismatch
    • 4.6 Learning from multiple tasks
      • 4.6.1 Transfer learning
      • 4.6.2 Multi-task learning
  • 5 Convolutional n eu r al networks
    • 5.1 Edge detection
    • 5.2 Padding
    • 5.3 Strided convolutions
    • 5.4 Cross-correlation vs convolution
    • 5.5 Convolutions over volume
    • 5.6 One layer convolution network
    • 5.7 Pooling layers
    • 5.8 Why we use convolutions
    • 5.9 Classic networks CONTENTS v
      • 5.9.1 LeNet-5
      • 5.9.2 AlexNet
      • 5.9.3 VGG-16
      • 5.9.4 ResNet
      • 5.9.5 1 × 1 convolution
      • 5.9.6 Inception network
    • 5.10 Competitions and benchmarks
  • 6 D etecti on algorithms
    • 6.1 Object localization
    • 6.2 Landmark detection
    • 6.3 Object detection
    • 6.4 Sliding windows with convolution
    • 6.5 Bounding box predi ct i ons
    • 6.6 Intersection over Union (IoU)
    • 6.7 Non-max suppression
    • 6.8 Anchor boxes
    • 6.9 YOLO algorithm
  • 7 Se qu en ce models
    • 7.1 Recurrent neural networks
      • 7.1.1 Backpropagation through time
      • 7.1.2 Variational RNNs
      • 7.1.3 Language modeling
      • 7.1.4 Sampling novel sequences
      • 7.1.5 Vanishing gradients with RNNs
      • 7.1.6 GRU Gated recurrent unit
      • 7.1.7 LSTM: Long short-term memory
      • 7.1.8 BRNNs: Bi di r ec ti on al RNNs
      • 7.1.9 Deep RNNs
    • 7.2 Word embedding
    • 7.3 Word representation
      • 7.3.1 Using word embeddings
      • 7.3.2 Properties of word embeddings

Chapter 1

Introduction

The gist of deep learning, and the algorithms behind it, have been around for decades.

However, we saw that as we started to add data to neural networks, they started to perform

much better than traditional machine learning algorithms. With advances in GPU comput-

ing and the amount of data we have available, training larger neural networks are easier

than ever before. With more data, larger neural networks have been shown to outperform

all other machine learning algorithms.

Amount of data
P
e
rfo
rm
a
n
ce
Large neural networks
Medium neural networks
Small neural networks
Traditional ML

Figure 1.0. 1: With more data, neural networks have performed better than traditional machine

learning.

Artificial intelligence can be broken down into a few subfields that include deep learn-

ing (DL), machine learning (ML), probabilistic graphical models (PGM), planning agents,

search algorithms, knowledge representation (KR), and game t he ory. The only subfield that

has dramatically improved in performance has been deep learning and machine learning.

We can see the rise of artificial intelligence in many industries t oday, however, there is still

a long way to go. Even though a company uses neural networks, the company is not an

artificial intelligence company. The best AI companies are good at strategic data acquisition,

putting data together, spotting automation, and have job descriptions on technologies at

the forefront of the field.

1
2 CHAPTER 1. INTRODUCTION
Time
P
e
r
f
o
r
m
a
n
c
e
(a) DL and ML
Time
P
e
r
f
o
r
m
a
n
c
e
(b) PGM
Time
P
e
r
f
o
r
m
a
n
c
e
(c) Planning agents
Time
P
e
r
f
o
r
m
a
n
c
e
(d) Search algorithms
Time
P
e
r
f
o
r
m
a
n
c
e
(e) Knowledge representation
Time
P
e
r
f
o
r
m
a
n
c
e
(f) Game theory

Figure 1.0.2: The performance of deep learning and machine learning algorithms have been

exploding in the last few years, in comp a ris o n to other branches of artificial intelligence.

1.1 Deep learning

1.1.1 What is a neural network?

The aim of a neural network is to learn representations that best predict t h e output y, given

a set of features as the input x.

Size Price
x “neuron” y

Figure 1.1.1: Simplest possible neural network where we are given the size of a home and predict

its price.

Figure 1.1.1 represents the simpl e st possible neural network. The neuron is the part of

the neural network that tries to learn a function that maps x → y. Here, we only have

one neuron, whi ch is why it is the simplest case. We can make more complicated neural

networks by stacking neurons. Consi d er the cas e where we not on l y have the size but the

number of bedrooms, zip code, wealth as well. We can represent the extra inputs inside of

a neural network as shown in Figure 1.1.2.

Neural networks work well because we only need to feed the algorithm supervised data,

without specifying what all the intermediate values may be, as we did in Figure 1.1.2 with

family size, walkability, and schooling. Instead of connecting together only a few inputs,

we can connect all of the inputs together. Layers with all inputs connected to the output

are known as fully-connected layers. The general st ru ct u re of a neural network wil l fully

connected layers can be seen in Fi gur e 1.1.3.

1.1. DEEP LEARNING 3
size
bedrooms
zip code
family size
walkability
schooling
price (y)
wealth

Figure 1.1.2: In a neural network with more neurons, the intermediate connections between inputs

may represent their own values.

x 1
x 2
x 3
x 4
y
Hidden
layer
Input
layer
Output
layer
Figure 1.1.3: Neural network with fully connected layers

1.1.2 Supervised learning with neural networks

Almost all of the hype around machine learning has been centered aroun d supervised learn-

ing. A supervised learning algorithm t akes in a set of features and outputs a number or a

set of numbers. Some examples can be seen in Table 1.1.

Input (x) Output (y) Application
Home features Price Real estate
Ad, user info Click on ad? (0/1) Online advertising
Image Object (1,... , 1000) Photo tagging
Audio Text transcript Speech recognition
English Chinese Machine tran sl at i on
Image, radar info Position of other cars Auton omou s driving
Table 1.1: Examples of supervised learn i n g

As we have already seen, we can change the connections between layers to form different

types of neural networks. Indeed, changing the structure of layers has led to many of the

breakthroughs in deep learning and we have formulated tasks in which it would be beneficial

not using a neural network with fully connected layers. For example, while real estate and

online advertising we may use a neural network with ful l y -con ne ct e d layers, photo tagging

may use convolutional neural network s (CNNs), sequenced data may use recurrent neural

networks (RNNs) , and for something more complicated like autonomous dr i vi n g, we may

use some cust om hybrid neural network. Figure 1.1.4 shows the basic structure of different

neural networks, although we will go more in-depth into how each works in this text.

4 CHAPTER 1. INTRODUCTION
Figure 1.1.4: Different types of neural network architectures

Supervised learning can be both structured and unstructured. Structured data may have

come in the for m of database entries, while unstructur ed data may be audio, image clips, or

text. Historically, unstructured data like image recogniti on has often been a harder problem

for computers to solve, while humans have little difficulty solving these typ e s of proble ms.

Neural networks have given computers a lot of help when interpreting unstructured data.

Figure 1.1.5: Structured vs unstructured data

1.2 Logistic regression as a ne ur al network

1.2.1 Notation

Each training set will be composed of m training examples. We may denot e mtrainto be

the number of training examples in the training set , and mtestto be the number of examples

in the t es ti n g set. The ith input instance will be defined as (x (i) , y (i) ), where x (i) ∈ R

nx

and each y (i) ∈ { 0 , 1 }. To make notation more concise, we will use X to denote the matrix

formed by

X =
!
x
( 1 )
x
( 2 )

… x (m)

, (1.2.1)

where X ∈ R nx×m

. simi l arl y, we define Y as

Y =
!
y
( 1 )
y
( 2 )

… y (m)

, (1.2.2)

where Y ∈ R

1 ×m
.
1.2. LOGISTIC REGRESSION AS A NEURAL NETWORK 5

1.2.2 Binary classification

A bin ary classifier takes some input and classifies that input in either two categories, typi-

cally yes (1) or no (0). For example, an image of a cat may be classified as a cat or non-cat.

We will use y to denote the output as either a 0 or 1.

Figure 1.2. 1 : Colored images can be represented with the ad d i t io n of red, green, and blue channels.

Each color can be represented using the colors red, green, and blue. Each pixel on a colored

image can be represented as a t u pl e , storing the amount of red, green, and blue inside of

each pixel. For images, we say that the red, green, and blue channel corresponds the amount

of that color in each pixel as shown in Figure 1.2.1. For an image of size n × m, we wi l l

define our input as a single column vector x that stacks the red values on the blue values

on the green values as shown

x =
#
$
%
red values
blue values
green values
&
(.^ (1.2.3)

We will use nxto denote the input size of our vector. In this case, our inpu t size is

nx= 3 nm, (1.2.4)

since each channel is size (n ×m) and there are 3 channels.

1.2.3 Logistic Regression

Given an input feature x ∈ R nx , such as the pixels of an image, we would like to find an

algorithm that makes a prediction ˆy, where

yˆ = P (y = 1 | x). (1.2.5)

Inside of our algorithm we will define weights w ∈ R

nx
and a bias term b ∈ R, where our

output is

yˆ = σ
)
w
T
x + b
*
. (1.2.6)

The sigm oi d function σ is a function is used to transform the input betwe en 0 ≤ ˆy ≤ 1,

where

σ =
1
1 + e
−z
. (1.2.7)

We will denote the inner part of the sigmoid function as

z = w
T
x + b. (1.2.8)
6 CHAPTER 1. INTRODUCTION
− 10 − 5 5 10
0. 5
1
z
σ
Figure 1.2.2: Sigmoid function

1.2.4 Logistic regression cost function

Logistic regression is be a supervised learni ng algorithm where we train on the t r ain i ng

examples

+)
x
( 1 )
, y
( 1 )
*
,… ,
)
x
(m)
, y
(m)
*,
and would like yˆ
(i)
≈ y
(i)

. Wh il e many machine

learning algorithms use a squared error cost function, logistic regression does not work well

when using squared error because many local opti ma arise. Instead, we will use the loss

function

L (yˆ, y) = −(y log ˆy + (1 − y) log (1 − ˆy)). (1.2.9)

Intuitively, consi de r the following two scenarios. When y = 1, we will end up with

L (yˆ, 1) = −log ˆy, (1.2.10)

which is shown in Figure 1.2.3a. The second scenario occurs when y = 0. In that case, we

will have

L(yˆ, 0) = −log(1 − yˆ), (1.2.11)

which is shown in Figure 1.2. 3b. When small yˆ are inputted, we will have a small l oss , and

when yˆ approaches 1, our estimates will be way off, producing a large cost.

The loss function is a measure of how well the classifier is doing on a single example. The

cost function, defined as the average loss

J(w, b) =
1
m
m

-

i= 1
L
.
(i)
, y
(i)
/
. (1.2.12)
0. 5 1
1
5
6
log
ˆy
(a) Loss function when y = 1
0. 5 1
1
5
10
log(
ˆy
)
(b) Loss function when y = 0
Figure 1.2.3: A binary classifier’s loss function based on if y = 1 o r y = 0.
1.2. LOGISTIC REGRESSION AS A NEURAL NETWORK 7

1.2.5 Gradient descent

For many logistic regression problems, our function will look bowl-shaped, which is known

as a convex function shown in Figure 1.2.4. With convex functions, there is a clear minimum

of the fu nc ti on and we would like to find the representations of w and b that will put us at

the minimum.

Figure 1.2.4: An example of wha t our cost function might look like with the inputs w and b.

For logistic regression, we generally start by initializing our weights to be zero. Gradient

descent steps in the direction of steepest descent. If we start at some point that is not

the minimum, we will move in the direction of the minimum, until we converge near the

minimum. In one dimension, we can define the gradient descent algorithm as

w = w − α
∂J(w, b)
∂w
(1.2.13)
b = b − α
∂J(w, b)
∂b
(1.2.14)

where we will first calculate both of the partial derivative terms and then store it inside of

the variables in the left hand side.

1.2.6 Computation graphs

Consider the function

y(a, b, c) = 3(a + bc). (1.2.15)

To compute the output requires three steps, first compute bc, then add that result to a, and

then multiply th e added result by 3. We can represent this function as the computational

graph shown in Figure 1.2.5. Each forward propagation in the computation graph produces

a number that gets closer to the result.

a
b
c
+ × 3
Figure 1.2.5: Computation graph for 3(a + bc)
8 CHAPTER 1. INTRODUCTION

1.2.7 Derivat ives with a computational graph

a = 5
b = 3
c = 2
u = bc
6
11 33
v = a + u J = 3 v
Figure 1.2.6: Computation graph for 3(a + bc)

Consider th e updated computational graph in Figure 1.2.6. Now, suppose that we want to

find how J changes with respect to v, that is∂J/∂v at v = 11. Since we know that J = 3 v,

we can take the derivative of the function giving us

∂J
∂v
= 3 , (1.2.16)

which in this case is not based on the input at all. Now if we wish to find∂J/∂a, we can use

the chain rule to break

∂J
∂a
=
∂J
∂v
·
∂v
∂a
. (1.2.17)

Since we already calculated the value of∂J/∂v = 3, we only need to find∂v/∂a. Since v = a+u,

differentiating both sides with respect to a will give us

∂v
∂a
= 1. (1.2.18)

Now, using Equation 1.2.

∂J
∂a
= 3. (1.2.19)

To find∂J/∂u, we would use a similar process how we found∂J/∂a and we will again end up

with∂J/∂u = 3. To find∂J/∂b, we need

∂J
∂b
=
∂J
∂v
·
∂v
∂u
·
∂u
∂b
(1.2.20)
= 3 ·
∂u
∂b
. (1.2.21)

Since u = bc, differentiating both s i de s with resp e ct to b gives us∂u/∂b = c, where c in this

case is 2, so

∂J
∂b
= 6. (1.2.22)

A similar process would be carried out to find∂J/∂c = 9.

1.2.8 Logistic regression gradient descent

With our logistic regression compu t at ion al graph defined in Figure 1.2.7, we can again work

backward calcu l at in g the changes in L with respect each of the variables. Differentiating

both sides of our loss functi on with respect to a gives us

∂L(a, y)
∂a
= −
y
a
+
1 − y
1 − a
. (1.2.23)

We also find that

∂a
∂z
= a(1 − a), (1.2.24)

so

∂L
∂z
= a − y. (1.2.25)
1.3. PYTHON AND VECTORIZATION 9
x 1
w 1
x 2
w 2
b
z = w 1 x 1 + w 2 x 2 + b a^ =^ σ(z)^ L(a,^ y)
Figure 1.2.7: Computation graph for logistic regressi on

Now back propagating towards the inputs, we have

Lx
1
= (a − y) · w 1 (1.2.26)
Lx
2
= (a − y) · w 2 (1.2.27)
Lw
1
= (a − y) · x 1 (1.2.28)
Lw
2
= (a − y) · x 2 (1.2.29)
Lb= (a − y). (1.2.30)

In general, we can show

Lxn= (a − y) · wn (1.2.31)
Lwn= (a − y) · xn. (1.2.32)

1.2.9 Gradient descent on m examples

Recall that when we have m input instances to train on

J(w, b) =
1
m
m

-

i= 1
L(a
(i)
, y
(i)
), (1.2.33)

where a

(i)
= yˆ
(i)
= σ(w
T
x
(i)
+ b). If we differentiate J with respect to w 1 , we will have
∂J(w, b)
∂w 1
=
1
m
m

-

i= 1
∂L(a
(i)
, y
(i)
)
∂w 1
. (1.2.34)

In Equation 1.2.28, we calculated Lw 1 to be (a − y) · x 1 , so

∂J(w, b)
∂w 1
=
1
m
m

-

i= 1
(a
(i)
− y
(i)
) · x 1. (1.2.35)

1.3 Python and vectorization

1.3.1 Vectorization

The aim of vectorization is to remove explicit for-loops in code. With deep learning datasets

can be massive and it is vital to un de r st an d how vectorization works in order to improve

algorithmic efficiency. Recall that we set z = w T x + b. One implementation on how to

calculate this would be

z =
nx

-

i= 1
w
T
x
(i)
+ b, (1.3.1)

which would end u p being very slow. In contrast, we can use CPython to speed up the

calculation, where the NumPy command will be

z = np.dot(w, x) + b. (1.3.2)
10 CHAPTER 1. INTRODUCTION

Using built-in CPython functions enables parallelism much better than writing out explicit

for-loops. When using neur al networks, whenever possible, avoid explicit for-loops. A

few more example s of calculations that can be vectorized include the matrix multiplication

between Ax, which is also written as

Ax = np.dot(A, x). (1.3.3)

To apply some operation, such as expon entiating, to each item in a vector v, write

v = np.exp(v). (1.3.4)

1.3.2 Vectorizing logistic regression

For logistic regression, we need to find a

(i)
= σ(z
(i)
), m times, where z
(i)
= w
T
x
(i)
+ b.

Instead of running explicit for-loops for each instance, we would like a way to vectorize the

process of finding every value of a at once. Recall that we set

X =
!
x
( 1 )
x
( 2 )

… x

(m)
,
(1.3.5)

where X ∈ R nx×m

. We also know that there is a weight associated with each input instance,

so w

T
∈ R
1 ×nx

. Carry i ng out the product between

w
T
X =
!
w
T
x
( 1 )
w
T
x
( 2 )
· · · w
T
x
(m)
. (1.3.6)

Now, we can see that b ∈ R 1 ×m , so taking the sum of w T X an d b gives us

w
T
X + b =
!
w
T
x
( 1 )
+ b w
T
x
( 2 )
+ b · · · w
T
x
(m)
+ b
. (1.3.7)

In order to implement this in CPython, call

Z = np.dot(w.T, x) + b, (1.3.8)

where b ∈ R and Python will expand it t o be a 1 × m matrix. The concept of expanding a

number into a matrix in python is known as broadcasting.

Similar to how X stores the instances for x (i) , we will define the matrix

A =
!
a
( 1 )
a
( 2 )
· · · a
(m)
. (1.3. 9)

With Z being a 1 × m matrix, we can apply the sigmoid function to each element of Z by

calling

A =
1
1 + exp(−Z)
. (1.3.10)

1.3.3 Vectorizing logistic regression’s gradient computation

Recall from earlier that the change in b, which we will denote as db was equal to dz. We

then wanted to update db by the average of all of the nudges over the training examples.

Instead of initializing db = 0, looping over and adding every instance of dz (i) , and dividing

the result by m, we could instead call

db =
1
m
m

-

i= 1
dz
(i)
(1.3.11)
=
np.sum(dZ)
m
, (1.3.12)

where dZ is a row matrix formed by each dz

(i)
element. The update th e change in weights,

recall dw =^1 /m ·

0
m
i= 1
dx
(i)
dz
(i)
, which can be expressed in vectorized for as
dw =
np.dot(X, dZ)
m
. (1.3.13)
1.3. PYTHON AND VECTORIZATION 11

1.3.4 Broadcasting example

Consider the matrix

A =
#
$
%
56 0 4 68
1 102 52 8
2 135 99 1
&
( (1.3.14)

and we want to find the percentage each element contributes to the sum of its column. For

example, A 11 = 56 /(56 + 1 + 2). To calculate the sum of each row, we would type

s = A.sum(axis = 0). (1.3.15)

Axis 0 refers to the vertical columns inside of a matrix, while axis 1 refers to the horizontal

rows in a matrix. Calling t h e method in Equation 1.3.15 will produce a vector of 4 elements,

however, we would like to convert from R 4 → R 1 × 4

. To make this conversion, call

s = s.reshape(1, 4). (1.3.16)

Now we can performA/s in order to get t h e percentage each element contributes to the sum

of the column. Dividin g the matrix A ∈ R 3 × 4 by the row vector s ∈ R 1 × 4 will divide each

of the three elements in a col umn by the value in the corresponding column of s. Since their

sizes are different, this is known as broadcasting. A few other examples of broadcasti n g

include adding a vector and real number

#
$
$
$
$
%
1
2
3
4
&
(
+ 100 =
#
$
$
$
$
%
101
102
103
104
&
(
, (1.3.17)

and adding a matrix in R

m×n
with a row vector in R
1 ×n
1
1 2 3
4 5 6
2
+
!
100 200 300
=
1
101 202 303
104 205 306
2
. (1.3.18)

Notice, when applying a command between a matrix in R m×n and a vector in R 1 ×n , t he

vector will exp and to a matr i x in R m×n , where each row has the same elements. sim il ar l y,

when combining a matrix in R

m×n
with a column vector in R
m× 1
, the vector will exp and

to be in R m×n , where each column is identical. An example of a column vector expanding

is 1

1 2 3
4 5 6
2
+
1
100
200
2
=
1
101 102 103
204 205 206
2
. (1.3.19)

1.3.5 A note on NumPy vectors

Consider the example

a = np.random.randn(5). (1.3.20)

Calling a.shape will return the shape as (5, ), which is considered a rank one array, which

is neither a row vector nor a column vector. If we call a.T , nothing will be changed to

the array, and it will still be in the shape (5, ). Instead of using a rank one array, it is

recommended to use natural matrices when dealing with neural networks. If we wanted a

matrix of size R 5 × 1 instead of the rank one array in Equation 1.3.20, we would call

a = np.random.randn(5, 1). (1.3.21)

One goo d tip when dealing with a matrix of an unknown size is to first assert that it is a

size we want , by calling

assert(a.shape == (5, 1)). (1.3.22)

Chapter 2

Neural networks

2.1 Deep learning intuition

A mo d el for deep learni n g is based on architecture and parameters. The architecture is the

algorithmic design we choose, like logistic re gre ss ion , linear regression, or shallow neural

networks. The parameters are the weights and bias the model that takes in an instance as

input and attempts to classify it correctly based on the parameters.

Model

=

Architecture

+

Parameters

Output

Loss

Gradients

Input

Things that can change:

**- Activation function

  • Optimizer
  • Hyperparameters
  • Loss function
  • Input
  • Output
  • Architecture**

0

Figure 2.1.1: Where the model fits in and what can be ad ju st ed

Consider if we wanted to use a multi-class classifier that not only predicts if an image is a

cat or not a cat, but instead predicts from the image whether the image is a cat, dog, or

giraffe. Recall that when using a binary classifier our weights were

w =
#
$
$
$
$
$
%
w 1
w 2
.
.
.
wnx
&
(
, (2.1.1)
12
2.1. DEEP LEARNING INTUITION 13

and our lab e l s were single numbers, y ∈ { 0 , 1 }. Now, let us add two more columns to the

weights that will be the weights for the dog classifier and the giraffe classifier

w =
#
$
$
$
%
.
.
.
.
.
.
.
.
.
cats dogs giraffes
.
.
.
.
.
.
.
.
.
&
(
. (2.1.2)

Our weights are in R nx×^3 .

To update our labels, we have a few options. First, let us put y ∈ { 0 , 1 , 2 }, where 0

corresponds to a cat image, 1 corresponds to a dog image, and 2 c orr es ponds to a giraffe

image. The second option is one-hot encoding, where we will have an array with each

index mapping to an animal. With one-hot encoding, only one item in each labeled array

will have a value of 1 (picture is of an animal), while the rest are 0 (picture is not of an

animal). For example, if we have an image of a cat, our label would be

cat dog giraffe
[ 1 0 0 ].
(2.1.3)

While both of these options would work well in most cases, if we have a picture of both a

cat and dog in the same image, our classifier wi l l not work. Instead, we will use multi-

hot en coding, which works similar to one-hot enco d i ng, with the differen ce being multiple

values can take on the value 1. So now if we have a picture of both a cat and dog at the

same time, we can encode our label as

cat dog giraffe
[ 1 1 0 ].
(2.1.4)

The jth neuron in the ith layer will be labeled as a

[i]
j
as shown in Figure 2.1.2.
x
(i)
1
x
(i)
2
x
(i)
3
x
(i)
4
a
[ 1 ]
1
a
[ 1 ]
2
a
[ 1 ]
3
a
[ 1 ]
4
a
[ 1 ]
5
a
[ 2 ]
1
yˆ
(i)
Figure 2.1.2: Notation for the course

As the layers increase, the activation neurons start to look at more complex items. For

example, if we are working with a face classifier, the first layer might only detect edges, the

second layer might put some of the edges together to find ears and eyes, and the third layer

might detect the entire face. The process of extr act i n g data from the neurons is known as

encoding.

2.1.1 Day’n’night classification

Our goal is to create an image classifier that predicts from a photo taken outside if it was

taken “during the day” (0) or “during the night” (1). From the cat example, we needed

14 CHAPTER 2. NEURAL NETWORKS

about 10 , 000 images to create a good classifier, so we estimate this problem is about as

difficult and we will need about 10 , 000 images to classify everything correctly. We will split

the data up into both a tr ai ni n g and testing set. In this case, about 80% of the data wil l

be in the traini ng set. In cases with more training data, we would give less of a percentage

to th e testing set and more of a percentage to the training set. We also need to make sure

that the data is not split randomly, rather split proportionally to the group t he data is in

(80% of day examples and 80% of night examples go in the training set).

For our input, we would like to work with images that have th e lowest possible quality, while

still achieving good results. One clever way to choose the lowest possible quality images is

to find out the lowest possible quality that humans can perfectly identify and then use that

quality. After comparing to human performance, we find that 64 × 64 × 3 pixels will work

well as our image size.

Our output will be 0 for the day and 1 for the night. We will also use the sigmoid function

as our last activation function because it maps a number in R to be between 0 and 1. A

shallow neural network, one with only one hidden layer, should work p re t ty well with this

example because it i s a fairl y straightforward task. For our loss function, we will use the

logistical loss

L(yˆ, y) = −y log(yˆ) − (1 − y) log(1 − yˆ). (2.1.5)

2.1.2 Face verification

Our goal is to find a way for a school to use face verification for validating stud ent IDs in

facilities such as dining halls, gyms, and pools. The data we will need is a picture of every

student labeled with the ir name. The input will be the data from the camer as as well as

who they are attempting to be. In general, faces have mor e detail than the night sk y, so

we will need the input to be a larger size. Using th e human test we used last time, we find

a solid resolution to be 412 × 412 × 3 pixels. We will make the output 1 if it’s the correct

person at the camera and 0 if it is not.

To compare the image we have of a person in our database wit h the image we have from a

camera, we want to find a function to call on both of the images in order to determine if

the person in the image is the same. We could directly compare the differences in pixels,

however, that presents se vere problems if the background lighting is different, the person

grows facial hair, and the ID photo is outdated. Instead, we will use a deep network for our

function as illustrated in Figure 2.1.3.

Deep Network
Deep Network
3 4 4 4 4 4 4 4 5
0. 931
0. 433
.
.
.
0. 158
0. 039
6 7 7 7 7 7 7 7 8 3 4 4 4 4 4 4 4 5
0. 922
0. 343
.
.
.
0. 142
0. 024
6 7 7 7 7 7 7 7 8
Small Distance
Figure 2.1.3: The architecture we will use to classify images will be deep networks.

We will set the di st anc e threshold to be some number, and any distance les s than that

number we will classify with y = 1. Because the school will not have tons of data on the

pictures of students, we will use a public fac e dataset, where we want to find images of th e

2.1. DEEP LEARNING INTUITION 15

same people to ensure that their encoding is similar and images of different people to ensure

their encodi n g is different. We train our network, we will feed it triplets. Each instance will

have an anchor (actual person), a positive example (the same pe rs on, mini mi ze encoding

distance), and a negative example (different person, maximize encoding distance).

min distance
max distance
anchor positive^ negative
Figure 2.1.4: Ordered triplet input example

For our loss function, we can use

L = ||Enc(A) − Enc(P )|| − ||Enc(A) − Enc(N)|| + α, (2.1.6)

where || represents t he L2 norm and Enc represents t h e encoding. The loss function

presented makes sense because we will want   Enc(A) − Enc(P )   to be minimized and
  Enc(A) −Enc(N)   to be maximized. To minimize a maximized function, use the negative
of that function, which gives us our resu l t of −   Enc(A) − Enc( N )   . The α term, called

the margin, is used as a kickstart to our encoding function , to avoid t he case where every

weight in the deep network is zero, and we end up with a perfect loss. It is common to keep

loss functions posit i ve, so we generally train the maximum of th e loss function and 0.

Model

=

Architecture

+

Parameters

Output

Loss

Gradients

Input

anc pos neg
Enc(A) Enc(P)Enc(N)
Figure 2.1.5: Updated model for face verification
16 CHAPTER 2. NEURAL NETWORKS

2.1.3 Face recognition

In ou r school, we want to use face identification to recognize students in faci l it i es. Now,

instead of just verifying who the person is given their ID, we need to identify the person out

of many people. We will use a lot of similar details from the face verification example, but

instead of input ti n g 3 images and out pu t t in g 3 encodings, we will input one image and the

encoding on that image. Then, we will compare the encoding our database, which contains

encodings for every student and use a k-nearest neighbors algorithm to make comparisons.

To go over the entire database, the runtime complexity will be O( n ).

For another example, suppose we have thousands of pictures on our phone of 20 different

people and we want to make groups of the same people in folders. Usi n g the encoding on

each image, we could run a k-means clustering algorithm to find groups within the vector

space that correspond to the same people.

2.1.4 Art generation (neural style transfer)

Given a picture, our goal is to make it beautiful. Suppose we have any data that we want.

Our input will contain both c ontent and the style we consider beautiful. Then, once we

input a new content image, we would like to generate a style image based on the content.

(a) Content image (b) Style image
Figure 2.1.6: Input for art ge n era ti o n

The architecture we will use is first based on a model that understands images very well.

The existing ImageNet mo d el s, for instance, are models that are perfe ct l y well to use for an

image understanding task. After the image is forward propagated th r ough a few layers of

networks, we will get information about the content in the image by looking at the neurons.

We will call the information about the image ContentC. Then, giving the style matrix to

the deep network, we will u se the grain mat ri x to extract the StyleSfrom the style image.

More about the grain matrix wi l l be discussed later in the course.

The loss function we can define as

L = ||ContentC− ContentG|| + ||StyleS− StyleG||, (2.1.7)

whereGdenotes the generated image.

2.1.5 Trigger word detection

When given a 10-second audio speech, we want to detect when the word “active” has been

said. To create a model, we will need a lot of 10-second audio clips with as much of a variety

in the voice as possible. We can break the input down into three segments: when “ac ti ve” is

being said, when a word that is not “active” is bein g said, and when nothing is b ei ng said.

The sample rate of the in pu t would be similar to how we found the resol u t ion of images,

determining the minimum viable amount that humans have no trouble with.

2.2. SHALLOW NEURAL NETWORKS 17
Deep Network
(pretrained)
After many iterations
compute
loss
update pixels using gradients
Figure 2.1.7: Art generation architecture

Figure 2.1.8 : I n p u t for the trigger word detection model. Green represents the time frame

“active” is being said, red represents the time frame that no n - “a c t ive” words are being said, and

black represents the time frame nothing is being said.

The output will be a set of 0s and then a 1 after the word ac t ive is said. This output is

generally easier to debug and works better for continuous models. The last activation we

will use is a sigmoid and the architecture we will use should be recurrent neural networks.

We can gener ate d ata for our m et hods by collecting positive words, negative words, and

background noise and then adding different combinations of the data together to form 10-

second clips.

2.2 Shallow neur al networks

x
(i)
1
x
(i)
2
x
(i)
3
a
[ 1 ]
1
a
[ 1 ]
2
a
[ 1 ]
3
a
[ 2 ]
1
yˆ
(i)
Figure 2.2.1: Shallow neural network

2.2.1 Neural networks overview

In each layer of a neural network, we will calculate both the linear activation, and then map

the linear activation to be between 0 and 1. To calculat e the linear activati on in the ith

layer

z
[i]
= W
[i]
x + b
[i]

. (2.2.1)

18 CHAPTER 2. NEURAL NETWORKS
x
(i)
1
x
(i)
2
x
(i)
3
w
T
x + b
z
9
9
9
9
9
σ(z)
a

Figure 2.2.2: Each neuron in an activation layer represe nts two computa ti o n s: the lin ea r weights

computation plus the bias and then squeezing that number between 0 and 1.

To calculate the non-activation the ith layer, we would use the sigmoi d function

a
[i]
= σ(z
[i]
). (2.2.2)

Only once we get to the final layer would we calculate the loss.

2.2.2 Neural network representations

The hidden layer refe rs to data that is not seen in the training set. To denote the input

layer x, we may also use the notation a

[ 0 ]

. When we count neural network layers, we do not

include the input layer in that count, however, we do include the output layer. We will also

refer to the nodes in the hidden layers and output layer as neurons.

2.2.3 Computing a neural network’s output

In the shallow neural network pictured in Figure 2.2.5, for the hidden layer, we must calculate : ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; <

;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
=
z
[ 1 ]
1
= w
[ 1 ]T
1
x + b
[ 1 ]
1
a
[ 1 ]
1
= σ(z
[ 1 ]
1
)
z
[ 1 ]
2
= w
[ 1 ]T
2
x + b
[ 1 ]
2
a
[ 1 ]
2
= σ(z
[ 1 ]
2
)
z
[ 1 ]
3
= w
[ 1 ]T
3
x + b
[ 1 ]
3
a
[ 1 ]
3
= σ(z
[ 1 ]
3
)
. (2.2.3)

Implementing the assignments in Equation 2.2.3 would require an inefficient for-loop. In-

stead, we will vectorize the process as follows

z
[ 1 ]
=
# $ $ $ $ %
—– w
[ 1 ]T
1
—–
—– w
[ 1 ]T
2 —–
—– w
[ 1 ]T
3 —–
& ‘ ‘ ‘ ‘ (
·
#
$
%
x 1
x 2
x 3
&
(+
#
$
$
$
$
$
%
b
[ 1 ]
1
b
[ 1 ]
2
b
[ 1 ]
3
&
(
=
# $ $ $ $ %
w
[ 1 ]T
1
x 1 + b
[ 1 ]
1
w
[ 1 ]T
2
x 2 + b
[ 1 ]
2
w
[ 1 ]T
3
x 3 + b
[ 1 ]
3
& ‘ ‘ ‘ ‘ (
. (2.2.4)

The dimensions of the weights matrix will be the number of weights in the hidden layer by

the number of inputs. The bias vector will be a column vector with the same numb e r of

rows as the weight matrix. We will c all the weight matrix W [ 1 ] and the bias vector b [ 1 ]

. For

our activation, it follows that

a
[ 1 ]
=
#
$
$
$
$
$
%
a
[ 1 ]
1
a
[ 1 ]
2
a
[ 1 ]
3
&
(
= σ(z
[ 1 ]
). (2.2.5)
2.2. SHALLOW NEURAL NETWORKS 19

Putting both the values of z and a together, we now only have to calculate

>
z
[ 1 ]
= W
[ 1 ]
x + b
[ 1 ]
a
[ 1 ]
= σ(z
[ 1 ]
)
. (2.2.6)

Instead of writing x, we will write in its place a [ 0 ] , which gives us the assignments

z

[ 1 ]
= W
[ 1 ]
a
[ 0 ]
+ b
[ 1 ]
a
[ 1 ]
= σ(z
[ 1 ]
)
. (2.2.7)

We prefer the notation a

[ 0 ]
in this case, because it will make deeper layers more natural to

reference. Now for the second layer, we can use a similar process to show

>
z
[ 2 ]
= W
[ 2 ]
a
[ 1 ]
+ b
[ 2 ]
a
[ 2 ]
= σ(z
[ 2 ]
)
. (2.2.8)

From F i gur e 2.2.5, we see that there is only 1 neu ron in activation layer 2 with 3 inputs.

From that data, it implie s that the dimensions of the weight matrix are 1 × 3 multiplied by

the 3 × 1 input matrix and added to the 1 × 1 bias term.

2.2.4 Vectorizing across multiple examples

Previously, we have only calculated the valu es from one a single inpu t x. For m training

examples, we would be required to run a loop over every single example in ord er to cal-

culate each corresponding a

[ 2 ](i)
value, where the index i is the ith training example. The

corresponding for-loop can be shown in Algorithm 1.

Algorithm 1 Forward propagation

for i ∈ { 1 ,... , m} do
z
[ 1 ](i)
= W
[ 1 ]
a
[ 0 ](i)
+ b
[ 1 ]
a
[ 1 ](i)
= σ(z
[ 1 ](i)
)
z
[ 2 ](i)
= W
[ 2 ]
a
[ 1 ](i)
+ b
[ 2 ]
a
[ 2 ](i)
= σ(z
[ 2 ](i)
)
end for

Earlier, we defined the matrix X to store all of the input instan ces as

X =
#
$
$
%
| |
x
( 1 )

… x (m)

| |
&
(
, (2.2.9)

where X ∈ R

nx×m

. If we instead define

Z
[ 1 ]
= W
[ 1 ]
X + b
[ 1 ]
, (2.2.10)

where we broadcast b as a colum n vector to be a matrix with m columns. It follows that

A
[ 1 ]
= σ(Z
[ 1 ]
). (2.2.11)

Now, in each of the columns in A and Z we will have

Z
[ 1 ]
=
#
$
$
%
| |
z
[ 1 ]( 1 )

… z 1

| |
&
(
, A
[ 1 ]
=
#
$
$
%
| |
a
[ 1 ]( 1 )

… a 1

| |
&
(
. (2.2.12)

Horizontal nodes will correspond to different traning examples and vertical nodes correspond

to different neurons in the hidden unit. The complete vectorization across multiple examples

will be:

:
;;
<
;;
=
Z
[ 1 ]
= W
[ 1 ]
A
[ 0 ]
+ b
[ 1 ]
A
[ 1 ]
= σ(Z
[ 1 ]
)
Z
[ 2 ]
= W
[ 2 ]
A
[ 1 ]
+ b
[ 2 ]
A
[ 2 ]
= σ(Z
[ 2 ]
)
. (2.2.13)
20 CHAPTER 2. NEURAL NETWORKS

2.2.5 Activati on functions

Previously when f or ward propagating, we u se d σ as our activation function, although that

is just one choice. Recall the graph of

σ =
1
1 + e
−z
(2.2.14)

in Figure 2.2.3 was used to map z from any number in R to be between 0 and 1.

− 10 − 5 5 10
0. 5
1
z
σ
(
)
z
Figure 2.2.3: Sigmoid function

Let us denote our activation function as g, so that

a
[i]
= g(z
[i]
). (2.2.15)
− 10 − 5 5 10
0. 5
1
− 0. 5
− 1
z
tanh(
)
z
Figure 2.2.4: Hyperbolic tangent function as an activatio n function

An alter nat i ve to the traditional sigmoid fu nc t ion is the sigmoid fun ct i on with a range from

− 1 t o 1 that can be thought of as dragging the bottom of the si gmoi d s from the line at 0

to the line at −1. After all of the calculations, we would end up findi n g that this function

is the hyperbolic tangent function, so our a new activation function could be

g(z) = tanh(z) =
e
z
− e
−z
e
z
+ e
−z
. (2.2.16)

Generally, the hyperbolic tangent function tends to outperform the sigmoid function because

it has the effect of centering the weights at 0. For the last layer, however, it is still common

to use the sigmoid function to give a better interpretation of the data from a probabilistic

perspective. Since different activation layers may have different activation functions, we will

refer to the activation function in the ith layer as g

[i]
.

As we approach larger numbers and smaller numbers for both the sigmoid and hyperbolic

tangent functions, it is clear that the rate of change is decreasing drastically and is approach-

ing no rate of change at the asymp tot e s. The rectified linear unit (ReLU) activation

2.2. SHALLOW NEURAL NETWORKS 21
x
(i)
1
x
(i)
2
x
(i)
3
tanh
tanh
tanh
σ yˆ
(i)
Figure 2.2.5: Different layers in the neural network may have different activation functions

function is often used to solve the problem at the infinities. The ReLU function is defined

as

ReLU(z) = max(0, z), ( 2. 2. 17)

which gives the plot shown in Figure 2.2.6. Th e rate of change of the ReLU function will be

1 if z > 0 and 0 if z < 0. It is quite improbabl e that z = 0. ̄0, in which case our derivative

is undefined and we should manually set the result to either 0 or 1 in th at case.

− 10 − 5 5 10
5
10
z
Re
LU
Figure 2.2.6: ReLU activation function

For binary classification problems, where y ∈ { 0 , 1 }, it is common practice to use the sigmoid

activati on function in the last layer and the ReLU activation function everywhere else. We

prefer the ReLU activation function in practice be cau se it is often much easier and faster to

handle derivatives.

The last activation function is the leaky ReLU, which behaves similar to the ReLU function

with the exception being how negative numbers are handled. Instead of making all negative

number 0, the leaky ReLU uses some scalin g factor multiplied by z. The most common

scaling factor is 0.01, wh i ch corresponds to the leaky ReLU function

leaky ReLU(z) = max(0. 01 z, z). (2.2.18)

In practi ce , it is often hard to determine which activation function should be used. It is

quite common to test different combinations of activation functions in order to find which

one works the best.

2.2.6 Why use non-linear activation functions?

Consider the shallow neural network pictured in Figure 2.2.8. If we used a linear activation

function, that is

g
[i]
= z
[i]
, (2.2.19)
22 CHAPTER 2. NEURAL NETWORKS
z
Le
a
k
y
Re
LU
Figure 2.2.7: Leaky ReLU activation function
x
(i)
1
x
(i)
2
x
(i)
3
linear
linear
linear
σ ˆy
(i)
Figure 2.2.8: Different layers in the neural network may have different activation functions

then we have

>
a
[ 1 ]
= z
[ 1 ]
= W
[ 1 ]
x + b
[ 1 ]
a
[ 2 ]
= z
[ 2 ]
= W
[ 2 ]
a
[ 1 ]
+ b
[ 2 ]
. (2.2.20)

Substituting W

[ 1 ]
x + b
[ 1 ]
in for a
[ 1 ]
in the second equation, we have
a
[ 2 ]
= W
[ 2 ]
(W
[ 1 ]
x + b
[ 1 ]
) + b
[ 2 ]

. (2.2.21)

Expanding out the matrix, we have

(W
[ 2 ]
W
[ 1 ]
)x + (W
[ 2 ]
b
[ 1 ]
+ b
[ 2 ]
), (2.2.22)

which is a linear equation of x. Since this is a linear hidden layer, st or i ng deeper and deeper

sets of linear weights on top of each other will never change our function from being linear, so

more weights end up being useless. When using a linear activation function for preliminary

calculations, the entire function tends to appear just like a logistic regression problem.

In the case of linear regression where yˆ ∈ R, such as when housing prices are being predicted,

we may use a linear function i n the output layer of the neural network in order to map the set

of the real number, but other than that case, linear activation functions should be avoided.

2.2.7 Derivat ives of activation functions

With a sigmoid activation function

g(z) =
1
1 + e
−z
, (2.2.23)
2.2. SHALLOW NEURAL NETWORKS 23

we can write the derivative in terms of the or i gin al function us in g some clever algebraic

manipulation that follows:

∂σ
∂z
=
e
−z
(1 + e
−z
)
2
(2.2.24)
=
e
−z
1 + e
−z
·
1
1 + e
−z
(2.2.25)
=
1 + e
−z
− 1
1 + e−z
·
1
1 + e−z
(2.2.26)
=
?
1 + e
−z
1 + e−z
1
1 + e−z
@
·
1
1 + e−z
(2.2.27)
= (1 − σ) · σ. (2.2.28)

When using the hyperbolic tangent function as our activation function, we have

g(z) = tanh(z) =
e
z
− e
−z
ez+ e−z
. (2.2.29)

The derivative of the hyperbolic tangent function can also be written in terms of its input

∂ tanh(z)
∂z
= 1 − tanh
2
(z). (2.2.30)

For the ReLU function

g(z) = max(0, z), (2.2.31)

the derivative is

g
′
(z) =
>
1 z > 0
0 z < 0
. (2.2.32)

In there rare case where we have g

′
(z = 0), the derivative is techincally undefined, although

in practice we choose a value of either 0 or 1. The choice tends to be very insignificant.

For the leaky ReLU function

g(z) = max(0. 01 z, z), (2.2.33)

the derivative is

g
′
(z) =
>
1 z > 0
0. 01 z < 0
. (2.2.34)

We will treat the case where z = 0 for the leaky ReLU similar to the case where z = 0 for

the ReLU, in which we choose either 0 or 1.

2.2.8 Gradient descent for neural networks

For the shallow neural network with one input unit, one hidden unit, and one output unit,

we will have the parameters W [ 1 ]T ∈ R n [ 1 ] ×n [ 0 ] , b [ 1 ] ∈ R n [ 1 ] × 1 , W [ 2 ]T ∈ R n [ 2 ] ×n [ 1 ] , and

b [ 2 ] ∈ R

n
[ 2 ]
× 1

. So far we have only seen examples where n [ 2 ] = 1. We are also denotin g n [ 0 ]

as nxand n [i] to be the number of nodes in the ith l ayer.

For now, assume we are doing binary classification with t he cost function

J(W
[ 1 ]
, b
[ 1 ]
, W
[ 2 ]
, b
[ 2 ]
) =
1
m
m

-

i= 1
L(yˆ, y), (2.2.35)

where yˆ = a [ 2 ] in this case and our loss function is

L(ˆy, y) = −y log yˆ −(1 − y) log(1 − yˆ). (2.2.36)

We then make updates to our parameters with the grad i ent descent algorithm shown in

Algorithm 2.

24 CHAPTER 2. NEURAL NETWORKS
x
(i)
1
x
(i)
2
x
(i)
3
z
[ 1 ]
1
9
9
9
9
a
[ 1 ]
1
z
[ 1 ]
2
9
9
9
9
a
[ 1 ]
2
z
[ 1 ]
3
9
9
9
9
a
[ 1 ]
3
z
[ 2 ]
9
9
9
9
σ
a[^2 ]
(i)

Figure 2.2.9: Each neuron can be broken into two parts, the linear activation z a n d then non-linear

activation a

Algorithm 2 Gr ad ie nt descent

repeat
Compute the predictions yˆ
(i)
for i ∈ { 1 ,... , m}
Compute the derivatives of the parameters dW
[ℓ]
, db
[ℓ]
with respect J
Update the weights and bias with W
[ℓ]
= W
[ℓ]
− α dW
[ℓ]
and b
[ℓ]
= b
[ℓ]
− α db
[ℓ]
until the cost function J converges
x
w
b
z = w 1 x 1 + w 2 x 2 + b a^ =^ σ(z)^ L(a,^ y)
Figure 2.2.10: Computation graph for logistic regress io n

Recall that when we were using logistic regression, we had the computational graph in Figure

2.2.10, where the red arrows forward propagation and the blue arrows repres ent backward

propagation. We previously calculated

La=
y − 1
a − 1
y
a
(2.2.37)
Lz= Lb= a − y (2.2.38)
Lx= w(a − y) (2.2.39)
Lw= x(a − y). (2.2.40)

In the two layer computational graph pictured in Figure 2.2.11, it is clear that a [ 2 ] , z [ 2 ] ,

W

[ 2 ]
, and b
[ 2 ]
will have similar rates of change to the loss function to a, z, w, and b when

using logistic regression, so we have the derivatives

La[ 2 ]=
y − 1
a[^2 ]− 1
y
a[^2 ]
(2.2.41)
L
z[^2 ]
= a
[ 2 ]
− y (2.2.42)
Lb[ 2 ]= a
[ 2 ]
− y (2.2.43)
LW[ 2 ]= a
[ 1 ]T
· (a
[ 2 ]
− y). (2.2.44)
2.2. SHALLOW NEURAL NETWORKS 25
W
[ 1 ]
x
b
z
[ 1 ]
= W
[ 1 ]
x + b
[ 1 ] a[^1 ]= σ(z[^1 ])
W
[ 2 ]
b
[ 2 ]
z
[ 2 ]
= W
[ 2 ]
a
[ 1 ]
+ b
[ 2 ] a[^2 ]= σ(z[^2 ]) L(a, y)
Figure 2.2.11: Two layer neural network computational graph

We can calculate L a[^1 ] from the chain rule

∂L
∂a
[ 1 ]
=
∂L
∂z
·
∂z
∂a
[ 1 ]
(2.2.45)
= (a
[ 2 ]
− y) · W
[ 2 ]T

. (2.2.46)

Going backward to Lz[ 1 ], we can use the result from La[ 1 ]

∂L
∂z
[ 1 ]
=
∂L
∂a
[ 1 ]
·
∂a
[ 1 ]
∂z
[ 1 ]
(2.2.47)
= ((a
[ 2 ]
− y) · W
[ 2 ]T
) · (z
[ 1 ]
· (1 − z
[ 1 ]
)). (2.2.48)

Now we can get back to the weight and bias terms in the first layer, where the rate of change

with respect to W [ 1 ] is

∂L
∂W
[ 1 ]
=
∂L
∂z
[ 1 ]
·
∂z
[ 1 ]
∂W
[ 1 ]
(2.2.49)
= W
[ 2 ]T
(a
[ 2 ]
− y) · z
[ 1 ]
(1 − z
[ 1 ]
) · xT, (2.2.50)

and the rate of change with respect to b is

∂L
∂W
[ 1 ]
=
∂L
∂z
[ 1 ]
·
∂z
[ 1 ]
∂b
[ 1 ]
(2.2.51)
= W
[ 2 ]T
(a
[ 2 ]
− y) · z
[ 1 ]
(1 − z
[ 1 ]
). (2.2.52)

What we have written is for t he case of one input example , however, we can extend to all

the input examples by using vectorization, where

L
Z[^2 ]
= A
[ 2 ]
− Y. (2.2.53)

For LW[ 2 ], we will need to conisder the cost function inside of ou r rate of change, so we get

L
W
[ 2 ]=
1
m
L
Z
[ 2 ]A
[ 1 ]T

. (2.2.54)

We will write Lb[ 2 ]using the CPython notation of

L
b
[ 2 ]=
1
m
· np.sum(L
Z
[ 2 ], axis = 1 , keepdims = True). (2.2.55)

We will have LZ[ 1 ]∈^ R

n
[ 1 ]
×m
, where
L
Z
[ 1 ]= W
[ 2 ]T
L
Z
[ 2 ]∗ g
[ 1 ]′
(Z
[ 1 ]
), (2.2.56)

where g

[ 1 ]
is the activation function in layer 1, and ∗ is the element product operator. Finally

the change to the weights and bias in layer 1 will be

LW[ 1 ]=
dZ
[ 1 ]
X
T
m
(2.2.57)

and

Lb[ 1 ]=
1
m
· np.sum(LZ[ 1 ], axis = 1 , keepdims = True). (2.2.58)
26 CHAPTER 2. NEURAL NETWORKS
x 1
x 2
a
[ 1 ]
1
a
[ 1 ]
2
a
[ 2 ]
1
yˆ
(i)
Figure 2.2.12: Two layer neural network

2.2.9 Random Initialization

Consider if we initialized

W
[ 1 ]
=
1
0 0
0 0
2
, (2.2.59)

then a

[ 1 ]
1
= a
[ 2 ]
2

. If two neurons in the same layer are equal, then they are going to have

the same rate of change and update from the gradient. With the same changes after each

iteration, the two neu r ons will act as if they are the same, which means having two neurons

is pointless since it does not add additional information. We call this error t h e symmetry

breaking problem.

The solution to the prob l em is to in i ti al i ze small non-zero random weights, where we initialize

W
[ 1 ]
= np.random.randn((2, 2) ) ∗ 0. 01. (2.2.60)

We use the 0. 01 term to make sure that our input is very small, since our hyperbolic tangent

function and sigmoid function have more meaninful values near 0 and our function converge

faster.

It turns out th at our bias term b does not have a symmetry problem, s o it is alright if we

initialize it to be the zero column vector

b
[ 1 ]
= np.zeros((2, 1)). (2.2.61)

We can initialize the rest of the weights and bias terms in a similar fashion.

2.3 Deep neural networks

x
(i)
1
x
(i)
2
x
(i)
3
a
[ 1 ]
1
a
[ 1 ]
2
a
[ 1 ]
3
a
[ 1 ]
4
a
[ 2 ]
1
a
[ 2 ]
2
a
[ 2 ]
3
a
[ 2 ]
4
a
[ 3 ]
1
a
[ 3 ]
2
a
[ 3 ]
3
a
[ 4 ]
1
yˆ
(i)
Figure 2.3.1: 4 layer neural network

For neural networks, we will use L to d en ot e the number of layers and n [ℓ] to denote the

number of nodes in the nth layer. For example, in F i gur e 2.3.1, we have L = 4 and

n [ 1 ] , n [ 2 ] = 4 and n [ 3 ] = 3. To refer to the activations in each layer, we will use a [ℓ] and to

refer to the weights in each layer we will use w [ℓ]

. Similarly, the linear activation and bias

in each layer will be z

[ℓ]
and b
[ℓ]
, respectively. We can refer t o the input as ℓ = 0 and the

output as ℓ = L.

2.3. DEEP NEURAL NETWORKS 27

2.3.1 Forward propagation in a dee p network

To propagate a training example forward, using the network in Figure 2.3.1, we would use

a similar process to how we forward propagated shallow neural networks, but now we will

extend to more layers, calculating

:
;
;
;
;;
;
;;
;
;;
;
;;
;
<
;;
;
;;
;
;;
;
;;
;
;;
;
=
z
[ 1 ]
= w
[ 1 ]
a
[ 0 ]
+ b
[ 1 ]
a
[ 1 ]
= g
[ 1 ]
(z
[ 1 ]
)
z
[ 2 ]
= w
[ 2 ]
a
[ 1 ]
+ b
[ 2 ]
a
[ 2 ]
= g
[ 2 ]
(z
[ 2 ]
)
z
[ 3 ]
= w
[ 3 ]
a
[ 2 ]
+ b
[ 3 ]
a
[ 3 ]
= g
[ 3 ]
(z
[ 3 ]
)
z
[ 4 ]
= w
[ 4 ]
a
[ 3 ]
+ b
[ 4 ]
a
[ 4 ]
= g
[ 4 ]
(z
[ 4 ]
)
. (2.3.1)

In general, it is easy to see that

>
z
[ℓ]
= w
[ℓ]
a
[ℓ− 1 ]
+ b
[ℓ]
a
[ℓ]
= g
[ℓ]
(z
[ℓ]
, (2.3.2)

for 0 ≤ ℓ ≤ L. For an entire training set, instead of a single training instance, we could use

>
Z
[ℓ]
= w
[ℓ]
A
[ℓ− 1 ]
+ b
[ℓ]
A
[ℓ]
= g
[ℓ]
(z
[ℓ]
)
. (2.3.3)

While we would prefer to never use for-loops, using a for loop to iterate over each layer is

the best way to forward propagate.

2.3.2 Getting the matrix dimensions right

x 1
x 2
a
[ 1 ]
1
a
[ 1 ]
2
a
[ 1 ]
3
Figure 2.3.2: Figure for matix d im en si o n s walkthrough

One of the aspects in deep learning that is easy to make a mistake on is calculating the

dimensions of the matrices incorrectly. In t hi s section, we are going to walk through h ow

we can possibly think about the dimensions and make sure they are correct at each step.

Consider the neural network in Figure 2.3.2, where our goal is to form W [ 1 ] , b [ 1 ] , and x in

order to form

z
[ 1 ]
= w
[ 1 ]
x + b
[ 1 ]

. (2.3.4)

We know that input features correspond to the rows of our matrix and different training

examples correspond to differ ent columns. So, if each instance has two input features, that

means each instance is in R 2 × 1

. Further, since n [ 1 ] = 3, we know that for each instance,

z [ 1 ] ∈ R 3 × 1

. So, we currently have the dimensi on s for the equation to be

(3 × 1) = w
[ 1 ]
(2 × 1) + b
[ 1 ]

. (2.3.5)

From matrix multiplication (without broadcasting), w

[ 1 ]
must have 2 colum ns to match the

number of rows of x. We also know that when adding matrices the dimensions must match,

28 CHAPTER 2. NEURAL NETWORKS

so b

[ 1 ]
must be in the same dimensional space as z
[ 1 ]
and w
[ 1 ]
x, which means b
[ 1 ]
∈ R
3 × 1

Now, from properties of matrix multiplication, w [ 1 ] x must have w [ 1 ] ∈ R 3 × 2

. The matrix a

will always be in the same dimensional space as z, since a only applies element operations

to the elements of z. More gen er al ly, we can see that the sizes for each item in our neural

network based on ℓ will be

z
[ℓ]
, a
[ℓ]
∈ R
n
[l]
×m
(2.3.6)
w
[ℓ]
∈ R
w
[ℓ]
×w
[ℓ− 1 ]
(2.3.7)
b
[ℓ]
∈ R
n
[ℓ]
× 1

. (2.3.8)

The derivatives of each variable will also be in the same dimensionsal space as the variables,

so when we are backward propagating

dz
[ℓ]
, da
[ℓ]
∈ R
n
[l]
×m
(2.3.9)
dw
[ℓ]
∈ R
w
[ℓ]
×w
[ℓ− 1 ]
(2.3.10)
db
[ℓ]
∈ R
n
[ℓ]
× 1

. (2.3.11)

2.3.3 Why deep representations?

less complex more complex
Figure 2.3.3: Deeper layers in a n eu ra l nexwork analyze more complex data

When using a neural network architecture, the deeper nodes look at more complex data.

For example, with images, the first layer may detect edges, the second layer may put a few

edges together, and the thir d layer may formulate the understanding of a face. Compounding

knowledge throughout multiple layers is analogous to how many people believe the human

brain works.

Graphs, i n general, are o ft en more useful when multiple laye rs can be stacked on top of each

other. Consider the logic tree in Figure 2.3.4 that represents the function y = x 1 ∨(x 2 ∧x 3 ).

With more layers, it is easier to store mor e information that bec omes more complex as the

layers grow in depth.

2.3.4 Forward and backward propagati on

The forward propagation step on layer ℓ takes in as input a [ℓ− 1 ] and outputs a [ℓ] , which also

caching z

[ℓ]
for back propagation. For backward propagation, we will input da
[ℓ]
and output

da

[ℓ− 1 ]
, dW
[ℓ]
, db
[ℓ]

. From the chain rule, we can derive the outputs of back propagation to

2.3. DEEP NEURAL NETWORKS 29
x 1
x 2
x 3
x 2 ∧ x 3
x 1 ∨ (x 2 ∧ x 3 ) y
(a) Few nodes are required
for a deeper trees
x 1
x 2
x 3
x 1 x 2 x 3
¬x 1 x 2 x 3
x 1 ¬x 2 x 3
x 1 x 2 ¬x 3
¬x 1 ¬x 2 x 3
¬x 1 x 2 ¬x 3
x 1 ¬x 2 ¬x 3
¬x 1 ¬x 2 ¬x 3
y
(b) Exponentially more
nodes are required for a shal-
low tree, since it must test
every possible combination of
x 1 , x 2 , and x 3.

Figure 2.3.4 : Comparing logic trees for y = x 1 ∨ (x 2 ∧ x 3 ) of different depth, it is easier to for

deeper trees to represent the same information as shallow trees with less nodes.

a
[ 0 ] w[^1 ], b[^1 ]
a
[ 1 ] w[^2 ], b[^2 ] a[^2 ]= yˆ
da
[ 0 ] w[^1 ], b[^1 ], dz[^1 ]
da
[ 1 ] w[^2 ], b[^2 ], dz[^2 ]
cache z
[ 1 ]
, a
[ 0 ]
cache z
[ 2 ]
, a
[ 1 ]
dw
[ 1 ]
, db
[ 1 ]
dw
[ 2 ]
, db
[ 2 ]

Figure 2.3.5: Forward and backward propaga t io n steps for a two layer neural network. We

first forward propagate across the top, while caching intermediate z [ℓ] values, then we backward

propagate across the bottom to find the changes to the weights.

give us the equations

:
;;
<
;;
=
dz
[ℓ]
= da
[ℓ]
∗ g
[ℓ]′
(z
[ℓ]
)
dw
[ℓ]
= dz
[ℓ]
· a
[ℓ− 1 ]
db
[ℓ]
= dz
[ℓ]
da
[ℓ−a]
= w
[ℓ]T
· dz
[ℓ]
. (2.3.12)
30 CHAPTER 2. NEURAL NETWORKS

For an entire training set, we would have

:
;
;
;
;
<
;
;
;
;
=
dZ
[ℓ]
= dA
[ℓ]
∗ g
[ℓ]′
(Z
[ℓ]
)
dW
[ℓ]
=^1 /m · dZ
[ℓ]
· A
[ℓ− 1 ]T
db
[ℓ]
=^1 /m · np.sum ( dZ
[ℓ]
, axis=1, keepdims=True)
dA
[ℓ− 1 ]
= W
[ℓ]T
· dZ
[ℓ]
. (2.3.13)

2.3.5 Parameters v s hyperparameters

For neural network models , our parameters consis t of strictly weights W

[ℓ]
and bias b
[ℓ]

parameters. However, there are plenty more variables that we have to set in order to

optimize our function, such as the learning rate α, number of gradie nt descent iterations,

number of hi d de n layers L, number of hidden un i t s n [ℓ] , and type of activation function.

The variables that affect our parameters, as just listed, are known as hyperparameters.

Many hyperparamet e rs are set on an experimental basis and there are many uses of neural

networks that determine optimal hyperparameters.

2.3.6 Connection to the brain

Modern deep learning has little resemblance to what we know about how the brain learns.

While it is still unknown how a neuron in the brain works, we have a loose idea that each

neuron takes some signal and passes it to the next signal. Putting many neurons together

then can sort of resembles a tree-like structure similar to a neural network. However,

there is little evidence that anything like forward and backward propagation oc cu rs , so the

connection to the brain is very loose.

Chapter 3

Optimizing and structuring a

neural network

3.1 Full-cycle deep learning projects

For most machine learning projects we have to do the fol lowing tasks in order:

  1. find a problem
  2. get data
  3. design a model
  4. train the model
  5. test the model
  6. deploy the project
  7. maintain the project

The original training model that we choose tends to be changed in order to find the best

model that performs best. Most machine learning papers tend to focu s on steps 2-5, but in

this section, we wi ll focus on steps 1, 6, and 7. For deep learning problems, good problems

for us would tend to be interesting with available data in a field we have domain knowledge

in, as well as useful and feasible to create. Feasibility is a massive problem for choosing

projects, espec ial l y in industry. To find data let us use the example of a trigger word speech

detection system, such as al ex a. It may be a good idea to first use a small set of dat a in

order to quickly start building out a model and finding out the difficulties in the problem

at hand; even a couple hundred instances may initially work well. One way to collect new

data would be to go around and ask different people to speak into a recording device, where

they say either nothing, alexa, or some other words.

As we train t h e models iteratively, it is a good idea to document each model that we set up

and the resul t s it performed. That way, it will be easy to look back on older mo de ls when

making tweaks to a current system.

VAD
Large NN
for trigger word detection
audio yˆ

Figure 3.1.1: Breaking a problem involving trigger word detection into two parts: voice activity

dection (VAD) and trigger word detection.

For our system involving a trigger word detection, it may be a good idea to break the

problem into two parts: one th at detects if a word is spoken, and one that detects if the

31
32 CHAPTER 3. OPTIMIZING AND STRUCTURING A NEURAL NE TWORK

spoken word was the trigger word. In mod er n literature, the word speaki ng detector may

be referred to as voice activity detection (VAD). When using VAD we primarily have two

options to choose from wh en select i ng a model. We can use a non-machine learning based

approach which detects if the volume is greater than some value %, or we can use a small

neural network that is trained to detect human speech. Since the non-machine learning

option may only take a few minutes to implement, it may be best to choose that option

first in order to minimize cost and move quickly. Another downside t o the neural network

approach i s that new accents m ay be introduced in production that the model has never

heard before which could cause abnormal results.

3.2 Setting up a machine learning application

3.2.1 Training, development, and t e st i ng sets

Applied machine l ear ni n g involving choosing many hyperparameters for a model, such as the

number of layers, hidden units, learning rates, and activation functions. Most of the choices

of hyperparameters for one subset of problems, such as natural language processing, tend to

not work very well on other machine learning problems, like computer vision. Because it is

often difficult t o determine which hyperparameters work best, we tend to check our choice

of hyperparameters against some test training set in order to determine its effectiveness.

train dev test
Figure 3.2.1: A data set is broken into three parts: training, development, and testing.

We tend to break our data into three sets: training, cross-validation or development, and

testing. 1 The training data set is use d to change the weights and bias of our model, the

cross-vali dat i on set is used to determine if our model has overfitted the data and determine

if our hyperparameters generalize well to data we have not seen, and the testing set is used

to calculate how accurate our model will be once deployed. In t he past, the best practice

for breaking the data up was to give the training set ∼ 60% of the dat a, the development

set ∼ 20% of the data and the testing set ∼ 20% of the data. Now, however, our datasets

have grown large enough to the point that breaking up the data based on percentages no

longer makes sense. Today, we may break up a data set with 1 million examples into a

development set with 10 thousand examples and a testing set with 10 thousand examples.

It is important to distri bu t e different types of data between the testing and development

set evenly.

3.2.2 Bias and variance

high bias

underfitting overfitting

“just right” high variance

Figure 3.2.2: Different types of varia n c e and bias.

When we talk about the variance for a par t i cu lar model, we are referring the amount of

overfitting in our model. When there is a lot of overfitting, we say the model has high

1
Many people may only use a training and testing set , where the testing set acts as if it were a

development set.

3.3. REGULARIZING A NEURAL NETWORK 33

variance. similarly, when the model is underfitting, we say there is low varianc e. We use

the term b ias to describe how well our model fits the data. If our model performs well on

both the training and development set, we say that the model has low bias, and if the model

performs badly on both the training and development set, we say the model has a high bias.

We often determine variance based on t he training set and development set err or rates.

Train set error 1% 15% 15% 0 .5%
Dev set error 11% 16% 30% 1%
high variance high bias high bias low bias
high variance low variance

Table 3.1: How we may denote h i gh and low variance and bias when given the training and

development set error rates. Here, we assume th a t the human error is approximately 0%.

high bias

high variance

Figure 3.2.3: An example of a model having both high varian ce and high bias.

3.2.3 Combatting bi as and variance

With a model, we firs t check for high bias and then high variance. When a model has

high bias the networ k will not be generalizing well on the traini n g data. It may be a

good idea to test out a larger neur al network, run gradie nt descent longer, or change neural

network architectures. When a model has high variance, the development set error will be

high. We can improve the high variance problem by introducing new data, regularization,

or changing the neural network architecture.

Before neural networks became widely adopted, we often had to make decisi ons on whether

or not to improve the bias xor variance. Having to choose one or the other came to be

known as the bias variance tradeoff. However, with larger neural networks we have been

able to independently improve both the bi as and variance of a network.

3.3 Regularizing a neural network

3.3.1 Regularization

Regularization will be a way to improve a high variance model. To show why regulari zat i on

may be useful, let us consider the logi st i c regression where our goal was to find

min
w,b
J(w, b) (3.3.1)

where w ∈ R

nx
, b ∈ R, and
J(w, b) =
1
m
m
-
i= 1
L
.
(i)
, y
(i)
/
. (3.3.2)
34 CHAPTER 3. OPTIMIZING AND STRUCTURING A NEURAL NE TWORK

When using regularization we will redefine our cost function to be

J(w, b) =
1
m
  • m
i= 1
L
.
(i)
, y
(i)
/
+
λ
2 m
||w||
2
2
, (3.3.3)

where we are using the L2 nor m defined as

||w||
2
2
=
  • nx
j= 1
w
2
j
= w
T
w. (3.3.4)

When using the L2 norm in our regularization term, we say that we are using L2 regulariza-

tion. The other main type of regularization is L1 regularization, where our regularization

term is

λ
2 m
nx

-

j= 1
|wj| =
λ
2 m
||w|| 1. (3.3.5)

L1 regularization tends to result in w being a sparse vector, which is a vector with a lot of

0s and convenient for calculations. L2 regularization tends to be used a lot more often with

neural networks. The term λ is known as the regularization p aram eter which is another

hyperparameter often set during the validation process of training the network.

For neural networks, we want to find

min
w[^1 ],b[^1 ],...,w[L],b[L]
J(w
[ 1 ]
, b
[ 1 ]
,... , w
[L]
, b
[L]
), (3.3.6)

where our cost function is also defined as

J(w
[ 1 ]
, b
[ 1 ]
,... , w
[L]
, b
[L]
) =
1
m
m

-

i= 1
L
.
yˆ
(i)
, y
(i)
/
. (3.3.7)

The regularization involving multiple bias and weight terms is

J(w
[ 1 ]
, b
[ 1 ]
,... , w
[L]
, b
[L]
) =
1
m
m

-

i= 1
L
.
(i)
, y
(i)
/
+
λ
2 m
L

-

ℓ= 1
||w
[ℓ]
||
2
F
, (3.3.8)

where the squared norm of a matrix is known as the Frobenius norm, which is defined as

||w
[ℓ]
||
2
F
=
n
[ℓ− 1 ]

-

i= 1
n
[ℓ]

-

j= 1
.
w
[ℓ]
ij
/ 2
. (3.3.9)

Before we added the additional regularizat ion term to our cost function, we updated the

weights like

w
[ℓ]
= w
[ℓ]
− α dw
[ℓ]
, (3.3.10)

where we calculated dw

[ℓ]
to be the change in t he cost function with respect to w. When

adding the regularization term, our calc ul at i on of gradient descent must also include changes

to the regularization term with resp e ct to the weights and bias. Whe n our bias changes, the

regularization term does not change, so the gradient descent term for the calculation of bias

remains the same. However, when our weights change, the derivative of our re gul ar iz at i on

term turns out to be

d
dw
[ℓ]
A
λ
2 m
||w
[ℓ]
||
2
F
B
=
λ
m
w
[ℓ]

. (3.3.11)

While it would seem weird that the derivative of the Frobenius nor m results in a matrix (since

the Frobenius norm produces a scalar value), we are really just looking for the changing

result of the Frob en iu s norm due t o changes in the entire mat ri x and not t he changes due to

one element, which is why the result ends up as a matrix. Now we can rewrite our weights

gradient descent expression as

w
[ℓ]
= w
[ℓ]
− α
?
dw
[ℓ]
+
λ
m
w
[ℓ]
@
. (3.3.12)
3.3. REGULARIZING A NEURAL NETWORK 35

Distributing the alpha term, we have

w
[ℓ]
= w
[ℓ]
− α dw
[ℓ]
− α
λ
m
w
[ℓ]

. (3.3.13)

Notice that with the terms

w
[ℓ]
− α
λ
m
w
[ℓ]
(3.3.14)

in our gradient descent function, the addition of the regularization term will always decrease

the values of the weights, since α, λ, and m are all nonnegati ve. We sometimes refer to

regularization that strictly decreases the weights as weight decay.

3.3.2 Why regular i zat i on reduces overfitting

With the L2 regularization of the cost function

J =
1
m
  • m
i= 1
L
.
(i)
, y
(i)
/
+
λ
2 m
- L
ℓ= 1
||w
[ℓ]
||
2
F,^ (3.3.15)

making λ a large number will make the term

1
m
m

-

i= 1
L
.
(i)
, y
(i)
/
(3.3.16)

negligible in comparison. In the scenario where λ is very large, we must make the weights

very small in order to minimize the cost. If the weights are near zero, they will not have

any effect on the neural network and our new neural network will become much simpl er. In

practice, however, al l of the weights become much smaller, not just a select few as shown in

Figure 3.3.1. But, with all the weights near 0, the model will still become simpler.

x
(i)
1
x
(i)
2
x
(i)
3
a
[ 1 ]
1
≈ 0
≈ 0
≈ 0
a
[ 2 ]
1
≈ 0
≈ 0
≈ 0
a
[ 3 ]
1
≈ 0
≈ 0
a
[ 4 ]
1
yˆ
(i)

Figure 3.3.1: If the weights connect ed to a neuron are near 0, then the neuron will not make a

significant contribution to the hypothesis. With less neurons in the neural network, the model will

become simpler.

With our linear activation calculation of

z
[ℓ]
= w
[ℓ]
a
[ℓ− 1 ]
+ b
[ℓ]
, (3.3.17)

if we make w [ℓ] really small, then z [ℓ] will end up really small (assuming b [ℓ] does not make

a significant contribution). If we are using the hyperbolic tangent function or the sigmoid

function as our activation function, then when z is smal l we wil l pretty much have a linear

activati on function. Recall that with a linear activation function, the depth of our neur al

network would not matter and the entire model c oul d just be represented with a linear

function.

36 CHAPTER 3. OPTIMIZING AND STRUCTURING A NEURAL NE TWORK
− 0. 5 0. 5
0. 5
− 0. 5
z
tanh(
)
z
Figure 3.3.2: The hyperbolic tangent function is nearly a linear fun c t io n for values near 0
x
(i)
1
x
(i)
2
x
(i)
3
a
[ 1 ]
1
a
[ 1 ]
2
a
[ 1 ]
3
a
[ 1 ]
4
a
[ 2 ]
1
a
[ 2 ]
2
a
[ 2 ]
3
a
[ 2 ]
4
a
[ 3 ]
1
a
[ 3 ]
2
a
[ 3 ]
3
a
[ 3 ]
4
a
[ 4 ]
1
yˆ
(i)
Figure 3.3.3: 4 layer neural network with 3 hidden layers ea ch with 4 neurons

3.3.3 Dropout regularization

Consider t h e neural network in Figure 3.3.3, which has a total of 12 hidden neurons. For

each iteration of gradient descent that we train our neural network on, dropout regular-

ization visits each hidden neuron and randomly chooses whether or not that neuron will

be removed from the neural network. The probability that each hidden neuron remains

is a hyperparameter that we can set. With a neural network has a differing number of

neurons in each hidden layer, we can make the keep prob value a hyperparameter in each

layer. If a neuron is removed, then all the weights and bias conne ct ed to the neuron are

also removed. When t es t in g our neural network or once it is deployed, we will no longer

use dropout regularization, primarily because we do not want randomized predictions. In

computer vision, the input to the neural network is very large and many people in computer

vision use dropout regularly.

To implement dropout regu lar i zat i on let us set

dℓ = np.random.rand(aℓ.sh ape[0], a ℓ .s hape[1]) < keep prob, (3.3.18)

where keep prob is the pr obab il i ty that we keep the neuron and dℓ is the same shape as a

[ℓ]
.

dℓ will now store a matrix of True or False values. We can then take

aℓ = aℓ ∗ dℓ (3.3.19)

in order to zero out the values in the matrix where dℓ is 0 and keep the values in the matrix

where dℓ is 1. We also want to make sure that z

[ℓ]
is not reduc ed from its expected value

without dropbox regularization, which can be done by taking

aℓ = aℓ/keep prob. (3.3.20)

Consider, for example, if we had 100 hidden n eu r ons in a layer and we set keep prob = 0 .8.

On average, 20 neurons would be hidden and a

[ℓ]
would be reduc ed by 20%. In order to
3.3. REGULARIZING A NEURAL NETWORK 37
x
(i)
1
x
(i)
2
x
(i)
3
a
[ 1 ]
1
a
[ 1 ]
3
a
[ 2 ]
2
a
[ 2 ]
3
a
[ 3 ]
2
a
[ 3 ]
4
a
[ 4 ]
1
yˆ
(i)

Figure 3.3.4: A potential simplified neura l network with only the hidden neurons that are not

removed in dropout regularization.

mitigate t hi s loss, divide by our keep prob value of 0.8. Dividing by the hyperparameter

keep prob is known as the inverted dropout technique.

With dropout regularization, any feature could be removed randomly from a neural network

on any given iteration. Because of this, the network can learn not to rely on any given feature

too much, which can regularize a model. In general, when using dropout regularization, the

weights shrink in size.

3.3.4 Other regularization methods

While we know that adding more data to our model could help reduce overfitting, we do

not always have easy access to new data. Instead, one way to add more data is to augment

the existing data that we have, which is known as data augmentation. If we were to

augment images, for exam pl e, then we might reflect, crop, distort, and change the colors of

our inputted data as sh own in Figure 3.3.5.

Figure 3.3.5: Data augmentation on images

Early stopping is another way to regularize our model. With early stopping, we decrease

the number of iterations when running gradient descent in hopes of finding a more general

model. When using early stopping, the primary downside is that we are no longer looking

38 CHAPTER 3. OPTIMIZING AND STRUCTURING A NEURAL NE TWORK

to completely optimize the cost function and it is often hard to determine when it is a good

time to stop iterating.

3.4 Setting up an optimization problem

3.4.1 Normalizing inputs

0 50 100

5

10

Figure 3.4.1: Normal distribution of data vertically centered at 5 with a varian c e of 2.25.

To normalize a dat ase t , we must do two things: center the mean at the origin in each

direction and set the variance to be 1. Consider the two dimensional data in Figure 3.4.1.

Centering t h e data in each direction would require us to calcu l at e each directi on ’s mean

μ =
1
m
  • m
i= 1
x
(i)
(3.4.1)

and then take each point and subt r act the mean, which can be done with

x = x − μ. (3.4.2)

The result after centering the data in each direction at 0 can be shown in Figure 3.4.2.

  • 50 50

5

  • 5
Figure 3.4.2: Centering the mean of the data in each d irec t io n at 0

The variance is defined as the square of the standard deviation. In general, the variance is

σ
2
=
1
m
m

-

i= 1
.
x
(i)
− μ
/
2
, (3.4.3)

but since we have set μ = 0, we can simpifly th e variance to

σ
2
=
1
m
m
-
i= 1
.
x
(i)
/
2

. (3.4.4)

3.4. SETTING UP AN OPTIMIZATION PROBLEM 39

To normalize the variance, we must set the variance of the entire dataset to be 1, which can

be done by dividing the entire dataset by the standard deviation

x =
x
σ
. (3.4.5)

Figure 3.4.3 shows the result after making the variance of the data 1 and displays the

normalized data. It is important when u si ng a training set and testing set th at both sets

  • 3 - 2 - 1 1 2 3
    • 1
1
  • 2
2
  • 3
3
Figure 3.4.3: Normalized data after centering the mean at 0 and making the variance 1.

are normalized with the same μ and σ

2
in order to best simulate how the model would

perform when deployed.

w
b
(a) Normalized
w
b
(b) Unnormalized

Figure 3.4.4: Potential contour plots of the cost function for normalized and unno rm a li zed data.

Normalizing the data often makes it q ui cker for gradient descent to converge. Considering

the two-dimensional contour plots of a potential cost function in Figure 3.4.4, our goal with

normalizing is to make it easier to find the minimum of t he cost function. If we have a set

of features with a range of (− 1000 , 1000) and another set of features with a range of (− 1 , 1),

it is often quite h ar d to pick a learning rate that will work well for both features.

40 CHAPTER 3. OPTIMIZING AND STRUCTURING A NEURAL NE TWORK
a
[ 0 ]
1
a
[ 0 ]
2
a
[ 1 ]
1
a
[ 1 ]
2
a
[ 2 ]
1
a
[ 2 ]
2
a
[ 3 ]
1
a
[ 3 ]
2
a
[ 4 ]
1
a
[ 4 ]
2
a
[ 5 ]
1
a
[ 5 ]
2
a
[ 6 ]
1
a
[ 6 ]
2
a
[ 7 ]
1
Figure 3.4.5: 7 layer neural network

3.4.2 Data vanishing and exploding gradients

Consider the neural network in Figure 3.4.5, where we will assume the input is non-negative

and will ignore the bias term for now. Here, we will exclusively use ReLU activations and

initialize all of our weights to be in the form of

W
[ℓ]
=
1
γ 0
0 γ
2
, (3.4.6)

where γ > 0. In the c ase presented, the ReLU activation function will behave like a linear

activati on function and we will have

W
[ 7 ]
W
[ 6 ]
W
[ 5 ]
W
[ 4 ]
W
[ 3 ]
W
[ 2 ]
W
[ 1 ]
a
[ 0 ]
= yˆ. (3.4.7)

The multiplication of all of the W [ℓ] will simplify to

7
C
ℓ= 1
W
[ℓ]
=
7
C
ℓ= 1
1
γ 0
0 γ
2
=
1
γ
7
0
0 γ
7
2
. (3.4.8)

When we initialize λ < 1 we will end up multiplying ou r input by an exponentially small

number and when we initialize λ > 1 multiplication will be ex ponentially large. Consider, for

example, if we set λ = 0. 5 then our input will be scaled down by a factor of λ 7 = 0 .0078125.

similarly, if we set λ = 1. 5 then the i np u t will be scaled up by λ 7 = 17 .0859375. When

the prediction of our model is a huge number, then the gradients in back propagation will

explode and when our prediction is near zer o, the data will have a negligible impact.

3.4.3 Weight initialization for deep neural networks

For each neuron, we calculate the linear activation to be

z = w 1 x 1 + w 2 x 2 + · · · + wnxn. (3.4.9)

At the moment, we are going to ignore the bias term. Notice that when n ≫ 0, we would

like the weights to be small in order to keep z from exploding. similarly, when n is near 0,

making the weights larger would make t he most sense in order to have the input make a

contribution. One solution is to initialize our weights to have a variance of

σ
2
=
1
n
. (3.4.10)

We previously calculated the variance of a data set centered at zero to be

σ
2
=
1
m
m

-

i= 1
.
x
(i)
/ 2
. (3.4.11)

Starting with a random normally distributed data set with a variance of σ 2 = 1, we can

change the variance to be ζ by multiplying each instanc e by

ζ. We can s how that this is

correct by carrying out the multiplication on each instance and using substituting with the

original variance of 1, which gives us

σ
2
=
1
m
  • m
i= 1
.
x
(i)
D
ζ
/ 2
(3.4.12)
= ζ
E
1
m
m
-
i= 1
.
x
(i)
/
2
F
(3.4.13)
= ζ. (3.4.14)
3.4. SETTING UP AN OPTIMIZATION PROBLEM 41

So, if we wanted to make the variance^1 /n, we would initialize our weights to be

W
[ℓ]
= np.random.randn(shape) ∗ np.sqrt
?
1
n[ℓ−^1 ]
@
. (3.4.15)

The modern best p r act i ce tends to set the variance of weights with the ReLU function to

be

σ
2
=
2
n
[ℓ− 1 ]
, (3.4.16)

which is known as He initialization. With a hyperbolic tangent function we generally

stick wit h

σ
2
=
1
n
[ℓ− 1 ]
, (3.4.17)

which is known as Xavier initialization. Occasionally, some people may also set the

variance to be

σ
2
=
2
n
[ℓ− 1 ]
+ n
[ℓ]
. (3.4.18)

3.4.4 Gradient numerical approximations

Numerical approximations of our gradi e nt can be useful when debugging a network and

testing if the forward an d backward propagation steps are working correctly. From the

definition of the derivative, we know that the one-side d limit of

f
′
(θ) = lim
#→ 0
f(θ + %) − f(θ)
%
(3.4.19)

will give us the slope of the line tangent to a two variate equation centered at θ. We could

also use use the two sided limit

f
′
(θ) = lim
#→ 0
f(θ + %) − f(θ − %)
2 %
(3.4.20)

to calculate the deri vative, which wil l produce an equivalent result. The two-sided limit

as shown in Fi gu re 3.4.6 works best for numerical approximations becuase it gets a more

accurate picture of the slope at both sides.

x
f
x
(
)
Figure 3.4.6: Two-sided limit derivative definition

The error when using the two-sided limit is O(% 2 ) and the error when usi ng a one-sided limit

is O(%). Although it is twice as computationally expensive, it is often worth it to use the

two-sided limit.

42 CHAPTER 3. OPTIMIZING AND STRUCTURING A NEURAL NE TWORK

3.4.5 Gradient checking

Using the numerical approximations of the derivative our goal is to check if our calculations

implemented in back propagation are correct. To simplify calculations, we will store each

entry in all of our parameters W

[ 1 ]
, b
[ 1 ]
,... , W
[L]
, b
[L]
into a single large vector denoted as

θ. similarly, we will store dW [ 1 ] , db [ 1 ] ,… , dW [L] , db [L] into a vector dθ. We now can write

our cost function as

J(θ) = J(θ 1 , θ 2 , θ 3 ,... ). (3.4.21)

For each element i of our vector θ, we can approximate the derivative with

̃
dθi=
J(θ 1 , θ 2 ,... , θi+ %,... ) − J(θ 1 , θ 2 ,... , θi− %,... )
2 %
. (3.4.22)

If our calculation s are correct, then

̃
dθ should approximately be dθ. We can check our error

rate E by taking th e normalized Euclidean distance between the vectors of

E =
G
G
G
̃
dθ −dθ
G
G
G
G^2
G
G
̃
G
G
G
2
+
G
G
Gdθ
G
G
G
2
. (3.4.23)

In practice, if we set % = 10

− 7
, then with E < 10
− 7
we can generally confirm that our

calculations are correct and with E > 10 − 3 we should double check our calculations. For

values in-between, it is harder to determine correctness. For debugging, it is best to fir s t

check for bugs where the difference betweend ̃θiand dθiare the greatest.

We can run gradient checking on L2 regularization by making sure to include our reg-

ularization term in the cost function. However, because of the randomness of dropout

regularization, we cannot use gradient checking with dropout on.

3.5 Optimization algorithms

3.5.1 Mini-batch gradient descent

Deep learning works best when there is a lot of data and we can quickly optimize our model.

We have previously been using gradient descent on all of our training examples with our

vectorized input

X =
!
x
( 1 )
x
( 2 )

… x

(m)
, (3.5.1)

where X ∈ R

nx×m
and our vectorized output
Y =
!
y
( 1 )
y
( 2 )

… y

(m)
, (3.5.2)

where Y ∈ R 1 ×m

. The gradient descent we have previously been usi ng is known as b atch

gradient descent. If m ≫ 0, then it may take a while to compute each iterati on of gradient

descent. For now, suppose that our training set is of size m = 5 , 000 , 000. Instead of usin g

all m training examples for each iteration of gradient descent, let us break the input into

groups of 1000, w hi ch we will call mini-batches. For example, the first mini-batch will be

have :

<
=
X
{ 1 }
=
H
x
( 1 )
x
( 2 )

… x

( 1000 )
I
Y
{ 1 }
=
H
y
( 1 )
y
( 2 )

… y

( 1000 )
I (3.5.3)

where we will denote the ith mini-batch with superscript {i}. In total, we will have 5000

mini-batches. Next, we will make a pass through the training set as described in Algorithm

  1. Each pass through the training set using any form of gradient descent is known as an

epoch.

With batch gradient descent, our cost will ideally always be decreasing, however, that is

not the case with mini-batch gradient descent because each mini-batch does not reflect the

entire training set and our gradient may not step in the perfect direction that minimizes

our cost.

3.5. OPTIMIZATION ALGORITHMS 43

Algorithm 3 A single pass through the training set using mini-batch gradient descent.

for t ∈ { 1 ,... , 5000 } do
Forward propagation on X
{t}
Compute the cost for J
{t}
Backprop to compute gradients
Update the weights and bias
end for
Epoch
Cost
(a) Batch gradient descent
Mini-batch iterations t
Cost
(b) Mini-batch gradient descent
Figure 3.5.1: How the cost function could decrease b a sed on the choice of gradient descent.

When setting the hyperparameter that represents the size of the mini-batch, there are two

extreme cases. First, if we set the mini-batch size equal to m, then we just have our old

method of batch gradient descent. If we set the mini-batch size to be 1 then each example is

a mini-batch; we call this Stochastic gradient d escent. W it h Stochastic gradient descent,

we cannot u se vectorization and will not typically converge to a single value, instead, it will

jump around near the minimum. In practice, we often choose a mini-batch size in-between

the two extremes in order to preserve vectorization and make faster progress.

The modern best practice in the field is to use batch gradient descent when m ≤ 2000,

otherwise, we typically use 2

6
, 2
7
, 2
8
or 2
9
in order to store data most efficiently.

3.5.2 Exponentially weighted averages

20 40 60 80
500
1000
1500
Figure 3.5.2: Data set to calc u la t e the exponentially weighted average

To describe other optimization algorithms that may work better than gradi e nt descent, we

first need to introduce the concept of exponentially weighted averages. Consider the dataset

44 CHAPTER 3. OPTIMIZING AND STRUCTURING A NEURAL NE TWORK

in Figure 3.5.2 where

θ =
# $ $ $ $ %
θ 1
θ 2
.
.
.
θ 100
& ‘ ‘ ‘ ‘ (
=
# $ $ $ $ %
1124
1184
.
.
.
1546
& ‘ ‘ ‘ ‘ (
. (3.5.4)

To calcu l at e t h e exponentially weighted average, we will start by initializing v 0 = 0 and

recursively defining

vt= 0. 9 vt− 1 + 0. 1 θt (3.5.5)

for t ∈ { 1 ,… , 100 }. After computing all of the vtpoints, we can p lot the joined data as

shown in red in Figure 3.5.3.

20 40 60 80
500
1000
1500
Figure 3.5.3: Exponentially weighted average as a joined plot (red)

In general, we can set

vt= β vt− 1 + (1 − β) θt, (3.5.6)

where 0 ≤ β < 1. The choice of β sets the value of vtto approximately be the average over

the last

1
1 − β
(3.5.7)

data points. For example, when we set β = 0 .9, we were closely mirroring each valu e of vtto

be the average of the previous 10 instances. Additional changes to the value of β are shown

in Figure 3.5.4. In statistics, the exponentially weighted average may also be referred to as

the exponentially weighted moving average.

20 40 60 80
500
1000
1500
Figure 3.5.4: β = 0. 01 (green), β = 0. 6 (purple), beta = 0. 95 (ora n g e)

To better understand exponentially weighted averages, let us set β = 0. 9 and look at the

sequence :

;;
;<
;;
;=
v 100 = 0. 9 v 99 + 0. 1 θ 100
v 99 = 0. 9 v 98 + 0. 1 θ 99
v 98 = 0. 9 v 97 + 0. 1 θ 98
.
.
.
. (3.5.8)
3.5. OPTIMIZATION ALGORITHMS 45

Using substitution, we can write

v 100 = 0. 9 (0. 9 ( 0. 9 (v 97 ) + 0. 1 θ 98 ) + 0. 1 θ 99 ) + 0. 1 θ 100 (3.5.9)
= 0. 1 θ 100 + (0. 1 × 0 .9)θ 99 +
)
0. 1 × 0. 9
2
*
θ 98 +
)
0. 1 × 0. 9
3
*
θ 97 + · · · (3.5.10)
=
100

-

t= 1
)
0. 1 × 0. 9
100 −t
*
θt. (3.5.11)

Visually, the the weight that is contributed to the θ 100 term decreases exponentially as

shown in Figure 3.5.5.

50 60 70 80 90 100
0. 0
0. 2
0. 4
0. 6
0. 8
1. 0
Scale factor
60 70 80 90 100
500

1000

1500

Figure 3.5.5: At θ 100 is the sum of the element-wise produ c t s between the scaling factors and

data

The time constant τ is the time it takes for an exponentially decaying plot to reach^1 /e its

max. From our decay equation, we have

β
t−τ
=
1
e
. (3.5.12)

Solving for τ, we find the time constant to be

τ = t + logβ(e), (3.5.13)

where log β (e) represents the number of data points we go back in order to calculate 1 −^1 /e

of the value of vt. The approximation in Equation 3.5.6 is a simp li fi ed version of −log β (e)

and is often u se d because it gives a close enough answer to the true time constant in our

domain for β. Th e differences in −log β (e) and^1 / 1 −β are shown in Figure 3.5.6.

0. 2 0. 4 0. 6 0. 8 1. 0
2
4
6
8
10
12

Figure 3.5.6: Approximation of the time consta nt τ =^1 / 1 −β (red) and the real time constant

τ = log β (e) (blue)

We generally choose to use this definition of a weighted average instead of truly calculating

the average in a particular window because it is easier to compute and requires less memory

storage.

46 CHAPTER 3. OPTIMIZING AND STRUCTURING A NEURAL NE TWORK
10 20 30 40
500
1000
1500

Figure 3.5.7: Data for bias correc ti o n analysis, where the red line is the exponentially weighted

average when β = 0. 5

3.5.3 Bias correction for exponentially weighted averages

Notice that at the beginning pe ri od of the plot in Figure 3.5.7, the plot starts off with pretty

small values and grows relatively slow. Let us denote % = 1 −β. Because of how we defined

our exponentially weighted average, each plot will start with

:
;;
<
;
;=
v 0 = 0
v 1 = θ 1 %
v 2 = βθ 1 % + θ 2 %
v 3 = β
2
θ 1 % + βθ 2 % + θ 3 %
. (3.5.14)

We can more easily interpret this dat a by plugging in a few values of β and anal y zi ng the

results, which is done in Table 3.2.

β 0.1 0.5 0.9
v 1 0. 9 θ 1 0. 5 θ 1 0. 1 θ 1
v 2 0. 09 θ 1 + 0. 9 θ 2 0. 25 θ 1 + 0. 5 θ 2 0. 09 θ 1 + 0. 1 θ 2
v 3 0. 009 θ 1 + 0. 09 θ 2 + 0. 9 θ 3 0. 125 θ 1 + 0. 25 θ 2 + 0. 5 θ 3 0. 081 θ 1 + 0. 09 θ 2 + 0. 1 θ 3
Table 3.2: Calculation s of the first three vtterms.

Notice that when β is large, vtstarts off growing the slowest. If we set each value of θnto

1 then we can get the total amount we are scaling our input. We would expect the scaling

factor to be 1 in the general case because when averaging n numbers, if each number is 1,

then the average is 1. However, when β = 0. 9 our total scaling factor for v 1 is 0.1, for v 2 is

0.19, and for v 3 is 0.271, which are all much lower than 1.

If we instead define our recursive vtequation as

vt=
β vt− 1 + (1 − β)θt
1 − β
t
, (3.5.15)

then will scale up the first few instances of the data without affecting the latter entries, as

shown in Figure 3.5.8. The term in the denominator of Equation 3.5.15 is known as the bias

correction term for exponentially weighted averages.

In practice, most people do not actually cal cu lat e the bias correction because it d oes not

make much of a difference on large datasets.

3.5.4 Gradient descent with momentum

In gradient descent, when we are far from an optimal point in our cost function, the gradients

will be larger numbers. We are typically far at the beginning and taking large steps but often

not in a great direction. In Figure 3.5. 9, we are illustrating a potential path through t he

3.5. OPTIMIZATION ALGORITHMS 47
10 20 30 40
500
1000
1500

Figure 3.5.8: Weighted average of our da t a with bias correction (p u rp l e) , without bias correction

(red)

Figure 3. 5 .9 : Contour plot of the cost funct io n with the minimum in red a n d our path from

gradient descent in blue

contour plot. Ideally, we would like to have a larger learning rate in the horizontal direction

and a smaller learning rate in the vertical direction, in orde r to step more di r ec tl y towards

the minimum. Looking at the previous it er at ion s of the direction of gr adi ent descent, we

can get a pretty good feeling about the direction of the minimum by taking the weighted

average. In our figure, the vertical weighted average would be near 0, while the horizontal

weighted average would be in the direction towards the center. Using the exponentially

weighted average with gr ad ie nt descent is referred to gradient descent with momentum.

Gradient descent with momentum has two hyperparamete rs , α and β.

Algorithm 4 Gr ad ie nt descent with momentum

vdW, vdb= 0
repeat
Compute dW, db with respect J
vdW= βvdW+ (1 − β)dW
vdb= βvdb+ (1 − β)db
W = W − α vdW
b = b − α vdb
until the cost function J converges

Consider how a ball rolling down the sides of a bowl will find the minimum by taking a step

in the direction of the gradient. Here, the equation

vdW= βvdW+ (1 − β)dW ( 3.5. 16)

has a loose analog to the physical system of a ball rolling down the sides of a bowl, where

dW represents acceleration, vdW represents velocity, and β represents friction.

Finally, some people may change the exponentially weighted average term to be

vdW= βvdW+ dW, (3.5.17)

where th ey omit (1 − β) in hopes of avoiding overfitting, however, this would require us to

make less intuitive changes to our learning rate α in order for the cost function to converge

quickly.

3.5.5 RMSprop

From Figure 3.5.10, notice that it would be more beneficial to move further i n the direction

with less oscillation than it would be to continue t ak i ng large steps vertically. One way to

48 CHAPTER 3. OPTIMIZING AND STRUCTURING A NEURAL NE TWORK

w

b

Figure 3.5.10: Plotting the gradient in each direction, if we denote the vertical axis as the bias

and the horizontal direction with the weight

determine the amount of oscillation would be to calculate

>
SdW = β 2 SdW+ (1 − β 2 )dW
2
Sdb= β 2 Sdb+ (1 − β 2 )db
2
, (3.5.18)

where the square operation is element-wise and β 2 is new hyperparameter we are introducing

that works similar to β.

2
Squaring the rate of change term will make the term positive and

reward small val u es while making large values more costly. In our e x amp le , the changes

in the horizontal direction, which correspond to the changes in the weights, are relatively

small, so SdW will be relatively small. In contrast, Sdbwill be huge because the vertical

changes are massive. Now, if we update the weights and bias with

:
;
;
<
;
;
=
w = w − α
dW
√
SdW+ %
b = b − α
db
√
Sdb+ %
, (3.5.19)

we will end up moving more in the direction that does not oscillate and less in the direction

that oscillates a lot. Here, we include % in order to avoid a division by zero error, where %

is commonly set to a small value such as % = 10

− 8

. The algorithmic described is known as

root mean square prop (RMSprop) and is detailed in Algorithm 5

Algorithm 5 RM Sp rop

SdW, Sdb= 0
repeat
Compute dW, db with respect J
SdW = β 2 SdW+ (1 − β 2 )dW
2
Sdb= β 2 Sdb+ (1 − β 2 )db
2
w = w − α
dW
√
SdW+ %
b = b − α
db
√
Sdb+ %
until the cost function J converges
2
We are introducing the new hyperparameter primarily to make the next section which combines

RMSprop and momentum together.

3.5. OPTIMIZATION ALGORITHMS 49

3.5.6 Adam optimization algorithm

The Adam optimization algorithm combines both RMSprop and momentum and has shown

to work well across plenty of optimization problems. The complete algorithm is shown in

Algorithm 6. Although we did not use bias correction with momentum or RMSprop, it is

most common to use bias correction with the Adam al gor i th m.

Algorithm 6 Adam optimization

SdW, Sdb, vdW, vdb= 0
repeat
t = t + 1
Compute dW, db with respect J
vdW= β 1 vdW+ (1 − β 1 )dW , vdb= β 1 vdb+ (1 − β 1 )db
SdW= β 2 SdW+ (1 − β 2 )dW
2
, Sdb= β 2 Sdb+ (1 − β 2 )db
2
vdW=
vdW
1 − β
t
1
, vdb=
vdb
1 − β
t
1
SdW=
SdW
1 − β
t
1
, Sdb=
Sdb
1 − β
t
1
w = w − α
vdW
√
SdW+ %
b = b − α
vdb
√
Sdb+ %
until the cost function J converges

From th e Adam op t i mi zat i on p aper (Kingma, et. al), the default values for the hyperpa-

rameters are

β 1 = 0. 9 (3.5.20)
β 2 = 0. 999 (3.5.21)
% = 10
− 8

. (3.5.22)

The learning rate α does not have a default value and still need s to be tuned. Adam stands

for Adaptive moment estimation and the terms β 1 and b 2 are the fi r st and second

moments, res pectively.

3.5.7 Learning rate decay

Figure 3.5.11: Gradient descent may take large steps when near the minimum, bouncing around

and never quite converging

Learning rate decay refers to reducing the size of the learning rate as the number of iterations

increases. Consider th e example in Figure 3.5.11, where the steps once we are near the center

appear to never quite reach the minimum and stay there. Instead, if we took smaller steps

when closer to the minimum, we could more easily determin e the values gradient descent is

converging towards. One way to make α decay is by setting

α =
α 0
1 + decay rate ∗ epoch num
, (3. 5. 23)
50 CHAPTER 3. OPTIMIZING AND STRUCTURING A NEURAL NE TWORK

where α 0 is the initial learning rate, decay rate is a hyperparam et er that we much set, and

epoch num is the numb e r of times we have iterated through the entire training set. A few

other ways to d ec ay α include expotential decay

α = α 0 (0. 95
epoch num
), (3.5.24)

or

α =
k α 0
√
epoch num
. (3.5.25)

Occasionally, people may use a discrete staircase to change the learning rate, in which case

the value changes after a set number of iterations, or we can go in and manually change the

decay rate.

3.5.8 The problem with local optima

Back in the day, deep learning practit i one rs fe are d th at t he i r algor it h ms would find some

local m i ni mum and get trapped inside due to the gradient being 0 at a local opti mum. In

higher dimensional spac e, however, it is quite difficult for local optima to form because it

would require every si n gl e direction to have the minimum at a single point. In general, we

are more likely to see saddle points and plateaus. Plat e aus tend to be the biggest problem

because they will s low down the rate of gradient descent and could possib ly fool us into

believing that we have found a global minimum if we stop running gradient descent early.

3.6 Hyperparameter Tuning

3.6.1 Tuning process

With deep neural networks, there are quite a few hyp er p aram et er s that need to be tuned,

including the learning rate, momentum term, number of layers, number of hidden units,

learning rate decay, and mini-batch size. The learning rate tends to be the hyperparameter

that can vary the most an d is often the one that is hardest to tune. The next set of

hyperparameters that c an make a significant difference include changes to the momentum

term, number of hidden units, and mini-batch size. The hyperparameters that represent

the number of layers and learning rate decay may make a d i ffer e nc e, but not to the degree

of the rest of the hyperparameters mentioned.

When testing out choices for hyperparam et e rs , it is often best to start with a reasonable

range of values that could ponentially work for each hyperparameter. Then, rather than

changing one hyperparameter at a time in a grid-like way, it is best to randomly choose

values in the range for the hyper p aram et er to take on, which will allow us to test a lot more

hyperparameters. For example, in F i gu re 3.6.1, randomly choosing hyperparameter 1 each

time allows us to test 25 different values, while a grid choice would only allow for 5 different

values we could test.

If we find that a section of the choices for our hyperparameters is producing good results, we

can often zoom in to that s ect i on on focus more on the hyperparamete rs there, narrowing

down the range of values. This is known as coarse to fine and is shown in Figure 3.6.2.

3.6.2 Using an appropriate scale to pick hyperparameters

Randomly testing hyperparameter values tends to work out well for things like the number

of hidden units in each layer and the number of total layers. In each of these cases, we may

only be choosing from ar ou nd 50 different integer values. For the learning rate, however,

the range might be from 0.0001 to 1 if we randomly choose values then 90% of the values

we choose will be between 0.1 and 1. Instead of using a numerical scale, we may use a log

scale to divide up t he values.

To choose α from the logarithmic scale in Figure 3.6.3, we could first choose a random

variable r that is in between 0 and −4, and then take

α = 10
r

. (3.6.1)

3.6. HYPERPARAMETER TUNING 51
1 2 3 4 5
1
2
3
4
5
Hyperparameter 1
Hyperparameter
2
(a) Grid choice of hyperparameters
1 2 3 4 5
1
2
3
4
5
Hyperparameter 1
Hyperparameter
2
(b) Random choice of hyperparam et er val-
ues
Figure 3.6.1: Ways we can go about choosing different hyperparameters
1 2 3 4 5
1
2
3
4
5
Hyperparameter 1
H
yperpar
a
m
e
t
e
r
2
1 2
1
2
Hyperparameter 1
H
yp
e
rp
a
ra
m
e
t
e
r
2

Figure 3.6.2: Coarse to fine, where we focus on the region of the choices for hyperparame ters that

has produced the most promising results.

Another problem that may arise is sampling the values of β, which are often between 0. 9

and 0 .999. If we instead look to find values of 1 −β, we will be searching for values between

  1. 1 and 0 .001. Now, u si n g the logarithmic scale just introduced, we can pick a random

variable r between − 3 and − 1 and set

β = 1 − 10
r

. (3.6.2)

3.6.3 Hyperparameters tuning in practice: pandas vs caviar

For models that we do not have an immense GPU power to run on, we may be only able

to train a single test over a certain amount of time. During this trai ni ng period, we could

proactively adjust the hyperparameters as the model is training, attempting to improve

the process at each change. Here, we are taking a careful approach to make sure we get

everything right on this model as best as we can, since we will only be testing a s i ngl e model

at a time. This is known as b abysitting one model and is considered the pandas approach

because pandas take careful care of their children. In contrast, if we multiple GPUs t o run

52 CHAPTER 3. OPTIMIZING AND STRUCTURING A NEURAL NE TWORK
0. 0001 0. 1 1
0. 0001 0. 001 0. 01 0. 1 1

Figure 3.6.3: The bin sizes for randomly variab le s on a linear scale (top) versus a logarithmi c

scale (bottom)

our model across, we may run a bunch of tests in parallel and use the hyperparameters that

give us the best results, which is known as the caviar approach.

3.7 Batch normali zat i on

3.7.1 Normalizing activations in a network

Previously, we have been normalize our inputs a [ 0 ] to more easily train w [ 1 ] and b [ 1 ]

. Batch

normalization takes this a step further and works to normalize a [ℓ] in order to better train

w [ℓ+ 1 ] and b [ℓ+ 1 ]

. In general, p eop le are more likely to normalize z [ℓ] instead of a [ℓ] , so that

is what we will be focusing on.

To calculate batch normalization on layer ℓ, we will have the batch of values

{z ,… , z }, where

μ =
1
m
  • m
i= 1
z
[ℓ](i)
(3.7.1)

and

σ
2
=
1
m
m

-

i= 1
z
[ℓ](i)
− μ
/ 2
. (3.7.2)

Just in case σ

2
turns out to be 0, we will introduce a hyperparameter % and set the normal-

ization of z to be

z
[ℓ](i)
norm=
z
[ℓ](i)
− μ
√
σ
2
+ %
, (3.7.3)

where % is some small number like 10

− 8

. We may not always want the mean to be 0 and

the variance to be 1, so instead we will set

z ̃
[ℓ](i)
= γz
[ℓ](i)
norm+^ β,^ (3.7.4)

where both the new standard deviation γ and the new mean β can be learned from the

model and are updated after each gradient descent iteration.

3.7.2 Fitting batch norm in a neural network

To add batch normalization to our mo de l , each neuron will now have to calculat e the linear

activati on, then the batch normalization of the linear activation, and then the non-linear

activati on. For each layer ℓ, the batch normalization step will have the parameters β [ℓ] and

γ

[ℓ]
as shown in Figure 3.7.1. For the entire neural network, our parameters will be W
[ℓ]
,

b

[ℓ]
, β
[ℓ]
, and γ
[ℓ]
for ℓ ∈ { 1 ,... , L}, with our original gradient descent step updating to
:
;;
<
;;
=
W
[ℓ]
= W
[ℓ]
− α dW
[ℓ]
b
[ℓ]
= b
[ℓ]
− α db
[ℓ]
β
[ℓ]
= β
[ℓ]
− α dβ
[ℓ]
γ
[ℓ]
= γ
[ℓ]
− α dγ
[ℓ]
. (3.7.5)
3.7. BATCH NORMALIZATION 53
x
(i)
1
x
(i)
2
x
(i)
3
z
[ 1 ]
1
9
9
9
9
z ̃
[ 1 ]
1
.
β
[ 1 ]
, γ
[ 1 ]
/
9
9
9
9
a
[ 1 ]
1
z
[ 1 ]
2
9
9
9
9
z ̃
[ 1 ]
2
.
β
[ 1 ]
, γ
[ 1 ]
/
9
9
9
9
a
[ 1 ]
2
z
[ 1 ]
3
9
9
9
9
z ̃
[ 1 ]
3
.
β
[ 1 ]
, γ
[ 1 ]
/
9
9
9
9
a
[ 1 ]
3
z
[ 2 ]
9
9
9
9
z ̃
[ 2 ]
.
β
[ 2 ]
, γ
[ 2 ]
/
9
9
9
9
a
[ 2 ]
yˆ
(i)

Figure 3.7.1: With batch norma li z at i o n we now have three calculations at each node: th e linear

activation, the batch normalization of the linear activation, and the non-linear activation.

In our current process, we have

z
[ℓ]
= W
[ℓ]
a
[ℓ− 1 ]
+ b
[ℓ]
(3.7.6)

which is then normalied with

z ̃
[ℓ]
= γ
[ℓ]
?
z
[ℓ]
− μ
√
σ
2
+ %
@
+ β
[ℓ]
(3.7.7)
= γ
[ℓ]
?
W
[ℓ]
a
[ℓ− 1 ]
+ b
[ℓ]
− μ
√
σ^2 + %
@
+ β
[ℓ]
(3.7.8)
= γ
[ℓ]
?
W
[ℓ]
a
[ℓ− 1 ]
− μ
√
σ

(^2) + %

@
+ β
[ℓ]
+ b
[ℓ]
?
γ
[ℓ]
√
σ

(^2) + %

@
. (3.7.9)

Here, notice that bias term is not really adding any new information and we can combine

the β [ℓ] term with the b [ℓ] term into a si ngl e term. For simplicity, we will choose to remove

the term b [ℓ] and keep β [ℓ] , giving us the updated param et e rs γ [ℓ] , W [ℓ] , and β [ℓ] .

Currently, we have been using batch normalization with the entire training set, but if we

were to break the data up into mini-batches, we would go through a similar process using

batch normalization on each mini-batch. The complete algorithm for gradient descent with

batch normali z ati on is described in Algorithm 7

Algorithm 7 Gr ad ie nt descent with batch normalization

for t = 1 to the number of mini-batches do
Forward propagate on X
{t}
In each hidden layer update z
[ℓ]
with batch normalization
Backward propagate to compute dW
[ℓ]
, dβ
[ℓ]
, dγ
[ℓ]
Update the parameters with W
[ℓ]
= W
[ℓ]
− α dW
[ℓ]
,...
end for
54 CHAPTER 3. OPTIMIZING AND STRUCTURING A NEURAL NE TWORK

3.7.3 Batch norm at test time

Recall that for each mini-batch, we calculated

:
;
;
;;
;
;;
;
;;
;
;;
<
;
;;
;
;;
;
;;
;
;;
;
=
μ =
1
m
m

-

i= 1
z
(i)
σ
2
=
1
m
m

-

i= 1
z
(i)
− μ
/ 2
z
(i)
norm
=
z
(i)
− μ
√
σ
2
+ %
z ̃
(i)
= γz
(i)
norm
+ β
, (3.7.10)

which relies on having an entire training set of data, which we do n ot have at test time. In

practice, we predict μ and σ

2
as the exponentially weighted average across mini-batches. For

example, if we had three mini-batches, then for each layer ℓ we would find μ { 1 }[ℓ] , μ { 2 }[ℓ] ,

μ

{ 3 }[ℓ]
and σ
2 { 1 }[ℓ]
, σ
2 { 2 }[ℓ]
, σ
2 { 3 }[ℓ]
and calculate the expon entially weighted average for

both μ and σ

2

. At test time, we would compute

z
(i)
norm
=
z
(i)
− μ
√
σ
2
+ %
, (3.7.11)

where μ and σ

2
come from the latest cal cu l at ion of our exponentially weighted average.

3.8 Multi-class classificati on

3.8.1 Softmax regression

Currently, we have only been d eal i ng with binary classification problems, but it would be

much more useful to be able to handle multi-class classifications.

Figure 3.8.1: Multi-class cla ss ifi c a t io n with the labels are 0 → cat, 1 → dog, 2 → lion, and 3 →

zebra.

We will denote C to be the number of classes, so in Figure 3.8.1 C = 4. The number of

classes will corresp ond to the number of neurons in the last layer of the neural network, so

n
[L]
= C, (3.8.1)

where each neuron corresponds to the probability of each class as shown in Figure 3.8.2.

From the laws of probability, we know that the sum of the final layer must be 1. We will

start off, as usual, calculating the linear activation in the final layer, but we must change

the activation function. For our activation function, let us first calculat e the element-wise

exponentiation of each linear ac t i vation

t = e
(z
[L]
)
, (3.8.2)

where t is a tem porary variable and t ∈ R

C× 1

. Next, we will normalize the vector t which

will be our activation function

a
[L]
=
e
(z
[L]
)
0
C
j= 1
tj
, (3.8.3)
3.8. MULTI-CLAS S CLASSIFICAT IO N 55
a
[L]
1
P (class = 0 | X)
a
[L]
2
P (class = 1 | X)
a
[L]
3
P (class = 2 | X)
a
[L]
4
P (class = 3 | X)
Figure 3.8.2: Each neuron in the final layer corresponds to a probability of a class

where a [L] ∈ R

C× 1

. Equation 3.8.3 is again an element-wise function where the ith element

is

a
[L]
i
=
ti

(^0) C j= 1 tj

. (3.8.4)

As a concrete example, s up pose we have

z
[L]
=
# $ $ $ $ %
5
2
− 1
3
& ‘ ‘ ‘ ‘ (
. (3.8.5)

We will first exponentiate each element

t = e
(z
[L]
)
=
#
$
$
$
$
%
e
5
e
2
e
− 1
e
3
&
(
#
$
$
$
$
%
148. 4
7. 4
0. 4
20. 1
&
(
(3.8.6)

and calculate 4

-

j= 1
tj≈ 176. 3. (3.8.7)

Next, we will d i v id e each element in t by the sum of all the valu es in t, which gives us the

probabilities each class occurs

a
[L]
=
#
$
$
$
$
%
e
5
/ 176. 3
e
2
/ 176. 3
e
− 1
/ 176. 3
e
3
/ 176. 3
&
(
=
#
$
$
$
$
%
0. 842
0. 042
0. 002
0. 114
&
(
. (3.8.8)

The activat i on f un ct i on we have been using on the final layer of our neural network is known

as the softmax activation fun ction. The name softmax is in contrast to a hardmax

classifier that would output 1 for the max and 0 for everything else. In our example, the

hardmax classifier would output

a
[L]
=
#
$
$
$
$
%
1
0
0
0
&
(
. (3.8.9)
56 CHAPTER 3. OPTIMIZING AND STRUCTURING A NEURAL NE TWORK

3.8.2 Training a sof t max classifier

As an example, suppose our neur al network outputs the data

yˆ =
#
$
$
$
$
%
0. 3
0. 2
0. 1
0. 4
&
(
(3.8.10)

where the actual labels are

y =
# $ $ $ $ %
0
1
0
0
& ‘ ‘ ‘ ‘ (
. (3.8.11)

The loss function we will use for a softmax classifier is defined as

L(yˆ, y) = −
C

-

j= 1
yjlog yˆj, (3.8.12)

which would leave us with only the prediction from the pos it i ve label contributing to the

loss and we can simplify our loss to be

L(yˆ, y) = −y 2 log yˆ 2 = −log ˆy 2. (3.8.13)

In order to minimize our loss, we would want to make yˆ 2 as close to 1 as possible. If yˆ 2 is

near 0

+
, then we will have a loss approaching ∞. For a multi-class classifier, we will use

the same definition of t h e cost function from before with

J =
1
m
m

-

i= 1
L
.
(i)
, y
(i)
/
. (3.8.14)

We can calculate the backward propagation step through a softmax classifier with

∂J
∂z
[L]
= yˆ − y. (3.8.15)

Chapter 4

Applied deep learning

4.1 Orthogonalization

We will use the term orthogonalization to refer to changes in one hyperparameter n ot

influencing the rest of the hyperparameter. For instance, in a self-driving car, we would not

want changes i n the acceleration to have any effect on the steering. In regards to machine

learning, we primarily have to fit the training set well, fit the dev set well, fit the test set

well, and the n maintain the model on real-world examples. If one of these steps is not

working well, then we would like to adjust a set of hyperparameters that will have littl e

impact on the other st ep s.

4.2 Setting up goals

4.2.1 Single numb er evaluati on metric

Using a binary cat classification example, we wi l l say that precision refers to the percentage

of actual cats inside of the recognized cats by our model. Recall refer s to the percentage

of actual cats that are correctly labeled.

Classifier Precision Recall
A 95% 90%
B 98% 85%
Table 4.1: Example of precision and recall fo r two different classifiers

Suppose our classifier gives us the precision and recall data in Table 4.1, where one classifier

gives us better precision, while the other gives us better recall. With two evalu at ion metri cs ,

determining the best classifier is hard to do. Instead, we can combine precision and recall

into a category known as F1 Score, where

F1 Score =
2
1
P
+
1
R
, (4.2.1)

where P represents precision, and R represe nts r ec all. This way of calculating the mean is

known as the Harmonic mean. Now, using the F1 Score, we can simply say that classifier

A is the better classifier.

When choosing metrics, we typically first want to define a metric and the n have a completely

separate process the deals with optimizing that metric. It is often best to quickly choose an

evaluation metric and then change that metric in the fu t ur e if there is a significant difference

in how data performs on the development set versus the testing set versus the real world.

57
58 CHAPTER 4. APPLIED DEEP LEARNING
Classifier Precision Recall F1 Score
A 95% 90% 92.4%
B 98% 85% 91.0%
Table 4.2: Calculating the F1 Score for each classifier

4.2.2 Train/dev/test distributions

For our development and test set, we would like to have instances come from the same

distribution of our data. In particular, we would like each of these sets t o model the

expected data we will receive in the future as best as possible. The size of each set may

vary. O u r test set should be large enough that it is capable of giving us a lot of confidence

if our model per for ms well on it.

4.3 Comparing to human-level performance

time

a

c

c

u

r

a

c

y

Bayes optimal error
human level
Figure 4.3.1: How our model could behave over time

In cases such as image recognition or speech detection, we tend to be able to create mod-

els that rapidly approach human-level performance, but then slowly decrease the rat e at

which the accuracy is improving. Some of the reasons for the plateau afte r human-level

performance is achieved are that human-level and Bayes optimal error are not too far apart

on tasks, like recognizing images. When our model performs worse than human-level per-

formance we still have ways we can improve with gaining more trained data and manual

error analysis. Our model will never surpass the Bayes optimal error, which is the best

error rate we could possibly achieve as shown in Figure 4.3.1. It will not always be 100%,

because in scenarios where images are blurred, for example, it may be impossible to extract

information about what is in t h e image.

4.3.1 Avoidable bi as

We will hardly ever actually have a number for Bayes optimal error. Instead, we use human-

level error to approximate Bayes optimal error. From Table 4.3, we h ave two different tasks

that give us the same training and development error but d i ffer in human error. In task

A, wh er e our training error is quite far from our human error , we likely want to focus on

improving the bias in our model. In contrast, when our training error is close to our human

error, like in model B, it m ay be best to focus on improving variance in the model. We

say that the difference betwe en our approximation for Bayes optimal error and our training

4.4. ERROR ANALYSIS 59
Task A B
Human Error 1% 7 .5%
Traini n g Error 8% 8%
Development Error 10% 10%
Table 4.3: Example error rates for two tasks with different human error rat es.

error is the avoidable bias and the difference between the training error and development

error is the variance.

4.3.2 Understanding human-level performance

Suppose we have the medical imaging classification error rates :

  • Typical human: 3%
  • Typical doctor: 1%
  • Experienced doctor: 0 .7%
  • Team of experienced doctors: 0 .5%

and we want to determine a human-level e rr or. Since human-level is often used as an

approximate for Bayes optimal error, it is best, in this case, to go with the error rate of

0 .5% from the team of experienced doc t ors. For deployment purp oses , however, we may

have an application that just needs to beat a typical doctor, in which case it may be best

to use the typical do ct or ’s error rate of 1% as human-level.

4.3.3 Surpassing human-level performance

If we have the error rates

  • Team of humans: 0 .5%
  • One human: 1%
  • Training error: 0 .3%
  • Dev error: 0 .4%

where we have already surpass ed our estimate for Bayes optimal error, then it is hard to

determine what we should focus on improving next. There are a bunch of problems that fall

into categories where machine learning far outperforms human-level performance such as

online advertising, product recommendations, transit time predictions, and loan approvals.

4.4 Error analysis

Suppose that we are working on a cat classifier and we have achieved a 10% error rate, with

a Bayes optimal error rate of 2%. Further, suppose we also want to see if better recognition

of dogs could help improve our cat classifier if it mislabels a dog as a cat. Before attempti n g

for our cat recognizer to do better with dogs, it is likely best to find out how many images

are mislabeled in our development set and how many of those images are dogs. If we have

around 100 mislabeled images in total and only 5 of those images contain dogs, then the

ceiling we can improve our model at by recognizing dogs is with a 9 .5% error rate. I ns te ad,

if we 50 images where dogs are mislabeled in our development set, then improving dog

classification would be more worth our time, since the ceiling for our development set error

rate would be 5%.

We can also look for multiple sets of misclassified images i n parallel. For example, on our

cat classification, we may want to check for dogs, great cats (lions, panthers), and blurry

images being mislabeled. To check for multiple categories, it may be best to create a table

or spreadsheet like the one shown in Table 4.4.

60 CHAPTER 4. APPLIED DEEP LEARNING
Image Dog Great cats Blurry Comments
1! Pitbull
2!
3!! Rainy day at zoo
Total % 8% 43% 61%
Table 4.4: Using a table to represent our misclassifi e d images

4.4.1 Cleaning up incorrectly labeled data

With a large enough training set, it may not matter much if data is randomly labeled

incorrectly. However, for smaller data sets or if data is systematically labeled incorrectly it

may be in our best interest to manually change t h e labels. In our development set, we may

update our table to include a column for incorrectly labeled instances as shown in Table

4.5.

Image Dog Great cats Blurry Incorrectly labeled Comments
98! Cat was in background
99!
100! Cat drawing
Total % 8% 43% 61% 6%
Table 4.5: Adding an incorrectly labeled category to our table of misclassified images

We tend to only make adjustments to the labels in our development set when the changes

could produce significantly b et t er results. For example, if our development set error was

10% and the error due to incorrect labels was 0 .6%, then it may be the most worthwhile to

focus on the 9 .4% first. If we later get the development set error rate down to 2%, while still

having 0 .6% error du e to incorrect labels, then it may be worthwhile to adjust the l abels

so that they are accurate. If we go in and adjust the development set for errors in the

mislabeled images, then it may also be worthwhile to go in and look for incorrectly labeled

images that our network labeled as correct.

4.5 Mismatched training and dev set

Training data Uploaded data from users

images: 200,000 images: 10,000
Figure 4.5.1: Training data could vary significantly from data our model views once deployed

After developing a model, we may test it in the real world and find that the data uploaded

by users varies significantly from the data we trained our mo d el on. Consider the data in

Figure 4.5.1, which shows that we have 200,000 images from our training set and 10,000

4.6. LEARNING FROM MULTIPLE TASKS 61

images from the users that each is distributed differently. We also know that we want to

optimize the uploaded d ata from the users as best as possible, although 10,000 images will

not work well as an entire training set.

One way we could adjust our sets would be to combine the testing and uploaded data into

a single large array, then shuffle the array and break up the data into groups for a training,

development, and testing set. Suppose we set our training set size to be 205,000 and our

development set size and testi ng set size each to 2,500. If we were to break data up randomly,

then we would only expect around 119 images in the development and test set to come from

the uploaded data from u ser s.

Instead, since our primary goal is to optimize the data uploaded by users, we could again use

bin sizes of 205,000 for the traini ng set and 2,500 for both the development and testing set,

but this time we will make up the development and testin g sets entirely from the uploaded

data from users. While the data is not split evenly, we tend to prefer this dist r i bu t ion of

data because it allows us to focus more on optimizing the right goals.

4.5.1 Bias and variance with mismatched data distributions

When we distribute the types of data differently between our training set and development

set, it may be likely that we have a model with a low training error and a relatively high

development error, even without having a variance problem. Instead of directly comparing

the results against the development set, we will introduce a training-development set,

which has the same distribution as the training set although we do not act ual l y use data in

this set to train on. Now, we can look at the training set error rate, the training-development

set error rate, and the development set error rate, where we will determine the variance

based on the difference between the training s et error rate and the training-development set

error rate. If we have both a low training set error rate and training-development set error

rate while our development set error rate is high, then we say we have a data mismatch

problem.

4.5.2 Addressing data mismatch

With a data mismatch prob l em, it may first be beneficial to look at where the errors are

coming from in the development set. If we find , for example, that our errors are due to

blurry i mages then we may want to get more t rai n i ng examples of blurry images. Instead of

actually going out and getting more data, we cou l d use artificial data synthesis, which

in this case would take crisp images and apply a blurring filter to them.

For another example, if we are trying to create a trigger word detector and we have a bunch

of low noise voice recordings, but we find that the user s’ data on ce our model is deployed

tends to have a lot of noise, then we can artifici al l y add car noise. One thing to keep in

mind here is that if we have a lot more data of voice recordings than we have of car noise,

adding the same car noise to each audio clip may cause our model to overfit the car noise

data.

4.6 Learning from multiple tasks

4.6.1 Transfer learning

Transfe r learning is a process of taking a neural network that has been trained for one thing

and using it as a starting poi nt to train something else. For instance, if we are training

an i mage cl ass ifi cat i on sys t em , we may be able to transfer the low-level implementation

details in a neur al network, that ideally look for edges and curves, and use those details

with radiology diagnosis.

When changing the last layer of a neural network as shown in Figure 4.6.1, we should set

the new weight and bias terms randomly. We are also able to add more layers to a network

if necessary and are not restricted to simply changing out the last layer when using transfer

learning.

62 CHAPTER 4. APPLIED DEEP LEARNING
x
a
[ 1 ]
1
a
[ 1 ]
2
a
[ 1 ]
3
a
[ 1 ]
4
a
[ 2 ]
1
a
[ 2 ]
2
a
[ 2 ]
3
a
[ 2 ]
4
a
[ 3 ]
1
a
[ 3 ]
2
a
[ 3 ]
3
a
[ 4 ]
1
yˆ
(i)
a
[ 4 ]
1
yˆ
(i)

Figure 4.6.1: Disconnec t in g the end of a neural network and transfering the trained each weight

and bias to anothe r similar problem

In our image classification to radiology diagnosis example, we say that the image classifica-

tion data is the pre-training data, while the radi ology diagnosis data is the fine-tuning

our model.

We will often change the number of layers that are learned in a neural network based on

the amount of fine-tuning data we have available. With more fine-tuning data it would be

beneficial to use more layers, wh i le with only a little bit of fine-tuning data we would rely

more heavily on our pre-training data. It is also common to fr ee ze the representations of

earlier layers in a neural network and only learn the disconnected part, or the latter part

of the model. With plenty of dat a, we may use the pre-training model as our initialized

representations and retrain the entire network.

4.6.2 Multi-task learning

For multi-task learning, suppose that we are building a self-driving car and need to classify

pedestrians, cars, stop signs, and traffic lights. Instead of building four sep ar at e neural

networks to do each of these tasks, we can use a single neural network with multi-hot

encoding where our outputs are labeled with

y
(i)
=
# $ $ $ $ %
pedestrian?
car?
stop signs?
traffic light?
& ‘ ‘ ‘ ‘ (
(4.6.1)

and a

[L]
= 4. If some of our labels do not have all the information filled in, for instance if

it is unknown if there is a stop sign or traffic light in the image, then we can have our loss

function not sum over those values. Multi-task learning is different from softmax regression

because each image can have multiple labels.

Multi-task learning works best with tasks that share low-level features, relatively similar

data, and when we can use a large enough neural network.

Chapter 5

Convolutional neural networks

With our current approach to image classification, where we took pixels as our input, we

will quickly find that with larger images or images with different aspect ratios, our neural

network will no longer work. Consider an image of size 1000 × 1000 × 3 which would make

our inpu t of size 3 million. With such a large input, it is too computation al ly expensive

to run most models and overfitting is prone to occur. Instead, we will use and introduce

convolutional neural networks in this section.

5.1 Edge detection

From the field of image processing, we have long been able to apply filters to images. We

represent different filters as matrices or tensors. Applying a filter to a set of pixels can

give us properties of an image, including where the edges are. When we apply a filter to a

matrix, the matrix and the filte r must be the same size and the res ul t is the element-wise

sum between the filter and matrix.

With images, the matrix that re pr e sents the pixels is usually larger than the filter. In cases

where our matrix is of a different size than the filter, we can apply a convolution operation

that will r es ul t in a matrix all th e possible ways to form complete filters across the image.

It is typical to denote the convolution operator as ∗, which is not to be confused with the

element-wise operator we defined with the same notation earlie r.

We wil l denote filters to be of size f ×f and our image matrix of size n×n. After convolving

a filter across our image matrix, we will end up with a matrix of size (n−f +1)×(n−f +1).

An example of a convolution operator can be found in Figure 5.1.1.

4
3
9
3
6
1
1
0
0
5
8
3
6
7
0
2
8
8
8
0
7
2
0
1
5
2
7
7
3
7
8
8
7
7
0
5
*
=
1
1
1
0
0
0
-1
-1
-1
-8
1
1
1
0
0
0
  • 1
  • 1
  • 1

Figure 5.1.1: A co nvolution over a matrix is the element-wise product between a filter and smaller

entries of the matrix. The value from the convolution between the filter an d the upper left-h a n d

corner of the matrix corresponds to the top left entry in the result of the convolution.

With different filters, we can find out different properties from our image. Whe n look in g

at results after applying an edge detection filt er , gray represents no edge detected, black

represents positive values from our filte r , and white represents negative values from our

filter, as shown in Figure 5.1.2. Some of the most c ommon filters used for edge detection

are shown in Figure 5.1.3.

63
64 CHAPTER 5. CONVOLUTIONAL NEURAL NETWORKS
(a) Original image (b) Horizontal edge detector (c) Vertical edge detector
Figure 5.1.2: Resulting images after applying a horizontal and vertical filter
1
1
1
0
0
0
-1
-1
-1
(a) Standard vertical filter
1
2
1
0
0
0
-1
-2
-1
(b) Sob el vertical filter
3
10
3
0
0
0
-3
-10
-3
(c) Scharr vertical filter
-1
0
1
-1
0
1
-1
0
1
(d) Standard horizontal filter
-1
0
1
-2
0
2
-1
0
1
(e) Sobel horizontal filter
-3
0
3
-10
0
10
-3
0
3
(f) Scharr horiz o ntal filter
Figure 5.1.3: Different types of horizontal and vertical filters that are used for edg e detection

5.2 Padding

If we stack multiple convolution operations back to back, then we will end up with a signifi-

cantly smaller final outputted image than we started with. One way to combat the changing

size would be by adding padding to the image. When we add padding to the image, we

typically add a border of all 0s as shown in Figure 5.2.1.

6
1
1
7
4
7
6
5
1
9
4
3
1
8
8
6
3
1
1
8
8
3
4
2
5
5
1
7
5
6
7
5
3
5
9
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0 0
0 0
0 0
0 0
0 0
0 0
*
=
1
1
1
0
0
0
-1
-1
-1
-7
1
1
1
0
0
0
  • 1
  • 1
  • 1

Figure 5.2. 1: Adding padding to our image could produce the output in the same siz e as our

input

5.3. STRIDED CONVOLUTIONS 65

We add padding un if or ml y to each size of the image and the amount of padding we add

is denoted by p. With paddi ng, the output of our image will be of s iz e (n + 2 p − f +

1) × (n + 2 p − f + 1). If we want our input to be equal to our output, then we can solve

n + 2 p − f + 1 = n for p and find

p =
f − 1
2
. (5.2.1)

When the output size matches the input size, we call the convolution a same convolu ti on

and when we do not add padding, we say that we have a valid convolution. It is also most

common to see f as an odd number because that allows for a pixel to be in the distinctive

center.

5.3 Strided convolutions

3
0
6
5
3
8
3
4
2
3
8
6
2
8
0
8

Figure 5.3.1: An invali d filter since we will not use the filters that run over the side of the image,

instead we will only use filters that fill a comple te f × f grid

With convolutions, we are also able to take stride lengths that ar e larger than 1. From the

top left corner, when moving right or down we will move each spot in the filter a length of

s. With strides, only count the convolutions that contain lie inside of the boundary of our

image with padding, as sh own in Figure 5.3.1. Therefore, the size of our output will be J n + 2 p − f

s
+ 1
K
×
J
n + 2 p − f
s
+ 1
K
. (5.3.1)

5.4 Cross-correlation vs convolution

In image processing, the convolution operator actually mirrors the filter horizontally and

vertically as shown in Figure 5.4.1. The method we have been using thus far, whe re we do

not mirror the filter is known in image processing as cross-correlation. However, i t is most

common in computer vision to use the word convolution in place of cross-correlation, which

is what we will continue to do.

1 2 3
4 5 6
7 8 9
9 8 7
6 5 4
3 2 1

Figure 5.4.1: Wit h a traditional convolution operation, we first flip the filter horizontally and

vertically

5.5 Convolutions over volume

For colored images, we have three channels–red, green, and blue–and can define a filter for

each channel as shown in Figure 5.5.1. For a channel with a height and width of f, our filter

will have f

3
parameters. To convolve with volume, we apply the front filter only to the red

channel, the second filter only to the green channel, and the third filter only to the blue

channel. Then, on each channel, we carry out the convolution process described earlier.

Previously, we saw that each filter may be looking for different things inside of an image.

We can also set up multiple filters, by storing the resulting matrices from each convolution

inside of a tensor. We will denote t h e number of filters as n

′
c
, so our outputted matrix will

be n

′
clayers^ deep^ as^ shown^ in^ Fi gur e^ 5.5.2.
66 CHAPTER 5. CONVOLUTIONAL NEURAL NETWORKS

*

=

Figure 5.5.1: Convolutions of an RGB ima g e
*
=

Figure 5.5.2: Using a convolution on n

′
cfilters^ will^ produce^ a^ matrix^ in^ n
′
cdimensions^ with^ each

dimension correspond in g to a result from a convoluted filter

5.6 One layer convolution network

For convolutional neural networks, we will treat the resulting matrices created from the

convolution added to the bias t e rm b as our linear activation z. Then we will put each

element through a non-linear activation as shown in Figure 5.6.1. Each index inside of the

filter tensor is treated as a weight.

  • =
+ b
1

!

  • b 2

!

ReLU

ReLU

a

[ 0 ]

W

[ 1 ]

b

[ 1 ]

z

[ 1 ]

a

[ 1 ]
Figure 5.6.1: One layer of a convolutional neura l network
5.7. POOLING LAYERS 67

For the lth layer of a convolutional neural network:

  • f [l] = filter size
  • p [l] = padding
  • s
[l]
= stride
  • n
[l]
c =^ number^ of^ filters
  • n
[l− 1 ]
h
×n
[l− 1 ]
w ×n
[l− 1 ]
c = size of input, height of input × width of input × channels
in input
  • n
[l]
h
=
J
n
[l− 1 ]
h
+ 2 p
[l]
− f
[l]
s
[l]
+ 1
K
  • n
[l]
w=
J
n
[l− 1 ]
w +^2 p
[l]
− f
[l]
s
[l]
+ 1
K
  • f [l] × f [l] × n
[l− 1 ]
c =^ filter^ size
  • n
[l]
h
× n
[l]
w×^ n
[l]
c =^ a
[l]
  • f
[l]
× f
[l]
× n
[l− 1 ]
c ×^ n
[l]
c =^ size^ of^ weights
  • n
[l]
c =^ size^ of^ bias

5.7 Pooling layers

Pooling layers give us a way to shrink the size of our input, while still capturing the essential

details. The primary typ es of pooling include average pooling and max pooling. Pooling

layers have both a filter and a step size, although the re are no weights to be learne d. Instead,

pooling applies a function to each region of the input it focuses on, where each frame is f ×f

and the stride lengt h is s. For 2D max pooling, the output is the maximum number inside

of the frame and for 2D average po ol i ng, the output is the aver age of all t he numbers inside

of the frame.

5 6 1 2
1 3 2 3
2 9 1 1
1 3 2 1
9
6
2
3
(a) Max pooling

5 6 1 2
1 3 2 3
2 9 1 1
1 3 2 1
3.75
3.75
1.25
2.00
(b) Average pooling
Figure 5.7.1: 2D pooling with the hyp erp a ra m et ers f = 2 and s = 2

For 3D pooling, the number of channels in the input will match the number of channels

in the output. Each of the channels will work independently. For the ith channel in the

output, we will take the pool from the filter over the ith channel of the inpu t.

Max pooling is used most commonly because it generally produces bet t e r results. Some

people hypothesize that max pooling performs better because it captures in each filter if a

certain aspect of the image is present, while average pooling can be saturated with a lot of

data.

In deep learning literature, we combine convolution and pooling steps into a single layer

because pooling does not have any representations to be learned.

5.8 Why we use convolutions

As a t hou ght e xperiment, consider a single layer neural network with an input size of

32 × 32 × 3 and an output size of 28 × 28 × 6. If we were to use a fu l ly -c onn ec t ed neural

network, we would have around 14 million representations we would have to learn, whereas

68 CHAPTER 5. CONVOLUTIONAL NEURAL NETWORKS

if we use a convolutional neural network with a filter size of 5 × 5 then we would only need

to learn 156 representations.

With fewer representations we can take advantage of parameter sharing and sparsity

of connections. Parameter sharing al l ows us to detect featu re s at any area inside of the

image. The sparsity of connections means that for every output, there will only be a few

numbers th at impact it.

5.9 Classic networks

Since choosing the architecture for a neural network is often an empirical task we can look

at other people’s models to make better decisions on our model. The main way to look at

how other people build their models is by reading their research. We will be going over how

LeNet-5, AlexNet, VGG-16, ResNet, and Inception work.

5.9.1 LeNet-5

INPUT
32x32
Convolutions Convolutions Subsampling
C1: feature maps
6@28x28
Subsampling
S2: f. maps
6@14x14
S4: f. maps 16@5x5
C5: layer
120
C3: f. maps 16@10x10
F6: layer
84
Full connection
Full connection
Gaussian connections
OUTPUT
10
Figure 5.9.1: LeNet-5 architecture (LeCun et al., 1998).

LeNet-5, named aft e r the paper’s lead author Yann LeCun, was introduced in 1998 with a

primary goal of recognizing characters. LeNet-5 uses average pooling which was m ore widely

used at the time of publication. This model contains around 60 thousand representations,

which is now considered a small amount. Among other things, this architecture coined the

concept of moving deeper into the neural network to shrink the input size, while increasing

the number of channels and convolutional layers should be followed by poolin g layers.

5.9.2 AlexNet

!"#
$%& $%& !"#
&'(# &'(#
(#
!"#
Figure 5.9.2: AlexNet architecture (Krizhevsky et al., 2012).

AlexNet looks quite similar to LeNet-5. Although , AlexNet has around 60 million parame-

ters in total and used a ReLU activation function. AlexNet significantly sparked a surge of

research toward artificial intelligence when published in 2012, because the results from the

neural network far outperformed state-of-the-art approaches at the time.

5.9. CLASSIC NETWORKS 69

5.9.3 VGG-16

224x224x64
input size: 224x224x3
4096x1
1 12x112x256 1000x1
56x56x256 28x28x512
softmax
fully connected with ReLU
max pooling
convolution with ReLU
14x14x512 7x7x512
Figure 5.9.3: VGG-16 architecture

The VGG-16 architecture was introd uc ed in a 2015 paper by Karen Simonyan and Andrew

Zisserman. In it, they make all their convolutions as same convolutions with a filter size

of 3 × 3 and a stride length of 1. Each max pooling layer has a filt er size of 2 × 2 with a

stride l en gt h of 2. This architecture is on the larger side of modern neural networks with

a total number of representations around 138 million and sticks out primarily due to its

simple choices for hyperpar ame t er s that turn out to work well.

5.9.4 ResNet

A residual n etwork (ResNet) gives us a way to skip connections inside of a neural network.

With a plain network, our path from a [ℓ] to a [ℓ+ 2 ] may be what is shown in Figure 5.9.4.

a
[ℓ]
linear ReLU a
[ℓ+ 1 ]
linear ReLU a
[ℓ+ 2 ]
Figure 5.9.4: Plain network

However, with a ResNet, we use what is known as a residual block that sets

a
[ℓ+ 2 ]
= g
.
z
[ℓ+ 2 ]
+ a
[ℓ]
/
, (5.9.1)

where a

[ℓ]
is skipping connections. The path of a residual block is shown in Figure 5.9.5.
a
[ℓ]
linear ReLU a
[ℓ+ 1 ]
linear ReLU a
• ⊕ [ℓ+^2 ]
Figure 5.9.5: Residual block

As shown in Figure 5.9.6, residual networks work well with deep neural networks that have

a lot of layers whereas plain neural networks often have weight decay.

Figure 5.9.6: The error rate for plain networks (left) and ResNets (middle, right) when the

number of layers in the network changes (He et al., 2015).

70 CHAPTER 5. CONVOLUTIONAL NEURAL NETWORKS

The reasoning why ResNets work with more layers can be shown by expanding the right-

hand side of Equation 5. 9. 1

a
[ℓ+ 2 ]
= g
.
z
[ℓ+ 2 ]
+ a
[ℓ]
/
(5.9.2)
= g
.
w
[ℓ+ 2 ]
a
[ℓ+ 1 ]
+ b
[ℓ+ 2 ]
+ a
[ℓ]
/
. (5.9.3)

Notice that when weight decay occurs in plain neural networks, where w [ℓ+ 2 ] ≈ 0, and we

are using a ReLU activation function, we will come close to having a

[ℓ+ 2 ]
= g
)
a
[ℓ]
*
= a
[ℓ]
.

Therefore, adding more layers to a ResNet may have no effect on the activat i on if weight

decay occurs, or the weights could add addi t ion al information that could decrease the error

rate.

5.9. CLASSIC NETWORKS 71

Figure 5.9.7: How a 34-layer ResNet compares to a 34-layer plain network and VGG-19, which

is modified version of VGG-16 (He et al., 2015).

72 CHAPTER 5. CONVOLUTIONAL NEURAL NETWORKS

5.9.5 1 × 1 convolution

While a 1 × 1 convolution would simply scale up each value in a 2D input, w it h 3D inputs

we can do much more. For a 1 × 1 convolution the depth of the filter will be equal to the

depth of the input. Then t he output for each filter comes from t he dot product between

the pixel on the input and the filter. Als o, our output will always be the same size as our

input. With multiple filters, we add depth to the output as shown in Figure 5.9.8. 1 × 1

convolutions may also be referred to as a network in network (Lin et al., 2013).

=
*
Figure 5.9.8: 1 × 1 convolution

5.9.6 Inception network

The primary goal from the inception network was to allow a model to learn the best hy-

perparameter choices for each layer of a network. Instead of manually choosing a filter size,

the model will use three different filter sizes: 1 × 1, 3 × 3, and 5 ×5; in addition, each layer

could also l ear n max p ooling. To pick the best hyperparameters, the model concatenates the

results from convolving with the different filter sizes and max pooling into a single output

as shown in Figure 5.9.9a.

When we go deeper into t he network we often run into a large number of channels with

relatively small dim en si on size. Now consider, f or example, a layer whose input is of size

28 × 28 × 192 with a same convolution that has a 5 × 5 × 192 filter size and 32 filters. In

this scenario, our output will be of size 28 × 28 × 32 and for each output, we will ne ed

5 × 5 × 192 total multiplications, which equates to around 120 milli on in total. Even with

modern computers, 120 million multiplications will take a su bs t antial amount of time.

Instead, if we first pass the input of size 28 × 28 × 192 into a 1 × 1 × 192 convolution with

16 filters and then pass that result into a 5 × 5 × 16 same convolution with 32 filters we

could end up with a pretty similar result with a far less computational cost. In tot al, we

would have around 12 m il l i on total multiplications using the dimension reduction method

we just proposed, which is shown in Figure 5.9.9b. Also, noti ce in our dimension reduction

figure that we pass max pooling into a 1 × 1 convolution. This is because when using a

max pooling filter, we cannot shrink the number of channels. Rather, we will shrink the

number of channels with a 1 × 1 c onvolution, where the number of filters correspond s to the

outputted channel size.

1 x 1 convolutions 3 x 3 convolutions 5 x 5 convolutions
Filter
concatenation
Previous layer
3 x 3 max pooling
(a) Na ̈ıve version
1 x 1 convolutions
3 x 3 convolutions 5 x 5 convolutions
Filter
concatenation
Previous layer
1 x 1 convolutions 1 x 1 convolutions 3 x 3 max pooling
1 x 1 convolutions
(b) Dimension reduction
Figure 5.9.9: Inception layer (Szegedy et al., 2015)
5.9. CLASSIC NETWORKS 73

The entire original inception network is shown in Figure 5.9.10 and was named Go ogLe Net

after the popular LeNet architecture.

74 CHAPTER 5. CONVOLUTIONAL NEURAL NETWORKS
input
Conv
7x7+2(S)
MaxPool
3x3+2(S)
LocalRespNorm
Conv
1x1+1(V)
Conv
3x3+1(S)
LocalRespNorm
MaxPool
3x3+2(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
Dept Concat
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
Dept Concat
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
MaxPool
3x3+2(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
Dept Concat
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
AveragePool
5x5+3(V)
Dept Concat
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
Dept Concat
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
Dept Concat
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
AveragePool
5x5+3(V)
Dept Concat
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
MaxPool
3x3+2(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
Dept Concat
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
Dept Concat
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
AveragePool
7x7+1(V)
FC
Conv
1x1+1(S)
FC
FC
SoftmaxActivation
softmax0
Conv
1x1+1(S)
FC
FC
SoftmaxActivation
softmax1
SoftmaxActivation
softmax2
Figure 5.9.10: GoogLeNet architecture (Szegedy et al., 2015)
5.10. COMPETITIONS AND BENCHMARKS 75

Figure 5.9.11: Meme cited in the original inception paper as a reasoning for the name inception

network.

As a fun fact, the i nc ep t i on network cited the reasoning f or its name came from the popular

meme in Figure 5.9.11 th at appeared in the movie Inception.

5.10 Competitions and benchmarks

To beat state-of-the-art models on a s pecific b e nchmark in competitions, there are a few

methods employed that are rarely used in practice including ensembling and multi-crop at

test time. Ensembling is when we test each instance of the t es t in g set against a range of

models, usually around 3-15, and use the prediction that either occurs most of t en (clas-

sification) or the average prediction (regression). Ensembling often performs better, but

there is a significant computational cost that is introduced. The other method employed at

test time is multi-crop which averages the output from different crops of the image. One

common way to multi-crop an image take is known as 10-crop and is shown in Figure 5.10.1.

Figure 5.10.1: 10-crop includes mirroring an image and taking different croppings of both the

image and its mirroring.

Chapter 6

Detection algorithms

6.1 Object localization

(a) Image classification
(b) Classification with localiza-
tion
(c) Detection
Figure 6.1.1: Classification vs. l ocalization vs. detection

Previously, we have only been working on image classification, which assumes only one

object in a screen i s present and determines that object. Next, we will be goin g into

classification with localization which determines where the object is located in the image

as shown in Figure 6.1.1b. We will also be going over image detection which allows u s to

classify multiple occurrences of the same or di ffer ent classes from a single image as shown

in Figure 6.1.1c.

With classification with localization, we are going to assume there is only one object in every

image. As an example, we are going to walkthrough how a self -dr i v i ng car could localize a

pedestrian (1), car (2), motorcycle (3), or, if no object exists, a background (4). We

can then change the output of the neural network from a softmax of 4 categories t o include a

bounding box with the center point, height, and width. Th e center of the bounding box will

be denoted as (bx, by) and the height and width will be bhand bw, respectively. Currently,

we will be using square images and will label the t op left corner as (0, 0) and the bottom

right corne r as (1, 1). The entire labeling of our image is shown in Figure 6. 1. 2.

76
6.1. OBJECT LOCALIZATION 77
(0, 0)

-

(1, 1)

-

bw

bh
(bx, by)

-

Figure 6.1.2: Our labeling for localization

We will also need to adjust our supervised learning labels to match the output of our model.

We will define the output as

y =
#
$
$
$
$
$
$
$ $ $ $ $ $ $ %
Pc
bx
by
bh
bw
car?
pedestrian?
motorcycle?
&
’ ‘ ‘ ‘ ‘ ‘ ‘ (
, (6.1.1)

where Pcis the probability we have a class 1, 2, or 3. A few labeled examples are shown in

Figure 6.1.3.

y =
# $ $ $ $ $ $
$
$
$
$
$
$
$
%
1
0. 5
0. 7
0. 2
0. 5
1
0
0
& ‘ ‘ ‘ ‘ ‘ ‘
(
(a) Car label
y =
# $ $ $ $ $ $
$
$
$
$
$
$
$
%
0
None
None
None
None
None
None
None
& ‘ ‘ ‘ ‘ ‘ ‘
(
(b) Background label
Figure 6.1.3: Our labels for different images
78 CHAPTER 6. DETECTION ALGORITHMS

We will also define our loss function to be a squared error loss function

L (yˆ, y) =
L
0
8
i= 1
(yˆi− yi)
2
if y 1 = 1
(yˆ 1 − y 1 )
2
if y 1 = 0
, (6.1.2)

where the 8 comes from the size of our label.

6.2 Landmark detection

Landmark detection works to find specific areas of an image. Recently, landmark detection

has b ee n used to determine human emotions, determine the orientation of a moving body,

and apply S n apchat filter s. In order to create Snapchat face filter, like the one shown in

Figure 6. 2.1c , we can trai n a neural network to determin e specific landmark locations inside

of an i mage. To determine landmark locations on a face, we first need to classify a face and

then label each landmark’s location on our training data as sh own in Figure 6.2.2, where

each label with n landmarks would be in the form

y =
#
$
$ $ $ $ $ $ $ $ %
face?
ℓ 1 x
ℓ 1 y
nx
ℓny
&
’ ‘ ‘ ‘ ‘ ‘ ‘ ‘ (
, (6.2.1)

where (ℓix, ℓiy) is the xy coordinate of the ith landmark. We also must makr sure that

we always label the ith landmark the same thing in each image. For example, if the first

landmark is the leftmost part of somebody’s le ft eye in one image, then it also must be the

leftmost part of a person’s l ef t eye in another image.

Happy
Angry
(a) Emotion classification (b)^ Bod y^ tracking (c) Snapchat face filters
Figure 6.2.1: Applications for landmark detection

6.3 Object detection

Object detection focuses on locating multiple objects in a given image. A na ̈ıve approach to

object detection is with the sliding windows detection algorithm. We will start wit h

6.4. SLIDING WINDOWS WITH CONVOLUTION 79

1
9
14
15
16

(^1817) 19 20 21 22 23 24 25 26 28 27 29 (^3031) 32 33 34 35 36 37 38 39 40 41 42 43 44 45 (^4647) 48 59 60 61 62 63 64 65 66 49 50 51 52 53 54 55 56 57 58 1312 1011 5 7 6 8 (^23) 4 Figure 6. 2 .2 : Adding a specific numbered landmark to each part of the imag e we are interested in 1 1 1 0 0 Figure 6.3.1: Sliding windows detection training set a training set of ti ghtly cropped im ages as shown in Figure 6. 3. 1. Then, we will create a window and slide it throughout our image. At each location where the window is on the image, only the pixels that the filter is covering will be passed into the convolutional neural network in hope s that the window will perfectly identify the location of objects. We will then change the size of the window and again repe at the sliding process as shown in Figure 6.3.2. With the sliding windows detection algorithm just presented, we will run into a high compu- tational cost in ord er to have an accurate image detector. Further, improving t h e algorithm by making the stride length shorter and testing more window sizes will increase the cost.

6.4 Sliding windows with convolution

To improve the computational cost of our sliding windows detection algorithm let us try

to use a convolutional implementation where the output corresponds to sliding the window

across each position. Consider taking a 3 × 3 matrix and convoluting it twice with a 2 × 2

filter, which will produce a 1 × 1 output as shown in Figure 6.4.1a. Now, consider if we were

to add a row and colum n to the end of our input matrix, increasing it to a size of 4 × 4 and

then convolute it twice with a 2 × 2 filter. What we end up with is a 2 × 2 output matrix

where the upper left entry is equivalent to the two convolutions from the 3 × 3 input matrix

as shown in Figure 6.4.1b. Similarily, the top right entry will be the two convolutions with

the 3 × 3 input matrix without the first column and without the last row.

Using the me t hod just introduced, we have a way to create a sliding window with a convo-

lution. Takin g this method a step further, we can also adjust the stride by adding a max

pooling layer. The filter size of the max pooling layer will determine the size. Further, if we

80 CHAPTER 6. DETECTION ALGORITHMS

1

st

w

i

n

d

o

w

si

z

e

slides: 0 slides: 1 slides: 2

2

n

d

w

i

n

d

o

w

s

i

z

e

Figure 6.3.2: The sliding windows detection algorithm chooses different window sizes and slides

the window size across the entire image. At each slide, the area that the window is covering is fed

into a convolutional neural ne twork in order to try and classify the image.

f = 2 f = 2
(a) 3 × 3 input
f = 2 f = 2
(b) 4 × 4 input

Figure 6.4.1: Applying the same model to inputs of different size. The dark portion in Figure

6.4.1b is equivalent to the model in Figure 6.4.1a.

want to learn a more complex function we can add 1 × 1 convolutional neural networks to

the end of our network that will act like fully connected layers as shown in Figure 6.4.2.

#(vertical convolutions) x
#(horizontal convolutions) x
#(classes)
Act like fully connected layers
softmax convolution with ReLU max pooling
f = stride

Figure 6.4.2: Convolutional neural network that determines if a window image is one of four

categories: pedestrian , car, motorcycle, or background.

6.5. BOUNDING BOX PREDICTIONS 81

6.5 Bounding box predi ct i ons

y =
# $ $ $ $ $ $ $ $ $ $ $ $ $ %
1
0. 1
0. 5
1
1. 3
0
1
0
& ‘ ‘ ‘ ‘ ‘ ‘ ‘ ‘ ‘ ‘ ‘ ‘ ‘ (
y =
#
$
$
$
$
$
$
$
$
$
$
$
$
$
%
1
0. 4
0. 9
1. 9
1. 2
0
1
0
&
(

Figure 6.5.1: Using a bounding box on an image with the correspon d i n g labels that contain cars.

We are using the labeling set in Equation 6.1.1.

Currently, we have only seen guess and check algorithms in terms of choosing a filter and

step size. Instead, it would be nice to only have one pass through a convolutional neural

network that determines object detection. Let us consider breaking the image into a gr i d

and treating each gridded cell as its own image that we h ave to label. If an object we are

trying to identify has its centerpoint in the grid, then we will label Pcin that grid as 1.

For grids that do not contain any objects or contain non-centerpoints of objects they will

have Pcas 0 with None in all of the other values. When l abeling images from the grids

centerpoint, notice that our constraints are 0 ≤ bx, by≤ 1 and bh, bw≥ 0. With our 3 × 3

grid, the output wil l need to be 3 × 3 × 8 because each grid cell has its own label with 8

entries. Therefore, we can construct a convolutional neural network, like t he one shown in

Figure 6.4.2 that takes our image as input and outputs a 3 × 3 × 8 tensor. The way we have

set up bounding box prediction is known as the YOLO–You Only Look Once–algorithm and

we will continue discussing it after introducing a few more ideas that pl ay a key role.

6.6 Intersection over Union (IoU)

One common way to asse ss if a bounding box’s prediction i s accurate is to take the inter-

section of the labeled bounding box and the predicted bounding box and divide that by the

union of the labeled and predicted bounding boxes, known as IoU. Typically if IoU ≥ 0 .5,

we say that our model’s bounding box is correct, however, 0. 5 is just an arbitrary choice

and is sometimes changed.

6.7 Non-max suppression

When using smaller grid sizes for bounding box prediction, it is more likely that neighboring

boxes will predict references to the same object as shown in Figure 6.7.1a. To start, we will

first want to remove any boxes that are uncertain of containing a class. Typically, we may

82 CHAPTER 6. DETECTION ALGORITHMS

Intersection

Union

Figure 6.6.1: Taking the intersection over union of two bounding boxes to assess its correctness.

0. 9
0. 8
0. 7
0. 8 0.^95
0. 45
(a) Pre non-max suppression
0. 9
0. 95
(b) Post non-max suppression

Figure 6.7.1: Predicted boundin g boxes using a 19 × 19 grid along with the confidence a non-

background class exists inside of that box.

remove all b oxes that have Pc≤ 0 .6. With a way to calculate the similarity between two

bounding boxes (IoU), we can use that to determine if two bounding boxes are referring

to the same object. Starting with the box that has the highest propability a class exists

inside of it, Pc, we will keep that box and then remove any other box from the same class

prediction that has an IoU ≥ 0. 5 with the box in focus. Thus, we are effectively removing the

non-maximum boxes from the p r ed ic t ion. The complete algorithm for n on- max suppression

is shown in Algorithm 8.

Algorithm 8 Non- max suppression

Input: boxes ← list of all bou nd i ng boxes
Initialize: finalBoxes ← empty list
Remove any box from boxes with Pc≤ 0. 6
while boxes length > 0 do
maxBox ← box with max Pcfrom boxes
Append maxBox to finalBoxes
Remove maxBox from boxes
for each box in boxes do
if IoU(box, maxBox) ≥ 0.5 then
Remove box from boxes
end if
end for
end while
6.8. ANCHOR BOXES 83

6.8 Anchor boxes

Figure 6 .8. 1: The midpo int for two objects–pedestrian and car–are located in the same grid cell.

With our current implementation for labelling each grid cell, we have no way to label two

separate classes that occur in th e same grid. For example, Figu re 6.8.1 shows that the

centerpoint for a pedestrian and a car occur in the center gridpoint. We will start off by a

set number of different anchor boxes as shown in Figure 6.8. 2. The sizes of the anchor boxes

can be de te r mi ne d from a k-means clustering algorithm that looks for grouping of anchor

box sizes amongst each class.

(a) Anchor box 1 (b) Anchor box 2
Figure 6.8.2: Defining multiple distinct anchor boxes
84 CHAPTER 6. DETECTION ALGORITHMS

Now, for each label in our training data we will label it with

y =
#
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$ $ $ $ $ $ $ $ $ $ $ $ $ $ %
Pc
bx
bh
bw
cw
c 2
c 3
Pc
bx
bh
bw
cw
c 2
c 3
&
’ ‘ ‘ ‘ ‘ ‘ ‘ ‘ ‘ ‘ ‘ ‘ ‘ ‘ (
M
;;
;
;;
;
;;
;
;;
N
;
;;
;
;;
;
;;
;
;O
anchor box 1
M
;;
;
;;
;
;;
;
;;
N
;
;
;
;
;
;
;
;
;
;
;
O
anchor box 2
. (6.8.1)

Then, for each grid cell we will use the IoU to determine which anchor box is closest to each

bounding box. In Figure 6.8.1, the pedes t ri an would be asssigned to anchor box 1 and the

car would be assigned to anchor box 2. We then fill in the output as the labels from each

bounding box. With our 3 × 3 grid, the output size will be 3 × 3 × 16 or 3 × 3 × (2 × 8),

since there are two anchor boxes and each anchor b ox has 8 entries. When u si n g two anchor

boxes, our predic t ion will give us 2 boundary boxes in each grid c el l.

We could always define m ore anchor boxes in practice, alghough with a larger grid size, it

is less likely for collisions in anchor boxes to occur. If we have more objects that share the

same grid cell than we have anchor boxes or we have two boundary boxes that are the same

dimension, our algorithm will not perform well.

6.9 YOLO algorithm

The YOLO (You Only Look Once) algorithm was introduced in 2016 by Joseph Redmon,

Santosh Divvala, Ross G ir s hi ck, and Ali Farhadi. The algorithm combines boundary box

predictions with anchor boxes and non-max suppression. With only having to move through

the neural network in one pass, YOLO beat the running time for all pr ev i ous state-of-the-art

object detection algorithms.

Chapter 7

Sequence models

Sequence models allow us to work with variable inputs or variable outputs. This feature

allows us to greatly expand the number of problems we are able to tackle. For example ,

sequence models are used for speech recognition, where given an variable length audio clip, we

translate the spoken words; music generation, where we can input a set of data and output

a sequence of music; sentiment classification, where we could take in a sentence review and

output a star-rating; DNA sequence analysis, where we take a DNA sequence and label the

part of the sequence corresponding to, say a protein; machine (or language) translation,

where we are given a sentence in one language and have to convert it into another; video

activity recognition, where we are given a set of frames and want to output the activity; and

name entity recognition, where we are given a set of text and want to find the names within

the text.

As an example, suppose we would like to build a sequence model for name entity recognition

with the input

x : Harry Potter and Hermione Granger i nvented a new spell. (7.0.1)

We will denote the tth word of the input as x

, starting at index 1. For the output y, we will use multi-hot encoding where each word is labeled with a 1 or 0 as follows ``` y : 1 1 0 1 1 0 0 0 0. (7.0.2) ``` The tth element of the output will similarly be referred to as y . The length of the input sequence wil l be denoted by Txand the length of the output sequence will be Ty. In this case, both Txand Tyare equal to 9, although that is not always the case. So, if we want to refer to the tth word in the ith element of the t r ain i ng sample, we will call x ``` (i) ``` . Since the input and output of the seque nc e can vary, we will use T ``` (i) x and^ T ``` ``` (i) y to^ denote^ the^ length ``` of the ith input and output, respectively. To represent words, we will need use vocabulary list of all the possible words that are likely to occur. For modern app li c at ion s, this could be around 50,000 words in total. We will represent teh vocabulary list sorted in ascending order. Then, to store x ``` , we can use ``` one-hot encoding, where we have a 1 in the index of the vocabulary list that stores the word we are looking at, and a 0 everywhere else. We will tal k about unknown words, or words that are not in our vocabulary list, soon. ### 7.1 Recurrent neural networks The problem with using a standard neural network is that the in p ut length and output lengths can differ in different new samples of our data. If we tried to set a maximum number as the input, with 0 for each element that is not necessary, then we will not only greatly diminish the number of problems that we can focus on, but we also will be using a bunch of unnecessary memory i n the network. The plain neural network will al so not ##### 85 ##### 86 CHAPTER 7. SEQUENCE MODELS take advantage of shared features of sequ ential d at a, similar to how convolutional neural networks took advantage of using the same filters over different regions of the image. With recurrent ne ur al networks, we built a m odel that takes as input the current time step t and all of the previous time steps in order to compute the output y . At each time step, the model is parameterized by shared weights Wax, Waaand Wy a, as shown in Figure 7.1.1. A big problem with this type of representation is that we are only looking at the previous time steps. Suppose we have the following sentences as input ``` > He said, “Teddy Roosevelt was a great President.” ``` ``` He said, “Teddy bears are on sale!” ``` ##### . (7.1.1) In both of these cases , the first three words are equivalent; however, in only t he first case does the word “Teddy” represent a name. We will delve into this topic soon, which looks at Bidirectional RNNs (BRNNs), however to understand the basics of RNNs we will stick with our unidirectional RNN. For now, we will set a ``` < 0 > = 0, which is typical in many cases. Now, to represent the first ``` activati ons and output in the recurrent neural network pictured in Figur e 7.1.1, we will set ``` a < 1 > = g ``` ##### ) ``` Waaa < 0 > + Waxx < 1 > + ba ``` ##### * ##### (7.1.2) ``` yˆ ``` ``` < 1 > = g ``` ##### ) ``` Wy aa ``` ``` < 1 > + by ``` ##### * ##### . (7.1.3) The activation functions in Equation 7.1.2 and Equation 7.1.3 do not have to be the same. With the activation functions, tanh and Re LU are commonly used, while sigmoid or softmax are typically use d for the output activation functions. To generalize the previous equations, we have ``` a ``` ``` = g ``` ##### ) ``` Waaa ``` ``` <t− 1 > + Waxx ``` ``` + ba ``` ##### * ##### (7.1.4) ``` yˆ ``` ``` = g ``` ##### ) ``` Wy aa ``` ``` + by ``` ##### * ##### . (7.1.5) We will futher simplify Equation 7.1.4 to be ``` a ``` ``` = g ``` ##### ) ``` Wa ``` ##### H ``` a ``` ``` <t− 1 > , x ``` ``` ``` ##### I ``` + ba ``` ##### * ##### , (7.1.6) where Wais formed by stacking ``` Wa= ``` ##### ! ``` Waa | Wax ``` ##### " ##### (7.1.7) into a single matrix. We wil l also use the notation ##### H ``` a ``` ``` <t− 1 > , x ``` ``` ``` ##### I ##### = ##### 1 ``` a ``` ``` <t− 1 > ``` ``` x ``` ``` ``` ##### 2 ##### (7.1.8) We will also simplify Equation 7.1.5 to be ``` yˆ ``` ``` = g ``` ##### ) ``` Wya ``` ``` + by ``` ##### * ##### . (7.1.9) #### 7.1.1 Backpropagation through time We will define our loss at time step t as our familiar cross-entropy, or logistic loss ##### L ``` ``` ##### ) ``` yˆ ``` ``` , y ``` ``` ``` ##### * ``` = −y ``` ``` log yˆ ``` ``` − ``` ##### ) ``` 1 − y ``` ``` ``` ##### * ``` log ``` ##### ) ``` 1 − yˆ ``` ``` ``` ##### * ##### . (7.1.10) For the entire sequence, we will compute the loss as ``` L (ˆy, y) = ``` ``` Ty ``` - ``` t= 1 ``` ##### L ``` ``` ##### ) ``` yˆ ``` ``` , y ``` ``` ``` ##### * ##### . (7.1.11) Now, we can define the computation graph pictured in Figure 7.1.2. When backpropagating through this network, we say that we are backpropagating through time, because we are going backwards in the time series. ##### 7.1. RECURRENT NEURAL NETWORKS 87 ### a ``` < 0 > ``` ### x ``` < 1 > ``` ``` < 1 > ``` ### a ``` < 1 > ``` ### x ``` < 2 > ``` ``` < 2 > ``` ### a ``` < 2 > ``` ### x ``` < 3 > ``` ``` < 3 > ``` ### a ``` < Tx - 1 > ``` ### x ``` < Tx > ``` ``` < Ty > ``` ### a ``` < 3 > ``` ### Waa Waa Waa Waa Waa ### Wax, ba Wax, ba Wax, ba Wax, ba ### Wya, by Wya, by Wya, by Wya, by Figure 7.1 .1 : The structure of a ba s ic recurrent neural network predicts y ``` based on all of the ``` previous time steps, along with the current time step. The parameters of the network are pictured in red. _x_ ``` < 1 > ``` ``` < 1 > ``` _a_ < 1 > _a_ ``` < 0 > ``` _Wa, ba_ _Wy, by_ ``` < 1 > ``` _x_ ``` < 2 > ``` ``` < 2 > ``` _a_ ``` < 2 > ``` ``` < 2 > ``` _x_ ``` < 3 > ``` ``` < 3 > ``` _a_ ``` < 3 > ``` ``` < 3 > ``` _x_ ``` < Tx > ``` ``` < Ty > ``` _a_ ``` < Tx > ``` ``` < Ty > ``` ``` Figure 7.1.2: Computation graph for our recurrent neural network. ``` #### 7.1.2 Variational RNNs RNNs can be broken into different categories, as shown in Figure 7.1.3. What we have currently been working with is a many-to-many model. If we were to use sentiment classi- ficaiton, where we input a rev i ew sentence and output the number of stars, we would use a many-to-one model. A one-to-many model may consist of music generation, where we could input some metadata (i.e. genre, style, etc.), and output script s of music. Many-to-many models may also consist of examples where the number of inputs and the numbe r of outputs differ, such as with machine translati on. #### 7.1.3 Language modeling Suppose we would like to build a speech recognition system that inputs an audio clip of speech and outputs the spoken words. Looking simply at the pronunciation of each word i s not enough to determine the corre ct word that is spoken. For instance, t he pronunciation for “pair” and “pear”; “two”, “to”, and “too”; and “their”, “they’re”, and “there” are all ##### 88 CHAPTER 7. SEQUENCE MODELS Figure 7.1.3: Depending on the combinations of fixed or variable length inpu t and output, we can form different types of RNNs. pronounced identitically, but are different words alltogether. A language model may lo ok at a set of sentences that sound identitical and choose the sentence that is most likely to have occured. For example, suppose we have the two equivalently spoken sentences ``` > The apple and pair salad. ``` ``` The apple and pear salad. ``` ##### . (7.1. 12) Our language model may output something in the form of ``` > P (The apple and pair salad.) = 3. 2 × 10 − 13 ``` ``` P (The apple and pear salad.) = 5. 7 × 10 − 10 ``` ##### , (7.1.13) thus, the second sentence would be selected because it is more prob abl e. To create a language model with a RNN, we need a training set that consists of a large corpus of English text. The word corpus comes from the field of natural language processing and refers to a set of text. For each sentence we are given as input, we will tokenize the ou tp u t using one-hot enco di n g. It is also common to add an end-of-sentence token. If we are given a sentence like ``` The Egyptian Mau is a bread of cat. (7.1.14) ``` and do not have the word “Mau” in our vocab ul ar y list, then we will use a 1 t o encode the unknown word item in our vocabulary list, while making everything else a 0. Suppose we have a vocabulary list that has 10k entries (excludes puncuation and incl u de s spots for unknown words) are working with the following sentence ``` x : Well, he l lo there! (7.1.15) ``` With an RNN model that is trying to predict the next word, the inputs will be the ordered list of words that came before the current time step t. For the firs t time step, the input will be the 0 vector and the first prediction ˆy < 1 > will output a 10k softmax classification, where each word in th e vocabulary list i s given a probability that it occurs; that is ``` yˆ ``` ``` < 1 > = P (< each word >) = ``` ##### # ##### $ ##### $ ##### % ``` P (< word one >) ``` ``` . . . ``` ``` P (< word 10k >) ``` ##### & ##### ' ##### ' ##### ( ##### . (7.1.16) In this case, we have y ``` < 1 > as a one-hot encoding with the word “well”. For the second ``` time step, our input will be y < 1 > = “well”. Then prediction for yˆ < 2 > will be a conditional probability of the second word, given the first word; that is ``` yˆ ``` ``` < 2 > = P ``` ##### ) ``` < each word > | y ``` ``` < 1 > = well ``` ##### * ##### . (7.1.17) For the third time step, t he input will be y < 1 > = “well”, y < 2 > = “hello”. Following the same pattern, the prediction will be in the form of a 10k softmax classifier, where ``` yˆ ``` ``` < 3 > = P ``` ##### ) ``` < each word > | y ``` ``` < 1 > = “well”, y ``` ``` < 2 > = “hello” ``` ##### * ##### . (7.1.18) ##### 7.1. RECURRENT NEURAL NETWORKS 89 On the backward pass, we will lo ok to maximize the predictions that ``` : < ``` ##### = ``` yˆ ``` ``` < 1 > = well ``` ``` yˆ ``` ``` < 2 > = hello ``` ``` yˆ < 3 > = there ``` ##### . (7.1.19) To predict the probability of an entire sentence, we could calculate ##### P ##### ) ``` y ``` ``` < 1 > ``` ##### * ##### · P ##### ) ``` y ``` ``` < 2 > | y ``` ``` < 1 > ``` ##### * ##### · P ##### E ``` y ``` ``` < 3 > | ``` ``` 2 P ``` ``` t= 1 ``` ``` y ``` ``` ``` ##### F ##### · · · P ##### E ``` y ``` ``` | ``` ``` T − 1 P ``` ``` t= 1 ``` ``` y ``` ``` ``` ##### F ##### . (7.1.20) #### 7.1.4 Sampling novel sequences To sample a sequence, we will use a similar RNN model to the one just described. However, we will now change the input at each time step to b e all of the previous predic t ion s from the model, as shown in Figure 7.1.4. We will als o have the predictions b e random variabl e choices from the probability distribution in order to avoid an infinite loop of the same few words. In NumPy, we can use np.random.choice to choose a random variable from the output prediction distribut ion. To end t he program, we can wait until the sentence is over, in which case the output will be our word. We could also iterate for a certain number of time steps or run a program for a specific amount of time. _x_ ``` < 1 > ``` ``` < 1 > ``` _a_ < 1 > _a_ ``` < 0 > ``` ``` < 2 > ``` ``` < 1 > < 2 > < Tx - 1 > ``` _a_ ``` < 2 > ``` ``` < 3 > ``` _a_ ``` < 3 > ``` ``` < Ty > ``` _a_ ``` < Tx > ``` Figure 7.1.4: RNN model for sequence generation. Here, the to t h e time step t is the output from all of the previous time steps. Up until this point, we have been focused on using a word vocabulary list; however, it is also common to use a character vocabulary list, such as all of the ASCII characters. An advantage of character level models is that we wil l never have to use an unknown token. One of the drawbacks with the character level approach is that it is worse at capturing long-range dependencies, such as opening and closing quot es , along with being more computationally expensive. #### 7.1.5 Vanishing gradients with RNNs When we are working wit h RNNs, it is not unrealistic to be working with hundreds or thousands of time steps in a given model. However, recall the pr obl em with deep neural networks that had a lot of layers, where we continued to multiply by smaller and smaller numbers as we backpropagate further. We can think of our the number of layers we have with NNs similar to the number of time steps with RNNs. With our current RNN approach, it will much harder to r ep re se nt long-term dependencies in our data, such as opening and closing quotes. #### 7.1.6 GRU Gated recurrent unit Currently, we have been using RNNs that have a hidden unit in the following form ``` a ``` ``` = tanh ``` ##### ) ``` Wa ``` ##### H ``` a ``` ``` <t− 1 > , x ``` ``` ``` ##### I ``` + ba ``` ##### * ##### . (7.1.21) ##### 90 CHAPTER 7. SEQUENCE MODELS _x_ ### < t > ### < t > _a_ ### < t –1> _a_ ### < t + 1 > ### softmax ``` tanh ``` Figure 7.1.5: Our current approach to RNN calculations, where the g ray box represents a hidden black box calculation. We can picture this as a black box calculation at each time step t as shown in in Figure 7.1.5. However, our model is not adapt to cap t ur i ng long-term dependencies in the dat a. Suppose we are working with the following sentence ``` x : The cat, which already ate, was full. (7.1.22) ``` If we slightly changed this sentence to have the word cat as cats, then we would also need to update was to were. With a GRU, a will be treated like a memory cell. We will then set ``` a ̃ ``` ``` = tanh ``` ##### ) ``` Wa ``` ##### H ``` a ``` ``` <t− 1 > , x ``` ``` ``` ##### I ``` + ba ``` ##### * ##### (7.1.23) as a potential candidate for a . Next , we will create an update gate ``` Γu= σ ``` ##### ) ``` Wu ``` ##### H ``` a ``` ``` <t− 1 > , x ``` ``` ``` ##### I ``` + bu ``` ##### * ##### , (7.1.24) where the majority of values in Γuwill be 0 or 1, based on the f un ct i onal i ty of the sigmoid function. The update gate ultimately determines if we will update a . For example, suppose the model learns that singular words are denoted with a ``` = 1. Then, i n our ``` working sentence we may have something along the l i n es of: ``` t 1 2 3 4 5 6 7 ``` ``` x The cat which already ate was full ``` ``` a − 1 1 1 1 1 − ``` Notice that the update gate continues to denote that our sentence is singular at further time steps t. At some fut ur e time t, if we run into a plural phrase, then it is the job of Γuto update a . Specifically, our up d at e will be ``` a ``` ``` = Γu∗ ̃a ``` ``` + (1 − Γu) ∗ a ``` ``` <t− 1 > , (7.1.25) ``` where ∗ is an element-wise multiplication between the vector. Notice here that when Γu= 1, we will update a to be ̃a , such as when the time step was 2 in the above example. When Γu= 0, then we keep a the same value, such as in the time steps [3, 6] above. An inportant note is that the values a ̃ , a , and Γuare all vectors that can learn many different types of long-term dependencies. For example, we can learn how to open and close brackets, q u ot es, and use singular or plural p hr ase s when appropriate. The computation graph for GRUs is shown in Figure 7.1.6. ##### 7.1. RECURRENT NEURAL NETWORKS 91 _x_ < _t_ > < _t_ > _a_ ``` < t –1> a ``` < _t_ + 1 > softmax ``` tanh ``` ``` Figure 7.1.6: Simplified GRU computational graph at a time step t. ``` There is one other part to the GRU that we have currently left out, and that is to determi ne how relevant the last me mor y cell a ``` <t− 1 > is for the calculation of the next, potential, ``` memory cell ̃a . Here, we add the relevance gate i n as follows ``` ̃a = tanh ``` ##### ) ``` Wa ``` ##### H ``` Γr∗^ a ``` ``` <t− 1 > , x ``` ##### I ``` + ba ``` ##### * ##### (7.1.26) ``` Γu= σ ``` ##### ) ``` Wu ``` ##### H ``` a ``` ``` <t− 1 > , x ``` ``` ``` ##### I ``` + bu ``` ##### * ##### (7.1.27) ``` Γr= σ ``` ##### ) ``` Wr ``` ##### H ``` a ``` ``` <t− 1 > , x ``` ``` ``` ##### I ``` + br ``` ##### * ##### (7.1.28) ``` a = Γu∗ ̃a + (1 − Γu) ∗ a <t− 1 > ``` . (7.1.29) #### 7.1.7 LSTM: Long short-term memory In addition to GRUs to store long-term dependencies in sequential data, LSTMs are also commonly used. Figure 7.1.7 shows an example of an LSTM model. We can think about LSTMs as an extension of GRUs wi th 2 hidden units h 1 and h 2 . We use 3 separate gates, a forget gate Γf, update gate Γu, and output gate Γo, that are each defined as ``` Γu= σ ``` ##### . ``` Wu· [xt, ht− 1 ] ``` ##### / ##### (7.1.30) ``` Γf= σ ``` ##### . ``` Wf· [xt, ht− 1 ] ``` ##### / ##### (7.1.31) ``` Γo= σ ``` ##### . ``` Wo· [xt, ht− 1 ] ``` ##### / ##### . (7.1.32) With LSTMs, we remove the relevance gate Γrbecause it often does not improve the perfor- mance of a model in practice. With our 1st hidden unit h 1 , we will set its replace candidate based on the other hid de n unit input h 2 t− 1 and the new input xt, which we will define as ##### ̃ ``` h ``` ``` 1 t=^ σ ``` ##### . ``` Wc· [xt, h ``` ``` 2 t− 1 ] ``` ##### / ##### . (7.1.33) Now instead of using a single update gate that determin es both how much of the previous layer we want to forget and how much of the replacement candidate we want to remember, ##### 92 CHAPTER 7. SEQUENCE MODELS ``` x t ``` ``` softmax ``` ``` forget gate update gate tanh output gate ``` ``` tanh ``` ``` y t ``` _h_ ``` 1 t ``` _h_ ``` 1 t - 1 ``` _h_ ``` 2 ``` ``` t ``` _h_ ``` 2 ``` ``` t - 1 ~ h ``` ``` 1 t ``` ``` (a) Single LSTM gate ``` ``` (b) Stacking multiple LSTM gates in a row ``` ``` Figure 7.1.7: LSTM model ``` we use 2 separate gates to define these quantities. In particular, we define our 1st hidden unit to be ``` h ``` ``` 1 t = Γu⊗ ̃h ``` ``` 1 t + Γf⊗ h ``` ``` 1 t− 1 ``` ##### . (7.1.34) Our 2nd hidden un i t is dependent on h 1 t and passed into the softmax fucntion to produce the output yt, which we define as ``` h ``` ``` 2 t = Γo⊗ tanh ``` ##### ) ``` h ``` ``` 1 t ``` ##### * ##### . (7.1.35) Here, the output gate determines t h e relevance of each hidden unit for the output. The blue path that we show in Figure 7.1. 7b connects the 1st hidden unit between multiple LSTM gates. The path could give an intuitive fee l for why LSTMs give long-range dependency connections, which is primarily due to having multiple update gates control what we store in the internal memory, making it harder to forget everything completely. #### 7.1.8 BRNNs: Bi di r ec ti on al RNNs Recall our 2 sentences we used earlier ``` > He said, “Teddy bears are on sale!” ``` ``` He said, “Teddy Roosevelt was a great Pre si de nt!” ``` ##### , (7.1.36) where it is not enough to predict if t he work Teddy is a name based on the information before the word occurs. In this case, we can use a bidirectional RNN (BRNN) to make pred ic t ion s with the words before and after Teddy. Figure 7.1.8 shows the diagram for a BRNN with a fixed size input length of 4. Our BRNN is broken up into 2 separate forward RNNs. Here, we have a forward RNN with activations in the form ##### −→ ``` a that takes into account all of the ``` previous time step inputs. We also have another RNN with activations in the form of ##### ←− ``` a ``` ##### 7.2. WORD EMBE D D ING 93 that looks at all the futur e time steps in reverse order. For example, if our sentence is ``` He said, “Teddy Roosevelt!”, (7.1.37) ``` then at t = 3 we can form ##### ←− ``` a < 3 > with {He, said, Teddy} and we can form ``` ##### −→ ``` a < 3 > with ``` {Roosevelt}. We c an then make an output prediction, such as the meaning of the word Teddy, in the form of ``` yˆ ``` ``` = g ``` ##### ) ##### W · ##### H ##### −→ ``` a ``` ``` < 3 > , ``` ##### ←− ``` a ``` ``` < 3 > ``` ##### I* ##### . (7.1.38) One of the downsides to using a BRNN is that we are required to supply the entire input at the start. _x_ ``` < 1 > ``` _a_ ``` < 1 > a ``` ``` < 1 > ``` ``` < 1 > ``` _x_ ``` < 2 > ``` _a_ ``` < 2 > a ``` ``` < 2 > ``` ``` < 2 > ``` _x_ ``` < 3 > ``` _a_ ``` < 3 > a ``` ``` < 3 > ``` ``` < 3 > ``` _x_ ``` < 4 > ``` _a_ ``` < 4 > a ``` ``` < 4 > ``` ``` < 4 > ``` ``` (a) Forward and backward connections ``` ``` x ``` ``` < 1 > ``` ``` < 1 > ``` ``` a ``` ``` < 1 > ``` ``` x ``` ``` < 2 > ``` ``` < 2 > ``` ``` a ``` ``` < 2 > ``` ``` x ``` ``` < 3 > ``` ``` < 3 > ``` ``` a ``` ``` < 3 > ``` ``` x ``` ``` < 4 > ``` ``` < 4 > ``` ``` a ``` ``` < 4 > ``` (b) Forward RNN connection with the sequence in order ``` x ``` ``` < 4 > ``` ``` < 4 > ``` ``` a ``` ``` < 4 > ``` ``` x ``` ``` < 3 > ``` ``` < 3 > ``` ``` a ``` ``` < 3 > ``` ``` x ``` ``` < 2 > ``` ``` < 2 > ``` ``` a ``` ``` < 2 > ``` ``` x ``` ``` < 1 > ``` ``` < 1 > ``` ``` a ``` ``` < 1 > ``` ``` (c) Forward RNN connect io n , where the se- ``` ``` quence is in reverse order ``` ``` Figure 7.1.8: BRNN with an in p u t of length of 4 ``` #### 7.1.9 Deep RNNs Figure 7.1.9 shows the 2 primary types of deep RNNs, where we stack multiple activations at each time step. The 1st type of network, in Figure 7.1.9a, shows 3 stacked activations at each time step that are each connected to the next time step. Here, each layer has its own weights, which are shared at each time step. If, for example, we can cal cu l ate each hidden activati on in layer ℓ on the tth time step, then we can use ``` a ``` ``` [ℓ] = g W ``` ``` [ℓ] · ``` ##### ! ``` a ``` ``` [ℓ]<t− 1 > , a ``` ``` [ℓ− 1 ] ``` ##### "/ ##### , (7.1.39) which takes the l ef t and b e l ow activations as the input. The 2nd type of network, in Figure 7.1.9b, shows the first few layers connecting to future time steps, while the deeper layers are only connected to a single time step. Here, the deeper layers are not connec te d to fut u r e time steps because connecting them would be computationally expensive. ### 7.2 Word embedding ### 7.3 Word representation Up until this point, we have represented words with a dictionary, where we specify a specific word using a one-hot enco d i ng vector. One of the problems with this representation is that ##### 94 CHAPTER 7. SEQUENCE MODELS ### x ``` < 1 > ``` ``` < 1 > ``` ### a ``` [2] < 1 > ``` ### a ``` [1] < 1 > ``` ### a ``` [3] < 1 > ``` ### x ``` < 2 > ``` ``` < 2 > ``` ### a ``` [2] < 2 > ``` ### a ``` [1] < 2 > ``` ### a ``` [3] < 2 > ``` ### x ``` < 3 > ``` ``` < 3 > ``` ### a ``` [2] < 3 > ``` ### a ``` [1] < 3 > ``` ### a ``` [3] < 3 > ``` ### x ``` < 4 > ``` ``` < 4 > ``` ### a ``` [2] < 4 > ``` ### a ``` [1] < 4 > ``` ### a ``` [3] < 4 > ``` (a) Stacking together multiple layers of activations that are each connected to the ne xt and p rev io u s time steps ``` x ``` ``` < 1 > ``` ``` < 1 > ``` ``` a ``` ``` [ 2 ] < 1 > ``` ``` a ``` ``` [ 1 ] < 1 > ``` ``` a ``` ``` [ 3 ] < 1 > ``` ``` x ``` ``` < 2 > ``` ``` < 2 > ``` ``` a ``` ``` [ 2 ] < 2 > ``` ``` a ``` ``` [ 1 ] < 2 > ``` ``` a ``` ``` [ 3 ] < 2 > ``` ``` x ``` ``` < 3 > ``` ``` < 3 > ``` ``` a ``` ``` [ 2 ] < 3 > ``` ``` a ``` ``` [ 1 ] < 3 > ``` ``` a ``` ``` [ 3 ] < 3 > ``` ``` x ``` ``` < 4 > ``` ``` < 4 > ``` ``` a ``` ``` [ 2 ] < 4 > ``` ``` a ``` ``` [ 1 ] < 4 > ``` ``` a ``` ``` [ 3 ] < 4 > ``` ``` a ``` ``` [ 5 ] < 1 > ``` ``` a ``` ``` [ 4 ] < 1 > ``` ``` a ``` ``` [ 6 ] < 1 > ``` ``` a ``` ``` [ 5 ] < 2 > ``` ``` a ``` ``` [ 4 ] < 2 > ``` ``` a ``` ``` [ 6 ] < 2 > ``` ``` a ``` ``` [ 5 ] < 3 > ``` ``` a ``` ``` [ 4 ] < 3 > ``` ``` a ``` ``` [ 6 ] < 3 > ``` ``` a ``` ``` [ 5 ] < 4 > ``` ``` a ``` ``` [ 4 ] < 4 > ``` ``` a ``` ``` [ 6 ] < 4 > ``` ``` a ``` ``` [ 7 ] < 1 > a ``` ``` [ 7 ] < 2 > a ``` ``` [ 7 ] < 3 > a ``` ``` [ 7 ] < 4 > ``` (b) Deep RNN where the first few layers have activations that connect to future time steps, while the deeper layers only carry weight in a single time step ``` Figure 7.1.9: Deep RNNs ``` the words do not have a connection with each other. For examp l e, suppose we input ``` I want a glass of orange (7.3.1) ``` ##### 7.3. WORD REPRE SE NTATION 95 input our ne twork, where it is the goal of the network to predict the blank. With a well trained network the likely prediction is the word juice. But, suppose we now have t he input ``` I want a glass of apple (7.3.2) ``` and our model does not know about apple juice. In this situation, if our model knows that there is some connection between orange and apple, then it would still likely predict juice. One approach to learning connections would be to discover feature repre sentations with each word. Table 7.1 shows an example of a potential encoding that our model may find. The table shows the words king, queen, apple, and orange, along with thei r responses to a set of features. For example, if we have a royal feature, then king and queen might have a high response; alte rn at i vely if we have a food feature, then apple and orange might have a high response. Here, if we have 300 features, then we can represent each word with a vector of length 300. Now, due to the similarities between features, our network is more likely to pick up similar words. ``` King Qu een Apple Orange ``` ``` Royal 0.93 0.95 -0.01 0.00 ``` ``` Age 0.7 0.71 0.03 -0.02 ``` ``` Food 0.01 -0.03 0.95 0.97 ``` Table 7.1: Sample words in a dictionary (top) with their responses to each feature (left) that ranges between [− 1 : 1] #### 7.3.1 Using word embeddings To learn the features for a set of words, it is common to use transfer learning. There are currently many freely available text corpuses that have learned word embeddings training on 1B to 100B words. Once we have word emb ed di n gs, we can transfer the words to a smaller task and fine-tune the parameters if necessary. #### 7.3.2 Properties of word embeddings Once we have the word embeddings, one neat thing that we can find is the differen ce between 2 words. Suppose for example that one of our features gives the value 1 for words describing objects w it h an orange color, − 1 for words describing a colored object that is not orange, and 0 for objects that are not defined by a color. Table 7.2 shows how we can spot the feature differences between the apple and orange vectors. I n particular, if we subtract the 2 vectors, then t he f eat ur e s th at ar e lar ges t i n magn it u de cor r es pond to the feature differences between words, while the feature differences near 0 refer to features being the same. ``` Apple Orange Apple - Orange (approximate) ``` ``` Orange colored − 0. 96 0. 98 − 2 ``` ``` Royal -0.01 0.00 0 ``` ``` Age 0.03 -0.02 0 ``` ``` Food 0.95 0.97 0 ``` Table 7.2: To compare words we can find the differences and similarities by taking the difference between the word feature vectors, and finding the features wit h the largest difference in magnitude are the most different, while feature differences near 0 are quite similar.