Deep Learning

Stanford CS 230


Matt Deitke

Last updated: February 15 , 2020


Acknowledgements

This set of notes follows the lectures from Stanford’s graduate-level course CS230: Deep

Learning. The course was taught by Andrew Ng, Kian Katanforoosh. The course

website is cs230.stanford.edu.

i


## Contents

• 1 Intro d u cti on
• 1.1 Deep learning
• 1.1.1 What is a neural network?
• 1.1.2 Supervised learning with neural networks
• 1.2 Logistic regression as a ne ur al network
• 1.2.1 Notation
• 1.2.2 Binary classification
• 1.2.3 Logistic Regression
• 1.2.4 Logistic regression cost function
• 1.2.6 Computation graphs
• 1.2.7 Derivat ives with a computational graph
• 1.2.8 Logistic regression gradient descent
• 1.2.9 Gradient descent on m examples
• 1.3 Python and vectorization
• 1.3.1 Vectorization
• 1.3.2 Vectorizing logistic regression
• 1.3.3 Vectorizing logistic regression’s gradient computation
• 1.3.5 A note on NumPy vectors
• 2 Neu ral networks
• 2.1 Deep learning intuition
• 2.1.1 Day’n’night classification
• 2.1.2 Face verification
• 2.1.3 Face recognition
• 2.1.4 Art generation (neural style transfer)
• 2.1.5 Trigger word detection
• 2.2 Shallow neur al networks
• 2.2.1 Neural networks overview
• 2.2.2 Neural network representations
• 2.2.3 Computing a neural network’s output CONTENTS iii
• 2.2.4 Vectorizing across multiple examples
• 2.2.5 Activati on functions
• 2.2.6 Why use non-linear activation functions?
• 2.2.7 Derivat ives of activation functions
• 2.2.8 Gradient descent for neural networks
• 2.2.9 Random Initialization
• 2.3 Deep neural networks
• 2.3.1 Forward propagation in a dee p network
• 2.3.2 Getting the matrix dimensions right
• 2.3.3 Why deep representations?
• 2.3.4 Forward and backward propagati on
• 2.3.5 Parameters v s hyperparameters
• 2.3.6 Connection to the brain
• 3 Op t imi zin g and structuring a neural network
• 3.1 Full-cycle deep learning projects
• 3.2 Setting up a machine learning application
• 3.2.1 Training, development, and t e st i ng sets
• 3.2.2 Bias and variance
• 3.2.3 Combatting bi as and variance
• 3.3 Regularizing a neural network
• 3.3.1 Regularization
• 3.3.2 Why regular i zat i on reduces overfitting
• 3.3.3 Dropout regularization
• 3.3.4 Other regularization methods
• 3.4 Setting up an optimization problem
• 3.4.1 Normalizing inputs
• 3.4.2 Data vanishing and exploding gradients
• 3.4.3 Weight initialization for deep neural networks
• 3.5 Optimization algorithms
• 3.5.2 Exponentially weighted averages
• 3.5.3 Bias correction for exponentially weighted averages
• 3.5.4 Gradient descent with momentum
• 3.5.5 RMSprop
• 3.5.7 Learning rate decay
• 3.5.8 The problem with local optima iv CONTENTS
• 3.6 Hyperparameter Tuning
• 3.6.1 Tuning process
• 3.6.2 Using an appropriate scale to pick hyperparameters
• 3.6.3 Hyperparameters tuning in practice: pandas vs caviar
• 3.7 Batch normali zat i on
• 3.7.1 Normalizing activations in a network
• 3.7.2 Fitting batch norm in a neural network
• 3.7.3 Batch norm at test time
• 3.8 Multi-class classificati on
• 3.8.1 Softmax regression
• 3.8.2 Training a sof t max classifier
• 4 Ap p li ed deep learning
• 4.1 Orthogonalization
• 4.2 Setting up goals
• 4.2.1 Single numb er evaluati on metric
• 4.2.2 Train/dev/test distributions
• 4.3 Comparing to human-level performance
• 4.3.1 Avoidable bi as
• 4.3.2 Understanding human-level performance
• 4.3.3 Surpassing human-level performance
• 4.4 Error analysis
• 4.4.1 Cleaning up incorrectly labeled data
• 4.5 Mismatched training and dev set
• 4.5.1 Bias and variance with mismatched data distributions
• 4.6 Learning from multiple tasks
• 4.6.1 Transfer learning
• 5 Convolutional n eu r al networks
• 5.1 Edge detection
• 5.3 Strided convolutions
• 5.4 Cross-correlation vs convolution
• 5.5 Convolutions over volume
• 5.6 One layer convolution network
• 5.7 Pooling layers
• 5.8 Why we use convolutions
• 5.9 Classic networks CONTENTS v
• 5.9.1 LeNet-5
• 5.9.2 AlexNet
• 5.9.3 VGG-16
• 5.9.4 ResNet
• 5.9.5 1 × 1 convolution
• 5.9.6 Inception network
• 5.10 Competitions and benchmarks
• 6 D etecti on algorithms
• 6.1 Object localization
• 6.2 Landmark detection
• 6.3 Object detection
• 6.4 Sliding windows with convolution
• 6.5 Bounding box predi ct i ons
• 6.6 Intersection over Union (IoU)
• 6.7 Non-max suppression
• 6.8 Anchor boxes
• 6.9 YOLO algorithm
• 7 Se qu en ce models
• 7.1 Recurrent neural networks
• 7.1.1 Backpropagation through time
• 7.1.2 Variational RNNs
• 7.1.3 Language modeling
• 7.1.4 Sampling novel sequences
• 7.1.5 Vanishing gradients with RNNs
• 7.1.6 GRU Gated recurrent unit
• 7.1.7 LSTM: Long short-term memory
• 7.1.8 BRNNs: Bi di r ec ti on al RNNs
• 7.1.9 Deep RNNs
• 7.2 Word embedding
• 7.3 Word representation
• 7.3.1 Using word embeddings
• 7.3.2 Properties of word embeddings

Chapter 1

Introduction

The gist of deep learning, and the algorithms behind it, have been around for decades.

However, we saw that as we started to add data to neural networks, they started to perform

much better than traditional machine learning algorithms. With advances in GPU comput-

ing and the amount of data we have available, training larger neural networks are easier

than ever before. With more data, larger neural networks have been shown to outperform

all other machine learning algorithms.

Amount of data

##### P
e

rfo

rm

a

n

ce
Large neural networks

Medium neural networks

Small neural networks

Traditional ML


Figure 1.0. 1: With more data, neural networks have performed better than traditional machine

learning.

Artificial intelligence can be broken down into a few subfields that include deep learn-

ing (DL), machine learning (ML), probabilistic graphical models (PGM), planning agents,

search algorithms, knowledge representation (KR), and game t he ory. The only subfield that

has dramatically improved in performance has been deep learning and machine learning.

We can see the rise of artificial intelligence in many industries t oday, however, there is still

a long way to go. Even though a company uses neural networks, the company is not an

artificial intelligence company. The best AI companies are good at strategic data acquisition,

putting data together, spotting automation, and have job descriptions on technologies at

the forefront of the field.

##### 2 CHAPTER 1. INTRODUCTION
Time

P

e

r

f

o

r

m

a

n

c

e

(a) DL and ML

Time

P

e

r

f

o

r

m

a

n

c

e

(b) PGM

Time

P

e

r

f

o

r

m

a

n

c

e

(c) Planning agents

Time

P

e

r

f
o

r

m

a

n

c

e

(d) Search algorithms

Time

P

e

r

f
o

r

m

a

n

c

e

(e) Knowledge representation

Time

P

e

r

f
o

r

m

a

n

c

e

(f) Game theory


Figure 1.0.2: The performance of deep learning and machine learning algorithms have been

exploding in the last few years, in comp a ris o n to other branches of artificial intelligence.

### 1.1 Deep learning

#### 1.1.1 What is a neural network?

The aim of a neural network is to learn representations that best predict t h e output y, given

a set of features as the input x.

Size Price

x “neuron” y


Figure 1.1.1: Simplest possible neural network where we are given the size of a home and predict

its price.

Figure 1.1.1 represents the simpl e st possible neural network. The neuron is the part of

the neural network that tries to learn a function that maps x → y. Here, we only have

one neuron, whi ch is why it is the simplest case. We can make more complicated neural

networks by stacking neurons. Consi d er the cas e where we not on l y have the size but the

number of bedrooms, zip code, wealth as well. We can represent the extra inputs inside of

a neural network as shown in Figure 1.1.2.

Neural networks work well because we only need to feed the algorithm supervised data,

without specifying what all the intermediate values may be, as we did in Figure 1.1.2 with

family size, walkability, and schooling. Instead of connecting together only a few inputs,

we can connect all of the inputs together. Layers with all inputs connected to the output

are known as fully-connected layers. The general st ru ct u re of a neural network wil l fully

connected layers can be seen in Fi gur e 1.1.3.

##### 1.1. DEEP LEARNING 3
size

bedrooms

zip code

family size

walkability

schooling

price (y)

wealth


Figure 1.1.2: In a neural network with more neurons, the intermediate connections between inputs

may represent their own values.

x 1

x 2

x 3

x 4

y

Hidden

layer

Input

layer

Output

layer

Figure 1.1.3: Neural network with fully connected layers


#### 1.1.2 Supervised learning with neural networks

Almost all of the hype around machine learning has been centered aroun d supervised learn-

ing. A supervised learning algorithm t akes in a set of features and outputs a number or a

set of numbers. Some examples can be seen in Table 1.1.

Input (x) Output (y) Application

Home features Price Real estate

Ad, user info Click on ad? (0/1) Online advertising

Image Object (1,... , 1000) Photo tagging

Audio Text transcript Speech recognition

English Chinese Machine tran sl at i on

Image, radar info Position of other cars Auton omou s driving

Table 1.1: Examples of supervised learn i n g


As we have already seen, we can change the connections between layers to form different

types of neural networks. Indeed, changing the structure of layers has led to many of the

breakthroughs in deep learning and we have formulated tasks in which it would be beneficial

not using a neural network with fully connected layers. For example, while real estate and

online advertising we may use a neural network with ful l y -con ne ct e d layers, photo tagging

may use convolutional neural network s (CNNs), sequenced data may use recurrent neural

networks (RNNs) , and for something more complicated like autonomous dr i vi n g, we may

use some cust om hybrid neural network. Figure 1.1.4 shows the basic structure of different

neural networks, although we will go more in-depth into how each works in this text.

##### 4 CHAPTER 1. INTRODUCTION
Figure 1.1.4: Different types of neural network architectures


Supervised learning can be both structured and unstructured. Structured data may have

come in the for m of database entries, while unstructur ed data may be audio, image clips, or

text. Historically, unstructured data like image recogniti on has often been a harder problem

for computers to solve, while humans have little difficulty solving these typ e s of proble ms.

Neural networks have given computers a lot of help when interpreting unstructured data.

Figure 1.1.5: Structured vs unstructured data


### 1.2 Logistic regression as a ne ur al network

#### 1.2.1 Notation

Each training set will be composed of m training examples. We may denot e mtrainto be

the number of training examples in the training set , and mtestto be the number of examples

in the t es ti n g set. The ith input instance will be defined as (x (i) , y (i) ), where x (i) ∈ R

nx


and each y (i) ∈ { 0 , 1 }. To make notation more concise, we will use X to denote the matrix

formed by

##### !
x
( 1 )
x
( 2 )


… x (m)

##### , (1.2.1)

where X ∈ R nx×m

. simi l arl y, we define Y as

##### !
y
( 1 )
y
( 2 )


… y (m)

##### , (1.2.2)

where Y ∈ R

1 ×m
.


#### 1.2.2 Binary classification

A bin ary classifier takes some input and classifies that input in either two categories, typi-

cally yes (1) or no (0). For example, an image of a cat may be classified as a cat or non-cat.

We will use y to denote the output as either a 0 or 1.

Figure 1.2. 1 : Colored images can be represented with the ad d i t io n of red, green, and blue channels.

Each color can be represented using the colors red, green, and blue. Each pixel on a colored

image can be represented as a t u pl e , storing the amount of red, green, and blue inside of

each pixel. For images, we say that the red, green, and blue channel corresponds the amount

of that color in each pixel as shown in Figure 1.2.1. For an image of size n × m, we wi l l

define our input as a single column vector x that stacks the red values on the blue values

on the green values as shown

x =

##### ( (1.3.14)

and we want to find the percentage each element contributes to the sum of its column. For

example, A 11 = 56 /(56 + 1 + 2). To calculate the sum of each row, we would type

s = A.sum(axis = 0). (1.3.15)


Axis 0 refers to the vertical columns inside of a matrix, while axis 1 refers to the horizontal

rows in a matrix. Calling t h e method in Equation 1.3.15 will produce a vector of 4 elements,

however, we would like to convert from R 4 → R 1 × 4

. To make this conversion, call

s = s.reshape(1, 4). (1.3.16)


Now we can performA/s in order to get t h e percentage each element contributes to the sum

of the column. Dividin g the matrix A ∈ R 3 × 4 by the row vector s ∈ R 1 × 4 will divide each

of the three elements in a col umn by the value in the corresponding column of s. Since their

sizes are different, this is known as broadcasting. A few other examples of broadcasti n g

include adding a vector and real number

#

##### , (1.3.17)

and adding a matrix in R

m×n
with a row vector in R

1 ×n

##### . (1.3.18)

Notice, when applying a command between a matrix in R m×n and a vector in R 1 ×n , t he

vector will exp and to a matr i x in R m×n , where each row has the same elements. sim il ar l y,

when combining a matrix in R

m×n
with a column vector in R

m× 1
, the vector will exp and


to be in R m×n , where each column is identical. An example of a column vector expanding

is 1

1 2 3


#### 1.3.5 A note on NumPy vectors

Consider the example

a = np.random.randn(5). (1.3.20)


Calling a.shape will return the shape as (5, ), which is considered a rank one array, which

is neither a row vector nor a column vector. If we call a.T , nothing will be changed to

the array, and it will still be in the shape (5, ). Instead of using a rank one array, it is

recommended to use natural matrices when dealing with neural networks. If we wanted a

matrix of size R 5 × 1 instead of the rank one array in Equation 1.3.20, we would call

a = np.random.randn(5, 1). (1.3.21)


One goo d tip when dealing with a matrix of an unknown size is to first assert that it is a

size we want , by calling

assert(a.shape == (5, 1)). (1.3.22)


Chapter 2

## Neural networks

### 2.1 Deep learning intuition

A mo d el for deep learni n g is based on architecture and parameters. The architecture is the

algorithmic design we choose, like logistic re gre ss ion , linear regression, or shallow neural

networks. The parameters are the weights and bias the model that takes in an instance as

input and attempts to classify it correctly based on the parameters.

### Input

Things that can change:


**- Activation function

• Optimizer
• Hyperparameters
• Loss function
• Input
• Output
• Architecture**

# 0

Figure 2.1.1: Where the model fits in and what can be ad ju st ed


Consider if we wanted to use a multi-class classifier that not only predicts if an image is a

cat or not a cat, but instead predicts from the image whether the image is a cat, dog, or

giraffe. Recall that when using a binary classifier our weights were

w =

##### .
cats dogs giraffes

.
.
.

##### . (2.1.2)

Our weights are in R nx×^3 .

To update our labels, we have a few options. First, let us put y ∈ { 0 , 1 , 2 }, where 0

corresponds to a cat image, 1 corresponds to a dog image, and 2 c orr es ponds to a giraffe

image. The second option is one-hot encoding, where we will have an array with each

index mapping to an animal. With one-hot encoding, only one item in each labeled array

will have a value of 1 (picture is of an animal), while the rest are 0 (picture is not of an

animal). For example, if we have an image of a cat, our label would be

cat dog giraffe

##### (2.1.3)

While both of these options would work well in most cases, if we have a picture of both a

cat and dog in the same image, our classifier wi l l not work. Instead, we will use multi-

hot en coding, which works similar to one-hot enco d i ng, with the differen ce being multiple

values can take on the value 1. So now if we have a picture of both a cat and dog at the

same time, we can encode our label as

cat dog giraffe

##### (2.1.4)

The jth neuron in the ith layer will be labeled as a

[i]
j
as shown in Figure 2.1.2.

x

(i)
1

x

(i)
2

x

(i)
3

x

(i)
4

a

[ 1 ]
1

a

[ 1 ]
2

a

[ 1 ]
3

a

[ 1 ]
4

a

[ 1 ]
5

a

[ 2 ]
1
yˆ

(i)

Figure 2.1.2: Notation for the course


As the layers increase, the activation neurons start to look at more complex items. For

example, if we are working with a face classifier, the first layer might only detect edges, the

second layer might put some of the edges together to find ears and eyes, and the third layer

might detect the entire face. The process of extr act i n g data from the neurons is known as

encoding.

#### 2.1.1 Day’n’night classification

Our goal is to create an image classifier that predicts from a photo taken outside if it was

taken “during the day” (0) or “during the night” (1). From the cat example, we needed

##### 14 CHAPTER 2. NEURAL NETWORKS

about 10 , 000 images to create a good classifier, so we estimate this problem is about as

difficult and we will need about 10 , 000 images to classify everything correctly. We will split

the data up into both a tr ai ni n g and testing set. In this case, about 80% of the data wil l

be in the traini ng set. In cases with more training data, we would give less of a percentage

to th e testing set and more of a percentage to the training set. We also need to make sure

that the data is not split randomly, rather split proportionally to the group t he data is in

(80% of day examples and 80% of night examples go in the training set).

For our input, we would like to work with images that have th e lowest possible quality, while

still achieving good results. One clever way to choose the lowest possible quality images is

to find out the lowest possible quality that humans can perfectly identify and then use that

quality. After comparing to human performance, we find that 64 × 64 × 3 pixels will work

well as our image size.

Our output will be 0 for the day and 1 for the night. We will also use the sigmoid function

as our last activation function because it maps a number in R to be between 0 and 1. A

shallow neural network, one with only one hidden layer, should work p re t ty well with this

example because it i s a fairl y straightforward task. For our loss function, we will use the

logistical loss

L(yˆ, y) = −y log(yˆ) − (1 − y) log(1 − yˆ). (2.1.5)


#### 2.1.2 Face verification

Our goal is to find a way for a school to use face verification for validating stud ent IDs in

facilities such as dining halls, gyms, and pools. The data we will need is a picture of every

student labeled with the ir name. The input will be the data from the camer as as well as

who they are attempting to be. In general, faces have mor e detail than the night sk y, so

we will need the input to be a larger size. Using th e human test we used last time, we find

a solid resolution to be 412 × 412 × 3 pixels. We will make the output 1 if it’s the correct

person at the camera and 0 if it is not.

To compare the image we have of a person in our database wit h the image we have from a

camera, we want to find a function to call on both of the images in order to determine if

the person in the image is the same. We could directly compare the differences in pixels,

however, that presents se vere problems if the background lighting is different, the person

grows facial hair, and the ID photo is outdated. Instead, we will use a deep network for our

function as illustrated in Figure 2.1.3.

Deep Network

Deep Network

##### 6 7 7 7 7 7 7 7 8
Small Distance

Figure 2.1.3: The architecture we will use to classify images will be deep networks.


We will set the di st anc e threshold to be some number, and any distance les s than that

number we will classify with y = 1. Because the school will not have tons of data on the

pictures of students, we will use a public fac e dataset, where we want to find images of th e

##### 2.1. DEEP LEARNING INTUITION 15

same people to ensure that their encoding is similar and images of different people to ensure

their encodi n g is different. We train our network, we will feed it triplets. Each instance will

have an anchor (actual person), a positive example (the same pe rs on, mini mi ze encoding

distance), and a negative example (different person, maximize encoding distance).

min distance

max distance

anchor positive^ negative

Figure 2.1.4: Ordered triplet input example


For our loss function, we can use

L = ||Enc(A) − Enc(P )|| − ||Enc(A) − Enc(N)|| + α, (2.1.6)


where || represents t he L2 norm and Enc represents t h e encoding. The loss function

 presented makes sense because we will want Enc(A) − Enc(P ) to be minimized and
 Enc(A) −Enc(N) to be maximized. To minimize a maximized function, use the negative
 of that function, which gives us our resu l t of − Enc(A) − Enc( N ) . The α term, called

the margin, is used as a kickstart to our encoding function , to avoid t he case where every

weight in the deep network is zero, and we end up with a perfect loss. It is common to keep

loss functions posit i ve, so we generally train the maximum of th e loss function and 0.

### Input

anc pos neg

Enc(A) Enc(P)Enc(N)

Figure 2.1.5: Updated model for face verification


#### 2.1.3 Face recognition

In ou r school, we want to use face identification to recognize students in faci l it i es. Now,

instead of just verifying who the person is given their ID, we need to identify the person out

of many people. We will use a lot of similar details from the face verification example, but

instead of input ti n g 3 images and out pu t t in g 3 encodings, we will input one image and the

encoding on that image. Then, we will compare the encoding our database, which contains

encodings for every student and use a k-nearest neighbors algorithm to make comparisons.

To go over the entire database, the runtime complexity will be O( n ).

For another example, suppose we have thousands of pictures on our phone of 20 different

people and we want to make groups of the same people in folders. Usi n g the encoding on

each image, we could run a k-means clustering algorithm to find groups within the vector

space that correspond to the same people.

#### 2.1.4 Art generation (neural style transfer)

Given a picture, our goal is to make it beautiful. Suppose we have any data that we want.

Our input will contain both c ontent and the style we consider beautiful. Then, once we

input a new content image, we would like to generate a style image based on the content.

(a) Content image (b) Style image

Figure 2.1.6: Input for art ge n era ti o n


The architecture we will use is first based on a model that understands images very well.

The existing ImageNet mo d el s, for instance, are models that are perfe ct l y well to use for an

image understanding task. After the image is forward propagated th r ough a few layers of

networks, we will get information about the content in the image by looking at the neurons.

We will call the information about the image ContentC. Then, giving the style matrix to

the deep network, we will u se the grain mat ri x to extract the StyleSfrom the style image.

More about the grain matrix wi l l be discussed later in the course.

The loss function we can define as

L = ||ContentC− ContentG|| + ||StyleS− StyleG||, (2.1.7)


whereGdenotes the generated image.

#### 2.1.5 Trigger word detection

When given a 10-second audio speech, we want to detect when the word “active” has been

said. To create a model, we will need a lot of 10-second audio clips with as much of a variety

in the voice as possible. We can break the input down into three segments: when “ac ti ve” is

being said, when a word that is not “active” is bein g said, and when nothing is b ei ng said.

The sample rate of the in pu t would be similar to how we found the resol u t ion of images,

determining the minimum viable amount that humans have no trouble with.

##### 2.2. SHALLOW NEURAL NETWORKS 17
Deep Network

(pretrained)

After many iterations

compute

loss

update pixels using gradients

Figure 2.1.7: Art generation architecture


Figure 2.1.8 : I n p u t for the trigger word detection model. Green represents the time frame

“active” is being said, red represents the time frame that no n - “a c t ive” words are being said, and

black represents the time frame nothing is being said.

The output will be a set of 0s and then a 1 after the word ac t ive is said. This output is

generally easier to debug and works better for continuous models. The last activation we

will use is a sigmoid and the architecture we will use should be recurrent neural networks.

We can gener ate d ata for our m et hods by collecting positive words, negative words, and

background noise and then adding different combinations of the data together to form 10-

second clips.

### 2.2 Shallow neur al networks

x

(i)
1

x

(i)
2

x

(i)
3

a

[ 1 ]
1

a

[ 1 ]
2

a

[ 1 ]
3

a

[ 2 ]
1
yˆ
(i)

Figure 2.2.1: Shallow neural network


#### 2.2.1 Neural networks overview

In each layer of a neural network, we will calculate both the linear activation, and then map

the linear activation to be between 0 and 1. To calculat e the linear activati on in the ith

layer

z

[i]
= W

[i]
x + b

[i]


. (2.2.1)

##### 18 CHAPTER 2. NEURAL NETWORKS
x

(i)
1

x

(i)
2

x

(i)
3

w

T
x + b
z

##### 9
σ(z)
a


Figure 2.2.2: Each neuron in an activation layer represe nts two computa ti o n s: the lin ea r weights

computation plus the bias and then squeezing that number between 0 and 1.

To calculate the non-activation the ith layer, we would use the sigmoi d function

a

[i]
= σ(z

[i]
). (2.2.2)


Only once we get to the final layer would we calculate the loss.

#### 2.2.2 Neural network representations

The hidden layer refe rs to data that is not seen in the training set. To denote the input

layer x, we may also use the notation a

[ 0 ]


. When we count neural network layers, we do not

include the input layer in that count, however, we do include the output layer. We will also

refer to the nodes in the hidden layers and output layer as neurons.

2.2.3 Computing a neural network’s output

In the shallow neural network pictured in Figure 2.2.5, for the hidden layer, we must calculate : ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; <

##### =
z

[ 1 ]
1
= w

[ 1 ]T
1
x + b

[ 1 ]
1

a

[ 1 ]
1
= σ(z

[ 1 ]
1

##### )
z

[ 1 ]
2
= w

[ 1 ]T
2
x + b

[ 1 ]
2

a

[ 1 ]
2
= σ(z

[ 1 ]
2

##### )
z

[ 1 ]
3
= w

[ 1 ]T
3
x + b

[ 1 ]
3

a

[ 1 ]
3
= σ(z

[ 1 ]
3

##### . (2.2.3)

Implementing the assignments in Equation 2.2.3 would require an inefficient for-loop. In-

stead, we will vectorize the process as follows

z

[ 1 ]
=

##### #   %
—– w

[ 1 ]T
1

##### —–
—– w

[ 1 ]T
2 —–

—– w

[ 1 ]T
3 —–

##### %
b

[ 1 ]
1

b

[ 1 ]
2

b

[ 1 ]
3

##### #   %
w

[ 1 ]T
1
x 1 + b

[ 1 ]
1

w

[ 1 ]T
2
x 2 + b

[ 1 ]
2

w

[ 1 ]T
3
x 3 + b

[ 1 ]
3

##### . (2.2.4)

The dimensions of the weights matrix will be the number of weights in the hidden layer by

the number of inputs. The bias vector will be a column vector with the same numb e r of

rows as the weight matrix. We will c all the weight matrix W [ 1 ] and the bias vector b [ 1 ]

. For

our activation, it follows that

a

[ 1 ]
=

##### $#####$    %
Pc

bx

by

bh

bw

car?

pedestrian?

motorcycle?

##### , (6.1.1)

where Pcis the probability we have a class 1, 2, or 3. A few labeled examples are shown in

Figure 6.1.3.

y =

##### 0
None

None

None

None

None

None

None

##### (
(b) Background label

Figure 6.1.3: Our labels for different images

##### 78 CHAPTER 6. DETECTION ALGORITHMS

We will also define our loss function to be a squared error loss function

L (yˆ, y) =

##### 0
8
i= 1
(yˆi− yi)

2
if y 1 = 1

(yˆ 1 − y 1 )

2
if y 1 = 0

##### , (6.1.2)

where the 8 comes from the size of our label.

### 6.2 Landmark detection

Landmark detection works to find specific areas of an image. Recently, landmark detection

has b ee n used to determine human emotions, determine the orientation of a moving body,

and apply S n apchat filter s. In order to create Snapchat face filter, like the one shown in

Figure 6. 2.1c , we can trai n a neural network to determin e specific landmark locations inside

of an i mage. To determine landmark locations on a face, we first need to classify a face and

then label each landmark’s location on our training data as sh own in Figure 6.2.2, where

each label with n landmarks would be in the form

y =

##### & ‘ ‘ ‘ ‘ ‘ ‘ ‘ ‘ ‘ ‘ ‘ ‘ ‘ (
y =

#####        %
Pc

bx

bh

bw

cw

c 2

c 3

Pc

bx

bh

bw

cw

c 2

c 3

##### ;O
anchor box 1

##### O
anchor box 2

##### . (6.8.1)

Then, for each grid cell we will use the IoU to determine which anchor box is closest to each

bounding box. In Figure 6.8.1, the pedes t ri an would be asssigned to anchor box 1 and the

car would be assigned to anchor box 2. We then fill in the output as the labels from each

bounding box. With our 3 × 3 grid, the output size will be 3 × 3 × 16 or 3 × 3 × (2 × 8),

since there are two anchor boxes and each anchor b ox has 8 entries. When u si n g two anchor

boxes, our predic t ion will give us 2 boundary boxes in each grid c el l.

We could always define m ore anchor boxes in practice, alghough with a larger grid size, it

is less likely for collisions in anchor boxes to occur. If we have more objects that share the

same grid cell than we have anchor boxes or we have two boundary boxes that are the same

dimension, our algorithm will not perform well.

### 6.9 YOLO algorithm

The YOLO (You Only Look Once) algorithm was introduced in 2016 by Joseph Redmon,

Santosh Divvala, Ross G ir s hi ck, and Ali Farhadi. The algorithm combines boundary box

predictions with anchor boxes and non-max suppression. With only having to move through

the neural network in one pass, YOLO beat the running time for all pr ev i ous state-of-the-art

object detection algorithms.

Chapter 7

Sequence models

Sequence models allow us to work with variable inputs or variable outputs. This feature

allows us to greatly expand the number of problems we are able to tackle. For example ,

sequence models are used for speech recognition, where given an variable length audio clip, we

translate the spoken words; music generation, where we can input a set of data and output

a sequence of music; sentiment classification, where we could take in a sentence review and

output a star-rating; DNA sequence analysis, where we take a DNA sequence and label the

part of the sequence corresponding to, say a protein; machine (or language) translation,

where we are given a sentence in one language and have to convert it into another; video

activity recognition, where we are given a set of frames and want to output the activity; and

name entity recognition, where we are given a set of text and want to find the names within

the text.

As an example, suppose we would like to build a sequence model for name entity recognition

with the input

x : Harry Potter and Hermione Granger i nvented a new spell. (7.0.1)


We will denote the tth word of the input as x

, starting at index 1. For the output y, we will use multi-hot encoding where each word is labeled with a 1 or 0 as follows  y : 1 1 0 1 1 0 0 0 0. (7.0.2)  The tth element of the output will similarly be referred to as y . The length of the input sequence wil l be denoted by Txand the length of the output sequence will be Ty. In this case, both Txand Tyare equal to 9, although that is not always the case. So, if we want to refer to the tth word in the ith element of the t r ain i ng sample, we will call x  (i)  . Since the input and output of the seque nc e can vary, we will use T  (i) x and^ T   (i) y to^ denote^ the^ length  of the ith input and output, respectively. To represent words, we will need use vocabulary list of all the possible words that are likely to occur. For modern app li c at ion s, this could be around 50,000 words in total. We will represent teh vocabulary list sorted in ascending order. Then, to store x  , we can use  one-hot encoding, where we have a 1 in the index of the vocabulary list that stores the word we are looking at, and a 0 everywhere else. We will tal k about unknown words, or words that are not in our vocabulary list, soon. ### 7.1 Recurrent neural networks The problem with using a standard neural network is that the in p ut length and output lengths can differ in different new samples of our data. If we tried to set a maximum number as the input, with 0 for each element that is not necessary, then we will not only greatly diminish the number of problems that we can focus on, but we also will be using a bunch of unnecessary memory i n the network. The plain neural network will al so not ##### 85 ##### 86 CHAPTER 7. SEQUENCE MODELS take advantage of shared features of sequ ential d at a, similar to how convolutional neural networks took advantage of using the same filters over different regions of the image. With recurrent ne ur al networks, we built a m odel that takes as input the current time step t and all of the previous time steps in order to compute the output y . At each time step, the model is parameterized by shared weights Wax, Waaand Wy a, as shown in Figure 7.1.1. A big problem with this type of representation is that we are only looking at the previous time steps. Suppose we have the following sentences as input  > He said, “Teddy Roosevelt was a great President.”   He said, “Teddy bears are on sale!”  ##### . (7.1.1) In both of these cases , the first three words are equivalent; however, in only t he first case does the word “Teddy” represent a name. We will delve into this topic soon, which looks at Bidirectional RNNs (BRNNs), however to understand the basics of RNNs we will stick with our unidirectional RNN. For now, we will set a  < 0 > = 0, which is typical in many cases. Now, to represent the first  activati ons and output in the recurrent neural network pictured in Figur e 7.1.1, we will set  a < 1 > = g  ##### )  Waaa < 0 > + Waxx < 1 > + ba  ##### * ##### (7.1.2)  yˆ   < 1 > = g  ##### )  Wy aa   < 1 > + by  ##### * ##### . (7.1.3) The activation functions in Equation 7.1.2 and Equation 7.1.3 do not have to be the same. With the activation functions, tanh and Re LU are commonly used, while sigmoid or softmax are typically use d for the output activation functions. To generalize the previous equations, we have  a   = g  ##### )  Waaa   <t− 1 > + Waxx   + ba  ##### * ##### (7.1.4)  yˆ   = g  ##### )  Wy aa   + by  ##### * ##### . (7.1.5) We will futher simplify Equation 7.1.4 to be  a   = g  ##### )  Wa  ##### H  a   <t− 1 > , x    ##### I  + ba  ##### * ##### , (7.1.6) where Wais formed by stacking  Wa=  ##### !  Waa | Wax  ##### " ##### (7.1.7) into a single matrix. We wil l also use the notation ##### H  a   <t− 1 > , x    ##### I ##### = ##### 1  a   <t− 1 >   x    ##### 2 ##### (7.1.8) We will also simplify Equation 7.1.5 to be  yˆ   = g  ##### )  Wya   + by  ##### * ##### . (7.1.9) #### 7.1.1 Backpropagation through time We will define our loss at time step t as our familiar cross-entropy, or logistic loss ##### L   ##### )  yˆ   , y    ##### *  = −y   log yˆ   −  ##### )  1 − y    ##### *  log  ##### )  1 − yˆ    ##### * ##### . (7.1.10) For the entire sequence, we will compute the loss as  L (ˆy, y) =   Ty  -  t= 1  ##### L   ##### )  yˆ   , y    ##### * ##### . (7.1.11) Now, we can define the computation graph pictured in Figure 7.1.2. When backpropagating through this network, we say that we are backpropagating through time, because we are going backwards in the time series. ##### 7.1. RECURRENT NEURAL NETWORKS 87 ### a  < 0 >  ### x  < 1 >   < 1 >  ### a  < 1 >  ### x  < 2 >   < 2 >  ### a  < 2 >  ### x  < 3 >   < 3 >  ### a  < Tx - 1 >  ### x  < Tx >   < Ty >  ### a  < 3 >  ### Waa Waa Waa Waa Waa ### Wax, ba Wax, ba Wax, ba Wax, ba ### Wya, by Wya, by Wya, by Wya, by Figure 7.1 .1 : The structure of a ba s ic recurrent neural network predicts y  based on all of the  previous time steps, along with the current time step. The parameters of the network are pictured in red. _x_  < 1 >   < 1 >  _a_ < 1 > _a_  < 0 >  _Wa, ba_ _Wy, by_  < 1 >  _x_  < 2 >   < 2 >  _a_  < 2 >   < 2 >  _x_  < 3 >   < 3 >  _a_  < 3 >   < 3 >  _x_  < Tx >   < Ty >  _a_  < Tx >   < Ty >   Figure 7.1.2: Computation graph for our recurrent neural network.  #### 7.1.2 Variational RNNs RNNs can be broken into different categories, as shown in Figure 7.1.3. What we have currently been working with is a many-to-many model. If we were to use sentiment classi- ficaiton, where we input a rev i ew sentence and output the number of stars, we would use a many-to-one model. A one-to-many model may consist of music generation, where we could input some metadata (i.e. genre, style, etc.), and output script s of music. Many-to-many models may also consist of examples where the number of inputs and the numbe r of outputs differ, such as with machine translati on. #### 7.1.3 Language modeling Suppose we would like to build a speech recognition system that inputs an audio clip of speech and outputs the spoken words. Looking simply at the pronunciation of each word i s not enough to determine the corre ct word that is spoken. For instance, t he pronunciation for “pair” and “pear”; “two”, “to”, and “too”; and “their”, “they’re”, and “there” are all ##### 88 CHAPTER 7. SEQUENCE MODELS Figure 7.1.3: Depending on the combinations of fixed or variable length inpu t and output, we can form different types of RNNs. pronounced identitically, but are different words alltogether. A language model may lo ok at a set of sentences that sound identitical and choose the sentence that is most likely to have occured. For example, suppose we have the two equivalently spoken sentences  > The apple and pair salad.   The apple and pear salad.  ##### . (7.1. 12) Our language model may output something in the form of  > P (The apple and pair salad.) = 3. 2 × 10 − 13   P (The apple and pear salad.) = 5. 7 × 10 − 10  ##### , (7.1.13) thus, the second sentence would be selected because it is more prob abl e. To create a language model with a RNN, we need a training set that consists of a large corpus of English text. The word corpus comes from the field of natural language processing and refers to a set of text. For each sentence we are given as input, we will tokenize the ou tp u t using one-hot enco di n g. It is also common to add an end-of-sentence token. If we are given a sentence like  The Egyptian Mau is a bread of cat. (7.1.14)  and do not have the word “Mau” in our vocab ul ar y list, then we will use a 1 t o encode the unknown word item in our vocabulary list, while making everything else a 0. Suppose we have a vocabulary list that has 10k entries (excludes puncuation and incl u de s spots for unknown words) are working with the following sentence  x : Well, he l lo there! (7.1.15)  With an RNN model that is trying to predict the next word, the inputs will be the ordered list of words that came before the current time step t. For the firs t time step, the input will be the 0 vector and the first prediction ˆy < 1 > will output a 10k softmax classification, where each word in th e vocabulary list i s given a probability that it occurs; that is  yˆ   < 1 > = P (< each word >) =  ##### # ##### $#####$ ##### %  P (< word one >)   . . .   P (< word 10k >)  ##### & ##### ' ##### ' ##### ( ##### . (7.1.16) In this case, we have y  < 1 > as a one-hot encoding with the word “well”. For the second  time step, our input will be y < 1 > = “well”. Then prediction for yˆ < 2 > will be a conditional probability of the second word, given the first word; that is  yˆ   < 2 > = P  ##### )  < each word > | y   < 1 > = well  ##### * ##### . (7.1.17) For the third time step, t he input will be y < 1 > = “well”, y < 2 > = “hello”. Following the same pattern, the prediction will be in the form of a 10k softmax classifier, where  yˆ   < 3 > = P  ##### )  < each word > | y   < 1 > = “well”, y   < 2 > = “hello”  ##### * ##### . (7.1.18) ##### 7.1. RECURRENT NEURAL NETWORKS 89 On the backward pass, we will lo ok to maximize the predictions that  : <  ##### =  yˆ   < 1 > = well   yˆ   < 2 > = hello   yˆ < 3 > = there  ##### . (7.1.19) To predict the probability of an entire sentence, we could calculate ##### P ##### )  y   < 1 >  ##### * ##### · P ##### )  y   < 2 > | y   < 1 >  ##### * ##### · P ##### E  y   < 3 > |   2 P   t= 1   y    ##### F ##### · · · P ##### E  y   |   T − 1 P   t= 1   y    ##### F ##### . (7.1.20) #### 7.1.4 Sampling novel sequences To sample a sequence, we will use a similar RNN model to the one just described. However, we will now change the input at each time step to b e all of the previous predic t ion s from the model, as shown in Figure 7.1.4. We will als o have the predictions b e random variabl e choices from the probability distribution in order to avoid an infinite loop of the same few words. In NumPy, we can use np.random.choice to choose a random variable from the output prediction distribut ion. To end t he program, we can wait until the sentence is over, in which case the output will be our word. We could also iterate for a certain number of time steps or run a program for a specific amount of time. _x_  < 1 >   < 1 >  _a_ < 1 > _a_  < 0 >   < 2 >   < 1 > < 2 > < Tx - 1 >  _a_  < 2 >   < 3 >  _a_  < 3 >   < Ty >  _a_  < Tx >  Figure 7.1.4: RNN model for sequence generation. Here, the to t h e time step t is the output from all of the previous time steps. Up until this point, we have been focused on using a word vocabulary list; however, it is also common to use a character vocabulary list, such as all of the ASCII characters. An advantage of character level models is that we wil l never have to use an unknown token. One of the drawbacks with the character level approach is that it is worse at capturing long-range dependencies, such as opening and closing quot es , along with being more computationally expensive. #### 7.1.5 Vanishing gradients with RNNs When we are working wit h RNNs, it is not unrealistic to be working with hundreds or thousands of time steps in a given model. However, recall the pr obl em with deep neural networks that had a lot of layers, where we continued to multiply by smaller and smaller numbers as we backpropagate further. We can think of our the number of layers we have with NNs similar to the number of time steps with RNNs. With our current RNN approach, it will much harder to r ep re se nt long-term dependencies in our data, such as opening and closing quotes. #### 7.1.6 GRU Gated recurrent unit Currently, we have been using RNNs that have a hidden unit in the following form  a   = tanh  ##### )  Wa  ##### H  a   <t− 1 > , x    ##### I  + ba  ##### * ##### . (7.1.21) ##### 90 CHAPTER 7. SEQUENCE MODELS _x_ ### < t > ### < t > _a_ ### < t –1> _a_ ### < t + 1 > ### softmax  tanh  Figure 7.1.5: Our current approach to RNN calculations, where the g ray box represents a hidden black box calculation. We can picture this as a black box calculation at each time step t as shown in in Figure 7.1.5. However, our model is not adapt to cap t ur i ng long-term dependencies in the dat a. Suppose we are working with the following sentence  x : The cat, which already ate, was full. (7.1.22)  If we slightly changed this sentence to have the word cat as cats, then we would also need to update was to were. With a GRU, a will be treated like a memory cell. We will then set  a ̃   = tanh  ##### )  Wa  ##### H  a   <t− 1 > , x    ##### I  + ba  ##### * ##### (7.1.23) as a potential candidate for a . Next , we will create an update gate  Γu= σ  ##### )  Wu  ##### H  a   <t− 1 > , x    ##### I  + bu  ##### * ##### , (7.1.24) where the majority of values in Γuwill be 0 or 1, based on the f un ct i onal i ty of the sigmoid function. The update gate ultimately determines if we will update a . For example, suppose the model learns that singular words are denoted with a  = 1. Then, i n our  working sentence we may have something along the l i n es of:  t 1 2 3 4 5 6 7   x The cat which already ate was full   a − 1 1 1 1 1 −  Notice that the update gate continues to denote that our sentence is singular at further time steps t. At some fut ur e time t, if we run into a plural phrase, then it is the job of Γuto update a . Specifically, our up d at e will be  a   = Γu∗ ̃a   + (1 − Γu) ∗ a   <t− 1 > , (7.1.25)  where ∗ is an element-wise multiplication between the vector. Notice here that when Γu= 1, we will update a to be ̃a , such as when the time step was 2 in the above example. When Γu= 0, then we keep a the same value, such as in the time steps [3, 6] above. An inportant note is that the values a ̃ , a , and Γuare all vectors that can learn many different types of long-term dependencies. For example, we can learn how to open and close brackets, q u ot es, and use singular or plural p hr ase s when appropriate. The computation graph for GRUs is shown in Figure 7.1.6. ##### 7.1. RECURRENT NEURAL NETWORKS 91 _x_ < _t_ > < _t_ > _a_  < t –1> a  < _t_ + 1 > softmax  tanh   Figure 7.1.6: Simplified GRU computational graph at a time step t.  There is one other part to the GRU that we have currently left out, and that is to determi ne how relevant the last me mor y cell a  <t− 1 > is for the calculation of the next, potential,  memory cell ̃a . Here, we add the relevance gate i n as follows  ̃a = tanh  ##### )  Wa  ##### H  Γr∗^ a   <t− 1 > , x  ##### I  + ba  ##### * ##### (7.1.26)  Γu= σ  ##### )  Wu  ##### H  a   <t− 1 > , x    ##### I  + bu  ##### * ##### (7.1.27)  Γr= σ  ##### )  Wr  ##### H  a   <t− 1 > , x    ##### I  + br  ##### * ##### (7.1.28)  a = Γu∗ ̃a + (1 − Γu) ∗ a <t− 1 >  . (7.1.29) #### 7.1.7 LSTM: Long short-term memory In addition to GRUs to store long-term dependencies in sequential data, LSTMs are also commonly used. Figure 7.1.7 shows an example of an LSTM model. We can think about LSTMs as an extension of GRUs wi th 2 hidden units h 1 and h 2 . We use 3 separate gates, a forget gate Γf, update gate Γu, and output gate Γo, that are each defined as  Γu= σ  ##### .  Wu· [xt, ht− 1 ]  ##### / ##### (7.1.30)  Γf= σ  ##### .  Wf· [xt, ht− 1 ]  ##### / ##### (7.1.31)  Γo= σ  ##### .  Wo· [xt, ht− 1 ]  ##### / ##### . (7.1.32) With LSTMs, we remove the relevance gate Γrbecause it often does not improve the perfor- mance of a model in practice. With our 1st hidden unit h 1 , we will set its replace candidate based on the other hid de n unit input h 2 t− 1 and the new input xt, which we will define as ##### ̃  h   1 t=^ σ  ##### .  Wc· [xt, h   2 t− 1 ]  ##### / ##### . (7.1.33) Now instead of using a single update gate that determin es both how much of the previous layer we want to forget and how much of the replacement candidate we want to remember, ##### 92 CHAPTER 7. SEQUENCE MODELS  x t   softmax   forget gate update gate tanh output gate   tanh   y t  _h_  1 t  _h_  1 t - 1  _h_  2   t  _h_  2   t - 1 ~ h   1 t   (a) Single LSTM gate   (b) Stacking multiple LSTM gates in a row   Figure 7.1.7: LSTM model  we use 2 separate gates to define these quantities. In particular, we define our 1st hidden unit to be  h   1 t = Γu⊗ ̃h   1 t + Γf⊗ h   1 t− 1  ##### . (7.1.34) Our 2nd hidden un i t is dependent on h 1 t and passed into the softmax fucntion to produce the output yt, which we define as  h   2 t = Γo⊗ tanh  ##### )  h   1 t  ##### * ##### . (7.1.35) Here, the output gate determines t h e relevance of each hidden unit for the output. The blue path that we show in Figure 7.1. 7b connects the 1st hidden unit between multiple LSTM gates. The path could give an intuitive fee l for why LSTMs give long-range dependency connections, which is primarily due to having multiple update gates control what we store in the internal memory, making it harder to forget everything completely. #### 7.1.8 BRNNs: Bi di r ec ti on al RNNs Recall our 2 sentences we used earlier  > He said, “Teddy bears are on sale!”   He said, “Teddy Roosevelt was a great Pre si de nt!”  ##### , (7.1.36) where it is not enough to predict if t he work Teddy is a name based on the information before the word occurs. In this case, we can use a bidirectional RNN (BRNN) to make pred ic t ion s with the words before and after Teddy. Figure 7.1.8 shows the diagram for a BRNN with a fixed size input length of 4. Our BRNN is broken up into 2 separate forward RNNs. Here, we have a forward RNN with activations in the form ##### −→  a that takes into account all of the  previous time step inputs. We also have another RNN with activations in the form of ##### ←−  a  ##### 7.2. WORD EMBE D D ING 93 that looks at all the futur e time steps in reverse order. For example, if our sentence is  He said, “Teddy Roosevelt!”, (7.1.37)  then at t = 3 we can form ##### ←−  a < 3 > with {He, said, Teddy} and we can form  ##### −→  a < 3 > with  {Roosevelt}. We c an then make an output prediction, such as the meaning of the word Teddy, in the form of  yˆ   = g  ##### ) ##### W · ##### H ##### −→  a   < 3 > ,  ##### ←−  a   < 3 >  ##### I* ##### . (7.1.38) One of the downsides to using a BRNN is that we are required to supply the entire input at the start. _x_  < 1 >  _a_  < 1 > a   < 1 >   < 1 >  _x_  < 2 >  _a_  < 2 > a   < 2 >   < 2 >  _x_  < 3 >  _a_  < 3 > a   < 3 >   < 3 >  _x_  < 4 >  _a_  < 4 > a   < 4 >   < 4 >   (a) Forward and backward connections   x   < 1 >   < 1 >   a   < 1 >   x   < 2 >   < 2 >   a   < 2 >   x   < 3 >   < 3 >   a   < 3 >   x   < 4 >   < 4 >   a   < 4 >  (b) Forward RNN connection with the sequence in order  x   < 4 >   < 4 >   a   < 4 >   x   < 3 >   < 3 >   a   < 3 >   x   < 2 >   < 2 >   a   < 2 >   x   < 1 >   < 1 >   a   < 1 >   (c) Forward RNN connect io n , where the se-   quence is in reverse order   Figure 7.1.8: BRNN with an in p u t of length of 4  #### 7.1.9 Deep RNNs Figure 7.1.9 shows the 2 primary types of deep RNNs, where we stack multiple activations at each time step. The 1st type of network, in Figure 7.1.9a, shows 3 stacked activations at each time step that are each connected to the next time step. Here, each layer has its own weights, which are shared at each time step. If, for example, we can cal cu l ate each hidden activati on in layer ℓ on the tth time step, then we can use  a   [ℓ] = g W   [ℓ] ·  ##### !  a   [ℓ]<t− 1 > , a   [ℓ− 1 ]  ##### "/ ##### , (7.1.39) which takes the l ef t and b e l ow activations as the input. The 2nd type of network, in Figure 7.1.9b, shows the first few layers connecting to future time steps, while the deeper layers are only connected to a single time step. Here, the deeper layers are not connec te d to fut u r e time steps because connecting them would be computationally expensive. ### 7.2 Word embedding ### 7.3 Word representation Up until this point, we have represented words with a dictionary, where we specify a specific word using a one-hot enco d i ng vector. One of the problems with this representation is that ##### 94 CHAPTER 7. SEQUENCE MODELS ### x  < 1 >   < 1 >  ### a  [2] < 1 >  ### a  [1] < 1 >  ### a  [3] < 1 >  ### x  < 2 >   < 2 >  ### a  [2] < 2 >  ### a  [1] < 2 >  ### a  [3] < 2 >  ### x  < 3 >   < 3 >  ### a  [2] < 3 >  ### a  [1] < 3 >  ### a  [3] < 3 >  ### x  < 4 >   < 4 >  ### a  [2] < 4 >  ### a  [1] < 4 >  ### a  [3] < 4 >  (a) Stacking together multiple layers of activations that are each connected to the ne xt and p rev io u s time steps  x   < 1 >   < 1 >   a   [ 2 ] < 1 >   a   [ 1 ] < 1 >   a   [ 3 ] < 1 >   x   < 2 >   < 2 >   a   [ 2 ] < 2 >   a   [ 1 ] < 2 >   a   [ 3 ] < 2 >   x   < 3 >   < 3 >   a   [ 2 ] < 3 >   a   [ 1 ] < 3 >   a   [ 3 ] < 3 >   x   < 4 >   < 4 >   a   [ 2 ] < 4 >   a   [ 1 ] < 4 >   a   [ 3 ] < 4 >   a   [ 5 ] < 1 >   a   [ 4 ] < 1 >   a   [ 6 ] < 1 >   a   [ 5 ] < 2 >   a   [ 4 ] < 2 >   a   [ 6 ] < 2 >   a   [ 5 ] < 3 >   a   [ 4 ] < 3 >   a   [ 6 ] < 3 >   a   [ 5 ] < 4 >   a   [ 4 ] < 4 >   a   [ 6 ] < 4 >   a   [ 7 ] < 1 > a   [ 7 ] < 2 > a   [ 7 ] < 3 > a   [ 7 ] < 4 >  (b) Deep RNN where the first few layers have activations that connect to future time steps, while the deeper layers only carry weight in a single time step  Figure 7.1.9: Deep RNNs  the words do not have a connection with each other. For examp l e, suppose we input  I want a glass of orange (7.3.1)  ##### 7.3. WORD REPRE SE NTATION 95 input our ne twork, where it is the goal of the network to predict the blank. With a well trained network the likely prediction is the word juice. But, suppose we now have t he input  I want a glass of apple (7.3.2)  and our model does not know about apple juice. In this situation, if our model knows that there is some connection between orange and apple, then it would still likely predict juice. One approach to learning connections would be to discover feature repre sentations with each word. Table 7.1 shows an example of a potential encoding that our model may find. The table shows the words king, queen, apple, and orange, along with thei r responses to a set of features. For example, if we have a royal feature, then king and queen might have a high response; alte rn at i vely if we have a food feature, then apple and orange might have a high response. Here, if we have 300 features, then we can represent each word with a vector of length 300. Now, due to the similarities between features, our network is more likely to pick up similar words.  King Qu een Apple Orange   Royal 0.93 0.95 -0.01 0.00   Age 0.7 0.71 0.03 -0.02   Food 0.01 -0.03 0.95 0.97  Table 7.1: Sample words in a dictionary (top) with their responses to each feature (left) that ranges between [− 1 : 1] #### 7.3.1 Using word embeddings To learn the features for a set of words, it is common to use transfer learning. There are currently many freely available text corpuses that have learned word embeddings training on 1B to 100B words. Once we have word emb ed di n gs, we can transfer the words to a smaller task and fine-tune the parameters if necessary. #### 7.3.2 Properties of word embeddings Once we have the word embeddings, one neat thing that we can find is the differen ce between 2 words. Suppose for example that one of our features gives the value 1 for words describing objects w it h an orange color, − 1 for words describing a colored object that is not orange, and 0 for objects that are not defined by a color. Table 7.2 shows how we can spot the feature differences between the apple and orange vectors. I n particular, if we subtract the 2 vectors, then t he f eat ur e s th at ar e lar ges t i n magn it u de cor r es pond to the feature differences between words, while the feature differences near 0 refer to features being the same.  Apple Orange Apple - Orange (approximate)   Orange colored − 0. 96 0. 98 − 2   Royal -0.01 0.00 0   Age 0.03 -0.02 0   Food 0.95 0.97 0  Table 7.2: To compare words we can find the differences and similarities by taking the difference between the word feature vectors, and finding the features wit h the largest difference in magnitude are the most different, while feature differences near 0 are quite similar.