CS231n • Recurrent Neural Networks
Recurrent Neural Networks
-
Vanilla Neural Networks (i.e., feed-forward neural networks), input of fixed size goes through some hidden units and then go to output. We call it a one to one network.
-
Recurrent Neural Networks RNN Models:
- One to many
- Example: Image Captioning
- image ==> sequence of words
- Example: Image Captioning
- Many to One
- Example: Sentiment Classification
- sequence of words ==> sentiment
- Example: Sentiment Classification
- Many to many
- Example: Machine Translation
- seq of words in one language ==> seq of words in another language
- Example: Video classification on frame level
- Example: Machine Translation
- One to many
-
RNNs can also work for Non-Sequence Data (One to One problems)
- It worked in Digit classification through taking a series of “glimpses”
- “Multiple Object Recognition with Visual Attention”, ICLR 2015.
- It worked on generating images one piece at a time
- i.e generating a captcha
- It worked in Digit classification through taking a series of “glimpses”
-
So what is a recurrent neural network?
-
Recurrent core cell that take an input x and that cell has an internal state that are updated each time it reads an input.
-
The RNN block should return a vector.
-
We can process a sequence of vectors x by applying a recurrence formula at every time step:
-
h[t] = fw (h[t-1], x[t]) # Where fw is some function with parameters W
- The same function and the same set of parameters are used at every time step.
-
-
(Vanilla) Recurrent Neural Network:
-
h[t] = tanh (W[h,h]*h[t-1] + W[x,h]*x[t]) # Then we save h[t] y[t] = W[h,y]*h[t]
- This is the simplest example of a RNN.
-
-
RNN works on a sequence of related data.
-
-
Recurrent NN Computational graph:
h0
are initialized to zero.- Gradient of
W
is the sum of all theW
gradients that has been calculated! - A many to many graph:
- Also the last is the sum of all losses and the weights of Y is one and is updated through summing all the gradients!
- A many to one graph:
- A one to many graph:
- sequence to sequence graph:
- Encoder and decoder philosophy.
-
Examples:
- Suppose we are building words using characters. We want a model to predict the next character of a sequence. Lets say that the characters are only
[h, e, l, o]
and the words are [hello]- Training:
- Only the third prediction here is true. The loss needs to be optimized.
- We can train the network by feeding the whole word(s).
- Testing time:
- At test time we work with a character by character. The output character will be the next input with the other saved hidden activations.
- This link contains all the code but uses Truncated Backpropagation through time as we will discuss.
- Training:
- Suppose we are building words using characters. We want a model to predict the next character of a sequence. Lets say that the characters are only
-
Backpropagation through time Forward through entire sequence to compute loss, then backward through entire sequence to compute gradient.
- But if we choose the whole sequence it will be so slow and take so much memory and will never converge!
-
So in practice people are doing “Truncated Backpropagation through time” as we go on we Run forward and backward through chunks of the sequence instead of whole sequence
- Then Carry hidden states forward in time forever, but only backpropagate for some smaller number of steps.
-
Example on image captioning:
- They use
token to finish running. - The biggest dataset for image captioning is Microsoft COCO.
- They use
-
Image Captioning with Attention is a project in which when the RNN is generating captions, it looks at a specific part of the image not the whole image.
- Image Captioning with Attention technique is also used in “Visual Question Answering” problem
-
Multilayer RNNs is generally using some layers as the hidden layer that are feed into again. LSTM is a multilayer RNNs.
-
Backward flow of gradients in RNN can explode or vanish. Exploding is controlled with gradient clipping. Vanishing is controlled with additive interactions (LSTM)
-
LSTM stands for Long Short Term Memory. It was designed to help the vanishing gradient problem on RNNs.
- It consists of:
- f: Forget gate, Whether to erase cell
- i: Input gate, whether to write to cell
- g: Gate gate (?), How much to write to cell
- o: Output gate, How much to reveal cell
- The LSTM gradients are easily computed like ResNet
- The LSTM is keeping data on the long or short memory as it trains means it can remember not just the things from last layer but layers.
- It consists of:
-
Highway networks is something between ResNet and LSTM that is still in research.
-
Better/simpler architectures are a hot topic of current research
-
Better understanding (both theoretical and empirical) is needed.
-
RNN is used for problems that uses sequences of related inputs more. Like NLP and Speech recognition.
Citation
If you found our work useful, please cite it as:
@article{Chadha2020RecurrentNeuralNetworks,
title = {Recurrent Neural Networks},
author = {Chadha, Aman},
journal = {Distilled Notes for Stanford CS231n: Convolutional Neural Networks for Visual Recognition},
year = {2020},
note = {\url{https://aman.ai}}
}