Aman's AI Journal • Natural Language Processing • Machine Translation

Overview
Neural Machine Translation
Seq2Seq
Training NMT
Multi layer RNNs
Beam search decoding
Evaluation for Machine Translation
Low Resource Machine Translation
Citation

Overview

Machine Translation: task of translating a sentence \(x\) from one language to another.
Before we have neural machine translation, around the time of the Cold War, we had code breaking.
- 1920-2010: Statistical Machine Translation
- Learn a probabilistic model from data
- Large amount of parallel data human translated between difference languages

Neural Machine Translation

This is a way to do Machine Translation with single end to end neural network.
How does it work?
- Feed source sentence
- Output translation
- Feed a lot of parallel translation
- Encode source sentence

Seq2Seq

Sequence-to-sequence models are deep learning models that have achieved a lot of success in tasks like machine translation, text summarization, and image captioning. Google Translate started using such a model in production in late 2016.
A sequence-to-sequence model is a model that takes a sequence of items (words, letters, features of an images…etc) and outputs another sequence of items
In neural machine translation, a sequence is a series of words, processed one after another. The output is, likewise, a series of words
The encoder processes each item in the input sequence, it compiles the information it captures into a vector (called the context). After processing the entire input sequence, the encoder sends the context over to the decoder, which begins producing the output sequence item by item.
The context is a vector (an array of numbers, basically) in the case of machine translation. The encoder and decoder tend to both be recurrent neural networks
You can set the size of the context vector when you set up your model. It is basically the number of hidden units in the encoder RNN. These visualizations show a vector of size 4, but in real world applications the context vector would be of a size like 256, 512, or 1024.
requires 2 Neural networks (RNN)
Seq2Seq uses:
- Summarization
- Dialogue
- Parsing
- Code generation

Training NMT

Get a large parallel corpus
Source sentence:batches will be encoded, feed final hidden state to target LSTM
Compare word by word if sentence was correct otherwise take a loss (negative log probability)
Loss gives us information to back prop through entire network
Seq2Seq is optimized as a single system so you can update all parameters of decoder and encoder model
Target sentences from decoder RNN

Multi layer RNNs

By design, a RNN takes two inputs at each time step: an input (in the case of the encoder, one word from the input sentence), and a hidden state
Allows network to compute more complex representations
Lower RNNs should computer lower level features and higher RNNs should compute higher level features
Lower features: more basic things about words like what part of speech, are these words a name or a company
Higher features: overall structure of sentence, positive or negative connotation, semantic meaning
Has a <START> and <END> token

Beam search decoding

On each step of decoder, keep track of k most probable partial translations(which is called a hypothesis)
K is the beam size(5-10)
Used in more than just NMT
Hypothesis has a score which is the log probability of what we’ve seen so far
Not guaranteed to find optimal solution
Longer hypotheses have lower scores:
- Need to use normalization by length

Evaluation for Machine Translation

Get a translator to judge how good of a translation is
Scoring translation: BLEU
- You compare machine written translation to one or several human written translations and compute a similarity score

Low Resource Machine Translation

Parallel data set
Minimize cross entropy loss
Maximize log probability of the reference human translation given source sentence
- Via stochastic gradient descent
Supervised learning because parallel dataset available
Algorithms:
- Phrase-based and Neural Unsup Machine Translation
- Back Translation (data augmentation?)
Hyperparameter = noise

Citation

If you found our work useful, please cite it as:

@article{Chadha2021Distilled,
  title   = {Machine Translation},
  author  = {Jain, Vinija and Chadha, Aman},
  journal = {Distilled Notes for Stanford CS224n: Natural Language Processing with Deep Learning},
  year    = {2021},
  note    = {\url{https://aman.ai}}
}