Natural Language Processing • Machine Translation
- Overview
- Neural Machine Translation
- Seq2Seq
- Training NMT
- Multi layer RNNs
- Beam search decoding
- Evaluation for Machine Translation
- Low Resource Machine Translation
- Citation
Overview
- Machine Translation: task of translating a sentence \(x\) from one language to another.
- Before we have neural machine translation, around the time of the Cold War, we had code breaking.
- 1920-2010: Statistical Machine Translation
- Learn a probabilistic model from data
- Large amount of parallel data human translated between difference languages
Neural Machine Translation
- This is a way to do Machine Translation with single end to end neural network.
- How does it work?
- Feed source sentence
- Output translation
- Feed a lot of parallel translation
- Encode source sentence
Seq2Seq
- Sequence-to-sequence models are deep learning models that have achieved a lot of success in tasks like machine translation, text summarization, and image captioning. Google Translate started using such a model in production in late 2016.
- A sequence-to-sequence model is a model that takes a sequence of items (words, letters, features of an images…etc) and outputs another sequence of items
- In neural machine translation, a sequence is a series of words, processed one after another. The output is, likewise, a series of words
- The encoder processes each item in the input sequence, it compiles the information it captures into a vector (called the context). After processing the entire input sequence, the encoder sends the context over to the decoder, which begins producing the output sequence item by item.
- The context is a vector (an array of numbers, basically) in the case of machine translation. The encoder and decoder tend to both be recurrent neural networks
- You can set the size of the context vector when you set up your model. It is basically the number of hidden units in the encoder RNN. These visualizations show a vector of size 4, but in real world applications the context vector would be of a size like 256, 512, or 1024.
- requires 2 Neural networks (RNN)
- Seq2Seq uses:
- Summarization
- Dialogue
- Parsing
- Code generation
Training NMT
- Get a large parallel corpus
- Source sentence:batches will be encoded, feed final hidden state to target LSTM
- Compare word by word if sentence was correct otherwise take a loss (negative log probability)
- Loss gives us information to back prop through entire network
- Seq2Seq is optimized as a single system so you can update all parameters of decoder and encoder model
- Target sentences from decoder RNN
Multi layer RNNs
- By design, a RNN takes two inputs at each time step: an input (in the case of the encoder, one word from the input sentence), and a hidden state
- Allows network to compute more complex representations
- Lower RNNs should computer lower level features and higher RNNs should compute higher level features
- Lower features: more basic things about words like what part of speech, are these words a name or a company
- Higher features: overall structure of sentence, positive or negative connotation, semantic meaning
- Has a
<START>
and<END>
token
Beam search decoding
- On each step of decoder, keep track of k most probable partial translations(which is called a hypothesis)
- K is the beam size(5-10)
- Used in more than just NMT
- Hypothesis has a score which is the log probability of what we’ve seen so far
- Not guaranteed to find optimal solution
- Longer hypotheses have lower scores:
- Need to use normalization by length
Evaluation for Machine Translation
- Get a translator to judge how good of a translation is
- Scoring translation: BLEU
- You compare machine written translation to one or several human written translations and compute a similarity score
Low Resource Machine Translation
- Parallel data set
- Minimize cross entropy loss
- Maximize log probability of the reference human translation given source sentence
- Via stochastic gradient descent
- Supervised learning because parallel dataset available
- Algorithms:
- Phrase-based and Neural Unsup Machine Translation
- Back Translation (data augmentation?)
- Hyperparameter = noise
Citation
If you found our work useful, please cite it as:
@article{Chadha2021Distilled,
title = {Machine Translation},
author = {Jain, Vinija and Chadha, Aman},
journal = {Distilled Notes for Stanford CS224n: Natural Language Processing with Deep Learning},
year = {2021},
note = {\url{https://aman.ai}}
}