• Machine Translation: task of translating a sentence \(x\) from one language to another.
  • Before we have neural machine translation, around the time of the Cold War, we had code breaking.
    • 1920-2010: Statistical Machine Translation
    • Learn a probabilistic model from data
    • Large amount of parallel data human translated between difference languages

Neural Machine Translation

  • This is a way to do Machine Translation with single end to end neural network.
  • How does it work?
    • Feed source sentence
    • Output translation
    • Feed a lot of parallel translation
    • Encode source sentence


  • Sequence-to-sequence models are deep learning models that have achieved a lot of success in tasks like machine translation, text summarization, and image captioning. Google Translate started using such a model in production in late 2016.
  • A sequence-to-sequence model is a model that takes a sequence of items (words, letters, features of an images…etc) and outputs another sequence of items
  • In neural machine translation, a sequence is a series of words, processed one after another. The output is, likewise, a series of words
  • The encoder processes each item in the input sequence, it compiles the information it captures into a vector (called the context). After processing the entire input sequence, the encoder sends the context over to the decoder, which begins producing the output sequence item by item.
  • The context is a vector (an array of numbers, basically) in the case of machine translation. The encoder and decoder tend to both be recurrent neural networks
  • You can set the size of the context vector when you set up your model. It is basically the number of hidden units in the encoder RNN. These visualizations show a vector of size 4, but in real world applications the context vector would be of a size like 256, 512, or 1024.
  • requires 2 Neural networks (RNN)
  • Seq2Seq uses:
    • Summarization
    • Dialogue
    • Parsing
    • Code generation

Training NMT

  • Get a large parallel corpus
  • Source sentence:batches will be encoded, feed final hidden state to target LSTM
  • Compare word by word if sentence was correct otherwise take a loss (negative log probability)
  • Loss gives us information to back prop through entire network
  • Seq2Seq is optimized as a single system so you can update all parameters of decoder and encoder model
  • Target sentences from decoder RNN

Multi layer RNNs

  • By design, a RNN takes two inputs at each time step: an input (in the case of the encoder, one word from the input sentence), and a hidden state
  • Allows network to compute more complex representations
  • Lower RNNs should computer lower level features and higher RNNs should compute higher level features
  • Lower features: more basic things about words like what part of speech, are these words a name or a company
  • Higher features: overall structure of sentence, positive or negative connotation, semantic meaning
  • Has a <START> and <END> token

Beam search decoding

  • On each step of decoder, keep track of k most probable partial translations(which is called a hypothesis)
  • K is the beam size(5-10)
  • Used in more than just NMT
  • Hypothesis has a score which is the log probability of what we’ve seen so far
  • Not guaranteed to find optimal solution
  • Longer hypotheses have lower scores:
    • Need to use normalization by length

Evaluation for Machine Translation

  • Get a translator to judge how good of a translation is
  • Scoring translation: BLEU
    • You compare machine written translation to one or several human written translation and compute a similarity score

Low Resource Machine Translation

  • Parallel data set
  • Minimize cross entropy loss
  • Maximize log probability of the reference human translation given source sentence
    • Via stochastic gradient descent
  • Supervised learning because parallel dataset available
  • Algorithms:
    • Phrase-based and Neural Unsup Machine Translation
    • Back Translation (data augmentation?)
  • Hyperparameter = noise


If you found our work useful, please cite it as:

  title   = {Machine Translation},
  author  = {Jain, Vinija and Chadha, Aman},
  journal = {Distilled Notes for Stanford CS224n: Natural Language Processing with Deep Learning},
  year    = {2021},
  note    = {\url{https://aman.ai}}