## Overview

• Machine Translation: task of translating a sentence $$x$$ from one language to another.
• Before we have neural machine translation, around the time of the Cold War, we had code breaking.
• 1920-2010: Statistical Machine Translation
• Learn a probabilistic model from data
• Large amount of parallel data human translated between difference languages

## Neural Machine Translation

• This is a way to do Machine Translation with single end to end neural network.
• How does it work?
• Feed source sentence
• Output translation
• Feed a lot of parallel translation
• Encode source sentence

## Seq2Seq

• Sequence-to-sequence models are deep learning models that have achieved a lot of success in tasks like machine translation, text summarization, and image captioning. Google Translate started using such a model in production in late 2016.
• A sequence-to-sequence model is a model that takes a sequence of items (words, letters, features of an images…etc) and outputs another sequence of items
• In neural machine translation, a sequence is a series of words, processed one after another. The output is, likewise, a series of words
• The encoder processes each item in the input sequence, it compiles the information it captures into a vector (called the context). After processing the entire input sequence, the encoder sends the context over to the decoder, which begins producing the output sequence item by item.
• The context is a vector (an array of numbers, basically) in the case of machine translation. The encoder and decoder tend to both be recurrent neural networks
• You can set the size of the context vector when you set up your model. It is basically the number of hidden units in the encoder RNN. These visualizations show a vector of size 4, but in real world applications the context vector would be of a size like 256, 512, or 1024.
• requires 2 Neural networks (RNN)
• Seq2Seq uses:
• Summarization
• Dialogue
• Parsing
• Code generation

## Training NMT

• Get a large parallel corpus
• Source sentence:batches will be encoded, feed final hidden state to target LSTM
• Compare word by word if sentence was correct otherwise take a loss (negative log probability)
• Loss gives us information to back prop through entire network
• Seq2Seq is optimized as a single system so you can update all parameters of decoder and encoder model
• Target sentences from decoder RNN

## Multi layer RNNs

• By design, a RNN takes two inputs at each time step: an input (in the case of the encoder, one word from the input sentence), and a hidden state
• Allows network to compute more complex representations
• Lower RNNs should computer lower level features and higher RNNs should compute higher level features
• Lower features: more basic things about words like what part of speech, are these words a name or a company
• Higher features: overall structure of sentence, positive or negative connotation, semantic meaning
• Has a <START> and <END> token

## Beam search decoding

• On each step of decoder, keep track of k most probable partial translations(which is called a hypothesis)
• K is the beam size(5-10)
• Used in more than just NMT
• Hypothesis has a score which is the log probability of what we’ve seen so far
• Not guaranteed to find optimal solution
• Longer hypotheses have lower scores:
• Need to use normalization by length

## Evaluation for Machine Translation

• Get a translator to judge how good of a translation is
• Scoring translation: BLEU
• You compare machine written translation to one or several human written translation and compute a similarity score

## Low Resource Machine Translation

• Parallel data set
• Minimize cross entropy loss
• Maximize log probability of the reference human translation given source sentence
• Supervised learning because parallel dataset available
• Algorithms:
• Phrase-based and Neural Unsup Machine Translation
• Back Translation (data augmentation?)
• Hyperparameter = noise

## Citation

If you found our work useful, please cite it as:

@article{Chadha2021Distilled,
title   = {Machine Translation},
author  = {Jain, Vinija and Chadha, Aman},
journal = {Distilled Notes for Stanford CS224n: Natural Language Processing with Deep Learning},
year    = {2021},
note    = {\url{https://aman.ai}}
}