Aman's AI Journal • Natural Language Processing • Language Models

Language Modeling + RNN
N-gram language models
Neural Language Model
Evaluating language models
Contextual Embeddings
ELMO
GPT-3
Citation

Language Modeling + RNN

This is the task of predicting what word comes next.
Probability distribution over next words given preceding contexts.
Language model is a system that does that.
It assigns probability to a piece of text.
- Gboard is a language model
- Query completion is a language model

N-gram language models

Chunk of n consecutive words (4 gram, 5 gram)
Make a markov assumption: predict t+1, we throw away earlier words and just use n-1 words and use conditional probability
4 gram language model:
- Only use last 3 words
Problem is if you don’t see that data within your window, you’ll get probability 0.
Count how often word sequences occur in a corpus.
N gram will not have context, it will be incoherent. We need to consider more than 3-4 words at a time if we want to model language well
N grams have a sparsity problem.

Neural Language Model

Input seq of words
Output prob dist of the next word
How about a window based neural model
A fixed window neural language model (discard far away words)
These still suck, precursor to RNN
Single hidden layer

Evaluating language models

Perplexity: standard evaluation metric; lower is better

Contextual Embeddings

Word embeddings work in NLP by neural networks
- Word embeddings are context free
- Bank will have the same embedding even if river or financial place
Solution: contextual representation on text corpus

ELMO

Contextualized word-embeddings
Trained bidirectional, but weakly
Instead of using a fixed embedding for each word, ELMo looks at the entire sentence before assigning each word in it an embedding. It uses a bi-directional LSTM trained on a specific task to be able to create those embeddings.
Left to right and right to left LM and concatenate

Models are still kind of fixed
Embedding from Language Model (ELMo) (Peters et al., 2018) is one such method that provides deep contextual embeddings.
ELMo produces word embeddings for each context where the word is used, thus allowing different representations for varying senses of the same word

GPT-3

Can use for DB
Language model
Given context, probability of next word is what it predicts
What new about GPT 3 is flexible “in-context” learning
- Demonstrates some level of fast adaptation to completely new tasks via in context learning
- The language model training (outer loop) is learning how to learn from the context (inner loop)

Citation

If you found our work useful, please cite it as:

@article{Chadha2021Distilled,
  title   = {Language Models},
  author  = {Jain, Vinija and Chadha, Aman},
  journal = {Distilled Notes for Stanford CS224n: Natural Language Processing with Deep Learning},
  year    = {2021},
  note    = {\url{https://aman.ai}}
}