## Language Modeling + RNN

• This is the task of predicting what word comes next.
• Probability distribution over next words given preceding contexts.
• Language model is a system that does that.
• It assigns probability to a piece of text.
• Gboard is a language model
• Query completion is a language model

## N-gram language models

• Chunk of n consecutive words (4 gram, 5 gram)
• Make a markov assumption: predict t+1, we throw away earlier words and just use n-1 words and use conditional probability
• 4 gram language model:
• Only use last 3 words
• Problem is if you don’t see that data within your window, you’ll get probability 0.
• Count how often word sequences occur in a corpus.
• N gram will not have context, it will be incoherent. We need to consider more than 3-4 words at a time if we want to model language well
• N grams have a sparsity problem.

## Neural Language Model

• Input seq of words
• Output prob dist of the next word
• How about a window based neural model
• A fixed window neural language model (discard far away words)
• These still suck, precursor to RNN
• Single hidden layer

## Evaluating language models

• Perplexity: standard evaluation metric; lower is better

## Contextual Embeddings

• Word embeddings work in NLP by neural networks
• Word embeddings are context free
• Bank will have the same embedding even if river or financial place
• Solution: contextual representation on text corpus

## ELMO

• Contextualized word-embeddings
• Trained bidirectional, but weakly
• Instead of using a fixed embedding for each word, ELMo looks at the entire sentence before assigning each word in it an embedding. It uses a bi-directional LSTM trained on a specific task to be able to create those embeddings.
• Left to right and right to left LM and concatenate

• Models are still kind of fixed
• Embedding from Language Model (ELMo) (Peters et al., 2018) is one such method that provides deep contextual embeddings.
• ELMo produces word embeddings for each context where the word is used, thus allowing different representations for varying senses of the same word

## GPT-3

• Can use for DB
• Language model
• Given context, probability of next word is what it predicts
• What new about GPT 3 is flexible “in-context” learning
• Demonstrates some level of fast adaptation to completely new tasks via in context learning
• The language model training (outer loop) is learning how to learn from the context (inner loop)

## Citation

If you found our work useful, please cite it as:

@article{Chadha2021Distilled,
title   = {Language Models},
author  = {Jain, Vinija and Chadha, Aman},
journal = {Distilled Notes for Stanford CS224n: Natural Language Processing with Deep Learning},
year    = {2021},
note    = {\url{https://aman.ai}}
}