Language Modeling + RNN

  • This is the task of predicting what word comes next.
  • Probability distribution over next words given preceding contexts.
  • Language model is a system that does that.
  • It assigns probability to a piece of text.
    • Gboard is a language model
    • Query completion is a language model

N-gram language models

  • Chunk of n consecutive words (4 gram, 5 gram)
  • Make a markov assumption: predict t+1, we throw away earlier words and just use n-1 words and use conditional probability
  • 4 gram language model:
    • Only use last 3 words
  • Problem is if you don’t see that data within your window, you’ll get probability 0.
  • Count how often word sequences occur in a corpus.
  • N gram will not have context, it will be incoherent. We need to consider more than 3-4 words at a time if we want to model language well
  • N grams have a sparsity problem.

Neural Language Model

  • Input seq of words
  • Output prob dist of the next word
  • How about a window based neural model
  • A fixed window neural language model (discard far away words)
  • These still suck, precursor to RNN
  • Single hidden layer

Evaluating language models

  • Perplexity: standard evaluation metric; lower is better

Contextual Embeddings

  • Word embeddings work in NLP by neural networks
    • Word embeddings are context free
    • Bank will have the same embedding even if river or financial place
  • Solution: contextual representation on text corpus


  • Contextualized word-embeddings
  • Trained bidirectional, but weakly
  • Instead of using a fixed embedding for each word, ELMo looks at the entire sentence before assigning each word in it an embedding. It uses a bi-directional LSTM trained on a specific task to be able to create those embeddings.
  • Left to right and right to left LM and concatenate

  • Models are still kind of fixed
  • Embedding from Language Model (ELMo) (Peters et al., 2018) is one such method that provides deep contextual embeddings.
  • ELMo produces word embeddings for each context where the word is used, thus allowing different representations for varying senses of the same word


  • Can use for DB
  • Language model
  • Given context, probability of next word is what it predicts
  • What new about GPT 3 is flexible “in-context” learning
    • Demonstrates some level of fast adaptation to completely new tasks via in context learning
    • The language model training (outer loop) is learning how to learn from the context (inner loop)


If you found our work useful, please cite it as:

  title   = {Language Models},
  author  = {Jain, Vinija and Chadha, Aman},
  journal = {Distilled Notes for Stanford CS224n: Natural Language Processing with Deep Learning},
  year    = {2021},
  note    = {\url{}}