Language Modeling and Recurrent Neural Networks (RNN)

  • Language modeling involves predicting the next word in a sequence based on the context of the preceding words.
  • A language model, like Gboard or a query completion system, assigns probabilities to text sequences, effectively creating a probability distribution over the next word given the context.

N-gram language models

  • An n-gram language model chunks a text into n consecutive words (4-gram, 5-gram), making a Markov assumption that to predict word t+1, only the last n-1 words are necessary.
  • For instance, a 4-gram model would only consider the last three words. However, if the model doesn’t encounter a specific data sequence within its window, it assigns a probability of zero.
  • Furthermore, n-grams can lack context and coherence, as they only consider 3-4 words at a time, leading to a sparsity problem.
  • The below slide (source) depicts this.

Neural Language Model

  • Unlike n-gram models, a neural language model takes a sequence of words as input and outputs a probability distribution of the next word.
  • Early models were window-based, focusing on a fixed range of preceding words and discarding those farther away, but these were only precursors to the more advanced RNN models.

Evaluating language models

  • Perplexity is the standard metric for evaluating language models, with lower scores indicating better performance.

Contextual Embeddings

  • While traditional word embeddings in NLP are context-free, meaning a word like “bank” would have the same representation regardless of context (e.g., river bank vs. financial institution), contextual embeddings provide a solution.
  • These generate different representations based on the surrounding text.

ELMo: Embeddings from Language Models

  • ELMo (Peters et al., 2018) introduced contextualized word embeddings.
  • Instead of using a fixed embedding for each word, ELMo takes the entire sentence into account before assigning each word an embedding.
  • This is achieved through a bidirectional LSTM trained on a specific task. ELMo, therefore, can assign different representations to the same word based on its context.
  • The below slide (source) depicts this.

GPT-3: Generative Pre-trained Transformer 3

  • GPT-3 is a powerful language model that predicts the probability of the next word given a context.
  • Its novelty lies in its ability for flexible “in-context” learning.
  • It demonstrates rapid adaptation to completely new tasks through in-context learning, essentially learning how to learn from the context.
  • GPT-3 can also be used as a knowledge base, given its vast training data and ability to generate contextually relevant responses.


If you found our work useful, please cite it as:

  title   = {Language Models},
  author  = {Jain, Vinija and Chadha, Aman},
  journal = {Distilled Notes for Stanford CS224n: Natural Language Processing with Deep Learning},
  year    = {2021},
  note    = {\url{}}