• Add Knowledge to Language Models:
  • Standard language models: task to predict next word in seq of texts and can compute the probability of a sequence
  • Masked language models(BERT): instead predict a masked token in a sequence of texts using bidirectional context
  • Language models not always able to predict correctly:
    • Unseen facts: some facts may not have occurred in the training corpus at all
    • It can’t make up facts about the world
    • Rare facts: LM hasn’t seen enoch examples during training to memorize the fact
    • Model sensitivity: LM may have seen the fact during training, but is sensitive to the phrasing of the prompt
  • Inability to reliably recall knowledge is a key challenge facing LM’s today

Knowledge Graphs

  • Hope to replace SQL with natural question answering in terms of gaining knowledge
  • Advantages of language models over traditional Knowledge Bases (SQL)
    • LMs are pre trained over a large amounts of unstructured and unlabeled text
    • Can support more flexible natural language queries
  • Cons:
    • Hard to interpret(why did it return that answer)
      • Knowledge is encoded into the parameters of the model so its hard to understand
    • Hard to trust(LM may product realistic but incorrect answers)
    • Hard to modify(not easy to remove or update knowledge in the LM)
  • Techniques researchers are using to add knowledge to LM:
    • Add pretrained entity embeddings
      • We need to know facts about the word are usually in terms of entities:

Entity Linking

  • Link mentions in text to entities in a knowledge base
  • Tells us which entity embeddings are relevant to the text
  • They’re like word embeddings but for a knowledge base
  • Knowledge graph embedding methods: TransE
  • Wikipedia2Vec
  • BLINK from facebook using a transformer
    • How to incorporate when it’s from a different embedding space?
      • Learn a fusion layer to combine context and entity information

ERNIE: Enhanced Language Representation with Informative Entities

  • Pretrained entity embeddings
  • Fusion layer
  • Text encoder: multi layer bidirectional Transformer encoder(BERT) over the words in a sentence
  • Knowledge encoder: stacked blocks composed of:
    • Two multi headed attentions (MHAs) over entity embeddings and token embeddings
    • A fusion layer to combine the output of the MHAs
    • Output of fusion layer new word and entity embeddings
  • Pretrained with 3 tasks:
    • Masked language model and next sentence prediction (BERT tasks)
    • Knowledge pretraining task: randomly mask token entity alignments and predict corresponding entity for a token from the entities in the sequence
  • Point of the fusion layer is to find the correlation between word embeddings and entity embeddings in order to be able to correctly return the answer


  • Key idea pretrain an integrated entity linker as an extension to BERT
  • Learning entity learning may better encode knowledge
  • Uses fusion layer to combine entity and context info and adds a knowledge pretraining tasks


  • LSTMs condition the language model on a knowledge graph
  • LM predicts the next word by computing
  • Now predict the next word using entity information, by computing
  • Builds a “local” knowledge graph as you iterate over the sequence
    • Local KG: subset of the full KG with only entities relevant to the sequence
  • When should LM use knowledge graph vs predict next word
  • Find top scoring parent and relation in the local KG using LSTM hidden state and pretrained entity and relation embeddings
  • New entity:
    • (not in local KG)
    • Find top scoring entity in full KG using LSTM hidden state and pretrained entity embeddings
  • KGLM outperforms GPT-2
  • Nearest Neighbor Language Models (kNN-LM)
    • Key idea: learning similarities between text sequence is easier than predicting the next word
    • Store all representations of text sequences in a nearest neighbor datastore

Evaluating knowledge in LMs

  • Language Model analysis: LAMA probe
  • How much relational ( commonsense and factual) knowledge is already in off the shelf language models
  • Without any additional training or fine tuning
  • Limitations of the LAMA probe:
  • Hard to understand why models perform well when they do
  • BERT large may memorize co occurrence patterns rather than “understanding” the cloze statements


If you found our work useful, please cite it as:

  title   = {Knowledge Graphs},
  author  = {Jain, Vinija and Chadha, Aman},
  journal = {Distilled Notes for Stanford CS224n: Natural Language Processing with Deep Learning},
  year    = {2021},
  note    = {\url{https://aman.ai}}