Background: Pre-Training

  • One of the biggest challenges in natural language processing (NLP) is the shortage of training data. Because NLP is a diversified field with many distinct tasks, most task-specific datasets contain only a few thousand or a few hundred thousand human-labeled training examples. However, modern deep learning-based NLP models see benefits from much larger amounts of data, improving when trained on millions, or billions, of annotated training examples. To help close this gap in data, researchers have developed a variety of techniques for training general purpose language representation models using the enormous amount of unannotated text on the web (known as pre-training).
  • The pre-trained model can then be fine-tuned on small-data NLP tasks like question answering and sentiment analysis, resulting in substantial accuracy improvements compared to training on these datasets from scratch.

Enter BERT

  • In 2018, Google open sourced a new technique for NLP pre-training called Bidirectional Encoder Representations from Transformers, or BERT. As the name suggests, it generates representations using an encoder from Vaswani et al.’s Transformer architecture. However, there are notable differences between BERT and the original Transformer, especially in how they train those models.
  • With BERT, anyone in the world can train their own state-of-the-art question answering system (or a variety of other models) in about 30 minutes on a single Cloud TPU, or in a few hours using a single GPU. The release includes source code built on top of TensorFlow and a number of pre-trained language representation models.
  • In the paper, Devlin et al. (2018) demonstrate state-of-the-art results on 11 NLP tasks, including the very competitive Stanford Question Answering Dataset (SQuAD v1.1).

Word Embeddings

  • Word embeddings (or word vectors) are a way to represent words or sentences as vectors of numbers that can be fed into downstream models for various tasks such as search, recommendation, translation and so on. The idea can be generalized to various entities beyond words – for e.g., topics, products, and even people (commonly done in applications such as recommender systems, face/speaker recognition, etc.).
  • Word2Vec has been one of the most commonly used models that popularized the use of embeddings and made them accessible through easy to use pre-trained embeddings (in case of Word2Vec, pre-trained embeddings trained on the Google News corpus are available). Two important learnings from Word2Vec were:
    • Embeddings of semantically similar words are close in cosine similarity.
    • Word embeddings support intuitive arithmetic properties. (An important consequence of this statement is that phrase embeddings can be obtained as the sum of word embeddings.)
  • However, since Word2Vec there have been numerous techniques that attempted to create better and better deep learning models for producing embeddings (as we’ll see in the later sections).

Contextual vs. Non-contextual Word Embeddings

  • It is often stated that Word2Vec and GloVe yield non-contextual embeddings while ELMo and BERT embeddings are contextual. At a top level, all word embeddings are fundamentally non-contextual but can be made contextual by incorporating hidden layers:
    1. The word2vec model is trained to learn embeddings that predict either the probability of a surrounding word occurring given a center word (SkipGram) or vice versa (CBoW). The surrounding words are also called context words because they appear in the context of the center word.
    2. The GloVe model is trained such that a pair of embeddings has weights that reflect their co-occurrence probability. The latter is defined as the percentage of times that a given word \(y\) occurs within some context window of word \(x\).
    3. If embeddings are trained from scratch in an encoder / decoder framework involving RNNs (or their variants), then, at the input layer, the embedding that you look up for a given word reflects nothing about the context of the word in that particular sentence. The same goes for Transformer-based architectures.
  • Word2Vec and GloVe embeddings can be plugged into any type of neural language model, and contextual embeddings can be derived from them by incorporating hidden layers. These layers extract the meaning of a given word, accounting for the words it is surrounded by (i.e., the context) in that particular sentence. Similarly, while hidden layers of an LSTM encoder or Transformer do extract information about surrounding words to represent a given word, the embeddings at the input layer do not.
  • Key takeaways
    • Word embeddings you retrieve from a lookup table are always non-contextual since you cannot have pre-generated contextualized embeddings. (It is slightly different in ELMo which uses a character-based network to get a word embedding, but it does consider context).
    • However, when people say contextual word embeddings, they don’t mean the vectors from the look-up table, they mean the hidden states of the pre-trained model.
    • Here are the key differences between them Word2Vec and BERT embeddings and when to use each:
BERT Word2Vec
Central idea Offers different embeddings based on context, i.e., "contextual embeddings" (say, the game "Go" vs. the verb "go"). The same word occurs in different contexts can thus yield different embeddings. Same embedding for a given word even if it occurs in different contexts.
Context/Dependency coverage Captures longer range context/dependencies owing to the Transformer encoder architecture under-the-hood which is designed to capture more context. While Word2Vec Skipgram tries to predict contextual (surrounding) words based on the center word (and CBOW does the reverse, i.e., ), both are limited in their context to just a static window size (which is a hyperparameter) and thus cannot capture longer range context relative to BERT.
Position sensitivity Takes into account word position since it uses positional embeddings at the input. Does not take into account word position and is thus word-position agnostic (hence, called "bag of words" which indicates it is insensitive to word position).
Generating embeddings Use a pre-trained model and generate embeddings for desired words (rather than using pre-trained embeddings as in Word2Vec) since they are tailor-fit based on the context. Off-the-shelf pre-trained word embeddings available since embeddings are context-agnostic.

ELMo: Contextualized Embeddings

  • ELMo came up with the concept of contextualized embeddings by grouping together the hidden states of the LSTM-based model (and the initial non-contextualized embedding) in a certain way (concatenation followed by weighted summation).

BERT: An Overview

  • At the input, BERT (and many other transformer models) consume 512 tokens max —- truncating anything beyond this length. Since it can generate an output per input token, it can output 512 tokens.
  • BERT actually uses WordPieces as tokens rather than the input words – so some words are broken down into smaller chunks.
  • BERT is trained using two objectives: (i) Masked Language Modeling (MLM), and (ii) Next Sentence Prediction (NSP).

Masked Language Modeling (MLM)

  • While the OpenAI transformer (which was a decoder) gave us a fine-tunable (through prompting) pre-trained model based on the Transformer, something went missing in this transition from LSTMs (ELMo) to Transformers (OpenAI Transformer). ELMo’s language model was bi-directional, but the OpenAI transformer is a forward language model. Could we build a transformer-based model whose language model looks both forward and backwards (in the technical jargon – “is conditioned on both left and right context”)?
    • However, here’s the issue with bidirectional conditioning when pre-training a language model. The community usually trains a language model by training it on a related task which helps develop a contextual understanding of words in a model. More often than not, such tasks involve predicting the next word or words in close vicinity of each other. Such training methods can’t be extended and used for bidirectional models because it would allow each word to indirectly “see itself” — when you would approach the same sentence again but from opposite direction, you kind of already know what to expect. A case of data leakage. In other words, bidrectional conditioning would allow each word to indirectly see itself in a multi-layered context. The training objective of Masked Language Modelling, which seeks to predict the masked tokens, solves this problem.
    • While the masked language modelling objective allows us to obtain a bidirectional pre-trained model, note that a downside is that we are creating a mismatch between pre-training and fine-tuning, since the [MASK] token does not appear during fine-tuning. To mitigate this, BERT does not always replace “masked” words with the actual [MASK] token. The training data generator chooses 15% of the token positions at random for prediction. If the \(i^{th}\) token is chosen, BERT replaces the \(i^{th}\) token with (1) the [MASK] token 80% of the time (2) a random token 10% of the time, and (3) the unchanged \(i^{th}\) token 10% of the time.

Next Sentence Prediction (NSP)

  • To make BERT better at handling relationships between multiple sentences, the pre-training process includes an additional task: Given two sentences (\(A\) and \(B\)), is \(B\) likely to be the sentence that follows \(A\), or not?
  • More on the two pre-training objectives in the section on Masked Language Model (MLM) and Next Sentence Prediction (NSP).

BERT Flavors

  • BERT comes in two flavors:
    • BERT Base: 12 layers (transformer blocks), 12 attention heads, 768 hidden size (i.e., the size of \(q\), \(k\) and \(v\) vectors), and 110 million parameters.
    • BERT Large: 24 layers (transformer blocks), 16 attention heads, 1024 hidden size (i.e., the size of \(q\), \(k\) and \(v\) vectors) and 340 million parameters.
  • By default BERT (which typically refers to BERT-base), word embeddings have 768 dimensions.

Sentence Embeddings with BERT

  • To calculate sentence embeddings using BERT, there are multiple strategies, but a simple approach is to average the second to last hidden layer of each token producing a single 768 length vector. You can also do a weighted sum of the vectors of words in the sentence.

BERT’s Encoder Architecture vs. Other Decoder Architectures

  • BERT is based on the Transformer encoder. Unlike BERT, decoder models (GPT, TransformerXL, XLNet, etc.) are auto-regressive in nature. As an encoder-based architecture, BERT traded-off auto-regression and gained the ability to incorporate context on both sides of a word and thereby offer better results.
  • Note that XLNet brings back autoregression while finding an alternative way to incorporate the context on both sides.
  • More on this in the article on Encoding vs. Decoder Models.

What makes BERT different?

  • BERT builds upon recent work in pre-training contextual representations — including Semi-supervised Sequence Learning, Generative Pre-Training, ELMo, and ULMFit.
  • However, unlike these previous models, BERT is the first deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus (in this case, Wikipedia).
  • Why does this matter? Pre-trained representations can either be context-free or contextual, and contextual representations can further be unidirectional or bidirectional. Context-free models such as word2vec or GloVe generate a single word embedding representation for each word in the vocabulary.
  • For example, the word “bank” would have the same context-free representation in “bank account” and “bank of the river.” Contextual models instead generate a representation of each word that is based on the other words in the sentence. For example, in the sentence “I accessed the bank account,” a unidirectional contextual model would represent “bank” based on “I accessed the” but not “account.” However, BERT represents “bank” using both its previous and next context — “I accessed the … account” — starting from the very bottom of a deep neural network, making it deeply bidirectional.
  • A visualization of BERT’s neural network architecture compared to previous state-of-the-art contextual pre-training methods is shown below (source). BERT is deeply bidirectional, OpenAI GPT is unidirectional, and ELMo is shallowly bidirectional. The arrows indicate the information flow from one layer to the next. The green boxes at the top indicate the final contextualized representation of each input word:

Why Unsupervised Pre-Training?

  • Vaswani et al. employed supervised learning to train the original Transformer models for language translation tasks, which requires pairs of source and target language sentences. For example, a German-to-English translation model needs a training dataset with many German sentences and corresponding English translations. Collecting such text data may involve much work, but we require them to ensure machine translation quality. There is not much else we can do about it, or can we?
  • We actually can use unsupervised learning to tap into many unlabelled corpora. However, before discussing unsupervised learning, let’s look at another problem with supervised representation learning. The original Transformer architecture has an encoder for a source language and a decoder for a target language. The encoder learns task-specific representations, which are helpful for the decoder to perform translation, i.e., from German sentences to English. It sounds reasonable that the model learns representations helpful for the ultimate objective. But there is a catch.
  • If we wanted the model to perform other tasks like question answering and language inference, we would need to modify its architecture and re-train it from scratch. It is time-consuming, especially with a large corpus.
  • Would human brains learn different representations for each specific task? It does not seem so. When kids learn a language, they do not aim for a single task in mind. They would somehow understand the use of words in many situations and acquire how to adjust and apply them for multiple activities.
  • To summarize what we have discussed so far, the question is whether we can train a model with many unlabeled texts to generate representations and adjust the model for different tasks without from-scratch training.
  • The answer is a resounding yes, and that’s exactly what Devlin et al. did with BERT. They pre-trained a model with unsupervised learning to obtain non-task-specific representations helpful for various language model tasks. Then, they added one additional output layer to fine-tune the pre-trained model for each task, achieving state-of-the-art results on eleven natural language processing tasks like GLUE, MultiNLI, and SQuAD v1.1 and v2.0 (question answering).
  • So, the first step in the BERT framework is to pre-train a model on a large amount of unlabeled data, giving many contexts for the model to learn representations in unsupervised training. The resulting pre-trained BERT model is a non-task-specific feature extractor that we can fine-tune quickly to a specific objective.
  • The next question is how they pre-trained BERT using text datasets without labeling.

The Strength of Bidirectionality

  • If bidirectionality is so powerful, why hasn’t it been done before? To understand why, consider that unidirectional models are efficiently trained by predicting each word conditioned on the previous words in the sentence. However, it is not possible to train bidirectional models by simply conditioning each word on its previous and next words, since this would allow the word that’s being predicted to indirectly “see itself” in a multi-layer model.
  • To solve this problem, BERT uses the straightforward technique of masking out some of the words in the input and then condition each word bidirectionally to predict the masked words. In other words, BERT uses the neighboring words (i.e., bidirectional context) to predict the current masked word – also known as the Cloze task. For example (image source):

  • While this idea has been around for a very long time, BERT was the first to adopt it to pre-train a deep neural network.
  • BERT also learns to model relationships between sentences by pre-training on a very simple task that can be generated from any text corpus: given two sentences \(A\) and \(B\), is \(B\) the actual next sentence that comes after \(A\) in the corpus, or just a random sentence? For example (image source):

  • Thus, BERT has been trained on two main tasks:
    • Masked Language Model (MLM)
    • Next Sentence Prediction (NSP)

Masked Language Model (MLM)

  • Language models (LM) estimate the probability of the next word following a sequence of words in a sentence. One application of LM is to generate texts given a prompt by predicting the most probable word sequence. It is a left-to-right language model.
  • Note: “left-to-right” may imply languages such as English and German. However, other languages use “right-to-left” or “top-to-bottom” sequences. Therefore, “left-to-right” means the natural flow of language sequences (forward), not aiming for specific languages. Also, “right-to-left” means the reverse order of language sequences (backward).
  • Unlike left-to-right language models, the masked language models (MLM) estimate the probability of masked words. We randomly hide (mask) some tokens from the input and ask the model to predict the original tokens in the masked positions. It lets us use unlabeled text data since we hide tokens that become labels (as such, we may call it “self-supervised” rather than “unsupervised,” but I’ll keep the same terminology as the paper for consistency’s sake).
  • The model predicts each hidden token solely based on its context, where the self-attention mechanism from the Transformer architecture comes into play. The context of a hidden token originates from both directions since the self-attention mechanism considers all tokens, not just the ones preceding the hidden token. Devlin et al. call such representation bi-directional, comparing with uni-directional representation by left-to-right language models.
  • Note: the term “bi-directional” is a bit misnomer because the self-attention mechanism is not directional at all. However, we should treat the term as the antithesis of “uni-directional”.
  • In the paper, they have a footnote (4) that says:

We note that in the literature the bidirectional Transformer is often referred to as a “Transformer encoder” while the left-context-only unidirectional version is referred to as a “Transformer decoder” since it can be used for text generation.

  • So, Devlin et al. trained an encoder (including the self-attention layers) to generate bi-directional representations, which can be richer than uni-directional representations from left-to-right language models for some tasks. Also, it is better than simply concatenating independently trained left-to-right and right-to-left language model representations because bi-directional representations simultaneously incorporate contexts from all tokens.
  • However, many tasks involve understanding the relationship between two sentences, such as Question Answering (QA) and Natural Language Inference (NLI). Language modeling (LM or MLM) does not capture this information.
  • We need another unsupervised representation learning task for multi-sentence relationships.

Next Sentence Prediction (NSP)

  • A next sentence prediction is a task to predict a binary value (i.e., Yes/No, True/False) to learn the relationship between two sentences. For example, there are two sentences \(A\) and \(B\), and the model predicts if \(B\) is the actual next sentence that follows \(A\). They randomly used the true or false next sentence for \(B\). It is easy to generate such a dataset from any monolingual corpus. Hence, it is unsupervised learning.

Pre-training BERT

  • How can we pre-train a model for both MLM and NSP tasks? To understand how a model can accommodate two pre-training objectives, let’s look at how they tokenize inputs.
  • They used WordPiece for tokenization, which has a vocabulary of 30,000 tokens, based on the most frequent sub-words (combinations of characters and symbols). Special tokens such as [CLS] (Classification Token), SEP (Separator Token), and MASK (Masked Token) are added during the tokenization process. These tokens are added to the input sequences during the pre-processing stage, both for training and inference. Let’s delve into these special tokens one-by-one:
  • [CLS] (Classification Token):
    • The CLS token, short for “Classification Token,” is a special token used in BERT (Bidirectional Encoder Representations from Transformers) and similar models. Its primary purpose is to represent the entire input sequence when BERT is used for various NLP (Natural Language Processing) tasks. Here are key points about the CLS token:
      • Representation of the Whole Sequence: The CLS token is placed at the beginning of the input sequence and serves as a representation of the entire sequence. It encapsulates the contextual information from all the tokens in the input.
      • Classification Task: In downstream NLP tasks like text classification or sentiment analysis, BERT’s final hidden state corresponding to the CLS token is often used as the input to a classifier to make predictions.
      • Position and Usage: The CLS token is added to the input during the pre-processing stage before training or inference. It’s inserted at the beginning of the tokenized input, and BERT’s architecture ensures that it captures contextual information from all tokens in both left and right directions.
  • [SEP] (Separator Token):
    • The SEP token, short for “Separator Token,” is another special token used in BERT. Its primary role is to separate segments or sentences within an input sequence. Here’s what you need to know about the SEP token:
      • Segment Separation: The SEP token is inserted between segments or sentences in the input sequence. It helps BERT distinguish between different parts of the input, especially when multiple sentences are concatenated for training or inference.
      • Position and Usage: Like the CLS token, the SEP token is added during the pre-processing stage. It’s placed between segments, ensuring that BERT understands the boundaries between different sentences.
      • Example: In a question-answering task, where a question and a passage are combined as input, the SEP token is used to indicate the boundary between the question and the passage.
  • [MASK] (Masked Token):
    • The MASK token is used during the pre-training phase of BERT, which involves masking certain tokens in the input for the model to predict. Here’s what you should know about the MASK token:
      • Masked Tokens for Pre-training: During pre-training, some of the tokens in the input sequence are randomly selected to be replaced with the MASK token. The model’s objective is to predict the original identity of these masked tokens.
      • Training Objective: The MASK token is introduced during the training phase to improve BERT’s ability to understand context and relationships between words. The model learns to predict the masked tokens based on the surrounding context.
      • Position and Usage: The MASK token is applied to tokens before feeding them into the model during the pre-training stage. This masking process is part of the unsupervised pre-training of BERT.
  • In summary, the CLS token , the SEP token , and the MASK token is , enhancing BERT’s contextual understanding.
    • [CLS]: Stands for classification and is the first token of every sequence. The final hidden state is the aggregate sequence representation. It thus represents the entire input sequence and is used for classification tasks.
    • [SEP]: A separator token for separating segments within the input, such as sentences, questions, and related passages.
    • [MASK]: Used to mask/hide tokens.
  • For the MLM task, they randomly choose a token to replace with the [MASK] token. They calculate cross-entropy loss between the model’s prediction and the masked token to train the model. Specific details about this masking procedure are as follows:
    • To train a deep bidirectional representation, the researchers mask a certain percentage of the input tokens randomly and then predict those masked tokens. This procedure is known as a “masked LM” (MLM), also commonly referred to as a Cloze task in academic literature. The final hidden vectors corresponding to the mask tokens are fed into an output softmax over the vocabulary, similar to a standard LM. In their experiments, 15% of all WordPiece tokens in each sequence are masked randomly. This approach differs from denoising auto-encoders, as only the masked words are predicted, not the entire input reconstruction.
    • While this method enables the creation of a bidirectional pre-trained model, it introduces a mismatch between pre-training and fine-tuning phases due to the absence of the [MASK] token during fine-tuning. To address this, the masked words are not always replaced with the actual [MASK] token. The training data generator randomly selects 15% of the token positions for prediction. If the \(i^{th}\) token is chosen, it is replaced with the [MASK] token 80% of the time, a random token 10% of the time, or remains unchanged 10% of the time. The goal is to use \(T_i\) to predict the original token with cross-entropy loss.
  • For the NSP task, given sentences \(A\) and \(B\), the model learns to predict if \(B\) follows \(A\). They separated sentences \(A\) and \(B\) using [SEP] token and used the final hidden state in place of [CLS] for binary classification. The output of the [CLS] token tells us how likely the current sentence follows the prior sentence. You can think about the output of [CLS] as a probability. The motivation is that the [CLS] embedding should contain a “summary” of both sentences to be able to decide if they follow each other or not. Note that they can’t take any other word from the input sequence, because the output of the word is it’s representation. So they add a tag that has no other purpose than being a sentence-level representation for classification. Their final model achieved 97%-98% accuracy on this task.
  • Now you may ask the question – instead of using [CLS]’s output, can we just output a number as probability? Yes, we can do that if the task of predicting next sentence is a separate task. However, BERT has been trained on both the MLM and NSP tasks simultaneously. Organizing inputs and outputs in such a format (with both [MASK] and [CLS]) helps BERT to learn both tasks at the same time and boost its performance.
  • When it comes to classification task (e.g. sentiment classification), the output of [CLS] can be helpful because it contains BERT’s understanding at the sentence-level.
  • Note: there is one more detail when separating two sentences. They added a learned “segment” embedding to every token, indicating whether it belongs to sentence \(A\) or \(B\). It’s similar to positional encoding, but it is for sentence level. They call it segmentation embeddings. Figure 2 of the paper shows the various embeddings corresponding to the input.

  • So, Devlin et al. pre-trained BERT using the two unsupervised tasks and empirically showed that pre-trained bi-directional representations could help execute various language tasks involving single text or text pairs.
  • The final step is to conduct supervised fine-tuning to perform specific tasks.

Supervised Fine-Tuning

  • Fine-tuning adjusts all pre-trained model parameters for a specific task, which is a lot faster than from-scratch training. Furthermore, it is more flexible than feature-based training that fixes pre-trained parameters. As a result, we can quickly train a model for each specific task without heavily engineering a task-specific architecture.
  • The pre-trained BERT model can generate representations for single text or text pairs, thanks to the special tokens and the two unsupervised language modeling pre-training tasks. As such, we can plug task-specific inputs and outputs into BERT for each downstream task.
  • For classification tasks, we feed the final [CLS] representation to an output layer. For multi-sentence tasks, the encoder can process a concatenated text pair (using [SEP]) into bi-directional cross attention between two sentences. For example, we can use it for question-passage pair in a question-answering task.
  • By now, it should be clear why and how they repurposed the Transformer architecture, especially the self-attention mechanism through unsupervised pre-training objectives and downstream task-specific fine-tuning.

Training with Cloud TPUs

  • Everything that we’ve described so far might seem fairly straightforward, so what’s the missing piece that made it work so well? Cloud TPUs. Cloud TPUs gave us the freedom to quickly experiment, debug, and tweak our models, which was critical in allowing us to move beyond existing pre-training techniques.
  • The Transformer model architecture, developed by researchers at Google in 2017, gave BERT the foundation to make it successful. The Transformer is implemented in Google’s open source release, as well as the tensor2tensor library.

Results with BERT

  • To evaluate performance, we compared BERT to other state-of-the-art NLP systems. Importantly, BERT achieved all of its results with almost no task-specific changes to the neural network architecture.
  • On SQuAD v1.1, BERT achieves 93.2% F1 score (a measure of accuracy), surpassing the previous state-of-the-art score of 91.6% and human-level score of 91.2%:

  • BERT also improves the state-of-the-art by 7.6% absolute on the very challenging GLUE benchmark, a set of 9 diverse Natural Language Understanding (NLU) tasks. The amount of human-labeled training data in these tasks ranges from 2,500 examples to 400,000 examples, and BERT substantially improves upon the state-of-the-art accuracy on all of them:

  • Below are the GLUE test results from table 1 of the paper. They reported results on the two model sizes:
    • The base BERT uses 110M parameters in total:
      • 12 encoder blocks
      • 768-dimensional embedding vectors
      • 12 attention heads
    • The large BERT uses 340M parameters in total:
      • 24 encoder blocks
      • 1024-dimensional embedding vectors
      • 16 attention heads

Google search improvements

  • Google search deployed BERT for understanding search queries in 2019.
  • Given an input query, say “2019 brazil traveler to usa need a visa”, the following image shows the difference in Google’s search results before and after BERT. Based on the image, we can see that BERT (“after”) does a much better job at understanding the query compared to the keyword-based match (before). A keyword-based match yields articles related to US citizens traveling to Brazil whereas the intent behind the query was someone in Brazil looking to travel to the USA.

Making BERT Work for You

  • The models that Google has released can be fine-tuned on a wide variety of NLP tasks in a few hours or less. The open source release also includes code to run pre-training, although we believe the majority of NLP researchers who use BERT will never need to pre-train their own models from scratch. The BERT models that Google has released so far are English-only, but they are working on releasing models which have been pre-trained on a variety of languages in the near future.
  • The open source TensorFlow implementation and pointers to pre-trained BERT models can be found here. Alternatively, you can get started using BERT through Colab with the notebook “BERT FineTuning with Cloud TPUs”.

A Visual Notebook to Using BERT for the First Time

  • This Jupyter notebook by Jay Alammar offers a great intro to using a pre-trained BERT model to carry out sentiment classification using the Stanford Sentiment Treebank (SST2) dataset.

Further Reading

FAQs

In BERT, how do we go from \(Q\), \(K\), and \(V\) at the final transformer block’s output to contextualized embeddings?

  • To understand how the \(Q\), \(K\), and \(V\) matrices contribute to the contextualized embeddings in BERT, let’s dive into the core processes occurring in the final layer of BERT’s transformer encoder stack. Each layer performs self-attention, where the matrices \(Q\), \(K\), and \(V\) interact to determine how each token attends to others in the sequence. Through this mechanism, each token’s embedding is iteratively refined across multiple layers, progressively capturing both its own attributes and its contextual relationships with other tokens.
  • By the time these computations reach the final layer, the output embeddings for each token are highly contextualized. Each token’s embedding now encapsulates not only its individual meaning but also the influence of surrounding tokens, providing a rich representation of the token in context. This final, refined embedding is what BERT ultimately uses to represent each token, balancing individual token characteristics with the nuanced context in which the token appears.
  • Let’s dive deeper into how the \(Q\), \(K\), and \(V\) matrices at each layer ultimately yield embeddings that are contextualized, particularly by looking at what happens in the final layer of BERT’s transformer encoder stack. The core steps involved from self-attention outputs in the last layer to meaningful embeddings per token are:

  • Self-Attention Mechanism Recap:

    • In each layer, BERT computes self-attention across the sequence of tokens. For each token, it generates a query vector \(Q\), a key vector \(K\), and a value vector \(V\). These matrices are learned transformations of the token embeddings and encode how each token should attend to other tokens.
    • For each token in the sequence, self-attention calculates attention scores by comparing \(Q\) with \(K\), determining the influence or weight of other tokens relative to the current token.
  • Attention Weights Calculation:

    • For each token, the model computes the similarity of its \(Q\) vector with every other token’s \(K\) vector in the sequence. This similarity score is then normalized (typically through softmax), resulting in attention weights.
    • These weights tell us the degree to which each token should “attend to” (or incorporate information from) other tokens.
  • Weighted Summation of Values (Producing Contextual Embeddings):

    • Using the attention weights, each token creates a weighted sum over the \(V\) vectors of other tokens. This weighted sum serves as the output of the self-attention operation for that token.
    • Each token’s output is thus a combination of other tokens’ values, weighted by their attention scores. This result effectively integrates context from surrounding tokens.
  • Passing Through Multi-Head Attention and Feed-Forward Layers:

    • BERT uses multi-head attention, meaning that it performs multiple attention computations (heads) in parallel with different learned transformations of \(Q\), \(K\), and \(V\).
    • Each head provides a different “view” of the relationships between tokens. The outputs from all heads are concatenated and then passed through a feed-forward layer to further refine each token’s representation.
  • Stacking Layers for Deeper Contextualization:

    • The output from the multi-head attention and feed-forward layer for each token is passed as input to the next layer. Each subsequent layer refines the token embeddings by adding another layer of attention-based contextualization.
    • By the final layer, each token embedding has been repeatedly updated, capturing nuanced dependencies from all tokens in the sequence through multiple self-attention layers.
  • Extracting Final Token Embeddings from the Last Encoder Layer:

    • After the last layer, the output matrix contains a contextualized embedding for each token in the sequence. These embeddings represent the final “meaning” of each token as understood by BERT, based on the entire input sequence.
    • For a sequence with \(n\) tokens, the output from the final layer is a matrix of shape \(n \times d\), where \(d\) is the embedding dimension.
  • Embedding Interpretability and Usage:

    • The embedding for each token in this final matrix is now contextualized; it reflects not just the identity of the token itself but also its role and relationships within the context of the entire sequence.
    • These final embeddings can be used for downstream tasks, such as classification or question answering, where the model uses these embeddings to predict task-specific outputs.

What gets passed on from the output of the previous transformer block to the next in the encoder/decoder?

  • In a transformer-based architecture (such as the vanilla transformer or BERT), the output of each transformer block (or layer) becomes the input to the subsequent layer in the stack. Specifically, here’s what gets passed from one layer to the next:

  • Token Embeddings (Contextualized Representations):

    • The main component passed between layers is a set of token embeddings, which are contextualized representations of each token in the sequence up to that layer.
    • For a sequence of \(n\) tokens, if the embedding dimension is \(d\), the output of each layer is an \(n \times d\) matrix, where each row represents the embedding of a token, now updated with contextual information learned from the previous layer.
    • Each embedding at this point reflects the token’s meaning as influenced by the other tokens it attended to in that layer.
  • Residual Connections:

    • Transformers use residual connections to stabilize training and allow better gradient flow. Each layer’s output is combined with its input via a residual (or skip) connection.
    • In practice, the output of the self-attention and feed-forward operations is added to the input embeddings from the previous layer, preserving information from the initial representation.
  • Layer Normalization:

    • After the residual connection, layer normalization is applied to the summed representation. This normalization helps stabilize training by maintaining consistent scaling of token representations across layers.
    • The layer-normalized output is then what gets passed on as the “input” to the next layer.
  • Positional Information:

    • The positional embeddings (added initially to the token embeddings to account for the order of tokens in the sequence) remain embedded in the representations throughout the layers. No additional positional encoding is added between layers; instead, the attention mechanism itself maintains positional relationships indirectly.
  • Summary of the Process:

    1. Each layer receives an \(n \times d\) matrix (the sequence of token embeddings), which now includes contextual information from previous layers.
    2. The layer performs self-attention and passes the output through a feed-forward network.
    3. The residual connection adds the original input to the output of the feed-forward network.
    4. Layer normalization is applied to this result, and the final matrix is passed on as the input to the next layer.
      • This flow ensures that each successive layer refines the contextual embeddings for each token, building progressively more sophisticated representations of tokens within the context of the entire sequence.

References

Citation

If you found our work useful, please cite it as:

@article{Chadha2020DistilledBERT,
  title   = {BERT},
  author  = {Chadha, Aman},
  journal = {Distilled AI},
  year    = {2020},
  note    = {\url{https://aman.ai}}
}