Primers • Transformers
 Background: Representation Learning for NLP
 Enter the Transformer
 Transformers vs. Recurrent and Convolutional Architectures: An Overview
 Breaking down the Transformer
 Background
 Onehot encoding
 Dot product
 Matrix multiplication as a series of dot products
 First order sequence model
 Second order sequence model
 Second order sequence model with skips
 Masking features
 From Feature Vectors to Transformers
 Attention as matrix multiplication
 Second order sequence model as matrix multiplications
 Sampling a sequence of output words
 Transformer Core
 Embeddings
 Positional encoding
 Decoding output words / Deembeddings
 Attention
 Why attention? Contextualized Word Embeddings
 Types of Attention: Additive, Multiplicative (Dotproduct), and Scaled
 Attention calculation
 SelfAttention
 Single head attention revisited
 Coding up selfattention
 Averaging is equivalent to uniform attention
 Activation Functions
 Attention in Transformers: What’s new and what’s not?
 Calculating \(Q\), \(K\), and \(V\) matrices in the Transformer architecture
 Optimizing Performance with the KV Cache
 Applications of Attention in Transformers
 MultiHead Attention
 CrossAttention
 Dropout
 Skip connections
 Layer normalization
 Softmax
 Stacking Transformer Layers
 Transformer Encoder and Decoder
 Putting it all together: The Transformer Architecture
 Loss function
 Background
 Implementation details
 The relation between transformers and Graph Neural Networks
 Time complexity: RNNs vs. Transformers
 Lessons Learned
 Transformers: merging the worlds of linguistic theory and statistical NLP using fully connected graphs
 Long term dependencies
 Are Transformers learning neural syntax?
 Why multiple heads of attention? Why attention?
 Benefits of Transformers compared to RNNs/GRUs/LSTMs
 What would we like to fix about the transformer? / Drawbacks of Transformers
 Why is training Transformers so hard?
 Transformers: Extrapolation engines in highdimensional space
 The road ahead for Transformers
 Choosing the right language model for your NLP usecase: key takeaways
 Transformers Learning Recipe
 FAQs
 Did the original Transformer use absolute or relative positional encoding?
 How does the choice of positional encoding method can influence the number of parameters added to the model? Consinder absolute, relative, and rotary positional encoding mechanisms.
 In Transformerbased models, why is RoPE required for context length extension?
 Why is the Transformer Architecture not as susceptible to vanishing gradients compared to RNNs?
 What is the fraction of attention weights relative to feedforward weights in common LLMs?
 Further Reading
 References
 Citation
Background: Representation Learning for NLP
 At a high level, all neural network architectures build representations of input data as vectors/embeddings, which encode useful syntactic and semantic information about the data. These latent or hidden representations can then be used for performing something useful, such as classifying an image or translating a sentence. The neural network learns to build betterandbetter representations by receiving feedback, usually via error/loss functions.
 For Natural Language Processing (NLP), conventionally, Recurrent Neural Networks (RNNs) build representations of each word in a sentence in a sequential manner, i.e., one word at a time. Intuitively, we can imagine an RNN layer as a conveyor belt (as shown in the figure below; source), with the words being processed on it autoregressively from left to right. In the end, we get a hidden feature for each word in the sentence, which we pass to the next RNN layer or use for our NLP tasks of choice. Chris Olah’s legendary blog for recaps on LSTMs and representation learning for NLP is highly recommend to develop a background in this area
 Initially introduced for machine translation, Transformers have gradually replaced RNNs in mainstream NLP. The architecture takes a fresh approach to representation learning: Doing away with recurrence entirely, Transformers build features of each word using an attention mechanism (which had also been experimented in the world of RNNs as “Augmented RNNs”) to figure out how important all the other words in the sentence are w.r.t. to the aforementioned word. Knowing this, the word’s updated features are simply the sum of linear transformations of the features of all the words, weighted by their importance (as shown in the figure below; source). Back in 2017, this idea sounded very radical, because the NLP community was so used to the sequential–onewordatatime–style of processing text with RNNs. As recommended reading, Lilian Weng’s Attention? Attention! offers a great overview on various attention types and their pros/cons.
Enter the Transformer
 History:
 LSTMs, GRUs and other flavors of RNNs were the essential building blocks of NLP models for two decades since 1990s.
 CNNs were the essential building blocks of vision (and some NLP) models for three decades since the 1980s.
 In 2017, Transformers (proposed in the “Attention Is All You Need” paper) demonstrated that recurrence and/or convolutions are not essential for building highperformance natural language models.
 In 2020, Vision Transformer (ViT) (An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale) demonstrated that convolutions are not essential for building highperformance vision models.
 The most advanced architectures in use before Transformers gained a foothold in the field were RNNs with LSTMs/GRUs. These architectures, however, suffered from the following drawbacks:
 They struggle with really long sequences (despite using LSTM and GRU units).
 They are fairly slow, as their sequential nature doesn’t allow any kind of parallel computing.
 At the time, LSTMbased recurrent models were the defacto choice for language modeling. Here’s a timeline of some relevant events:
 ELMo (LSTMbased): 2018
 ULMFiT (LSTMbased): 2018
 Initially introduced for machine translation by Vaswani et al. (2017), the vanilla Transformer model utilizes an encoderdecoder architecture, which is able to perform sequence transduction with a sophisticated attention mechanism. As such, compared to prior recurrent architectures, Transformers possess fundamental differences in terms of how they work:
 They work on the entire sequence calculating attention across all wordpairs, which let them learn longrange dependencies.
 Some parts of the architecture can be processed in parallel, making training much faster.
 Owing to their unique selfattention mechanism, transformer models offer a great deal of representational capacity/expressive power.
 These performance and parallelization benefits led to Transformers gradually replacing RNNs in mainstream NLP. The architecture takes a fresh approach to representation learning: Doing away with recurrence entirely, Transformers build features of each word using an attention mechanism to figure out how important all the other words in the sentence are w.r.t. the aforementioned word. As such, the word’s updated features are simply the sum of linear transformations of the features of all the words, weighted by their importance.
 Back in 2017, this idea sounded very radical, because the NLP community was so used to the sequential – onewordatatime – style of processing text with RNNs. The title of the paper probably added fuel to the fire! For a recap, Yannic Kilcher made an excellent video overview.
 However, Transformers did not become a overnight success until GPT and BERT immensely popularized them. Here’s a timeline of some relevant events:
 Attention is all you need: 2017
 Transformers revolutionizing the world of NLP, Speech, and Vision: 2018 onwards
 GPT (Transformerbased): 2018
 BERT (Transformerbased): 2018
 Today, transformers are not just limited to language tasks but are used in vision, speech, and so much more. The following plot (source) shows the transformers family tree with prevalent models:
 And, the plots below (first plot source); (second plot source) show the timeline for prevalent transformer models:
 Lastly, the plot below (source) shows the timeline vs. number of parameters for prevalent transformer models:
Transformers vs. Recurrent and Convolutional Architectures: An Overview
Language
 In a vanilla language model, for example, nearby words would first get grouped together. The transformer, by contrast, runs processes so that every element in the input data connects, or pays attention, to every other element. This is referred to as “selfattention.” This means that as soon as it starts training, the transformer can see traces of the entire data set.
 Before transformers came along, progress on AI language tasks largely lagged behind developments in other areas. Infact, in this deep learning revolution that happened in the past 10 years or so, natural language processing was a latecomer and NLP was, in a sense, behind computer vision, per the computer scientist Anna Rumshisky of the University of Massachusetts, Lowell.
 However, with the arrival of Transformers, the field of NLP has received a muchneeded push and has churned model after model that have beat the stateoftheart in various NLP tasks.
 As an example, to understand the difference between vanilla language models (based on say, a recurrent architecture such as RNNs, LSTMs or GRUs) vs. transformers, consider these sentences: “The owl spied a squirrel. It tried to grab it with its talons but only got the end of its tail.” The structure of the second sentence is confusing: What do those “it”s refer to? A vanilla language model that focuses only on the words immediately around the “it”s would struggle, but a transformer connecting every word to every other word could discern that the owl did the grabbing, and the squirrel lost part of its tail.
Vision
In CNNs, you start off being very local and slowly get a global perspective. A CNN recognizes an image pixel by pixel, identifying features like edges, corners, or lines by building its way up from the local to the global. But in transformers, owing to selfattention, even the very first attention layer models global contextual information, making connections between distant image locations (just as with language). If we model a CNN’s approach as starting at a single pixel and zooming out, a transformer slowly brings the whole fuzzy image into focus.
 CNNs work by repeatedly applying filters on local patches of the input data, generating local feature representations (or “feature maps”) and incrementally increase their receptive field and build up to global feature representations. It is because of convolutions that photo apps can organize your library by faces or tell an avocado apart from a cloud. Prior to the transformer architecture, CNNs were thus considered indispensable to vision tasks.
 With the Vision Transformer (ViT), the architecture of the model is nearly identical to that of the first transformer proposed in 2017, with only minor changes allowing it to analyze images instead of words. Since language tends to be discrete, a lot of adaptations were to discretize the input image to make transformers work with visual input. Exactly mimicing the language approach and performing selfattention on every pixel would be prohibitively expensive in computing time. Instead, ViT divides the larger image into square units, or patches (akin to tokens in NLP). The size is arbitrary, as the tokens could be made larger or smaller depending on the resolution of the original image (the default is 16x16 pixels). But by processing pixels in groups, and applying selfattention to each, the ViT could quickly churn through enormous training data sets, spitting out increasingly accurate classifications.
 In Do Vision Transformers See Like Convolutional Neural Networks?, Raghu et al. sought to understand how selfattention powers transformers in visionbased tasks.
Multimodal Tasks
 As discussed in the Enter the Transformer section, other architectures are “one trick ponies” while multimodal learning requires handling of modalities with different patterns within a streamlined architecture with a reasonably high relational inductive bias to even remotely reach humanlike intelligence. In other words, we needs a single versatile architecture that seamlessly transitions between senses like reading/seeing, speaking, and listening.
 The potential to offer a universal architecture that can be adopted for multimodal tasks (that requires simultaneously handling multiple types of data, such as raw images, video and language) is something that makes the transformer architecture unique and popular.
 Because of the siloed approach with earlier architectures where each type of data had its own specialized model, this was a difficult task to accomplish. However, transformers offer an easy way to combine multiple input sources. For example, multimodal networks might power a system that reads a person’s lips in addition to listening to their voice using rich representations of both language and image information.
 With crossattention where the query, key and value vectors are derived from different sources, transformers are able to lend themselves as a powerful tool for multimodal learning.
 The transformer thus offers be a big step toward achieving a kind of “convergence” for neural net architectures, resulting in a universal approach to processing data from multiple modalities.
Breaking down the Transformer
 Before we pop open the hood of the Transformer and go through each component one by one, let’s first setup a background in underlying concepts such as onehot vectors, dot product, matrix multiplication, embedding generation, and attention.
Background
Onehot encoding
Overview
 Computers process numerical data. However, in most practical scenarios, the input data is not naturally numeric, for e.g., images (we model intensity values as pixels), speech (we model the audio signal as an oscillogram/spectrogram). Our first step is to convert all the words to numbers so we can do math on them.
 One hot encoding is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction.
Idea
 So, you’re playing with ML models and you encounter this “onehot encoding” term all over the place. You see the sklearn documentation for onehot encoder and it says “encode categorical integer features using a onehot aka oneofK scheme.” To demystify that, let’s look at what onehot encoding actually is, through an example.
Example: Basic Dataset

Suppose the dataset is as follows:
╔════════════╦═════════════════╦════════╗ ║ CompanyName Categoricalvalue ║ Price ║ ╠════════════╬═════════════════╣════════║ ║ VW ╬ 1 ║ 20000 ║ ║ Acura ╬ 2 ║ 10011 ║ ║ Honda ╬ 3 ║ 50000 ║ ║ Honda ╬ 3 ║ 10000 ║ ╚════════════╩═════════════════╩════════╝
 The categorical value represents the numerical value of the entry in the dataset. For example: if there were to be another company in the dataset, it would have been given categorical value as 4. As the number of unique entries increases, the categorical values also proportionally increases.
 The previous table is just a representation. In reality, the categorical values start from 0 goes all the way up to \(N1\) categories.
 As you probably already know, the categorical value assignment can be done using sklearn’s LabelEncoder.

Now let’s get back to onehot encoding: Say we follow instructions as given in the sklearn’s documentation for onehot encoding and follow it with a little cleanup, we end up with the following:
╔════╦══════╦══════╦════════╗ ║ VW ║ Acura║ Honda║ Price ║ ╠════╬══════╬══════╬════════║ ║ 1 ╬ 0 ╬ 0 ║ 20000 ║ ║ 0 ╬ 1 ╬ 0 ║ 10011 ║ ║ 0 ╬ 0 ╬ 1 ║ 50000 ║ ║ 0 ╬ 0 ╬ 1 ║ 10000 ║ ╚════╩══════╩══════╩════════╝
 where
0
indicates nonexistent while1
indicates existent.
 where
 Before we proceed further, could you think of one reason why just label encoding is not sufficient to provide to the model for training? Why do you need onehot encoding?
 Problem with label encoding is that it assumes higher the categorical value, better the category. Specifically, what this form of organization presupposes is
VW > Acura > Honda
based on the categorical values. Say supposing your model internally calculates average, then accordingly we get,1+3 = 4/2 = 2
. This implies that: Average of VW and Honda is Acura. This is definitely a recipe for disaster. This model’s prediction would have a lot of errors.  This is why we use onehot encoder to perform “binarization” of the category and include it as a feature to train the model.
 As another example: Suppose you have
flower
feature which can take valuesdaffodil
,lily
, androse
. One hot encoding convertsflower
feature to three features,is_daffodil
,is_lily
, andis_rose
which all are binary.
Example: NLP
 Inspired by Brandon Rohrer’s Transformers From Scratch, let’s consider another example in the context of natural language processing. Imagine that our goal is to create the computer that processes text, say a Machine Translation system that translates computer commands from one language to another. Such a model would ingest the input text and convert (or transduce) a sequence of sounds to a sequence of words.
 We start by choosing our vocabulary, the collection of symbols that we are going to be working with in each sequence. In our case, there will be two different sets of symbols, one for the input sequence to represent vocal sounds and one for the output sequence to represent words.
 For now, let’s assume we’re working with English. There are tens of thousands of words in the English language, and perhaps another few thousand to cover computerspecific terminology. That would give us a vocabulary size that is the better part of a hundred thousand. One way to convert words to numbers is to start counting at one and assign each word its own number. Then a sequence of words can be represented as a list of numbers.
 For example, consider a tiny language with a vocabulary size of three: files, find, and my. Each word could be swapped out for a number, perhaps
files = 1
,find = 2
, andmy = 3
. Then the sentence “Find my files”, consisting of the word sequence[find, my, files]
could be represented instead as the sequence of numbers[2, 3, 1].
 This is a perfectly valid way to convert symbols to numbers, but it turns out that there’s another format that’s even easier for computers to work with, onehot encoding. In onehot encoding a symbol is represented by an array of mostly zeros, the same length of the vocabulary, with only a single element having a value of one. Each element in the array corresponds to a separate symbol.
 Another way to think about onehot encoding is that each word still gets assigned its own number, but now that number is an index to an array. Here is our example above, in onehot notation.
 So the phrase
find my files
becomes a sequence of onedimensional arrays, which, after squeezing together, looks like a twodimensional array.
 The terms “onedimensional array” and “vector” are typically used interchangeably (in this article and otherwise). Similarly, “twodimensional array” and “matrix” can be interchanged as well.
Dot product
 One really useful thing about the onehot representation is that it lets us compute dot product (also referred to as the inner product, scalar product or cosine similarity).
Algebraic Definition

The dot product of two vectors \(\mathbf{a}=\left[a_{1}, a_{2}, \ldots, a_{n}\right]\) and \(\mathbf{b}=\left[b_{1}, b_{2}, \ldots, b_{n}\right]\) is defined as:
\[\mathbf{a} \cdot \mathbf{b}=\sum_{i=1}^{n} a_{i} b_{i}=a_{1} b_{1}+a_{2} b_{2}+\cdots+a_{n} b_{n}\] where $\Sigma$ denotes summation and $n$ is the dimension of the vector space.

For instance, in threedimensional space, the dot product of vectors \([1, 3, 5]\) and \([4,2,1]\) is:
\[\begin{aligned} {[1,3,5] \cdot[4,2,1] } &=(1 \times 4)+(3 \times2)+(5 \times1) \\ &=46+5 \\ &=3 \end{aligned}\] 
The dot product can also be written as a product of two vectors, as below.
\[\mathbf{a} \cdot \mathbf{b}=\mathbf{a b}^{\top}\] where \(\mathbf{b}^{\top}\) denotes the transpose of \(\mathbf{b}\).

Expressing the above example in this way, a \(1 \times 3\) matrix (row vector) is multiplied by a \(3 \times 1\) matrix (column vector) to get a \(1 \times 1\) matrix that is identified with its unique entry:
\[\left[\begin{array}{lll} 1 & 3 & 5 \end{array}\right]\left[\begin{array}{c} 4 \\ 2 \\ 1 \end{array}\right]=3\] 
Key takeaway:
 In summary, to get the dot product of two vectors, multiply their corresponding elements, then add the results. For a visual example of calculating the dot product for two vectors, check out the figure below.
Geometric Definition

In Euclidean space, a Euclidean vector is a geometric object that possesses both a magnitude and a direction. A vector can be pictured as an arrow. Its magnitude is its length, and its direction is the direction to which the arrow points. The magnitude of a vector a is denoted by \(\mid \mid a \mid \mid\). The dot product of two Euclidean vectors \(\mathbf{a}\) and \(\mathbf{b}\) is defined by,
\[\mathbf{a} \cdot \mathbf{b}=\\mathbf{a}\\\mathbf{b}\ \cos \theta\] where \(\theta\) is the angle between \(\mathbf{a}\) and \(\mathbf{b}\).

The above equation establishes the relation between dot product and cosine similarity.
Properties of the dot product

Dot products are especially useful when we’re working with our onehot word representations owing to it’s properties, some of which are highlighted below.

The dot product of any onehot vector with itself is one.
 The dot product of any onehot vector with another onehot vector is zero.
 The previous two examples show how dot products can be used to measure similarity. As another example, consider a vector of values that represents a combination of words with varying weights. A onehot encoded word can be compared against it with the dot product to show how strongly that word is represented. The following figure shows how a similarity score between two vectors is calculated by way of calculating the dot product.
Matrix multiplication as a series of dot products

The dot product is the building block of matrix multiplication, a very particular way to combine a pair of twodimensional arrays. We’ll call the first of these matrices \(A\) and the second one \(B\). In the simplest case, when \(A\) has only one row and \(B\) has only one column, the result of matrix multiplication is the dot product of the two. The following figure shows the multiplication of a single row matrix and a single column matrix.
 Notice how the number of columns in A and the number of rows in \(B\) needs to be the same for the two arrays to match up and for the dot product to work out.
 When \(A\) and \(B\) start to grow, matrix multiplication starts to increase quadratically in time complexity. To handle more than one row in \(A\), take the dot product of \(B\) with each row separately. The answer will have as many rows as A does. The following figure shows the multiplication of a two row matrix and a single column matrix.
 When \(B\) takes on more columns, take the dot product of each column with \(A\) and stack the results in successive columns. The following figure shows the multiplication of a one row matrix and a two column matrix:
 Now we can extend this to mutliplying any two matrices, as long as the number of columns in \(A\) is the same as the number of rows in \(B\). The result will have the same number of rows as \(A\) and the same number of columns as \(B\). The following figure shows the multiplication of a one three matrix and a two column matrix:
Matrix multiplication as a table lookup
 In the above section, we saw how matrix multiplication acts as a lookup table.
 The matrix \(A\) is made up of a stack of onehot vectors. They have ones in the first column, the fourth column, and the third column, respectively. When we work through the matrix multiplication, this serves to pull out the first row, the fourth row, and the third row of the \(B\) matrix, in that order. This trick of using a onehot vector to pull out a particular row of a matrix is at the core of how transformers work.
First order sequence model
 We can set aside matrices for a minute and get back to what we really care about, sequences of words. Imagine that as we start to develop our natural language computer interface we want to handle just three possible commands:
Show me my directories please.
Show me my files please.
Show me my photos please.
 Our vocabulary size is now seven:
{directories, files, me, my, photos, please, show}
 One useful way to represent sequences is with a transition model. For every word in the vocabulary, it shows what the next word is likely to be. If users ask about photos half the time, files 30% of the time, and directories the rest of the time, the transition model will look like this. The sum of the transitions away from any word will always add up to one. The following figure shows a Markov chain transition model.

This particular transition model is called a Markov chain, because it satisfies the Markov property that the probabilities for the next word depend only on recent words. More specifically, it is a first order Markov model because it only looks at the single most recent word. If it considered the two most recent words it would be a second order Markov model.

Our break from matrices is over. It turns out that Markov chains can be expressed conveniently in matrix form. Using the same indexing scheme that we used when creating onehot vectors, each row represents one of the words in our vocabulary. So does each column. The matrix transition model treats a matrix as a lookup table. Find the row that corresponds to the word you’re interested in. The value in each column shows the probability of that word coming next. Because the value of each element in the matrix represents a probability, they will all fall between zero and one. Because probabilities always sum to one, the values in each row will always add up to one. The following diagram from Brandon Rohrer’s Transformers From Scratch shows a transition matrix:

In the transition matrix here we can see the structure of our three sentences clearly. Almost all of the transition probabilities are zero or one. There is only one place in the Markov chain where branching happens. After
my
, the wordsdirectories
,files
, orphotos
might appear, each with a different probability. Other than that, there’s no uncertainty about which word will come next. That certainty is reflected by having mostly ones and zeros in the transition matrix. 
We can revisit our trick of using matrix multiplication with a onehot vector to pull out the transition probabilities associated with any given word. For instance, if we just wanted to isolate the probabilities of which word comes after
my
, we can create a onehot vector representing the word my and multiply it by our transition matrix. This pulls out the row the relevant row and shows us the probability distribution of what the next word will be. The following diagram from Brandon Rohrer’s Transformers From Scratch shows a transition probability lookup using a transition matrix:
Second order sequence model
 Predicting the next word based on only the current word is hard. That’s like predicting the rest of a tune after being given just the first note. Our chances are a lot better if we can at least get two notes to go on.
 We can see how this works in another toy language model for our computer commands. We expect that this one will only ever see two sentences, in a 40/60 proportion.
Check whether the battery ran down please.
Check whether the program ran please.
 A Markov chain illustrates a first order model for this. The following diagram from Brandon Rohrer’s Transformers From Scratch shows a first order Markov chain transition model.
 Here we can see that if our model looked at the two most recent words, instead of just one, that it could do a better job. When it encounters
battery ran
, it knows that the next word will be down, and when it sees program ran the next word will beplease
. This eliminates one of the branches in the model, reducing uncertainty and increasing confidence. Looking back two words turns this into a second order Markov model. It gives more context on which to base next word predictions. Second order Markov chains are more challenging to draw, but here are the connections that demonstrate their value. The following diagram from Brandon Rohrer’s Transformers From Scratch shows a second order Markov chain.
 To highlight the difference between the two, here is the first order transition matrix,
 Here’s a first order transition matrix:
 … and here is the second order transition matrix:
 Notice how the second order matrix has a separate row for every combination of words (most of which are not shown here). That means that if we start with a vocabulary size of \(N\) then the transition matrix has \(N^2\) rows.
 What this buys us is more confidence. There are more ones and fewer fractions in the second order model. There’s only one row with fractions in it, one branch in our model. Intuitively, looking at two words instead of just one gives more context, more information on which to base a next word guess.
Second order sequence model with skips
 A second order model works well when we only have to look back two words to decide what word comes next. What about when we have to look back further? Imagine we are building yet another language model. This one only has to represent two sentences, each equally likely to occur.
Check the program log and find out whether it ran please.
Check the battery log and find out whether it ran down please.
 In this example, in order to determine which word should come after ran, we would have to look back 8 words into the past. If we want to improve on our second order language model, we can of course consider third and higher order models. However, with a significant vocabulary size this takes a combination of creativity and brute force to execute. A naive implementation of an eighth order model would have \(N^8\) rows, a ridiculous number for any reasonable vocabulary.
 Instead, we can do something sly and make a second order model, but consider the combinations of the most recent word with each of the words that came before. It’s still second order, because we’re only considering two words at a time, but it allows us to reach back further and capture long range dependencies. The difference between this secondorderwithskips and a full umpteenthorder model is that we discard most of the word order information and combinations of preceding words. What remains is still pretty powerful.
 Markov chains fail us entirely now, but we can still represent the link between each pair of preceding words and the words that follow. Here we’ve dispensed with numerical weights, and instead are showing only the arrows associated with nonzero weights. Larger weights are shown with heavier lines. The following diagram from Brandon Rohrer’s Transformers From Scratch shows a second order sequence model with skips feature voting.
 Here’s what it might look like in a second order with skips transition matrix.

This view only shows the rows relevant to predicting the word that comes after
ran
. It shows instances where the most recent word (ran
) is preceded by each of the other words in the vocabulary. Only the relevant values are shown. All the empty cells are zeros. 
The first thing that becomes apparent is that, when trying to predict the word that comes after
ran
, we no longer look at just one line, but rather a whole set of them. We’ve moved out of the Markov realm now. Each row no longer represents the state of the sequence at a particular point. Instead, each row represents one of many features that may describe the sequence at a particular point. The combination of the most recent word with each of the words that came before makes for a collection of applicable rows, maybe a large collection. Because of this change in meaning, each value in the matrix no longer represents a probability, but rather a vote. Votes will be summed and compared to determine next word predictions. 
The next thing that becomes apparent is that most of the features don’t matter. Most of the words appear in both sentences, and so the fact that they have been seen is of no help in predicting what comes next. They all have a value of 0.5. The only two exceptions are
battery
andprogram
. They have some 1 and 0 weights associated with the. The featurebattery
, ran indicates that ran was the most recent word and thatbattery
occurred somewhere earlier in the sentence. This feature has a weight of 1 associated with down and a weight of 0 associated with please. Similarly, the feature program, ran has the opposite set of weights. This structure shows that it is the presence of these two words earlier in the sentence that is decisive in predicting which word comes next. 
To convert this set of wordpair features into a next word estimate, the values of all the relevant rows need to be summed. Adding down the column, the sequence Check the program log and find out whether it ran generates sums of 0 for all the words, except a 4 for
down
and a 5 forplease
. The sequenceCheck the battery log and find out whether it ran
does the same, except with a 5 fordown
and a 4 forplease
. By choosing the word with the highest vote total as the next word prediction, this model gets us the right answer, despite having an eight word deep dependency.
Masking features
 On more careful consideration, this is unsatisfying – the difference between a vote total of 4 and 5 is relatively small. It suggests that the model isn’t as confident as it could be. And in a larger, more organic language model it’s easy to imagine that such a slight difference could be lost in the statistical noise.
 We can sharpen the prediction by weeding out all the uninformative feature votes. With the exception of battery, ran and program, ran. It’s helpful to remember at this point that we pull the relevant rows out of the transition matrix by multiplying it with a vector showing which features are currently active. For this example so far, we’ve been using the implied feature vector shown here. The following diagram from Brandon Rohrer’s Transformers From Scratch shows a feature selection vector.
 It includes a one for each feature that is a combination of ran with each of the words that come before it. Any words that come after it don’t get included in the feature set. (In the next word prediction problem these haven’t been seen yet, and so it’s not fair to use them predict what comes next.) And this doesn’t include all the other possible word combinations. We can safely ignore these for this example because they will all be zero.
 To improve our results, we can additionally force the unhelpful features to zero by creating a mask. It’s a vector full of ones except for the positions you’d like to hide or mask, and those are set to zero. In our case we’d like to mask everything except for
battery
,ran
andprogram
,ran
, the only two features that have been of any help. The following diagram from Brandon Rohrer’s Transformers From Scratch shows a masked feature vector.
 To apply the mask, we multiply the two vectors element by element. Any feature activity value in an unmasked position will be multiplied by one and left unchanged. Any feature activity value in a masked position will be multiplied by zero, and thus forced to zero.
 The mask has the effect of hiding a lot of the transition matrix. It hides the combination of ran with everything except
battery
andprogram
, leaving just the features that matter. The following diagram from Brandon Rohrer’s Transformers From Scratch shows a masked transition matrix.
 After masking the unhelpful features, the next word predictions become much stronger. When the word battery occurs earlier in the sentence, the word after ran is predicted to be down with a weight of 1 and please with a weight of 0. What was a weight difference of 25 percent has become a difference of infinity percent. There is no doubt what word comes next. The same strong prediction occurs for please when program occurs early on.
 This process of selective masking is the attention called out in the title of the original paper on transformers. So far, what we’ve described is a just an approximation of how attention is implemented in the paper.
Generally speaking, an attention function computes the weights that should be assigned to a particular element of the input to generate the output. In the context of the specific attention function called scaled dotproduct attention that Transformers deploy which adopts the querykeyvalue paradigm from the field of information retrieval, an attention function is the mapping between a query and a set of keyvalue pairs to an output. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function (referred to as the “alignment” function in Dazmitry Bahdanau’s original paper from Bengio’s lab that introduced attention) of the query with the corresponding key. While this captures a toplevel overview of the important concepts, the details are discussed in the section on Attention.
Origins of attention
 As mentioned above, the attention mechanism originally introduced in Bahdanau et al. (2015) served as a foundation upon which the selfattention mechanism in the Transformer paper was based on.
 The following slide from Stanford’s CS25 course shows how the attention mechanism was conceived and is a perfect illustration of why AI/ML is an empirical field, built on intuition.
From Feature Vectors to Transformers
 The selectivesecondorderwithskips model is a useful way to think about what transformers do, at least in the decoder side. It captures, to a first approximation, what generative language models like OpenAI’s GPT3 are doing. It doesn’t tell the complete story, but it represents the central gist of it.
 The next sections cover more of the gap between this intuitive explanation and how transformers are implemented. These are largely driven by three practical considerations:
 Computers are especially good at matrix multiplications. There is an entire industry around building computer hardware specifically for fast matrix multiplications, with CPUs being good at matrix multiplications owing to it being modeled as a multithreaded algorithm, GPUs being even faster at it owing to them have massively parallelizable/multithreaded dedicated cores onchip that are especially suited. Any computation that can be expressed as a matrix multiplication can be made shockingly efficient. It’s a bullet train. If you can get your baggage into it, it will get you where you want to go real fast.
 Each step needs to be differentiable. So far we’ve just been working with toy examples, and have had the luxury of handpicking all the transition probabilities and mask values—the model parameters. In practice, these have to be learned via backpropagation, which depends on each computation step being differentiable. This means that for any small change in a parameter, we can calculate the corresponding change in the model error or loss.
 The gradient needs to be smooth and well conditioned. The combination of all the derivatives for all the parameters is the loss gradient. In practice, getting backpropagation to behave well requires gradients that are smooth, that is, the slope doesn’t change very quickly as you make small steps in any direction. They also behave much better when the gradient is well conditioned, that is, it’s not radically larger in one direction than another. If you picture a loss function as a landscape, The Grand Canyon would be a poorly conditioned one. Depending on whether you are traveling along the bottom, or up the side, you will have very different slopes to travel. By contrast, the rolling hills of the classic Windows screensaver would have a well conditioned gradient. If the science of architecting neural networks is creating differentiable building blocks, the art of them is stacking the pieces in such a way that the gradient doesn’t change too quickly and is roughly of the same magnitude in every direction.
Attention as matrix multiplication
 Feature weights could be straightforward to build by counting how often each word pair/next word transition occurs in training, but attention masks are not. Up to this point, we’ve pulled the mask vector out of thin air. How transformers find the relevant mask matters. It would be natural to use some sort of lookup table, but now we are focusing hard on expressing everything as matrix multiplications.
 We can use the same lookup method we introduced above by stacking the mask vectors for every word into a matrix and using the onehot representation of the most recent word to pull out the relevant mask.
 In the matrix showing the collection of mask vectors, we’ve only shown the one we’re trying to pull out, for clarity.
 We’re finally getting to the point where we can start tying into the paper. This mask lookup is represented by the \(QK^T\) term in the attention equation (below), described in the details are discussed in the section on Single Head Attention Revisited.
 The query \(Q\) represents the feature of interest and the matrix \(K\) represents the collection of masks. Because it’s stored with masks in columns, rather than rows, it needs to be transposed (with the \(T\) operator) before multiplying. By the time we’re all done, we’ll make some important modifications to this, but at this level it captures the concept of a differentiable lookup table that transformers make use of.
 More in the section on Attention below.
Second order sequence model as matrix multiplications

Another step that we have been hand wavy about so far is the construction of transition matrices. We have been clear about the logic, but not about how to do it with matrix multiplications.

Once we have the result of our attention step, a vector that includes the most recent word and a small collection of the words that have preceded it, we need to translate that into features, each of which is a word pair. Attention masking gets us the raw material that we need, but it doesn’t build those word pair features. To do that, we can use a single layer fully connected neural network.

To see how a neural network layer can create these pairs, we’ll hand craft one. It will be artificially clean and stylized, and its weights will bear no resemblance to the weights in practice, but it will demonstrate how the neural network has the expressivity necessary to build these two word pair features. To keep it small and clean, will focus on just the three attended words from this example,
battery
,program
,ran
. The following diagram from Brandon Rohrer’s Transformers From Scratch shows a neural network layer for creating multiword features.
 In the layer diagram above, we can see how the weights act to combine the presence and absence of each word into a collection of features. This can also be expressed in matrix form. The following diagram from Brandon Rohrer’s Transformers From Scratch shows a weight matrix for creating multi word features.
 And it can be calculated by a matrix multiplication with a vector representing the collection of words seen so far. The following diagram from Brandon Rohrer’s Transformers From Scratch shows the calculation of the
battery, ran
feature.
 The battery and ran elements are 1 and the program element is 0. The bias element is always 1, a feature of neural networks. Working through the matrix multiplication gives a 1 for the element representing
battery, ran
and a 1 for the element representingprogram, ran
. The results for the other case are similar. The following diagram from Brandon Rohrer’s Transformers From Scratch shows the calculation of theprogram, ran
feature.

The final step in calculating these word combo features is to apply a rectified linear unit (ReLU) nonlinearity. The effect of this is to substitute any negative value with a zero. This cleans up both of these results so they represent the presence (with a 1) or absence (with a 0) of each word combination feature.
 With those gymnastics behind us, we finally have a matrix multiplication based method for creating multiword features. Although I originally claimed that these consist of the most recent word and one earlier word, a closer look at this method shows that it can build other features too. When the feature creation matrix is learned, rather than hard coded, other structures can be learned. Even in this toy example, there’s nothing to stop the creation of a threeword combination like
battery, program, ran
. If this combination occurred commonly enough it would probably end up being represented. There wouldn’t be any way to indicated what order the words occurred in (at least not yet), but we could absolutely use their cooccurrence to make predictions. It would even be possible to make use of word combos that ignored the most recent word, likebattery, program
. These and other types of features are probably created in practice, exposing the oversimiplification we made when we claimed that transformers are a selectivesecondorderwithskips sequence model. There’s more nuance to it than that, and now you can see exactly what that nuance is. This won’t be the last time we’ll change the story to incorporate more subtlety.  In this form, the multiword feature matrix is ready for one more matrix multiplication, the second order sequence model with skips we developed above. All together, the following sequence of feedforward processing steps get applied after attention is applied:
 Feature creation matrix multiplication,
 ReLU nonlinearity, and
 Transition matrix multiplication.
 The following equation from the paper shows the steps behind the Feed Forward block in a concise mathematical formulation.
 The architecture diagram (below) from the Transformers paper shows these lumped together as the Feed Forward block.
Sampling a sequence of output words
Generating words as a probability distribution over the vocabulary
 So far we’ve only talked about next word prediction. There are a couple of pieces we need to add to get our decoder to generate a long sequence. The first is a prompt, some example text to give the transformer running start and context on which to build the rest of the sequence. It gets fed in to decoder, the column on the right in the image above, where it’s labeled “Outputs (shifted right)”. Choosing a prompt that gives interesting sequences is an art in itself, called prompt engineering. It’s also a great example of humans modifying their behavior to support algorithms, rather than the other way around.
 The decoder is fed a
<START>
token to generate the first word. The token serves as a signal to tell itself to start decoding using the compact representation of collected information from the encoder (more on this in the section on CrossAttention). The following animation from Jay Alammar’s The Illustrated Transformer shows: (i) the process of parallel ingestion of tokens by the encoder (leading to the generation of key and value matrices from the last encoder layer), and (ii) the decoder producing the first token (the<START>
token is missing from the animation).
 Once the decoder has a partial sequence in the form of a prompt (or the start token) to get started with, it takes a forward pass. The end result is a set of predicted probability distributions of words, one probability distribution for each position in the sequence. The process of deembedding/decoding which involves going from a vector produced as the output of the decoder stack (bottom) to a series of logits (at the output of the linear layer) to a probability distribution (at the output of the softmax layer) and finally, to an output word as shown below in the below illustration from Jay Alammar’s The Illustrated Transformer.
Role of the Final Linear and Softmax Layer
 The linear layer is a simple fully connected layer that projects the vector produced by the stack of decoders, into a much, much larger vector called a logits vector.
 A typical NLP model knows about 40,000 unique English words (our model’s “output vocabulary”) that it’s learned from its training dataset. This would make the logits vector 40,000 dimensional, with each cell corresponding to the score of a unique word. That is how we interpret the output of the model followed by the linear layer.
 The Softmax layer then turns those scores from (unnormalized) logits/energy values into (normalized) probabilities, which effectively imposes the following constraints on the output: (i) all values are nonnegative, i.e, \(\in [0, 1]\), and (ii) all values add up to 1.0.
 At each position, the distribution shows the predicted probabilities for each next word in the vocabulary. We don’t care about predicted probabilities for existing words in the sequence (since they’re already established) – what we really care about are the predicted probabilities for the next word after the end of the prompt.
 The cell with the highest probability is chosen (more on this in the section on Greedy Decoding), and the word associated with it is produced as the output for this time step.
Greedy Decoding
 There are several ways to go about choosing what that word should be, but the most straightforward is called greedy decoding, which involves picking the word with the highest probability.
 The new next word then gets added to the sequence and fed in as input to the decoder, and the process is repeated until you either reach an
<EOS>
token or once you sample a fixed number of tokens. The following animation from Jay Alammar’s The Illustrated Transformer shows the decoder autoregressively generating the next token (by absorbing the previous tokens).
 The one piece we’re not quite ready to describe in detail is yet another form of masking, ensuring that when the transformer makes predictions it only looks behind, not ahead. It’s applied in the block labeled “Masked MultiHead Attention”. We’ll revisit this later in the section on Single Head Attention Revisited to understand how it is implemented.
Transformer Core
Embeddings
 As we’ve described them so far, transformers are too big. For a vocabulary size N of say 50,000, the transition matrix between all pairs of words and all potential next words would have 50,000 columns and 50,000 squared (2.5 billion) rows, totaling over 100 trillion elements. That is still a stretch, even for modern hardware.
 It’s not just the size of the matrices that’s the problem. In order to build a stable transition language model, we would have to provide training data illustrating every potential sequence several times at least. That would far exceed the capacity of even the most ambitious training data sets.
 Fortunately, there is a workaround for both of these problems, embeddings.
 In a onehot representation of a language, there is one vector element for each word. For a vocabulary of size \(N\) that vector is an Ndimensional space. Each word represents a point in that space, one unit away from the origin along one of the many axes. A crude representation of a high dimensional space is as below.
 In an embedding, those word points are all taken and rearranged into a lowerdimensional space. In linear algebra terminology, this refers to the projecting data points. The picture above shows what they might look like in a 2dimensional space for example. Now, instead of needing \(\)N numbers to specify a word, we only need 2. These are the \((x, y)\) coordinates of each point in the new space. Here’s what a 2dimensional embedding might look like for our toy example, together with the coordinates of a few of the words.

A good embedding groups words with similar meanings together. A model that works with an embedding learns patterns in the embedded space. That means that whatever it learns to do with one word automatically gets applied to all the words right next to it. This has the added benefit of reducing the amount of training data needed. Each example gives a little bit of learning that gets applied across a whole neighborhood of words.

The illustration shows that by putting important components in one area (
battery
,log
,program
), prepositions in another (down
,out
), and verbs near the center (check
,find
,ran
). In an actual embedding the groupings may not be so clear or intuitive, but the underlying concept is the same. Distance is small between words that behave similarly. 
An embedding reduces the number of parameters needed by a tremendous amount. However, the fewer the dimensions in the embedded space, the more information about the original words gets discarded. The richness of a language still requires quite a bit of space to lay out all the important concepts so that they don’t step on each other’s toes. By choosing the size of the embedded space, we get to trade off computational load for model accuracy.

It will probably not surprise you to learn that projecting words from their onehot representation to an embedded space involves a matrix multiplication. Projection is what matrices do best. Starting with a onehot matrix that has one row and \(N\) columns, and moving to an embedded space of two dimensions, the projection matrix will have \(N\) rows and two columns, as shown here. The following diagram from Brandon Rohrer’s Transformers From Scratch shows a projection matrix describing an embedding.

This example shows how a onehot vector, representing for example battery, pulls out the row associated with it, which contains the coordinates of the word in the embedded space. In order to make the relationship clearer, the zeros in the onehot vector are hidden, as are all the other rows that don’t get pulled out of the projection matrix. The full projection matrix is dense, each row containing the coordinates of the word it’s associated with.

Projection matrices can convert the original collection of onehot vocabulary vectors into any configuration in a space of whatever dimensionality you want. The biggest trick is finding a useful projection, one that has similar words grouped together, and one that has enough dimensions to spread them out. There are some decent precomputed embeddings for common langauges, like English. Also, like everything else in the transformer, it can be learned during training.

The architecture diagram from the Transformers paper shows where the embeddings are generated:
Positional encoding
In contrast to recurrent and convolutional neural networks, the Transformer architecture does not explicitly model relative or absolute position information in its structure.

Up to this point, we’ve assumed that the positions of words are ignored, at least for any words coming before the very most recent word. Now we get to fix that using positional embeddings, which offer a gateway to embed spatial information as an input to the transformer.
 There are several ways that position information could be introduced into our embedded representation of words, but the way it was done in the original transformer was to add a circular wiggle by using sinusoidal positional embeddings. Newer positional encoding schemes that utilize sophisticated schemes such as Rotary Position Embeddings (RoPE), which encode absolute positional information with a rotation matrix and naturally incorporate explicit relative position dependency in the selfattention formulation, have been recently proposed. Rotary Embeddings: A Relative Revolution by Eleuther AI offers a detailed discourse into the mathematical details of RoPE.
 The following diagram from Brandon Rohrer’s Transformers From Scratch shows that positional encoding introduces a circular wiggle, owing to the addition of sinusoidal positional embeddings:

The position of the word in the embedding space acts as the center of a circle. A perturbation is added to it, depending on where it falls in the order of the sequence of words. For each position, the word is moved the same distance but at a different angle, resulting in a circular pattern as you move through the sequence. Words that are close to each other in the sequence have similar perturbations, but words that are far apart are perturbed in different directions.

Since a circle is a two dimensional figure, representing a circular wiggle requires modifying two dimensions of the embedding space. If the embedding space consists of more than two dimensions (which it almost always does), the circular wiggle is repeated in all the other pairs of dimensions, but with different angular frequency, that is, it sweeps out a different number of rotations in each case. In some dimension pairs, the wiggle will sweep out many rotations of the circle. In other pairs, it will only sweep out a small fraction of a rotation. The combination of all these circular wiggles of different frequencies gives a good representation of the absolute position of a word within the sequence.

In the architecture diagram from the Transformers paper, these blocks show the generation of the position encoding and its addition to the embedded words:
Why sinusoidal positional embeddings work
 It seems to add position information into the mix in a way that doesn’t disrupt the learned relationships between words and attention. For a deeper dive into the math and implications, Amirhossein Kazemnejad’s positional encoding tutorial is recommended.
Decoding output words / Deembeddings

Embedding words makes them vastly more efficient to work with, but once the party is over, they need to be converted back to words from the original vocabulary. Deembedding is done the same way embeddings are done, with a projection from one space to another, that is, a matrix multiplication.

The deembedding matrix is the same shape as the embedding matrix, but with the number of rows and columns flipped. The number of rows is the dimensionality of the space we’re converting from. In the example we’ve been using, it’s the size of our embedding space, two. The number of columns is the dimensionality of the space we’re converting to — the size of the onehot representation of the full vocabulary, 13 in our example. The following diagram shows the deembedding transform:
 The values in a good deembedding matrix aren’t as straightforward to illustrate as those from the embedding matrix, but the effect is similar. When an embedded vector representing, say, the word
program
is multiplied by the deembedding matrix, the value in the corresponding position is high. However, because of how projection to higher dimensional spaces works, the values associated with the other words won’t be zero. The words closest toprogram
in the embedded space will also have mediumhigh values. Other words will have near zero value. And there will likely be a lot of words with negative values. The output vector in vocabulary space will no longer be onehot or sparse. It will be dense, with nearly all values nonzero. The following diagram shows the representative dense result vector from deembedding:

We can recreate the onehot vector by choosing the word associated with the highest value. This operation is also called argmax, the argument (element) that gives the maximum value. This is how to do greedy sequence completion, as mentioned in the section on sampling a sequence of output words. It’s a great first pass, but we can do better.

If an embedding maps very well to several words, we might not want to choose the best one every time. It might be only a tiny bit better choice than the others, and adding a touch of variety can make the result more interesting. Also, sometimes it’s useful to look several words ahead and consider all the directions the sentence might go before settling on a final choice. In order to do these, we have to first convert our deembedding results to a probability distribution.
Attention
 Now that we’ve made peace with the concepts of projections (matrix multiplications) and spaces (vector sizes), we can revisit the core attention mechanism with renewed vigor. It will help clarify the algorithm if we can be more specific about the shape of our matrices at each stage. There is a short list of important numbers for this.
 \(N\): vocabulary size; 13 in our example. Typically in the tens of thousands.
 \(n\): maximum sequence length; 12 in our example. Something like a few hundred in the paper (they don’t specify.) 2048 in GPT3.
 \(d_{model}\): number of dimensions in the embedding space used throughout the model (512 in the paper).
 The original input matrix is constructed by getting each of the words from the sentence in their onehot representation, and stacking them such that each of the onehot vectors is its own row. The resulting input matrix has \(n\) rows and \(N\) columns, which we can abbreviate as \([n \times N]\).
 As we illustrated before, the embedding matrix has \(N\) rows and \(d_{model}\) columns, which we can abbreviate as \([N \times d_{model}]\). When multiplying two matrices, the result takes its number of rows from the first matrix, and its number of columns from the second. That gives the embedded word sequence matrix a shape of \([n \times d_{model}]\).
 We can follow the changes in matrix shape through the transformer as a way to track what’s going on (c.f. figure below; source). After the initial embedding, the positional encoding is additive, rather than a multiplication, so it doesn’t change the shape of things. Then the embedded word sequence goes into the attention layers, and comes out the other end in the same shape. (We’ll come back to the inner workings of these in a second.) Finally, the deembedding restores the matrix to its original shape, offering a probability for every word in the vocabulary at every position in the sequence.
Why attention? Contextualized Word Embeddings
History
 Bag of words was the first technique invented to create a machinerepresentation of text. By counting the frequency of words in a piece of text, one could extract its “characteristics”. The following table (source) shows an example of the data samples (reviews) per row and the vocabulary of the model (unique words) across columns.
 However, this suggests that when all words are considered equally important, significant words like “crisis” which carry important meaning in the text can be drowned out by insignificant words like “and”, “for”, or “the” which add little information but are commonly used in all types of text.
 To address this issue, TFIDF (Term FrequencyInverse Document Frequency) assigns weights to each word based on its frequency across all documents. The more frequent the word is across all documents, the less weight it carries.
 However, this method is limited in that it treats each word independently and does not account for the fact that the meaning of a word is highly dependent on its context. As a result, it can be difficult to accurately capture the meaning of the text. This limitation was addressed with the use of deep learning techniques.
Enter Word2Vec: Neural Word Embeddings
 Word2Vec revolutionized embeddings by using a neural network to transform texts into vectors.
 Two popular approaches are the Continuous Bag of Words (CBOW) and Skipgram models, which are trained using raw text data in an unsupervised manner. These models learn to predict the center word given context words or the context words given the center word, respectively. The resulting trained weights encode the meaning of each word relative to its context.
 The following figure (source) visualizes CBOW where the target word is predicted based on the context using a neural network:
 However, Word2Vec and similar techniques (such as GloVe, FastText, etc.) have their own limitations. After training, each word is assigned a unique embedding. Thus, polysemous words (i.e, words with multiple distinct meanings in different contexts) cannot be accurately encoded using this method. As an example:
“The man was accused of robbing a bank.” “The man went fishing by the bank of the river.”
 As another example:
“Time flies like an arrow.” “Fruit flies like a banana.”
 This limitation gave rise to contextualized word embeddings.
Contextualized Word Embeddings
 Transformers, owing to their selfattention mechanism, are able to encode a word using its context. This, in turn, offers the ability to learn contextualized word embeddings.
 Note that while Transformerbased architectures (e.g., BERT) learn contextualized word embeddings, prior work (ELMo) originally proposed this concept.
 As indicated in the prior section, contextualized word embeddings help distinguish between multiple meanings of the same word, in case of polysemous words.
 The process begins by encoding each word as an embedding (i.e., a vector that represents the word and that LLMs can operate with). A basic one is onehot encoding, but we typically use embeddings that encode meaning (the Transformer architecture begins with a randomlyinitialized
nn.Embedding
instance that is learnt during the course of training). However, note that the embeddings at this stage are noncontextual, i.e., they are fixed per word and do not incorporate context surrounding the word.  As we will see in the section on Single Head Attention Revisited, selfattention transforms the embedding to a weighted combination of the embeddings of all the other words in the text. This represents the contextualized embedding that packs in the context surrounding the word.
 Considering the example of the word bank above, the embedding for bank in the first sentence would have contributions (and would thus be influenced significantly) from words like “accused”, “robbing”, etc. while the one in the second sentence would utilize the embeddings for “fishing”, “river”, etc. In case of the word flies, the embedding for flies in the first sentence will have contributions from words like “go”, “soars”, “pass”, “fast”, etc. while the one in the second sentence would depend on contributions from “insect”, “bug”, etc.
 The following figure (source) shows an example for the word flies, and computing the new embeddings involves a linear combination of the representations of the other words, with the weight being proportional to the relationship (say, similarity) of other words compared to the current word. In other words, the output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key (also called the “alignment” function in Bengio’s original paper that introduced attention in the context of neural networks).
Types of Attention: Additive, Multiplicative (Dotproduct), and Scaled
 The Transformer is based on “scaled dotproduct attention”.
 The two most commonly used attention functions are additive attention (proposed by Bahdanau et al. (2015) in Neural Machine Translation by Jointly Learning to Align and Translate), and dotproduct (multiplicative) attention. The scaled dotproduct attention proposed in the Transformer paper is identical to dotproduct attention, except for the scaling factor of \(\frac{1}{\sqrt{d_{k}}}\). Additive attention computes the compatibility function using a feedforward network with a single hidden layer. While the two are similar in theoretical complexity, dotproduct attention is much faster and more spaceefficient in practice, since it can be implemented using highly optimized matrix multiplication code.
 While for small values of \(d_{k}\) the two mechanisms perform similarly, additive attention outperforms dot product attention without scaling for larger values of \(d_{k}\) (Massive Exploration of Neural Machine Translation Architectures). We suspect that for large values of \(d_{k}\), the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients (To illustrate why the dot products get large, assume that the components of \(q\) and \(k\) are independent random variables with mean 0 and variance 1. Then their dot product, \(q \cdot k=\sum_{i=1}^{d_{k}} q_{i} k_{i}\), has mean 0 and variance \(d_{k}\).). To counteract this effect, we scale the dot products by \(\frac{1}{\sqrt{d_{k}}}\).
Attention calculation
 Let’s develop an intuition about the architecture using the language of mathematical symbols and vectors.

We update the hidden feature \(h\) of the \(i^{th}\) word in a sentence \(\mathcal{S}\) from layer \(\ell\) to layer \(\ell+1\) as follows:
\[h_{i}^{\ell+1}=\text { Attention }\left(Q^{\ell} h_{i}^{\ell}, K^{\ell} h_{j}^{\ell}, V^{\ell} h_{j}^{\ell}\right)\] i.e.,
 where \(j \in \mathcal{S}\) denotes the set of words in the sentence and \(Q^{\ell}, K^{\ell}, V^{\ell}\) are learnable linear weights (denoting the Query, Key and Value for the attention computation, respectively).
 From Eugene Yan’s Some Intuition on Attention and the Transformer blog, to build intuition around the concept of attention, let’s draw a parallel from a real life scenario and reason about the concept of keyvalue attention:
Imagine yourself in a library. You have a specific question (query). Books on the shelves have titles on their spines (keys) that suggest their content. You compare your question to these titles to decide how relevant each book is, and how much attention to give each book. Then, you get the information (value) from the relevant books to answer your question.
 We can understand the attention mechanism better through the following pipeline (source):
 Taking in the features of the word \(h_{i}^{\ell}\) and the set of other words in the sentence \(\left{h_{j}^{\ell} \forall j \in \mathcal{S}\right}\), we compute the attention weights \(w_{i j}\) for each pair \((i, j)\) through the dotproduct, followed by a softmax across all \(j\)’s.
 Finally, we produce the updated word feature \(h_{i}^{\ell+1}\) for word \(i\) by summing over all \(\left{h_{j}^{\ell}\right}\)’s weighted by their corresponding \(w_{i j}\). Each word in the sentence parallelly undergoes the same pipeline to update its features.
 For more details on attention (including an overview of the various types and mathematical formulation of each), please refer the Attention primer.
SelfAttention
 Since the queries, keys, and values are all drawn from the same source, we refer to this as selfattention (we use “attention” and “selfattention” interchangeably in this primer). Selfattention forms the core component of Transformers. Also, given the use of the dotproduct to ascertain similarity between the query and key vectors, the attention mechanism is also called dotproduct selfattention.
 Note that one of the benefits of selfattention over recurrence is that it’s highly parallelizable. In other words, the attention mechanism is performed in parallel for each word in the sentence to obtain their updated features in one shot. This is a big advantage for Transformers over RNNs, which update features wordbyword. In other words, Transformerbased deep learning models don’t require sequential data to be processed in order, allowing for much more parallelization and reduced training time on GPUs than RNNs.
Single head attention revisited

We already walked through a conceptual illustration of attention in Attention As Matrix Multiplication above. The actual implementation is a little messier, but our earlier intuition is still helpful. The queries and the keys are no longer easy to inspect and interpret because they are all projected down onto their own idiosyncratic subspaces. In our conceptual illustration, one row in the queries matrix represents one point in the vocabulary space, which, thanks the onehot representation, represents one and only one word. In their embedded form, one row in the queries matrix represents one point in the embedded space, which will be near a group of words with similar meanings and usage. The conceptual illustration mapped one query word to a set of keys, which in turn filtered out all the values that are not being attended to. Each attention head in the actual implempentation maps a query word to a point in yet another lowerdimensional embedded space. The result of this that that attention becomes a relationship between word groups, rather than between individual words. It takes advantage of semantic similarities (closeness in the embedded space) to generalize what it has learned about similar words.

Following the shape of the matrices through the attention calculation helps to track what it’s doing (source):
 The queries and keys matrices, \(Q\) and \(K\), both come in with shape \([n \times d_k]\). Thanks to \(K\) being transposed before multiplication, the result of \(Q K^T\) gives a matrix of \([n \times d_k] * [d_k \times n] = [n \times n]\). Dividing every element of this matrix by the square root of \(d_k\) has been shown to keep the magnitude of the values from growing wildly, and helps backpropagation to perform well. The softmax, as we mentioned, shoehorns the result into an approximation of an argmax, tending to focus attention one element of the sequence more than the rest. In this form, the \([n \times n]\) attention matrix roughly maps each element of the sequence to one other element of the sequence, indicating what it should be watching in order to get the most relevant context for predicting the next element. It is a filter that finally gets applied to the values matrix \(V\), leaving only a collection of the attended values. This has the effect of ignoring the vast majority of what came before in the sequence, and shines a spotlight on the one prior element that is most useful to be aware of.
The attention function can be described as mapping a query and a set of keyvalue pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key (also called the “alignment” function in Bengio’s original paper that introduced attention in the context of neural networks).

One tricky part about understanding this set of calculations is keeping in mind that it is calculating attention for every element of our input sequence, for every word in our sentence, not just the most recent word. It’s also calculating attention for earlier words. We don’t really care about these because their next words have already been predicted and established. It’s also calculating attention for future words. These don’t have much use yet, because they are too far out and their immediate predecessors haven’t yet been chosen. But there are indirect paths through which these calculations can effect the attention for the most recent word, so we include them all. It’s just that when we get to the end and calculate word probabilities for every position in the sequence, we throw away most of them and only pay attention to the next word.

The masking block enforces the constraint that, at least for this sequence completion task, we can’t look into the future. It avoids introducing any weird artifacts from imaginary future words. It is crude and effective – manually set the attention paid to all words past the current position to negative infinity to prevent attention to subsequent positions. In The Annotated Transformer, an immeasurably helpful companion to the paper showing line by line Python implementation, the mask matrix is visualized. Purple cells show where attention is disallowed. Each row corresponds to an element in the sequence. The first row is allowed to attend to itself (the first element), but to nothing after. The last row is allowed to attend to itself (the final element) and everything that comes before. The Mask is an \([n \times n]\) matrix. It is applied not with a matrix multiplication, but with a more straightforward elementbyelement multiplication. This has the effect of manually going in to the attention matrix and setting all of the purple elements from the mask to negative infinity. The following diagram shows an attention mask for sequence completion (source):
 Another important difference in how attention is implemented is that it makes use of the order in which words are presented to it in the sequence, and represents attention not as a wordtoword relationship, but as a positiontoposition relationship. This is evident in its \([n \times n]\) shape. It maps each element from the sequence, indicated by the row index, to some other element(s) of the sequence, indicated by the column index. This helps us to visualize and interpret what it is doing more easily, since it is operating in the embedding space. We are spared the extra step of finding nearby word in the embedding space to represent the relationships between queries and keys.
Putting it all together
 The following infographic (source) provides a quick overview of the constituent steps to calculate attention.
 As indicated in the section on Contextualized Word Embeddings, Attention enables contextualized word embeddings by allowing the model to selectively focus on different parts of the input sequence when making predictions. Put simply, the attention mechanism allows the transformer to dynamically weigh the importance of different parts of the input sequence based on the current task and context.
 In an attentionbased model like the transformer, the word embeddings are combined with attention weights that are learned during training. These weights indicate how much attention should be given to each word in the input sequence when making predictions. By dynamically adjusting the attention weights, the model can focus on different parts of the input sequence and better capture the context in which a word appears. As the paper states, the attention mechanism is what has revolutionized Transformers to what we see them to be today.
 Upon encoding a word as an embedding vector, we can also encode the position of that word in the input sentence as a vector (positional embeddings), and add it to the word embedding. This way, the same word at a different position in a sentence is encoded differently.
 The attention mechanism works with the inclusion of three vectors: key, query, value. Attention is the mapping between a query and a set of keyvalue pairs to an output. We start off by taking a dot product of query and key vectors to understand how similar they are. Next, the Softmax function is used to normalize the similarities of the resulting querykey vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key. (source)
 Thus, the basis behind the concept of attention is: “How much attention a word should pay to another word in the input to understand the meaning of the sentence?”
 As indicated in the section on Attention Calculation, one of the benefits of selfattention over recurrence is that it’s highly parallelizable. In other words, the attention mechanism is performed in parallel for each word in the sentence to obtain their updated features in one shot. Furthermore, learning longterm/longrange dependencies in sequences is another benefit.
Coding up selfattention
Single Input
 To ensure that the matrix multiplications in the scaled dotproduct attention function are valid, we need to add assertions to check the shapes of \(Q\), \(K\), and \(V\). Specifically, after transposing \(K\), the last dimension of \(Q\) should match the first dimension of \(K^T\) for the multiplication \(Q * K^T\) to be valid. Similarly, for the multiplication of the attention weights and \(V\), the last dimension of the attention weights should match the first dimension of \(V\).
 Here’s the updated code with these assertions:
import numpy as np
from scipy.special import softmax
def scaled_dot_product_attention_single(Q: np.ndarray, K: np.ndarray, V: np.ndarray) > np.ndarray:
"""
Implements scaled dotproduct attention for a single input using NumPy.
Includes shape assertions for valid matrix multiplications.
Parameters:
Q (np.ndarray): Query array of shape [seq_len, d_q].
K (np.ndarray): Key array of shape [seq_len, d_k].
V (np.ndarray): Value array of shape [seq_len, d_v].
Returns:
np.ndarray: Output array of the attention mechanism.
"""
# Ensure the last dimension of Q matches the first dimension of K^T
assert Q.shape[1] == K.shape[1], "The last dimension of Q must match the first dimension of K^T"
# Ensure the last dimension of attention weights matches the first dimension of V
assert K.shape[0] == V.shape[0], "The last dimension of K must match the first dimension of V"
d_k = Q.shape[1] # Dimension of the key vectors
# Calculate dot products of Q with K^T and scale
scores = np.matmul(Q, K^T) / np.sqrt(d_k)
# Apply softmax to get attention weights
attn_weights = softmax(scores, axis=1)
# Multiply by V to get output
output = np.matmul(attn_weights, V)
return output
# Test with sample input
def test_with_sample_input():
# Sample inputs
Q = np.array([[1, 0], [0, 1]])
K = np.array([[1, 0], [0, 1]])
V = np.array([[1, 2], [3, 4]])
# Function output
output = scaled_dot_product_attention_single(Q, K, V)
# Manually calculate expected output
d_k = Q.shape[1]
scores = np.matmul(Q, K^T) / np.sqrt(d_k)
attn_weights = softmax(scores, axis=1)
expected_output = np.matmul(attn_weights, V)
 Explanation:
 Two assertions are added:
 \(Q\) and \(K^T\) Multiplication: Checks that the last dimension of \(Q\) matches the first dimension of \(K^T\) (or the last dimension of \(K\)).
 Attention Weights and \(V\) Multiplication: Ensures that the last dimension of \(K\) (or \(K^T\)) matches the first dimension of \(V\), as the shape of the attention weights will align with the shape of \(K^T\) after softmax.
 Note that these shape checks are critical for the correctness of matrix multiplications involved in the attention mechanism. By adding these assertions, we ensure the function handles inputs with appropriate dimensions, avoiding runtime errors due to invalid matrix multiplications.
 Two assertions are added:
Batch Input
 In the batched version, the inputs \(Q\), \(K\), and \(V\) will have shapes
[batch_size, seq_len, feature_size]
. The function then needs to perform operations on each item in the batch independently.
import numpy as np
from scipy.special import softmax
def scaled_dot_product_attention_batch(Q: np.ndarray, K: np.ndarray, V: np.ndarray) > np.ndarray:
"""
Implements scaled dotproduct attention for batch input using NumPy.
Includes shape assertions for valid matrix multiplications.
Parameters:
Q (np.ndarray): Query array of shape [batch_size, seq_len, d_q].
K (np.ndarray): Key array of shape [batch_size, seq_len, d_k].
V (np.ndarray): Value array of shape [batch_size, seq_len, d_v].
Returns:
np.ndarray: Output array of the attention mechanism.
"""
# Ensure batch dimensions of Q, K, V match
assert Q.shape[0] == K.shape[0] == V.shape[0], "Batch dimensions of Q, K, V must match"
# Ensure the last dimension of Q matches the last dimension of K
assert Q.shape[1] == K.shape[1], "The last dimension of Q must match the last dimension of K"
# Ensure the last dimension of K matches the last dimension of V
assert K.shape[1] == V.shape[1], "The first dimension of K must match the first dimension of V"
d_k = Q.shape[1]
# Calculate dot products of Q with K^T for each batch and scale
scores = np.matmul(Q, K.transpose(0, 2, 1)) / np.sqrt(d_k)
# Apply softmax to get attention weights for each batch
attn_weights = softmax(scores, axis=1)
# Multiply by V to get output for each batch
output = np.matmul(attn_weights, V)
return output
# Example test case for batched input
def test_with_batch_input():
batch_size, seq_len, feature_size = 2, 3, 4
Q_batch = np.random.randn(batch_size, seq_len, feature_size)
K_batch = np.random.randn(batch_size, seq_len, feature_size)
V_batch = np.random.randn(batch_size, seq_len, feature_size)
output = scaled_dot_product_attention_batch(Q_batch, K_batch, V_batch)
assert output.shape == (batch_size, seq_len, feature_size), "Output shape is incorrect for batched input"
 Explanation:
 The function now expects inputs with an additional batch dimension at the beginning.
 The shape assertions are updated to ensure that the batch dimensions of \(Q\), \(K\), and \(V\) match, and the feature dimensions are compatible for matrix multiplication.
 Matrix multiplications (
np.matmul
) and the softmax operation are performed independently for each item in the batch.  The test case
test_with_batch_input
demonstrates how to use the function with batched input and checks if the output shape is correct.
Averaging is equivalent to uniform attention
 On a side note, it is worthwhile noting that the averaging operation is equivalent to uniform attention with the weights being all equal to \(\frac{1}{n}\), where \(n\) is the number of words in the input sequence. In other words, averaging is simply a special case of attention.
Activation Functions
 The transformer does not use an activation function following the multihead attention layer, but does use the ReLU activation sandwiched between the two positionwise fullyconnected layers that form the feedforward network. Put simply, the a fully connected feedforward network in the transformer blocks consists of two linear transformations with a ReLU activation in between.
 The reason behind this goes back to the purpose of selfattention. The measure between wordvectors is generally computed through cosinesimilarity because in the dimensions word tokens exist, it’s highly unlikely for two words to be collinear even if they are trained to be closer in value if they are similar. However, two trained tokens will have higher cosinesimilarity if they are semantically closer to each other than two completely unrelated words.
 This fact is exploited by the selfattention mechanism; after several of these matrix multiplications, the dissimilar words will zero out or become negative due to the dot product between them, and the similar words will stand out in the resulting matrix.
 Thus, selfattention can be viewed as a weighted average, where less similar words become averaged out faster (toward the zero vector, on average), thereby achieving groupings of important and unimportant words (i.e. attention). The weighting happens through the dot product. If input vectors were normalized, the weights would be exactly the cosine similarities.
 The important thing to take into consideration is that within the selfattention mechanism, there are no inherent parameters; those linear operations are just there to capture the relationship between the different vectors by using the properties of the vectors used to represent them, leading to attention weights.
Attention in Transformers: What’s new and what’s not?
 The seq2seq encoderdecoder architecture that Vaswani et al. used is an idea adopted from one of Bengio’s papers, Neural Machine Translation by Jointly Learning to Align and Translate.
 Further, Transformers use scaled dotproduct attention (based on Query, Key and Value matrices) which is a concept inspired from the field of information retrieval (note that Bengio’s seq2seq architecture group used Bahdanau attention in Neural Machine Translation by Jointly Learning to Align and Translate which is a more relatively basic form of attention compared to what Transformers use).
 However, what’s novel about Transformers is that Vaswani et al. applied attention to the encoder as well (along with applying (cross)attention at the decoder, similar to how Bengio’s group did it in Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation – thereby leading to the concept of “selfattention”, which is unique to Transformers.
Calculating \(Q\), \(K\), and \(V\) matrices in the Transformer architecture
 Each word is embedded into a vector of size 512 and is fed into the bottommost encoder. The abstraction that is common to all the encoders is that they receive a list of vectors each of the size 512 – in the bottom encoder that would be the word embeddings, but in other encoders, it would be the output of the encoder that is directly below. The size of this list is hyperparameter we can set – basically it would be the length of the longest sentence in our training dataset.
 In the selfattention layers, multiplying the input vector (which is the word embedding for the first block of the encoder/decoder stack, while the output of the previous block for subsequent blocks) by the attention weights matrix (which are the \(Q\), \(K\), and \(V\) matrices stacked horizontally) and adding a bias vector afterwards results in a concatenated key, value, and query vector for this token. This long vector is split to form the \(q\), \(k\), and \(v\) vectors for this token (which actually represent the concatenated output for multiple attention heads and is thus, further reshaped into \(q\), \(k\), and \(v\) outputs for each attention head — more on this in the section on Multihead Attention). From Jay Alammar’s: The Illustrated GPT2:
Optimizing Performance with the KV Cache
 Using a KV cache is one of the most commonly used tricks for speeding up inference with Transformerbased models, particularly employed with LLMs. Let’s unveil its inner workings.
 Autoregressive decoding process: When we perform inference with an LLM, it follows an autoregressive decoding process. Put simply, this means that we (i) start with a sequence of textual tokens, (ii) predict the next token, (iii) add this token to our input, and (iv) repeat until generation is finished.
 Causal selfattention: Selfattention within a language model is causal, meaning that each token only considers itself and prior tokens when computing its representation (i.e., NOT future tokens). As such, representations for each token do not change during autoregressive decoding! We need to compute the representation for each new token, but other tokens remain fixed (i.e., because they don’t depend on tokens that follow them).
 Caching selfattention values: When we perform selfattention, we project our sequence of tokens using three separate, linear projections: key projection, value projection, and query projection. Then, we execute selfattention using the resulting matrices. The KVcache simply stores the results of the key and value projections for future decoding iterations so that we don’t recompute them every time!
 Why not cache the query? So why are the key and value projections cached, but not the query? This is simply because the entries in the query matrix are only needed to compute the representations of prior tokens in the sequence (whose key and value representations are already stored in the KVCache). At each timestep, the new query input consists of the token at that timestep and all prior tokens (i.e., the entire sequence up to that point). For computing the representation of query representation for the most recent token, we only need access to the most recent row in the query matrix.
 Updates to the KV cache: Throughout autoregressive decoding, we have the key and value projections cached. Each time we get a new token in our input, we simply compute the new rows as part of selfattention and add them to the KV cache. Then, we can use the query projection for the new token and the updated key and value projections to perform the rest of the forward pass.
 Latency optimization: KVcaching decreases the latency to the next token in an autoregressive setting starting from the second token. Since the prompt tokens are not cached at the beginning of the generation, time to the first token is high, but as KVcaching kicks in for further generation, latency reduces. In other words, KVcaching is the reason why the latency of the first token’s generation (from the time the input prompt is fed in) is higher than that of consecutive tokens.
 Scaling to Multihead Selfattention: Here, we have considered singlehead selfattention for simplicity. However, it’s important to note that the same exact process applies to the multihead selfattention used by LLMs (detailed in the MultiHead Attention section below). We just perform the exact same process in parallel across multiple attention heads.
 More on the KV cache in the Model Acceleration primer.
Applications of Attention in Transformers
 From the paper, the Transformer uses multihead attention in three different ways:
 The encoder contains selfattention layers. In a selfattention layer, all of the keys, values, and queries are derived from the same source, which is the word embedding for the first block of the encoder stack, while the output of the previous block for subsequent blocks. Each position in the encoder can attend to all positions in the previous block of the encoder.
 Similarly, selfattention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position. We need to prevent leftward information flow in the decoder to preserve the autoregressive property. We implement this inside of scaled dotproduct attention by masking out all values (by setting to a very low value, such as \(−\infty\)) in the input of the softmax which correspond to illegal connections.
 In “encoderdecoder attention” layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence. This mimics the typical encoderdecoder attention mechanisms in sequencetosequence models such as Neural Machine Translation by Jointly Learning to Align and Translate, Google’s neural machine translation system: Bridging the gap between human and machine translation, and Convolutional Sequence to Sequence Learning.
MultiHead Attention
 Let’s confront some of the simplistic assumptions we made during our first pass through explaining the attention mechanism. Words are represented as dense embedded vectors, rather than onehot vectors. Attention isn’t just 1 or 0, on or off, but can also be anywhere in between. To get the results to fall between 0 and 1, we use the softmax trick again. It has the dual benefit of forcing all the values to lie in our [0, 1] attention range, and it helps to emphasize the highest value, while aggressively squashing the smallest. It’s the differential almostargmax behavior we took advantage of before when interpreting the final output of the model.
 An complicating consequence of putting a softmax function in attention is that it will tend to focus on a single element. This is a limitation we didn’t have before. Sometimes it’s useful to keep several of the preceding words in mind when predicting the next, and the softmax just robbed us of that. This is a problem for the model.
 To address the above issues, the Transformer paper refined the selfattention layer by adding a mechanism called “multihead” attention. This improves the performance of the attention layer in two ways:
 It expands the model’s ability to focus on different positions. It would be useful if we’re translating a sentence like “The animal didn’t cross the street because it was too tired”, we would want to know which word “it” refers to.
 It gives the attention layer multiple “representation subspaces”. As we’ll see next, with multihead attention we have not only one, but multiple sets of \(Q, K, V\) weight matrices (the Transformer uses eight attention heads, so we end up with eight sets for each encoder/decoder). Each of these sets is randomly initialized. Then, after training, each set is used to project the input embeddings (or vectors from lower encoders/decoders) into a different representation subspace.
 Further, getting the straightforward dotproduct attention mechanism to work can be tricky. Bad random initializations of the learnable weights can destabilize the training process.
 Multiple heads lets the the transformer consider several previous words simultaneously when predicting the next. It brings back the power we had before we pulled the softmax into the picture.
 To fix the aforementioned issues, we can run multiple ‘heads’ of attention in parallel and concatenate the result (with each head now having separate learnable weights).
 To accomplish multihead attention, selfattention is simply conducted multiple times on different parts of the \(Q, K, V\) matrices (each part corresponding to each attention head). Each \(q\), \(k\), and \(v\) vector generated at the output contains concatenated output corresponding to contains each attention head. To obtain the output corresponding to each attention heads, we simply reshape the long \(q\), \(k\), and \(v\) selfattention vectors into a matrix (with each row corresponding to the output of each attention head). From Jay Alammar’s: The Illustrated GPT2:

Mathematically,
\[\begin{array}{c} h_{i}^{\ell+1}=\text {Concat }\left(\text {head }_{1}, \ldots, \text { head}_{K}\right) O^{\ell} \\ \text { head }_{k}=\text {Attention }\left(Q^{k, \ell} h_{i}^{\ell}, K^{k, \ell} h_{j}^{\ell}, V^{k, \ell} h_{j}^{\ell}\right) \end{array}\] where \(Q^{k, \ell}, K^{k, \ell}, V^{k, \ell}\) are the learnable weights of the \(k^{\prime}\)th attention head and \(O^{\ell}\) is a downprojection to match the dimensions of \(h_{i}^{\ell+1}\) and \(h_{i}^{\ell}\) across layers.

Multiple heads allow the attention mechanism to essentially ‘hedge its bets’, looking at different transformations or aspects of the hidden features from the previous layer. More on this in the section on Why Multiple Heads of Attention? Why Attention?.
Managing computational load due to multihead attention
 Unfortunately, multihead attention really increases the computational load. Computing attention was already the bulk of the work, and we just multiplied it by however many heads we want to use. To get around this, we can reuse the trick of projecting everything into a lowerdimensional embedding space. This shrinks the matrices involved which dramatically reduces the computation time.
 To see how this plays out, we can continue looking at matrix shapes. Tracing the matrix shape through the branches and weaves of the multihead attention blocks requires three more numbers.

The \([n \times d_{model}]\) sequence of embedded words serves as the basis for everything that follows. In each case there is a matrix, \(W_v\), \(W_q\),, and \(W_k\), (all shown unhelpfully as “Linear” blocks in the architecture diagram) that transforms the original sequence of embedded words into the values matrix, \(V\), the queries matrix, \(Q\), and the keys matrix, \(K\). \(K\) and \(Q\) have the same shape, \([n \times d_k]\), but \(V\) can be different, \([n \times d_v]\). It confuses things a little that \(d_k\) and \(d_v\) are the same in the paper, but they don’t have to be. An important aspect of this setup is that each attention head has its own \(W_v\), \(W_q\), and \(W_k\) transforms. That means that each head can zoom in and expand the parts of the embedded space that it wants to focus on, and it can be different than what each of the other heads is focusing on.

The result of each attention head has the same shape as \(V\). Now we have the problem of h different result vectors, each attending to different elements of the sequence. To combine these into one, we exploit the powers of linear algebra, and just concatenate all these results into one giant \([n \times h * d_v]\) matrix. Then, to make sure it ends up in the same shape it started, we use one more transform with the shape \([h * d_v \times d_{model}]\).

Here’s all of the that from the paper, stated tersely.
\[\begin{aligned} \operatorname{MultiHead}(Q, K, V) &=\operatorname{Concat}\left(\operatorname{head}_{1}, \ldots, \text { head }_{\mathrm{h}}\right) W^{O} \\ \text { where head } &=\operatorname{Attention}\left(Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}\right) \end{aligned}\] where the projections are parameter matrices \(W_{i}^{Q} \in \mathbb{R}^{d_{\text {model }} \times d_{k}}, W_{i}^{K} \in \mathbb{R}^{d_{\text {model }} \times d_{k}}, W_{i}^{V} \in \mathbb{R}^{d_{\text {model }} \times d_{v}}\) and \(W^{O} \in \mathbb{R}^{h d_{v} \times d_{\text {model }}}\).
Why have multiple attention heads?
 Per Eugene Yan’s Some Intuition on Attention and the Transformer blog, multiple heads lets the model consider multiple words simultaneously. Because we use the softmax function in attention, it amplifies the highest value while squashing the lower ones. As a result, each head tends to focus on a single element.
 Consider the sentence: “The chicken crossed the road carelessly”. The following words are relevant to “crossed” and should be attended to:
 The “chicken” is the subject doing the crossing.
 The “road” is the object being crossed.
 The crossing is done “carelessly”.
 If we had a single attention head, we might only focus on a single word, either “chicken”, “road”, or “crossed”. Multiple heads let us attend to several words. It also provides redundancy, where if any single head fails, we have the other attention heads to rely on.
CrossAttention

The final step in getting the full transformer up and running is the connection between the encoder and decoder stacks, the cross attention block. We’ve saved it for last and, thanks to the groundwork we’ve laid, there’s not a lot left to explain.

Crossattention works just like selfattention with the exception that the key matrix \(K\) and value matrix \(V\) are based on the output of the encoder stack (i.e., the final encoder layer), rather than the output of the previous decoder layer. The query matrix \(Q\) is still calculated from the results of the previous decoder layer. This is the channel by which information from the source sequence makes its way into the target sequence and steers its creation in the right direction. It’s interesting to note that the same embedded source sequence (output from the final layer in the encoder stack) is provided to every layer of the decoder, supporting the notion that successive layers provide redundancy and are all cooperating to perform the same task. The following diagram highlights the crossattention piece within the transformer architecture.
Dropout
 Per the original Transformer paper, dropout is applied to the output of each “sublayer” (where a “sublayer” refers to the self/cross multihead attention layers as well as the positionwise feedfoward networks.), before it is added to the sublayer input and normalized. In addition, it is also applied dropout to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks. For the base model, the original Transformer use a rate of \(P_{drop} = 0.1\).
 Thus, from a code perspective, the sequence of actions can be summarized as follows:
x2 = SubLayer(x)
x2 = torch.nn.dropout(x2, p=0.1)
x = nn.LayerNorm(x2 + x)
 For more details, please refer The Annotated Transformer.
Skip connections

Attention is the most fundamental part of what transformers do. It’s the core mechanism, and we have now traversed it had a pretty concrete level. Everything from here on out is the plumbing necessary to make it work well. It’s the rest of the harness that lets attention pull our heavy workloads.

One piece we haven’t explained yet are skip connections. These occur around the MultiHead Attention blocks, and around the element wise Feed Forward blocks in the blocks labeled “Add and Norm”. In skip connections, a copy of the input is added to the output of a set of calculations. The inputs to the attention block are added back in to its output. The inputs to the elementwise feed forward block are added to its outputs. The following diagram shows the Transformer architecture showing add and norm blocks.
 Skip connections serve two purposes:
 They help keep the gradient smooth, which is a big help for backpropagation. Attention is a filter, which means that when it’s working correctly it will block most of what tries to pass through it. The result of this is that small changes in a lot of the inputs may not produce much change in the outputs if they happen to fall into channels that are blocked. This produces dead spots in the gradient where it is flat, but still nowhere near the bottom of a valley. These saddle points and ridges are a big tripping point for backpropagation. Skip connections help to smooth these out. In the case of attention, even if all of the weights were zero and all the inputs were blocked, a skip connection would add a copy of the inputs to the results and ensure that small changes in any of the inputs will still have noticeable changes in the result. This keeps gradient descent from getting stuck far away from a good solution. Skip connections have become popular because of how they improve performance since the days of the ResNet image classifier. They are now a standard feature in neural network architectures. The figure below (source) shows the effect that skip connections have by comparing a ResNet with and without skip connections. The slopes of the loss function hills are are much more moderate and uniform when skip connections are used. If you feel like taking a deeper dive into how the work and why, there’s a more indepth treatment in this post. The following diagram shows the comparison of loss surfaces with and without skip connections.
 The second purpose of skip connections is specific to transformers — preserving the original input sequence. Even with a lot of attention heads, there’s no guarantee that a word will attend to its own position. It’s possible for the attention filter to forget entirely about the most recent word in favor of watching all of the earlier words that might be relevant. A skip connection takes the original word and manually adds it back into the signal, so that there’s no way it can be dropped or forgotten. This source of robustness may be one of the reasons for transformers’ good behavior in so many varied sequence completion tasks.
Why have skip connections?
 Per Eugene Yan’s Some Intuition on Attention and the Transformer blog, because attention acts as a filter, it blocks most information from passing through. As a result, a small change to the inputs of the attention layer may not change the outputs, if the attention score is tiny or zero. This can lead to flat gradients or local optima.
 Skip connections help dampen the impact of poor attention filtering. Even if an input’s attention weight is zero and the input is blocked, skip connections add a copy of that input to the output. This ensures that even small changes to the input can still have noticeable impact on the output. Furthermore, skip connections preserve the input sentence: There’s no guarantee that a context word will attend to itself in a transformer. Skip connections ensure this by taking the context word vector and adding it to the output.
Layer normalization
 Normalization is a step that pairs well with skip connections. There’s no reason they necessarily have to go together, but they both do their best work when placed after a group of calculations, like attention or a feed forward neural network.
 The short version of layer normalization is that the values of the matrix are shifted to have a mean of zero and scaled to have a standard deviation of one. The following diagram shows several distributions being normalized.
 The longer version is that in systems like transformers, where there are a lot of moving pieces and some of them are something other than matrix multiplications (such as softmax operators or rectified linear units), it matters how big values are and how they’re balanced between positive and negative. If everything is linear, you can double all your inputs, and your outputs will be twice as big, and everything will work just fine. Not so with neural networks. They are inherently nonlinear, which makes them very expressive but also sensitive to signals’ magnitudes and distributions. Normalization is a technique that has proven useful in maintaining a consistent distribution of signal values each step of the way throughout manylayered neural networks. It encourages convergence of parameter values and usually results in much better performance.
 To understand the different types of normalization techniques, please refer Normalization Methods which includes batch normalization, a close cousin of the layer normalization used in transformers.
Softmax

The argmax function is “hard” in the sense that the highest value wins, even if it is only infinitesimally larger than the others. If we want to entertain several possibilities at once, it’s better to have a “soft” maximum function, which we get from softmax. To get the softmax of the value \(x\) in a vector, divide the exponential of \(x\), \(e^x\), by the sum of the exponentials of all the values in the vector. This converts the (unnormalized) logits/energy values into (normalized) probabilities \(\in [0, 1]\), with all summing up to 1.

The softmax is helpful here for three reasons. First, it converts our deembedding results vector from an arbitrary set of values to a probability distribution. As probabilities, it becomes easier to compare the likelihood of different words being selected and even to compare the likelihood of multiword sequences if we want to look further into the future.

Second, it thins the field near the top. If one word scores clearly higher than the others, softmax will exaggerate that difference (owing to the “exponential” operation), making it look almost like an argmax, with the winning value close to one and all the others close to zero. However, if there are several words that all come out close to the top, it will preserve them all as highly probable, rather than artificially crushing close second place results, which argmax is susceptible to. You might be thinking what the difference between standard normalization and softmax is – after all, both rescale the logits between 0 and 1. By using softmax, we are effectively “approximating” argmax as indicated earlier while gaining differentiability. Rescaling doesn’t weigh the max significantly higher than other logits, whereas softmax does due to its “exponential” operation. Simply put, softmax is a “softer” argmax.

Third, softmax is differentiable, meaning we can calculate how much each element of the results will change, given a small change in any of the input elements. This allows us to use it with backpropagation to train our transformer.

Together the deembedding transform (shown as the Linear block below) and a softmax function complete the deembedding process. The following diagram shows the deembedding steps in the architecture diagram (source: Transformers paper).
Stacking Transformer Layers
 While we were laying the foundations above, we showed that an attention block and a feed forward block with carefully chosen weights were enough to make a decent language model. Most of the weights were zeros in our examples, a few of them were ones, and they were all hand picked. When training from raw data, we won’t have this luxury. At the beginning the weights are all chosen randomly, most of them are close to zero, and the few that aren’t probably aren’t the ones we need. It’s a long way from where it needs to be for our model to perform well.
 Stochastic gradient descent through backpropagation can do some pretty amazing things, but it relies a lot on trialanderror. If there is just one way to get to the right answer, just one combination of weights necessary for the network to work well, then it’s unlikely that it will find its way. But if there are lots of paths to a good solution, chances are much better that the model will get there.
 Having a single attention layer (just one multihead attention block and one feed forward block) only allows for one path to a good set of transformer parameters. Every element of every matrix needs to find its way to the right value to make things work well. It is fragile and brittle, likely to get stuck in a farfromideal solution unless the initial guesses for the parameters are very very lucky.
 The way transformers sidestep this problem is by having multiple attention layers, each using the output of the previous one as its input. The use of skip connections make the overall pipeline robust to individual attention blocks failing or giving wonky results. Having multiples means that there are others waiting to take up the slack. If one should go off the rails, or in any way fail to live up to its potential, there will be another downstream that has another chance to close the gap or fix the error. The paper showed that more layers resulted in better performance, although the improvement became marginal after 6.
 Another way to think about multiple layers is as a conveyor belt assembly line. Each attention block and feedforward block has the chance to pull inputs off the line, calculate useful attention matrices and make next word predictions. Whatever results they produce, useful or not, get added back onto the conveyer, and passed to the next layer. The following diagram shows the transformer redrawn as a conveyor belt:
 This is in contrast to the traditional description of manylayered neural networks as “deep”. Thanks to skip connections, successive layers don’t provide increasingly sophisticated abstraction as much as they provide redundancy. Whatever opportunities for focusing attention and creating useful features and making accurate predictions were missed in one layer can always be caught by the next. Layers become workers on the assembly line, where each does what it can, but doesn’t worry about catching every piece, because the next worker will catch the ones they miss.
Why have multiple attention layers?
 Per Eugene Yan’s Some Intuition on Attention and the Transformer blog, multiple attention layers builds in redundancy (on top of having multiple attention heads). If we only had a single attention layer, that attention layer would have to do a flawless job—this design could be brittle and lead to suboptimal outcomes. We can address this via multiple attention layers, where each one uses the output of the previous layer with the safety net of skip connections. Thus, if any single attention layer messed up, the skip connections and downstream layers can mitigate the issue.
 Stacking attention layers also broadens the model’s receptive field. The first attention layer produces context vectors by attending to interactions between pairs of words in the input sentence. Then, the second layer produces context vectors based on pairs of pairs, and so on. With more attention layers, the Transformer gains a wider perspective and can attend to multiple interaction levels within the input sentence.
Transformer Encoder and Decoder
 The Transformer model has two parts: encoder and decoder. Both encoder and decoder are mostly identical (with a few differences) and are comprised of a stack of transformer blocks. Each block is comprised of a combination of multihead attention blocks, positional feedforward layers, residual connections and layer normalization blocks.
 The attention layers from the encoder and decoder have the following differences:
 The encoder only has selfattention blocks while the decoder has a crossattention encoderdecoder layer sandwiched between the selfattention layer and the feedforward neural network.
 Also, the selfattention blocks are masked to ensure causal predictions (i.e., the prediction of token \(N\) only depends on the previous \(N  1\) tokens, and not on the future ones).
 Each of the encoding/decoding blocks contains many stacked encoders/decoder transformer blocks. The Transformer encoder is a stack of six encoders, while the decoder is a stack of six decoders. The initial layers capture more basic patterns (broadly speaking, basic syntactic patterns), whereas the last layers can detect more sophisticated ones, similar to how convolutional networks learn to look for lowlevel features such as edges and blobs of color in the initial layers while the mid layers focus on learning highlevel features such as object shapes and textures the later layers focus on detecting the entire objects themselves (using textures, shapes and patterns learnt from earlier layers as building blocks).
 The six encoders and decoders are identical in structure but do not share weights. Check weights shared by different parts of a transformer model for a detailed discourse on weight sharing opportunities within the Transformer layers.
 For more on the pros and cons of the encoder and decoder stack, refer Autoregressive vs. Autoencoder Models.
Decoder stack
The decoder, which follows the autoregressive property, i.e., consumes the tokens generated so far to generate the next one, is used standalone for generation tasks, such as tasks in the domain of natural language generation (NLG), for e.g., such as summarization, translation, or abstractive question answering. Decoder models are typically trained with an objective of predicting the next token, i.e., “autoregressive blank infilling”.
 As we laid out in the section on Sampling a Sequence of Output Words, the decoder can complete partial sequences and extend them as far as you want. OpenAI created the generative pretraining (GPT) family of models to do just this, by training on a predictingthenexttoken objective. The architecture they describe in this report should look familiar. It is a transformer with the encoder stack and all its connections surgically removed. What remains is a 12 layer decoder stack. The following diagram from the GPT1 paper Improving Language Understanding by Generative PreTraining shows the architecture of the GPT family of models:
 Any time you come across a generative/autoregressive model, such as GPTX, LLaMA, Copilot, etc., you’re probably seeing the decoder half of a transformer in action.
Encoder stack
The encoder, is typically used standalone for content understanding tasks, such as tasks in the domain of natural language understanding (NLU) that involve classification, for e.g., sentiment analysis, or extractive question answering. Encoder models are typically trained with a “fill in the blanks”/”blank infilling” objective – reconstructing the original data from masked/corrupted input (i.e., by randomly sampling tokens from the input and replacing them with
[MASK]
elements, or shuffling sentences in random order if it’s the next sentence prediction task). In that sense, an encoder can be thought of as an autoencoder which seeks to denoise a partially corrupted input, i.e., “Denoising Autoencoder” (DAE) and aim to recover the original undistorted input.
 Almost everything we’ve learned about the decoder applies to the encoder too. The biggest difference is that there’s no explicit predictions being made at the end that we can use to judge the rightness or wrongness of its performance. Instead, the end product of an encoder stack is an abstract representation in the form of a sequence of vectors in an embedded space. It has been described as a pure semantic representation of the sequence, divorced from any particular language or vocabulary, but this feels overly romantic to me. What we know for sure is that it is a useful signal for communicating intent and meaning to the decoder stack.

Having an encoder stack opens up the full potential of transformers instead of just generating sequences, they can now translate (or transform) the sequence from one language to another. Training on a translation task is different than training on a sequence completion task. The training data requires both a sequence in the language of origin, and a matching sequence in the target language. The full language of origin is run through the encoder (no masking this time, since we assume that we get to see the whole sentence before creating a translation) and the result, the output of the final encoder layer is provided as an input to each of the decoder layers. Then sequence generation in the decoder proceeds as before, but this time with no prompt to kick it off.

Any time you come across an encoder model that generates semantic embeddings, such as BERT, ELMo, etc., you’re likely seeing the encoder half of a transformer in action.
Putting it all together: The Transformer Architecture
 The Transformer architecture combines the individual encoder/decoder models. The encoder takes the input and encodes it into fixedlength query, key, and vector tensors (analogous to the fixedlength context vector in the original paper by Bahdanau et al. (2015)) that introduced attention. These tensors are passed onto the decoder which decodes it into the output sequence.

The encoder (left) and decoder (right) of the transformer is shown below:

Note that the multihead attention in the encoder is the scaled dotproduct multihead self attention, while that in the initial layer in the decoder is the masked scaled dotproduct multihead self attention and the middle layer (which enables the decoder to attend to the encoder) is the scaled dotproduct multihead cross attention.

Redrawn vectorized versions from DAIR.AI are as follows:

 The full model architecture of the transformer – from fig. 1 and 2 in Vaswani et al. (2017) – is as follows:
 Here is an illustrated version of the overall Transformer architecture from Abdullah Al Imran:
 As a walkthrough exercise, the following diagram (source: CS330 slides) shows an sample input sentence “Joe Biden is the US President” being fed in as input to the Transformer. The various transformations that occur as the input vector is processed are:
 Input sequence: \(I\) = “Joe Biden is the US President”.
 Tokenization: \(I \in {\mid \text { vocab } \mid}^{T}\).
 Input embeddings lookup: \(E \in \mathbb{R}^{T \times d}\).
 Inputs to Transformer block: \(X \in \mathbb{R}^{T \times d}\).
 Obtaining three separate linear projections of input \(X\) (queries, keys, and values): \(X_Q=X W_Q, \quad X_K=X W_K, \quad X_V=X W_V\).
 Calculating selfattention: \(A=\operatorname{sm}\left(X_Q X_K^{\top}\right) X_V\) (the scaling part is missing in the figure below – you can reference the section on Types of Attention: Additive, Multiplicative (Dotproduct), and Scaled for more).
 This is followed by a residual connection and LayerNorm.
 Feedforward (MLP) layers which perform two linear transformations/projections of the input with a ReLU activation in between: \(\operatorname{FFN}(x)=\max \left(0, x W_1+b_1\right) W_2+b_2\)
 This is followed by a residual connection and LayerNorm.
 Output of the Transformer block: \(O \in \mathbb{R}^{T \times d}\).
 Project to vocabulary size at time \(t\): \(p_\theta^t(\cdot) \in \mathbb{R}^{\mid \text {vocab } \mid}\).
Loss function
 The encoder and decoder are jointly trained (“endtoend”) to minimize the crossentropy loss between the predicted probability matrix of shape
output sequence length
$\times$vocab
(right before taking the argmax on the output of the softmax to ascertain the next token to output), and theoutput sequence length
sized output vector of token IDs as the true label.  Effectively, the crossentropy loss “pulls” the predicted probability of the correct class towards 1 during training. This is accomplished by calculating gradients of the loss function w.r.t. the model’s weights; with the model’s sigmoid/softmax output (in case of binary/multiclass classification) serving as the prediction (i.e., the preargmax output is utilized since argmax is not differentiable).
Implementation details
Tokenizing

We made it all the way through the transformer! We covered it in enough detail that there should be no mysterious black boxes left. There are a few implementation details that we didn’t dig into. You would need to know about them in order to build a working version for yourself. These last few tidbits aren’t so much about how transformers work as they are about getting neural networks to behave well. The Annotated Transformer will help you fill in these gaps.

In the section on Onehot encoding, we discussed that a vocabulary could be represented by a high dimensional onehot vector, with one element associated with each word. In order to do this, we need to know exactly how many words we are going to be representing and what they are.

A naïve approach is to make a list of all possible words, like we might find in Webster’s Dictionary. For the English language this will give us several tens of thousands, the exact number depending on what we choose to include or exclude. But this is an oversimplification. Most words have several forms, including plurals, possessives, and conjugations. Words can have alternative spellings. And unless your data has been very carefully cleaned, it will contain typographical errors of all sorts. This doesn’t even touch on the possibilities opened up by freeform text, neologisms, slang, jargon, and the vast universe of Unicode. An exhaustive list of all possible words would be infeasibly long.

A reasonable fallback position would be to have individual characters serve as the building blocks, rather than words. An exhaustive list of characters is well within the capacity we have to compute. However there are a couple of problems with this. After we transform data into an embedding space, we assume the distance in that space has a semantic interpretation, that is, we assume that points that fall close together have similar meanings, and points that are far away mean something very different. That allows us to implicitly extend what we learn about one word to its immediate neighbors, an assumption we rely on for computational efficiency and from which the transformer draws some ability to generalize.

At the individual character level, there is very little semantic content. There are a few one character words in the English language for example, but not many. Emoji are the exception to this, but they are not the primary content of most of the data sets we are looking at. That leaves us in the unfortunate position of having an unhelpful embedding space.

It might still be possible to work around this theoretically, if we could look at rich enough combinations of characters to build up semantically useful sequences like words, words stems, or word pairs. Unfortunately, the features that transformers create internally behave more like a collection of input pairs than an ordered set of inputs. That means that the representation of a word would be a collection of character pairs, without their order strongly represented. The transformer would be forced to continually work with anagrams, making its job much harder. And in fact experiments with character level representations have shown the transformers don’t perform very well with them.
Per OpenAI’s Tokenizer Platform page, a helpful rule of thumb is that one token generally corresponds to ~4 characters of text for common English text. This translates to roughly \(\frac{3}{4}\) of a word (so 100 tokens ~= 75 words).
Byte pair encoding (BPE)
 Fortunately, there is an elegant solution to this called byte pair encoding, which is a simple form of data compression in which the most common pair of consecutive bytes of data is replaced with a byte that does not occur within that data. A table of the replacements is required to rebuild the original data.
 Starting with the character level representation, each character is assigned a code, its own unique byte. Then after scanning some representative data, the most common pair of bytes is grouped together and assigned a new byte, a new code. This new code is substituted back into the data, and the process is repeated.
Example
 As an example (credit: Wikipedia: Byte pair encoding), suppose the data to be encoded is:
aaabdaaabac
 The byte pair “aa” occurs most often, so it will be replaced by a byte that is not used in the data, “Z”. Now there is the following data and replacement table:
ZabdZabac
Z=aa
 Then the process is repeated with byte pair “ab”, replacing it with Y:
ZYdZYac
Y=ab
Z=aa
 The only literal byte pair left occurs only once, and the encoding might stop here. Or the process could continue with recursive byte pair encoding, replacing “ZY” with “X”:
XdXac
X=ZY
Y=ab
Z=aa
 This data cannot be compressed further by byte pair encoding because there are no pairs of bytes that occur more than once.
 To decompress the data, simply perform the replacements in the reverse order.
Applying BPE to learn new, rare, and misspelled words

Codes representing pairs of characters can be combined with codes representing other characters or pairs of characters to get new codes representing longer sequences of characters. There’s no limit to the length of character sequence a code can represent. They will grow as long as they need to in order to represent commonly repeated sequences. The cool part of byte pair encoding is that in infers which long sequences of characters to learn from the data, as opposed to dumbly representing all possible sequences. it learns to represent long words like transformer with a single byte code, but would not waste a code on an arbitrary string of similar length, such as
ksowjmckder
. And because it retains all the byte codes for its single character building blocks, it can still represent weird misspellings, new words, and even foreign languages. 
When you use byte pair encoding, you get to assign it a vocabulary size, ad it will keep building new codes until reaches that size. The vocabulary size needs to be big enough, that the character strings get long enough to capture the semantic content of the the text. They have to mean something. Then they will be sufficiently rich to power transformers.

After a byte pair encoder is trained or borrowed, we can use it to preprocess out data before feeding it into the transformer. This breaks it the unbroken stream of text into a sequence of distinct chunks, (most of which are hopefully recognizable words) and provides a concise code for each one. This is the process called tokenization.
Teacher Forcing
 Teacher forcing is a common training technique for sequencetosequence models where, during training, the model is fed with the ground truth (true) target sequence at each time step as input, rather than the model’s own predictions. This helps the model learn faster and more accurately during training because it has access to the correct information at each step.
 Pros: Teacher forcing is essential because it accelerates training convergence and stabilizes learning. By using correct previous tokens as input during training, it ensures the model learns to predict the next token accurately. If we do not use teacher forcing, the hidden states of the model will be updated by a sequence of wrong predictions, errors will accumulate, making it difficult for the model to learn. This method effectively guides the model in learning the structure and nuances of language (especially during early stages of training when the predictions of the model lack coherence), leading to more coherent and contextually accurate text generation.
 Cons: With teacher forcing, when the model is deployed for inference (generating sequences), it typically does not have access to ground truth information and must rely on its own predictions, which can be less accurate. Put simply, during inference, since there is usually no ground truth available, the model will need to feed its own previous prediction back to itself for the next prediction. This discrepancy between training and inference can potentially lead to poor model performance and instability. This is known as “exposure bias” in literature, which can be mitigated using scheduled sampling.
 For more, check out What is Teacher Forcing for Recurrent Neural Networks? and What is Teacher Forcing?.
Scheduled Sampling
 Scheduled sampling is a technique used in sequencetosequence models, particularly in the context of training recurrent neural networks (RNNs) and sequencetosequence models like LSTMs and Transformers. Its primary goal is to address the discrepancy between the training and inference phases that arises due to teacher forcing, and it helps mitigate the exposure bias generated by teacher forcing.
 Scheduled sampling is thus introduced to bridge this “traintest discrepancy” gap between training and inference by gradually transitioning from teacher forcing to using the model’s own predictions during training. Here’s how it works:
 Teacher Forcing Phase:
 In the early stages of training, scheduled sampling follows a schedule where teacher forcing is dominant. This means that the model is mostly exposed to the ground truth target sequence during training.
 At each time step, the model has a high probability of receiving the true target as input, which encourages it to learn from the correct data.
 Transition Phase:
 As training progresses, scheduled sampling gradually reduces the probability of using the true target as input and increases the probability of using the model’s own predictions.
 This transition phase helps the model get accustomed to generating its own sequences and reduces its dependence on the ground truth data.
 Inference Phase:
 During inference (when the model generates sequences without access to the ground truth), scheduled sampling is typically turned off. The model relies entirely on its own predictions to generate sequences.
 Teacher Forcing Phase:
 By implementing scheduled sampling, the model learns to be more robust and capable of generating sequences that are not strictly dependent on teacherforced inputs. This mitigates the exposure bias problem, as the model becomes more capable of handling realworld scenarios where it must generate sequences autonomously.
 In summary, scheduled sampling is a training strategy for sequencetosequence models that gradually transitions from teacher forcing to using the model’s own predictions, helping to bridge the gap between training and inference and mitigating the bias generated by teacher forcing. This technique encourages the model to learn more robust and accurate sequence generation.
Decoder Outputs: Shifted Right
 In the architectural diagram of the Transformer shown below, the output embedding that is “shifted right”. This shifting is done during training, where the decoder is given the correct output at that step (e.g., the translation of a sentence in the original Transformer decoder) as input but shifted one position to the right. This means that the token at each position in the input is the token that should have been predicted at the previous step.
 This shiftright ensures that the prediction for a particular position (say position \(i\)) is only dependent on the known outputs at positions less than \(i\). Essentially, it prevents the model from “cheating” by seeing the correct output for position \(i\) when predicting position \(i\).
Label Smoothing as a Regularizer
 During training, they employ label smoothing which penalizes the model if it gets overconfident about a particular choice. This hurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score.
 They implement label smoothing using the KL div loss. Instead of using a onehot target distribution, we create a distribution that has a reasonably high confidence of the correct word and the rest of the smoothing mass distributed throughout the vocabulary.
Scaling Issues
 A key issue motivating the final Transformer architecture is that the features for words after the attention mechanism might be at different scales or magnitudes. This can be due to some words having very sharp or very distributed attention weights \(w_{i j}\) when summing over the features of the other words. Scaling the dotproduct attention by the squareroot of the feature dimension helps counteract this issue.
 Additionally, at the individual feature/vector entries level, concatenating across multiple attention headseach of which might output values at different scalescan lead to the entries of the final vector \(h_{i}^{\ell+1}\) having a wide range of values. Following conventional ML wisdom, it seems reasonable to add a normalization layer into the pipeline. As such, Transformers overcome this issue with LayerNorm, which normalizes and learns an affine transformation at the feature level.
 Finally, the authors propose another ‘trick’ to control the scale issue: a positionwise 2layer MLP with a special structure. After the multihead attention, they project \(h_{i}^{\ell+1}\) to a (absurdly) higher dimension by a learnable weight, where it undergoes the ReLU nonlinearity, and is then projected back to its original dimension followed by another normalization:
 Since LayerNorm and scaled dotproducts (supposedly) didn’t completely solve the highlighted scaling issues, the overparameterized feedforward sublayer was utilized. In other words, the big MLP is a sort of hack to rescale the feature vectors independently of each other. According to Jannes Muenchmeyer, the feedforward sublayer ensures that the Transformer is a universal approximator. Thus, projecting to a very high dimensional space, applying a nonlinearity, and reprojecting to the original dimension allows the model to represent more functions than maintaining the same dimension across the hidden layer would. The final picture of a Transformer layer looks like this:
 The Transformer architecture is also extremely amenable to very deep networks, enabling the NLP community to scale up in terms of both model parameters and, by extension, data. Residual connections between the inputs and outputs of each multihead attention sublayer and the feedforward sublayer are key for stacking Transformer layers (but omitted from the diagram for clarity).
The relation between transformers and Graph Neural Networks
GNNs build representations of graphs

Let’s take a step away from NLP for a moment.

Graph Neural Networks (GNNs) or Graph Convolutional Networks (GCNs) build representations of nodes and edges in graph data. They do so through neighbourhood aggregation (or message passing), where each node gathers features from its neighbours to update its representation of the local graph structure around it. Stacking several GNN layers enables the model to propagate each node’s features over the entire graph—from its neighbours to the neighbours’ neighbours, and so on.

Take the example of this emoji social network below (source): The node features produced by the GNN can be used for predictive tasks such as identifying the most influential members or proposing potential connections.

In their most basic form, GNNs update the hidden features \(h\) of node \(i\) (for example, 😆) at layer \(\ell\) via a nonlinear transformation of the node’s own features \(h_{i}^{\ell}\) added to the aggregation of features \(h_{j}^{\ell}\) from each neighbouring node \(j \in \mathcal{N}(i)\):
\[h_{i}^{\ell+1}=\sigma\left(U^{\ell} h_{i}^{\ell}+\sum_{j \in \mathcal{N}(i)}\left(V^{\ell} h_{j}^{\ell}\right)\right)\] where \(U^{\ell}, V^{\ell}\) are learnable weight matrices of the GNN layer and \(\sigma\) is a nonlinear function such as ReLU. In the example, (😆) {😘, 😎, 😜, 🤩}.

The summation over the neighbourhood nodes \(j \in \mathcal{N}(i)\) can be replaced by other input sizeinvariant aggregation functions such as simple mean/max or something more powerful, such as a weighted sum via an attention mechanism.

Does that sound familiar? Maybe a pipeline will help make the connection (figure source):
 If we were to do multiple parallel heads of neighbourhood aggregation and replace summation over the neighbours \(j\) with the attention mechanism, i.e., a weighted sum, we’d get the Graph Attention Network (GAT). Add normalization and the feedforward MLP, and voila, we have a Graph Transformer! Transformers are thus a special case of GNNs – they are just GNNs with multihead attention.
Sentences are fullyconnected word graphs
 To make the connection more explicit, consider a sentence as a fullyconnected graph, where each word is connected to every other word. Now, we can use a GNN to build features for each node (word) in the graph (sentence), which we can then perform NLP tasks with as shown in the figure (source) below.

Broadly, this is what Transformers are doing: they are GNNs with multihead attention as the neighbourhood aggregation function. Whereas standard GNNs aggregate features from their local neighbourhood nodes \(j \in \mathcal{N}(i)\), Transformers for NLP treat the entire sentence \(\mathcal{S}\) as the local neighbourhood, aggregating features from each word \(j \in \mathcal{S}\) at each layer.

Importantly, various problemspecific tricks—such as position encodings, causal/masked aggregation, learning rate schedules and extensive pretraining—are essential for the success of Transformers but seldom seem in the GNN community. At the same time, looking at Transformers from a GNN perspective could inspire us to get rid of a lot of the bells and whistles in the architecture.
Inductive biases of transformers
 Based on the above discussion, we’ve established that transformers are indeed a special case of Graph Neural Networks (GNNs) owing to their architecture level commonalities. Relational inductive biases, deep learning, and graph networks by Battaglia et al. (2018) from DeepMind/Google, MIT and the University of Edinburgh offers a great overview of the relational inductive biases of various neural net architectures, summarized in the table below from the paper. Each neural net architecture exhibits varying degrees of relational inductive biases. Transformers fall somewhere between RNNs and GNNs in the table below (source).
 YouTube Video from UofT CSC2547: Relational inductive biases, deep learning, and graph networks; Slides by KAIST on inductive biases, graph neural networks, attention and relational inference
Time complexity: RNNs vs. Transformers
 RNNs and Transformers have different time complexities, which significantly impact their runtime performance, especially on long sequences. This section offers a detailed explanation of the time complexities of RNNs and Transformers, including the reasoning behind each term in the complexities.
RNNs
 Time Complexity: \(O(n \cdot d^2)\)
 Explanation:
 \(n\): This represents the length of the input sequence. RNNs process sequences one step at a time, so they need to iterate through all \(n\) time steps.
 \(d\): This represents the dimensionality of the hidden state.
 \(d^2\): This term arises because at each time step, an RNN performs operations that involve the hidden state. Specifically, each step involves matrix multiplications that have a computational cost of \(O(d^2)\). The key operations are:
 Hidden State Update: For a simple RNN, the hidden state update is computed as \(h_t = \tanh(W_h h_{t1} + W_x x_t)\). Here, \(W_h\) and \(W_x\) are weight matrices of size \(d \times d\) and \(d \times \text{input\_dim}\), respectively.
 The matrix multiplication \(W_h h_{t1}\) dominates the computation and contributes \(O(d^2)\) to the complexity because multiplying a \(d \times d\) matrix with a \(d\)dimensional vector requires \(d^2\) operations.
 Therefore, for each of the \(n\) time steps, the \(d^2\) operations need to be performed, leading to the overall time complexity of \(O(n \cdot d^2)\).
 Explanation:
Transformers
 Time Complexity: \(O(n^2 \cdot d)\)
 Explanation:
 \(n\): This represents the length of the input sequence.
 \(n^2\): This term arises from the selfattention mechanism used in Transformers. In selfattention, each token in the sequence attends to every other token, requiring the computation of attention scores for all pairs of tokens. This results in \(O(n^2)\) pairwise comparisons.
 \(d\): This represents the dimensionality of the model. The attention mechanism involves projecting the input into query, key, and value vectors of size \(d\), and computing dot products between queries and keys, which are then scaled and used to weight the values. The operations involved are:
 Projection: Each input token is projected into three different \(d\)dimensional spaces (query, key, value), resulting in a complexity of \(O(nd)\) for this step.
 Dot Products: Computing the dot product between each pair of query and key vectors results in \(O(n^2 d)\) operations.
 Weighting and Summing: Applying the attention weights to the value vectors and summing them up also involves \(O(n^2 d)\) operations.
 Therefore, the overall time complexity for the selfattention mechanism in Transformers is \(O(n^2 \cdot d)\).
 Explanation:
Comparative Analysis
 RNNs: The linear time complexity with respect to the sequence length makes RNNs potentially faster for shorter sequences. However, their sequential nature can make parallelization challenging, leading to slower processing times for long sequences on modern hardware optimized for parallel computations. Put simply, the dependency on previous time steps means that RNNs cannot fully leverage parallel processing, which is a significant drawback on modern hardware optimized for parallel computations.
 Transformers: The quadratic time complexity with respect to the sequence length means that Transformers can be slower for very long sequences. However, their highly parallelizable architecture often results in faster training and inference times on modern GPUs, especially for tasks involving long sequences or large datasets. This parallelism makes them more efficient in practice, especially for tasks involving long sequences or large datasets. The ability to handle dependencies across long sequences without being constrained by the sequential nature of RNNs gives Transformers a significant advantage in many applications.
Practical Implications
 For tasks involving short to moderately long sequences, RNNs can be efficient and effective.
 For tasks involving long sequences, Transformers are generally preferred due to their parallel processing capabilities, despite their higher theoretical time complexity.
Summary
 RNNs: \(O(n \cdot d^2)\) – Efficient for shorter sequences, but limited by sequential processing.
 Transformers: \(O(n^2 \cdot d)\) – Better suited for long sequences due to parallel processing capabilities, despite higher theoretical complexity.
Lessons Learned
Transformers: merging the worlds of linguistic theory and statistical NLP using fully connected graphs

Now that we’ve established a connection between Transformers and GNNs, let’s throw some ideas around. For one, are fullyconnected graphs the best input format for NLP?

Before statistical NLP and ML, linguists like Noam Chomsky focused on developing formal theories of linguistic structure, such as syntax trees/graphs. Tree LSTMs already tried this, but maybe Transformers/GNNs are better architectures for bringing together the two worlds of linguistic theory and statistical NLP? For example, a very recent work from MILA and Stanford explores augmenting pretrained Transformers such as BERT with syntax trees [Sachan et al., 2020. The figure below from Wikipedia: Syntactic Structures shows a tree diagram of the sentence “Colorless green ideas sleep furiously”:
Long term dependencies

Another issue with fullyconnected graphs is that they make learning very longterm dependencies between words difficult. This is simply due to how the number of edges in the graph scales quadratically with the number of nodes, i.e., in an \(n\) word sentence, a Transformer/GNN would be doing computations over \(n^{2}\) pairs of words. Things get out of hand for very large \(n\).

The NLP community’s perspective on the long sequences and dependencies problem is interesting: making the attention mechanism sparse or adaptive in terms of input size, adding recurrence or compression into each layer, and using Locality Sensitive Hashing for efficient attention are all promising new ideas for better transformers. See Maddison May’s excellent survey on longterm context in Transformers for more details.

It would be interesting to see ideas from the GNN community thrown into the mix, e.g., Binary Partitioning for sentence graph sparsification seems like another exciting approach. BPTransformers recursively subdivide sentences into two until they can construct a hierarchical binary tree from the sentence tokens. This structural inductive bias helps the model process longer text sequences in a memoryefficient manner. The following figure from Ye et al. (2019) shows binary partitioning for sentence graph sparsification.
Are Transformers learning neural syntax?
 There have been several interesting papers from the NLP community on what Transformers might be learning. The basic premise is that performing attention on all word pairs in a sentence – with the purpose of identifying which pairs are the most interesting – enables Transformers to learn something like a taskspecific syntax.
 Different heads in the multihead attention might also be ‘looking’ at different syntactic properties, as shown in the figure (source) below.
Why multiple heads of attention? Why attention?
 The optimization view of multiple attention heads is that they improve learning and help overcome bad random initializations. For instance, Analyzing MultiHead SelfAttention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned and it’s accompanying post by Viota (2019) and Are Sixteen Heads Really Better than One? by Michel et al. showed that Transformer heads can be ‘pruned’ or removed after training without significant performance impact.
Benefits of Transformers compared to RNNs/GRUs/LSTMs
 The Transformer can learn longerrange dependencies than RNNs and its variants such as GRUs and LSTMs.
 The biggest benefit, however, comes from how the Transformer lends itself to parallelization. Unlike an RNN which processes a word at each time step, a key property of the Transformer is that the word at each position flows through its own path in the encoder. There are dependencies between these paths in the selfattention layer (since the selfattention layer computes how important each other word in the input sequence is to this word). However, once the selfattention output is generated, the feedforward layer does not have those dependencies, and thus the various paths can be executed in parallel while flowing through the feedforward layer. This is an especially useful trait in case of the Transformer encoder which can process each input word in parallel with other words after the selfattention layer. This feature, is however, not of great importance for the decoder since it generates one word at a time and thus does not utilize parallel word paths.
What would we like to fix about the transformer? / Drawbacks of Transformers
 The biggest drawback of the Transformer architecture is the quadratic computational complexity with respect to both the number of tokens (\(n\)) and the embedding size (\(d\)). This means that as sequences get longer, the time and computational resources needed for training increase significantly. A detailed discourse on this and a couple of secondary drawbacks are as below.
 Quadratic time and space complexity of the attention layer:
 Transformers use what’s known as selfattention, where each token in a sequence attends to all other tokens (including itself). This implies that the runtime of the Transformer architecture is quadratic in the length of the input sequence, which means it can be slow when processing long documents or taking characters as inputs. If you have a sequence of $$n $$ tokens, you’ll essentially have to compute attention scores for each pair of tokens, resulting in $$n^2 $$ (quadratic) computations. In other words, computing all pairs of interactions (i.e., attention over all wordpairs) during selfattention means our computation grows quadratically with the sequence length, i.e., \(O(T^2 d)\), where \(T\) is the sequence length, and \(d\) is the dimensionality.
 In a graph context, selfattention mandates that the number of edges in the graph to scale quadratically with the number of nodes, i.e., in an \(n\) word sentence, a Transformer would be doing computations over \(n^{2}\) pairs of words. Note that for recurrent models, it only grew linearly.
 This implies a large parameter count (implying high memory footprint) and thus, high computational complexity.
 Say, \(d = 1000\). So, for a single (shortish) sentence, \(T \leq 30 \Rightarrow T^{2} \leq 900 \Rightarrow T^2 d \approx 900K\). Note that in practice, we set a bound such as \(T = 512\). Imagine working on long documents with \(T \geq 10,000\)?!
 High compute requirements has a negative impact on power and battery life requirements, especially for portable device targets.
 Similarly, for storing these attention scores, you’d need space that scales with $$n^2 $$, leading to a quadratic space complexity.
 This becomes problematic for very long sequences as both the computation time and memory usage grow quickly, limiting the practical use of standard transformers for lengthy inputs.
 Overall, a transformer requires higher computational power (and thus, lower battery life) and memory footprint compared to its conventional counterparts.
 Wouldn’t it be nice for Transformers if we didn’t have to compute pairwise interactions between each word pair in the sentence? Recent studies such as the following show that decent performance levels can be achieved without computing interactions between all wordpairs (such as by approximating pairwise attention).
 Quadratic time complexity of linear layers w.r.t. embedding size $$d $$:
 In Transformers, after calculating the attention scores, the result is passed through linear layers, which have weights that scale with the dimension of the embeddings. If your token is represented by an embedding of size $$d $$, and if $$d $$ is greater than $$n $$ (the number of tokens), then the computation associated with these linear layers can also be demanding.
 The complexity arises because for each token, you’re doing operations in a $$d $$dimensional space. For densely connected layers, if $$d $$ grows, the number of parameters and hence computations grows quadratically.
 Positional Sinusoidal Embedding:
 Transformers, in their original design, do not inherently understand the order of tokens (i.e., they don’t recognize sequences). To address this, positional information is added to the token embeddings.
 The original Transformer model (by Vaswani et al.) proposed using sinusoidal functions to generate these positional embeddings. This method allows models to theoretically handle sequences of any length (since sinusoids are periodic and continuous), but it might not be the most efficient or effective way to capture positional information, especially for very long sequences or specialized tasks. Hence, it’s often considered a limitation or area of improvement, leading to newer positional encoding methods like Rotary Positional Embeddings (RoPE).
 Data appetite of Transformers vs. sampleefficient architectures:
 Furthermore, compared to CNNs, the sample complexity (i.e., data appetite) of transformers is obscenely high. CNNs are still sample efficient, which makes them great candidates for lowresource tasks. This is especially true for image/video generation tasks where an exceptionally large amount of data is needed, even for CNN architectures (and thus implies that Transformer architectures would have a ridiculously high data requirement). For example, the recent CLIP architecture by Radford et al. was trained with CNNbased ResNets as vision backbones (and not a ViTlike transformer architecture).
 Put simply, while Transformers do offer accuracy lifts once their data requirement is satisfied, CNNs offer a way to deliver reasonable performance in tasks where the amount of data available is not exceptionally high. Both architectures thus have their usecases.
Why is training Transformers so hard?
 Reading new Transformer papers makes me feel that training these models requires something akin to black magic when determining the best learning rate schedule, warmup strategy and decay settings. This could simply be because the models are so huge and the NLP tasks studied are so challenging.
 But recent results suggest that it could also be due to the specific permutation of normalization and residual connections within the architecture.
Transformers: Extrapolation engines in highdimensional space
 The fluency of Transformers can be tracked back to extrapolation in a high dimensional space. That is what they do: capturing of high abstractions of semantic structures while learning, matching and merging those patterns on output. So any inference must be converted into a retrieval task (which then is called many names like Prompt Engineering, Chain/Tree/Graph/* of Thought, RAG, etc.), while any Transformer model is by design a giant stochastic approximation of whatever its training data it was fed.
The road ahead for Transformers
 In the field of NLP, Transformers have already established themselves as the numero uno architectural choice or the de facto standard for a plethora of NLP tasks.
 Likewise, in the field of vision, an updated version of ViT was second only to a newer approach that combines CNNs with transformers on the ImageNet image classification task at the start of 2022. CNNs without transformers, the longtime champs, barely reached the top 10!
 It is quite likely that transformers or hybrid derivatives thereof (combining concepts of selfattention with say convolutions) will be the leading architectures of choice in the near future, especially if functional metrics (such as accuracy) are the sole optimization metrics. However, along other axes such as data, computational complexity, power/battery life, and memory footprint, transformers are currently not the best choice – which the above section on What Would We Like to Fix about the Transformer? / Drawbacks of Transformers expands on.
 Could Transformers benefit from ditching attention, altogether? Yann Dauphin and collaborators’ recent work suggests an alternative ConvNet architecture. Transformers, too, might ultimately be doing something similar to ConvNets!
Choosing the right language model for your NLP usecase: key takeaways
 Some key takeaways for LLM selection and deployment:
 When evaluating potential models, be clear about where you are in your AI journey:
 In the beginning, it might be a good idea to experiment with LLMs deployed via cloud APIs.
 Once you have found productmarket fit, consider hosting and maintaining your model on your side to have more control and further sharpen model performance to your application.
 To align with your downstream task, your AI team should create a short list of models based on the following criteria:
 Benchmarking results in the academic literature, with a focus on your downstream task.
 Alignment between the pretraining objective and downstream task: consider autoencoding for NLU and autoregression for NLG. The figure below shows the best LLMs depending on the NLP usecase (image source):
 The shortlisted models should be then tested against your realworld task and dataset to get a first feeling for the performance.
 In most cases, you are likely to achieve better quality with dedicated finetuning. However, consider few/zeroshot learning if you don’t have the internal tech skills or budget for finetuning, or if you need to cover a large number of tasks.
 LLM innovations and trends are shortlived. When using language models, keep an eye on their lifecycle and the overall activity in the LLM landscape and watch out for opportunities to step up your game.
 When evaluating potential models, be clear about where you are in your AI journey:
Transformers Learning Recipe
 Transformers have accelerated the development of new techniques and models for natural language processing (NLP) tasks. While it has mostly been used for NLP tasks, it is now seeing heavy adoption in other areas such as computer vision and reinforcement learning. That makes it one of the most important modern concepts to understand and be able to apply.
 A lot of machine learning and NLP students and practitioners are keen on learning about transformers. Therefore, this recipe of resources and study materials should be helpful to help guide students interested in learning about the world of Transformers.
 To dive deep into the Transformer architecture from an NLP perspective, here’s a few links to better understand and implement transformer models from scratch.
HuggingFace EncoderDecoder Models
 With the HuggingFace EncoderDecoder class, you no longer need to stick to prebuilt encoderdecoder models like BART or T5, but can instead build your own EncoderDecoder architecture by doing a mixandmatch with the encoder and decoder model of your choice (similar to stacking legos!), say BERTGPT2. This is called “warmstarting” encoderdecoder models. Read more here: HuggingFace: Leveraging Pretrained Language Model Checkpoints for EncoderDecoder Models.
 You could build your own multimodal encoderdecoder architectures by mixing and matching encoders and decoders. For example:
 Image captioning: ViT/DEiT/BEiT + GPTx
 OCR: ViT/DEiT/BEiT + xBERT
 ImagetoText (CLIP): ViT/DEiT/BEiT + xBERT
 SpeechtoText: Wav2Vec2 Encoder + GPTx
 TexttoImage (DALLE): xBERT + DALLE
 TexttoSpeech: xBERT + speech decoder
 TexttoImage: xBERT + image decoder
 As an example, refer TrOCR: Transformerbased Optical Character Recognition with Pretrained Models and Leveraging Pretrained Checkpoints for Sequence Generation Tasks.
Highlevel Introduction

First, try to get a very highlevel introduction about transformers. Some references worth looking at:
 Transformers From Scratch (by Brandon Rohrer)
 How Transformers work in deep learning and NLP: an intuitive introduction (by AI Summer)
 Deep Learning for Language Understanding (by DeepMind)
The Illustrated Transformer
 Jay Alammar’s illustrated explanations are exceptional. Once you get that highlevel understanding of transformers, going through The Illustrated Transformer is recommend for its detailed and illustrated explanation of transformers:
Technical Summary
 At this point, you may be looking for a technical summary and overview of transformers. Lilian Weng’s The Transformer Family is a gem and provides concise technical explanations/summaries:
Implementation
 Once you’ve absorbed the theory, implementing algorithms from scratch is a great way to test your knowledge and understanding of the subject matter.
 For implementing transformers in PyTorch, The Annotated Transformer offers a great tutorial. Mina Ghashami’s Transformer: Concept and Code from Scratch is also a great resource.
 For implementing transformers in TensorFlow, Transformer model for language understanding offers a great tutorial.
 Google Colab; GitHub
Attention Is All You Need
 This paper by Vaswani et al. introduced the Transformer architecture. Read it after you have a highlevel understanding and want to get into the details. Pay attention to other references in the paper for diving deep.
Applying Transformers
 After some time studying and understanding the theory behind transformers, you may be interested in applying them to different NLP projects or research. At this time, your best bet is the Transformers library by HuggingFace.
 The Hugging Face Team has also published a new book on NLP with Transformers, so you might want to check that out as well.
Inference Arithmetic
 This blog by Kipply presents detailed fewprinciples reasoning about large language model inference performance, with no experiments or difficult math. The amount of understanding that can be acquired this way is really impressive and practical! A very simple model of latency for inference turns out to be a good fit for emprical results. It can enable better predictions and form better explanations about transformer inference.
Transformer Taxonomy
 This blog by Kipply is a comprehensive literature review of AI, specifically focusing on transformers. It covers 22 models, 11 architectural changes, 7 postpretraining techniques, and 3 training techniques. The review is curated based on the author’s knowledge and includes links to the original papers for further reading. The content is presented in a loosely ordered manner based on importance and uniqueness.
GPT in 60 Lines of NumPy
 The blog post implements picoGPT and flexes some of the benefits of JAX: (i) trivial to port Numpy using
jax.numpy
, (ii) get gradients, and (iii) batch withjax.vmap
. It also inferences GPT2 checkpoints.
xtransformers
 This Github repo offers a concise but fullyfeatured transformer, complete with a set of promising experimental features from various papers.
Speeding up the GPT  KV cache
 The blog post discusses an optimization technique for speeding up transformer model inference using KeyValue (KV) caching, highlighting its implementation in GPT models to reduce computational complexity from quadratic to linear by caching inputs for the attention block, thereby enhancing prediction speed without compromising output quality.
FAQs
Did the original Transformer use absolute or relative positional encoding?
 The original Transformer model, as introduced by Vaswani et al. in their 2017 paper “Attention Is All You Need”, used absolute positional encoding. This design was a key feature to incorporate the notion of sequence order into the model’s architecture.
 Absolute Positional Encoding in the Original Transformer
 Mechanism:
 The Transformer model does not inherently capture the sequential order of the input data in its selfattention mechanism. To address this, the authors introduced absolute positional encoding.
 Each position in the sequence was assigned a unique positional encoding vector, which was added to the input embeddings before they were fed into the attention layers.
 Implementation: The positional encodings used were fixed (not learned) and were based on sine and cosine functions of different frequencies. This choice was intended to allow the model to easily learn to attend by relative positions since for any fixed offset \(k, PE_{pos + k}\) could be represented as a linear function of \(PE_{pos}\).
 Mechanism:
 Importance: This approach to positional encoding was crucial for enabling the model to understand the order of tokens in a sequence, a fundamental aspect of processing sequential data like text.
 Relative and Rotary Positional Encoding in Later Models
 After the introduction of the original Transformer, subsequent research explored alternative ways to incorporate positional information. One such development was the use of relative positional encoding, which, instead of assigning a unique encoding to each absolute position, encodes the relative positions of tokens with respect to each other. This method has been found to be effective in certain contexts and has been adopted in various Transformerbased models developed after the original Transformer. Rotary positional encoding methods (such as RoPE) were also presented after relative positional encoding methods.
 Conclusion: In summary, the original Transformer model utilized absolute positional encoding to integrate sequence order into its architecture. This approach was foundational in the development of Transformer models, while later variations and improvements, including relative positional encoding, have been explored in subsequent research to further enhance the model’s capabilities.
How does the choice of positional encoding method can influence the number of parameters added to the model? Consinder absolute, relative, and rotary positional encoding mechanisms.
 In Large Language Models (LLMs), the choice of positional encoding method can influence the number of parameters added to the model. Let’s compare absolute, relative, and rotary (RoPE) positional encoding in this context:
 Absolute Positional Encoding
 Parameter Addition:
 Absolute positional encodings typically add a fixed number of parameters to the model, depending on the maximum sequence length the model can handle.
 Each position in the sequence has a unique positional encoding vector. If the maximum sequence length is \(N\) and the model dimension is \(D\), the total number of added parameters for absolute positional encoding is \(N \times D\).
 Fixed and NonLearnable: In many implementations (like the original Transformer), these positional encodings are fixed (based on sine and cosine functions) and not learnable, meaning they don’t add to the total count of trainable parameters.
 Parameter Addition:
 Relative Positional Encoding
 Parameter Addition:
 Relative positional encoding often adds fewer parameters than absolute encoding, as it typically uses a set of parameters that represent relative positions rather than unique encodings for each absolute position.
 The exact number of added parameters can vary based on the implementation but is generally smaller than the \(N \times D\) parameters required for absolute encoding.
 Learnable or Fixed: Depending on the model, relative positional encodings can be either learnable or fixed, which would affect whether they contribute to the model’s total trainable parameters.
 Parameter Addition:
 Rotary Positional Encoding (RoPE)
 Parameter Addition:
 RoPE does not add any additional learnable parameters to the model. It integrates positional information through a rotation operation applied to the query and key vectors in the selfattention mechanism.
 The rotation is based on the position but is calculated using fixed, nonlearnable trigonometric functions, similar to absolute positional encoding.
 Efficiency: The major advantage of RoPE is its efficiency in terms of parameter count. It enables the model to capture relative positional information without increasing the number of trainable parameters.
 Parameter Addition:
 Summary:
 Absolute Positional Encoding: Adds \(N \times D\) parameters, usually fixed and nonlearnable.
 Relative Positional Encoding: Adds fewer parameters than absolute encoding, can be learnable, but the exact count varies with implementation.
 Rotary Positional Encoding (RoPE): Adds no additional learnable parameters, efficiently integrating positional information.
 In terms of parameter efficiency, RoPE stands out as it enriches the model with positional awareness without increasing the trainable parameter count, a significant advantage in the context of LLMs where managing the scale of parameters is crucial.
In Transformerbased models, why is RoPE required for context length extension?
 RoPE, or Rotary Positional Embedding, is a technique used in some language models, particularly Transformers, for handling positional information. The need for RoPE or similar techniques becomes apparent when dealing with long context lengths in Large Language Models (LLMs).
 Context Length Extension in LLMs
 Positional Encoding in Transformers:
 Traditional Transformer models use positional encodings to add information about the position of tokens in a sequence. This is crucial because the selfattention mechanism is, by default, permutationinvariant (i.e., it doesn’t consider the order of tokens).
 In standard implementations like the original Transformer, positional encodings are added to the token embeddings and are typically fixed (not learned) and based on sine and cosine functions of different frequencies.
 Challenges with Long Sequences: As the context length (number of tokens in a sequence) increases, maintaining effective positional information becomes challenging. This is especially true for fixed positional encodings, which may not scale well or capture relative positions effectively in very long sequences.
 Role and Advantages of RoPE
 Rotary Positional Embedding: RoPE is designed to provide rotational equivariance to selfattention. It essentially encodes the absolute position and then rotates the positional encoding of keys and queries differently based on their position. This allows the model to implicitly capture relative positional information through the selfattention mechanism.
 Effectiveness in Long Contexts: RoPE scales effectively with sequence length, making it suitable for LLMs that need to handle long contexts or documents. This is particularly important in tasks like document summarization or questionanswering over long passages.
 Preserving Relative Positional Information: RoPE allows the model to understand the relative positioning of tokens effectively, which is crucial in understanding the structure and meaning of sentences, especially in languages with less rigid syntax.
 Computational Efficiency: Compared to other methods of handling positional information in long sequences, RoPE can be more computationally efficient, as it doesn’t significantly increase the model’s complexity or the number of parameters.
 Conclusion: In summary, RoPE is required for effectively extending the context length in LLMs due to its ability to handle long sequences while preserving crucial relative positional information. It offers a scalable and computationally efficient solution to one of the challenges posed by the selfattention mechanism in Transformers, particularly in scenarios where understanding the order and relationship of tokens in long sequences is essential.
Why is the Transformer Architecture not as susceptible to vanishing gradients compared to RNNs?
 The Transformer architecture is less susceptible to vanishing gradients compared to Recurrent Neural Networks (RNNs) due to several key differences in their design and operation:
 SelfAttention Mechanism and Parallel Processing:
 Transformers: Transformers use selfattention mechanisms which allow them to directly access any position in the input sequence without the need for sequential processing. This means that the gradients can flow more easily across the entire network since there are direct connections between all input and output positions. Additionally, the selfattention mechanism and feedforward layers in Transformers allow for parallel processing of the entire sequence, facilitating better gradient flow and more efficient training. To handle the sequential nature of data, Transformers use positional encodings added to the input embeddings, enabling them to maintain the order of the sequence while still allowing parallel processing.
 RNNs: RNNs process input sequences sequentially, step by step. This sequential processing can cause gradients to either vanish or explode as they are propagated back through many time steps during training, especially in long sequences. RNNs are typically trained using Backpropagation Through Time (BPTT), a method that unrolls the network through time and applies backpropagation. BPTT can suffer from vanishing and exploding gradients because the gradients must be propagated back through many time steps, leading to instability and difficulty in training long sequences.
 Residual Connections:
 Transformers: Each layer in a Transformer includes residual (skip) connections, which add the input of a layer to its output. These connections help gradients flow through the network more directly, mitigating the vanishing gradient problem.
 RNNs: Although some RNN architectures can incorporate residual connections, it is less common and less effective due to the inherently sequential nature of RNNs.
 Layer Normalization:
 Transformers: Transformers use layer normalization, which helps stabilize the training process and maintain gradient magnitudes.
 RNNs: While batch normalization and layer normalization can be applied to RNNs, it is more challenging and less common compared to the straightforward application in Transformers.
 SelfAttention Mechanism and Parallel Processing:
 In summary, the Transformer architecture’s reliance on parallel processing nature of selfattention (and thus the avoidance of BPTT that RNNs depend on), residual connections, and layer normalization contributes to its robustness against vanishing gradients, making it more efficient and effective for handling long sequences compared to RNNs.
What is the fraction of attention weights relative to feedforward weights in common LLMs?
GPT
 In GPT1 and similar transformerbased models, the distribution of parameters between attention mechanisms and feedforward networks (FFNs) is key to understanding their architecture and design. Let’s delve into the parameter allocation in GPT1:
Model Configuration

GPT1, like many models in the GPT series, follows the transformer architecture described in the original “Attention is All You Need” paper. Here’s a breakdown:
 Model Dimension ((d_{\text{model}})): For GPT1, (d_{\text{model}}) is typically smaller compared to later models like GPT3. The size used in GPT1 is 768.
 FeedForward Dimension ((d_{\text{ff}})): The dimension of the feedforward layers in GPT1 is typically about 4 times the model dimension, similar to other transformers. This results in (d_{\text{ff}} = 3072) for GPT1.
Attention and FeedForward Weights Calculation

Let’s calculate the typical number of parameters for each component:
 Attention Parameters:
 Query, Key, Value (QKV) Weights: Each transformer layer in GPT1 includes multihead selfattention with separate weights for queries, keys, and values. Each of these matrices is of size (d_{\text{model}} \times \frac{d_{\text{model}}}{h}), and for simplicity, the total size for Q, K, and V combined for all heads is (d_{\text{model}} \times d_{\text{model}}).
 Output Projection: This is another matrix of size (d_{\text{model}} \times d_{\text{model}}).
 FeedForward Network (FFN) Parameters:
 Layer Projections: Consisting of two linear transformations:
 First layer projects from (d_{\text{model}}) to (d_{\text{ff}}),
 Second layer projects back from (d_{\text{ff}}) to (d_{\text{model}}).
 Layer Projections: Consisting of two linear transformations:
Example Calculation with GPT1 Values
 Total Attention Weights Per Layer:
 Total for Q, K, and V combined: (768 \times 768 \times 3 = 1769472).
 Output projection: (768 \times 768 = 589824).
 Total attention weights: (1769472 + 589824 = 2359296) parameters.
 Total FeedForward Weights Per Layer:
 Upprojection: (768 \times 3072 = 2359296),
 Downprojection: (3072 \times 768 = 2359296),
 Total FFN weights: (2359296 + 2359296 = 4718592) parameters.
Fraction of Attention to FFN Weights
 The fraction of attention weights relative to FFN weights can be calculated as:
Conclusion
 In GPT1, the feedforward networks hold about twice as many parameters as the attention mechanisms, a typical distribution for transformer models. This emphasizes the substantial role of the FFNs in enhancing the model’s ability to process and transform information, complementing the capabilities provided by the attention mechanisms. This balance is crucial for the overall performance and flexibility of the model in handling various language processing tasks.
GPT2
 In common large language models like GPT2, the fraction of attention weights relative to feedforward (MLP) weights generally follows a consistent pattern due to the architecture of the transformer layers used in these models. Typically, the MultiLayer Perceptron (MLP) blocks contain significantly more parameters than the attention blocks.
 Here’s a breakdown for better understanding:
Transformer Layer Composition
 Attention Mechanism: Each layer in a transformerbased model like GPT2 includes multihead selfattention mechanisms. The parameters in these mechanisms consist of query, key, value, and output projection matrices.
 FeedForward Network (MLP): Following the attention mechanism, each layer includes an MLP block, typically consisting of two linear transformations with a ReLU activation in between.
Parameter Distribution
 Attention Weights: For each attention head, the parameters are distributed across the matrices for queries, keys, values, and the output projection. If the model dimension is \(d_{\text{model}}\) and there are \(h\) heads, each head might use matrices of size \(\frac{d_{\text{model}}}{h} \times d_{\text{model}}\) for each of the query, key, and value, and \(d_{\text{model}} \times d_{\text{model}}\) for the output projection.
 MLP Weights: The MLP usually consists of two layers. The first layer projects from \(d_{\text{model}}\) to \(d_{\text{ff}}\) (where \(d_{\text{ff}}\) is typically 4 times \(d_{\text{model}}\)), and the second layer projects back from \(d_{\text{ff}}\) to \(d_{\text{model}}\). Thus, the MLP contains weights of size \(d_{\text{model}} \times d_{\text{ff}}\) and \(d_{\text{ff}} \times d_{\text{model}}\).
Example Calculation
 For GPT2, if we assume \(d_{\text{model}} = 768\) and \(d_{\text{ff}} = 3072\) (which is common in models like GPT2), and the number of heads \(h = 12\):
 Attention Parameters per Layer: Each set of Q/K/V matrices is \(\frac{768}{12} \times 768 = 49152\) parameters, and there are 3 sets per head, plus another \(768 \times 768\) for the output projection, totaling \(3 \times 49152 + 589824 = 737280\) parameters for all attention heads combined per layer.
 MLP Parameters per Layer: \(768 \times 3072 + 3072 \times 768 = 4718592\) parameters.
Fraction of Attention to MLP Weights

Fraction: Given these typical values, the attention parameters are about 737280, and the MLP parameters are about 4718592 per layer. This gives a fraction of attention to MLP weights of roughly \(\frac{737280}{4718592} \approx 0.156\), or about 15.6%.

This fraction indicates that the feedforward layers in models like GPT2 hold a substantially larger portion of the parameters compared to the attention mechanisms, emphasizing the role of the MLP in transforming representations within the network. This distribution has implications for deciding which components to adapt or optimize during tasks like finetuning, as the MLP layers may offer a larger scope for modification due to their greater parameter count.
BERT
 In the architecture of BERT (Bidirectional Encoder Representations from Transformers), which utilizes the transformer model structure similar to models in the GPT series, the distribution of parameters between attention mechanisms and feedforward networks (FFNs) reflects a balance that is integral to the model’s ability to perform its intended tasks. Here’s an overview of how these weights are typically distributed in BERT and similar models:
Model Configuration
 Model Dimension (\(d_{\text{model}}\)): This is the size of the hidden layers throughout the model. For example, BERTBase uses \(d_{\text{model}} = 768\).
 FeedForward Dimension (\(d_{\text{ff}}\)): The dimension of the feedforward layer is usually set to about 4 times \(d_{\text{model}}\). For BERTBase, \(d_{\text{ff}} = 3072\).
Attention and FeedForward Weights Calculation
 Attention Parameters:
 Query, Key, Value (QKV) Weights: Each transformer layer in BERT has multihead selfattention with separate weights for queries, keys, and values. For each head:
 Size of each matrix (Q, K, V): \(d_{\text{model}} \times \frac{d_{\text{model}}}{h}\), where \(h\) is the number of heads. The total size per matrix type for all heads combined is \(d_{\text{model}} \times d_{\text{model}}\).
 Output Projection Weights: Another matrix of size \(d_{\text{model}} \times d_{\text{model}}\).
 Query, Key, Value (QKV) Weights: Each transformer layer in BERT has multihead selfattention with separate weights for queries, keys, and values. For each head:
 FeedForward Network (FFN) Parameters:
 Layer Projections: There are two linear transformations in the FFN block:
 The first layer projects from \(d_{\text{model}}\) to \(d_{\text{ff}}\),
 The second layer projects back from \(d_{\text{ff}}\) to \(d_{\text{model}}\).
 Layer Projections: There are two linear transformations in the FFN block:
Example Calculation with Typical Values
 Attention Weights Per Layer:
 For Q, K, and V: \(768 \times 768 \times 3 = 1769472\) (each type has size \(768 \times 768\)).
 Output projection: \(768 \times 768 = 589824\).
 Total Attention Weights: \(1769472 + 589824 = 2359296\) parameters.
 FeedForward Weights Per Layer:
 Upprojection: \(768 \times 3072 = 2359296\),
 Downprojection: \(3072 \times 768 = 2359296\),
 Total FFN Weights: \(2359296 + 2359296 = 4718592\) parameters.
Fraction of Attention to FFN Weights
 The fraction of attention weights relative to FFN weights can be calculated as:
Conclusion
 In BERT, like in many transformer models, the feedforward networks hold about twice as many parameters as the attention mechanisms. This indicates a strong emphasis on the transformation capabilities of the FFNs, crucial for enabling BERT to generate contextrich embeddings for various NLP tasks. The FFN layers in BERT and similar models play a pivotal role in enhancing the model’s representational power, ensuring it can handle complex dependencies and nuances in language understanding and generation tasks.
Further Reading
 The Illustrated Transformer
 The Annotated Transformer
 Transformer models: an introduction and catalog
 Deep Learning for NLP Best Practices
 What is Teacher Forcing for Recurrent Neural Networks?
 What is Teacher Forcing?
 The Transformer Family
 Transformer: Concept and Code from Scratch
 Transformer Inference Arithmetic
 Transformer Taxonomy
References
 Transformers are Graph Neural Networks
 Transformers from Scratch by Brandon Rohrer
 Transformers from Scratch by Peter Bloem
 Positional encoding tutorial by Amirhossein Kazemnejad
 What is One Hot Encoding? Why and When Do You Have to Use it?
 Wikipedia: Dot product
 Wikipedia: Byte pair encoding
 Will Transformers Take Over Artificial Intelligence?
 Transformer Recipe
 The decoder part in a transformer model
 Why encoder input doesn’t have a start token?
 What is the cost function of a transformer?
 Transformer models: an introduction and catalog
 Why does a transformer not use an activation function following the multihead attention layer?
 CNN Explainer: Learn Convolutional Neural Network (CNN) in your browser!
 Where is dropout placed in the original transformer?
Citation
If you found our work useful, please cite it as:
@article{Chadha2020DistilledTransformers,
title = {Transformers},
author = {Chadha, Aman},
journal = {Distilled AI},
year = {2020},
note = {\url{https://aman.ai}}
}