Primers • Attention
 Overview
 Introduction
 The Classic SequencetoSequence Model
 SequencetoSequence Model with Attention
 Context Vector
 Resume
 Extensions to the Classic Attention Mechanism
 Summary
 References
 Citation
Overview
 The attention mechanism is now an established technique in many NLP tasks.
 This article outlines an introduction by focusing on the idea of the attention mechanism, as applied to the task of neural machine translation.
Introduction

In the context of NLP, The attention mechanism was first introduced in “Neural Machine Translation by Jointly Learning to Align and Translate” at ICLR 2015 by Bahdanau et al. (2015).

This was proposed in the context of machine translation, where given a sentence in one language, the model has to produce a translation for that sentence in another language.

In the paper, the authors propose to tackle the problem of a fixedlength context vector in the original seq2seq model for machine translation in “Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation” by Cho et al. (2014).
The Classic SequencetoSequence Model
 The seq2seq model is composed of two main components: an encoder, and a decoder, as shown in the figure (taken from towardsdatascience) below:

The encoder reads the input sentence, a sequence of vectors \(x = (x_{1}, \dots , x_{T})\), into a fixedlength vector \(c\). The encoder is a recurrent neural network, typical approaches are GRU or LSTMs such that:
\[h_{t} = f\ (x_{t}, h_{t−1})\] \[c = q\ (h_{1}, \dotsc, h_{T})\] where \(h_{t}\) is a hidden state at time \(t\), and \(c\) is a vector generated from the sequence of the hidden states, and \(f\) and \(q\) are some nonlinear functions.

At every timestep \(t\) the encoder produces a hidden state \(h_{t}\), and the generated context vector is modelled according to all hidden states.

The decoder is trained to predict the next word \(y_{t}\) given the context vector \(c\) and all the previously predict words \(\{y_{1}, \dots , y_{t1}\}\), it defines a probability over the translation \({\bf y}\) by decomposing the joint probability:
\[p({\bf y}) = \prod\limits_{i=1}^{x} p(y_{t}  {y_{1}, \dots , y_{t1}}, c)\] where \(\bf y = \{y_{1}, \dots , y_{t}\}\). In other words, the probability of a translation sequence is calculated by computing the conditional probability of each word given the previous words. With an LSTM/GRU each conditional probability is computed as:
 where, \(g\) is a nonlinear function that outputs the probability of \(y_{t}\), \(s_{t}\) is the value of the hidden state of the current position, and \(c\) the context vector.

In a simple seq2seq model, the last output of the LSTM/GRU is the context vector, encoding context from the entire sequence. This context vector is then used as the initial hidden state of the decoder.

At every step of decoding, the decoder is given an input token and (the previous) hidden state. The initial input token is the startofstring
token, and the first hidden state is the context vector (the encoder’s last hidden state). 
So, the fixed size contextvector needs to contain a good summary of the meaning of the whole source sentence, being this one big bottleneck, specially for long sentences. The figure below (taken from Bahdanau et al.) (2015)) shows how the performance of the seq2seq model varies by sentence length:
SequencetoSequence Model with Attention
 The fixed size contextvector bottleneck was one of the main motivations by Bahdanau et al. (2015), which proposed a similar architecture but with a crucial improvement:
“The new architecture consists of a bidirectional RNN as an encoder and a decoder that emulates searching through a source sentence during decoding a translation”

The encoder is now a bidirectional recurrent network with a forward and backward hidden states. A simple concatenation of the two hidden states represents the encoder state at any given position in the sentence. The motivation is to include both the preceding and following words in the representation/annotation of an input word.

The other key element, and the most important one, is that the decoder is now equipped with some sort of search, allowing it to look at the whole source sentence when it needs to produce an output word, the attention mechanism. The figure below (taken from Bahdanau et al.) (2015)) illustrates the attention mechanism in a seq2seq model.

The figure above gives a good overview of this new mechanism. To produce the output word at time \(y_{t}\) the decoder uses the last hidden state from the decoder  one can think about this as some sort of representation of the already produced words  and a dynamically computed context vector based on the input sequence.

The authors proposed to replace the fixedlength context vector by a another context vector \(c_{i}\) which is a sum of the hidden states of the input sequence, weighted by alignment scores.

Note that now the probability of each output word is conditioned on a distinct context vector \(c_{i}\) for each target word \(y\).

The new decoder is then defined as:
\[p(y_{t}  {y_{1}, \dots , y_{t1}}, c) = g(y_{t−1}, s_{i}, c)\] where \(s_{i}\) is the hidden state for time \(i\), computed by:
 that is, a new hidden state for \(i\) depends on the previous hidden state, the representation of the word generated by the previous state and the context vector for position \(i\). The remaining question now is, how to compute the context vector \(c_{i}\)?
Context Vector

The context vector \(c_{i}\) is a sum of the hidden states of the input sequence, weighted by alignment scores. Each word in the input sequence is represented by a concatenation of the two (i.e., forward and backward) RNNs hidden states, let’s call them annotations.

Each annotation contains information about the whole input sequence with a strong focus on the parts surrounding the \(i_{th}\) word in the input sequence.

The context vector \(c_{i}\) is computed as a weighted sum of these annotations:

The weight \(\alpha_{ij}\) of each annotation \(h_{j}\) is computed by:
\[\alpha_{ij} = \text{softmax}(e_{ij})\] where:

\(a\) is an alignment model which scores how well the inputs around position \(j\) and the output at position \(i\) match. The score is based on the RNN hidden state \(s_{i−1}\) (just before emitting \(y_{i}\) and the \(j_{th}\) annotation \(h_{j}\) of the input sentence
\[a(s_{i1},h_{j}) = \mathbf{v}_a^\top \tanh(\mathbf{W}_{a}\ s_{i1} + \mathbf{U}_{a}\ {h}_j)\] where both \(\mathbf{v}_a\) and \(\mathbf{W}_a\) are weight matrices to be learned in the alignment model.

The alignment model in the paper is described as feed forward neural network whose weight matrices \(\mathbf{v}_a\) and \(\mathbf{W}_a\) are learned jointly together with the whole graph/network.

The authors note:
“The probability \(\alpha_{ij}h_{j}\) reflects the importance of the annotation \(h_{j}\) with respect to the previous hidden state \(s_{i−1}\) in deciding the next state \(s_{i}\) and generating \(y_{i}\). Intuitively, this implements a mechanism of attention in the decoder.”
Resume
 It’s now useful to review again visually the attention mechanism and compare it against the fixedlength context vector. The pictures below were made by Nelson Zhao and hopefully will help understand clearly the difference between the two encoderdecoder approaches. The figure below illustrates the EncoderDecoder architecture with a fixedcontext vector.
 On the other hand, the figure below illustrates the EncoderDecoder architecture with attention mechanism.
Extensions to the Classic Attention Mechanism
 Luong et al. (2015) proposed and compared other mechanisms of attentions, more specifically, alternative functions to compute the alignment score:
 NOTE: the concat is the same as in Bahdanau et al. (2015). But, most importantly, instead of a weighted average over all the source hidden states, they proposed a mechanism of local attention which focus only on a small subset of the source positions per target word instead of attending to all words on the source for each target word.
Summary
 This was a short introduction on the first “classic” attention mechanism on which subsequent techniques such as selfattention or querykeyvalueattention are based.
 After transforming the field of Neural Machine Translation, the attention mechanism was applied to other natural language processing tasks, such as documentlevel classification or sequence labelling.
References
 An Introductory Survey on Attention Mechanisms in NLP Problems (arXiv.org version)
 Neural Machine Translation by Jointly Learning to Align and Translate (slides)
 Effective Approaches to Attentionbased Neural Machine Translation (slides)
 “Attention, Attention” in Lil’Log
 Big Bird: Transformers for Longer Sequences
Citation
If you found our work useful, please cite it as:
@article{Chadha2020DistilledAttention,
title = {Attention},
author = {Chadha, Aman},
journal = {Distilled AI},
year = {2020},
note = {\url{https://aman.ai}}
}