• Large Language Models (LLMs) like GPT-3 or BERT are deep neural networks that utilize the Transformer architecture. LLMs are part of a class of models known as foundation models because these models can be transferred to a number of downstream tasks (via fine-tuning) since they have been trained on a huge amount of unsupervised and unstructured data.
  • The Transformer architecture has two parts: encoder and decoder. Both encoder and decoder are mostly identical (with a few differences); (more on this in the primer on the Transformer architecture). Also, for the pros and cons of the encoder and decoder stack, refer Autoregressive vs. Autoencoder Models.
  • Given the prevalence of decoder-based models in the area of generative AI, the article focuses on decoder models (such as GPT-x) rather than encoder models (such as BERT and its variants). Henceforth, the term LLMs is used interchangeably with “decoder-based models”.
  • “Given an input text “prompt”, at essence what these systems do is compute a probability distribution over a “vocabulary”—the list of all words (or actually parts of words, or tokens) that the system knows about. The vocabulary is given to the system by the human designers. Note that GPT-3, for example, has a vocabulary of about 50,000 tokens.” Source
  • It’s worthwhile to note that while LLMs still suffer from a myriad of limitations, such as hallucination and issues in chain of thought reasoning (there have been recent improvements), it’s important to keep in mind that LLMs were trained to perform statistical language modeling.

Specifically, language modeling is defined as the task of predicting the next token given some context.


  • In the context of Natural Language Processing (NLP), embeddings are dense vector representations of words or sentences that capture semantic and syntactic properties of words or sentences. These embeddings are usually obtained by training models, such as BERT and its variants, Word2Vec, GloVe, or FastText, on a large corpus of text, and they provide a way to convert textual information into a form that machine learning algorithms can process. Put simply, embeddings encapsulate the semantic meaning of words (which are internally represented as one or more tokens) or semantic and syntactic properties of sentences by representing them as dense, low-dimensional vectors.
  • Note that embeddings can be contextualized (where the embeddings of each token are a function of other tokens in the input; in particular, this enables polysemous words such as “bank” to have a unique embedding depending on whether the word occurs in a “finance” or “river” context) vs. non-contextualized embeddings (where each token has a static embedding irrespective of its context, which can be pre-trained and utilized for downstream applications). Word2Vec, GloVe, FastText, etc. are examples of models that offer non-contextualized embeddings, while BERT and its variants offer contextualized embeddings.
  • To obtain the embedding for the token, extract the learned weights from the trained model for each word. These weights form the word embeddings, where each word is represented by a dense vector.

Contextualized vs. Non-Contextualized Embeddings

  • Encoder models, like the Transformer-based BERT (Bidirectional Encoder Representations from Transformers), are designed to generate contextualized embeddings. Unlike traditional word embeddings that assign a static vector to each word (such as Word2Vec or GloVe), these models consider the context of a word (i.e., the words that surround it). This allows the model to capture a richer and more nuanced understanding of words since the same word can have different meanings based on the context it is used in.

Use-cases of Embeddings

  • With embeddings, you can perform various arithmetic operations to carry out specific tasks:

    1. Word similarity: You can compare the embeddings of two words to understand their similarity. This is often done using cosine similarity, a metric that measures the cosine of the angle between two vectors. A higher cosine similarity between two word vectors indicates that the words are more similar in terms of their usage or meaning.

    2. Word analogy: Vector arithmetic can be used to solve word analogy tasks. For example, given the analogy task “man is to woman as king is to what?”, we can find the answer (queen) by performing the following operation on the word embeddings: “king” - “man” + “woman”.

    3. Sentence similarity: If you want to measure the similarity between two sentences, you could use the embedding of the special [CLS] token produced by models like BERT, which is designed to capture the aggregate meaning of the sentence. Alternatively, you could average the embeddings of all tokens in each sentence and compare these average vectors. Note that when it comes to sentence-level tasks like sentence similarity, Sentence-BERT (SBERT), a modification of the BERT model, is often a better choice. SBERT is specifically trained to produce sentence embeddings that are directly comparable in semantic space, which generally leads to better performance on sentence-level tasks. In SBERT, both sentences are fed into the model simultaneously, allowing it to understand the context of each sentence in relation to the other, resulting in more accurate sentence embeddings.

How do LLMs work?

  • Like we discussed in the Overview section, LLMs are trained to predict the next token based on the set of previous tokens. They do this in an autoregressive manner (where the current generated token is fed back into the LLM as input to generate the next one), enabling generation capabilities.
  • The first step involves taking the prompt they receive, tokenining it, and converting it into embeddings, which are vector representations of the input text. Note that these embeddings are initialized randomly and learned during the course of model training, and represent a non-contextualized vector form of the input token.
  • Next, they do layer-by-layer attention and feed-forward computations, eventually assigning a number or logit to each word in its vocabulary (in the case of a decoder model like GPT-N, LLaMA, etc.) or outputs these features as contextualized embeddings (in the case of an encoder model like BERT and its variants such as RoBERTa, ELECTRA, etc.).
  • Finally, in the case of decoder models, the next step is converting each (unnormalized) logit into a (normalized) probability distribution (via the Softmax function), determining which word shall come next in the text.
  • Let’s break the steps down into finer detail:

    1. Tokenization:
      • Before processing, the raw input text is tokenized into smaller units, often subwords or words. This process breaks down the input into chunks that the model can recognize.
      • This step is crucial because the model has a fixed vocabulary, and tokenization ensures the input is in a format that matches this vocabulary.
    2. Embedding:
      • Each token is then mapped to a high-dimensional vector using an embedding matrix. This vector representation captures the semantic meaning of the token and serves as the input to the subsequent layers of the model.
      • Positional Encodings are added to these embeddings to give the model information about the order of tokens, especially important since models like transformers do not have inherent sequence awareness.
    3. Transformer Architecture:
      • The core of most modern LLMs is the transformer architecture.
      • It comprises multiple layers, and each layer has two primary components: a multi-head self-attention mechanism and a position-wise feed-forward network.
      • The self-attention mechanism allows tokens to weigh the importance of other tokens relative to themselves. In essence, it enables the model to “pay attention” to relevant parts of the input for a given token.
      • After attention, the result is passed through a feed-forward neural network independently at each position.
    4. Residual Connections:
      • Each sub-layer (like self-attention or feed-forward neural network) in the model has a residual connection around it, followed by layer normalization. This helps in stabilizing the activations and speeds up training.
    5. Output Layer:
      • After passing through all the transformer layers, the final representation of each token is transformed into a vector of logits, where each logit corresponds to a word in the model’s vocabulary.
      • These logits describe the likelihood of each word being the next word in the sequence.
    6. Probability Distribution:
      • To convert the logits into probabilities, the Softmax function is applied. It normalizes the logits such that they all lie between 0 and 1 and sum up to 1.
      • The word with the highest probability can be chosen as the next word in the sequence.
    7. Decoding:
      • Depending on the application, different decoding strategies like greedy decoding, beam search, or top-k sampling might be employed to generate coherent and contextually relevant sequences.
  • Through this multi-step process, LLMs can generate human-like text, understand context, and provide relevant responses or completions to prompts.

LLM Training Steps

  • At a top-level, here are steps involved in training LLMs:
    1. Corpus Preparation: Gather a large corpus of text data, such as news articles, social media posts, or web documents.
    2. Tokenization: Split the text into individual words or subword units, known as tokens.
    3. Embdedding Generation: Typically accomplished using a randomly initialized embedding table, via the nn.Embedding class in PyTorch. Pre-trained embeddings such Word2Vec, GloVe, FastText, etc. can also be used as a starting point for training. Note that these embeddings represent the non-contextualized vector form of the input token.
    4. Neural Network Training: Train a neural network model on the input tokens.
      • For encoder models such as BERT and its variants, the model learns to predict the context (surrounding words) of a given word which are masked. BERT is specifically trained on the Masked Language Modeling task (known as the Cloze task) and the Next Sentence Prediction objective; described in our BERT primer.
      • For decoder models such as GPT-N, LLaMA, etc., the model learns to predict the next token in the sequence, given the prior context of tokens.

Computing Similarity between Embeddings

  • For encoder models, contextualized embeddings are obtained at the output. Arithmetic operations can be performed on the embeddings to for various tasks such as understanding the similarity between two words, identifying word analogies, etc.
  • For the task of word similarity, the respective contextualized embeddings of the words can be used; while for sentence similarity, the output of the [CLS] token can be used or the word embeddings of all tokens can be averaged. For best performance on sentence similarity tasks, Sentence BERT variants of encoder models are preferred.
  • Word/sentence similarity is the measure of the degree to which two words/sentences are semantically equivalent in meaning.
  • Below are the two most common measures of word/sentence similarity (note that neither of them is a “distance metric”).

Dot Product Similarity

  • The dot product of two vectors \(u\) and \(v\) is defined as:
\[u \cdot v=|u||v| \cos \theta\]
  • It’s perhaps easiest to visualize its use as a similarity measure when \(\|v\|=1\), as in the diagram (source) below, where \(\cos \theta=\frac{u \cdot v}{\|u\|\|v\|} = \frac{u \cdot v}{\|u\|}\).

  • Here you can see that when \(\theta=0\) and \(\cos \theta=1\), i.e., the vectors are colinear, the dot product is the element-wise product of the vectors. When \(\theta\) is a right angle, and \(\cos \theta=0\), i.e. the vectors are orthogonal, the dot product is 0. In general, \(\cos \theta\) tells you the similarity in terms of the direction of the vectors (it is -1 when they point in opposite directions). This holds as the number of dimensions is increased, and \(\cos \theta\) thus has important uses as a similarity measure in multidimensional space, which is why it is arguably the most commonly used similarity metric.

Geometric intuition

  • The dot product between \(u, v\) can be interpreted as projecting \(u\) onto \(v\) (or vice-versa), and then taking product of projected length of \(u\) \((\|u\|)\) with length of \(v\) \((\|v\|)\).
  • When \(u\) is orthogonal to \(v\), projection of \(u\) onto \(v\) is a zero length vector, yielding a zero product. If you visualize all possible rotations of \(u\) while keeping \(v\) fixed, the dot product gives:
    • Zero value when \(u\) is orthogonal to \(v\) as the projection of \(u\) onto \(v\) yields a vector of zero length. This corresponds to the intuition of zero similarity.
    • Largest value of \(\|u\|\|v\|\) when \(u\) and \(v\) point in the same direction.
    • Lowest value of \(-\|u\|\|v\|\) when \(u\) and \(v\) point in opposite direction.
  • Dividing \(u \cdot v\) by the magnitude of \(u\) and \(v\), i.e., \(\|u\|\|v\|\), limits the range to \([-1,1]\) making it scale invariant, which is what brings us to cosine similarity.

Cosine Similarity

\[\text{cosine_similarity}(u,v) = \frac{u \cdot v}{\left\|u\right\|\left\|v\right\|} = \frac{\sum_{i=1}^{n} u_i v_i}{\sqrt{\sum_{i=1}^{n} u_i^2} \sqrt{\sum_{i=1}^{n} v_i^2}}\]
  • where,
    • \(u\) and \(v\) are the two vectors being compared.
    • \(\cdot\) represents the dot product.
    • \(\|u\|\) and \(\|v\|\) represent the magnitudes (or norms) of the vectors, and \(n\) is the number of dimensions in the vectors.
  • Note that as mentioned earlier, the length normalization part (i.e., dividing \(u \cdot v\) by the magnitude of \(u\) and \(v\), i.e., \(\|u\|\|v\|\)) limits the range to \([-1,1]\), making it scale invariant.

Cosine similarity vs. dot product similarity

  • Cosine similarity and dot product similarity are both techniques used to determine the similarity between vectors, which can represent things like text documents, user preferences, etc. The choice between the two depends on the specific use case and desired properties. Here’s a comparison of the advantages of cosine similarity over dot product similarity:
    • Magnitude Normalization: Cosine similarity considers only the angle between two vectors, ignoring their magnitudes. This is particularly useful when comparing documents of different lengths or vectors where magnitude isn’t representative of similarity. Dot product, on the other hand, can be affected by the magnitude of the vectors. A long document with many mentions of a particular term might have a high dot product with another document, even if the percentage of relevant content is low. Note that if you normalize your data to have the same magnitude, the two are indistinguishable. Sometimes it is desirable to ignore the magnitude, hence cosine similarity is nice, but if magnitude plays a role, dot product would be better as a similarity measure. In other words, cosine similarity is simply dot product, normalized by magnitude (hence is a value \(\in [0, 1]\)). Cosine similarity is preferable because it is scale invariant and thus lends itself naturally towards diverse data samples (with say, varying length). For instance, say we have two sets of documents and we computing similarity within each set. Within each set docs are identical, but set #1 documents are shorter, than set #2 ones. Dot product would produce different numbers if the embedding/feature size is different, while in both cases cosine similarity would yield comparable results (since it is length normalized). On the other hand, plain dot product is a little bit “cheaper” (in terms of complexity and implementation), since it involves lesser operations (no length normalization).
    • Bound Values: Cosine similarity returns values between -1 and 1 for vectors with non-negative dimensions, and specifically between 0 and 1 for all non-negative vectors (like in the case of TF-IDF representations of documents). This bounded nature can be easier to interpret. Dot product values can range from negative to positive infinity, which can make normalization or thresholding harder.
    • Robustness in High Dimensions: In high dimensional spaces, most pairs of vectors tend to be almost orthogonal, which means their dot products approach zero. However, their cosine similarities can still provide meaningful differentiation. Dot product can be highly sensitive to the magnitude of individual dimensions, especially in high-dimensional spaces.
    • Common Use Cases: Cosine similarity is extensively used in text analytics, information retrieval, and recommender systems because of its effectiveness in these domains. When representing text with models like TF-IDF, where vectors are non-negative and the magnitude might be influenced by the length of the text, cosine similarity is more appropriate. While dot product has its own strengths, it might not be as suitable for these use cases without additional normalization.
    • Intuitiveness: In many scenarios, thinking in terms of angles (cosine similarity) can be more intuitive than considering the raw projection (dot product). For instance, when two vectors point in the exact same direction (regardless of their magnitudes), their cosine similarity is 1, indicating perfect similarity.
    • Centroid Calculation: When trying to calculate the centroid (average) of multiple vectors (like in clustering), the centroid remains meaningful under cosine similarity. If you average the vectors and then compare another vector using cosine similarity, you get a measure of how similar the vector is to the “average” vector. This isn’t necessarily true with dot product. Despite these advantages, it’s worth noting that in some applications, especially in neural networks and deep learning, raw dot products (sometimes followed by a normalization step) are preferred due to their computational properties and the nature of learned embeddings. Always consider the specific application and the properties of the data when choosing between these measures.


  • Let’s delve into how reasoning works in LLMs; we will define reasoning as the “ability to make inferences using evidence and logic.” (source)
  • There are a multitude of varieties of reasoning, such as commonsense reasoning or mathematical reasoning.
  • Similarly, there are a variety of methods to elicit reasoning from the model, one of them being chain-of-thought prompting which can be found here.
  • It’s important to note that the extent of how much reasoning an LLM uses in order to give its final prediction is still unknown, since teasing apart the contribution of reasoning and factual information to derive the final output is not a straightforward task.

Retrieval/Knowledge-Augmented Generation or RAG (i.e., Providing LLMs External Knowledge)

  • In an industrial setting, cost-conscious, privacy-respecting, and reliable solutions are most desired. Companies, especially startups, are not looking to invest in talent or training models from scratch with no RoI on the horizon.
  • In most recent research and release of new chatbots, it’s been shown that they are capable of leveraging knowledge and information that is not necessarily in its weights. This paradigm is referred to as Retrieval Augmented Generation (RAG).
  • RAG enables in-context learning without costly fine-tuning, making the use of LLMs more cost-efficient. By leveraging RAG, companies can use the same model to process and generate responses based on new data, while being able to customize their solution and maintain relevance.
  • There are several ways we can accomplish this, first of those being leveraging another LM by iteratively calling it to extract information needed.
  • In the image below (source), we get a glimpse into how iteratively calling LM works:

  • Another method for LLM gaining external knowledge is through information retrieval via memory units such as an external database, say of recent facts. As such, there are two types of information retrievers, dense and sparse.
    • As the name suggests, sparse retrievers use sparse bag of words representation of documents and queries while dense (neural) retrievers use dense query and document vectors obtained from a neural network.
  • Yet another method is to leverage using agents which utilize APIs/tools to carry out a specializes task. The model chooses the most appropriate tool corresponding to a given input. With the help of tools like Google Search, Wikipedia and OpenAPI, LLMs can not only browse the web while responding, but also perform tasks like flight booking and weather reporting. LangChain offers a variety of different tools.
  • “Even though the idea of retrieving documents to perform question answering is not new, retrieval-augmented LMs have recently demonstrated strong performance in other knowledge-intensive tasks besides Q&A. These proposals close the performance gap compared to larger LMs that use significantly more parameters.” (source)

  • “With RAG, the external data used to augment your prompts can come from multiple data sources, such as a document repositories, databases, or APIs. The first step is to convert your documents and any user queries into a compatible format to perform relevancy search.
  • To make the formats compatible, a document collection, or knowledge library, and user-submitted queries are converted to numerical representations using embedding language models. Embedding is the process by which text is given numerical representation in a vector space.
  • RAG model architectures compare the embeddings of user queries within the vector of the knowledge library. The original user prompt is then appended with relevant context from similar documents within the knowledge library. This augmented prompt is then sent to the foundation model. You can update knowledge libraries and their relevant embeddings asynchronously.” source

  • RAG also plays a part in reducing hallucination


  • First step is to store the knowledge of your internal documents in a format that is suitable for querying. We do so by embedding all of your internally held knowledge using an embedding model:
    1. Split text corpus of the entire knowledge base into chunks – a chunk represents a single piece of context available to be queried. Keep in mind that the data of interest can be coming from multiple sources of different types, e.g. documentation in Confluence supplemented by PDF reports.
    2. Use the Embedding Model to transform each of the chunks into a vector embedding.
    3. Store all vector embeddings in a Vector Database.
    4. Save text that represents each of the embeddings separately together with the pointer to the embedding (we will need this later).
  • The following flowchart (source) illustrates the architecture of the system:

  • Next we can start constructing the answer to a question/query of interest:
    1. Embed a question/query you want to ask using the same Embedding Model that was used to embed the knowledge base itself.
    2. Use the resulting Vector Embedding to run a query against the index in Vector Database. Choose how many vectors you want to retrieve from the Vector Database - it will equal the amount of context you will be retrieving and eventually using for answering the query question.
    3. Vector DB performs an Approximate Nearest Neighbour (ANN) search for the provided vector embedding against the index and returns previously chosen amount of context vectors. The procedure returns vectors that are most similar in a given Embedding/Latent space.
    4. Map the returned Vector Embeddings to the text chunks that represent them.
    5. Pass a question together with the retrieved context text chunks to the LLM via prompt. Instruct the LLM to only use the provided context to answer the given question. This does not mean that no Prompt Engineering will be needed - you will want to ensure that the answers returned by LLM fall into expected boundaries, e.g. if there is no data in the retrieved context that could be used make sure that no made up answer is provided.
  • To make a real demo (for e.g., an interactive chatbot like ChatGPT), face the entire application with a Web UI that exposes a text input box to act as a chat interface. After running the provided question through steps 1. to 9. - return and display the generated answer. This is how most of the chatbots that are based on a single or multiple internal knowledge base sources are actually built nowadays.


  • RAG augments the knowledge base of an LM with relevant documents. Vector databases such as Pinecone, Chroma, Weaviate, etc. offer great solutions to augment LLMs. Open-soruce solutions such as Milvus and LlamaIndex are great options as well.
  • Here is RAG step-by-step:
    1. Chunk, embed, & index documents in a vector database (VDB).
    2. Match the query embedding of the claim advisor using (approximate) nearest neighbor techniques.
    3. Retrieve the relevant context from the VDB.
    4. Augment the LLM’s prompt with the retrieved content.
  • As a stack recommendation, you can build prototypes with LangChain or for more of an industrial flair, go with Google Vertex.
  • Another method recent LM’s have leveraged is the search engine itself such as WebGPT does. “WebGPT learns to interact with a web-browser, which allows it to further refine the initial query or perform additional actions based on its interactions with the tool. More specifically, WebGPT can search the internet, navigate webpages, follow links, and cite sources.” (source)

Context Length Scaling in Language Models (LLMs)

  • Language models like LLaMA have traditionally been bounded by a fixed context window size, which essentially means they can only consider a fixed number of previous tokens when predicting the next one. Positional embeddings are a core component in these models to provide a sense of sequence order. However, scaling these embeddings to accommodate longer sequences has its own set of challenges.
  • For example, GPT-4 has a context length of 32K input tokens, while Anthropic has 100K input tokens. To give an idea, The Great Gatsby is 72K tokens, 210 pages.
  • Let’s delve deeper into the context length issue found in Transformers:
    • Background:
      • Weight Matrix Shapes and Input Tokens:
        • In the Transformer model, the sizes (or shapes) of the learnable weight matrices do not depend on the number of input tokens (\(n\)). This means the architecture itself doesn’t change if you provide more tokens.
        • The implication of this is that a Transformer trained on shorter sequences (like 2K tokens) can technically accept much longer sequences (like 100K tokens) without any structural changes.
      • Training on Different Context Lengths:
        • While a Transformer model can accept longer sequences, it may not produce meaningful results if it hasn’t been trained on such long sequences. In the given example, if a model is trained only on 2K tokens, feeding it 100K tokens might result in outputs that are not accurate or coherent.
      • Computational Complexity and Costs: - The original Transformer has a quadratic computational complexity with respect to both the number of tokens (\(n\)) and the embedding size (\(d\)). This means that as sequences get longer, the time and computational resources needed for training increase significantly. - As a concrete example, the author points out that training a model called LLaMA on sequences of 2K tokens is already quite expensive (~$3M). Due to the quadratic scaling, if you tried to train LLaMA on sequences of 100K tokens, the cost would balloon to an estimated ~$150M.
    • Potential Solution:
      • Fine-tuning on Longer Contexts: - A potential solution to the problem might seem to be training a model on shorter sequences (like 2K tokens) and then fine-tuning it on longer sequences (like 65K tokens). This approach is often used in other contexts to adapt a model to new data or tasks. - However, this won’t work well with the original Transformer due to its Positional Sinusoidal Encoding. This encoding is designed to add positional information to the input tokens, but if the model is only familiar with shorter sequences, it might struggle to interpret the positional encodings for much longer sequences accurately.
  • We will look at other more viable solutions, such as Flash Attention, below.

Challenges with Context Scaling

  1. Fixed Maximum Length: Positional embeddings are configured for a predetermined sequence length. If a model is designed for 512 tokens, it possesses 512 distinct positional embedding vectors. Beyond this, there’s no positional information for extra tokens, making longer sequences problematic.
  2. Memory Overhead: Extending these embeddings for incredibly lengthy sequences demands more memory, especially if the embeddings are parameters that need to be learned during training.
  3. Computational Burden: The self-attention mechanism in Transformers grows in computational intensity with the length of the sequence. It’s quadratic in nature, meaning that even a modest increase in sequence length can substantially raise the computation time.

Solutions to Challenges

Positional Interpolation (PI)

  • Concept: Think of PI as a ‘resizing’ tool. Just as you might resize an image to fit a specific frame, PI adjusts position indices to fit within the existing context size. This is done using mathematical interpolation techniques.

  • Functionality: Suppose you trained a model for a 512-token context, but now want it to manage 1024 tokens. PI would transform positions [0,1,2,…,1023] to something like [0,0.5,1,…,511.5] to utilize the existing 512 embeddings.

  • Advantages: It’s like giving an old model a new pair of glasses. The model can now ‘see’ or process longer sequences without having to undergo rigorous training again from scratch.

  • Fine-tuning: After employing PI, models often need some brushing up. This is done through fine-tuning, where the model learns to adjust to its new sequence processing capability.

Rotary Positional Encoding (RoPE)

  • Concept: Rather than adding distinct positional information, RoPE rotates the existing embeddings based on their positions. By distributing positional data across all dimensions, the essence of sequence position is captured in a more fluid manner.

  • Functionality: RoPE employs mathematical operations to rotate the input embeddings. This allows the model to handle sequences that go beyond its original training without requiring explicit positional data for each new position.

  • Advantages: The rotation-based mechanism is more dynamic, meaning the model can work with sequences of any length without needing distinct embeddings for every position. This offers significant flexibility, especially when dealing with texts of unpredictable lengths.

  • Limitation: The continuous nature of RoPE’s rotation can cause some imprecision, especially when sequences become extremely lengthy.

ALiBi (Attention with Linear Biases)

  • Author: Ofer Press from FAIR et al.
  • While ALiBi does not directly increase context length, it enhances the Transformer’s adaptability to varied sequence lengths by introducing biases in the attention mechanism, optimizing its performance on extended contexts.
  • The original Transformer leveraged Positional Sinusoidal Encoding which did not have the ‘extrapolation’ ability, thus, it performed poorly during inference/ fine-tuning when the context length was increased.
    • For example, when you train a transformer model on sequences of a particular length (say 2K tokens) and later want to use it on longer sequences (like 65K tokens), this encoding does not effectively “extrapolate” to these longer sequences. This means that the model starts to perform poorly when dealing with sequences longer than what it was trained on.
  • AliBi is an alternative to Positional Sinusoidal Encoding and is a modification to the attention mechanism within the Transformer architecture. Instead of adding positional information at the start (or bottom) of the model, ALiBi integrates this information within the attention mechanism itself.
  • In the attention mechanism, attention scores are computed between query and key pairs. ALiBi introduces a bias to these attention scores based on the distance between tokens in the sequence. Specifically, the farther apart two tokens are in the sequence, the more penalty or bias is added to their attention score. This bias ensures that the model is aware of the token positions when calculating attention.
  • Benefits of ALiBi:
    • Adaptability: Unlike Positional Sinusoidal Encoding, ALiBi is more adaptable to different sequence lengths, making it more suitable for models that need to handle varying sequence lengths during training and inference.
    • Training Speed: Incorporating ALiBi can speed up the training process.
  • The image below depicts the constant bias added from the original ALiBi paper.

Sparse Attention

  • Sparse attention is another modification to self-attention and it exploits the reasoning that not all tokens within your content size are relevant to each other.
  • Thus, it considers only some tokens when calculating the attention score, and adds sparsity to make the computation linear not quadratic w.r.t. input token size.
  • There are many ways in which sparse attention can be implemented and we will look at a few below:
    • Sliding Window Attention or Local: implements a fixed-size window attention surrounding each token. Here, each token doesn’t look at all other tokens but only a fixed number around it, defined by a window size \(w\). If \(w\) is much smaller than \(n\), this significantly reduces computations. However, information can still flow across the entire sequence since each token can pass its information to its neighbors, who pass it to their neighbors, and so on.
    • BigBird Attention: Another approach, introduced in the BigBird model, combines different types of attention: some tokens attend globally (to all tokens), some attend locally (like the sliding window), and some attend to random tokens. This combination ensures efficient computation while maintaining a good flow of information across the entire sequence.
  • The image below, source depicts full attention and how it can be viewed as a graph.

  • In contrast, the image below, source depicts sparse attention specifically from the BigBird paper.

Flash Attention

  • FlashAttention optimizes the attention mechanism for GPUs by breaking computations into smaller blocks, reducing memory transfer overheads and enhancing processing speed. Let’s see how below:
  • Background context:
    • Remember from earlier, transformers utilize an attention mechanism that involves several computational steps to determine how much focus each word in a sentence should have on other words. These steps include matrix multiplications, as illustrated by the operations: S = Q * K, P = softmax(S), and O = P * V.
    • However, when processing these operations on a GPU, there are some inefficiencies that slow down the computation:
      1. GPU Memory Hierarchy: GPUs have different types of memory. SRAM (Static Random-Access Memory) is fast but has limited size, while HBM (High Bandwidth Memory) has a much larger size but is slower. For effective GPU operations, data needs to be loaded into the quick SRAM memory. But due to SRAM’s limited size, larger intermediate results (like the matrices P, S, and O) need to be stored back into the slower HBM memory, which adds overheads.
      2. Memory Access Overhead: Constantly moving these large intermediate results (P, S, and O) between the SRAM and HBM memories creates a bottleneck in performance.
  • Solution - FlashAttention:
  • FlashAttention was introduced to optimize these operations for GPUs. Instead of computing the attention for the entire matrices at once, FlashAttention breaks them down into smaller blocks or tiles:
  1. Tiling: The Q, K, and V matrices are divided into smaller blocks. These blocks are then loaded from the HBM to the faster SRAM for computation.
  2. Optimized Computation: Within each block, the attention output is computed without the need to constantly shift large intermediate results between the two types of memory. This means fewer transfers between the slow and fast memories, which leads to speed improvements.
  3. Optimized for GPU: While individual operations like matrix multiplication are already efficient on GPUs, FlashAttention makes the entire attention layer more GPU-friendly by minimizing memory transfers and fusing several operations.
    • The result of these optimizations is a significant speedup in both training and inference times. Moreover, this optimization is now integrated into popular frameworks like PyTorch 2.0, making it easily accessible for developers.
    • In essence, FlashAttention is a smart way to restructure and execute the attention mechanism’s computations on GPUs, minimizing the bottlenecks caused by the GPU’s memory architecture.
  • The image below, source, depicts Flash Attention from the original paper.

Multi-Query Attention

  • Background context:
    • Multi-Head Attention (MHA) in Transformers:
      • In the original Transformer architecture, the attention mechanism uses multiple heads. Each of these heads independently calculates its own attention scores by projecting the input into different “query”, “key”, and “value” spaces using separate weight matrices. The outputs from all heads are then concatenated and linearly transformed to produce the final result.
    • The Challenge with MHA:
      • While MHA allows the model to focus on different parts of the input simultaneously, it has its costs. One such cost is memory usage. During the inference stage, especially in decoders, previous tokens’ “keys” and “values” are cached so that the model doesn’t need to recompute them for each new token. However, as more tokens are processed, the cache grows, consuming more GPU memory.
  • Introducing Multi-Query Attention (MQA):
    • MQA is an optimization over the standard MHA. Instead of each head having separate weight matrices for projecting the “key” and “value”, MQA proposes that all heads share a common weight matrix for these projections.
  • Advantages of MQA:
    1. Memory Efficiency: By sharing weights for the “key” and “value” projections across heads, you significantly reduce the memory required for caching during inference. For instance, a model with 96 heads, like GPT-3, can reduce its memory consumption for the key/value cache by up to 96 times.
    2. Speed in Inference: Since you’re now working with shared projections and a reduced cache size, the calculation of attention scores during inference becomes faster. This is especially beneficial when generating longer sequences of text.
    3. Maintains Training Speed: Despite these changes, the training speed remains largely unaffected. This means you get the advantages of MQA without any significant downside in terms of training time.

Comparative Analysis

  • Stability: PI’s methodology, in certain settings, can be more consistent in performance than RoPE.

  • Approach: While PI essentially ‘squeezes’ or ‘stretches’ positional indices to align with existing embeddings, RoPE modifies the very nature of how embeddings encapsulate position information.

  • Training Dynamics: Post-PI models often crave some refinement to accommodate the interpolated positions, whereas RoPE’s intrinsic design means it’s already geared up for variable sequence lengths without additional training.

  • Flexibility: RoPE’s absence of dependency on fixed embeddings gives it an edge, as it can gracefully handle sequences of any conceivable length.

Dynamically Scaled RoPE

  • Static RoPE can sometimes force a compromise between catering to very long sequences and maintaining efficacy on shorter ones. Enter the dynamic variant of RoPE, which seeks to fluidly adjust scaling based on the sequence’s length, offering the best of both worlds.


  • Adaptivity: Instead of a one-size-fits-all scaling, this method tailors the scaling based on the present sequence length. It’s like adjusting the focus of a camera lens based on the subject’s distance.

  • Scale Dynamics: The model starts with precise position values for the initial context (up to the first 2k tokens). Beyond that, it recalibrates the scaling factor in real-time, relative to how the sequence length evolves.

Key Benefits

  • Performance Boost: Dynamic scaling typically exhibits enhanced efficacy compared to its static counterpart and other techniques like NTK-Aware.

  • Versatility: The real-time adaptability ensures that the model remains effective across a wide spectrum of sequence lengths.

NTK-Aware Method Perspective

  • Performance Metrics: NTK-Aware may falter a bit with shorter sequences, but it tends to flourish as sequences grow.

  • Parameter Dynamics: Both dynamic RoPE and NTK-Aware possess parameters influencing their performance over different sequence lengths. The distinction lies in dynamic RoPE’s ability to adjust these parameters on-the-fly based on the current sequence length, enhancing its responsiveness.


  • Dynamically Scaled RoPE, with its adaptive nature, represents a promising direction in the quest for more versatile and efficient language models. By dynamically adjusting to varying sequence lengths, it ensures that models like LLaMA maintain optimal performance across diverse contexts.
  • The below infographic (source) performs a comparative analysis between traditional databases and vector databases.

When not to use Vector DBs?

  • Credits for this section go to Prithivi Da.
  • Vector DBs are “leaky abstractions”.

A leaky abstraction is an abstraction that exposes details and limitations of its underlying implementation to its users that should ideally be hidden away

  • Encoders are usecase specific, so you need to know which encoder and hidden dimension size will yield the best representation. For some-cases / domains you have to train your own instead of using a pre-trained one.
  • Default similarity and distance functions for relevance may not be good for all usecase. Cosine is default for most VdBs (which tools like langchain blindly keep). For instance if the indexed data is noisy, Jaccard similarity is robust to noise, while cosine similarity is not.
  • LSH isn’t a good option for multi billion record scale hence Milvus skipped it keeping only IVF-PQ and HSNW; you should know when to use what.
  • If your case requires “read-your-own-writes” the latency of encoding and indexing cannot tolerate your needs.
  • Total Cost of Ownership (TCO) is higher compared to traditional and hybrid data stores.
  • Backfilling can be very slow if your historical dataset is huge.

Knowledge Graphs with LLMs: Best of Both Worlds

  • Credits to the following section go to Tony Seale.
  • The recent increasing significance on LLMs within organisations is not just a fleeting fad but part of a transformative shift that all forward-thinking organisations must come to terms with. However, for an organisation to succeed in this transition, effectively leveraging ontologies (of which knowledge graphs are a popular instantiation) is a crucial factor.
  • LLMs possess remarkable AI capabilities, allowing them to comprehend and generate human-like text by learning intricate patterns from vast volumes of training data. These powerful models are capable of crafting eloquent letters, analysing data, generating code, orchestrating workflows, and performing a myriad of other complex tasks. Their potential seems increasingly disruptive, with Microsoft even ‘betting the house’ on them.
  • However, when deploying LLMs within an enterprise context, reliability, trustworthiness, and understandability are vital concerns for those running and governing these systems. Hallucination is simply not an option.
  • Ontologies offer structured and formal representations of knowledge, defining relationships between concepts within specific domains. These structures enable computers to comprehend and reason in a logical, consistent, and comprehensible manner. Yet, designing and maintaining ontologies requires substantial effort. Before LLMs came along, they were the ‘top dog in town’ when it came to a semantic understanding, but now they seem relatively inflexible, incomplete and slow to change.
  • Enter the intriguing and powerful synergy created by the convergence of LLMs AND Ontologies. The ability of LLMs to generate and extend ontologies is a game-changer. Although you still need a ‘human-in-the-loop,’ the top LLMs demonstrate surprising effectiveness. Simultaneously, ontologies provide vital context to the prompts given to LLMs, enriching the accuracy and relevance of the LLM’s responses. Ontologies can also be used to validate the consistency of those responses.
  • LLMs can help discover new knowledge, and the ontologies compile that knowledge down for future use.
  • This collaborative partnership between LLMs and ontologies establishes a reinforcing feedback loop of continuous improvement. As LLMs help generate better ontologies faster and more dynamically, the ontologies, in turn, elevate the performance of LLMs by offering a more comprehensive context of the data and text they analyse. This positive feedback loop has the potential to catalyse an exponential leap in the capabilities of AI applications within organisations, streamlining processes, adding intelligence, and enhancing customer experiences like never before.

Continuous v/s Discrete Knowledge Representation

  • Credits to the following section go to Tony Seale.
  • We can think of information existing in a continuous stream or in discrete chunks. Large Language Models (LLMs) fall under the category of continuous knowledge representation, while Knowledge Graphs belong to the discrete realm. Each approach has its merits, and understanding the implications of their differences is essential.
  • LLM embeddings are dense, continuous real-valued vectors existing in a high-dimensional space. Think of them as coordinates on a map: just as longitude and latitude can pinpoint a location on a two-dimensional map, embeddings guide us to rough positions in a multi-dimensional ‘semantic space’ made up of the connections between the words on the internet. Since the embedding vectors are continuous, they allow for an infinite range of values within a given interval, making the embeddings’ coordinates ‘fuzzy’.
  • An LLM embedding for ‘Jennifer Aniston’ will be a several-thousand-dimensional continuous vector that leads to a location in a several-billion-parameter ‘word-space’. If we add the ‘TV series’ embedding to this vector then I will be pulled towards the position of the ‘Friends’ vector. Magic! But this magic comes with a price: you can never quite trust the answers. Hallucination and creativity are two sides of the same coin.
  • On the other hand, Knowledge Graphs embrace a discrete representation approach, where each entity is associated with a unique URL. For example, the Wikidata URL for Jennifer Aniston is https://www.wikidata.org/wiki/Q32522. This represents a discrete location in ‘DNS + IP space’. Humans have carefully structured data that is reliable, editable, and explainable. However, the discrete nature of Knowledge Graphs also comes with its own price. There is no magical internal animation here; just static facts.

The “Context Stuffing” Problem

  • Research shows that providing LLMs with large context windows – “context stuffing” – comes at a cost and performs worse than expected.
  • Less is More: Why Use Retrieval Instead of Larger Context Windows summarizes two studies showing that:

    1. LLMs tend to struggle in distinguishing valuable information when flooded with large amounts of unfiltered information. Put simply, answer quality decreases, and the risk of hallucination increases, with larger context windows.
    2. Using a retrieval system to find and provide narrow, relevant information boosts the models’ efficiency per token, which results in lower resource consumption and improved accuracy.
    3. The above holds true even when a single large document is put into the context, rather than many documents.
    4. Costs increase linearly with larger contexts since processing larger contexts requires more computation. LLM providers charge per token which means a longer context (i.e, more tokens) makes each query more expensive.
  • LLMs seem to provide better results when given fewer, more relevant documents in the context, rather than large numbers of unfiltered documents.

RAG for limiting hallucination

  • Hallucination is typically caused due to imperfections in training data, lack of access to external, real-world knowledge, and limited contextual understanding from prompts.
  • RAG (using either an agent or an external data-source such as a Vector DB) can serve as a means to alleviate model hallucination and improve accuracy.
  • Furthermore, augmenting the prompt using examples is another effective strategy to reduce hallucination.
  • Another approach which has recently gained traction is plan-and-execute where the model is asked to first plan and then solve the problem step-by-step while paying attention to calculations.
  • Lastly, as contaminated training data can cause hallucinations, cleaning up the data and fine-tuning your model can also help reduce hallucinations. However, as most models are large to train or even fine-tune, this approach should be used while taking the cost-vs-accuracy tradeoff into consideration.

LLM Knobs

  • When working with prompts, you interact with the LLM via an API or directly. You can configure a few parameters to get different results for your prompts.

  • Temperature: In short, the lower the temperature, the more deterministic the results in the sense that the highest probable next token is always picked. Increasing temperature could lead to more randomness, which encourages more diverse or creative outputs. You are essentially increasing the weights of the other possible tokens. In terms of application, you might want to use a lower temperature value for tasks like fact-based QA to encourage more factual and concise responses. For poem generation or other creative tasks, it might be beneficial to increase the temperature value.

  • Top_p: Similarly, with top_p, a sampling technique with temperature called nucleus sampling, you can control how deterministic the model is at generating a response. If you are looking for exact and factual answers keep this low. If you are looking for more diverse responses, increase to a higher value.

  • The general recommendation is to alter one, not both.

  • Before starting with some basic examples, keep in mind that your results may vary depending on the version of LLM you use.

Token Sampling

Prompt Engineering

Token healing

  • This section is leveraged from Guidance by Microsoft.
  • The standard greedy tokenizations used by most LLMs introduce a subtle and powerful bias that can have all kinds of unintended consequences for your prompts. “Token healing” automatically removes these surprising biases, freeing you to focus on designing the prompts you want without worrying about tokenization artifacts.

  • Consider the following example, where we are trying to generate an HTTP URL string:
# we use StableLM as an open example, but these issues impact all models to varying degrees
guidance.llm = guidance.llms.Transformers("stabilityai/stablelm-base-alpha-3b", device=0)

# we turn token healing off so that guidance acts like a normal prompting library
program = guidance('''The link is <a href="http:''')

  • Note that the output generated by the LLM does not complete the URL with the obvious next characters (two forward slashes). It instead creates an invalid URL string with a space in the middle. Why? Because the string “://” is its own token (1358), and so once the model sees a colon by itself (token 27), it assumes that the next characters cannot be “//”; otherwise, the tokenizer would not have used 27 and instead would have used 1358 (the token for “://”).

  • This bias is not just limited to the colon character – it happens everywhere. Over 70% of the 10k most common tokens for the StableLM model used above are prefixes of longer possible tokens, and so cause token boundary bias when they are the last token in a prompt. For example the “:” token 27 has 34 possible extensions, the “ the” token 1735 has 51 extensions, and the “ “ (space) token 209 has 28,802 extensions). guidance eliminates these biases by backing up the model by one token then allowing the model to step forward while constraining it to only generate tokens whose prefix matches the last token. This “token healing” process eliminates token boundary biases and allows any prompt to be completed naturally:

guidance('The link is <a href="http:')()

Evaluation Metrics

  • Evaluating Large Language Models (LLMs) often requires a combination of traditional and more recent metrics to gain a comprehensive understanding of their performance. Here’s a deep dive into some key metrics and how they’re applied to LLMs:
  1. Perplexity (PPL):
    • Definition: Perplexity measures how well a probability model predicts a sample. In the context of language models, it indicates the model’s uncertainty when predicting the next token in a sequence. A lower perplexity score implies the model is more confident in its predictions.
    • Calculation: Given a probability distribution ( p ) and a sequence of ( N ) tokens, [ \text{PPL}(p) = \exp\left(-\frac{1}{N} \sum_{i=1}^{N} \log p(x_i)\right) ]
    • Usage: It’s commonly used in the initial stages of LLM development as a sanity check and for model selection during hyperparameter tuning. However, while perplexity provides an overall measure of how well the model fits the data, it doesn’t necessarily correlate directly with performance on specific tasks.
  2. BLEU (Bilingual Evaluation Understudy) Score:
    • Definition: Originally designed for machine translation, BLEU scores measure how many n-grams (phrases of n words) in the model’s output match the n-grams in a reference output.
    • Usage: While primarily used for translation tasks, BLEU scores have been adapted for other generative tasks as a measure of content quality and relevance.
  3. ROUGE (Recall-Oriented Understudy for Gisting Evaluation):
    • Definition: Used mainly for summarization tasks, ROUGE compares the overlap between the n-grams in the generated text and a reference text.
    • Usage: ROUGE can capture various dimensions like precision, recall, or F1 score based on overlapping content.
  4. METEOR (Metric for Evaluation of Translation with Explicit ORdering):
    • Definition: Another metric for translation quality, METEOR considers precision and recall, synonymy, stemming, and word order.
    • Usage: METEOR gives a more holistic evaluation of translation outputs compared to BLEU.
  5. Fidelity and Faithfulness:
    • Definition: These metrics measure whether the generated content retains the meaning of the source without introducing any false information.
    • Usage: Especially important in tasks like summarization or paraphrasing where the content’s meaning must be preserved.
  6. Diversity Metrics:
    • Definition: Evaluate how varied the outputs of a model are, especially in generative tasks. Measures might include distinct n-grams or entropy.
    • Usage: Ensure that the model doesn’t overfit to particular phrases or produce monotonous outputs.
  7. Entity Overlap:
    • Definition: Measures the overlap of named entities between generated content and reference text.
    • Usage: Can be particularly relevant in tasks like question-answering or summarization where retaining key entities is crucial.
  8. Completion Metrics:
    • Definition: Used for tasks where a model must complete a given prompt, these metrics assess how relevant and coherent the completions are.
    • Usage: Relevant in chatbot interactions or any prompt-response generation.

Methods to Knowledge-Augment LLMs

  • Let’s look at a few methodologies to knowledge-augment LLMs:
    • Few-shot prompting: it requires no weight updates and the reasoning and acting abilities of the LM are tied to the provided prompt, which makes it very powerful as a method in teaching the LM what the desired outputs are.
    • Fine-tuning: Complementary to few-shot prompting, via supervised learning we can always fine-tune and update the weights of the parameters.
    • Prompt pre-training: “A potential risk of finetuning after the pre-training phase is that the LM might deviate far from the original distribution and overfit the distribution of the examples provided during fine-tuning. To alleviate this issue, Taylor et al. (2022) propose to mix pre-training data with labeled demonstrations of reasoning, similar to how earlier work mixes pre-training data with examples from various downstream tasks (Raffel et al. 2020); however, the exact gains from this mixing, compared to having a separate fine-tuning stage, have not yet been empirically studied. With a similar goal in mind, Ouyang et al. (2022) and Iyer et al. (2022) include examples from pre-training during the fine-tuning stage.” (source)
    • Bootstrapping: “This typically works by prompting a LM to reason or act in a few-shot setup followed by a final prediction; examples for which the actions or reasoning steps performed did not lead to a correct final prediction are then discarded.” (source)
    • Reinforcement Learning: “Supervised learning from human-created prompts is effective to teach models to reason and act” (source).

Fine-tuning vs. Prompting

  • You can think of fine-tuning as a more-powerful form of prompting, where instead of writing your instructions in text you actually encode them in the weights of the model itself. You do this by training an existing model on example input/output pairs that demonstrate the task you want your fine-tuned model to learn. Finetuning can work with as few as 50 examples but offers optimal performance with thousands to tens of thousands if possible.
  • Prompting has some big advantages over fine-tuning, as follows:
    • It’s way easier/faster to iterate on your instructions than label data and re-train a model.
    • Operationally, it’s easier to deploy one big model and just adjust its behavior as necessary vs deploying many small fine-tuned models that will likely each get lower utilization.
  • On the other hand, the benefits of fine-tuning are as follows:
    • The biggest advantage is that it it is far more effective at guiding a model’s behavior than prompting (leading to better performance), so you can often get away with a much smaller model. This enables faster responses and lower inference costs. For e.g., a fine-tuned Llama 7B model is 50x cheaper than GPT-3.5 on a per-token basis, and for many use cases can produce results that are as good or better!
    • Fine-tuning enables check-pointing of your model with relevant data; while prompting requires stuffing up your prompt every single time with relevant data (which is an exercise that needs to be repeated per inference run).
    • Fine-tuning costs can be categorized as NRE costs; they’re one-off while with prompt tuning, per-token costs with a stuffed prompt can accumulate with every run (so the apt choice of technique for a particular use-case depends on the amount of inference runs are planned).
    • With LoRA-based schemes (such as QLoRA), you can fine-tune a minimal (sub-1%) fraction of parameters and still be able to render great performance levels compared to prompting.

Augmenting LLMs with Knowledge Graphs


  • Per Tony Seale,
    • The butcher-on-the-bus is a rhetorical device that sheds light on human memory processes. Imagine recognising someone on a bus but struggling to place their identity. Without a doubt, you know them, but it takes a moment of reflection before it hits you … a-ha! They’re the local butcher!
    • This scenario illustrates how our memory seemingly comprises two types: one that is flexible, fuzzy, generalisable, and gradually learned, and another that is specific, precise, and acquired in a single shot.
    • Could this dualistic model enhance AI systems? LLMs learn statistical approximations from text corpora, granting them generalisation, flexibility, and creativity. However, they also suffer from hallucinations, unreliability, and staleness. On the other hand, databases offer accuracy, speed, and reliability but lack adaptability and intelligence.
    • Perhaps the key lies in bridging these two worlds, and that’s where graphs come into play. By integrating LLMs with internal data through Knowledge Graphs (KGs), we can create a Working Memory Graph (WMG) that combines the strengths of both approaches in order to achieve a given task.
    • To build a WMG, the LLM processes a question and returns a graph of nodes using URLs as identifiers, these URLs link to ground truths stored in the organisation’s Knowledge Graph. The WMG can also incorporate nodes representing conceptual understanding, establishing connections between the LLM’s numerical vectors and the KG’s ontological classes.
  • Thus, combining the best of both worlds (LLMs with their reasoning capabilities along with KGs with their structured, static ontology) can yield a system with the structured knowledge capabilities of knowledge graphs as well as the reasoning capabilities of LLMs. This will enable unleashing the true potential of your organisation’s knowledge assets. Combining the power of Large Language Models (LLMs) with the reliability of knowledge graphs can be a game-changer. However, bridging the gap between these two representations has been an ongoing challenge. More on this in the next section.


  • Per Tony Seale, a simple and pragmatic technique to connect your knowledge graph to LLMs effectively is as follows:
    • Extract Relevant Nodes: Begin by pulling all the nodes that you wish to index from your Knowledge Graph, including their descriptions:
        rows = rdflib_graph.query(‘SELECT * WHERE {?uri dc:description ?desc}’)
    • Generate Embedding Vectors: Employ your large language model to create an embedding vector for the description of each node:
        node_embedding = openai.Embedding.create(input = row.desc, model=model) ['data'][0]['embedding']
    • Build a Vector Store: Store the generated embedding vectors in a dedicated vector store:
        index = faiss.IndexFlatL2(len(embedding))
    • Query with Natural Language: When a user poses a question in natural language, convert the query into an embedding vector using the same language model. Then, leverage the vector store to find the nodes with the lowest cosine similarity to the query vector:
        question_embedding = openai.Embedding.create(input = question, model=model) ['data'][0]['embedding']
        d, i = index.search(question_embedding, 100)
    • Semantic Post-processing: To further enhance the user experience, apply post-processing techniques to the retrieved related nodes. This step refines the results and presents information in a way that best provides users with actionable insights.

Summary of LLMs

  • The following table (source) offers a summary of large language models, including original release date, largest model size, and whether the weights are fully open source to the public:

🤗 Open LLM Leaderboard

  • With the plethora of large language models (LLMs) and chatbots being released week upon week, often with grandiose claims of their performance, it can be hard to filter out the genuine progress that is being made by the open-source community and which model is the current state of the art. The 🤗 Open LLM Leaderboard aims to track, rank and evaluate LLMs and chatbots as they are released.

🤗 Massive Text Embedding Benchmark (MTEB) Leaderboard

  • Text embeddings are commonly evaluated on a small set of datasets from a single task not covering their possible applications to other tasks. It is unclear whether state-of-the-art embeddings on semantic textual similarity (STS) can be equally well applied to other tasks like clustering or reranking. This makes progress in the field difficult to track, as various models are constantly being proposed without proper evaluation.
  • To solve this problem, Muennighoff et al. from HuggingFace and cohere.ai introduce the Massive Text Embedding Benchmark (MTEB) Leaderboard. MTEB spans 8 embedding tasks covering a total of 58 datasets and 112 languages.
  • Through the benchmarking of 33 models on MTEB, they establish the most comprehensive benchmark of text embeddings to date. The following figure from the paper shows an overview of tasks and datasets in MTEB. Multilingual datasets are marked with a purple shade

  • They find that no particular text embedding method dominates across all tasks. This suggests that the field has yet to converge on a universal text embedding method and scale it up sufficiently to provide state-of-the-art results on all embedding tasks.
  • Paper.

Extending prompt context


  • As LLMs become ubiquitous, their applications to long sequences have been a key focus, especially for applications like summarizing text (potentially interleaved with other data sources like tables and images), writing code, and predicting protein sequences, which require the model to effectively consider long distance structural dependencies. A large context allows a pre-trained LLM to look at customer data (e.g., documents the LLM did not use in training) and responds to useful information seeking queries.
  • Yet, most open-source LLMs (e.g., LLaMA, MPT, Falcon) have been trained with a maximum of 2K token sequence length, which is a key limitation in modeling long sequences. Inference time solutions such as ALiBi have yet to be evaluated for larger models (e.g. MPT-7b-StoryWriter-65k+). Recent work on model scaling has shown that for a given compute budget, the best performances are not necessarily achieved by the largest models, but by smaller models trained on more data (measured by number of tokens). A smaller model is also generally preferred for inference efficiency during serving including on-device serving.
  • In light of this, LLMs with extended prompt contexts offer various advantages especially for use-cases centered around extracting long-range dependencies such as long-document question answering and summarization.

Status quo

  • Open-source LLMs are behind commercial models when it comes to context length. For example, OpenAI’s GPT-3.5 now has a context length of 16k, GPT-4 of 32k, and Anthropic’s Claude up 100k, while Meta’s LLaMA or Technology Innovation Institute’s Falcon only have a context length of 2k.
  • But it is possible to extend the context length of open-source models like LLaMa either post-pre-training or during pre-training; here are two amazing blog posts:
  • The problem we’re looking to address is the quadratic time and space complexity of attention layer computations w.r.t. the number of input tokens \(n\).
  • The second problem occurs when the embedding size \(d > n\), in which case the linear layers exhibit a quadratic time complexity w.r.t. embedding size \(d\).
  • The third problem is Absolute/Sinusoidal Positional Embedding used in the original architecture.
  • In Transformer architecture, the shapes of learnable matrix weights are agnostic to the number of input tokens \(n\).
  • Thus, a Transformer trained with a 2K context length can consume tokens of any length, even 100K. But the model will not produce meaningful results on 100K tokens during inference if it isn’t trained with a 100K context length.
  • “Training the vanilla Transformer on a giant corpus and only on a large context length is unfeasibly expensive due to the quadratic complexity w.r.t. to \(n\) and \(d\). LLaMA on 2K context length was estimated to be trained for ~$3M. Thus, LLaMA on 100K would cost ~$150M.” (source)

Extending Context Length via RoPE scaling

  • The technique was originally proposed by u/emozilla on Reddit as “Dynamically Scaled RoPE further increases performance of long context LLaMA with zero fine-tuning” and allows us to scale out the context length of models without fine-tuning by dynamically interpolating RoPE to represent longer sequences while preserving performance.
  • While it works well out of the box, performance can be further improved by additional fine-tuning. With RoPE scaling, companies can now easily extend open-source LLMs to the context lengths which work for their given use case.
  • From the Reddit post:
    • “When u/kaiokendev first posted about linearly interpolating RoPE for longer sequences, I (and a few others) had wondered if it was possible to pick the correct scale parameter dynamically based on the sequence length rather than having to settle for the fixed tradeoff of maximum sequence length vs. performance on shorter sequences. My idea was to use the exact position values for the first 2k context (after all, why mess with a good thing?) and then re-calculate the position vector for every new sequence length as the model generates token by token. Essentially, set scale to original model context length / current sequence length. This has the effect of slowly increasing scale as the sequence length increases.
    • I did some experiments and found that this has very strong performance, much better than simple linear interpolation. When u/bloc97 posted his NTK-Aware method, it was much closer to this dynamic linear scaling in terms of performance. Compared to dynamic linear scaling, NTK-Aware has higher perplexity for shorter sequences, but better perplexity at the tail end of the sequence lengths. Unfortunately, it also suffers from catastrophic perplexity blowup, just like regular RoPE and static linear scaling.
    • The main hyperparamter of NTK-Aware is \(\alpha\). Like static linear scaling, it represents a tradeoff between short/long sequence performance. So I thought, why not use the same dynamic scaling method with NTK-Aware? For Dynamic NTK, the scaling of \(\alpha\) is set to (\(\alpha\) * current sequence length / original model context length) - (\(\alpha\) - 1). The idea again is to dynamically scale the hyperparameter as the sequence length increases.

    • This uses the same methodology as NTK-Aware (perplexity on GovReport test). You can check out all the code on GitHub.”
  • Hugging Face Transformers now supports RoPE-scaling (rotary position embeddings) to extend the context length of large language models like LLaMA, GPT-NeoX, or Falcon.
  • So in essence, RoPE scaling dynamically rescales relative position differences based on the input length, analogous to a rope stretching and contracting.

Summary of tricks to extend prompt context

  • From The Secret Sauce behind 100K context window in LLMs: all tricks in one place, we can summarize all the tricks to extend prompt context as follows:
    1. One option is to train the model on 2K tokens context and then fine-tune it in longer contexts (for example, 65K). But it won’t work with the original Transformer because of the Absolute/Sinusoidal Positional Encoding. To address this, remove Absolute/Sinusoidal Positional Encoding and use ALiBi, a simple and elegant positional embedding that doesn’t hurt accuracy. Then you can train on 2K and fine-tune on 100K.
    2. You don’t need to calculate attention scores between all tokens. Some tokens are more important than others, so Sparse Attention can be used. It will speed up both training and inference.
    3. Flash Attention efficiently implements the attention layer for GPU. It uses tiling and avoids materialization of big intermediate matrices \((n, n)\) that do not fit into GPU SRAM. It will speed up both training and inference.
    4. Multi-Query attention instead of Multi-Head attention. That means you share weights across all heads when linearly projecting \(K\) and \(V\). It dramatically speeds up incremental inference.
    5. Conditional computation avoids applying all model parameters to all tokens from the input sequence. CoLT5 applies heavy computations only to the most important tokens and processes the rest of the tokens with a lighter version of layers. It will speed up both training and inference.
    6. To fit a large context, you need a lot of RAM in GPU, so people use 80GB A100 GPUs.

To sum up, the more you speed up the training and inference, the larger the context length you can use.

Large prompt context models

  • Anthropic AI announced that they are expanding Claude’s context window to 100k tokens, tripling GPT-4’s maximum of 32k.
  • For scale: The first Harry Potter book has 76,944 words, which is ~100k tokens after tokenization.
  • Larger context windows significantly elevate LLMs’ capabilities across a wide range of applications:
    1. Improved comprehension of lengthy and complex texts: by accessing a greater portion of the text, LLMs can generate responses and create content that is contextually relevant, more accurate, comprehensive, and coherent. This opens the door for processing extensive documents such as academic articles or legal contracts with more accuracy.
    2. Reduced need for fine-tuning: longer prompts can support advanced prompting techniques such as Chain of Thought and Few-Shot Learning, improving the LLM’s performance at inference time.
    3. Enhanced ability to summarize and synthesize information: with a greater understanding of entire documents, LLMs can generate summaries that encapsulate the key findings and more accurately.
    4. Improved context: conversational AI systems often struggle to maintain context during extended interactions. A larger context window can store more significant portions of the conversation history, leading to more coherent and contextually appropriate responses.
  • Over time, this could gradually diminish the need for vector store approaches for external knowledge retrieval in LLMs because you could now include the information as regular input.
  • It will likely make LLMs more efficient few-shot learners as well since more examples can now be provided via the context. However, this will likely not be a replacement for finetuning yet. Finetuning not only optimizes LLMs for domain-specific datasets, but it also helps to optimize them for a target task.
  • As an analogy, a person who specifically studied for a math exam will perform better than a random person who is only given past exams as examples without studying. Moreover, you can combine the two: apply in-context learning to finetuned models (a person who studied the exam subject and also uses past exams as examples).
  • MosaicML also announced MPT-65K, an LLM that can handle 65k tokens.

Scaling Transformer to 1M tokens and beyond with RMT

  • This technical report by Bulatov et al. from DeepPavlov, Artificial Intelligence Research Institute (AIRI), and London Institute for Mathematical Sciences presents the application of a recurrent memory to extend the context length of BERT, one of the most effective Transformer-based models in natural language processing.
  • By leveraging the Recurrent Memory Transformer architecture, we have successfully increased the model’s effective context length to an unprecedented two million tokens, while maintaining high memory retrieval accuracy.
  • Their method allows for the storage and processing of both local and global information and enables information flow between segments of the input sequence through the use of recurrence. - Their experiments demonstrate the effectiveness of RMT, which holds significant potential to enhance long-term dependency handling in natural language understanding and generation tasks as well as enable large-scale context processing for memory-intensive applications.
  • The following figure from the paper shows memory-intensive synthetic tasks. Synthetic tasks and the required RMT operations to solve them are presented. In the Memorize task, a fact statement is placed at the start of the sequence. In the Detect and Memorize task, a fact is randomly placed within a text sequence, making its detection more challenging. In the Reasoning task, two facts required to provide an answer are randomly placed within the text. For all tasks, the question is at the end of the sequence. ’mem’ denotes memory tokens, ’Q’ represents the question, and ’A’ signifies the answer.

Hyena Hierarchy: Towards Larger Convolutional Language Models

  • Recent advances in deep learning have relied heavily on the use of large Transformers due to their ability to learn at scale. However, the core building block of Transformers, the attention operator, exhibits quadratic cost in sequence length, limiting the amount of context accessible. Existing subquadratic methods based on low-rank and sparse approximations need to be combined with dense attention layers to match Transformers, indicating a gap in capability.
  • This paper by Poli et al. from proposes Hyena, a subquadratic drop-in replacement for attention constructed by interleaving implicitly parametrized long convolutions and data-controlled gating. Guided by these findings, we introduce the Hyena hierarchy, an operator defined by a recurrence of two efficient subquadratic primitives: a long convolution and element-wise multiplicative gating (see figure below from the paper). A specified depth (i.e., number of steps) of the recurrence controls the size of the operator. For short recurrences, existing models are recovered as special cases. By mapping each step in the Hyena recurrence to its corresponding matrix form, we reveal Hyena operators to be equivalently defined as a decomposition of a data-controlled matrix i.e., a matrix whose entries are functions of the input. Furthermore, we show how Hyena operators can be evaluated efficiently without materializing the full matrix, by leveraging fast convolution algorithms. Empirically, Hyena operators are able to significantly shrink the quality gap with attention at scale, reaching similar perplexity and downstream performance with a smaller computational budget and without hybridization of attention.
  • The following figure from the paper illustrates the Hyena operator is defined as a recurrence of two efficient subquadratic primitives: an implicit long convolution \(h\) (i.e. Hyena filters parameterized by a feed-forward network) and multiplicative elementwise gating of the (projected) input. The depth of the recurrence specifies the size of the operator. Hyena can equivalently be expressed as a multiplication with data-controlled (conditioned by the input \(u\)) diagonal matrices \(\mathrm{D}_x\) and Toeplitz matrices \(\mathrm{S}_h\). In addition, Hyena exhibits sublinear parameter scaling (in sequence length) and unrestricted context, similar to attention, while having lower time complexity.

  • In recall and reasoning tasks on sequences of thousands to hundreds of thousands of tokens, Hyena improves accuracy by more than 50 points over operators relying on state-spaces and other implicit and explicit methods, matching attention-based models. We set a new state-of-the-art for dense-attention-free architectures on language modeling in standard datasets (WikiText103 and The Pile), reaching Transformer quality with a 20% reduction in training compute required at sequence length 2K. Hyena operators are twice as fast as highly optimized attention at sequence length 8K, and 100x faster at sequence length 64K.

LongNet: Scaling Transformers to 1,000,000,000 Tokens

  • Scaling sequence length has become a critical demand in the era of large language models. However, existing methods struggle with either computational complexity or model expressivity, rendering the maximum sequence length restricted.
  • This paper by Ding et al. from Furu Wei’s group at MSR introduces LongNet, a Transformer variant that can scale sequence length to more than 1 billion tokens, without sacrificing the performance on shorter sequences. Specifically, they propose dilated attention, which expands the attentive field exponentially as the distance grows. LongNet has significant advantages:
    1. It has a linear computation complexity and a logarithm dependency between tokens;
    2. It can be served as a distributed trainer for extremely long sequences;
    3. Its dilated attention is a drop-in replacement for standard attention, which can be seamlessly integrated with the existing Transformer-based optimization.
  • The following figure from the paper illustrates the trend of Transformer sequence lengths over time.

  • Experiments results demonstrate that LongNet yields strong performance on both long-sequence modeling and general language tasks. Our work opens up new possibilities for modeling very long sequences, e.g., treating a whole corpus or even the entire Internet as a sequence.
  • Github repo.

Extending Context Window of Large Language Models via Positional Interpolation

  • This paper by Chen et al. from Meta AI in 2023 presents Position Interpolation (PI) that extends the context window sizes of RoPE-based pretrained LLMs such as LLaMA models to up to 32768 with minimal fine-tuning (within 1000 steps), while demonstrating strong empirical results on various tasks that require long context, including passkey retrieval, language modeling, and long document summarization from LLaMA 7B to 65B.
  • Meanwhile, the extended model by Position Interpolation preserve quality relatively well on tasks within its original context window. To achieve this goal, Position Interpolation linearly down-scales the input position indices to match the original context window size, rather than extrapolating beyond the trained context length which may lead to catastrophically high attention scores that completely ruin the self-attention mechanism.
  • They present a theoretical study which shows that the upper bound of interpolation is at least ∼600x smaller than that of extrapolation, further demonstrating its stability.
  • Models extended via Position Interpolation retain its original architecture and can reuse most pre-existing optimization and infrastructure.
  • The following figure from the paper illustrates the Position Interpolation method. Consider a Llama model pre-trained with a 2048 context window length. Upper left illustrates the normal usage of an LLM model: input position indices (blue dots) are within the pre-trained range. Upper right illustrates length extrapolation where models are required to operate unseen positions (red dots) up to 4096. Lower left illustrates Position Interpolation where we downscale the position indices (blue and green dots) themselves from [0, 4096] to [0, 2048] to force them to reside in the pretrained range.

  • From Damien Benveniste’s post:
    • The typical Transformer architecture is composed of Embeddings to encode the text input, multiple transformer blocks, and a prediction head specific to the learning task the LLM is used for.
    • To encode the text, we use a text embedding matrix \(T\) that has the size of the token vocabulary and a positional embedding \(P\) that encodes the position of the token in the input sequence. That position embedding size defines the context size. That embedding can be learned or it can be a simple sin function of the position index. Typically they are added together \(T + P\) such that the same word is encoded differently at positions \(i\) and \(j\).
    • The great thing about Llama 2 is that it uses Rotary Positional Embeddings (RoPE) as opposed to the typical sin function encoding. Each Attention layer is modified using that embedding and it ensures the computed attention between input tokens to be only dependent on the distance between those tokens. If token \(T_1\) is at position \(i\) and a token \(T_2\) at position \(j\), the attention \(A(T_1, T_2) = f(j - i)\) is a function of \(j - i\). The attention is not dependent on the specific token’s locations but on their relative positions.
    • The technique used by the paper to extend the context window is to interpolate at non-integer positions. Basically, if the original window size is \(L\), you can extend it to \(L'\) (with \(L' > L\)) by rescaling the integer positions as: \(i' = i * L / L'\).
    • As an example, if you wanted to have a text input of 16,384 tokens (so 4x the window size of Llama 2) into Llama 2, you would just need to divide every integer position by 4: \(i' = \frac{i}{4}\). To be clear, if you look at the implementation of Llama 2 available on GitHub (line 50 in model.py), you would just need to replace the following line of code:
      t = torch.arange(end, device=freqs.device) 


      t = torch.arange(end, device=freqs.device) / 4
    • How simple is that? Because the model was not trained for that position embedding, you would need to fine-tune a bit the model to adapt it to that new context window and position embedding. When we think that LLama 2 will most likely be used to be fine-tuned on private data, that is the icing on the cake to be able to dynamically adapt the context window to our needs as we fine-tune it.
    • They were able to extend LLama’s context window by 16 times while keeping the performance at the same level!

Recent techniques powering LLMs

  • Contemporary LLMs that have exhibited exceptional results on common NLP tasks such as paraphrasing, summarization, question answering, etc. utilize a repertoire of recent techniques under-the-hood that have enabled their exceptional capabilities, namely:
  • The following descriptions of models are from their respective project pages.


  • Read our LLaMA primer here.

Llama 2

  • Llama 2 is a collection of pretrained and fine-tuned large language models (LLMs) from Meta AI ranging in scale from 7 billion to 70 billion parameters. The fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Their models outperform open-source chat models on most benchmarks we tested, and based on their human evaluations for helpfulness and safety, may be a suitable substitute for closed source models. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs.
  • Llama 2 is powered by Ghost Attention (GAtt), introduced in the paper, which improves multi-turn memory. From section 3.3 in the technical report:
    • “In a dialogue setup, some instructions should apply for all the conversation turns, e.g., to respond succinctly, or to “act as” some public figure. When we provided such instructions to Llama 2-Chat, the subsequent response should always respect the constraint. However, our initial RLHF models tended to forget the initial instruction after a few turns of dialogue, as illustrated in the below figure (left) which shows that issues with multi-turn memory (left) can be improved with GAtt (right).

    • To address these limitations, we propose Ghost Attention (GAtt), a very simple method inspired by Context Distillation (Bai et al., 2022) that hacks the fine-tuning data to help the attention focus in a multi-stage process. GAtt enables dialogue control over multiple turns, as illustrated in the figure above (right).
    • GAtt Method: Assume we have access to a multi-turn dialogue dataset between two persons (e.g., a user and an assistant), with a list of messages \(\left[u_1, a_1, \ldots, u_n, a_n\right]\), where \(u_n\) and \(a_n\) correspond to the user and assistant messages for turn \(n\), respectively. Then, we define an instruction, inst, that should be respected throughout the dialogue. For example, inst could be “act as.” We can then synthetically concatenate this instruction to all the user messages of the conversation.
    • Next, we can sample from this synthetic data using the latest RLHF model. We now have a context-dialogue and the sample with which to fine-tune a model, in a process analogous to Rejection Sampling. Instead of augmenting all context-dialogue turns with the instruction, we can drop it in all but the first turn, but this would lead to a mismatch at training time between the system message, i.e., all the intermediate assistant messages that come before the last turn, and our sample. To fix this issue, which could hurt the training, we simply set the loss to 0 for all the tokens from the previous turns, including assistant messages.
    • For the training instructions, we created a few synthetic constraints to sample from: Hobbies (“You enjoy e.g. Tennis”), Language (“Speak in e.g. French”), or Public Figure (“Act as e.g. Napoleon”). To obtain the lists of hobbies and public figures, we asked Llama 2-Chat to generate it, avoiding a mismatch between the instruction and model knowledge (e.g., asking the model to act as someone it had not encountered during training). To make the instructions more complex and diverse, we construct the final instruction by randomly combining the above constraints. When constructing the final system message for the training data, we also modify the original instruction half of the time to be less verbose, e.g., “Always act as Napoleon from now”-> “Figure: Napoleon.” These steps produce an SFT dataset, on which we can fine-tune Llama 2-Chat.
    • GAtt Evaluation: We applied GAtt after RLHF V3. We report a quantitative analysis indicating that GAtt is consistent up to 20+ turns, until the maximum context length is reached (see Appendix A.3.5 in the paper). We tried to set constraints not present in the training of GAtt at inference time, for instance “Always answer with Haiku,” for which the model was found to remain consistent.
    • To illustrate how GAtt helped reshape attention during fine-tuning, we display the maximum attention activations of the model in Figure 10. The left-hand side of each figure corresponds to the system message (“Act as Oscar Wilde”). From the figure above, we can see that the GAtt-equipped model (right) maintains large attention activations with respect to the system message for a larger portion of the dialogue, as compared to the model without GAtt (left).
    • Despite its utility, the current implementation of GAtt is vanilla, and more development and iteration on this technique could likely further benefit the model. For instance, we could teach the model to change the system message during the conversation by integrating such data during fine-tuning.”
  • Another important aspect that is highlighted in the report is the effect of RLHF on Llama 2, and this graph from Meta’s paper shows how high-quality human preferences data (obtained from Surge AI) keeps on improving Llama 2 – without saturation.

  • They also call out the importance of supervised fine-tuning (SFT) data quality (in the “quality is all you need” section) – it’s not about volume, but diversity and quality.
  • From Linxi Fan’s notes:
    • Llama-2 likely costed $20M+ to train. Meta has done an incredible service to the community by releasing the model with a commercially-friendly license. AI researchers from big companies were wary of Llama-1 due to licensing issues, but now many of them will jump on the ship and contribute their firepower.
    • Meta’s team did a human study on 4K prompts to evaluate Llama-2’s helpfulness. They use “win rate” as a metric to compare models, in similar spirit as the Vicuna benchmark. 70B model roughly ties with GPT-3.5-0301, and performs noticeably stronger than Falcon, MPT, and Vicuna. These real human ratings should be trusted more than academic benchmarks, because they typically capture the “in-the-wild vibe” better.
    • Llama-2 is not yet at GPT-3.5 level, mainly because of its weak coding abilities. On “HumanEval” (standard coding benchmark), it isn’t nearly as good as StarCoder or many other models specifically designed for coding. That being said, I have little doubt that Llama-2 will improve significantly thanks to its open weights.
    • Meta’s team goes above and beyond on AI safety issues. In fact, almost half of the paper is talking about safety guardrails, red-teaming, and evaluations. A round of applause for such responsible efforts!
    • In prior works, there’s a thorny trade-ff between helpfulness and safety. Meta mitigates this by training 2 separate reward models. They aren’t open-source yet, but would be extremely valuable to the community.
    • Llama-2 will dramatically boost multimodal AI and robotics research. These fields need more than just blackbox access to an API.
    • So far, we have to convert the complex sensory signals (video, audio, 3D perception) to text description and then feed to an LLM, which is awkward and leads to huge information loss. It’d be much more effective to graft sensory modules directly on a strong LLM backbone.
    • The whitepaper itself is a masterpiece. Unlike GPT-4’s paper that shared very little info, Llama-2 spelled out the entire recipe, including model details, training stages, hardware, data pipeline, and annotation process. For example, there’s a systematic analysis on the effect of RLHF with nice visualizations. Quote sec 5.1: “We posit that the superior writing abilities of LLMs, as manifested in surpassing human annotators in certain tasks, are fundamentally driven by RLHF.”
  • The following figure from the paper shows the training of Llama 2-Chat: This process begins with the pretraining of Llama 2 using publicly available online sources. Following this, they create an initial version of Llama 2-Chat through the application of supervised fine-tuning. Subsequently, the model is iteratively refined using Reinforcement Learning with Human Feedback (RLHF) methodologies, specifically through rejection sampling and Proximal Policy Optimization (PPO). Throughout the RLHF stage, the accumulation of iterative reward modeling data in parallel with model enhancements is crucial to ensure the reward models remain within distribution.

  • Summary:
    • Llama 2 is available for free (including commercial license).
    • Llama 2 can be accessed via managed services in Azure and AWS.
    • Llama is trained on 2B tokens, with 4 variants, ranging from 7-70B parameters.
    • Llama is intended to be used in English, with almost 90% of the pre-training data being in English.
    • The commercial license specifies a number of harmful use cases that violate the license, including spam!
    • Llama 2 is very comparable to ChatGPT 3.5 in most benchmarks (particularly, it beats ChatGPT in human evaluation on helpfulness: Win 36%; Tie 32%; Loss 32%) other than coding, looking at the data mix coding data is still quite small (classified under the - unknown language category)
    • Llama 2 outperforms all other open-source models including Falcon and MPT, and has three variants including 7B, 13B, and 70B; the 70B variant achieves top performance across the board.
    • Benchmarks were done both on standardized ones (like MMLU) and head to head competition against other models, including PaLM-2 Bison and ChatGPT 3.5.
    • A large portion of the paper focuses on RLHF improvements and objectives which is super neat.
    • Model toxicity and evaluation is another large focus, including evaluations like red-teaming which were found in the Claude 2 model card. Generally Llama 2 performed very well with fewer safety violations than ChatGPT in human evaluations.
    • The tokenizer is the same as Llama 1 which is interesting, but the context length is now 4k, double the original 2k!
    • There’s both a regular and chat variation, as has been the trend in recent papers.
    • Llama 2 (with fine tuning) offers better domain-specificity via fine-tuning at lower cost, and better guardrails.
    • Llama 2 is trained on 40% more data than Llama 1 and performs well against benchmarks.
    • In short: companies can create their own enterprise “ChatGPT” (without sharing any data with OpenAI).
  • Quantized Llama 2 weights are available for local inference here.

  • The following diagram presents summarizes the key graphs/tables of the Llama 2 paper:

  • The following infographic (source) presents an overview of Llama 2:

  • Demo; HuggingFace repo; Project page.

  • Related: llama2.c
    • The quest for running LLMs on a single computer landed Andrej Karpathy to embark on a weekend project to create a simplified version of the Llama 2 model, informally called TinyLlama or BabyLlama.
    • Based on llama.cpp, this is a bare-bones project with the Llama 2 architecture hard-coded, FP32 precision, one inference file of pure C with no dependencies.
    • “With the code in this repo you can train the Llama 2 LLM architecture from scratch in PyTorch, then export the weights to a binary file, and load that into one ~simple 500-line C file (run.c) that inferences the model. You might think that you need many billion parameter LLMs to do anything useful, but in fact very small LLMs can have surprisingly strong performance if you make the domain narrow enough.” (source)
    • Basically, it is nanoGPT, tuned to implement the Llama 2 architecture instead of GPT-2, and the meat of it was writing the C inference engine in run.c,” explained Karpathy in Llama2.c GitHub repository. His objective was to implement nanoGPT into Llama 2 architecture, instead of GPT within C programming language. The repository has already got 2.2K stars.
    • The success of Karpathy’s approach lies in its ability to achieve highly interactive rates, even with reasonably sized models containing a few million parameters and trained on a 15 million parameter model of the TinyStories dataset.
    • On a M1 MacBook Air, the Llama 2 model with ~15 million parameters can infer at around 100 tokens per second in fp32, all through the C code he developed.
    • This surprising result demonstrates the feasibility of running complex models on resource-constrained devices with a straightforward implementation.
  • Llama 2 is about as factually accurate as GPT-4 for summaries and is 30X cheaper evaluates Llama 2 for summarization and obtains stellar results compared to GPT-4, with investigations around the issues of LLMs not following instructions and ordering bias.


  • Read our GPT-4 primer here.
  • Per a rumor, GPT-4 might be an 8-way Mixture-of-Experts (MoE) model with 8 220B parameters (a total of 1.76T parameters).
  • A Mixture of Experts (MoE) model essentially revolves around a router that directs questions to the appropriate expert. If GPT-4 does adopt the MoE approach, it would consist of eight specialist models each trained in a specific domain, like mathematics, history, storytelling, etc. When a question is posed, the router analyses it and seamlessly forwards it to the most suitable expert.
  • The concept of MoE is quite prevalent (refer Outrageously Large Neural Networks: the Sparsely-Gated Mixture-of-Experts Layer), with Langchain’s high-level implementation of an LLMRouterChain, and notable low-level integrated examples like Google’s Switch Transformer (refer Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity).
  • Per yet another rumor, here are the specifics:
    • Parameter count: GPT-4 is more than 10x the size of GPT-3; with a total of ~1.8 trillion parameters across 120 layers.
    • Architecture: GPT-4 uses an MoE architecture; the main idea behind used an MoE model was to keep costs training/inference reasonable while ensuring great performance. In other words, it is not a dense transformer like, for instance, PaLM (or GPT-3). They utilizes 16 experts within their model, each is about ~111B parameters for MLP. 2 of these experts are routed per forward pass. There roughly ~55B shared parameters for attention.
    • **MoE Routing: While the literature talks a lot about advanced routing algorithms for choosing which experts to route each token to, OpenAI’s is allegedly quite simple, for the current GPT-4 model.
    • Inference: Each forward pass inference (generation of 1 token) only utilizes ~280B parameters and ~560 TFLOPs. This contrasts with the ~1.8 trillion parameters and ~3,700 TFLOP that would be required per forward pass of a purely dense model (vs. the MoE architecture that’s used).
    • Dataset: GPT-4 is trained on ~13T tokens. These are not unique tokens, but the total amount of tokens seen over all epochs. There are millions of instruction fine-tuning data samples from ScaleAI & internally (probably acquired through ChatGPT + their API before they changed the policy).
    • Training epochs: 2 epochs for text-based data and 4 for code-based data.
    • Training paradigm: For pre-training GPT-4 32K, they utilized an 8K context length. The 32K context version of GPT-4 was based on fine-tuning of the 8K after the pre-training. Extending context is hard… but not impossible is a good reference on how to achieve this.
    • Batch Size: The batch size was gradually ramped up over a number of days on the cluster, but by the end, OpenAI was using a batch size of 60 million! This, of course, is “only” a batch size of 7.5 million tokens per expert due to not every expert seeing all tokens. For the real batch size:** Divide this number by the context width to get the real batch size.
    • Parallelism strategies: To parallelize across all their A100s GPUs, they utilized 8-way tensor parallelism as that is the limit for NVLink. Beyond that, they used 15-way pipeline parallelism. Also apparently they used DeepSpeed ZeRo Stage 1 or block-level FSDP.
    • Training cost: OpenAI’s training FLOPS for GPT-4 is ~2.15e25, on ~25,000 A100s for 90 to 100 days at about 32% to 36% MFU. Part of this extremely low utilization is due to an absurd number of failures requiring checkpoints that needed to be restarted from. If their cost in the cloud was about $1 per A100 hour, the training costs for this run alone would be about $63 million. Had H100s been used, pre-training could be done with ~8,192 H100s in ~55 days for $21.5 million at $2 per H100 hour.
    • Mixture of Expert tradeoffs: There are multiple MoE tradeoffs taken; for example, MoE is incredibly difficult to deal with on inference because not every part of the model is utilized on every token generation. This means some parts may sit dormant when other parts are being used. When serving users, this really hurts utilization rates. Researchers have shown that using 64 to 128 experts achieves better loss than 16 experts, but that’s purely research. There are multiple reasons to go with fewer experts. One reason for OpenAI choosing 16 experts is because more experts are difficult to generalize at many tasks. More experts can also be more difficult to achieve convergence with. With such a large training run, OpenAI instead chose to be more conservative on the number of experts.
    • GPT-4 inference cost: GPT-4 costs 3x that of the 175B parameter DaVinci. This is largely due to the larger clusters required for GPT-4 and much lower utilization achieved. An estimate of it’s costs is $0.0049 cents per 1K tokens for 128 A100s to inference GPT-4 8K context width and $0.0021 cents per 1K tokens for 128 H100s to inference GPT-4 8K context width. It should be noted that they assume decent high utilization and keep batch sizes large. 9 Multi-Query Attention: GPT-4 uses MQA instead of MHA (MQA is a classic choice at this point). Because of that only 1 head is needed and memory capacity can be significantly reduced for the KV cache. Even then, the 32K context width GPT-4 definitely cannot run on 40GB A100s, and the 8K is capped on max batch size.
    • Continuous batching: OpenAI implements both variable batch sizes and continuous batching. This is so as to allow some level of maximum latency as well optimizing the inference costs.
    • Vision multi-modal: They have a separate vision encoder from the text encoder, with cross-attention. The architecture is similar to Google DeepMind’s Flamingo. This adds more parameters on top of the 1.8T text-only GPT-4. It is fine-tuned with another ~2 trillion tokens, after the text only pre-training. On the vision model, OpenAI wanted to train it from scratch, but it wasn’t mature enough, so they wanted to derisk it by starting with text. One of the primary purposes of this vision capability is for autonomous agents able to read web pages and transcribe what’s in images and video. Some of the data they train on is joint data (rendered LaTeX/text), screenshots of web pages, YouTube videos: sampling frames, and run Whisper around it to get transcript.
    • Speculative decoding: OpenAI might be using speculative decoding on GPT-4’s inference. The idea is to use a smaller faster model to decode several tokens in advance, and then feeds them into a large oracle model as a single batch. If the small model was right about its predictions (i.e., the larger model agrees), we can decode several tokens in a single batch. But if the larger model rejects the tokens predicted by the draft model then the rest of the batch is discarded. And we continue with the larger model. The conspiracy theory that the new GPT-4 quality had been deteriorated might be simply because they are letting the oracle model accept lower probability sequences from the speculative decoding model.
      • Per Andrej Karpathy, speculative sampling/decoding/execution for LLMs is an excellent inference-time optimization. It hinges on the following unintuitive observation: forwarding an LLM on a single input token takes about as much time as forwarding an LLM on \(K\) input tokens in a batch (for larger \(K\) than what might be obvious). This unintuitive fact is because sampling is heavily memory bound: most of the “work” is not doing compute, it is reading in the weights of the transformer from VRAM into on-chip cache for processing. So if you’re going to do all that work of reading in all those weights, you might as well apply them to a whole batch of input vectors.
        • At batch_size=1 (i.e. just generating a single stream of prediction on your computer), the inference is super duper memory-bound. The on-chip compute units are twiddling their thumbs while sucking model weights through a straw from DRAM. Every individual weight that is expensively loaded from DRAM onto the chip is only used for a single instant multiply to process each new input token. So the stat to look at is not FLOPS but the memory bandwidth.
        • Let’s take a look:
          • A100: 1935 GB/s memory bandwidth, 1248 TOPS
          • MacBook M2: 100 GB/s, 7 TFLOPS
        • The compute is ~200X but the memory bandwidth only ~20X. So the little M2 chip that could will only be about ~20X slower than a mighty A100. This is ~10X faster than you might naively expect just looking at ops.
        • The situation becomes a lot more different when you inference at a very high batch size (e.g. ~160+), such as when you’re hosting an LLM engine simultaneously serving a lot of parallel requests. Or in training, where you aren’t forced to go serially token by token and can parallelize across both batch and time dimension, because the next token targets (labels) are known. In these cases, once you load the weights into on-chip cache and pay that large fixed cost, you can re-use them across many input examples and reach ~50%+ utilization, actually making those FLOPS count.
        • In summary, why is LLM inference surprisingly fast on your MacBook? If all you want to do is batch 1 inference (i.e. a single “stream” of generation), only the memory bandwidth matters. And the memory bandwidth gap between chips is a lot smaller, and has been a lot harder to scale compared to flops.
      • The reason we can’t naively use this fact to sample in chunks of \(K\) tokens at a time is that every \(N^{th}\) token depends on what token we sample at time at step \(N-1\). There is a serial dependency, so the baseline implementation just goes one by one left to right.
      • Now the clever idea is to use a small and cheap draft model to first generate a candidate sequence of \(K\) tokens – a “draft”. Then we feed all of these together through the big model in a batch. This is almost as fast as feeding in just one token, per the above. Then we go from left to right over the logits predicted by the model and sample tokens. Any sample that agrees with the draft allows us to immediately skip forward to the next token. If there is a disagreement then we throw the draft away and eat the cost of doing some throwaway work (sampling the draft and the forward passing for all the later tokens).
      • The reason this works in practice is that most of the time the draft tokens get accepted, because they are easy, so even a much smaller draft model gets them. As these easy tokens get accepted, we skip through those parts in leaps. The hard tokens where the big model disagrees “fall back” to original speed, but actually a bit slower because of all the extra work.
      • In summary, this one weird trick works because LLMs are memory bound at inference time, in the “batch size 1” setting of sampling a single sequence of interest, that a large fraction of “local LLM” use cases fall into. And because most tokens are “easy”.
      • More on this here: Blockwise Parallel Decoding for Deep Autoregressive Models, Accelerating Large Language Model Decoding with Speculative Sampling, and Fast Inference from Transformers via Speculative Decoding
    • Inference architecture: The inference runs on a cluster of 128 GPUs. There are multiple of these clusters in multiple datacenters in different locations. It is done in 8-way tensor parallelism and 16-way pipeline parallelism. Each node of 8 GPUs has only ~130B parameters, or less than 30GB per GPU at FP16 and less than 15GB at FP8/int8. The model has 120 layers, so it fits in 15 different nodes. (Possibly the there are less layers on the first node since it needs to also compute the embeddings). According to these numbers: OpenAI should have trained on 2x the tokens if they were trying to go by Chinchilla’s optimal. This goes to show that they are struggling to get high quality data.
    • Why no Fully Sharded Data Parallel (FSDP)? A possible reason for this could be that some of the hardware infra they secured is of an older generation. This is pretty common at local compute clusters as the organisation usually upgrade the infra in several “waves” to avoid a complete pause of operation. With such a high amount of pipeline parallelism it is very likely that they suffer from the “batch bubble”: slight idle time between batches.
    • Dataset Mixture: They trained on 13T tokens. CommonCrawl & RefinedWeb are both 5T. Remove the duplication of tokens from multiple epochs and we get to a much reasonable number of “unaccounted for” tokens: the “secret” data – parts of it probably came from Twitter, Reddit, and YouTube. Some speculations are: LibGen (4M+ books), Sci-Hub (80M+ papers), all of GitHub. Part of the missing dataset could also be custom dataset of college textbooks collected by hand for as much courses as possible. This is very easy to convert to text form and than use Self-Instruct to transform it into instruction form. This creates the “illusion” that GPT-4 “is smart” no matter who uses it: for computer scientists, it can help you with your questions about P!=NP; for a philosophy major, it can totally talk to you about epistemology. There are also papers that try to extract by force memorized parts of books from GPT-4 to understand what it trained on. There are some books it knows so well that it had seen them for sure. Moreover, it even knows the unique ids of project Euler problems.


  • Claude 2 is Anthropic’s second-gen AI chatbot that’s cheaper, stronger, faster, can handle multiple PDFs, and supports longer conversations. It’s basically Anthropic’s answer to OpenAI’s GPT-4:
    • Claude 2 has 100K context. GPT-4 has 32K. So 3x more context.
    • Claude 2 is 4-5x cheaper than GPT-4-32k.
    • Claude 2’s knowledge cutoff is early 2023, while GPT-4 is late 2021. So more fresher knowledge.
  • You can easily upload large files (say, an 80-page 2022 Apple annual report) and ask for a summary, key takeaways, provide financial projections (not 100% there yet), and more.
  • Furthermore, you can import several documents into Claude 2 and perform conceptual blending by asking about the relationship between the concepts found in each document.


  • Stanford’s Alpaca 7B, a model fine-tuned from the LLaMA 7B model on 52K instruction-following demonstrations. On their preliminary evaluation of single-turn instruction following, Alpaca behaves qualitatively similarly to OpenAI’s text-davinci-003, while being surprisingly small and easy/cheap to reproduce.


  • Vicuna-13B, an open-source chatbot trained by fine-tuning LLaMA on user-shared conversations collected from ShareGPT. Preliminary evaluation using GPT-4 as a judge shows Vicuna-13B achieves more than 90%* quality of OpenAI ChatGPT and Google Bard while outperforming other models like LLaMA and Stanford Alpaca in more than 90% of cases. The cost of training Vicuna-13B was around $300.
  • Vicuna is created by fine-tuning a LLaMA base model using approximately 70K user-shared conversations gathered from ShareGPT.com with public APIs. To ensure data quality, we convert the HTML back to markdown and filter out some inappropriate or low-quality samples. Additionally, we divide lengthy conversations into smaller segments that fit the model’s maximum context length.
  • Their training recipe builds on top of Stanford’s Alpaca with the following improvements.
    • Memory Optimizations: To enable Vicuna’s understanding of long context, we expand the max context length from 512 in alpaca to 2048, which substantially increases GPU memory requirements. We tackle the memory pressure by utilizing gradient checkpointing and flash attention.
    • Multi-round conversations: We adjust the training loss to account for multi-round conversations and compute the fine-tuning loss solely on the chatbot’s output.
    • Cost Reduction via Spot Instance: The 40x larger dataset and 4x sequence length for training poses a considerable challenge in training expenses. We employ SkyPilot managed spot to reduce the cost by leveraging the cheaper spot instances with auto-recovery for preemptions and auto zone switch. This solution slashes costs for training the 7B model from $500 to around $140 and the 13B model from around $1K to $300.


  • StableVicuna is the first large-scale open source chatbot trained via reinforced learning from human feedback (RLHF). StableVicuna is a further instruction fine tuned and RLHF trained version of Vicuna v0 13b, which is an instruction fine tuned LLaMA 13B model.

Dolly 2.0

  • Dolly 2.0 is the world’s first open source, instruction-tuned LLM, fine-tuned on a human-generated instruction dataset licensed for research and commercial use.
  • Dolly 2.0 is a 12B parameter language model based on the EleutherAI pythia model family and fine-tuned exclusively on a new, high-quality human generated instruction following dataset, crowdsourced among Databricks employees.
  • They have also released the databricks-dolly-15k dataset:
    • databricks-dolly-15k contains 15,000 high-quality human-generated prompt / response pairs specifically designed for instruction tuning large language models. Under the licensing terms for databricks-dolly-15k (Creative Commons Attribution-ShareAlike 3.0 Unported License), anyone can use, modify, or extend this dataset for any purpose, including commercial applications.
    • This dataset is the first open source, human-generated instruction dataset specifically designed to make large language models exhibit the magical interactivity of ChatGPT. databricks-dolly-15k was authored by more than 5,000 Databricks employees during March and April of 2023. These training records are natural, expressive and designed to represent a wide range of the behaviors, from brainstorming and content generation to information extraction and summarization.


  • Stability AI’s StableLM series of language models. Github repo. StableLM comes in two variants: StableVicuna and StableLM-Alpha.
  • StableVicuna is an RLHF fine-tune of Vicuna-13B v0, which itself is a fine-tune of LLaMA-13B. It is our attempt at creating an open-source RLHF LLM Chatbot.
  • StableLM-Alpha models are trained on the new dataset that build on The Pile, which contains 1.5 trillion tokens, roughly 3x the size of The Pile. These models will be trained on up to 1.5 trillion tokens. The context length for these models is 4096 tokens. The models range from 3B to 175B parameters. As a proof-of-concept, we also fine-tuned the model with Stanford Alpaca’s procedure using a combination of five recent datasets for conversational agents: Stanford’s Alpaca, Nomic-AI’s gpt4all, RyokoAI’s ShareGPT52K datasets, Databricks labs’ Dolly, and Anthropic’s HH. We will be releasing these models as StableLM-Tuned-Alpha.


  • OpenLLaMA is a permissively licensed open source reproduction of Meta AI’s LLaMA large language model. They have released a 7B and 3B model trained on 1T tokens, as well as the preview of a 13B model trained on 600B tokens. They provide PyTorch and JAX weights of pre-trained OpenLLaMA models, as well as evaluation results and comparison against the original LLaMA models.


  • MPT-7B is a decoder-style transformer pretrained from scratch on 1T tokens of English text and code. This model was trained by MosaicML.
  • MPT-7B is part of the family of MosaicPretrainedTransformer (MPT) models, which use a modified transformer architecture optimized for efficient training and inference.
  • These architectural changes include performance-optimized layer implementations and the elimination of context length limits by replacing positional embeddings with Attention with Linear Biases (ALiBi). Thanks to these modifications, MPT models can be trained with high throughput efficiency and stable convergence. MPT models can also be served efficiently with both standard HuggingFace pipelines and NVIDIA’s FasterTransformer.
  • MPT-7B is licensed for the possibility of commercial use (unlike LLaMA).
  • MPT-7B is trained on a large amount of data (1T tokens like LLaMA vs. 300B for Pythia, 300B for OpenLLaMA, and 800B for StableLM).
  • MPT-7B is prepared to handle extremely long inputs, due to ALiBi (they finetuned MPT-7B-StoryWriter-65k+ on up to 65k inputs and can handle up to 84k vs. 2k-4k for other open source models).
  • MPT-7B is capable of fast training and inference (via FlashAttention and FasterTransformer)
  • MPT-7B is equipped with highly efficient open-source training code via the llm-foundry repository.


The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

  • Large language models are commonly trained on a mixture of filtered web data and curated high-quality corpora, such as social media conversations, books, or technical papers. This curation process is believed to be necessary to produce performant models with broad zero-shot generalization abilities. However, as larger models requiring pretraining on trillions of tokens are considered, it is unclear how scalable is curation and whether we will run out of unique high-quality data soon.
  • This paper by Penedo et al. from the Falcon LLM team shows that properly filtered and deduplicated web data alone can lead to powerful models; even significantly outperforming models from the state-of-the-art trained on The Pile. Despite extensive filtering, the high-quality data we extract from the web is still plentiful, and we are able to obtain five trillion tokens from CommonCrawl. We publicly release an extract of 600 billion tokens from our RefinedWeb dataset, and 1.3/7.5B parameters language models trained on it.
  • The following figure from the paper shows subsequent stages of Macrodata Refinement remove nearly 90% of the documents originally in CommonCrawl. Notably, filtering and deduplication each result in a halving of the data available: around 50% of documents are discarded for not being English, 24% of remaining for being of insufficient quality, and 12% for being duplicates. We report removal rate (grey) with respect to each previous stage, and kept rate (shade) overall. Rates measured in % of documents in the document preparation phase, then in tokens.

  • The following figure from the paper shows models trained on REFINEDWEB alone outperform models trained on curated corpora. Zero-shot performance on our main-agg task aggregate. At equivalent compute budgets, our models significantly outperform publicly available models trained on The Pile, and match the performance of the GPT-3 models when tested within our evaluation setup.

  • Related: Falcon LLM details –
    1. Data is key! As noted in the abstract and benchmarks, Falcon performs very well due to the data refinement techniques used. A key theme through the paper
    2. Falcon follows a very close scaling law to GPT-3, with the authors of the paper testing Babbage and Currie (two smaller variants). They measure it using the Eleuther AI evaluation harness.
    3. Data Filtering and deduplication is key. Starting with Common Crawl, they apply a 7 part pipeline including URL dedup, website specific actions, document dedup and line by line dedup.
    4. The final dataset is only \(\frac{1}{9}^{th}\) of the Common Crawl original!
    5. The team conducts several tests on 1B and 3B param models to validate their data cleaning hypothesis. C4 is still an excellent dataset, but Refined web outperforms The Pile and Oscar, which have duplications.
    6. After blocking NSFW urls, the dataset/model toxicity matches that of The Pile, indicating that more work can be done to further decrease it. Minimal work was done on investigating social and other biases.
    7. The team open sourced a 600B subset from their 5000B Token dataset.


  • The RedPajama project aims to create a set of leading open-source models and to rigorously understand the ingredients that yield good performance. In April 2023, they released the RedPajama base dataset based on the LLaMA paper, which has worked to kindle rapid innovation in open-source AI.
  • The 5 terabyte dataset has been downloaded thousands of times and used to train over 100 models!
  • They’ve trained 3B and 7B models on the Summit supercomputer, in collaboration with AAI CERC lab at Université de Montréal, EleutherAI & LAION for compute time on Summit within the INCITE program award “Scalable Foundation Models for Transferable Generalist AI”.
  • They have released the v1 versions of the RedPajama-INCITE family of models, including instruct-tuned and chat versions under the Apache 2.0 license.
    • RedPajama-INCITE-7B-Instruct is the highest scoring open model on HELM benchmarks, making it ideal for a wide range of tasks. It outperforms LLaMA-7B and state-of-the-art open models such as Falcon-7B (Base and Instruct) and MPT-7B (Base and Instruct) on HELM by 2-9 points.
    • RedPajama-INCITE-7B-Chat is available in OpenChatKit, including a training script for easily fine-tuning the model and is available to try now! The chat model is built on fully open-source data and does not use distilled data from closed models like OpenAI’s – ensuring it is clean for use in open or commercial applications.
    • RedPajama-INCITE-7B-Base was trained on 1T tokens of the RedPajama-1T dataset and releases with 10 checkpoints from training and open data generation scripts allowing full reproducibility of the model. This model is 4 points behind LLaMA-7B, and 1.3 points behind Falcon-7B/MPT-7B on HELM.
  • Project page.


  • Orca 13B, a small yet mighty AI model developed by Microsoft that’s making waves in the AI community. Despite its size, Orca 13B is proving that it can stand toe-to-toe with the giants, demonstrating capabilities that rival even the larger foundation models (LFMs) like ChatGPT and GPT-4.
  • Orca 13B’s progressive learning approach is a cornerstone of its success. By learning from rich signals from GPT-4, including explanation traces, step-by-step thought processes, and other complex instructions, Orca is able to develop a deeper understanding of the reasoning process. This is a significant departure from traditional AI models, which often focus on imitating the style of LFMs but fail to capture their reasoning process.
  • The use of explanation traces, for instance, allows Orca to understand the underlying logic behind the responses generated by GPT-4. This not only enhances Orca’s ability to generate accurate responses, but also enables it to understand the context and nuances of different scenarios, thereby improving its overall performance.
  • Furthermore, the role of ChatGPT as a teacher assistant is crucial in providing a supportive learning environment for Orca. By providing guidance and feedback, ChatGPT helps Orca refine its learning process and improve its understanding of complex instructions. This teacher-student dynamic is a key factor in Orca’s ability to imitate the reasoning process of LFMs.
  • Orca’s performance in various benchmarks is a testament to its capabilities. In complex zero-shot reasoning benchmarks like Big-Bench Hard (BBH) and AGIEval, Orca surpasses conventional state-of-the-art instruction-tuned models such as Vicuna-13B by more than 100% and 42% respectively. This is a significant achievement, considering that these benchmarks are designed to test the model’s ability to reason and make decisions in complex scenarios.
  • One of the most remarkable aspects of Orca is its size. Despite being a smaller AI model compared to giants like ChatGPT, Orca manages to perform at the same level. This is a significant breakthrough in technology as it demonstrates that powerful AI models can be built by smaller teams, making AI development more accessible.
  • The size of Orca also has implications for its efficiency and scalability. Being a smaller model, Orca requires less computational resources to train and operate, making it a more sustainable and cost-effective solution for AI development. Furthermore, its smaller size makes it easier to scale and adapt to different applications, thereby increasing its versatility and utility.
  • Paper.


  • A series of 7B LLMs named XGen-7B from Salesforce with standard dense attention on up to 8K sequence length for up to 1.5T tokens. We also fine tune the models on public-domain instructional data. The main take-aways are:
    • On standard NLP benchmarks, XGen achieves comparable or better results when compared with state-of-the-art open-source LLMs (e.g. MPT, Falcon, LLaMA, Redpajama, OpenLLaMA) of similar model size.
    • Our targeted evaluation on long sequence modeling benchmarks show benefits of our 8K-seq models over 2K- and 4K-seq models.
    • XGen-7B archives equally strong results both in text (e.g., MMLU, QA) and code (HumanEval) tasks.
    • Training cost of $150K on 1T tokens under Google Cloud pricing for TPU-v4.
  • Project page; Github repo; Model checkpoint.


  • OpenLLMs is a series of open-source language models fine-tuned on a small, yet diverse and high-quality dataset of multi-round conversations. Specifically, we utilize only ~6K GPT-4 conversations directly filtered from the ~90K ShareGPT conversations. Despite the small size of the dataset, OpenLLMs has demonstrated remarkable performance.
  • Generic Models:
    • OpenChat: based on LLaMA-13B with a context length of 2048.
      • Achieves 105.7% of ChatGPT score on the Vicuna GPT-4 evaluation.
      • Achieves 80.9% win-rate on AlpacaEval.
    • OpenChat-8192: based on LLaMA-13B, with an extended context length of 8192.
      • Achieves 106.6% of ChatGPT score on the Vicuna GPT-4 evaluation.
      • Achieves 79.5% win-rate on AlpacaEval.
  • Code Models:
    • OpenCoderPlus: based on StarCoderPlus with a native context length of 8192.
      • Achieves 102.5% of ChatGPT score on the Vicuna GPT-4 evaluation.
      • Achieves a 78.7% win-rate on AlpacaEval.
  • Dataset:


  • LlongMA-2, a suite of Llama-2 models, trained at 8k context length using linear positional interpolation scaling. The model was trained in collaboration with @theemozilla of NousResearch and Kaiokendev1, by extending the context length of the Llama-2 7b model through fine-tuning. The models maintain the same perplexity at 8k extrapolation surpassing the performance of other recent methodologies.
  • The model has similar performance to LLaMA 2 under 4k context length, performance scales directly to 8k, and works out-of-the-box with the new version of transformers (4.31) or with trust_remote_code for <= 4.30.


  • Qwen is a Transformer-based LLM by Alibaba, opensourced as two variants: Qwen-7B and Qwen-7B-Chat. Qwen-7B is pre-trained on self-constructed large-scale high-quality dataset of over 2.2 trillion tokens created using web texts, books, codes, etc.
  • In comparison with similar-sized models, Qwen outperforms the competitors on a series of benchmarks that evaluates natural language understanding, mathematics, coding, etc. Both Qwen-7B and Qwen-7B-Chat support the context length of 8K, which allows inputs with long contexts. Qwen-7B-Chat is trained with plugin-related alignment data, and thus it is capable of using tools, including APIs, models, databases, etc., and it is capable of playing as an agent!
  • The features of the Qwen-7B series include:
    • Trained with high-quality pretraining data. Qwen-7B is pretrained on a self-constructed large-scale high-quality dataset of over 2.2 trillion tokens. The dataset includes plain texts and codes, and it covers a wide range of domains, including general domain data and professional domain data.
    • Strong performance. In comparison with the models of the similar model size, we outperform the competitors on a series of benchmark datasets, which evaluates natural language understanding, mathematics, coding, etc.
    • Better support of languages. Qwen’s tokenizer, based on a large vocabulary of over 150K tokens, is a more efficient one compared with other tokenizers. It is friendly to many languages, and it is helpful for users to further finetune Qwen-7B for the extension of understanding a certain language.
    • Support of 8K Context Length. Both Qwen-7B and Qwen-7B-Chat support the context length of 8K, which allows inputs with long contexts.
    • Support of Plugins. Qwen-7B-Chat is trained with plugin-related alignment data, and thus it is capable of using tools, including APIs, models, databases, etc., and it is capable of playing as an agent.

Mistral 7B

  • Mistral 7B is a 7.3B parameter model that:
    • Outperforms Llama 2 13B on all benchmarks
    • Outperforms Llama 1 34B on many benchmarks
    • Approaches CodeLlama 7B performance on code, while remaining good at English tasks
    • Uses Grouped-query attention (GQA) for faster inference
    • Uses Sliding Window Attention (SWA) to handle longer sequences at smaller cost
  • Mistral 7B uses a sliding window attention (SWA) mechanism (Child et al., Beltagy et al.), in which each layer attends to the previous 4,096 hidden states. The main improvement, and reason for which this was initially investigated, is a linear compute cost of O(sliding_window.seq_len). In practice, changes made to FlashAttention and xFormers yield a 2x speed improvement for sequence length of 16k with a window of 4k.
  • Sliding window attention exploits the stacked layers of a transformer to attend in the past beyond the window size: A token i at layer k attends to tokens [i-sliding_window, i] at layer k-1. These tokens attended to tokens [i-2*sliding_window, i]. Higher layers have access to information further in the past than what the attention patterns seems to entail.

  • Finally, a fixed attention span means we can limit our cache to a size of sliding_window tokens, using rotating buffers (read more in their reference implementation repo). This saves half of the cache memory for inference on sequence length of 8192, without impacting model quality.
  • Mistral 7B has been released under the Apache 2.0 license, it can be used without restrictions.
  • Download it and use it anywhere (including locally) with their reference implementation.
  • Deploy it on any cloud (AWS/GCP/Azure), using vLLM inference server and skypilot.
  • Use it on HuggingFace.
  • Mistral 7B is easy to fine-tune on any task. As a demonstration, they’re providing a model fine-tuned for chat, which outperforms Llama 2 13B chat.


  • LLaVA with support for LLaMA-2, LoRA training with academia GPUs, higher resolution (336x336), 4-/8- inference, and more!
  • Project page; Demo; Github repo.


  • Introduced in Flamingo: a Visual Language Model for Few-Shot Learning, Flamingo models include key architectural innovations to: (i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily interleaved visual and textual data, and (iii) seamlessly ingest images or videos as inputs.
  • The key ideas behind Flamingo are:
    • Interleave cross-attention layers with language-only self-attention layers (frozen).
    • Perceiver-based architecture that transforms the input sequence data (videos) into a fixed number of visual tokens.
    • Large-scale (web) multi-modal data by scraping webpages which has inter-leaved text and images.
  • Thanks to their flexibility, Flamingo models can be trained on large-scale multimodal web corpora containing arbitrarily interleaved text and images, which is key to endow them with in-context few-shot learning capabilities.
  • They perform a thorough evaluation of the proposed Flamingo models, exploring and measuring their ability to rapidly adapt to a variety of image and video understanding benchmarks. These include open-ended tasks such as visual question-answering, where the model is prompted with a question which it has to answer, captioning tasks, which evaluate the ability to describe a scene or an event, and close-ended tasks such as multiple choice visual question-answering.
  • For tasks lying anywhere on this spectrum, they demonstrate that a single Flamingo model can achieve a new state of the art for few-shot learning, simply by prompting the model with task-specific examples. On many of these benchmarks, Flamingo actually surpasses the performance of models that are fine-tuned on thousands of times more task-specific data.


  • An open source version of DeepMind’s Flamingo model! They provide a PyTorch implementation for training and evaluating OpenFlamingo models as well as an initial OpenFlamingo 9B model trained on a new Multimodal C4 dataset.


  • Introduced in Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities, the Qwen-VL series are a set of large-scale vision-language models designed to perceive and understand both text and images. Comprising Qwen-VL and Qwen-VL-Chat, these models exhibit remarkable performance in tasks like image captioning, question answering, visual localization, and flexible interaction.
  • The evaluation covers a wide range of tasks including zero-shot captioning, visual or document visual question answering, and grounding. We demonstrate the Qwen-VL outperforms existing Large Vision Language Models (LVLMs).
  • They present their architecture, training, capabilities, and performance, highlighting their contributions to advancing multimodal artificial intelligence.
  • The following figure from the paper shows that Qwen-VL achieves state-of-the-art performance on a broad range of tasks compared with other generalist models.

  • The following figure from the paper shows some qualitative examples generated by Qwen-VL-Chat. Qwen-VL-Chat supports multiple image inputs, multi-round dialogue, multilingual conversation, and localization ability.

  • The following figure from the paper shows the training pipeline of the Qwen-VL series.

LangChain: Build apps with LLMs

  • LangChain is an open-source framework to build applications with LLMs. It enhances the capabilities of LLMs by providing a standard interface for prompt templates and integration of language models with different APIs and external databases. It can be used to build a chatbot, a Q&A answering platform, and any intelligent applications that can understand natural language and respond to user requests in real-time.
  • Let’s consider an example of you wanting to build a chatbot for a particular company.
  • First, you need to make your text data discoverable by the LLM. The typical way to do that is to index your data into a vector database. You’ll need to partition your data into chunks and encode those chunks into embeddings using a LLM and index those chunks using those embeddings. When you have a question, you can encode the question into another embedding and search in that database for the corresponding piece of data with the highest cosine similarity metric. By feeding that piece of data within a prompt, the LLM can more confidently recover the right answer.
  • You might need to insert users’ questions within a prompt template with additional examples to provide the right context for the LLM, or you might need to augment it by asking the LLM to establish a plan of action to accurately respond to the question. Often, chaining multiple (prompt, answer) pairs will be necessary to arrive at the desired outcome.
  • There are multiple tools that a LLM can use: Google Search, Python REPL, GraphQL, Wikipedia, etc. We need to prompt the LLM with a set of tools it can use and how to utilize the result to answer the question if it decides to use one of them. Additionally, we may want to keep track of the history for the LLM to remember what was previously discussed. This needs to be coordinated with the data search in the vector database that may happen in parallel.
  • When you take into account the initial question, prompt templates, prompt augmentation, data search, tools, plan of action, and memory, we start to understand the difficulty in juggling all those items when we need to construct the right prompts for meaningful results. LangChain is most likely one of the most powerful LLMops tools today! It provides an abstracted interface to chain all those items together in a simple manner to build applications. It has an API connection to ~40 of the public LLMs, Chat and embedding models. It integrates with more than 30 different tools and 20 different vector databases.
  • The following flowchart offers a visual summary of the entire process (source).


  • Credits for the following section go to Sonali Pattnaik.
  • Langchain is a popular AI framework for the fast prototyping of AI applications pipelines centered around Large Language Models (LLMs). It is used for document summarization, question answering, chat over documents, and many more applications.
  • The six major components of LangChain are Prompt Templates, LLMs, Agents, Memory, Indexes, and Chains as described below:
    1. Prompt templates: Prompt templates are a way to generate prompts consistently and reproducibly. The template is a ‘text string’ that the user can customize in multiple ways. The prompt templates can consist of instructions to LLMs, a few shot examples, etc.
    2. LLMs: LangChains provide a standard interface to connect to a number of LLMs (OpenAI, Hugging Face, Cohere, Anthropic, and many more) out there. It can also connect to all major cloud providers like Azure, Amazon, and Google Cloud.
    3. Agents: LangChain Agents use #LMs to decide an action sequence for task completion (e.g., in AutoGPT, BabyAGI, etc.).
      • What are LangChain Agents? LangChain Agents employ an LLM as a reasoning mechanism to connect apps to the outside world based on user input. Instead of using a predetermined chain of calls to LLMs/other tools, it uses an LLM to decide a potentially new chain to use that depends on the user’s input.
      • Why use Agents? Agents enable the integration of language models with external data sources and computation, such as search APIs and databases. Agents offer enhanced flexibility and capability compared to simple language model-tool connections, enabling better handling of edge cases and multi-hop tasks.
    4. Memory: “Memory” refers to the ability of an agent (such as a chatbot or language model) to retain information about previous interactions with the user. By default, agents are stateless, which means that each new incoming query or message is processed independently, without any knowledge or recollection of past interactions.
      • Popular types:
        • ConversationBufferMemory
        • ConversationBufferWindowMemory
        • ConversationTokenBufferMemory
        • ConversationSummaryMemory
        • ConversationKnowledgeGraphMemory
    5. Indexes: Indexes are used to structure documents for interaction with LLMs. Indexes are commonly used for “retrieval” to find the most relevant documents for a user’s query.
    6. Chains: LLMs work great for building simple applications. However, for building more complex applications one can leverage LangChain. The ‘main’ idea of LangChain is the ability to chain different components together so that the user can build intricate and interconnected applications that leverage LLMs.
    • The following infographic (source) illustrates the aforementioned components:


YouTube videos

Deploying LangChain

LangChain with MLFlow

LangChain with NeMo Guardrails (building safe and secure apps)

LangFlow - GUI for LangChain


  • Flowise is an open-source drag & drop tool to build LLM-powered apps, with use-cases such as (but not limited to):
    • Chat with your PDF and Excel files.
    • Launch customer support chatbots.
    • Turn user input into a full essay through multiple prompts (LLM chain).
    • Explore your codebase via a customized chatbot.


  • ragas is an evaluation framework for Retrieval Augmented Generation (RAG) pipelines using a bunch of diverse metrics.
  • ragas provides you with the tools based on the latest research for evaluating LLM generated text to give you insights about your RAG pipeline. ragas can be integrated with your CI/CD to provide continuous check to ensure performance.


Attention Manipulation to Steer LLM Output

  • OpenAI and Anthropic have obscured the output probabilities of their LLM’s generated tokens. This affects the interpretability (and thus the safety) of the generated output since “adapting” it at the user-level is impossible without knowing the statistics behind the generation.
  • For instance, Aleph Alpha (a German LLM company) recently announced a feature where you can control the emphasis of individual words in a prompt via highlighting to change their effect on the output.
  • It is worth noting that something like this was available through the original GPT-3 API (through manipulating the tokens’ log-probabilities). However, it wasn’t offered in an easy format like Aleph Alpha’s, nor with this level of manipulation. OpenAI and Anthropic removed the feature in the newest releases (for GPT-4 and Claude).
  • Aleph Alpha have a few other user-level explainability features which are very cool, like info on which specific tokens influenced which part of the output and to what extent. These tools are important for prompt-craft: they take a bit of the intuition out of process in a useful way, as you know upfront which tokens will be more or less impactful to change outputs.
  • This is also a good example of how safety and interpretability research actually lead to product improvements, without need for gain of function through model size. If users have more tools for guiding the model predictably, a model of given size becomes more powerful, while becoming more safe at the same time. This only can happen through researching more about the internals of the model. In this way safety research is profoundly compatible with, and an accelerant for, increasingly powerful product experiences.
  • Here’s their paper on Attention Manipulation (AtMan) that talks about suppressing/amplifying the attention of tokens to steer the model’s output. We can also build upon this idea to suppress high entropy words and thus automatically change the next set of tokens and avoid hallucination.

Strategies to get better results using prompt engineering

  • Here are some tactics to get better results when prompting an LLM (summarized from source):
    • Write clear instructions: the more specific you are with your instructions, the better the output. This could involve specifying the desired length and format or asking the model to adopt a certain persona, such as a world-class nutritionist or sarcastic writer.
    • Provide reference text: offering reference text can guide the model towards more precise and less hallucinated outputs, much like a student using study notes for an exam.
    • Break down complex tasks: deconstructing complex tasks into smaller, manageable subtasks can reduce error rates and improve results. For example, an inbound support request can be first addressed with an API call to categorize the user’s message, followed by another call to generate the response based on the category.
    • Encourage the model to think: asking an LLM to outline its thinking process can help the model reason its way toward more accurate responses.
    • Leverage external tools: complement GPT’s capabilities by using external tools, such as a text retrieval system or a code execution engine. For example, generating Python code to perform math calculations. You can even use GPT to generate the code that calls external APIs.
    • Evaluate changes systematically: getting to a performant prompt requires multiple iterations. Establish a comprehensive test suite to ensure changes improve performance and compare model outputs against benchmark answers. The benchmark should represent real-world usage, contain numerous test cases for statistical power, and be easy to automate or repeat. Evaluations can be performed by computers (other LLMs), humans, or a combination of both.

Further Reading

Llama 2: Responsible Use Guide

  • The Llama 2 Responsible Use Guide offers resources and best practices on how to build downstream LLM-powered products more responsibly.
  • It contains tips for fine-tuning the models, data preparation, mitigating risks, evaluation, red teaming, and a bunch of other resources useful for developers, especially on topics such as dealing with safety issues, hallucinations, adversarial attacks, etc.

Extended Guide: Instruction-tune Llama 2

  • This blog post from Philipp Schmid is a step-by-step guide on instruction-tuning Llama 2.
  • The idea of the blog post is to focus on creating the instruction dataset, which we can then use to fine-tune the base model of Llama 2 to follow our instructions.
  • The extended guide covers:
    • Define the use case and create a prompt template for instructions.
    • Create an instruction dataset.
    • Add Flash Attention and QLoRA for faster and more efficient training.
    • Instruction-tune Llama 2 using TRL.
    • Test the Model and run Inference.
  • The guide walks you through a detailed example and where you learn how to fine-tune Llama 7B to generate synthetic instructions, e.g., provide an email to get instructions that could have been used to generate this email. This model can then be used to generate synthetic data for personalizing LLMs, e.g., mimic your email writing.