• Retrieval-Augmented Generation (RAG) is a technique that enhances language model generation by incorporating external knowledge.
  • This is typically done by retrieving relevant information from a large corpus of documents and using that information to inform the generation process.
  • There are a few different approaches to how this retrieval step can be integrated in and we will look at those below.


  • In numerous instances, clients possess extensive proprietary documents, such as technical manuals, and require the extraction of specific information from this voluminous content. This task can be likened to locating a needle in a haystack.
  • Recently, OpenAI introduced a novel model, GPT4-Turbo, which boasts the capability to process large documents, potentially addressing this need. However, this model is not entirely efficient due to the “Lost In The Middle” phenomenon. This phenomenon mirrors the experience where, akin to reading the Bible in its entirety but struggling to recall what follows the Book of Samuel, the model tends to forget content located towards the middle of its contextual window.
  • To circumvent this limitation, an alternative approach known as Retrieval-Augmented-Generation (RAG) has been developed. This method involves creating an index for every paragraph in the document. When a query is made, the most pertinent paragraphs are swiftly identified and subsequently fed into a Large Language Model (LLM) like GPT4. This strategy of providing only select paragraphs, as opposed to the entire document, prevents information overload within the LLM and significantly enhances the quality of the results.
  • A simple ‘needle in a haystack’ analysis to test in-context retrieval ability of long context LLMs. This can be accomplished using the Needle In A Haystack - Pressure Testing LLMs library. The following plot shows OpenAI’s GPT-4-128K’s (top) and (bottom) performance with varying context length.

Neural Retrieval

  • Before we jump into RAG, let’s take a moment to talk about neural retrievers holistically.
  • Neural retrievers are a type of information retrieval model that uses neural networks to match queries to relevant documents. They encode the query and documents into dense vector representations and compute similarity scores between them. This allows them to go beyond lexical matching and capture semantic relevance.
  • Neural retrievers represent a significant shift from traditional keyword-based information retrieval systems to ones that can understand the underlying meanings and relationships in textual data. Here’s an expanded explanation of how they work and their significance:
  • Here’s how they generally operate:
  1. Vector Encoding:
    • Both queries and documents are transformed into vectors in a high-dimensional space. This process is done by neural network-based encoders that have been trained to capture the semantic essence of text.
    • During training, these models are often exposed to vast amounts of text, allowing them to learn complex patterns and relationships between words and phrases.
  2. Semantic Matching:
    • The similarity between query and document vectors is calculated using measures such as cosine similarity. This allows the system to determine which documents are most relevant to a query, based on the content’s meaning rather than just keyword overlap.
    • This process can capture nuanced relationships, like synonyms or related concepts, that traditional methods might miss.
  • Advantages of Neural Retrievers:
    • Neural retrievers can understand the context in which terms are used, allowing for more accurate retrieval when queries or documents have ambiguous or multiple meanings.
    • They are adept at dealing with long and complex queries because they can grasp the overall intent rather than just isolated terms.
    • Many neural retrievers are trained on multilingual datasets, enabling them to handle queries in different languages effectively.
  • Challenges and Considerations:
    • Neural models, particularly those used for encoding large documents, require significant computational power for both training and inference.
    • The performance of neural retrievers depends heavily on the data they are trained on, and they may inherit biases present in the training data.
    • Keeping the document representations current is a challenge, especially for dynamically changing content.

Retrieval Augmented Generation (RAG)

  • With RAG, the LLM is able to leverage knowledge and information that is not necessarily in its weights by providing it access to external knowledge sources such as databases.
  • It leverages a retriever to find relevant contexts to condition the LLM, in this way, RAG is able to augment the knowledge-base of an LLM with relevant documents.
  • The retriever here could be any of the following depending on the need for semantic retrieval or not:
    • Vector database: Typically, queries are embedded using models like BERT for generating dense vector embeddings. Alternatively, traditional methods like TF-IDF can be used for sparse embeddings. The search is then conducted based on term frequency or semantic similarity.
    • Graph database: Constructs a knowledge base from extracted entity relationships within the text. This approach is precise but may require exact query matching, which could be restrictive in some applications.
    • Regular SQL database: Offers structured data storage and retrieval but might lack the semantic flexibility of vector databases.
  • The image below from Damien Benveniste, PhD talks a bit about the difference between using Graph vs Vector database for RAG.

  • In his post linked above, Damien states that Graph Databases are favored for Retrieval Augmented Generation (RAG) when compared to Vector Databases. While Vector Databases partition and index data using LLM-encoded vectors, allowing for semantically similar vector retrieval, they may fetch irrelevant data.
  • Graph Databases, on the other hand, build a knowledge base from extracted entity relationships in the text, making retrievals concise. However, it requires exact query matching which can be limiting.
  • A potential solution could be to combine the strengths of both databases: indexing parsed entity relationships with vector representations in a graph database for more flexible information retrieval. It remains to be seen if such a hybrid model exists.

  • After retrieving, you may want to look into filtering the candidates further by adding ranking and/or fine ranking layers that allow you to filter down candidates that do not match your business rules, are not personalized for the user, current context, or response limit.
  • Let’s succinctly summarize the process of RAG and then delve into its pros and cons:
    1. Vector Database Creation: RAG starts by converting an internal dataset into vectors and storing them in a vector database (or a database of your choosing).
    2. User Input: A user provides a query in natural language, seeking an answer or completion.
    3. Information Retrieval: The retrieval mechanism scans the vector database to identify segments that are semantically similar to the user’s query (which is also embedded)RA. These segments are then given to the LLM to enrich its context for generating responses.
    4. Combining Data: The chosen data segments from the database are combined with the user’s initial query, creating an expanded prompt.
    5. Generating Text: The enlarged prompt, filled with added context, is then given to the LLM, which crafts the final, context-aware response.
  • The image below (source) displays the high-level working of RAG.

  • So why should you use RAG for your application?
  • RAG doesn’t require model retraining, saving time and computational resources.
  • It’s effective even with a limited amount of labeled data.
  • However, it does have its drawbacks, namely RAG’s performance depends on the comprehensiveness and correctness of the retriever’s knowledge base.
  • RAG is best suited for scenarios with abundant unlabeled data but scarce labeled data and is ideal for applications like virtual assistants needing real-time access to specific information like product manuals.

  • Below, let’s take a look at the publication that introduced RAG and how the original paper implemented the framework.

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

  • The paper by Lewis et al. from Facebook AI Research, University College London, and New York University, introduces Retrieval-Augmented Generation (RAG) models combining pre-trained parametric and non-parametric memory for language generation tasks.
  • Addressing limitations of large pre-trained language models, such as difficulty in accessing and precisely manipulating knowledge, RAG models merge a pre-trained sequence-to-sequence (seq2seq) model with a dense vector index of Wikipedia, accessed by a neural retriever.
  • The RAG framework encompasses two models: RAG-Sequence, using the same retrieved document for the entire sequence, and RAG-Token, allowing different passages for each token.
  • The retrieval component, Dense Passage Retriever (DPR), uses a bi-encoder architecture with BERT-based document and query encoders. The generator component utilizes BART-large, a pre-trained seq2seq transformer with 400M parameters.
  • RAG models were trained jointly on the retriever and generator components without direct supervision on which documents to retrieve, using stochastic gradient descent with Adam. The training used a Wikipedia dump as the non-parametric knowledge source, split into 21M 100-word chunks.
  • In open-domain QA tasks, RAG established new state-of-the-art results, outperforming both parametric seq2seq models and task-specific retrieve-and-extract architectures. RAG models showed the ability to generate correct answers even when the right answer wasn’t in any retrieved document.
  • RAG-Sequence surpassed BART in Open MS-MARCO NLG, indicating less hallucination and more factually correct text generation. RAG-Token outperformed RAG-Sequence in Jeopardy question generation, demonstrating higher factuality and specificity.
  • On the FEVER fact verification task, RAG models achieved results close to state-of-the-art models that require more complex architectures and intermediate retrieval supervision.
  • This study showcases the effectiveness of hybrid generation models, combining parametric and non-parametric memories, offering new directions in combining these components for a range of NLP tasks.
  • Let’s summarize the methods and models used for query/document embedding and retrieval, as well as the end-to-end structure of the RAG framework as used in this paper, below:
    1. Query/Document Embedding:
      • The retrieval component, Dense Passage Retriever (DPR), follows a bi-encoder architecture.
      • DPR uses BERTBASE as the foundation for both document and query encoders.
      • For a document \(z\), a dense representation \(d(z)\) is produced by a document encoder, BERTd.
      • For a query \(x\), a query representation \(q(x)\) is produced by a query encoder, BERTq.
      • The embeddings are created such that relevant documents for a given query are close in the embedding space, allowing effective retrieval.
    2. Retrieval Process:
      • The retrieval process involves calculating the top-k documents with the highest prior probability, which is essentially a Maximum Inner Product Search (MIPS) problem.
      • The MIPS problem is solved approximately in sub-linear time to efficiently retrieve relevant documents.
    3. End-to-End Structure:
      • The RAG model uses the input sequence \(x\) to retrieve text documents \(z\), which are then used as additional context for generating the target sequence \(y\).
      • The generator component is modeled using BART-large, a pre-trained seq2seq transformer with 400M parameters. BART-large combines the input \(x\)with the retrieved content \(z\) for generation.
      • The RAG-Sequence model uses the same retrieved document for generating the complete sequence, while the RAG-Token model can use different passages per token.
      • The training process involves jointly training the retriever and generator components without direct supervision on what document should be retrieved. The training minimizes the negative marginal log-likelihood of each target using stochastic gradient descent with Adam.
      • Notably, the document encoder BERTd is kept fixed during training, avoiding the need for periodic updates of the document index.

MuRAG: Multimodal Retrieval-Augmented Generator

  • MuRAG looks to extend the retrieval process beyond text to include other modalities like images or structured data, which can then be used alongside textual information to inform the generation process.
  • MuRAG’s magic lies in its two-phase training approach: pre-training and fine-tuning, each carefully crafted to build the model’s ability to tap into a vast expanse of multimodal knowledge.
  • The key goal of MuRAG is to incorporate both visual and textual knowledge into language models to improve their capability for multimodal question answering.
  • MuRAG has a dual-encoder architecture, consisting of a visual transformer (ViT) and a text encoder (T5). The encoders embed images, text snippets, and questions into a joint multimodal space.
  • For retrieval, MuRAG uses maximum inner product search to find the top-K most relevant image-text pairs from the memory given a question.
    • The “memory” here refers to the external knowledge base that the model can retrieve information from.
    • Specifically, the memory contains a large collection of image-text pairs that are encoded offline by the backbone encoder prior to training.
    • During training and inference, given a question, MuRAG’s retriever module will search through this memory to find the most relevant image-text pairs using maximum inner product search.
    • The memory serves as the knowledge source and can contain various types of multimodal data like images with captions, passages of text, tables, etc. that are related to the downstream task.
    • For example, when fine-tuning on the WebQA dataset, the memory contains 1.1 million image-text pairs extracted from Wikipedia that the model can retrieve from to answer questions.
    • So in summary, the memory is the large non-parametric external knowledge base encoded in a multimodal space that MuRAG learns to retrieve relevant knowledge from given a question, in order to augment its language generation capabilities. The memory provides the world knowledge to complement what is stored implicitly in the model’s parameters.
  • For reading, the retrieved multimodal context is combined with the question embedding and fed into the decoder to generate an answer.
  • MuRAG is pre-trained on a mixture of image-text data (LAION, Conceptual Captions) and text-only data (PAQ, VQA). It uses a contrastive loss for retrieving relevant knowledge and a generation loss for answer prediction.
  • MuRAG achieves state-of-the-art results on two multimodal QA datasets - WebQA and MultimodalQA, outperforming text-only methods by 10-20% accuracy. It demonstrates the value of incorporating both visual and textual knowledge.
  • Key limitations are the reliance on large-scale pre-training data, computational costs, and issues in visual reasoning like counting objects. But overall, MuRAG represents an important advance in building visually-grounded language models.
  • The image below from the original paper (source) shows how the model taps into an external repository to retrieve a diverse range of knowledge encapsulated within both images and textual fragments. This multimodal information is then employed to enhance the generative process. The upper section outlines the setup for the pre-training phase, whereas the lower section specifies the framework for the fine-tuning phase.

Ensemble of RAG

  • Leveraging an ensemble of RAG systems offers a substantial upgrade to the model’s ability to produce rich and contextually accurate text. Here’s an enhanced breakdown of how this procedure could work:
  • Knowledge sources: RAG models retrieve information from external knowledge stores to augment their knowledge in a particular domain. These can include passages, tables, images, etc. from domains like Wikipedia, books, news, databases.
  • Combining sources: At inference time, multiple retrievers can pull relevant content from different corpora. For example, one retriever searches Wikipedia, another searches news sources. Their results are concatenated into a pooled set of candidates.
  • Ranking: The model ranks the pooled candidates by their relevance to the context.
  • Selection: Highly ranked candidates are selected to condition the language model for generation.
  • Ensembling: Separate RAG models specialized on different corpora can be ensembled. Their outputs are merged, ranked, and voted on.
  • Multiple knowledge sources can augment RAG models through pooling and ensembles. Careful ranking and selection helps integrate these diverse sources for improved generation.
  • One thing to keep in mind when using multiple retrievers is to rank the different outputs from each retriever before merging them to form a response. This can be done in a variety of ways, using LTR algorithms, multi-armed bandit framework, multi-objective optimization, or according to specific business use cases.

HyDE: Alternative to retrieval

  • Published in Precise Zero-Shot Dense Retrieval without Relevance Labels, HyDE, or Hypothetical Document Embeddings, is an innovative retrieval technique that generates a hypothetical document in response to a query using an instruction-following language model, and then encodes this document into an embedding vector for retrieving similar real documents from a corpus.
  • In the RAG framework, HyDE can be employed during the retrieval phase. Instead of conventional retrieval methods, HyDE can generate hypothetical documents in response to a query, which are then encoded into embeddings. These embeddings can be used to retrieve relevant documents from the corpus, effectively replacing the standard retrieval mechanism in RAG with one that is adept at handling zero-shot scenarios.
  • HyDE’s ability to generate a hypothetical document based on the query can provide more contextually rich and nuanced information. This can be particularly beneficial for the RAG’s generation component, leading to responses that are better informed and more accurately aligned with the query’s intent.
  • One possible pitfall of HyDE is that it can potentially “hallucinate” in the sense that it generates hypothetical documents that may contain invented or inaccurate details. This phenomenon occurs because HyDE uses an instruction-following language model, like InstructGPT, to generate a document based on a query. The generated document is intended to capture the relevance patterns of the query, but since it’s created without direct reference to real-world data, it can include false or fictional information. This aspect of HyDE is a trade-off for its ability to operate in zero-shot retrieval scenarios, where it creates a contextually relevant but not necessarily factually accurate document to guide the retrieval process.

Intuition on how to use RAG

  • In the sections below, we will go over some key topics to keep in mind while designing your own RAG pipeline:


  • Chunking is the process of dividing the prompts and/or the documents to be retrieved, into smaller, manageable segments or chunks. These chunks can be defined either by a fixed size, such as a specific number of characters, sentences or paragraphs.
  • In RAG, each chunk is encoded into an embedding vector for retrieval. Smaller, more precise chunks lead to a finer match between the user’s query and the content, enhancing the accuracy and relevance of the information retrieved.
  • Larger chunks might include irrelevant information, introducing noise and potentially reducing the retrieval accuracy. By controlling the chunk size, RAG can maintain a balance between comprehensiveness and precision.
  • So the next natural question that comes up is, how do you choose the right chunk size for your use case? The choice of chunk size in RAG is crucial. It needs to be small enough to ensure relevance and reduce noise but large enough to maintain the context’s integrity. Let’s look at a few methods below referred from [(Pinecone)](https://www.pinecone.io/learn/chunking-strategies/:
    • Fixed-size chunking: Simply decide the number of tokens in our chunk along with whether there should be overlap between them or not. Overlap between chunks guarantees there to be minimal semantic context loss between chunks. This option is computationally cheap and simple to implement.
      text = "..." # your text
      from langchain.text_splitter import CharacterTextSplitter
      text_splitter = CharacterTextSplitter(
          separator = "\n\n",
          chunk_size = 256,
          chunk_overlap  = 20
      docs = text_splitter.create_documents([text])
    • Context-aware chunking: Content-aware chunking leverages the intrinsic structure of the text to create chunks that are more meaningful and contextually relevant. Here are several approaches to achieving this:
      1. Sentence Splitting This method aligns with models optimized for embedding sentence-level content. Different tools and techniques can be used for sentence splitting:
        • Naive Splitting: A basic method where sentences are split using periods and new lines. Example:
             text = "..."  # Your text
             docs = text.split(".")
          • This method is quick but may overlook complex sentence structures.
        • NLTK (Natural Language Toolkit): A comprehensive Python library for language processing. NLTK includes a sentence tokenizer that effectively splits text into sentences. Example:
          text = "..."  # Your text
          from langchain.text_splitter import NLTKTextSplitter
          text_splitter = NLTKTextSplitter()
          docs = text_splitter.split_text(text)
        • spaCy: An advanced Python library for NLP tasks, spaCy offers efficient sentence segmentation. Example:
          text = "..."  # Your text
          from langchain.text_splitter import SpacyTextSplitter
          text_splitter = SpacyTextSplitter()
          docs = text_splitter.split_text(text)
      2. Recursive Chunking: Recursive chunking is an iterative method that splits text hierarchically using various separators. It adapts to create chunks of similar size or structure by recursively applying different criteria. Example using LangChain:

           text = "..."  # Your text
           from langchain.text_splitter import RecursiveCharacterTextSplitter
           text_splitter = RecursiveCharacterTextSplitter(
               chunk_size = 256,
               chunk_overlap = 20
           docs = text_splitter.create_documents([text])
      3. Specialized Chunking : For formatted content like Markdown or LaTeX, specialized chunking can be applied to maintain the original structure:

        • Markdown Chunking: Recognizes Markdown syntax and divides content based on structure. Example:
          from langchain.text_splitter import MarkdownTextSplitter
          markdown_text = "..."
          markdown_splitter = MarkdownTextSplitter(chunk_size=100, chunk_overlap=0)
          docs = markdown_splitter.create_documents([markdown_text])
        • LaTeX Chunking: Parses LaTeX commands and environments to chunk content while preserving its logical organization.
  • “As a rule of thumb, if the chunk of text makes sense without the surrounding context to a human, it will make sense to the language model as well. Therefore, finding the optimal chunk size for the documents in the corpus is crucial to ensuring that the search results are accurate and relevant.” (source)


  • Once you have your prompt chunked appropriately, the next step is to embedd it. Embedding prompts and documents in RAG involves transforming both the user’s query (prompt) and the documents in the knowledge base into a format that can be effectively compared for relevance. This process is critical for RAG’s ability to retrieve the most relevant information from its knowledge base in response to a user query. Here’s how it typically works:
  • One option to help pick which embedding model would be best suited for your task is to look at HuggingFace’s Massive Text Embedding Benchmark (MTEB) leaderboard.. There is a question of whether a dense or sparse embedding can be used so let’s look into benefits of each below:
  • Sparse embedding: Sparse embeddings such as TF-IDF are great for lexical matching the prompt with the documents. Best for applications where keyword relevance is crucial. It’s computationally less intensive but may not capture the deeper semantic meanings in the text.
  • Senmantic embedding: Semantic embeddings, such as BERT or SentenceBERT lend themselves naturally to the RAG usecase.
    • BERT: Suitable for capturing contextual nuances in both the documents and queries. Requires more computational resources but offers more semantically rich embeddings.
    • SentenceBERT: Ideal for scenarios where the context and meaning at the sentence level are important. It strikes a balance between the deep contextual understanding of BERT and the need for concise, meaningful sentence representations.


Lost in the Middle: How Language Models Use Long Contexts

  • This paper provides several insights that are particularly relevant when designing a RAG system:
  1. Context Utilization: Language models often struggle to effectively use information placed in the middle of long contexts. This suggests that in a RAG system, the placement of relevant information within the input context is crucial. Ideally, important information should be positioned either at the beginning or the end of the input sequence to enhance model performance.

  2. Model Architecture Consideration: The study indicates that encoder-decoder models are more robust in handling various positions of relevant information within their trained context windows compared to decoder-only models. Therefore, choosing an encoder-decoder architecture for a RAG system might be more effective for tasks requiring extensive context processing.

  3. Query-Aware Contextualization: Placing the query both before and after the data (query-aware contextualization) significantly improves performance in key-value retrieval tasks. While its impact on multi-document question answering is limited, this approach can still be beneficial for specific types of retrieval tasks in a RAG system.

  4. Performance Trade-offs with Context Length: The research highlights that more context isn’t always better. There’s a trade-off between providing additional context and the model’s ability to process it effectively. In a RAG system, this implies carefully balancing the amount of context to ensure optimal performance without overwhelming the model.

  5. Instruction Fine-Tuning Effects: The paper explores the impact of instruction fine-tuning on model performance. While fine-tuning can slightly reduce performance disparities, it does not fundamentally alter how models handle long input contexts. This suggests that while fine-tuning can be beneficial, it should be combined with other strategies to optimize context usage in a RAG system.

Generating the response

  • The last step of the RAG pipeline is to generate responses back to the user. Below, we will look at how to go about this.
  • Start with a dataset where each instance includes a query, the retrieved documents, and the ideal response.
  • Train the model end-to-end, where the input is the query and the output is the desired response. The model learns to use the context provided by the retrieved documents to generate this response.
  • Use a suitable loss function that penalizes the model for incorrect responses, encouraging it to learn from the context provided by the retrieved documents such as cross-entropy loss.
  • Fine-tune the model on datasets specific to the domain or task it will be used for.
  • Optimize hyperparameters for best performance, balancing between the quality of generation and computational efficiency.
  • Implement strategies to prevent overfitting, ensuring the model generalizes well to new queries.
  • At inference time, adjust parameters like temperature and top-k sampling to control the creativity and randomness of responses.
  • Optionally, implement a system for continuous learning where the model is periodically updated with new data.


  title   = {Retrieval Augmented Generation},
  author  = {Chadha, Aman and Jain, Vinija},
  journal = {Distilled AI},
  year    = {2020},
  note    = {\url{https://aman.ai}}