Overview

  • Prompt engineering, also known as in-context prompting, is a method for steering an LLM’s behavior towards a particular outcome without updating the model’s weights/parameters. It’s the process of effectively communicating with AI about desired results.
  • Prompt engineering is used on a variety of tasks from question answering to arithmetic reasoning. It allows us to understand the limitations and the capabilities of LLMs.

Prompts

  • In order to understand prompt engineering, let’s take a step back and clarify what prompting (and prompts) are.
  • Prompts are the initial text inputs that model receives to generate a response or complete a task.
  • Prompts are a set of instructions we give AI or chatbots, such as ChatGPT, inorder to perform a task. There are several types of prompts like summarization, or solving arithmetic, however, more often than not the prompts consist of questions.
  • Thus, Prompt engineering aims to take these prompts and help the model achieve high accuracy and relevance in its outputs.
  • Below, we will look at some of the most common prompts with two most common types of learning used for prompting are zero-shot and few-shot prompting.

Zero-shot Prompting

  • Zero-shot learning involves feeding the task to the model without any examples that indicate the desired output, hence the name zero-shot. For example, we could feed a model a sentence and expect it to output the sentiment of that sentence.
  • Let’s look at an example below from DAIR-AI:
- Prompt: Classify the text into neutral, negative, or positive. Text: I think the vacation is okay.
- Output: Neutral

Few-shot Prompting

  • Few-shot learning, on the other hand, involves providing the model with a small number of high-quality examples that include both input and desired output for the target task. By seeing these good examples, the model can better understand the human intention and criteria for generating accurate outputs. As a result, few-shot learning often leads to better performance compared to zero-shot learning. However, this approach can consume more tokens and may encounter context length limitations when dealing with long input and output text.
  • Large language models, such as GPT-3, excel in zero-shot capabilities. However, for complex tasks where we see degraded performance, few-shot learning comes to the rescue! To enhance performance, we perform in-context learning using few-shot prompting by offering demonstrations in the prompt that guide the model to carry out the task. In other words, conditioning the model on a selection of task-specific examples helps improve the model’s performance.
  • Let’s look at an example below from Brown et al.:

- Prompt:
- A "whatpu" is a small, furry animal native to Tanzania. An example of a sentence that uses the word whatpu is:
We were traveling in Africa and we saw these very cute whatpus.
To do a "farduddle" means to jump up and down really fast. An example of a sentence that uses
the word farduddle is:

- Output:
- When we won the game, we all started to farduddle in celebration.

  • We can see from the prompt above, the model was given one example and then was able to generate the answer for the next.

Chain-of-Thought Prompting

  • Chain-of-Thought (CoT) prompting generates a sequence of short sentences known as reasoning chains.
  • These describe step-by-step reasoning logic leading to the final answer with more benefits seen for complex reasoning tasks and larger models.
  • We will look at the two basic CoT Prompting available today and describe them below.

Few-shot CoT

  • Few-shot CoT allows the model to view a few demonstration of high-quality reasoning chains.
  • Let’s look at the example below:
Question: Tom and Elizabeth have a competition to climb a hill. Elizabeth takes 30 minutes to climb the hill. Tom takes four times as long as Elizabeth does to climb the hill. How many hours does it take Tom to climb up the hill?
Answer: It takes Tom 30*4 = <<30*4=120>>120 minutes to climb the hill.
It takes Tom 120/60 = <<120/60=2>>2 hours to climb the hill.
So the answer is 2.
===
Question: Jack is a soccer player. He needs to buy two pairs of socks and a pair of soccer shoes. Each pair of socks cost $9.50, and the shoes cost $92. Jack has $40. How much more money does Jack need?
Answer: The total cost of two pairs of socks is $9.50 x 2 = $<<9.5*2=19>>19.
The total cost of the socks and the shoes is $19 + $92 = $<<19+92=111>>111.
Jack need $111 - $40 = $<<111-40=71>>71 more.
So the answer is 71.
===
Question: Marty has 100 centimeters of ribbon that he must cut into 4 equal parts. Each of the cut parts must be divided into 5 equal parts. How long will each final cut be?
Answer:

Zero-shot CoT

  • Zero-shot CoT was introduced by Kojima et al. (2022), which involves appending “Let’s think step by step” to the prompt, and helps in improving model performance. Let’s look at an example below:
 Prompt:
- I went to the market and bought 10 apples. I gave 2 apples to the neighbor and 2 to the repairman. I then went and bought 5 more apples and ate 1. How many apples did I remain with?
- Let's think step by step.
Output:
- First, you started with 10 apples.
-You gave away 2 apples to the neighbor and 2 to the repairman, so you had 6 apples left.
-Then you bought 5 more apples, so now you had 11 apples.
-Finally, you ate 1 apple, so you would remain with 10 apples.
  • It is indeed incredible to be able to see under the hood of the model and see how it reasons to obtain the answer!

Instruction Prompting and Tuning

  • Instruction prompting is by far the most common usecase of LLMs, especially chatbots such as ChatGPT. As an example of instruction prompting:
- Prompt: Define Onomatopoeia in one sentence.
- Output: Onomatopoeia is the use of words that imitate or suggest the natural sound of a thing or action.
  • Instruction tuning seeks to offer instruction prompt examples to the LLM so it can close the train-test discrepancy (where the model was trained on web-scale corpora and tested mostly on instructions) and mimic the real-world usage scenario of chatbots. Stanford’s Alpaca is a recent example that uses instruction tuning to offer performance similar to OpenAI’s GPT3.5 (without performing RLHF, unlike GPT3.5).
  • Instruction tuning finetunes a pretrained model with tuples of (task instruction, input, ground truth output) to enables the model to be better aligned to user intention and follow instructions. “When interacting with instruction models, we should describe the task requirement in detail, trying to be specific and precise, clearly specifying what to do (rather than saying not to do something)” (source).

Recursive Prompting

  • Recursive prompting refers to a method of problem-solving that involves breaking down a complex problem into smaller, more manageable sub-problems, which are then solved recursively through a series of prompts.
  • This approach can be particularly useful for tasks that require compositional generalization, where a language model must learn to combine different pieces of information to solve a problem.
  • In the context of natural language processing, recursive prompting can involve using a few-shot prompting approach to decompose a complex problem into sub-problems, and then sequentially solving the extracted sub-problems using the solution to the previous sub-problems to answer the next one. This approach can be used for tasks such as math problems or question answering, where a language model needs to be able to break down complex problems into smaller, more manageable parts to arrive at a solution.
  • The language model can then solve each sub-problem independently or sequentially, using the solution to the previous sub-problem to answer the next one. For example:
Calculate the product of the length and width:
prompt: "What is the product of 8 and 6?"
answer: 48

Substitute the given values for length and width into the equation:
prompt: "What is the area of a rectangle with length 8 and width 6?"
answer: "The area of a rectangle with length 8 and width 6 is 48."
  • The following image (source) shows multiple examples of recursive prompting:

Large prompt context

  • Anthropic AI announced that they are expanding Claude’s context window to 100k tokens, tripling GPT-4’s maximum of 32k.
  • For scale: The first Harry Potter book has 76,944 words, which is ~100k tokens after tokenization.
  • Larger context windows significantly elevate LLMs’ capabilities across a wide range of applications:
    1. Improved comprehension of lengthy and complex texts: by accessing a greater portion of the text, LLMs can generate responses and create content that is contextually relevant, more accurate, comprehensive, and coherent. This opens the door for processing extensive documents such as academic articles or legal contracts with more accuracy.
    2. Reduced need for fine-tuning: longer prompts can support advanced prompting techniques such as Chain of Thought and Few-Shot Learning, improving the LLM’s performance at inference time.
    3. Enhanced ability to summarize and synthesize information: with a greater understanding of entire documents, LLMs can generate summaries that encapsulate the key findings and more accurately.
    4. Improved context: conversational AI systems often struggle to maintain context during extended interactions. A larger context window can store more significant portions of the conversation history, leading to more coherent and contextually appropriate responses.
  • Over time, this could gradually diminish the need for vector store approaches for external knowledge retrieval in LLMs because you could now include the information as regular input.
  • It will likely make LLMs more efficient few-shot learners as well since more examples can now be provided via the context. However, this will likely not be a replacement for finetuning yet. Finetuning not only optimizes LLMs for domain-specific datasets, but it also helps to optimize them for a target task.
  • As an analogy, a person who specifically studied for a math exam will perform better than a random person who is only given past exams as examples without studying. Moreover, you can combine the two: apply in-context learning to finetuned models (a person who studied the exam subject and also uses past exams as examples).
  • MosaicML also announced MPT-65K, an LLM that can handle 65K tokens.

Further Reading/Viewing

References