Aman's AI Journal • Primers • LLM/VLM Benchmarks

Overview
Large Language Models (LLMs)
Vision-Language Models (VLMs)
- General Benchmarks
- Medical VLM Benchmarks
  - Medical Image Annotation and Retrieval
  - Disease Classification and Detection
Common Challenges Across Benchmarks
Citation

Overview

Large Language Models (LLMs) and Vision-and-Language Models (VLMs) are evaluated across a wide array of benchmarks, which test their abilities in language understanding, reasoning, coding, and multimedia understanding (in case of VLMs).
These benchmarks are crucial for the development of AI models as they provide standardized challenges that help identify both strengths and weaknesses, driving improvements in future iterations.
This primer offers an overview of these benchmarks, attributes of their datasets, and relevant papers.

Large Language Models (LLMs)

General Benchmarks

Language Understanding

GLUE (General Language Understanding Evaluation): A set of nine tasks including question answering and textual entailment, designed to gauge general language understanding.
- Dataset Attributes: Diverse text genres from web text, fiction, and non-fiction, requiring models to handle a variety of language styles and complexities. The tasks range from single-sentence tasks (e.g., CoLA for linguistic acceptability) to sentence-pair tasks (e.g., MRPC for paraphrase detection).
- Reference: “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding”.
SuperGLUE: A more challenging version of GLUE intended to push language models to their limits.
- Dataset Attributes: Includes more complex reasoning tasks over multiple domains, emphasizing inference, logic, and common sense. Tasks include Boolean Question (BoolQ), CommitmentBank (CB) for textual entailment, and Winogender schemas (WSC) for pronoun resolution.
- Reference: “SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems”.
MMLU (Massive Multitask Language Understanding): Assesses model performance across a broad range of subjects and task formats to test general knowledge.
- Dataset Attributes: Covers 57 tasks across subjects like humanities, STEM, and social sciences, requiring broad and specialized knowledge. The tasks vary from multiple-choice questions to open-ended questions, ensuring a comprehensive assessment of linguistic intelligence.
- Reference: “Massive Multitask Language Understanding: A Benchmark for Measuring General Linguistic Intelligence”.
MMLU-Pro (Massive Multitask Language Understanding Pro): A robust and challenging dataset designed to rigorously benchmark large language models’ capabilities. With 12K complex questions across various disciplines, it enhances evaluation complexity and model robustness by increasing options from 4 to 10, making random guessing less effective. Unlike the original MMLU’s knowledge-driven questions, MMLU-Pro focuses on more difficult, reasoning-based problems, where chain-of-thought (CoT) results can be 20% higher than perplexity (PPL). This increased difficulty results in more consistent model performance, as seen with Llama-2-7B’s variance of within 1%, compared to 4-5% in the original MMLU.
- Dataset Attributes: 12K questions with 10 options each. Sources include Original MMLU, STEM websites, TheoremQA, and SciBench. Covers disciplines such as Math, Physics, Chemistry, Law, Engineering, Health, Psychology, Economics, Business, Biology, Philosophy, Computer Science, and History. Focus on reasoning, increased problem difficulty, and manual expert review by a panel of over ten experts.
- Reference: Hugging Face: MMLU-Pro.
BIG-bench (Beyond the Imitation Game Benchmark): A comprehensive benchmark designed to evaluate a wide range of capabilities in language models, from simple tasks to complex reasoning.
- Dataset Attributes: Encompasses over 200 diverse tasks, including arithmetic, common-sense reasoning, language understanding, and more. It is designed to push the boundaries of current LLM capabilities by including both straightforward and highly complex tasks.
- Reference: “BIG-bench: A Large-Scale Evaluation of Language Models”.
BIG-bench Hard: A subset of the BIG-bench benchmark focusing specifically on the most challenging tasks.
- Dataset Attributes: Consists of the hardest tasks from the BIG-bench suite, requiring advanced reasoning, problem-solving, and deep understanding. It aims to evaluate models’ performance on tasks that are significantly more difficult than typical benchmarks.
- Reference: “BIG-bench Hard: A Challenge Set for Language Models”.

Reasoning

HellaSwag: A dataset designed to evaluate common-sense reasoning through completion of context-dependent scenarios.
- Dataset Attributes: Challenges models to choose the most plausible continuation among four options, requiring nuanced understanding of everyday activities and scenarios. It features adversarially filtered examples to ensure difficulty and minimize data leakage from pre-training.
- Reference: “HellaSwag: Can a Machine Really Finish Your Sentence?”.
WinoGrande: A large-scale dataset for evaluating common-sense reasoning through Winograd schema challenges.
- Dataset Attributes: Includes a diverse set of sentences that require resolving ambiguous pronouns, emphasizing subtle distinctions in language understanding. The dataset is designed to address the limitations of smaller Winograd Schema datasets by providing scale and diversity.
- Reference: “WinoGrande: An Adversarial Winograd Schema Challenge at Scale”.
ARC Challenge (ARC-c) and ARC Easy (ARC-e): The AI2 Reasoning Challenge (ARC) tests models on science exam questions, designed to be challenging for AI.
- Dataset Attributes: Comprised of grade-school science questions that demand complex reasoning and understanding, generally challenging for current AI systems. The ARC dataset is split into a challenging set (ARC-c) and an easier set (ARC-e) based on question difficulty.
- Reference: “Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge”.
OpenBookQA (OBQA): Focuses on science-based question answering that requires both retrieval of relevant facts and reasoning.
- Dataset Attributes: Challenges models to answer questions using both retrieved facts and reasoning, focusing on scientific knowledge. The dataset includes a small “open book” of 1,326 elementary-level science facts to aid in answering the questions.
- Reference: “OpenBookQA: A New Dataset for Open Book Question Answering”.
CommonsenseQA (CQA): A benchmark designed to probe models’ ability to reason about everyday knowledge.
- Dataset Attributes: Focuses on multiple-choice questions that require commonsense to answer, challenging the depth of models’ real-world understanding. The questions are designed to have one correct answer and four distractors, making the task non-trivial.
- Reference: “COMMONSENSEQA: A Question Answering Challenge Targeting Commonsense Knowledge”.
GPQA (Graduate-Level Google-Proof Question Answering): Evaluates models’ ability to answer 448 multiple-choice questions written by domain experts in biology, physics, and chemistry.
- Dataset Attributes: Includes high-quality and extremely difficult questions: experts who have or are pursuing PhDs in the corresponding domains reach 65% accuracy (74% when discounting clear mistakes the experts identified in retrospect), while highly skilled non-expert validators only reach 34% accuracy, despite spending on average over 30 minutes with unrestricted access to the web (i.e., the questions are “Google-proof”).
- Reference: “GPQA: A Benchmark for General Purpose Question Answering”.

Contextual Comprehension

LAMBADA: Focuses on predicting the last word of a passage, requiring a deep understanding of the context.
- Dataset Attributes: Passages where the last word requires significant contextual understanding, testing language models’ deep comprehension. The passages are drawn from novels and require broad contextual reasoning to accurately predict the final word.
- Reference: “The LAMBADA dataset: Word prediction requiring a broad discourse context”.
BoolQ: A dataset for boolean question answering, focusing on reading comprehension.
- Dataset Attributes: Consists of yes/no questions based on Google search queries and their corresponding Wikipedia articles, requiring binary comprehension of text. The questions are naturally occurring and require understanding the passage to answer correctly.
- Reference: “BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions”.

General Knowledge and Skills

TriviaQA: A widely used dataset consisting of trivia questions collected from various sources. It evaluates a model’s ability to answer open-domain questions with detailed and accurate responses. The dataset includes a mix of web-scraped and curated questions.
- Dataset Attributes: Contains over 650,000 question-answer pairs, including both verified and web-extracted answers, covering a broad range of general knowledge topics. The questions are accompanied by evidence documents to support answer validation.
- Reference: “TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension”.
Natural Questions (NQ): Developed by Google, this benchmark consists of real questions posed by users to the Google search engine. It assesses a model’s ability to retrieve and generate accurate answers based on a comprehensive understanding of the query and relevant documents.
- Dataset Attributes: Includes 300,000 training examples with questions and long and short answer annotations, providing a rich resource for training and evaluating LLMs on real-world information retrieval and comprehension. The dataset focuses on long-form answers sourced from Wikipedia.
- Reference: “Natural Questions: a Benchmark for Question Answering Research”.
WebQuestions (WQ): A dataset created to test a model’s ability to answer questions using information found on the web. The questions were obtained via the Google Suggest API, ensuring they reflect genuine user queries.
- Dataset Attributes: Comprises around 6,000 question-answer pairs, with answers derived from Freebase, allowing models to leverage structured knowledge bases to provide accurate responses. The dataset focuses on factual questions requiring specific, often entity-centric answers.
- Reference: “WebQuestions: A Benchmark for Open-Domain Question Answering”.

Specialized Knowledge and Skills

HumanEval: Tests models on generating code snippets to solve programming tasks, evaluating coding abilities.
- Dataset Attributes: Programming problems requiring synthesis of function bodies, testing understanding of code logic and syntax. The dataset consists of prompts and corresponding reference solutions in Python, ensuring a clear standard for evaluation.
- Reference: “Evaluating Large Language Models Trained on Code”.
Physical Interaction Question Answering (PIQA): Evaluates understanding of physical properties through problem-solving scenarios.
- Dataset Attributes: Focuses on questions that require reasoning about everyday physical interactions, pushing models to understand and predict physical outcomes. The scenarios involve practical physical tasks and common sense, making the benchmark unique in testing physical reasoning.
- Reference: “PIQA: Reasoning about Physical Commonsense in Natural Language”.
Social Interaction Question Answering (SIQA): Tests the ability of models to navigate social situations through multiple-choice questions.
- Dataset Attributes: Challenges models with scenarios involving human interactions, requiring understanding of social norms and behaviors. The questions are designed to assess social commonsense reasoning, with multiple plausible answers to evaluate nuanced understanding.
- Reference: “Social IQa: Commonsense Reasoning about Social Interactions”.

Mathematical and Scientific Reasoning

MATH: A comprehensive set of mathematical problems designed to challenge models on various levels of mathematics.
- Dataset Attributes: Contains complex, multi-step mathematical problems from various branches of mathematics, requiring advanced reasoning and problem-solving skills. Problems range from algebra and calculus to number theory and combinatorics, emphasizing detailed solutions and proofs.
- Reference: “Measuring Mathematical Problem Solving With the MATH Dataset”.
GSM8K (Grade School Math 8K): A benchmark for evaluating the reasoning capabilities of models through grade school level math problems.
- Dataset Attributes: Consists of arithmetic and word problems typical of elementary school mathematics, emphasizing logical and numerical reasoning. The dataset aims to test foundational math skills and the ability to apply these skills to solve straightforward problems.
- Reference: “Can Language Models do Grade School Math?”.
MetaMathQA: A diverse collection of mathematical reasoning questions that aim to evaluate and improve the problem-solving capabilities of models.
- Dataset Attributes: Features a wide range of question types, from elementary to advanced mathematics, emphasizing not only the final answer but also the reasoning process leading to it. The dataset includes step-by-step solutions to foster reasoning and understanding in mathematical problem-solving.
- Reference: “MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models”.
MathVista: An advanced dataset designed to evaluate the mathematical reasoning and problem-solving capabilities of large language models.
- Dataset Attributes: Contains a wide variety of mathematical problems, from elementary arithmetic to complex calculus and linear algebra. Emphasizes not only the final answer but the step-by-step reasoning process required to arrive at the solution. The dataset is curated to challenge models with both straightforward calculations and intricate proofs.
- Reference: “MathVista: A Comprehensive Benchmark for Mathematical Reasoning in Language Models”.

Instruction Tuning and Evaluation

IFEval: Focuses on evaluating instruction-following models using a wide range of real-world and simulated tasks.
- Dataset Attributes: Comprises diverse tasks that require models to interpret and execute instructions accurately, ranging from straightforward to complex scenarios. The tasks include data manipulation, information extraction, and user interaction simulations, providing a comprehensive assessment of the model’s instruction-following capabilities.
- Reference: “IFEval: A Benchmark for Instruction-Following Evaluation”.
AlpacaEval: Designed to evaluate instruction-following models using a diverse set of tasks and prompts.
- Dataset Attributes: Contains a variety of instruction types ranging from simple tasks to complex multi-step instructions. It emphasizes the ability of models to follow and execute detailed instructions accurately. The dataset includes tasks like translation, summarization, and question answering.
- Reference: “AlpacaEval: A Comprehensive Evaluation Suite for Instruction-Following Models”.
Arena Hard: Designed to rigorously test instruction-following models on challenging and complex tasks.
- Dataset Attributes: Features high-difficulty tasks that require nuanced understanding and execution of instructions. The tasks span various domains, including intricate problem-solving, advanced reasoning, and detailed multi-step processes, providing a thorough evaluation of the model’s capabilities.
- Reference: “Arena Hard: Benchmarking High-Difficulty Instruction-Following Tasks”.
Flan: Focuses on evaluating models trained with diverse instruction sets to assess their generalization capabilities.
- Dataset Attributes: Includes a wide array of tasks derived from existing benchmarks and real-world applications. The tasks span multiple domains, requiring models to adapt to various instruction styles and content areas. The benchmark is used to evaluate models trained on instruction tuning with the Flan collection.
- Reference: “Scaling Instruction-Finetuned Language Models”.
Self-Instruct: Evaluates models using a method where the model generates its own instructions and responses, allowing for iterative self-improvement.
- Dataset Attributes: Contains tasks generated by the model itself, covering diverse areas such as common-sense reasoning, factual recall, and open-ended tasks. The benchmark tests the model’s ability to refine its instruction-following capabilities through self-generated data.
- Reference: “Self-Instruct: Aligning Language Models with Self-Generated Instructions”.
Dolly: Evaluates models based on tasks and instructions derived from real-world use cases, emphasizing practical utility.
- Dataset Attributes: Includes instructions collected from enterprise use cases, focusing on practical and actionable tasks. The dataset aims to benchmark models on their ability to perform useful tasks in business and technical environments.
- Reference: “Dolly: Open Sourcing Instruction-Following LLMs”.
OpenAI Codex Evaluations: A benchmark for evaluating instruction-following capabilities specifically in the context of code generation and programming tasks.
- Dataset Attributes: Contains programming challenges that require models to generate code based on natural language instructions. It evaluates the model’s ability to understand and execute programming-related instructions accurately.
- Reference: “Evaluating Large Language Models Trained on Code”.
InstructGPT Benchmarks: Used to evaluate the performance of InstructGPT models, focusing on the ability to follow detailed and complex instructions.
- Dataset Attributes: Encompasses a variety of tasks including creative writing, problem-solving, and detailed explanations. The benchmark aims to assess the alignment of model outputs with user-provided instructions, ensuring the model’s responses are accurate and contextually appropriate.
- Reference: “Training language models to follow instructions with human feedback”.

Multi-Turn Conversation Benchmarks

MTBench: A benchmark designed to evaluate the performance of multi-turn dialogue systems on instruction-following tasks.
- Dataset Attributes: Contains a variety of multi-turn conversations where the model must follow detailed and evolving instructions across multiple exchanges. The tasks involve complex dialogues requiring the model to maintain context and coherence over several turns.
- Reference: “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena”.
MT-Eval:
- Dataset Attributes: A comprehensive benchmark designed to evaluate multi-turn conversational abilities. By analyzing human-LLM conversations, we categorize interaction patterns into four types: recollection, expansion, refinement, and follow-up. We construct multi-turn queries for each category either by augmenting existing datasets or by creating new examples with GPT-4 to avoid data leakage. To study the factors impacting multi-turn abilities, we create single-turn versions of the 1170 multi-turn queries and compare performance.
- Reference: “MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large Language Models”.
MuTual (Multi-Turn Dialogue Reasoning): A dataset designed for evaluating multi-turn reasoning in dialogue systems.
- Dataset Attributes: Features dialogues from Chinese high school English listening tests, requiring models to select the correct answer from multiple choices based on dialogue context. Emphasizes reasoning over multiple turns to derive the correct conclusion.
- Reference: “MuTual: A Dataset for Multi-Turn Dialogue Reasoning”.
DailyDialog: A high-quality multi-turn dialogue dataset that covers a wide range of everyday topics and scenarios.
- Dataset Attributes: Contains dialogues involving various everyday scenarios, annotated for dialogue act, emotion, and topic. Aimed at training and evaluating models on natural, human-like conversation.
- Reference: “DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset”.
MultiWOZ (Multi-Domain Wizard of Oz): A comprehensive dataset for multi-turn task-oriented dialogues spanning multiple domains.
- Dataset Attributes: Includes dialogues that span multiple domains like booking, travel, and restaurant reservations, annotated with user intents and system responses. Designed to train models for complex dialogue management across different domains.
- Reference: “MultiWOZ - A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling”.
Taskmaster: A diverse dataset for multi-turn conversations that include both spoken and written interactions.
- Dataset Attributes: Covers several domains with conversations sourced from both human-human and human-machine interactions, providing a rich resource for training dialogue systems on varied conversational data.
- Reference: “Taskmaster-1: Toward a Realistic and Diverse Dialog Dataset”.
Persona-Chat: A dataset focusing on persona-based dialogues to test models’ ability to maintain consistent personality traits across conversations.
- Dataset Attributes: Consists of conversations where each participant has a predefined persona, requiring models to generate responses that are consistent with the given persona traits. Designed to foster more engaging and personalized dialogues.
- Reference: “Personalizing Dialogue Agents: I have a dog, do you have pets too?”.
DialogRE: A large-scale dialog reasoning benchmark focusing on relational extraction from conversations.
- Dataset Attributes: Consists of dialogue instances annotated with relational facts, testing the model’s ability to understand and extract relationships from multi-turn dialogues. Includes a variety of domains and conversational contexts.
- Reference: “Dialogue-Based Relation Extraction”.

Reward Model Evaluation

RewardBench: A benchmark designed to assess reward model capabilities in four categories: Chat, Chat-Hard, Safety, and Reasoning.
- Dataset Attributes: A collection of prompt-chosen-rejected trios spanning chat, reasoning, and safety, to benchmark how reward models perform on challenging, structured and out-of-distribution queries. We create specific comparison datasets for RMs that have subtle, but verifiable reasons (e.g. bugs, incorrect facts) why one answer should be preferred to another.
- Reference: RewardBench: Evaluating Reward Models for Language Modeling

Medical Benchmarks

In the medical/biomedical field, benchmarks play a critical role in evaluating the ability of AI models to handle domain-specific tasks such as clinical decision support, medical image analysis, and processing of biomedical literature. Here’s an expanded overview of common benchmarks in these areas, including additional benchmarks and the attributes of their datasets, along with references to the original papers where these benchmarks were proposed:

Clinical Decision Support and Patient Outcomes

MIMIC-III (Medical Information Mart for Intensive Care): A widely used dataset comprising de-identified health data associated with over forty thousand patients who stayed in critical care units. This dataset is used for tasks such as predicting patient outcomes, extracting clinical information, and generating clinical notes.
- Dataset Attributes: Includes notes, lab test results, vital signs, medication records, diagnostic codes, and demographic information, requiring comprehensive understanding of medical terminology, clinical narratives, and patient history.
- Reference: “The MIMIC-III Clinical Database”

Biomedical Question Answering

BioASQ: A challenge for testing biomedical semantic indexing and question answering capabilities. The tasks include factoid, list-based, yes/no, and summary questions based on biomedical research articles.
- Dataset Attributes: Questions are crafted from the titles and abstracts of articles in PubMed, challenging models to retrieve and generate precise biomedical information. Includes large-scale training data and evaluation metrics that focus on precision and recall.
- Reference: “BioASQ: A challenge on large-scale biomedical semantic indexing and question answering”
MedQA (USMLE): A question answering benchmark based on the United States Medical Licensing Examination, which assesses a model’s ability to reason with medical knowledge under exam conditions.
- Dataset Attributes: Consists of multiple-choice questions with detailed explanations, reflecting real-world medical licensing exam scenarios that test comprehensive medical knowledge, clinical reasoning, and problem-solving skills.
- Reference: “What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams”
MultiMedQA: A benchmark collection that integrates multiple datasets for evaluating question answering across various medical fields, including consumer health, clinical medicine, and genetics.
- Dataset Attributes: Incorporates questions from several sources, requiring broad and deep medical knowledge across diverse sub-disciplines. Includes tasks such as multiple-choice questions, evidence retrieval, and fact verification.
- Reference: “MultiMedQA: Large-scale Multi-domain Medical Question Answering”
PubMedQA: A dataset for natural language question answering using abstracts from PubMed as the context, focusing on yes/no questions.
- Dataset Attributes: Questions derived from PubMed article titles with answers provided in the abstracts, emphasizing models’ ability to extract and verify factual information from scientific texts. Includes a balanced distribution of yes, no, and maybe answers.
- Reference: “PubMedQA: A Dataset for Biomedical Research Question Answering”
MedMCQA: A medical multiple-choice question answering benchmark that evaluates comprehensive understanding and application of medical concepts.
- Dataset Attributes: Features challenging multiple-choice questions that cover a wide range of medical topics, testing not only knowledge but also deep understanding and reasoning skills in medical contexts. Questions are sourced from medical exams and expert annotations.
- Reference: “MedMCQA: A Large-scale Multi-domain Clinical Question Answering Dataset”

Biomedical Language Understanding

BLUE (Biomedical Language Understanding Evaluation): A benchmark consisting of several diverse biomedical NLP tasks such as named entity recognition, relation extraction, and sentence similarity in the biomedical domain.
- Dataset Attributes: Utilizes various biomedical corpora, including PubMed abstracts, clinical trial reports, and electronic health records, emphasizing specialized language understanding and entity relations. Tasks are designed to evaluate both generalization and specialization in biomedical contexts.
- Reference: “BLUE: The Biomedical Language Understanding Evaluation Benchmark”

Code LLM Benchmarks

In the domain of code synthesis and understanding, benchmarks play a pivotal role in assessing the performance of Code LLMs. These benchmarks challenge models on various aspects such as code generation, understanding, and debugging. Here’s a detailed overview of common benchmarks used for evaluating code LLMs, including the attributes of their datasets and references to the original papers where these benchmarks were proposed:

Code Generation and Synthesis

HumanEval: This benchmark is designed to test the ability of language models to generate code. It consists of a set of Python programming problems that require writing function definitions from scratch.
- Dataset Attributes: Includes 164 hand-crafted programming problems covering a range of difficulty levels, requiring understanding of problem statements and generation of functionally correct and efficient code. Problems are evaluated based on correctness and execution results.
- Reference: “Evaluating Large Language Models Trained on Code”
HumanEval+: An extension of the HumanEval benchmark, aimed at assessing the ability of models to handle more intricate and diverse code generation tasks.
- Dataset Attributes: Includes a set of 300 Python programming problems that span a wider range of difficulty levels and require more sophisticated solutions. The problems are designed to test deeper understanding and creativity in code generation.
- Reference: “HumanEval+: Extending HumanEval for Advanced Code Generation Tasks”
Mostly Basic Programming Problems (MBPP): A benchmark consisting of simple Python coding problems intended to evaluate the capabilities of code generation models in solving basic programming tasks.
- Dataset Attributes: Contains 974 Python programming problems, focusing on basic functionalities and common programming tasks that are relatively straightforward to solve. Problems range from simple arithmetic to basic data manipulation and control structures.
- Reference: “Program Synthesis with Large Language Models”
MBPP+: An extension of the MBPP benchmark, designed to evaluate more complex and diverse programming tasks.
- Dataset Attributes: Comprises an expanded set of 1500 Python programming problems, including more complex and diverse tasks that require deeper problem-solving skills and understanding of advanced programming concepts. The problems cover a broader range of real-world scenarios and applications.
- Reference: “Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation”
SWE-Bench: This benchmark evaluates the ability of language models to generate software engineering-related code, focusing on practical tasks encountered in the industry.
- Dataset Attributes: Comprises a diverse set of software engineering tasks, including bug fixing, feature implementation, and code refactoring. The problems require understanding software specifications and generating correct and maintainable code.
- Reference: “SWE-bench: Can Language Models Resolve Real-World GitHub Issues?”
Aider: A benchmark aimed at assessing the capabilities of models in aiding software development by providing intelligent code suggestions and improvements.
- Dataset Attributes: Includes a variety of real-world coding scenarios where models are evaluated based on their ability to offer meaningful code suggestions, improvements, and refactoring options. The dataset spans multiple programming languages and development contexts.
- Reference: Paul Gauthier’s Github
MultiPL-E: A benchmark designed to evaluate the performance of language models across multiple programming languages.
- Dataset Attributes: Contains a variety of programming problems that are translated into several programming languages including Python, JavaScript, Java, and C++. The benchmark tests the model’s ability to understand and generate code in different syntaxes and paradigms.
- Reference: “MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code Generation”
BigCodeBench: A benchmark to evaluate LLMs on challenging and complex coding tasks focused on realistic, function-level tasks that require the use of diverse libraries and complex reasoning.
- Dataset Attributes: Contains 1,140 tasks with 5.6 test cases each, covering 139 libraries in Python. Uses Pass@1 with greedy decoding and Elo rating for comprehensive evaluation. Tasks are created in a three-stage process, including synthetic data generation and cross-validation by humans. The best model is GPT-4 with 61.1%, followed by DeepSeek-Coder-V2. Best open model is DeepSeek-Coder-V2 with 59.7%, better than Claude 3 Opus or Gemini. Evaluation framework and Docker images are available for easy reproduction. Plans to extend to multilingualism.
- Reference: Code; Blog; Leaderboard

Code Debugging and Error Detection

DS-1000 (DeepSource Python Bugs Dataset): This dataset is used to evaluate the ability of models to detect bugs in Python code. It includes a diverse set of real-world bugs.
- Dataset Attributes: Comprises 1000 annotated Python functions with detailed bug annotations, testing models on their ability to identify and understand common coding errors. The dataset includes both syntactic and semantic bugs, providing a comprehensive debugging challenge.
- Reference: “DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation”
LiveCodeBench: A benchmark designed to test the effectiveness of code generation models in real-time collaborative coding environments.
- Dataset Attributes: Features a collection of coding tasks designed to simulate live coding sessions where models need to provide accurate and timely code completions and suggestions. The tasks cover various programming languages and development frameworks.
- Reference: “LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code”

Comprehensive Code Understanding and Multi-language Evaluation

CodeXGLUE: A comprehensive benchmark that includes multiple tasks like code completion, code translation, and code repair across various programming languages.
- Dataset Attributes: Encompasses a range of programming challenges and languages, providing a broad assessment of models’ code understanding and generation across different contexts. The benchmark includes tasks for code summarization, code search, and clone detection, covering languages like Python, Java, and more.
- Reference: “CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation”

Algorithmic Problem Solving

LeetCode Problems: A widely used benchmark for algorithmic problem solving, offering a comprehensive set of problems that test various algorithmic and data structure concepts.
- Dataset Attributes: Features thousands of problems across different categories such as arrays, linked lists, dynamic programming, and more. Problems range from easy to hard, providing a robust platform for evaluating algorithmic problem-solving skills.
- Reference: The LeetCode Solution Dataset on Kaggle
Codeforces Problems: This benchmark includes competitive programming problems from Codeforces, a platform known for its challenging contests and diverse problem sets.
- Dataset Attributes: Contains problems that are designed to test deep algorithmic understanding and optimization skills. The problems vary in difficulty and cover a wide range of topics including graph theory, combinatorics, and computational geometry.
- Reference: “Competition-Level Problems are Effective LLM Evaluators”

Vision-Language Models (VLMs)

General Benchmarks

VLMs are pivotal in AI research as they combine visual data with linguistic elements, offering insights into how machines can interpret and generate human-like responses based on visual inputs. This section delves into key benchmarks that test these hybrid capabilities:

Visual Question Answering

Visual Question Answering (VQA) and VQAv2: Requires models to answer questions about images, testing both visual comprehension and language processing.
- Dataset Attributes: Combines real and abstract images with questions that require understanding of object properties, spatial relationships, and activities. VQA includes open-ended questions, while VQAv2 provides a balanced dataset to reduce language biases.
- Reference: “VQA: Visual Question Answering” and its subsequent updates.
TextVQA: Focuses on models’ ability to answer questions based on textual information found within images, testing the intersection of visual and textual understanding.
- Dataset Attributes: Comprises images containing text in various forms, such as signs, documents, and advertisements. The questions require models to read and comprehend the text within the image to provide accurate answers. The dataset includes a diverse set of images and questions to evaluate comprehensive visual-textual reasoning.
- Reference: “TextVQA: Toward VQA Models That Can Read”

Image Captioning

MSCOCO Captions: Models generate captions for images, focusing on accuracy and relevance of the visual descriptions.
- Dataset Attributes: Real-world images with annotations requiring descriptive and detailed captions that cover a broad range of everyday scenes and objects. The dataset includes over 330,000 images with five captions each, emphasizing diversity in descriptions.
- Reference: “Microsoft COCO: Common Objects in Context”

Visual Reasoning

NLVR2 (Natural Language for Visual Reasoning for Real): Evaluates reasoning about the relationship between textual descriptions and image pairs.
- Dataset Attributes: Pairs of photographs with text statements that models must verify, focusing on logical reasoning across visually disparate images. The dataset includes complex visual scenes requiring fine-grained reasoning about relationships and attributes.
- Reference: “A Corpus for Reasoning About Natural Language Grounded in Photographs”
MMBench: Provides a comprehensive evaluation of models’ multimodal understanding across different tasks.
- Dataset Attributes: Includes tasks such as visual question answering, image captioning, and visual reasoning, focusing on the integration and understanding of visual and textual data. The dataset is designed to challenge models with a wide range of scenarios requiring both linguistic and visual comprehension.
- Reference: “MMBench: A Comprehensive Multimodal Benchmark for Evaluating Vision-Language Models”
MMMU (Massive Multi-discipline Multimodal Understanding): Tests models’ ability to understand and generate responses based on both visual and textual stimuli.
- Dataset Attributes: Involves tasks like visual question answering, image captioning, and visual reasoning, testing both visual and textual understanding. The dataset includes diverse multimodal tasks designed to evaluate comprehensive understanding and generation abilities.
- Reference: “MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI”

Video Understanding

Perception Test: A benchmark designed to evaluate models on understanding and interpreting video content.
- Dataset Attributes: Video sequences requiring models to interpret dynamic scenes, focusing on object detection, movement prediction, and scene classification. The dataset includes real-world driving scenarios, making it relevant for autonomous vehicle research.
- Reference: “Perception Test: Benchmark for Autonomous Vehicle Perception”

Medical VLM Benchmarks

Medical VLMs are essential in merging AI’s visual and linguistic analysis for healthcare applications. They are pivotal for developing systems that can interpret complex medical imagery alongside textual data, enhancing diagnostic accuracy and treatment efficiency. This section explores major benchmarks testing these interdisciplinary skills:

Medical Image Annotation and Retrieval

ImageCLEFmed: Part of the ImageCLEF challenge, this benchmark tests image-based information retrieval, automatic annotation, and visual question answering using medical images.
- Dataset Attributes: Contains a wide array of medical imaging types, including radiographs, histopathology images, and MRI scans, necessitating the interpretation of complex visual medical data. Tasks range from multi-label classification to segmentation and retrieval.
- Reference: “ImageCLEF - the CLEF 2009 Cross-Language Image Retrieval Track”

Disease Classification and Detection

CheXpert: A large dataset of chest radiographs for identifying and classifying key thoracic pathologies. This benchmark is often used for tasks that involve reading and interpreting X-ray images.
- Dataset Attributes: Consists of over 200,000 chest radiographs annotated with findings from radiology reports, challenging models to accurately detect and diagnose multiple conditions such as pneumonia, pleural effusion, and cardiomegaly.
- Reference: “CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison”
Diabetic Retinopathy Detection: Focused on the classification of retinal images to diagnose diabetic retinopathy, a common cause of vision loss.
- Dataset Attributes: Features high-resolution retinal images, where models need to detect subtle indicators of disease progression, requiring high levels of visual detail recognition. The dataset includes labels for different stages of retinopathy, emphasizing early detection and severity assessment.
- Reference: Diabetic Retinopathy Detection on Kaggle

Common Challenges Across Benchmarks

Generalization: Assessing how well models can generalize from the training data to unseen problems.
Robustness: Evaluating the robustness of models against edge cases and unusual inputs.
Execution Correctness: Beyond generating syntactically correct code, the emphasis is also on whether the code runs correctly and solves the problem as intended.
Bias and Fairness: Ensuring that models do not inherit or perpetuate biases that could impact patient care outcomes, especially given the diversity of patient demographics.
Data Privacy and Security: Addressing concerns related to the handling and processing of sensitive health data in compliance with regulations such as HIPAA.
Domain Specificity: Handling the high complexity of medical and biomedical terminologies and imaging, which requires not only technical accuracy but also clinical relevancy.

Citation

If you found our work useful, please cite it as:

@article{Chadha2020DistilledLLMVLMBenchmarks,
  title   = {LLM/VLM Benchmarks},
  author  = {Chadha, Aman},
  journal = {Distilled AI},
  year    = {2020},
  note    = {\url{https://aman.ai}}
}