Aman's AI Journal • Primers • LLM/VLM Benchmarks

Overview
Large Language Models (LLMs)
Vision-Language Models (VLMs)
- General Benchmarks
- Medical VLM Benchmarks
  - Medical Image Annotation and Retrieval
  - Disease Classification and Detection
Common Challenges Across Benchmarks
Citation

Overview

Large Language Models (LLMs) and Vision-and-Language Models (VLMs) are evaluated across a wide array of benchmarks, which test their abilities in language understanding, reasoning, coding, and multimedia understanding (in case of VLMs).
These benchmarks are crucial for the development of AI models as they provide standardized challenges that help identify both strengths and weaknesses, driving improvements in future iterations.
This primer offers an overview of these benchmarks, attributes of their datasets, and relevant papers.

Large Language Models (LLMs)

General Benchmarks

Language Understanding and Reasoning

GLUE (General Language Understanding Evaluation): A set of nine tasks including question answering and textual entailment, designed to gauge general language understanding. Proposed in “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding”.
- Dataset Attributes: Diverse text genres from web text, fiction, and non-fiction, requiring models to handle a variety of language styles and complexities.
SuperGLUE: A more challenging version of GLUE intended to push language models to their limits. Introduced in “SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems”.
- Dataset Attributes: Includes more complex reasoning tasks over multiple domains, emphasizing inference, logic, and common sense.
HellaSwag: A dataset designed to evaluate common sense reasoning through completion of context-dependent scenarios. Introduced in “HellaSwag: Can a Machine Really Finish Your Sentence?”.
- Dataset Attributes: Challenges models to choose the most plausible continuation among four options, requiring nuanced understanding of everyday activities and scenarios.
WinoGrande: A large-scale dataset for evaluating common sense reasoning through winograd schema challenges. Described in “WinoGrande: An Adversarial Winograd Schema Challenge at Scale”.
- Dataset Attributes: Includes a diverse set of sentences that require resolving ambiguous pronouns, emphasizing subtle distinctions in language understanding.
ARC Challenge (ARC-c) and ARC Easy (ARC-e): Tests models on science exam questions, designed to be challenging for AI. Presented in “Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge”.
- Dataset Attributes: Comprised of grade-school science questions that demand complex reasoning and understanding, generally challenging for current AI systems.
TriviaQA: A large-scale dataset consisting of trivia questions, designed to evaluate a wide range of knowledge and inference abilities. Proposed in “TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension”.
- Dataset Attributes: Includes over 650,000 question-answer pairs, requiring models to retrieve and infer answers from large contexts.
Natural Questions (NQ): A benchmark for evaluating models on real user questions from Google search. Described in “Natural Questions: a Benchmark for Question Answering Research”.
- Dataset Attributes: Consists of questions from actual Google searches, paired with answers that require understanding and analyzing natural, real-world language use.
OpenBookQA (OBQA): Focuses on science-based question answering that requires both retrieval of relevant facts and reasoning. Proposed in “OpenBookQA: A New Dataset for Open Book Question Answering”.
- Dataset Attributes: Challenges models to answer questions using both retrieved facts and reasoning, focusing on scientific knowledge.

Contextual Comprehension

LAMBADA: Focuses on predicting the last word of a passage, requiring a deep understanding of the context. Described in “The LAMBADA dataset: Word prediction requiring a broad discourse context”.
- Dataset Attributes: Passages where the last word requires significant contextual understanding, testing language models’ deep comprehension.
BoolQ: A dataset for boolean question answering, focusing on reading comprehension. Introduced in “BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions”.
- Dataset Attributes: Consists of yes/no questions based on Google search queries and their corresponding Wikipedia articles, requiring binary comprehension of text.

Knowledge and Reasoning

CommonsenseQA (CQA): A benchmark designed to probe models’ ability to reason about everyday knowledge. Presented in “COMMONSENSEQA: A Question Answering Challenge Targeting Commonsense Knowledge”.
- Dataset Attributes: Focuses on multiple-choice questions that require commonsense to answer, challenging the depth of models’ real-world understanding.

Specialized Knowledge and Skills

MMLU (Massive Multitask Language Understanding): Assesses model performance across a broad range of subjects and task formats to test general knowledge. Introduced in “Massive Multitask Language Understanding: A Benchmark for Measuring General Linguistic Intelligence”.
- Dataset Attributes: Covers 57 tasks across subjects like humanities, STEM, and social sciences, requiring broad and specialized knowledge.
MMLU-Pro (Massive Multitask Language Understanding Pro): A robust and challenging dataset designed to rigorously benchmark large language models’ capabilities. With 12K complex questions across various disciplines, it enhances evaluation complexity and model robustness by increasing options from 4 to 10, making random guessing less effective. Unlike the original MMLU’s knowledge-driven questions, MMLU-Pro focuses on more difficult, reasoning-based problems, where chain-of-thought (CoT) results can be 20% higher than perplexity (PPL). This increased difficulty results in more consistent model performance, as seen with Llama-2-7B’s variance of within 1%, compared to 4-5% in the original MMLU. Introduced in Hugging Face: MMLU-Pro.
- Dataset Attributes: 12K questions with 10 options each. Sources include Original MMLU, STEM websites, TheoremQA, and SciBench. Covers disciplines such as Math, Physics, Chemistry, Law, Engineering, Health, Psychology, Economics, Business, Biology, Philosophy, Computer Science, and History. Focus on reasoning, increased problem difficulty, and manual expert review by a panel of over ten experts.
GSM8K (Grade School Math 8K): A benchmark for evaluating the reasoning capabilities of models through grade school level math problems. Found in “Can Language Models do Grade School Math?”.
- Dataset Attributes: Consists of arithmetic and word problems typical of elementary school mathematics, emphasizing logical and numerical reasoning.
HumanEval: Tests models on generating code snippets to solve programming tasks, evaluating coding abilities. Proposed in “Evaluating Large Language Models Trained on Code”.
- Dataset Attributes: Programming problems requiring synthesis of function bodies, testing understanding of code logic and syntax.
Physical Interaction Question Answering (PIQA): Evaluates understanding of physical properties through problem-solving scenarios. Introduced in “PIQA: Reasoning about Physical Commonsense in Natural Language”.
- Dataset Attributes: Focuses on questions that require reasoning about everyday physical interactions, pushing models to understand and predict physical outcomes.
Social Interaction Question Answering (SIQA): Tests the ability of models to navigate social situations through multiple-choice questions. Described in “Social IQa: Commonsense Reasoning about Social Interactions”.
- Dataset Attributes: Challenges models with scenarios involving human interactions, requiring understanding of social norms and behaviors.

Mathematical and Scientific Reasoning

MATH: A comprehensive set of mathematical problems designed to challenge models on various levels of mathematics. Proposed in “Measuring Mathematical Problem Solving With the MATH Dataset”.
- Dataset Attributes: Contains complex, multi-step mathematical problems from various branches of mathematics, requiring advanced reasoning and problem-solving skills.

Medical Benchmarks

In the medical/biomedical field, benchmarks play a critical role in evaluating the ability of AI models to handle domain-specific tasks such as clinical decision support, medical image analysis, and processing of biomedical literature. Here’s an expanded overview of common benchmarks in these areas, including additional benchmarks and the attributes of their datasets, along with references to the original papers where these benchmarks were proposed: Here’s the grouping of the medical benchmarks based on the specific tasks and challenges they address within the medical domain:

Clinical Decision Support and Patient Outcomes

MIMIC-III (Medical Information Mart for Intensive Care): A widely used dataset comprising de-identified health data associated with over forty thousand patients who stayed in critical care units. This dataset is used for tasks such as predicting patient outcomes, extracting clinical information, and generating clinical notes.
- Dataset Attributes: Includes notes, lab test results, vital signs, and more, requiring understanding of medical terminology and clinical narratives.
- Reference: “The MIMIC-III Clinical Database”

Biomedical Question Answering

BioASQ: A challenge for testing biomedical semantic indexing and question answering capabilities. The tasks include factoid, list-based, yes/no, and summary questions based on biomedical research articles.
- Dataset Attributes: Questions are crafted from the titles and abstracts of articles in PubMed, challenging models to retrieve and generate precise biomedical information.
- Reference: “BioASQ: A challenge on large-scale biomedical semantic indexing and question answering”
MedQA (USMLE): A question answering benchmark based on the United States Medical Licensing Examination, which assesses a model’s ability to reason with medical knowledge under exam conditions.
- Dataset Attributes: Consists of multiple-choice questions with detailed explanations, reflecting real-world medical licensing exam scenarios that test comprehensive medical knowledge.
- Reference: “MedQA: Large-scale Medical Question Answering”
MultiMedQA: A benchmark collection that integrates multiple datasets for evaluating question answering across various medical fields, including consumer health, clinical medicine, and genetics.
- Dataset Attributes: Incorporates questions from several sources, requiring broad and deep medical knowledge across diverse sub-disciplines.
- Reference: “MultiMedQA: Large-scale Multi-domain Medical Question Answering”
PubMedQA: A dataset for natural language question answering using abstracts from PubMed as the context, focusing on yes/no questions.
- Dataset Attributes: Questions derived from PubMed article titles with answers provided in the abstracts, emphasizing models’ ability to extract and verify factual information from scientific texts.
- Reference: “PubMedQA: A Dataset for Biomedical Research Question Answering”
MedMCQA: A medical multiple-choice question answering benchmark that evaluates comprehensive understanding and application of medical concepts.
- Dataset Attributes: Features challenging multiple-choice questions that cover a wide range of medical topics, testing not only knowledge but also deep understanding and reasoning skills in medical contexts.
- Reference: “MedMCQA: A Large-scale Multi-domain Clinical Question Answering Dataset”

Biomedical Language Understanding

BLUE (Biomedical Language Understanding Evaluation): A benchmark consisting of several diverse biomedical NLP tasks such as named entity recognition, relation extraction, and sentence similarity in the biomedical domain.
- Dataset Attributes: Utilizes various biomedical corpora, including PubMed abstracts and clinical trial reports, emphasizing specialized language understanding and entity relations.
- Reference: “BLUE: The Biomedical Language Understanding Evaluation Benchmark”

Code LLM Benchmarks

In the domain of code synthesis and understanding, benchmarks play a pivotal role in assessing the performance of Code LLMs. These benchmarks challenge models on various aspects such as code generation, understanding, and debugging. Here’s a detailed overview of common benchmarks used for evaluating code LLMs, including the attributes of their datasets and references to the original papers where these benchmarks were proposed:

Code Generation and Synthesis

HumanEval: This benchmark is designed to test the ability of language models to generate code. It consists of a set of Python programming problems that require writing function definitions from scratch.
- Dataset Attributes: Includes 164 hand-crafted programming problems covering a range of difficulty levels, requiring understanding of problem statements and generation of functionally correct code.
- Reference: “Evaluating Large Language Models Trained on Code”
MBPP (Mostly Basic Python Problems): A benchmark consisting of simple Python coding problems intended to evaluate the capabilities of code generation models in solving basic programming tasks.
- Dataset Attributes: Contains 974 Python programming problems, focusing on basic functionalities and common programming tasks that are relatively straightforward to solve.
- Reference: “Evaluating Models’ Local Decision Boundaries via Contrast Sets”

Code Debugging and Error Detection

DS-1000 (DeepSource Python Bugs Dataset): This dataset is used to evaluate the ability of models to detect bugs in Python code. It includes a diverse set of real-world bugs.
- Dataset Attributes: Comprises 1000 annotated Python functions with detailed bug annotations, testing models on their ability to identify and understand common coding errors.
- Reference: [No direct paper; part of DeepSource tooling offerings, information can be typically found on the DeepSource website or technical blogs associated with the tool.]

Comprehensive Code Understanding and Multi-language Evaluation

CodeXGLUE: A comprehensive benchmark that includes multiple tasks like code completion, code translation, and code repair across various programming languages.
- Dataset Attributes: Encompasses a range of programming challenges and languages, providing a broad assessment of models’ code understanding and generation across different contexts.
- Reference: “CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation”

Algorithmic Problem Solving

APP (Algorithms Problems from Programming): Focuses on algorithmic problem-solving skills by presenting problems typically found in programming competitions.
- Dataset Attributes: Challenges models with complex algorithmic questions that require not only the correct code but also efficiency and optimization.
- Reference: [Specific paper might not exist; generally sourced from programming contest sites and similar venues.]

Vision-Language Models (VLMs)

General Benchmarks

VLMs are pivotal in AI research as they combine visual data with linguistic elements, offering insights into how machines can interpret and generate human-like responses based on visual inputs. This section delves into key benchmarks that test these hybrid capabilities:

Visual Question Answering

Visual Question Answering (VQA) and VQAv2: Requires models to answer questions about images, testing both visual comprehension and language processing. Described in “VQA: Visual Question Answering” and its subsequent updates.
- Dataset Attributes: Combines real and abstract images with questions that require understanding of object properties, spatial relationships, and activities.

Image Captioning

MSCOCO Captions: Models generate captions for images, focusing on accuracy and relevance of the visual descriptions. Introduced in “Microsoft COCO: Common Objects in Context”.
- Dataset Attributes: Real-world images with annotations requiring descriptive and detailed captions that cover a broad range of everyday scenes and objects.

Visual Reasoning

NLVR2 (Natural Language for Visual Reasoning for Real): Evaluates reasoning about the relationship between textual descriptions and image pairs. Proposed in “A Corpus for Reasoning About Natural Language Grounded in Photographs”.
- Dataset Attributes: Pairs of photographs with text statements that models must verify, focusing on logical reasoning across visually disparate images.
MMMU (MultiModal MultiTask Understanding): Tests models’ ability to understand and generate responses based on both visual and textual stimuli. Introduced in the paper [Title not available].
- Dataset Attributes: Involves tasks like visual question answering, image captioning, and visual reasoning, testing both visual and textual understanding.

Video Understanding

Perception Test: A benchmark designed to evaluate models on understanding and interpreting video content. Detailed in “Perception Test: Benchmark for Autonomous Vehicle Perception”.
- Dataset Attributes: Video sequences requiring models to interpret dynamic scenes, focusing on object detection, movement prediction, and scene classification.

Medical VLM Benchmarks

Medical VLMs are essential in merging AI’s visual and linguistic analysis for healthcare applications. They are pivotal for developing systems that can interpret complex medical imagery alongside textual data, enhancing diagnostic accuracy and treatment efficiency. This section explores major benchmarks testing these interdisciplinary skills:

Medical Image Annotation and Retrieval

ImageCLEFmed: Part of the ImageCLEF challenge, this benchmark tests image-based information retrieval, automatic annotation, and visual question answering using medical images.
- Dataset Attributes: Contains a wide array of medical imaging types, including radiographs and histopathology images, necessitating the interpretation of complex visual medical data.
- Reference: “ImageCLEF - the CLEF 2009 Cross-Language Image Retrieval Track”

Disease Classification and Detection

CheXpert: A large dataset of chest radiographs for identifying and classifying key thoracic pathologies. This benchmark is often used for tasks that involve reading and interpreting X-ray images.
- Dataset Attributes: Consists of over 200,000 chest radiographs annotated with findings from radiology reports, challenging models to accurately detect and diagnose multiple conditions.
- Reference: “CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison”
Diabetic Retinopathy Detection: Focused on the classification of retinal images to diagnose diabetic retinopathy, a common cause of vision loss.
- Dataset Attributes: Features high-resolution retinal images, where models need to detect subtle indicators of disease progression, requiring high levels of visual detail recognition.
- Reference: [Typically linked with the Kaggle competition for diabetic retinopathy detection, more formal academic references may vary.]

Common Challenges Across Benchmarks

Generalization: Assessing how well models can generalize from the training data to unseen problems.
Robustness: Evaluating the robustness of models against edge cases and unusual inputs.
Execution Correctness: Beyond generating syntactically correct code, the emphasis is also on whether the code runs correctly and solves the problem as intended.
Bias and Fairness: Ensuring that models do not inherit or perpetuate biases that could impact patient care outcomes, especially given the diversity of patient demographics.
Data Privacy and Security: Addressing concerns related to the handling and processing of sensitive health data in compliance with regulations such as HIPAA.
Domain Specificity: Handling the high complexity of medical and biomedical terminologies and imaging, which requires not only technical accuracy but also clinical relevancy.

Citation

If you found our work useful, please cite it as:

@article{Chadha2020DistilledLLMVLMBenchmarks,
  title   = {LLM/VLM Benchmarks},
  author  = {Chadha, Aman},
  journal = {Distilled AI},
  year    = {2020},
  note    = {\url{https://aman.ai}}
}