Aman's AI Journal • Papers List

Papers List
Seminal Papers / Need-to-know
Selected Papers / Good-to-know

Papers List

A curated set of papers I’ve reviewed for the latest scoop in AI.

Seminal Papers / Need-to-know

Computer Vision

2010

Noise-contrastive estimation: A new estimation principle for unnormalized statistical models

This paper by Gutmann and Hyvarinen in AISTATS 2010 introduced the concept of negative sampling that formed the basis of contrastive learning.
They propose a new estimation principle for parameterized statistical models, noise-contrastive estimation, which discriminates between observed data and artificially generated noise. This is accomplished by performing nonlinear logistic regression to discriminate between the observed data and some artificially generated noise, using the model log-density function in the regression nonlinearity. They show that this leads to a consistent (convergent) estimator of the parameters, and analyze the asymptotic variance.
In particular, the method is shown to directly work for unnormalized models, i.e., models where the density function does not integrate to one. The normalization constant can be estimated just like any other parameter.
For a tractable ICA model, they compare the method with other estimation methods that can be used to learn unnormalized models, including score matching, contrastive divergence, and maximum-likelihood where the normalization constant is estimated with importance sampling.
Simulations show that noise-contrastive estimation offers the best trade-off between computational and statistical efficiency.
They apply the method to the modeling of natural images and show that the method can successfully estimate a large-scale two-layer model and a Markov random field.

2012

ImageNet Classification with Deep Convolutional Neural Networks

The original AlexNet paper by Krizhevsky et al. from NeurIPS 2012 that started it all. This trail-blazer was the first to apply deep supervised learning to the area of image classification.
They rained a large, deep convolutional neural network to classify the 1.3 million high-resolution images in the LSVRC-2010 ImageNet training set into the 1000 different classes.
On the test data, they achieved top-1 and top-5 error rates of 39.7% and 18.9% which was considerably better than the previous state-of-the-art results.
The neural network, which has 60 million parameters and 500,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and two globally connected layers with a final 1000-way softmax.
To make training faster, they used non-saturating neurons and a very efficient GPU implementation of convolutional nets. To reduce overfitting in the globally connected layers, they employed a new regularization method that proved to be very effective.
The following figure from the paper shows an illustration of the architecture of their CNN, explicitly showing the delineation of responsibilities between the two GPUs. One GPU runs the layer-parts at the top of the figure while the other runs the layer-parts at the bottom. The GPUs communicate only at certain layers.

3D Convolutional Neural Networks for Human Action Recognition

This paper by Ji et al. from ASU and NEC Labs in IEEE PAMI 2012 introduced 3D CNNs.
Their problem statement is the fully automated recognition of actions in an uncontrolled environment. Most existing work relies on domain knowledge to construct complex handcrafted features from inputs. In addition, the environments are usually assumed to be controlled.
Convolutional neural networks (CNNs) are a type of deep models that can act directly on the raw inputs, thus automating the process of feature construction. However, such models are currently limited to handle 2D inputs. This paper develops a novel 3D CNN model for action recognition.
This model extracts features from both spatial and temporal dimensions by performing 3D convolutions, thereby capturing the motion information encoded in multiple adjacent frames. The developed model generates multiple channels of information from the input frames, and the final feature representation is obtained by combining information from all channels.
They apply the developed model to recognize human actions in real-world environment, and it achieves superior performance without relying on handcrafted features.

2013

Visualizing and Understanding Convolutional Networks

This legendary paper by Zeiler and Fergus from the Courant Institute, NYU in 2013 seeks to demystify why CNNs perform so well on image classification, or how they might be improved. This paper seeks to address both issues.
They introduce a novel visualization technique that gives insight into the function of intermediate feature layers and the operation of the classifier.
They also perform an ablation study to discover the performance contribution from different model layers. This enables us to find model architectures that outperform Krizhevsky et. al on the ImageNet classification benchmark.
They show their ImageNet model generalizes well to other datasets: when the softmax classifier is retrained, it convincingly beats the current state-of-the-art results on Caltech-101 and Caltech-256 datasets.

Learning Factored Representations in a Deep Mixture of Experts

Mixtures of Experts combine the outputs of several “expert” networks, each of which specializes in a different part of the input space. This is achieved by training a “gating” network that maps each input to a distribution over the experts. Such models show promise for building larger networks that are still cheap to compute at test time, and more parallelizable at training time.
This paper by Eigen et al. from Google and NYU Courant in 2013 extends the Mixture of Experts to a stacked model, the Deep Mixture of Experts, with multiple sets of gating and experts. This exponentially increases the number of effective experts by associating each input with a combination of experts at each layer, yet maintains a modest model size.
On a randomly translated version of the MNIST dataset, they find that the Deep Mixture of Experts automatically learns to develop location-dependent (“where”) experts at the first layer, and class-specific (“what”) experts at the second layer. In addition, they see that the different combinations are in use when the model is applied to a dataset of speech monophones. These demonstrate effective use of all expert combinations.
The figure below from the paper shows (a) Mixture of Experts; (b) Deep Mixture of Experts with two layers.

2014

Generative Adversarial Networks

This paper by Goodfellow et al. from NeurIPS 2014 proposes a new framework called Generative Adversarial Networks (GANs) that estimates generative models via an adversarial process that corresponds to a zero-sum minimax two-player game. In this process, two models are simultaneously trained: a generative model $G$ that captures the data distribution, and a discriminative model $D$ that estimates the probability that a sample came from the training data rather than $G$. The training procedure for $G$ is to maximize the probability of $D$ making a mistake. In the space of arbitrary functions $G$ and $D$, a unique solution exists, with $G$ recovering the training data distribution and $D$ equal to \frac{1}{2} everywhere. In the case where $G$ and $D$ are defined by multilayer perceptrons, the entire system can be trained with backpropagation.
There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples.
Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.

2015

Very Deep Convolutional Networks for Large-Scale Image Recognition

This paper by Simonyan and Zisserman from DeepMind and Oxford in ICLR 2015 proposed the VGG architecture. They showed that a significant performance improvement can be achieved by pushing the depth to 16-19 weight layers, i.e., VGG-16 and VGG-19.
The main principle is that using a stack of $$3 \times 3$$ convolution filters are better than a single $$7 \times 7$$ layer. Firstly, because they use three non-linear activations (instead of one), which makes the function more discriminative. Secondly, the $$3 \times 3$$ design decreases the number of parameters – specifically, you need $3 \times (3^2)C^2 = 27C^2$ weights, compared to a $$7 \times 7$$ conv layer which would require $1 \times (7^2)C^2 = 49C^2$ parameters (81% more).

Going Deeper with Convolutions

This paper by Szegedy et al. from Google in CVPR 2015 introduced the Inception (also known as GoogLeNet or InceptionNet) architecture which achieved state of the art results for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) 2014.
Ideas from the paper:
- Increased the depth (number of layers) is not the only way to make a model bigger. What about increasing both the depth and width of the network while keeping computations at a manageable level? This time the inspiration comes from the human visual system, wherein information is processed at multiple scales and then aggregated locally. How do you achieve this without a memory explosion? The answer is with $$1 \times 1$$ convolutions! The main purpose is channel dimensionality reduction, by reducing the output channels of the input. Next, $1 \times 1$ convolutions are used to compute reductions before the computationally expensive convolutions ($3 \times 3$ and $5 \times 5$). Inception uses convolutions of different kernel sizes ($5 \times 5$, $3 \times 3$, $1 \times 1$) to capture details at multiple scales.
- To enable concatenation of features convolved with different kernels, they pad the output to make it the same size as the input. To find the appropriate padding with single stride convs without dilation, padding $p$ and kernel $k$ are defined so that $out=in$ (i.e., input and output have the same spatial dimensions): $p = (k-1)/2p$ (since $out = in + 2p - k + 1$).

FaceNet: A Unified Embedding for Face Recognition and Clustering

This paper by Schroff et al. from Google in 2015 proposes FaceNet, a system that directly learns a mapping from face images to a compact Euclidean space where distances directly correspond to a measure of face similarity. Once this space has been produced, tasks such as face recognition, verification and clustering can be easily implemented using standard techniques with FaceNet embeddings as feature vectors.
Their method uses a deep convolutional network trained to directly optimize the embedding itself, rather than an intermediate bottleneck layer as in previous deep learning approaches. To train, they use triplets of roughly aligned matching / non-matching face patches generated using a novel online triplet mining method. The benefit of our
approach is much greater representational efficiency: they achieve state-of-the-art face recognition performance using only 128-bytes per face.
Previous face recognition approaches based on deep networks use a classification layer trained over a set of known face identities and then take an intermediate bottle neck layer as a representation used to generalize recognition beyond the set of identities used in training. The downsides of this approach are its indirectness and its inefficiency: one has to hope that the bottleneck representation generalizes well to new faces; and by using a bottleneck layer the representation size per face is usually very large (1000s of dimensions). Some recent work has reduced this dimensionality using PCA, but this is a linear transformation that can be easily learnt in one layer of the network. In contrast to these approaches, FaceNet directly trains its output to be a compact 128-D embedding using a triplet-based loss function based on LMNN. Their triplets consist of two matching face thumbnails and a non-matching face thumbnail and the loss aims to separate the positive pair from the negative by a distance margin.
Choosing which triplets to use turns out to be very important for achieving good performance and, inspired by curriculum learning, they present a novel online negative exemplar mining strategy which ensures consistently increasing difficulty of triplets as the network trains. To improve clustering accuracy, they also explore hard-positive mining techniques which encourage spherical clusters for the embeddings of a single person.
The triplet loss minimizes the L2-distance between faces of the same identity and enforces a margin between the distance of faces of different identities and encourages a relative distance constraint. Specifically, the Triplet Loss minimizes the distance between an anchor and a positive, both of which have the same identity, and maximizes the distance between the anchor and a negative of a different identity. Thus, network is trained such that the squared L2 distances in the embedding space directly correspond to face similarity: faces of the same person have small distances and faces of distinct people have large distances. Once this embedding has been produced, downstream tasks become straight-forward: face verification simply involves thresholding the distance between the two embeddings; recognition becomes a k-NN classification problem; and clustering can be achieved using off-the-shelf techniques such as k-means or agglomerative clustering.
On the widely used Labeled Faces in the Wild (LFW) dataset, their system achieves a new record accuracy of 99.63%, which cuts the error rate in comparison to the best published result by 30% on both datasets.
They explore two different deep convolutional network architectures that have been recently used to great success in the computer vision community. The first architecture is based on the Zeiler&Fergus model which consists of multiple interleaved layers of convolutions, non-linear activations, local response normalizations, and max pooling layers. The second architecture is based on the Inception model of Szegedy et al. which was recently used as the winning approach for ImageNet 2014. These networks use mixed layers that run several different convolutional and pooling layers in parallel and concatenate their responses which reduces the number of parameters by up to 20 times and have the potential to reduce the number of FLOPS required for comparable performance.
They also introduce the concept of harmonic embeddings, and a harmonic triplet loss, which describe different versions of face embeddings (produced by different networks) that are compatible to each other and allow for direct comparison between each other.

Distilling the Knowledge in a Neural Network

This paper by Hinton et al. from Google in NeurIPS 2014 introduces a very simple way to improve the performance of almost any machine learning algorithm by training many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets.
Caruana et al. have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and the authors develop this approach further using a different compression technique. They achieve some surprising results on MNIST and show that they can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. They also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. This shows that distilling works very well for transferring knowledge from an ensemble or from a large highly regularized model into a smaller, distilled model.
The results show that on MNIST, distillation works remarkably well even when the transfer set that is used to train the distilled model lacks any examples of one or more of the classes. For a deep acoustic model that is version of the one used by Android voice search, they have shown that nearly all of the improvement that is achieved by training an ensemble of deep neural nets can be distilled into a single neural net of the same size which is far easier to deploy.
For really big neural networks, it can be infeasible even to train a full ensemble, but have shown that the performance of a single really big net that has been trained for a very long time can be significantly improved by learning a large number of specialist nets, each of which learns to discriminate between the classes in a highly confusable cluster.

Deep Unsupervised Learning using Nonequilibrium Thermodynamics

A central problem in machine learning involves modeling complex data-sets using highly flexible families of probability distributions in which learning, sampling, inference, and evaluation are still analytically or computationally tractable.
This paper by Dickstein et al. from Surya Ganguli’s lab at Stanford in 2015 develops an approach that simultaneously achieves both flexibility and tractability. They introduce a novel algorithm for modeling probability distributions that enables exact sampling and evaluation of probabilities and demonstrated its effectiveness on a variety of toy and real datasets, including challenging natural image datasets. The essential idea, inspired by non-equilibrium statistical physics, is to systematically and slowly destroy structure in a data distribution through an iterative forward diffusion process.
They then learn a reverse diffusion process that restores structure in data, yielding a highly flexible and tractable generative model of the data. This approach allows them to rapidly learn, sample from, and evaluate probabilities in deep generative models with thousands of layers or time steps, as well as to compute conditional and posterior probabilities under the learned model.
For each of the tests they conduct, they use a similar basic algorithm, showing that their method can accurately model a wide variety of distributions. Most existing density estimation techniques must sacrifice modeling power in order to stay tractable and efficient, and sampling or evaluation are often extremely expensive. The core of their algorithm consists of estimating the reversal of a Markov diffusion chain which maps data to a noise distribution; as the number of steps is made large, the reversal distribution of each diffusion step becomes simple and easy to estimate.
The result is an algorithm that can learn a fit to any data distribution, but which remains tractable to train, exactly sample from, and evaluate, and under which it is straightforward to manipulate conditional and posterior distributions.
Code.

2016

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

In recent years, supervised learning with convolutional networks (CNNs) has seen huge adoption in computer vision applications. Comparatively, unsupervised learning with CNNs has received less attention.
This paper by Radford et al. in ICLR 2016 helps bridge the gap between the success of CNNs for supervised learning and unsupervised learning. They introduce a class of CNNs called deep convolutional generative adversarial networks (DCGANs), that have certain architectural constraints, and demonstrate that they are a strong candidate for unsupervised learning.
Training on various image datasets, they show convincing evidence that their deep convolutional adversarial pair learns a hierarchy of representations from object parts to scenes in both the generator and discriminator.
Additionally, they use the learned features for novel tasks - demonstrating their applicability as general image representations.

Rethinking the Inception Architecture for Computer Vision

This paper by Szegedy et al. from Google in CVPR 2016 proposed InceptionV2, V3 by improving the Inception model based on the following principles:
- Using the same principle as VGG, the authors factorized $5 \times 5$ and $7 \times 7$ (in InceptionV3) convolutions to two and three $3 \times 3$ sequential convolutions respectively. This improves computational speed and utilizes far less parameters.
- Used spatially separable convolutions. Simply, a $3 \times 3$ kernel is decomposed into two smaller ones: a $1 \times 3$ and a $3 \times 1$ kernel, which are applied sequentially.
- Widened the inception modules (more number of filters).
- Distributed the computational budget in a balanced way between the depth and width of the network.
- Added batch normalization.

Deep Residual Learning for Image Recognition

ResNet paper by He et al. from Facebook AI in CVPR 2016. Most cited in several AI fields.
The issue of vanishing gradients when training a deep neural network was addressed with two tricks:
- Batch normalization and,
- Short skip connections
Instead of $H(x) = F(x)$, the skip connection leads to $H(x) = F(x) + x$, which implies that the model is learning the difference (i.e., residual), $F(x) = H(x) - x$.

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet and Fast R-CNN have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck.
This paper by Ren et al. from University of Science and Technology of China and Microsoft Research in 2016 proposes a Region Proposal Network (RPN) for efficient and accurate region proposal generation that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. By sharing convolutional features with the down-stream detection network, the region proposal step is nearly cost-free.
An RPN is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection.
They further merge RPN and Fast R-CNN into a single network by sharing their convolutional features – using the recently popular terminology of neural networks with ‘attention’ mechanisms, the RPN component tells the unified network where to look.
For the very deep VGG-16 model, their detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image. In ILSVRC and COCO 2015 competitions, Faster R-CNN and RPN are the foundations of the 1st-place winning entries in several tracks.
Faster R-CNN enables a unified, deep-learning-based object detection system to run at near real-time frame rates. The learned RPN also improves region proposal quality and thus the overall object detection accuracy.
Code.

You Only Look Once: Unified, Real-Time Object Detection

Prior work on object detection repurposes classifiers to perform detection.
This paper by Redmon et al. from Ali Farhadi’s group at UW in 2016 presents YOLO, a new approach to object detection which frames object detection as a regression problem to spatially separated bounding boxes and associated class probabilities.
A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation.
Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance. Unlike classifier-based approaches, YOLO is trained on a loss function that directly corresponds to detection performance and the entire model is trained jointly.
YOLO is extremely fast and can thus be utilized for real-time object detection. The base YOLO model processes images in real-time at 45 frames per second. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors.
Compared to state-of-the-art detection systems, YOLO makes more localization errors but is far less likely to predict false detections where nothing exists. Finally, YOLO learns very general representations of objects. It outperforms all other detection methods, including DPM and R-CNN, by a wide margin when generalizing from natural images to artwork on both the Picasso Dataset and the People-Art Dataset.

2017

Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning

This paper by Szegedy et al. from Google in AAAI 2017 introduced the latest versions of the Inception model – InceptionV4 and Inception-ResNet.

Photo-Realistic Single Image Super-Resolution using a GAN

This paper by Ledig et al. from Twitter in CVPR 2017 applied GANs for single image super-resolution (SISR).

Understanding intermediate layers using linear classifier probes

Neural network models have a notorious reputation for being black boxes.
This paper by Alain and Bengio from Mila and the University of Montreal in ICLR 2017 proposes to monitor the features at every layer of a model and measure how suitable they are for classification.
They use linear classifiers, which they refer to as “probes”, trained entirely independently of the model itself. This helps them better understand the roles and dynamics of the intermediate layers. They demonstrate how this can be used to develop a better intuition about models and to diagnose potential problems.
They apply this technique to the popular models Inception v3 and Resnet-50. Among other things, they observe experimentally that the linear separability of features increase monotonically along the depth of the model.

Image-to-Image Translation with Conditional Adversarial Networks

Many problems in image processing, graphics, and vision involve translating an input image into a corresponding output image. These problems are often treated with application-specific algorithms, even though the setting is always the same: map pixels to pixels. Conditional adversarial nets are a general-purpose solution that appears to work well on a wide variety of these problems.
This paper by et al. from UC Berkeley in CVPR 2017 introduces pix2pix, a conditional adversarial network-based framework for image-to-image translation.
These networks not only learn the mapping from input image to output image, but also learn a loss function to train this mapping. This makes it possible to apply the same generic approach to problems that traditionally would require very different loss formulations.
They demonstrate that this approach is effective at synthesizing photos from label maps, reconstructing objects from edge maps, and colorizing images, among other tasks.
The figure below from the paper shows the results of the method on several inputs. In each case, they use the same architecture and objective, and simply train on different data.

Improved Image Captioning via Policy Gradient optimization of SPIDEr

Current image captioning methods are usually trained via (penalized) maximum likelihood estimation. However, the log-likelihood score of a caption does not correlate well with human assessments of quality.
Standard syntactic evaluation metrics, such as BLEU, METEOR and ROUGE, are also not well correlated. The newer SPICE and CIDEr metrics are better correlated, but have traditionally been hard to optimize for.
This paper by Liu et al. from Oxford and Google in ICCV 2017 shows how to use a policy gradient (PG) method to directly optimize a linear combination of SPICE and CIDEr (a combination they call SPIDEr): the SPICE score ensures their captions are semantically faithful to the image, while CIDEr score ensures their captions are syntactically fluent.
The proposed PG method improves on the prior MIXER approach, by using Monte Carlo rollouts instead of mixing MLE training with PG. They show empirically that SPIDEr leads to easier optimization and improved results compared to MIXER.
Finally, they show that using their PG method they can optimize any of the metrics, including the proposed SPIDEr metric which results in image captions that are strongly preferred by human raters compared to captions generated by the same model but trained to optimize MLE or the COCO metrics.

2018

From Recognition to Cognition: Visual Commonsense Reasoning

Visual understanding goes well beyond object recognition. With one glance at an image, they can effortlessly imagine the world beyond the pixels: for instance, they can infer people’s actions, goals, and mental states. While this task is easy for humans, it is tremendously difficult for today’s vision systems, requiring higher-order cognition and commonsense reasoning about the world.
This paper by Zellers et al. from UW in CVPR 2019 formalizes this task as Visual Commonsense Reasoning (VCR). Given a challenging question about an image, a machine must answer correctly and then provide a rationale justifying its answer.
Next, they introduce a new dataset, VCR, consisting of 290k multiple choice QA problems derived from 110k movie scenes. The key recipe for generating non-trivial and high-quality problems at scale is Adversarial Matching, a new approach to transform rich annotations into multiple choice questions with minimal bias. Experimental results show that while humans find VCR easy (over 90% accuracy), state-of-the-art vision models struggle (~45%).
To move towards cognition-level understanding, they present a new reasoning engine, Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning. R2C helps narrow the gap between humans and machines (~65%); still, the challenge is far from solved, and they provide analysis that suggests avenues for future work.
Website with models/datasets.

Focal Loss for Dense Object Detection

The highest accuracy object detectors to date are based on a two-stage approach popularized by R-CNN, where a classifier is applied to a sparse set of candidate object locations. In contrast, one-stage detectors that are applied over a regular, dense sampling of possible object locations have the potential to be faster and simpler, but have trailed the accuracy of two-stage detectors thus far.
This paper by Lin et al. from in 2017 investigates why this is the case and introduced focal loss. They discover that the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause. They propose to address this class imbalance by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples.
Focal loss applies a modulating term to the cross entropy loss in order to focus learning on hard misclassified examples. It is a dynamically scaled cross entropy loss, where the scaling factor decays to zero as confidence in the correct class increases.
Their novel Focal Loss focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training. To evaluate the effectiveness of their loss, they design and train a simple dense detector they call RetinaNet.
Their results show that when trained with the focal loss, RetinaNet is able to match the speed of previous one-stage detectors while surpassing the accuracy of all existing state-of-the-art two-stage detectors.

Relational inductive biases, deep learning, and graph networks

Recent advances in AI, propelled by deep learning, have been transformative across many important domains. Despite this, a vast gap between human and machine intelligence remains, especially with respect to efficient, generalizable learning.
This paper by Battaglia et al. (2018) from DeepMind/Google, MIT and the University of Edinburgh offers a great overview of the relational inductive biases of various neural net architectures, summarized in the table below from the paper.

They argue that combinatorial generalization must be a top priority for AI to achieve human-like abilities, and advocate for marrying complementary approaches which draw on ideas from human cognition, traditional computer science, standard engineering practice, and modern deep learning. Just as biology uses nature and nurture cooperatively, they reject the false choice between “hand-engineering” and “end-to-end” learning, and instead advocate for an approach which benefits from their complementary strengths.
They investigate how using relational inductive biases within deep learning architectures can facilitate learning about entities, relations, and rules for composing them.
They explore flexible learning-based approaches which implement strong relational inductive biases to capitalize on explicitly structured representations and computations, and present a new building block for the AI toolkit – the graph neural networks (GNNs).
GNNs generalize and extend various approaches for neural networks that operate on graphs, and provides a straightforward interface for manipulating structured knowledge and producing structured behaviors. GNNs are designed to promote building complex architectures using customizable graph-to-graph building blocks, and their relational inductive biases promote support relational reasoning, combinatorial generalization, and improved sample efficiency over other standard machine learning building blocks. This would help lay the foundation for more sophisticated, interpretable, and flexible patterns of reasoning.

Squeeze-and-Excitation Networks

The central building block of convolutional neural networks (CNNs) is the convolution operator, which enables networks to construct informative features by fusing both spatial and channel-wise information within local receptive fields at each layer. A broad range of prior research has investigated the spatial component of this relationship, seeking to strengthen the representational power of a CNN by enhancing the quality of spatial encodings throughout its feature hierarchy.
This paper by et al. from the Chinese Academy of Sciences, University of Macau, and the Visual Geometry Group at the University of Oxford focuses instead on the channel relationship and propose a novel architectural unit, which we term the “Squeeze-and-Excitation” (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels.
They show that these blocks can be stacked together to form SENet architectures that generalize extremely effectively across different datasets.
They further demonstrate that SE blocks bring significant improvements in performance for existing state-of-the-art CNNs at slight additional computational cost. Squeeze-and-Excitation Networks formed the foundation of our ILSVRC 2017 classification submission which won first place and reduced the top-5 error to 2.251%, surpassing the winning entry of 2016 by a relative improvement of ~25%.
The following figure from the paper shows a squeeze-and-excitation block.

The following figure from the paper shows (first half) the schema of the original Inception module (left) and the SEInception module (right); (second half) the original Residual module (left) and the SEResNet module (right).

Code.

When Does Label Smoothing Help?

The following paper summary has been contributed by Zhibo Zhang.
This paper by Müller et al. from Google Brain in NeurIPS 2019 studies label smoothing in terms of the effects on penultimate layer representations, model calibration as well as knowledge distillation (Hinton et al., 2015).
The figure below from the paper shows the visualization of the penultimate layer representations of the following models (trained with label smoothing, denoted by “w/ LS” in the figure; and without label smoothing, denoted by “w/o LS” in the figure) and datasets:
- First row: AlexNet (Krizhevsky et al., 2012) on the CIFAR-10 (Krizhevsky, 2009) dataset, with the visualization of three semantically different classes.
- Second row: ResNet-56 (He et al., 2016) on the CIFAR-100 (Krizhevsky, 2009) dataset, with the visualization of three semantically different classes.
- Third row: Inception-v4 (Szegedy et al., 2017) on the ImageNet (Russakovsky et al., 2014) dataset, with the visualization of three semantically different classes.
- Fourth row: Inception-v4 on the ImageNet dataset, with the visualization of two semantically similar classes and a semantically different one
It can be observed that with label smoothing, the activations of the same class are more closely tightened together compared to training without label smoothing, which is because training with label smoothing encourages the penultimate layer representations of the same class to be equally distant from other classes.
In order to study the effects of label smoothing on model calibration, the authors conducted experiments on image classification and machine translation tasks. It was observed that training with label smoothing could reduce the expected calibration error (Guo et al., 2017) compared to training without label smoothing.
In addition, the authors noticed that in knowledge distillation, while a teacher model trained with label smoothing could have better accuracy for the teacher model itself, it could produce student models with worse performance.

Unsupervised Feature Learning via Non-Parametric Instance Discrimination

This paper by Wu et al. from UC Berkeley, Chinese University of Hong Kong, and Amazon Rekognition in CVPR 2018 as a spotlight paper introduces a novel method for unsupervised feature learning in neural networks, leveraging non-parametric instance discrimination.
The unique approach involves treating each image as a distinct class and employing noise-contrastive estimation (NCE) to address the computational challenges posed by the vast number of instance classes.
A non-parametric softmax classifier is proposed, which uses direct feature representation instead of a class weight vector, allowing for precise instance comparisons. This involves projecting image features into a 128-dimensional space and normalizing them. To efficiently store these representations, the concept of a memory bank is introduced.
To reduce the computational burden of the softmax function over numerous classes, the paper implements NCE, which approximates the full softmax distribution and cuts computational complexity from $O(n)$ to $O(1)$ per sample, without sacrificing performance. To stabilize the learning process, proximal regularization is applied. This is crucial as each instance class is visited only once per epoch, aiding in smoother learning dynamics and faster convergence.
The paper also explores an alternative approach involving storing representations from previous batches in a queue to be used as negative examples in the loss (Wu et al., 2018). This method allows for smaller batch sizes but introduces asymmetry between “queries” (generated from current batch elements) and “keys” (stored in the queue). Only “queries” undergo gradient backpropagation, treating “key” representations as fixed. However, this leads to performance drops when the network rapidly evolves during training. To address this, He et al. (2020) proposed MoCo, a technique using two networks: one for keys and one for queries, with the keys’ network updating more slowly. This offers a more stable learning dynamic, as the query network is updated using backpropagation and stochastic gradient descent.
The following figure from the paper shows the pipeline of our unsupervised feature learning approach. We use a backbone CNN to encode each image as a feature vector, which is projected to a 128-dimensional space and L2 normalized. The optimal feature embedding is learned via instance-level discrimination, which tries to maximally scatter the features of training samples over the 128-dimensional unit sphere.

The method exhibits state-of-the-art performance in unsupervised image classification on standard datasets like CIFAR-10 and ImageNet, notably achieving a top-1 accuracy of 46.5% on ImageNet.
The learned features demonstrate strong generalization in semi-supervised learning and object detection, showcasing effective transfer learning.
The scalability and efficiency of the approach are highlighted by the compact 128-dimensional representation, requiring only 600MB for a million images, enabling rapid nearest neighbor retrieval at runtime.
Code.

2019

Objects as Points

This paper by Zhou et al. from UT Austin in 2019 proposes CenterNet, a center point-based object detection approach, which is end-to-end differentiable, simpler, faster, and more accurate than other competitive bounding box based detectors.
CenterNet is an anchorless object detection architecture. As such, this structure has an important advantage in that it replaces the classical NMS (Non Maximum Suppression) step during post-processing. This mechanism enables faster inference.
Where most successful object detectors enumerate a nearly exhaustive list of potential object locations and classify each, which is wasteful, inefficient, and requires additional post-processing, CenterNet models an object as a single point — the center point of its bounding box. CenterNet object detector builds on successful keypoint estimation networks and uses keypoint estimation to find center points and regresses to all other object properties, such as size, 3D location, orientation, depth and extent, and pose in a single forward pass. The algorithm is simple, fast, accurate, and end-to-end differentiable without any NMS post-processing. The idea is general and has broad applications beyond simple two-dimensional detection.
Upon comparison with other state-of-the-art detectors in the COCO test-dev set. With multi-scale evaluation, CenterNet with Hourglass104 achieves an AP of 45.1%, outperforming all existing one-stage detectors. Sophisticated two-stage detectors are more accurate, but also slower.

RandAugment: Practical automated data augmentation with a reduced search space

Recent work has shown that data augmentation has the potential to significantly improve the generalization of deep learning models.
Recently, automated augmentation strategies have led to state-of-the-art results in image classification and object detection. While these strategies were optimized for improving validation accuracy, they also led to state-of-the-art results in semi-supervised learning and improved robustness to common corruptions of images.
An obstacle to a large-scale adoption of these methods is a separate search phase which increases the training complexity and may substantially increase the computational cost. Additionally, due to the separate search phase, these approaches are unable to adjust the regularization strength based on model or dataset size. Automated augmentation policies are often found by training small models on small datasets and subsequently applied to train larger models.
This paper by Cubuk et al. from Google Brain in 2019 demonstrates that previous methods of learned augmentation suffers from systematic drawbacks. Namely, not tailoring the number of distortions and the distortion magnitude to the dataset size nor the model size leads to sub-optimal performance. In previous work, scaling learned data augmentation to larger dataset and models have been a notable obstacle. For example, AutoAugment and Fast AutoAugment could only be optimized for small models on reduced subsets of data; population based augmentation was not reported for large-scale problems.
They propose RangAugment, a simple parameterization for targeting augmentation to particular model and dataset sizes, which seeks to remove both of the aforementioned obstacles. RandAugment has a significantly reduced search space which allows it to be trained on the target task with no need for a separate proxy task. Furthermore, due to the parameterization, the regularization strength may be tailored to different model and dataset sizes.
RandAugment can be used uniformly across different tasks and datasets and works out of the box, matching or surpassing all previous automated augmentation approaches on CIFAR-10/100, SVHN, and ImageNet without a separate search for data augmentation policies.
The proposed method scales quite well to datasets such as ImageNet and COCO while incurring minimal computational cost (e.g. 2 hyperparameters), but notable predictive performance gains.
On the ImageNet dataset, they achieve 85.0% accuracy, a 0.6% increase over the previous state-of-the-art and 1.0% increase over baseline augmentation. On object detection, RandAugment leads to 1.0-1.3% improvement over baseline augmentation, and is within 0.3% mAP of AutoAugment on COCO.
Finally, due to its interpretable hyperparameter, RandAugment may be used to investigate the role of data augmentation with varying model and dataset size.

Semantic Image Synthesis with Spatially-Adaptive Normalization

This paper by Park et al. from UC Berkeley, NVIDIA and MIT CSAIL proposes a spatially-adaptive normalization, a simple but effective layer for synthesizing photorealistic images given an input semantic layout. Previous methods directly feed the semantic layout as input to the deep network, which is then processed through stacks of convolution, normalization, and nonlinearity layers.
They show that this is suboptimal as the normalization layers tend to “wash away” semantic information.
To address the issue, they propose using the input layout for modulating the activations in normalization layers through a spatially-adaptive, learned affine transformation. The proposed normalization leads to the first semantic image synthesis model that can produce photorealistic outputs for diverse scenes including indoor, outdoor, landscape, and street scenes.
Experiments on several challenging datasets demonstrate the advantage of the proposed method over existing approaches, regarding both visual fidelity and alignment with input layouts.
Finally, their model allows user control over both semantic and style and demonstrate its application for multi-modal and guided image synthesis.
In the paper and the demo video, they showed GauGAN, an interactive app that generates realistic landscape images from the layout users draw. The model was trained on landscape images scraped from Flickr.com.
Code; project page; online interactive demo of GauGAN; GauGAN360.

Generative Modeling by Estimating Gradients of the Data Distribution

This paper by Song and Ermon in NeurIPS 2019 introduces a new generative model where samples are produced via Langevin dynamics using gradients of the data distribution estimated with score matching.
Because gradients can be ill-defined and hard to estimate when the data resides on low-dimensional manifolds, they perturb the data with different levels of Gaussian noise, and jointly estimate the corresponding scores, i.e., the vector fields of gradients of the perturbed data distribution for all noise levels. For sampling, they propose an annealed Langevin dynamics where we use gradients corresponding to gradually decreasing noise levels as the sampling process gets closer to the data manifold.
Their framework allows flexible model architectures, requires no sampling during training or the use of adversarial methods, and provides a learning objective that can be used for principled model comparisons.
Their models produce samples comparable to GANs on MNIST, CelebA and CIFAR-10 datasets, achieving a new state-of-the-art inception score of 8.87 on CIFAR-10. Additionally, they demonstrate that their models learn effective representations via image inpainting experiments.

2020

Denoising Diffusion Probabilistic Models

This paper by Ho et al. from Pieter Abbeel’s lab at UC Berkeley presents high quality image samples using diffusion probabilistic models (also called diffusion models), a class of latent variable models inspired by considerations from nonequilibrium thermodynamics.
Their best results are obtained by training on a weighted variational bound designed according to a novel connection between diffusion probabilistic models and denoising score matching with Langevin dynamics, and their models naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding.
On the unconditional CIFAR10 dataset, they obtain an Inception score of 9.46 and a state-of-the-art FID score of 3.17. On 256x256 LSUN, they obtain sample quality similar to ProgressiveGAN.
Code.

Designing Network Design Spaces

This paper by Radosavovic et al. from FAIR in CVPR 2020 presents a new network design paradigm. Their goal is to help advance the understanding of network design and discover design principles that generalize across settings. Instead of focusing on designing individual network instances, they design network design spaces that parametrize populations of networks. The overall process is analogous to classic manual design of networks, but elevated to the design space level.
Their methodology explores the structural aspect of network design and arrives at a low-dimensional design space consisting of simple, regular networks that they call RegNet. The core insight of the RegNet parametrization is surprisingly simple: widths and depths of good networks can be explained by a quantized linear function.
They analyze the RegNet design space and arrive at interesting findings that do not match the current practice of network design. The RegNet design space provides simple and fast networks that work well across a wide range of flop regimes.
Under comparable training settings and flops, the RegNet models outperform the popular EfficientNet models while being up to 5x faster on GPUs.

Training data-efficient image transformers & distillation through attention

Compared to CNNs, vision transformers are pre-trained with hundreds of millions of images using an expensive infrastructure, thereby limiting their adoption.
This paper by Touvron from Facebook AI and proposes DeiT, a competitive convolution-free transformer that does not require very large amount of data to be trained, thanks to improved training and in particular a novel distillation procedure. DeiT is trained on ImageNet on a single computer in less than 3 days. Their reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop evaluation) on ImageNet with no external data.
They introduce a teacher-student strategy specific to transformers. Using distillation can hamper the performance of neural networks. The student model pursues two different objectives that may be diverging: learning from a labeled dataset (strong supervision) and learning from the teacher. To alleviate this, they introduced a distillation token, which is a learned vector that flows through the network along with the transformed image data. The distillation token cues the model for its distillation output, which can differ from its class output. This new distillation method is specific to Transformers and further improves the image classification performance.
It relies on a distillation token ensuring that the student learns from the teacher through attention. They show the interest of this token-based distillation, especially when using a ConvNet as a teacher. This leads us to report results competitive with CNNs for both ImageNet (where they obtain up to 85.2% top-1 accuracy) and when transferring to other tasks.
Facebook AI post.
Code.

NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis

This paper by Mildenhall et al. from UC Berkeley, Google and UCSD in ECCV 2020 introduces NeRF, a method that achieves state-of-the-art results for synthesizing novel views of complex scenes by optimizing an underlying continuous volumetric scene function using a sparse set of input views.
Their algorithm represents a scene using a fully-connected (non-convolutional) deep network, whose input is a single continuous 5D coordinate (spatial location (x,y,z) and viewing direction (θ,ϕ)) and whose output is the volume density and view-dependent emitted radiance at that spatial location.
They synthesize views by querying 5D coordinates along camera rays and use classic volume rendering techniques to project the output colors and densities into an image. Because volume rendering is naturally differentiable, the only input required to optimize their representation is a set of images with known camera poses. They describe how to effectively optimize neural radiance fields to render photorealistic novel views of scenes with complicated geometry and appearance, and demonstrate results that outperform prior work on neural rendering and view synthesis.
Project page with videos and code.

Bootstrap your own latent: A new approach to self-supervised Learning

This paper by Grill et al. from DeepMind and Imperial College in 2020 introduces Bootstrap Your Own Latent (BYOL), a new approach to self-supervised image representation learning.
BYOL learns its representation by predicting previous versions of its outputs, without using negative pairs. BYOL relies on two neural networks, referred to as online and target networks, that interact and learn from each other. From an augmented view of an image, they train the online network to predict the target network representation of the same image under a different augmented view. At the same time, they update the target network with a slow-moving average of the online network.
While state-of-the art methods rely on negative pairs, BYOL achieves a new state of the art without them. BYOL reaches 74.3% top-1 classification accuracy on ImageNet using a linear evaluation with a ResNet-50 architecture and 79.6% with a larger ResNet, using 30% fewer parameters.
They show that BYOL performs on par or better than the current state of the art on both transfer and semi-supervised benchmarks.
Nevertheless, BYOL remains dependent on existing sets of augmentations that are specific to vision applications. To generalize BYOL to other modalities, it is necessary to obtain similarly suitable augmentations for each of them. Designing such augmentations may require significant effort and expertise. Therefore, automating the search for these augmentations would be an important next step to generalize BYOL to other modalities.
BYOL’s architecture is as shown below. BYOL minimizes a similarity loss between $q_{\theta}\left(z_{\theta}\right)$ and $\operatorname{sg}\left(z_{\xi}^{\prime}\right)$, where $\theta$ are the trained weights, $\xi$ are an exponential moving average of $\theta$ and $sg$ means stop-gradient. At the end of training, everything but $f_{\theta}$ is discarded, and $y_{\theta}$ is used as the image representation.

Code.

A Simple Framework for Contrastive Learning of Visual Representations

This paper by Chen et al. from Google Research and Hinton’s lab in ICML 2020 presents SimCLR, a simple framework for contrastive learning of visual representations.
They simplify recently proposed contrastive self-supervised learning algorithms without requiring specialized architectures or a memory bank. In order to understand what enables the contrastive prediction tasks to learn useful representations, they systematically study the major components of their framework and show the effects of different design choices.
They show that (1) composition of data augmentations plays a critical role in defining effective predictive tasks, (2) introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and (3) contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning.
By combining these findings, SimCLR is able to considerably outperform previous methods for self-supervised and semi-supervised learning on ImageNet. SimCLR differs from standard supervised learning on ImageNet only in the choice of data augmentation, the use of a nonlinear head at the end of the network, and the loss function. The strength of this simple framework suggests that, despite a recent surge in interest, self-supervised learning remains undervalued.
A linear classifier trained on self-supervised representations learned by SimCLR achieves 76.5% top-1 accuracy, which is a 7% relative improvement over previous state-of-the-art, matching the performance of a supervised ResNet-50. When fine-tuned on only 1% of the labels, SimCLR achieve 85.8% top-5 accuracy, outperforming AlexNet with 100x fewer labels.
The following diagram shows the SimCLR framework. Two separate data augmentation operators are sampled from the same family of augmentations ($t \sim \mathcal{T}$ and $t^{\prime} \sim \mathcal{T}$) and applied to each data example to obtain two correlated views. A base encoder network $f(\cdot)$ and a projection head $g(\cdot)$ are trained to maximize agreement using a contrastive loss. After training is completed, they throw away the projection head $g(\cdot)$ and use encoder $f(\cdot)$ and representation $\boldsymbol{h}$ for downstream tasks.

Conditional Negative Sampling for Contrastive Learning of Visual Representations

Recent methods for learning unsupervised visual representations, dubbed contrastive learning, optimize the noise-contrastive estimation (NCE) bound on mutual information between two views of an image. NCE uses randomly sampled negative examples to normalize the objective.
This paper by Wu et al. from Stanford in 2020 shows that choosing difficult negatives, or those more similar to the current instance, can yield stronger representations. To do this, they introduce a family of mutual information estimators called Conditional Noise Contrastive Estimator (CNCE) that sample negatives conditionally – in a “ring” around each positive, by approximating the partition function using samples from a class of conditional distributions. They prove that these estimators lower-bound mutual information, with higher bias but lower variance than NCE.
Applying these estimators as objectives in contrastive representation learning, shows that CNCE’s representations outperform existing approaches consistently across a spectrum of contrastive objectives, data distributions, and transfer tasks.
Experimentally, CNCE applied on top of existing models (IR, CMC, and MoCo) improves accuracy by 2-5% points in each case, measured by linear evaluation on four standard image datasets. Moreover, they find continued benefits when transferring features to a variety of new image distributions from the meta-dataset collection and to a variety of downstream tasks such as object detection, instance segmentation, and keypoint detection.

Momentum Contrast for Unsupervised Visual Representation Learning

This paper by He et al. from Facebook AI in CVPR 2020 presents Momentum Contrast (MoCo) for unsupervised visual representation learning. From a perspective on contrastive learning as dictionary look-up, MoCo builds a dynamic dictionary with a queue and a moving-averaged encoder. This enables building a large and consistent dictionary on-the-fly that facilitates contrastive unsupervised learning.
MoCo provides competitive results under the common linear protocol on ImageNet classification. More importantly, the representations learned by MoCo transfer well to downstream tasks. MoCo can outperform its supervised pre-training counterpart in 7 detection/segmentation tasks on PASCAL VOC, COCO, and other datasets, sometimes surpassing it by large margins. This suggests that the gap between unsupervised and supervised representation learning has been largely closed in many vision tasks.
Momentum Contrast (MoCo) trains a visual representation encoder by matching an encoded query $q$ to a dictionary of encoded keys using a contrastive loss, as shown in the diagram below. The dictionary keys $\left\{k_{0}, k_{1}, k_{2}, \ldots\right\}$ are defined on-the-fly by a set of data samples. The dictionary is built as a queue, with the current mini-batch enqueued and the oldest mini-batch dequeued, decoupling it from the mini-batch size. The keys are encoded by a slowly progressing encoder, driven by a momentum update with the query encoder. This method enables a large and consistent dictionary for learning visual representations.

The figure below from the paper shows the conceptual comparison of three contrastive loss mechanisms by illustrating one pair of query and key. The three mechanisms differ in how the keys are maintained and how the key encoder is updated. (a): The encoders for computing the query and key representations are updated end-to-end by back-propagation (the two encoders can be different). (b): The key representations are sampled from a memory bank. (c): MoCo encodes the new keys on-the-fly by a momentum-updated encoder, and maintains a queue (not illustrated in this figure) of keys.

Generative Pretraining from Pixels

Based on the observation that just as a large transformer model trained on language can generate coherent text, the same exact model trained on pixel sequences can generate coherent image completions and samples. By establishing a correlation between sample quality and image classification accuracy, they show that their best generative model also contains features competitive with top convolutional nets in the unsupervised setting.
This paper by Chen et al. from OpenAI in ICML 2020 examines whether similar models can learn useful representations for images, inspired by progress in unsupervised representation learning for natural language.
They train a sequence Transformer to auto-regressively predict pixels, without incorporating knowledge of the 2D input structure.
Despite training on low-resolution ImageNet without labels, they find that a GPT-2 scale model learns strong image representations as measured by linear probing, fine-tuning, and low-data classification. On CIFAR-10, they achieve 96.3% accuracy with a linear probe, outperforming a supervised Wide ResNet, and 99.0% accuracy with full finetuning, matching the top supervised pre-trained models.
An even larger model trained on a mixture of ImageNet and web images is competitive with self-supervised benchmarks on ImageNet, achieving 72.0% top-1 accuracy on a linear probe of their features.
OpenAI article.

Random Erasing Data Augmentation

This paper by Zhong et al. from Xiamen University, University of Technology Sydney, Australian National University, and CMU in AAAI 2020 introduces Random Erasing (“RandomErase”), a new data augmentation method for training the convolutional neural network (CNN). In training, Random Erasing randomly selects a rectangle region in an image and erases its pixels with random values.
In this process, training images with various levels of occlusion are generated, which reduces the risk of over-fitting and makes the model robust to occlusion. Random Erasing is parameter learning free, easy to implement, and can be integrated with most of the CNN-based recognition models.
Albeit simple, Random Erasing is complementary to commonly used data augmentation techniques such as random cropping and flipping, and yields consistent improvement over strong baselines in image classification, object detection and person re-identification.
The figure below from the paper shows examples of random erasing in image classification (a), person re-identification (re-ID) (b), object detection (c) and comparing with different augmentation methods (d). In CNN training, they randomly choose a rectangle region in the image and erase its pixels with random values or the ImageNet mean pixel value. Images with various levels of occlusion are thus generated.

Code.

2021

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place.
This paper by Dosovitskiy et al. from Google Brain in ICLR 2021 shows that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks.
Inspired by the Transformer scaling successes in NLP, they experiment with applying a standard Transformer directly to images, with the fewest possible modifications. To do so, they split an image into patches and provide the sequence of linear embeddings of these patches as an input to a Transformer. Image patches are treated the same way as tokens (words) in an NLP application. They train the model on image classification in supervised fashion (as shown in the figure below from the paper).
They introduce three ViT configurations (Base, Large, and Huge) in the form of two models: ViT-H/14 and ViT-L/16 (where the notation used is ViT-C/N, C is used to indicate the model size and N is the input patch size; for instance, ViT-L/16 means the “Large” variant with $16 \times 16$ input patch size).
When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), the proposed Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

RepVGG: Making VGG-style ConvNets Great Again

This paper by Ding et al. from Tsinghua University, MEGVII Technology, HKUST, and Aberystwyth University in CVPR 2021 Re-parameterization VGG (RepVGG), a simple but powerful architecture of convolutional neural network, which has a simple architecture with a stack of $3 \times 3$ convolution and ReLU during inference time, which is especially suitable for GPU and specialized inference chips, while the training-time model has a multi-branch topology. Such decoupling of the training-time and inference-time architecture is realized by a structural re-parameterization technique so that the model is named RepVGG.
The figure below from the paper shows a sketch of RepVGG architecture. RepVGG has 5 stages and conducts down-sampling via stride-2 convolution at the beginning of a stage. Here, only the first 4 layers of a specific stage are shown. Inspired by ResNet, RepVGG also uses identity and $1 \times 1$ branches, but only for training.

On ImageNet, RepVGG reaches over 80% top-1 accuracy, which is the first time for a plain model.
On NVIDIA 1080Ti GPU, RepVGG models run 83% faster than ResNet-50 or 101% faster than ResNet-101 with higher accuracy and show favorable accuracy-speed trade-off compared to the state-of-the-art models like EfficientNet and RegNet.
The figure below from the paper shows the Top-1 accuracy on ImageNet vs. actual speed. Left: lightweight and middleweight RepVGG and baselines trained in 120 epochs. Right: heavyweight models trained in 200 epochs. The speed is tested on the same 1080Ti with a batch size of 128, full precision (fp32), single crop, and measured in examples/second. The input resolution is 300 for EfficientNet-B3 and 224 for the others.

Code.

ArcFace: Additive Angular Margin Loss for Deep Face Recognition

Recently, a popular line of research in face recognition is adopting margins in the well-established softmax loss function to maximize class separability.
This paper by Deng et al. from in TPAMI introduces an Additive Angular Margin Loss (ArcFace), which not only has a clear geometric interpretation but also significantly enhances the discriminative power. - Since ArcFace is susceptible to the massive label noise, they further propose sub-center ArcFace, in which each class contains $K$ sub-centers and training samples only need to be close to any of the $K$ positive sub-centers. Sub-center ArcFace encourages one dominant sub-class that contains the majority of clean faces and non-dominant sub-classes that include hard or noisy faces.
Based on this self-propelled isolation, they boost the performance through automatically purifying raw web faces under massive real-world noise. Besides discriminative feature embedding, they also explore the inverse problem, mapping feature vectors to face images.
Without training any additional generator or discriminator, the pre-trained ArcFace model can generate identity-preserved face images for both subjects inside and outside the training data only by using the network gradient and Batch Normalization (BN) priors. Extensive experiments demonstrate that ArcFace can enhance the discriminative feature embedding as well as strengthen the generative face synthesis.
The figure below from the paper shows the comparisons of Triplet, Tuplet, ArcFace and sub-center ArcFace. Triplet and Tuplet conduct local sample-to-sample comparisons with Euclidean margins within the mini-batch. By contrast, ArcFace and sub-center ArcFace conduct global sample-to-class and sample-to-subclass comparisons with angular margins.

The figure below from the paper shows the training the deep face recognition model by the proposed ArcFace loss $(K=1)$ and sub-center ArcFace loss (e.g. $K=3)$. Based on a $\ell_2$ normalization step on both embedding feature $\mathbf{x}_i \in \mathbb{R}^{512}$ and all sub-centers $W \in \mathbb{R}^{512 \times N \times K}$, they get the subclass-wise similarity score $\mathcal{S} \in \mathbb{R}^{N \times K}$ by a matrix multiplication $W^T \mathbf{x}_i$. After a max pooling step, they can easily get the class-wise similarity score $\mathcal{S}^{\prime} \in \mathbb{R}^{N \times 1}$. Afterwards, they calculate the $\arccos \theta_{y_i}$ and get the angle between the feature $x_i$ and the ground truth center $W_{y_i}$. Then, they add an angular margin penalty $m$ on the target (ground truth) angle $\theta_{y_i}$. After that, they calculate $\cos \left(\theta_{y_i}+m\right)$ and multiply all logits by the feature scale $s$. Finally, the logits go through the softmax function and contribute to the cross entropy loss.

Do Vision Transformers See Like Convolutional Neural Networks?

Given the central role of convolutional neural networks in computer vision breakthroughs (leading to them being the de-facto model for visual data), it is remarkable that Transformer architectures (almost identical to those used in language) are capable of similar performance. For instance, recent work has shown that the Vision Transformer (ViT) model can achieve comparable or even superior performance on image classification tasks. This raises fundamental questions on whether these architectures work in the same way as CNNs: how are Vision Transformers solving these tasks? Are they acting like convolutional networks, or learning entirely different visual representations?
This paper by Raghu et al. from Google Brain in 2021 analyzes the internal representation structure of ViTs and CNNs on image classification benchmarks, and finds striking differences in the features and internal structures between the two architectures, such as ViT having more uniform representations across all layers. They explore how these differences arise, finding crucial roles played by self-attention, which enables early aggregation of global information (“earlier global features”), and ViT residual connections, which offer representation propagation of features from lower to higher layers, while also revealing that some CNN properties, e.g. local information aggregation at lower layers, are important to ViTs, being learned from scratch at scale.
They also examine the potential for ViTs to be used beyond classification through a study of spatial localization, discovering ViTs successfully preserve input spatial information with CLS tokens —- promising for future uses in object detection.
Finally, they investigate the effect of scale for transfer learning, finding larger ViT models develop significantly stronger intermediate representations through larger pretraining datasets. These results are also very pertinent to understanding recent architectures for vision such as the MLP-Mixer.

BEiT: BERT Pre-Training of Image Transformers

This paper by Wei et al. from Microsoft Research in 2021 introduces a self-supervised pre-trained representation model called BEiT, which stands for Bidirectional Encoder representations from Image Transformers. Following BERT developed in the natural language processing area, they propose a masked image modeling task to pretrain vision Transformers. Specifically, each image has two views in their pre-training, i.e, image patches (such as 16x16 pixels) the embeddings of which are calculated as linear projections of flattened patches, and visual tokens (i.e., discrete tokens) which are . Before pre-training, they learn a discrete variational autoencoder (dVAE) which acts as an “image tokenizer” learnt via autoencoding-style reconstruction, where the input image is tokenized into discrete visual tokens obtained by the latent codes of the discrete VAE (the one proposed in VQGAN and reused by CLIP in Ramesh et al., 2021) according to the learned vocabulary.
They show that the proposed method is critical to make BERT-like pre-training (i.e., auto-encoding with masked input) work well for image Transformers. They also present the intriguing property of automatically acquired knowledge about semantic regions, without using any human-annotated data.
Similar to the masked language modeling pre-training task of BERT, BEiT randomly masks some image patches and feeds them into the backbone Transformer. The pre-training objective is to recover the original visual tokens based on the corrupted image patches.
After pre-training BEiT, they directly fine-tune the model parameters on downstream tasks by appending task layers upon the pretrained encoder.
Experimental results on image classification and semantic segmentation show that BEiT achieves competitive results with previous pre-training methods. For example, base-size BEiT achieves 83.2% top-1 accuracy on ImageNet-1K, significantly outperforming from-scratch DeiT training (81.8%) with the same setup. Moreover, large-size BEiT obtains 86.3% only using ImageNet-1K, even outperforming ViT-L with supervised pre-training on ImageNet-22K (85.2%).
Code and pretrained models are here.

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

This paper by Liu et al. from Microsoft Research in 2021 presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision by producing a hierarchical feature representation and offers a linear computational complexity with respect to input image size. The key element of the Swin Transformer is the shifted window based self-attention.
The Swin transformer aims to address the challenges in adapting Transformer from language to vision which arise due to differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, they propose a hierarchical Transformer whose representation is computed with Shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection.
This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including ImageNet image classification (87.3 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as COCO object detection (58.7 box AP and 51.1 mask AP on COCO testdev) and ADE20K semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures.
Code and pretrained models are here.

CvT: Introducing Convolutions to Vision Transformers

This paper by Wu et al. from McGill and Microsoft in 2021 proposes the Convolutional vision Transformer (CvT), that improves Vision Transformer (ViT) in performance and efficiency by introducing convolutions into ViT to yield the best of both designs for image recognition tasks.
This is accomplished through two primary modifications: a hierarchy of Transformers containing a new convolutional token embedding, and a convolutional Transformer block leveraging a convolutional projection. These changes introduce desirable properties of convolutional neural networks (CNNs) to the ViT architecture (i.e., shift, scale, and distortion invariance) while maintaining the merits of Transformers (i.e., dynamic attention, global context, and better generalization).
They validate CvT by conducting extensive experiments, showing that this approach achieves state-of-the-art performance over other Vision Transformers and ResNets on ImageNet-1k, with fewer parameters and lower FLOPs.
In addition, performance gains are maintained when pretrained on larger datasets (for e.g., ImageNet-22k) and fine-tuned to downstream tasks. Pre-trained on ImageNet-22k, the CvT-W24 obtains a top-1 accuracy of 87.7% on the ImageNet-1k val set.
Furthermore, their results show that the positional encoding, a crucial component in existing Vision Transformers, can be safely removed in their model, giving it a potential advantage for adaption to a wide range of vision tasks requiring variable input resolution. This is due to the built-in local context structure introduced by convolutions, CvT no longer requires a position embedding.
CvTs thus introduce convolutions into the Vision Transformer architecture to merge the benefits of Transformers with the benefits of CNNs and demonstrate that the introduced convolutional token embedding and convolutional projection, along with the multi-stage design of the network enabled by convolutions, enable CvT to achieve superior performance while maintaining computational efficiency.
Code and pretrained models are here.

An Empirical Study of Training Self-Supervised Vision Transformers

While the training recipes for standard convolutional networks have been highly mature and robust, the recipes for ViT are yet to be built, especially in the self-supervised scenarios where training becomes more challenging.
This paper by Chen et al. from Facebook AI in ICCV 2021 studies a straightforward, incremental, yet must-know baseline given the recent progress in computer vision: self-supervised learning for Vision Transformers (ViT).
They go back to basics and investigate the effects of several fundamental components for training self-supervised ViT. Their comparisons concern several aspects, including ViT vs. convolutional networks, supervised vs. self-supervised, and contrastive learning vs. masked auto-encoding.
They observe that instability is a major issue that degrades accuracy, and it can be hidden by apparently good results. They reveal that these results are indeed partial failure, and they can be improved when training is made more stable.
They introduce “MoCo v3”, a framework which offers an incremental improvement of MoCo v1/2, and strikes for a better balance between simplicity, accuracy, and scalability. The pseudocode of MoCo v3 is as below:

They benchmark ViT results in MoCo v3 and several other self-supervised frameworks, with ablations in various aspects. They discuss the currently positive evidence as well as challenges and open questions.

Diffusion Models Beat GANs on Image Synthesis

This paper by Dhariwal and Nichol from OpenAI in 2021 shows that diffusion models, a class of likelihood-based models with a stationary training objective, can achieve image sample quality superior to the current state-of-the-art generative models.
They achieve this on unconditional image synthesis by finding a better architecture through a series of ablations. For conditional image synthesis, they further improve sample quality with classifier guidance: a simple, compute-efficient method for trading off diversity for fidelity using gradients from a classifier.
These guided diffusion models can reduce the sampling time gap between GANs and diffusion models, although diffusion models still require multiple forward passes during sampling. Finally, by combining guidance with upsampling, they can further improve sample quality on high-resolution conditional image synthesis.
They achieve an FID of 2.97 on ImageNet $128 \times 128$, 4.59 on ImageNet $256 \times 256$, and 7.72 on ImageNet $512 \times 512$, and match BigGAN-deep even with as few as 25 forward passes per sample, all while maintaining better coverage of the distribution.
Finally, they find that classifier guidance combines well with upsampling diffusion models, further improving FID to 3.94 on ImageNet $256 \times 256$ and 3.85 on ImageNet $512 \times 512$.
Code.

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Diffusion models have recently been shown to generate high-quality synthetic images, especially when paired with a guidance technique to trade off diversity for fidelity.
This paper by Nichol et al. from OpenAI in 2021 explores diffusion models for the problem of text-conditional image synthesis and compare two different guidance strategies: CLIP guidance and classifier-free guidance.
They find that the latter is preferred by human evaluators for both photorealism and caption similarity, and often produces photorealistic samples. Samples from a 3.5 billion parameter text-conditional diffusion model using classifier-free guidance are favored by human evaluators to those from DALL-E, even when the latter uses expensive CLIP reranking.
Additionally, they find that their models can be fine-tuned to perform image inpainting, enabling powerful text-driven image editing.
Code.

Multiscale Vision Transformers

This paper by Fan et al. from Facebook AI and UC Berkeley presents Multiscale Vision Transformers (MViT) for video and image recognition, by connecting the seminal idea of multiscale feature hierarchies with transformer models.
Multiscale Transformers have several channel-resolution scale stages. Starting from the input resolution and a small channel dimension, the stages hierarchically expand the channel capacity while reducing the spatial resolution. This creates a multiscale pyramid of features with early layers operating at high spatial resolution to model simple low-level visual information, and deeper layers at spatially coarse, but complex, high-dimensional features.
They evaluate this fundamental architectural prior for modeling the dense nature of visual signals for a variety of video recognition tasks where it outperforms concurrent vision transformers that rely on large scale external pre-training and are 5-10x more costly in computation and parameters.
They further remove the temporal dimension and apply their model for image classification where it outperforms prior work on vision transformers.
The figure below from the paper shows that Multiscale Vision Transformers learn a hierarchy from dense (in space) and simple (in channels) to coarse and complex features. Several resolution-channel scale stages progressively increase the channel capacity of the intermediate latent sequence while reducing its length and thereby spatial resolution.

Code.

Score-Based Generative Modeling through Stochastic Differential Equations

Creating noise from data is easy; creating data from noise is generative modeling.
This paper by Song et al. from Stanford and Google Brain in ICLR 2021 presents a stochastic differential equation (SDE) that smoothly transforms a complex data distribution to a known prior distribution by slowly injecting noise, and a corresponding reverse-time SDE that transforms the prior distribution back into the data distribution by slowly removing the noise.
Crucially, the reverse-time SDE depends only on the time-dependent gradient field (a.k.a., score) of the perturbed data distribution. By leveraging advances in score-based generative modeling, they can accurately estimate these scores with neural networks, and use numerical SDE solvers to generate samples. They show that this framework encapsulates previous approaches in score-based generative modeling and diffusion probabilistic modeling, allowing for new sampling procedures and new modeling capabilities.
In particular, they introduce a predictor-corrector framework to correct errors in the evolution of the discretized reverse-time SDE.
They also derive an equivalent neural ODE that samples from the same distribution as the SDE, but additionally enables exact likelihood computation, and improved sampling efficiency. In addition, they provide a new way to solve inverse problems with score-based models, as demonstrated with experiments on class-conditional generation, image inpainting, and colorization. Combined with multiple architectural improvements, they achieve record-breaking performance for unconditional image generation on CIFAR-10 with an Inception score of 9.89 and FID of 2.20, a competitive likelihood of 2.99 bits/dim, and demonstrate high fidelity generation of $1024 \times 1024$ images for the first time from a score-based generative model.
The figure below from the paper shows that solving a reverse-time SDE yields a score-based generative model. Transforming data to a simple noise distribution can be accomplished with a continuous-time SDE. This SDE can be reversed if they know the score of the distribution at each intermediate time step, $\nabla_{\mathbf{x}} \log p_t(\mathbf{x})$.

Related: Score-Based Diffusion Models.

Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Most recent semantic segmentation methods adopt a fully-convolutional network (FCN) with an encoder-decoder architecture. The encoder progressively reduces the spatial resolution and learns more abstract/semantic visual concepts with larger receptive fields. Since context modeling is critical for segmentation, the latest efforts have been focused on increasing the receptive field, through either dilated/atrous convolutions or inserting attention modules. However, the encoder-decoder based FCN architecture remains unchanged.
This paper by Zheng et al. from in CVPR 2021 proposes SEgmentation TRansformer (SETR), which, utilizes a a pure transformer (i.e., without convolution and resolution reduction) to encode an image as a sequence of patches and aims to provide an alternative perspective to the segmentation problem by treating semantic segmentation as a sequence-to-sequence prediction task. Thanks to the Transformer self-attention architecture, which models global context in every layer, this results in being able to combine the encoder with a simple decoder to provide a powerful segmentation model.
Extensive experiments show that SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes. Particularly, they achieve the first position in the highly competitive ADE20K test server leaderboard on the day of submission.
The figure below from the paper shows a schematic illustration of the proposed SETR model; (a) They first split an image into fixed-size patches, linearly embed each of them, add position embeddings, and feed the resulting sequence of vectors to the standard Transformer encoder. To perform pixel-wise segmentation, they introduce different decoder designs: (b) progressive upsampling (resulting in a variant called SETRPUP); and (c) multi-level feature aggregation (a variant called SETR-MLA).

Code.

Scaling Vision with Sparse Mixture of Experts

Almost all prevalent computer vision models networks are “dense,” that is, every input is processed by every parameter.
This paper by Riquelme et al. from Google Brain introduces the Vision Mixture of Experts (V-MoE), a novel approach for scaling vision models. The V-MoE is a sparsely activated version of the Vision Transformer (ViT) that demonstrates scalability and competitiveness with larger dense networks in image recognition tasks.
The paper proposes a sparse variant of the Vision Transformer (ViT) that uses a mixture-of-experts architecture. This approach routes each image patch to a subset of experts, making it possible to scale up to 15B parameters while matching the performance of state-of-the-art dense models.
An innovative extension to the routing algorithm is presented, allowing prioritization of subsets of each input across the entire batch. This adaptive per-image compute leads to a trade-off between performance and computational efficiency during inference.
The figure below from the paper shows an overview of the architecture. V-MoE is composed of $L$ ViT blocks. In some, we replace the MLP with a sparsely activated mixture of MLPs. Each MLP (the expert) is stored on a separate device, and processes a fixed number of tokens. The communication of these tokens between devices is shown in this example, which depicts the case when $k=1$ expert is selected per token. Here each expert uses a capacity ratio $C=\frac{4}{3}$: the sparse MoE layer receives 12 tokens per device, but each expert has capacity for $16\left(\frac{16 \cdot 1}{12}=\frac{4}{3}\right.$). Non-expert components of V-MoE such as routers, attention layers and normal MLP blocks are replicated identically across devices.

The V-MoE shows impressive scalability, successfully trained up to 15B parameters, and demonstrates strong performance, including 90.35% accuracy on ImageNet.
The paper explores the transfer learning abilities of V-MoE, showing its adaptability and effectiveness across different tasks and datasets, even with limited data.
A detailed analysis of the V-MoE’s routing decisions and the behavior of its experts is provided, offering insights into the model’s internal workings and guiding future improvements.
V-MoE models require less computational resources than dense counterparts, both in training and inference, thanks to their sparsely activated nature and the efficient use of the Batch Prioritized Routing algorithm.
The paper concludes with the potential of sparse conditional computation in vision tasks, emphasizing the environmental benefits due to reduced CO2 emissions and the promising directions for future research in large-scale multimodal or video modeling.
The paper represents a significant advancement in the field of computer vision, particularly in the development of scalable and efficient vision models.

MLP-Mixer: An all-MLP Architecture for Vision

This paper by Tolstikhin et al. from Google Brain introduces MLP-Mixer, a novel architecture for vision tasks that eschews convolutions and self-attention in favor of multi-layer perceptrons (MLPs). The architecture comprises two types of layers: channel-mixing MLPs, which operate on individual image patches (tokens) and allow for communication between different channels, and token-mixing MLPs, which facilitate communication between different spatial locations by operating on each channel independently.
The MLP-Mixer architecture is designed to maintain the input’s dimensionality (patches $\times$ channels) throughout, with interleaved channel- and token-mixing layers to enable both intra- and inter-patch interactions. The model employs skip-connections, dropout, and LayerNorm, with a final classifier head consisting of global average pooling followed by a fully-connected layer.
The figure below from the paper illustrates that MLP-Mixer consists of per-patch linear embeddings, Mixer layers, and a classifier head. Mixer layers contain one token-mixing MLP and one channel-mixing MLP, each consisting of two fully-connected layers and a GELU nonlinearity. Other components include: skip-connections, dropout, and layer norm on the channels.

When trained on large-scale datasets or using modern regularization techniques, MLP-Mixer achieves competitive performance on image classification benchmarks compared to state-of-the-art models like CNNs and Transformers. Notably, it attains 87.94% top-1 validation accuracy on ImageNet when pre-trained on a dataset of ~100M images.
Experiments demonstrate the model’s robustness to input permutations and its ability to learn without the inductive biases present in convolutional and attention-based models. The paper discusses potential improvements through untying parameters within token-mixing MLPs, grouping channels for token mixing, and employing pyramid-like structures for scaling.
The MLP-Mixer code is provided in JAX/Flax, showcasing its simplicity and the straightforward implementation of its constituent MLP blocks and mixer layers.

2022

A ConvNet for the 2020s

This paper by FAIR and UC Berkeley seeks to refute the recent apparent superiority of Transformers by re-examining the design of ConvNets and testing their limitations. The proposed approach is based on gradually modifying a standard ResNet50, following design choices closely inspired by Vision Transformer, to propose a new family of pure ConvNets called ConvNeXt, which can perform as good as a hierarchical vision Transformer on image classification, object detection, instance and semantic segmentation tasks.
The “Roaring 20s” of visual recognition began with the introduction of Vision Transformers (ViTs), which quickly superseded ConvNets as the state-of-the-art image classification model. A vanilla ViT, on the other hand, faces difficulties when applied to general computer vision tasks such as object detection and semantic segmentation. It is the hierarchical Transformers (e.g., Swin Transformers) that reintroduced several ConvNet priors, making Transformers practically viable as a generic vision backbone and demonstrating remarkable performance on a wide variety of vision tasks.
However, the effectiveness of such hybrid approaches is still largely credited to the intrinsic superiority of Transformers, rather than the inherent inductive biases of convolutions.
In this paper, the authors reexamine the design spaces and test the limits of what a pure ConvNet can achieve.
The authors gradually “modernize” a standard ResNet toward the design of a vision Transformer, and discover several key components that contribute to the performance difference along the way. They implement a series of design decisions starting with a ResNet50 trained with up-to-date techniques (extending the number of epochs, using AdamW optimizer, Stochastic Depth, Label Smoothing, and so on):
- Macro Design: The authors considered two aspects of Swin Transformers’ macro design. The first is the number of blocks in each stage (stage compute ratio), which was adjusted from (4, 4, 6, 3) to (3, 3, 9, 3), following the Swin Transformer ratio of (1:1:3:1). The second is the stem cell configuration, which in the original ResNet50 consisted of 7$\times$7 convolutions with stride 2 followed by a max-pooling layer. This was substituted by a more Transformer-like “patchify” layer which utilizes 4$\times$4 non-overlapping convolutions with stride 4. These modifications improved the accuracy to 79.5%.
- ResNeXt: In this part, the authors adopt two design choices of the popular ResNeXt: depthwise convolutions, which are interestingly similar to self-attention as they work on a per-channel basis, and a higher number of channels (from 64 to 96). These modifications improved the accuracy to 80.5%.
- Inverted Bottleneck: An essential configuration of Transformers is the expansion-compression rate in the MLP block (the hidden dimension is 4 times higher than the input and output dimension). This feature was reproduced by adding the inverted bottleneck design used in ConvNets (where the input is expanded using $1 \times 1$ convolutions and then shrunk through depthwise convolution and $1 \times 1$ convolutions). This modification slightly improved the accuracy to 80.6%.
- Large kernel sizes: The gold standard in ConvNet since the advent of VGG are 3$\times$3 kernels. Small kernels lead to the famous local receptive field, which, compared to the global self-attention, has a more limited area of focus. Although Swin Transformers reintroduced the concept of local attention, their window size has always been at least $7 \times 7$. To explore larger kernels, the first thing is to move the depthwise convolution before the convolution, to reduce the number of channels before such an expensive operation. This first modification resulted in a temporary degradation to 79.9%, but, experimenting with different sizes, with a $7 \times 7$ window (higher values did not bring any alterations in the results), the authors were able to achieve an accuracy of 80.6% again.
- Micro Design: Finally, some micro design choices were added: GELU instead of ReLU, a single activation for each block (the original transformer module has just one activation after the MLP), fewer normalization layers, Batch Normalization substituted by Layer Normalization, and separate downsampling layer.
- These modifications improved the accuracy to 82.0% and defined the final model, named ConvNeXt.
A comparison of this architecture with the Swin Transformer and ResNet is shown in the figure below.

Based entirely on convolutions, this model competed on par with Transformer-based architectures, achieving a top-1 accuracy of 87.8% on ImageNet classification. Equally excellent results were obtained in other tasks, such as object detection and segmentation on COCO and semantic segmentation on ADE20K.
The idea of modernizing ConvNets, adding all the concepts introduced over the past decade to a single model, is payback for convolutions, which have been ignored lately to the benefit of transformers. The authors suggest that ConvNeXt may be more suited for certain tasks, while Transformers may be more flexible for others. A case in point is multi-modal learning, in which a cross-attention module may be preferable for modeling feature interactions across many modalities. Additionally, Transformers may be more flexible when used for tasks requiring discretized, sparse, or structured outputs. They believe the architecture choice should meet the needs of the task at hand while striving for simplicity and efficiency.
Code.

Natural Language Descriptions of Deep Visual Features

Some neurons in deep networks specialize in recognizing highly specific perceptual, structural, or semantic features of inputs. In computer vision, techniques exist for identifying neurons that respond to individual concept categories like colors, textures, and object classes. But these techniques are limited in scope, labeling only a small subset of neurons and behaviors in any network. Is a richer characterization of neuron-level computation possible?
This paper by Hernandez et al. from MIT, Northeastern and Alleghency College in 2022 proposes MILAN, for mutual-information-guided linguistic annotation of neurons, that aims to generate open-ended, compositional, natural language descriptions of individual neurons in deep networks.
Given a neuron, MILAN generates a description by searching for a natural language string that maximizes pointwise mutual information with the image regions in which the neuron is active. These mutual information estimates are in turn produced by a pair of learned models trained on MILANNOTATIONS, a dataset of fine-grained image annotations released with this paper. MILAN produces fine-grained descriptions that capture categorical, relational, and logical structure in learned features. These descriptions obtain high agreement with human-generated feature descriptions across a diverse set of model architectures and tasks, and can aid in understanding and controlling learned models.
They highlight three applications of natural language neuron descriptions.
- First, they use MILAN for analysis, characterizing the distribution and importance of neurons selective for attribute, category, and relational information in vision models.
- Second, they use MILAN for auditing, surfacing neurons sensitive to protected categories like race and gender in models trained on datasets intended to obscure these features.
- Finally, they use MILAN for editing, improving robustness in an image classifier by deleting neurons sensitive to text features spuriously correlated with class labels.
MarkTechPost link.

Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision

Discriminative self-supervised learning allows training models on any random group of internet images, and possibly recover salient information that helps differentiate between the images. Applied to ImageNet, this leads to object-centric features that perform on par with supervised features on most object-centric downstream tasks.
This paper by Goyal et al. in 2022 from FAIR questions that if using this ability, they can learn any salient and more representative information present in diverse unbounded set of images from across the globe. To do so, they train models on billions of random images without any data pre-processing or prior assumptions about what they want the model to learn. This is a very large-scale experiment in which a RegNet architecture scaled to a dense 10 billion parameters (to avoid underfitting on a large data size) is pre-trained using the SwAV self-supervised method on a large collection of 1 billion randomly selected public images from Instagram with a diversity of gender, ethnicity, cultures, and locations (all outside the EU because of GDPR).
They achieve state of the art results on a majority of 50 transfer tasks, including fairness, robustness to distribution shift, geographical diversity, fine-grained classification, image copy detection and many image classification datasets. The resulting model, not only captures well semantic information, it also captures information about artistic style and learns salient information such as geo-locations and multilingual word embeddings based on visual content only.
The key takeaway is that large-scale self-supervised pre-training yields more robust, fair, less harmful, and less biased results than supervised models or models trained on object centric datasets such as ImageNet.

Block-NeRF: Scalable Large Scene Neural View Synthesis

This paper by Tancik et al. from UC Berkeley, Waymo and Google Research in 2022 presents Block-NeRF, a variant of Neural Radiance Fields (NeRFs) that can reconstruct large-scale environments.
They demonstrate that when scaling NeRF to render city-scale scenes spanning multiple blocks, it is vital to decompose the scene into individually trained NeRFs that can be optimized independently. This decomposition decouples rendering time from scene size, enables rendering to scale to arbitrarily large environments, and allows per-block updates of the environment.
At such a scale, the data collected will necessarily have transient objects and variations in appearance, which they account for by modifying the underlying NeRF architecture to make NeRF robust to data captured over months under different environmental conditions. They add appearance embeddings, learned pose refinement, and controllable exposure to each individual NeRF, and introduce a procedure for aligning appearance between adjacent NeRFs so that they can be seamlessly combined.
They demonstrate the method’s efficacy by building an entire neighborhood in San Francisco from 2.8M images using a grid of Block-NeRFs, forming the largest neural scene representation to date.

VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning

This paper by Bardes et al. from FAIR, Inria, École normale supérieure, CNRS, PSL Research University, and NYU, in ICLR 2022, introduces Variance-Invariance-Covariance Regularization (VICReg), a novel self-supervised learning method that tackles the challenge of the collapse problem in image representation learning, where encoders may output constant or trivial vectors, by introducing a simple regularization term on the variance of the embeddings along each dimension individually. Unlike other methods that rely on implicit biases in the architecture, often lacking clear justification or interpretation, VICReg provides an explicit approach to prevent collapse.
VICReg distinguishes itself by explicitly avoiding the collapse problem through two innovative regularization terms applied to the embeddings: a variance term to ensure each embedding dimension maintains variance above a specific threshold, and a covariance term to decorrelate pairs of embedding variables. This method simplifies the architecture by eliminating the need for complexities such as weight sharing, batch normalization, feature-wise normalization, output quantization, stop gradient, memory banks, etc., commonly found in other self-supervised learning approaches.
Employing a joint embedding architecture with encoders and expanders, VICReg utilizes ResNet-50 backbones for encoders and fully-connected layers for expanders. It aims to preserve the information content in the embeddings by enforcing invariance to input transformations, maintaining variance across dimensions, and minimizing covariance to ensure embedding variables carry independent information.
The figure below from the paper shows VICReg: joint embedding architecture with variance, invariance and covariance regularization. Given a batch of images I, two batches of different views $X$ and $X'$ are produced and are then encoded into representations $Y$ and $Y'$. The representations are fed to an expander producing the embeddings $Z$ and $Z'$. The distance between two embeddings from the same image is minimized, the variance of each embedding variable over a batch is maintained above a threshold, and the covariance between pairs of embedding variables over a batch are attracted to zero, decorrelating the variables from each other. Although the two branches do not require identical architectures nor share weights, in most of our experiments, they are Siamese with shared weights: the encoders are ResNet-50 backbones with output dimension 2048. The expanders have 3 fully-connected layers of size 8192.

The method has been rigorously evaluated against several benchmarks, showcasing competitive performance on downstream tasks without necessitating large batch sizes, memory banks, or contrastive samples. Moreover, VICReg’s introduction of a variance term into other self-supervised methods has been shown to enhance training stability and improve performance.
An extensive analysis underscores VICReg’s robustness and adaptability, including its effectiveness in multi-modal scenarios and its advantage in maintaining independent embeddings across different architectural setups. VICReg’s methodological approach, based on a triple objective of learning invariance, avoiding representation collapse through variance preservation, and maximizing information content via covariance regularization, sets a new standard in self-supervised learning. It demonstrates that the method is not limited by the constraints of identical or similar embedding branches, opening up new possibilities for a wide range of applications beyond conventional single-modality image representation learning tasks.

Masked Autoencoders Are Scalable Vision Learners

Simple algorithms that scale well are the core of deep learning. In NLP, simple self-supervised learning methods enable benefits from exponentially scaling models. In computer vision, practical pre-training paradigms are dominantly supervised despite progress in self-supervised learning. In this study, they observe on ImageNet and in transfer learning that an autoencoder —- a simple self-supervised method similar to techniques in NLP – provides scalable benefits. Self-supervised learning in vision may thus now be embarking on a similar trajectory as in NLP.
This paper by He et al. from Facebook AI Research introduces Masked Autoencoders (MAE) as a scalable, self-supervised learning approach for computer vision (referred to as ViTMAE), drawing parallels to the trajectory of self-supervised learning in NLP.
MAE employs a simple yet effective strategy: masking random patches of the input image and reconstructing the missing pixels, a method analogous to techniques in NLP but adapted for the visual domain.
The asymmetric encoder-decoder architecture is a core design feature of MAE. The encoder operates only on the visible subset of image patches, significantly reducing the computational load. In contrast, the decoder, being lightweight, reconstructs the original image from the latent representation and mask tokens.
Another notable aspect of MAE is the high proportion of input image masking (e.g., 75%). This strategy creates a nontrivial and meaningful self-supervisory task, enabling the model to infer complex, holistic reconstructions and learn various visual concepts.
The encoder, based on Vision Transformer (ViT), processes only unmasked patches, optimizing computational efficiency. This approach allows for scaling up model size while maintaining practical training speeds (up to 3 times faster).
The following figure from the paper shows the MAE architecture. During pre-training, a large random subset of image patches (e.g., 75%) is masked out. The encoder is applied to the small subset of visible patches. Mask tokens are introduced after the encoder, and the full set of encoded patches and mask tokens is processed by a small decoder that reconstructs the original image in pixels. After pre-training, the decoder is discarded and the encoder is applied to uncorrupted images (full sets of patches) for recognition tasks.

The loss function utilized is the mean squared error (MSE) between the reconstructed and original pixels of the masked regions. This focus on masked area reconstruction drives the model to infer missing information effectively.
During training, both the encoder and decoder are jointly optimized. However, in downstream tasks, the decoder is typically discarded, and the encoder is fine-tuned for specific applications.
MAE’s performance is highlighted by its excellent results on benchmarks such as ImageNet-1K, where a ViT-Huge model achieves 87.8% accuracy, surpassing methods using only ImageNet-1K data. Its efficacy extends to transfer learning, outperforming supervised pre-training on tasks like object detection and instance segmentation on COCO, and semantic segmentation on ADE20K.
The paper also explores various practical aspects of MAE, including the masking ratio, decoder design variations, data augmentation, and mask sampling strategy.
In summary, ViTMAE is a simply yet effective self-supervised pre-training technique, where authors combined vision transformer with masked autoencoder. The images are first masked (with a relatively high masking ratio: 75%, compared to BERT’s 15%) and then the model tries to learn about the features through trying to reconstruct the original image. The image is not masked, but rather only the visible patches are fed to the encoder (and that is the only thing encoder sees). Next, a mask token is added to where the masked patches are (similar to BERT) and the mask tokens and encoded patches are fed to decoder. The decoder then tries to reconstruct the original image. As a result, the authors found out that high masking ratio works well in fine-tuning for downstream tasks and linear probing.
The simplicity and effectiveness of MAE, especially in learning robust visual representations without semantic entities, marks a significant advancement in self-supervised learning in computer vision, akin to the evolution observed in NLP.
Weights docs; Visualization demo: Masked Autoencoders (MAE)

The Effects of Regularization and Data Augmentation are Class Dependent

Regularization is a fundamental technique to prevent over-fitting and to improve generalization performances by constraining a model’s complexity. Current Deep Networks heavily rely on regularizers such as data augmentation (DA) or weight-decay, and employ structural risk minimization, i.e., cross-validation, to select the optimal regularization hyper-parameters.
This paper by Balestriero et al. from Facebook AI in 2022 demonstrates that regularization techniques such as DA or weight decay increases the average test performances at the cost of significant performance drops on some specific classes. In other words, regularization produces a model with a reduced complexity that is unfair across classes. By focusing on maximizing aggregate performance statistics they have produced learning mechanisms that can be potentially harmful, especially in transfer learning tasks. The optimal amount of DA or weight decay found from cross-validation leads to disastrous model performances on some classes, e.g., on ImageNet with a ResNet50, the “barn spider” classification test accuracy falls from 68% to 46% only by introducing random crop DA during training. Even more surprising, such performance drop also appears when introducing uninformative regularization techniques such as weight decay.
Those results demonstrate that their search for ever increasing generalization performance – averaged over all classes and samples – has left us with models and regularizers that silently sacrifice performances on some classes. In fact, they also observe that varying the amount of regularization employed during pre-training of a specific dataset impacts the per-class performances of that pre-trained model on different downstream tasks e.g. an ImageNet pre-trained ResNet50 deployed on INaturalist sees its performances fall from 70% to 30% on a particular classwhen introducing random crop DA during the Imagenet pre-training phase. Those results demonstrate that designing novel regularizers without class-dependent bias remains an open research question.
Here’s an intuitive explanation:
- Some types of data augmentation and weight decay helps some categories but hurts others.
- Categories largely identifiable by color or texture (for e.g., yellow bird, textured mushroom) are unaffected by aggressive cropping, while categories identifiable by shape (for e.g., corkscrew) see a performance degradation with aggressive cropping that only contains part of the object.
- Conversely, color jitter does not affect shape or texture-based categories (for e.g., zebra), but affects color-based categories (for e.g., basket ball).

Instant Neural Graphics Primitives with a Multiresolution Hash Encoding

Neural graphics primitives, parameterized by fully connected neural networks, can be costly to train and evaluate. Moreover, many graphics problems rely on task specific data structures to exploit the sparsity or smoothness of the problem at hand.
This paper by Muller et al. from Nvidia in 2022 proposes InstantNeRF which reduce this cost with a versatile new input encoding that permits the use of a smaller network without sacrificing quality, thus significantly reducing the number of floating point and memory access operations. InstantNeRF offers near-instant training of neural graphics primitives on a single GPU for multiple tasks.
To this end, a small neural network is augmented by a multiresolution hash table of trainable feature vectors whose values are optimized through stochastic gradient descent. Multi-resolution hash encoding provides a practical learning-based alternative that automatically focuses on relevant detail, independent of task at hand. Its low overhead allows it to be used even in time-constrained settings like online training and inference.
In a gigapixel image, they represent an image by a neural network. SDF learns a signed distance function in 3D space whose zero level-set represents a 2D surface. NeRF uses 2D images and their camera poses to reconstruct a volumetric radiance-and-density field that is visualized using ray marching. Lastly, neural volume learns a denoised radiance and density field directly from a volumetric path tracer. In all tasks, their encoding and its efficient implementation provide clear benefits: instant training, high quality, and simplicity. Their encoding is task-agnostic: they use the same implementation and hyperparameters across all tasks and only vary the hash table size which trades off quality and performance.
The multiresolution structure allows the network to disambiguate hash collisions, making for a simple architecture that is trivial to parallelize on modern GPUs. In the context of neural network input encodings, it is a drop-in replacement, for example speeding up NeRF by several orders of magnitude and matching the performance of concurrent non-neural 3D reconstruction techniques.
They leverage this parallelism by implementing the whole system using fully-fused CUDA kernels with a focus on minimizing wasted bandwidth and compute operations.
While slow computational processes in any setting, from lightmap baking to the training of neural networks, can lead to frustrating workflows due to long iteration times, they achieve a combined speedup of several orders of magnitude, enabling training of high-quality neural graphics primitives in a matter of seconds, and rendering in tens of milliseconds at a resolution of 1920$\times$1080. They have demonstrated that single-GPU training times measured in seconds are within reach for many graphics applications, allowing neural approaches to be applied where previously they may have been discounted.
Code.

Pix2seq: A Language Modeling Framework for Object Detection

This paper by Chen et al. from Google Brain in ICLR 2022 presents Pix2Seq, a simple yet generic framework for object detection. This paper introduces Pix2Seq, a simple yet generic framework for object detection. By casting object detection as a language modeling task conditioned on the observed pixel inputs, Pix2Seq largely simplifies the detection pipeline, removing most of the specialization in modern detection algorithms.
Object descriptions (e.g., bounding boxes and class labels) are expressed as sequences of discrete tokens, and they train a neural network to perceive the image and generate the desired sequence.
Pix2Seq is based mainly on the intuition that if a neural network knows about where and what the objects are, they just need to teach it how to read them out.
Beyond the use of task-specific data augmentations, their approach makes minimal assumptions about the task, yet it achieves competitive results on the challenging COCO dataset, compared to highly specialized and well optimized detection algorithms.
Pix2Seq can be extended beyond object detection to solving a large variety of vision tasks where the output can be represented by a relatively concise sequence of discrete tokens (e.g., keypoint detection, image captioning, visual question answering).
A major limitation of Pix2Seq is that autoregressive modeling is expensive for long sequences (mainly during model inference). Practical measures to mitigate the issue includes: 1) stop inference when the ending token is produced (e.g., in COCO dataset, there are, in average, 7 objects per image, leading to a relatively small number of ∼35 tokens), 2) applying it to offline inference, or online scenarios where the objects of interest are relatively sparse (for e.g., locate a specific object with language description).
However, future work is needed to make it faster for real-time object detection applications. Another limitation is that the current approach for training Pix2Seq is entirely based on human annotation, and by reducing such dependence and letting the model train using unlabeled data in an unsupervised fashion, they can enable far more applications in the vision domain.

An Improved One millisecond Mobile Backbone

Efficient neural network backbones for mobile devices are often optimized for metrics such as FLOPs or parameter count. However, these metrics may not correlate well with latency of the network when deployed on a mobile device.
This paper by Vasu et al. from Apple in 2022 performs extensive analysis of different metrics by deploying several mobile friendly networks on a mobile device. They identify and analyze architectural and optimization bottlenecks in recent efficient neural networks and provide ways to mitigate these bottlenecks.
To this end, they design an efficient backbone MobileOne, with variants achieving an inference time under 1 ms on an iPhone12 with 75.9% top-1 accuracy on ImageNet. They show that MobileOne achieves state-of-the-art performance within the efficient architectures while being many times faster on mobile.
A MobileOne block has two different structures at train time and test time, inspired from RepVGG: Making VGG-style ConvNets Great Again. Left: Train time MobileOne block with reparameterizable branches. Right: MobileOne block at inference where the branches are reparameterized. Either ReLU or SE-ReLU is used as activation. The trivial over-parameterization factor $k$ is a hyperparameter which is tuned for every variant.

Their best model obtains similar performance on ImageNet as MobileFormer while being 38x faster. MobileOne obtains 2.3% better top-1 accuracy on ImageNet than EfficientNet at similar latency. Furthermore, they show that their model generalizes to multiple tasks – image classification, object detection, and semantic segmentation with significant improvements in latency and accuracy as compared to existing efficient architectures when deployed on a mobile device.

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

This paper by Saharia et al. from Google Brain in 2022 presents Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding. Imagen showcases the effectiveness of frozen large pretrained language models as text encoders for the text-to-image generation using diffusion models.
Imagen builds on the power of large transformer language models in understanding text and hinges on the strength of diffusion models in high-fidelity image generation. With these novel components, Imagen produces $1024 \times 1024$ samples with unprecedented photorealism and alignment with text.
Their key discovery is that generic large language models (e.g. T5), pretrained on text-only corpora, are surprisingly effective at encoding text for image synthesis: increasing the size of the language model in Imagen boosts both sample fidelity and image-text alignment much more than increasing the size of the image diffusion model.
Imagen achieves a new state-of-the-art FID score of 7.27 on the COCO dataset, without ever training on COCO, and human raters find Imagen samples to be on par with the COCO data itself in image-text alignment.
To assess text-to-image models in greater depth, they introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models. With DrawBench, they compare Imagen with recent methods including VQ-GAN+CLIP, Latent Diffusion Models, and DALL-E 2, and find that human raters prefer Imagen over other models in side-by-side comparisons, both in terms of sample quality and image-text alignment.
Google page with an overview of the results.

Swin Transformer V2: Scaling Up Capacity and Resolution

Large-scale NLP models have been shown to significantly improve the performance on language tasks with no signs of saturation. They also demonstrate amazing few-shot capabilities like that of human beings.
This paper by Liu et al. from Microsoft Research in 2022 explores large-scale models in computer vision. THey tackle three major issues in training and application of large vision models, including training instability, resolution gaps between pre-training and fine-tuning, and hunger on labelled data. Three main techniques are proposed: 1) a residual-post-norm method combined with cosine attention to improve training stability; 2) A log-spaced continuous position bias method to effectively transfer models pre-trained using low-resolution images to downstream tasks with high-resolution inputs; 3) A self-supervised pre-training method, SimMIM, to reduce the needs of vast labeled images.
Through these techniques, this paper successfully trained a 3 billion-parameter Swin Transformer V2 model, which is the largest dense vision model to date, and makes it capable of training with images of up to $1,536 \times 1,536$ resolution.
By scaling up capacity and resolution, Swin V2 sets new performance records on 4 representative vision tasks, including ImageNet-V2 image classification, COCO object detection, ADE20K semantic segmentation, and Kinetics-400 video action classification. Also, note their training is much more efficient than that in Google’s billion-level visual models, which consumes 40 times less labelled data and 40 times less training time.
The diagram below from the paper presents the techniques for scaling Swin Transformer up to 3 billion parameters and making it capable of training with images of up to $1,536 \times 1,536$ resolution, including the res-post-norm and scaled cosine attention to make the model easier to be scaled up in capacity, as well a log-spaced continuous relative position bias approach which lets the model more effectively transferred across window resolutions.

Code

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

This paper by Yu et al. from Google Research in 2022 presents the Pathways Autoregressive Text-to-Image (Parti) model, which generates high-fidelity photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge. In particular, Parti is able to represent a broad range of visual world knowledge, such as landmarks, specific years, makes and models of vehicles, pottery types, visual styles – and integrate these into novel settings and configurations.
Parti treats text-to-image generation as a sequence-to-sequence modeling problem, akin to machine translation, with sequences of image tokens as the target outputs rather than text tokens in another language. This strategy can naturally tap into the rich body of prior work on large language models, which have seen continued advances in capabilities and performance through scaling data and model sizes.
Their approach is simple: First, Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens.
Second, they achieve consistent quality improvements by scaling the encoder-decoder Transformer model up to 20B parameters, with a new state-of-the-art zero-shot FID score of 7.23 and finetuned FID score of 3.22 on MS-COCO.
Their detailed analysis on Localized Narratives as well as PartiPrompts (P2), a new holistic benchmark of over 1600 English prompts, demonstrate the effectiveness of Parti across a wide variety of categories and difficulty aspects.
They also provide an extensive discussion of the limitations, including a breakdown of many kinds of model errors and challenges, that they hope will be useful both for contextualizing what the model can do and for highlighting opportunities for future research.
Parti opens up opportunities to integrate scaled autoregressive models with diffusion models, starting with having an autoregressive model generate an initial low-resolution image and then iteratively refining and super-resolving images with diffusion modules. Furthermore, the authors suggest conducting more experiments and comparisons with both autoregressive and diffusion models in order to understand their relative capabilities, to address key questions of fairness and bias in both classes of models and strategies for mitigating them, and to identify optimal opportunities for combining their strengths.
Key takeaways:
- One of the most exciting research fields nowadays is text-to-image modeling. OpenAI’s DALL-E 2 and Google’s Imagen are phenomenal models in this area. Both used a Transformer to encode the text and use diffusion models to generate the image. Google’s Parti, consists solely of (really big) Transformer modules:
  - Text encoder: as with previous works, encoding the text with a Transformer is a no-brainer.
  - Image tokenizer and de-tokenizer: instead of generating the entire image, Parti will generate one patch at a time. A ViT-based module is used to encode and decode those patches.
  - Conditional decoder: conditioned on the encoded text and the tokenized image patches generated so far, a Transformer is used to generate the next patch (with the help of the de-tokenizer from the previous step).
Google page.
Code.

Sequencer: Deep LSTM for Image Classification

In recent computer vision research, the advent of the Vision Transformer (ViT) has rapidly revolutionized various architectural design efforts: ViT achieved state-of-the-art image classification performance using self-attention found in natural language processing, and MLP-Mixer achieved competitive performance using simple multi-layer perceptrons. In contrast, several studies have also suggested that carefully redesigned convolutional neural networks (CNNs) can achieve advanced performance comparable to ViT without resorting to these new ideas. Against this background, there is growing interest in what inductive bias is suitable for computer vision.
This paper by Tatsunami and Taki from Rikkyo, Japan in NeurIPS 2022 proposes Sequencer, a novel and competitive architecture alternative to ViT that provides a new perspective on these issues. Unlike ViTs, Sequencer models long-range dependencies using LSTMs rather than self-attention layers.
They also propose a two-dimensional version of Sequencer module, where an LSTM is decomposed into vertical and horizontal LSTMs to enhance performance. Despite its simplicity, several experiments demonstrate that Sequencer performs impressively well: Sequencer2D-L, with 54M parameters, realizes 84.6% top-1 accuracy on only ImageNet-1K.
Of note is the fact that the overall data appetite and time to converge was reported to be much better than the ViT and cousins since CNNs and LSTMs have great sample efficiency. Not only that, the paper shows that it has good transferability and the robust resolution adaptability on double resolution-band.

High-Resolution Image Synthesis with Latent Diffusion Models

The following paper summary has been contributed by Zhibo Zhang.
Diffusion models are known to be computationally expensive given that they require many steps of diffusion and denoising diffusion operations in possibly high-dimensional input feature spaces.
This paper by Rombach et al. from Ludwig Maximilian University of Munich & IWR, Heidelberg University and Runway ML in CVPR 2022 introduces diffusion models that operate on the latent space, aiming at generating high-resolution images with lower computation demands compared to those that operate directly on the pixel space.
In particular, the authors adopted an autoencoder that compresses the input images into a lower dimensional latent space. The autoencoder relies on either KL regularization or VQ regularization to constrain the variance of the latent space.
As shown in the illustration figure below by Rombach et al., in the latent space, the latent representation of the input image goes through a total of $T$ diffusion operations to get the noisy representation. A U-Net is then applied on top of the noisy representation for $T$ iterations to produce the denoised version of the representation. In addition, the authors introduced a cross attention mechanism to condition the denoising process on other types of inputs such as text and semantic maps.
In the final stage, the denoised representation will be mapped back to the pixel space using the decoder to get the synthesized image.
Empirically, the best performing latent diffusion model (with a carefully chosen downsampling factor) achieved competitive FID scores in image generation when comparing with a few other state-of-the-art generative models such as variations of generative adversarial nets on a few datasets including the CelebA-HQ dataset.
Code

Make-A-Video: Text-to-Video Generation without Text-Video Data

This paper by Singer et al. from Meta AI in 2022 proposes Make-A-Video – an approach for directly translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V).
Their intuition is simple: learn what the world looks like and how it is described from paired text-image data, and learn how the world moves from unsupervised video footage. Make-A-Video has three advantages: (1) it accelerates training of the T2V model (it does not need to learn visual and multimodal representations from scratch), (2) it does not require paired text-video data, and (3) the generated videos inherit the vastness (diversity in aesthetic, fantastical depictions, etc.) of today’s image generation models. They design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules.
First, they decompose the full temporal U-Net and attention tensors and approximate them in space and time. Second, they design a spatial temporal pipeline to generate high resolution and frame rate videos with a video decoder, interpolation model and two super resolution models that can enable various applications besides T2V.
In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation, as determined by both qualitative and quantitative measures.

Denoising Diffusion Implicit Models

Denoising diffusion probabilistic models (DDPMs) have achieved high quality image generation without adversarial training, yet they require simulating a Markov chain for many steps to produce a sample.
This paper by Song et al. in ICLR 2021 from Ermon’s lab at Stanford presents denoising diffusion implicit models (DDIMs), a more efficient class of iterative implicit probabilistic models with the same training procedure as DDPMs to accelerate the sampling process.
In DDPMs, the generative process is defined as the reverse of a Markovian diffusion process. DDIMs construct a class of non-Markovian diffusion processes that lead to the same training objective, but whose reverse process can be much faster to sample from.
They empirically demonstrate that DDIMs can produce high quality samples 10x to 50x faster in terms of wall-clock time compared to DDPMs, allow us to trade off computation for sample quality, and can perform semantically meaningful image interpolation directly in the latent space.

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

This paper by Dong et al. from the University of Science and Technology of China, Microsoft Research Asia, and Microsoft Cloud in 2022 presents the CSWin Transformer, an efficient and effective Transformer-based backbone for general-purpose vision tasks.
A challenging issue in Transformer design is that global self-attention is very expensive to compute whereas local self-attention often limits the field of interactions of each token. To address this issue, they develop the Cross-Shaped Window self-attention mechanism for computing self-attention in the horizontal and vertical stripes in parallel that form a cross-shaped window, with each stripe obtained by splitting the input feature into stripes of equal width.
They provide a mathematical analysis of the effect of the stripe width and vary the stripe width for different layers of the Transformer network which achieves strong modeling capability while limiting the computation cost.
They also introduce Locally-enhanced Positional Encoding (LePE), which handles the local positional information better than existing encoding schemes. LePE naturally supports arbitrary input resolutions, and is thus especially effective and friendly for downstream tasks.
Incorporated with these designs and a hierarchical structure, CSWin Transformer demonstrates competitive performance on common vision tasks. Specifically, it achieves 85.4% Top-1 accuracy on ImageNet-1K without any extra training data or label, 53.9 box AP and 46.4 mask AP on the COCO detection task, and 52.2 mIOU on the ADE20K semantic segmentation task, surpassing previous state-of-the-art Swin Transformer backbone by +1.2, +2.0, +1.4, and +2.0 respectively under the similar FLOPs setting.
By further pretraining on the larger dataset ImageNet-21K, they achieve 87.5% Top-1 accuracy on ImageNet-1K and high segmentation performance on ADE20K with 55.7 mIoU.
The following figure from the paper illustrates different self-attention mechanisms; CSWin is fundamentally different from two aspects. First, they split multi-heads ($\{h1, \ldots , hK\}$) into two groups and perform self-attention in horizontal and vertical stripes simultaneously. Second, they adjust the stripe width according to the depth network, which can achieve better trade-off between computation cost and capability.

The following figure from the paper shows: (Left) the overall architecture of their proposed CSWin Transformer; (Right) the illustration of CSWin Transformer block.

Code.

MViTv2: Improved Multiscale Vision Transformers for Classification and Detection

This paper by Li et al. from Facebook AI Research and UC Berkeley studies Multiscale Vision Transformers (MViTv2) as a unified architecture for image and video classification, as well as object detection.
They present an improved version of MViT that incorporates decomposed relative positional embeddings and residual pooling connections.
They instantiate this architecture in five sizes and evaluate it for ImageNet classification, COCO detection and Kinetics video recognition where it outperforms prior work.
They further compare MViTv2s’ pooling attention to window attention mechanisms where it outperforms the latter in accuracy/compute. Without bells-and-whistles, MViTv2 has state-of-the-art performance in 3 domains: 88.8% accuracy on ImageNet classification, 58.7 boxAP on COCO object detection as well as 86.1% on Kinetics-400 video classification.
The following figure from the paper shows the improved Pooling Attention mechanism that incorporating decomposed relative position embedding, $R_{p(i), p(j)}$, and residual pooling connection modules in the attention block.

The following figure from the paper illustrates MViTv2 as a multiscale transformer with state-of-the-art performance across three visual recognition tasks.

Code

iBOT: Image BERT Pre-training with Online Tokenizer

This paper by Zhou et al. from ByteDance, Johns Hopkins University, Shanghai Jiao Tong University, and UC Santa Cruz, presented at ICLR 2022, introduces a novel self-supervised framework for Vision Transformers (ViTs). This framework, named iBOT, is distinctive in its use of an online tokenizer in Masked Image Modeling (MIM) and the incorporation of self-distillation.
Self-distillation is a process where a model learns to predict its own outputs, effectively teaching itself. In the context of iBOT, it involves the model learning to predict masked parts of the input image based on the unmasked parts, allowing it to refine its understanding of visual features and patterns. This method of learning from its own predictions, rather than relying on external labels or pre-trained models, is a key aspect of iBOT’s self-supervised learning approach. Furthermore, in iBOT, the teacher model’s weights are an Exponential Moving Average (EMA) of the student’s weights. This method ensures that the teacher’s weights are a temporally smoothed version of the student’s, enabling more stable and consistent learning targets for the student model during the self-distillation process.
- Self-distillation, proposed in DINO, distills knowledge not from posterior distributions $P_{\theta}(x)$ but past iterations of model itself $P_{\theta'}(x)$ and is cast as a \textit{discriminative self-supervised objective}. Given the training set $L$, an image $x \sim L$ is sampled uniformly, over which two random augmentations are applied, yielding two distorted views $u$ and $v$. The two distorted views are then put through a teacher-student framework to get the predictive categorical distributions from the \texttt{[CLS]} token: $v^{[\texttt{CLS}]} = P_{\theta'}^{[\texttt{CLS}]}(v)$ and $u^{[\texttt{CLS}]} = P_{\theta}^{[\texttt{CLS}]}(u)$. The knowledge is distilled from teacher to student by minimizing their cross-entropy, formulated as
\[L_{\texttt{CLS}} = -P_{\theta'}^{[\texttt{CLS}]}(v)^T \log P_{\theta}^{[\texttt{CLS}]}(u).\]
- The teacher and the student share the same architecture consisting of a backbone $f$ (e.g., ViT) and a projection head $h_{\texttt{CLS}}$. The parameters of the student network $\theta$ are Exponentially Moving Averaged (EMA) to the parameters of teacher network $\theta'$.
The loss function employed in iBOT is critical to its design. It combines the self-distillation loss, where the model’s own predictions are used as soft labels for learning, with the standard Masked Image Modeling (MIM) loss. This combined loss function encourages the model to learn more robust and generalizable features by aligning the predictions of the teacher and student models within the self-distillation framework.
The implementation details of iBOT include the use of Vision Transformers and Swin Transformers with various parameter configurations. The models are pre-trained and fine-tuned on ImageNet datasets using a mix of self-distillation and MIM. The unique aspect of iBOT is its use of an online tokenizer, which evolves during training, learning to capture high-level visual semantics. This tokenizer is an integral part of the self-distillation process, helping the model to recover masked patch tokens accurately.
iBOT’s performance is evaluated across several tasks, including image classification, object detection, instance segmentation, and semantic segmentation. It achieves state-of-the-art results on these tasks, demonstrating its effectiveness. Notably, iBOT shows superior performance in terms of linear probing accuracy and fine-tuning accuracy on ImageNet-1K, as well as robustness against common image corruptions and adaptability to different datasets.
The following figure from the paper shows the overview of iBOT framework, performing masked image modeling with an online tokenizer. Given two views $\boldsymbol{u}$ and $\boldsymbol{v}$ of an image $\boldsymbol{x}$, each view is passed through a teacher network $h_t \circ f_t$ and a student network $h_s \circ f_s$. iBOT minimizes two losses. The first loss $L_{\text {[CLS] }}$ is selfdistillation between cross-view [CLS] tokens. The second loss $L_{\text {MIM }}$ is self-distillation between in-view patch tokens, with some tokens masked and replaced by $\boldsymbol{e}_{[\mathrm{MASK}]}$ for the student network. The objective is to reconstruct the masked tokens with the teacher networks’ outputs as supervision.

An in-depth analysis within the paper reveals that iBOT learns high-level semantic patterns in image patches, contributing to its strong performance on visual tasks. This is a notable advancement over traditional tokenization methods in Vision Transformers, which often focus on low-level details. The authors also conduct various ablation studies, underscoring the importance of the online tokenizer and its role in achieving high performance.
In conclusion, iBOT represents a significant step forward in self-supervised learning for Vision Transformers, particularly in how it handles the tokenization and learning of visual semantics. The results suggest that this method could be highly effective for a wide range of visual recognition tasks.

Imagen Video: High Definition Video Generation with Diffusion Models

This paper by Ho et al. from Google Research, Brain Team, introduces Imagen Video, a text-conditional video generation system leveraging a cascade of video diffusion models. Imagen Video generates high-definition videos from text prompts using a base video generation model and a sequence of interleaved spatial and temporal super-resolution models.
The core contributions and methodology of this work include the following technical details:
- Architecture and Components: Imagen Video utilizes a frozen T5 text encoder to process the text prompts, followed by a base video diffusion model and multiple spatial and temporal super-resolution (SSR and TSR) models. Specifically, the system comprises seven sub-models: one base video generation model, three SSR models, and three TSR models. This cascade structure allows the system to generate 1280x768 resolution videos at 24 frames per second, with a total of 128 frames (approximately 5.3 seconds).
- Diffusion Models: The diffusion models in Imagen Video are based on continuous-time formulations, with a forward process defined as a Gaussian process. The training objective is to denoise the latent variables through a noise-prediction loss. The v-parameterization is employed to predict noise, which ensures numerical stability and avoids color-shifting artifacts.
- Text Conditioning and Cascading: Text conditioning is achieved by injecting contextual embeddings from the T5-XXL text encoder into all models, ensuring alignment between the generated video and the text prompt. The cascading approach involves generating a low-resolution video first, which is then progressively enhanced through spatial and temporal super-resolution models. This method allows for high-resolution outputs without overly complex individual models.
The following figure from the paper shows the cascaded sampling pipeline starting from a text prompt input to generating a 5.3-second, 1280 $\times$ 768 video at 24fps. “SSR” and “TSR” denote spatial and temporal super-resolution respectively, and videos are labeled as frames $\times$ width $\times$ height. In practice, the text embeddings are injected into all models, not just the base model.

Implementation Details:
- v-parameterization: Used for numerical stability and to avoid artifacts in high-resolution video generation.
- Conditioning Augmentation: Gaussian noise augmentation is applied to the conditioning inputs during training to reduce domain gaps and facilitate parallel training of different models in the cascade.
- Joint Training on Images and Videos: The models are trained on a mix of video-text pairs and image-text pairs, treating individual images as single-frame videos. This approach allows the model to leverage larger and more diverse image-text datasets.
- Classifier-Free Guidance: This method enhances sample fidelity and ensures that the generated video closely follows the text prompt by adjusting the denoising prediction.
- Progressive Distillation: This technique is used to speed up the sampling process. It involves distilling a trained DDIM sampler into a model requiring fewer steps, thus significantly reducing computation time while maintaining sample quality.
Experiments and Findings:
- The model shows high fidelity in video generation and can produce diverse content, including 3D object understanding and various artistic styles.
- Scaling the parameter count of the video U-Net leads to improved performance, indicating that video modeling benefits significantly from larger models.
- The v-parameterization outperforms ε-parameterization, especially at higher resolutions, due to faster convergence and reduced color inconsistencies.
- Distillation reduces sampling time by 18x, making the model more efficient without sacrificing perceptual quality.
Conclusion: Imagen Video extends text-to-image diffusion techniques to video generation, achieving high-quality, temporally consistent videos. The integration of various advanced methodologies from image generation, such as v-parameterization, conditioning augmentation, and classifier-free guidance, demonstrates their effectiveness in the video domain. The work also highlights the potential for further improvements in video generation capabilities through continued research and development.
Project page

Autoregressive Image Generation using Residual Quantization

This paper by Lee et al. from Kakao Brain and POSTECH introduces a novel two-stage autoregressive (AR) image generation framework designed to efficiently generate high-resolution images while maintaining fidelity. The approach, combining Residual-Quantized VAE (RQ-VAE) and RQ-Transformer, addresses the limitations in sequence length and reconstruction quality seen in previous vector quantization methods like VQ-VAE and VQ-GAN.
RQ-VAE Design: RQ-VAE enhances the standard VQ-VAE by applying residual quantization (RQ) with a shared codebook of size $K$ across quantization depths. Each input feature vector is quantized recursively in a coarse-to-fine manner across multiple depths ($D$), resulting in a stack of discrete codes. This stacked representation increases the effective capacity (up to $K^D$ combinations) without requiring exponentially large codebooks, thereby mitigating codebook collapse and instability. RQ-VAE uses a loss composed of reconstruction loss and a depth-wise commitment loss to ensure gradual refinement of quantized vectors.
Model Architecture:
- Encoder/Decoder: Based on VQ-GAN with added residual blocks to achieve 8×8 spatial resolution of quantized feature maps for 256×256 input images.
- Codebook: Shared across all quantization depths; size varies (e.g., 16,384).
- Quantization Depth: Typically set to $D=4$ to balance fidelity and efficiency.
- The following figure from the paper shows an overview of our two-stage image generation framework composed of RQ-VAE and RQ-Transformer. In stage 1, RQ-VAE uses the residual quantizer to represent an image as a stack of $D = 4$ codes. After the stacked map of codes is reshaped, RQ-Transformer predicts the $D$ codes at the next position.
RQ-Transformer: This AR model predicts the D codes at each position in the image sequence. It is split into two parts:
- Spatial Transformer: Handles positional dependencies across image patches (sequence length T = H×W) using masked self-attention with learned positional embeddings.
- Depth Transformer: Predicts codes at each quantization depth for a given spatial position, using previously predicted codes and depth-specific positional embeddings.
- This architecture reduces complexity from O(NT²D²) to O(N_spatialT² + N_depthTD²), enabling faster sampling.
Training Techniques:
- Soft Labeling: Introduces smoothed supervision using a temperature-controlled softmax over code embedding distances.
- Stochastic Sampling: Replaces deterministic code selection with sampling from the soft distribution during training to address exposure bias.
Datasets and Results:
- Unconditional Generation: RQ-Transformer outperforms previous AR models on LSUN and FFHQ with significantly lower FID scores.
- Conditional Generation: Achieves state-of-the-art FID and IS on ImageNet and CC-3M, outperforming VQ-GAN and BigGAN without the need for rejection sampling.
- Efficiency: Achieves 4x–7x faster sampling speeds than VQ-GAN on 256×256 images.
- Ablation Studies: Demonstrate that increasing quantization depth D is more effective than enlarging the codebook for preserving reconstruction quality at reduced spatial resolutions.

#####

2023

Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles

Vision Transformers (ViTs) have dominated several tasks in computer vision. While architecturally simple, their accuracy and ability to scale make them still a popular choice today. Moreover, their simplicity unlocks the use of powerful pretraining strategies such as MAE, which make ViTs computationally and data efficient to train. However, this simplicity comes at a cost: by using the same spatial resolution and number of channels throughout the network, ViTs make inefficient use of their parameters. This is in contrast to prior “hierarchical” or “multi-scale” models (such as AlexNet or ResNet), which use fewer channels but higher spatial resolution in early stages with simpler features, and more channels but lower spatial resolution later in the model with more complex features.
While several domain specific hierarchical vision transformers have been introduced (such as hierarchical design, such as Swin or MViT), they have added several vision-specific components in the pursuit of supervised classification performance. While these components lead to effective accuracies and attractive FLOP counts, the added complexity actually makes these transformers slower than their vanilla ViT counterparts.
This paper by Ryali et al. from Facebook AI Research, Georgia Tech, and Johns Hopkins University argues that this additional bulk is unnecessary. By pretraining with a strong visual pretext task (MAE), they can strip out all the bells-and-whistles from a state-of-the-art multi-stage vision transformer without losing accuracy.
In the process, they create Hiera, an extremely simple hierarchical vision transformer that is more accurate than previous models while being significantly faster both at inference and during training. They evaluate Hiera on a variety of tasks for image and video recognition.
The following figure from the paper shows: (Left) the overall architecture of their proposed CSWin Transformer; (Right) the illustration of CSWin Transformer block.

Code.

Tree-Ring Watermarks: Fingerprints for Diffusion Images that are Invisible and Robust

Watermarking the outputs of generative models is a crucial technique for tracing copyright and preventing potential harm from AI-generated content.
This paper by Wen at al. from the University of Maryland introduces a novel technique called Tree-Ring Watermarking that robustly fingerprints diffusion model outputs.
Unlike existing methods that perform post-hoc modifications to images after sampling, Tree-Ring Watermarking subtly influences the entire sampling process, resulting in a model fingerprint that is invisible to humans. The watermark embeds a pattern into the initial noise vector used for sampling.
Because these patterns are structured in Fourier space so that they are invariant to perturbation such as convolutions, crops, dilations, flips, and rotations. After image generation, the watermark signal is detected by inverting the diffusion process to retrieve the noise vector, which is then checked for the embedded signal.
They demonstrate that this technique can be easily applied to arbitrary diffusion models, including text-conditioned Stable Diffusion, as a plug-in with negligible loss in FID. Their watermark is semantically hidden in the image space and is far more robust than watermarking alternatives that are currently deployed.
The following figure from the paper illustrates the pipeline for tree-ring Watermarking. A diffusion model generation is watermarked and later detected through ring-patterns in the Fourier space of the initial noise vector.

Code.

From Sparse to Soft Mixtures of Experts

Sparse Mixture of Experts (MoE) architectures scale model capacity without large increases in training or inference costs. MoE allows us to dramatically scale model sizes without significantly increasing inference latency. In short, each “expert” can separately attend to a different subset of tasks via different data subsets before they are combined via an input routing mechanism. Thus, the model can learn a wide variety of tasks, but still specialize when appropriate. Despite their success, MoEs suffer from a number of issues: training instability, token dropping, inability to scale the number of experts, or ineffective finetuning.
This paper by Puigcerver et al. from Google DeepMind proposes Soft MoE, a fully-differentiable sparse Transformer that addresses these challenges, while maintaining the benefits of MoEs.
Extra-large models like Google’s PaLM (540B parameters) or OpenAI’s GPT-4 use Sparse MoE under the hood, which suffers from training instabilities, because it’s not fully differentiable. Soft-MoE replaces the non-differentiable expert routing with a differentiable layer. The end-to-end model is fully differentiable again, can be trained with ordinary SGD-like optimizers, and the training instabilities go away.
Soft MoE performs an implicit soft assignment by passing different weighted combinations of all input tokens to each expert. As in other MoE works, experts in Soft MoE only process a subset of the (combined) tokens, enabling larger model capacity at lower inference cost.
The following figure from the paper illustrates the main differences between Sparse and Soft MoE layers. While the router in Sparse MoE layers (left) learns to assign individual input tokens to each of the available slots, in Soft MoE layers (right) each slot is the result of a (different) weighted average of all the input tokens. Learning to make discrete assignments introduces several optimization and implementation issues that Soft MoE sidesteps.

They propose a fully-differentiable sparse vision transformer (ViT) that addresses aforementioned challenges such as training instability, token dropping, and inefficient finetuning. In the context of visual recognition, Soft MoE greatly outperforms the standard ViT and popular MoE variants (Tokens Choice and Experts Choice). Soft MoE scales ViT models to >50B parameters with little effect on inference latency. For example, Soft MoE-Base/16 requires 10.5x lower inference cost (5.7x lower wall-clock time) than ViT-Huge/14 while matching its performance after similar training. Soft MoE also scales well: Soft MoE Huge/14 with 128 experts in 16 MoE layers has over 40x more parameters than ViT Huge/14, while inference time cost grows by only 2%, and it performs substantially better.
The following figure from the paper illustrates the Soft MoE routing algorithm. Soft MoE first computes scores or logits for every pair of input token and slot, based on some learnable per-slot parameters. These logits are then normalized per slot (columns) and every slot computes a linear combination of all the input tokens based on these weights (in green). Each expert (an MLP in this work) then processes its slots (e.g. 2 slots per expert, in this diagram). Finally, the same original logits are normalized per token (i.e., by row) and used to combine all the slot outputs, for every input token (in blue). Dashed boxes represent learnable parameters.

The following infographic (source) presents an overview of their results:

PyTorch implementation.

Estimating Example Difficulty using Variance of Gradients

This paper titled “Estimating Example Difficulty using Variance of Gradients” by Agarwal et al. introduces the concept of Variance of Gradients (VoG) as a metric to determine the difficulty of examples for machine learning models. The main highlights and contributions of the paper are as follows:
The authors propose VoG as an efficient metric to rank data by difficulty, enabling the identification of the most challenging examples for more focused human-in-the-loop auditing. The metric is particularly adept at identifying data points that are difficult for the model to learn, often correlating with corrupted or memorized examples.
The following $5 \times 5$ grid from the paper shows the top-25 Cifar-10 and Cifar-100 training-set images with the lowest and highest VoG scores in the Early (a) and Late (b) training stage respectively of two randomly chosen classes. Lower VoG images evidence uncluttered backgrounds (for both apple and plane) in the Late training stage. VoG also appears to capture a color bias present during the Early training stage for both apple (red). The VoG images in Late training stage present unusual vantage points, with images where the frame is zoomed in on the object of interest.

The study demonstrates the effectiveness of VoG across multiple architectures and datasets, including Cifar-10, Cifar-100, and ImageNet. VoG is shown to identify clusters of images with distinct semantic properties, where high VoG scores often align with images having cluttered backgrounds and atypical vantage points. The method also proves effective in surfacing memorized examples and provides insights into the learning cycle of the model.
An extensive evaluation of VoG’s utility as an auditing tool is conducted. This includes qualitative analysis of images with high and low VoG scores, demonstrating a correlation between VoG scores and the distinct visual properties of images. It’s observed that images with low VoG scores typically have uncluttered backgrounds and more prototypical views, whereas high VoG scores are associated with more challenging images. The study also shows that test set errors increase with higher VoG scores, especially in more complex datasets.
The stability of the VoG ranking is confirmed, which is crucial for building trust with users. The method produces consistent rankings across different training runs, demonstrating negligible deviation in VoG scores across samples. This stability is observed for both Cifar-10 and Cifar-100 datasets.
VoG’s role as an unsupervised auditing tool is highlighted, showing its capability to produce reliable rankings even without labels at test time. This feature is particularly valuable for datasets where obtaining labels for protected attributes is infeasible or intrusive.
The paper delves into VoG’s understanding of early and late training dynamics, revealing that VoG scores can capture different aspects of learning at various stages of training. For instance, during early training, higher VoG scores correlate with lower average error rates, while in the later stages, this trend reverses.
VoG is evaluated for its ability to identify memorized examples and its effectiveness as an Out-of-Distribution (OoD) detection tool. It is found to be discriminative in distinguishing between memorized and non-memorized examples. Additionally, when compared to other OoD detection methods, VoG outperforms most, highlighting its efficacy and scalability for large datasets and complex network architectures like ResNet-50.
In conclusion, the paper emphasizes the value of VoG in ranking data by difficulty and its utility in identifying challenging examples for auditing. Its domain-agnostic nature and ability to work with training and test examples make it a versatile tool in the realm of machine learning interpretability and model auditing.

EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything

This paper by Xiong et al. from Meta AI Research, the authors propose EfficientSAM, a novel approach for efficient image segmentation using lightweight models.
The paper addresses the computational challenges of the Segment Anything Model (SAM) with a super large Transformer model trained on the SA-1B dataset. The authors introduce EfficientSAM, which uses a lightweight SAM model for decent performance with significantly reduced complexity.
The core idea of EfficientSAM is leveraging masked image pretraining (SAMI), which learns to reconstruct features from the SAM image encoder, leading to effective visual representation learning.
EfficientSAM involves two key stages:
1. SAMI-pretrained lightweight image encoders and a mask decoder are used to build EfficientSAMs.
2. The models are fine-tuned on the SA-1B dataset for the segment anything task.
The following figure from the paper shows the proposed EfficientSAM contains two stages: SAMI pretraining (top) on ImageNet and SAM finetuning (bottom) on SA-1B.

The authors evaluate EfficientSAM on multiple vision tasks, including image classification, object detection, instance segmentation, and semantic object detection. SAMI consistently outperforms other masked image pretraining methods.
The paper reports significant improvements in zero-shot instance segmentation tasks, highlighting the advantage of SAMI-pretrained lightweight image encoders. For example, on the COCO/LVIS dataset, EfficientSAMs achieved approximately 4 AP improvement over other fast SAM models.
A key contribution of the paper is demonstrating that SAMI-pretrained backbones can generalize well to many tasks, including image classification, object detection, and semantic segmentation.
EfficientSAMs offer state-of-the-art quality-efficiency trade-offs, substantially reducing inference time and parameter size with minimal performance drops compared to the original SAM model.
The authors conclude that their masked image pretraining approach, SAMI, explores the potential of Vision Transformers under the guidance of SAM foundation models, improving masked image pretraining and enabling efficient segment anything tasks.
Project page; Code; Demo; Video.

Initializing Models with Larger Ones

This paper by Xu et al. from the University of Pennsylvania, UC Berkeley, MBZUAI, and Meta AI Research, introduces a groundbreaking weight initialization method for neural networks, known as ‘weight selection’. This method involves initializing smaller models (student models) by selectively borrowing weights from a pretrained, larger model (teacher model), which belongs to the same model family, such as Llama.
The weight selection process is detailed and straightforward. It includes three steps: (1) layer selection, where the first n layers of the teacher model are chosen corresponding to the number of layers in the student model; (2) component mapping, which aligns components like MLPs between the teacher and student models; (3) element selection, where the student’s components are initialized using their larger counterparts from the teacher. This selection can be uniform (selecting evenly spaced elements) or even random, as long as the same indices are chosen across all layers.
The following figure from the paper shows the process of weight selection. To initialize a smaller variant of a pretrained model, they uniformly select parameters from the corresponding component of the pretrained model.

The effectiveness of this method was rigorously tested across various image classification datasets, including ImageNet-1K, CIFAR-10, and CIFAR-100. The results demonstrated marked improvements in accuracy and substantial reductions in training time compared to networks initialized with standard methods like Kaiming and Xavier initialization. The student models initialized with teacher model weights not only performed better but also reached the performance level of randomly initialized models in just a third of the training time.
The paper provides a comprehensive analysis, comparing weight selection with classic initialization methods and different layer selection strategies (first-N layer selection, uniform layer selection). It also examines the method’s compatibility with knowledge distillation.
This weight selection approach is especially advantageous in resource-constrained environments, where deploying large models is not feasible. It opens up new possibilities for efficiently leveraging large pretrained models to enhance smaller models’ training, suggesting potential for widespread application and future research in the field.

Rethinking FID: Towards a Better Evaluation Metric for Image Generation

This paper by Jayasumana et al. from Google Research challenges the widely used Frechet Inception Distance (FID) for evaluating image generation models, highlighting its significant drawbacks: reliance on Inception’s features that poorly represent the rich content of modern text-to-image models, incorrect normality assumptions, and poor sample complexity. They propose an alternative metric, CMMD (based on CLIP embeddings and the maximum mean discrepancy distance), which does not assume any probability distribution of embeddings and is sample-efficient.
The authors critically analyze FID’s limitations through statistical tests and empirical evaluations, revealing that FID often contradicts human raters, fails to reflect iterative model improvements, and struggles with capturing complex image distortions. They introduce CMMD as a robust and reliable metric, demonstrating its superior alignment with human evaluation and effectiveness in capturing quality gradations in generated images.
CMMD leverages the rich image representations of CLIP embeddings, trained on a much larger and diverse dataset compared to Inception-v3, making it more suitable for evaluating the diverse content generated by modern image generation models. Unlike FID, CMMD is an unbiased estimator and does not suffer from dependency on sample size, showcasing better performance in various scenarios, including progressive image generation models and image distortions.
The authors conduct extensive experiments to compare FID and CMMD across different settings, including progressive image generation models, image distortions, sample efficiency, and computational cost. Their findings advocate for a shift from FID to CMMD for evaluating image generation models, highlighting CMMD’s advantages in consistency, robustness, and alignment with human perception of image quality.
The following table from the paper shows a comparison of options for comparing two image distributions. FID, the current de facto standard for text-to-image evaluation is in the upper-left corner. The proposed metric, CMMD, is in the lower-right corner and has many desirable properties over FID.

The study’s findings are supported by a detailed examination of the FID metric’s assumptions and performance, as well as the introduction and validation of the CMMD metric through theoretical analysis, empirical demonstrations, and human evaluations. This comprehensive analysis underscores the need for the image generation research community to reevaluate the use of FID and consider adopting CMMD for more reliable and meaningful evaluations.
Code

Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution

This paper by Dehghani et al. from Google DeepMind introduces NaViT (Native Resolution ViT), a vision transformer designed to process images of arbitrary resolutions and aspect ratios without resizing them to a fixed resolution, which is common but suboptimal.
NaViT leverages sequence packing during training, a technique inspired by natural language processing where multiple examples are packed into a single sequence, allowing efficient training on variable length inputs. This is termed Patch n’ Pack.
Architectural Changes: NaViT builds on the Vision Transformer (ViT) but introduces masked self-attention and masked pooling to prevent different examples from attending to each other. It also uses factorized and fractional positional embeddings to handle arbitrary resolutions and aspect ratios. These embeddings are decomposed into separate embeddings for x and y coordinates and summed together, allowing for easy extrapolation to unseen resolutions.
Training Enhancements: NaViT employs continuous token dropping, varying the token dropping rate per image, and resolution sampling, allowing mixed-resolution training by sampling from a distribution of image sizes while preserving aspect ratios. This enhances throughput and exposes the model to high-resolution images during training, yielding substantial performance improvements over equivalent ViTs.
Efficiency: NaViT demonstrates significant computational efficiency, processing five times more images during training than ViT within the same compute budget. The O(n^2) cost of attention, a concern when packing multiple images into longer sequences, diminishes with model scale, making the attention cost a smaller proportion of the overall computation.
The following figure from the paper shows an example packing enables variable resolution images with preserved aspect ratio, reducing training time, improving performance and increasing flexibility. We show here the aspects of the data preprocessing and modelling that need to be modified to support Patch n’ Pack. The position-wise operations in the network, such as MLPs, residual connections, and layer normalisations, do not need to be altered.

Implementation: The authors implemented NaViT using a greedy packing approach to fix the final sequence lengths containing multiple examples. They addressed padding issues and example-level loss computation by modifying pooling heads to account for packing and using chunked contrastive loss to manage memory and time constraints.
Performance: NaViT consistently outperforms ViT across various tasks, including image and video classification, object detection, and semantic segmentation. It shows improved results on robustness and fairness benchmarks, achieving better performance with lower computational costs and providing flexibility in handling different resolutions during inference.
Evaluation: NaViT’s training and adaptation efficiency were evaluated through empirical studies on datasets like ImageNet, LVIS, WebLI, and ADE20k. The model demonstrated superior performance in terms of accuracy and computational efficiency, highlighting the benefits of preserving aspect ratios and using mixed-resolution training.
NaViT represents a significant departure from the traditional convolutional neural network (CNN)-designed pipelines, offering a promising direction for Vision Transformers by enabling flexible and efficient processing of images at their native resolutions.

2025

Vision as LoRA

This paper from Wang et al. from ByteDance and University of Birmingham, introduces Vision as LoRA (VoRA), a paradigm that equips large language models (LLMs) with visual capabilities by embedding vision-specific Low-Rank Adaptation (LoRA) layers directly into the LLM, avoiding external vision encoders typically used in multimodal large language models (MLLMs).
VoRA deviates from the modular encoder-connector-LLM architecture by offering an encoder-free framework where images are directly embedded and passed through LoRA-enhanced LLM layers. This approach is structurally simpler, significantly reduces memory and computational overhead, and allows VoRA to handle arbitrary image resolutions thanks to the LLM’s inherent flexibility.
Architecture and Implementation:
- Vision as LoRA:
  - Visual input is embedded via a lightweight MLP with positional encoding (about 6M parameters).
  - LoRA layers are inserted into all linear layers (QKV projections, FFN) of the first $N_{\text{vit}}$ blocks of the LLM.
  - During training, only the LoRA layers and the vision embedding layer are updated, keeping the base LLM frozen. This decoupling stabilizes training by avoiding modality conflicts.
  - After training, LoRA layers are merged back into the LLM, resulting in no additional inference cost.
- Block-wise Distillation:
  - A pre-trained ViT (e.g., AIMv2-Huge-448p) serves as a teacher, guiding the learning of visual representations within the LoRA-enhanced LLM.
  - For each block $i$ and token position $s$, the distillation loss is the cosine distance between the projected LLM hidden states and ViT embeddings:
    \[L_i^{\text{distill}} = \frac{1}{S} \sum_{s=1}^{S} \left(1 - \frac{\text{AuxHead}(h^{\text{llm}}_{i,s})^\top h^{\text{vit}}_{i,s}}{\|\text{AuxHead}(h^{\text{llm}}_{i,s})\|_2 \|h^{\text{vit}}_{i,s}\|_2} \right)\]
  - Total distillation loss is averaged over $N_{\text{vit}}$ blocks.
  - The final loss function combines distillation and standard language modeling loss:
    \[L_{\text{total}} = L_{\text{distill}} + L_{\text{LM}}\]
- Bi-directional Attention for Vision:
  - Causal attention is retained for text, but vision tokens use bi-directional attention, allowing all image patches to attend to each other.
  - This significantly improves visual understanding and leads to better alignment with ViT features.
- Training:
  - Backbone: Qwen2.5-7B-Instruct.
  - Vision encoder: AIMv2-Huge-448p used only during distillation.
  - Batch size: 256, Learning rate: 0.0002, warm-up steps: 100.
  - Mixed training data: 30M image-caption pairs + 6.4M instruction-following text samples.
- Fine-tuning:
  - LoRA layers are merged; only the base LLM and visual embedding layers are trainable.
  - A native-resolution version (VoRA-AnyRes) reuses weights but enables variable image sizes.
- The following figure from the paper shows a high-level overview of VoRA. Visual parameters are indicated with an eye icon. Mainstream MLLMs adopt a modular, sequential architecture: raw pixels are first processed by a pre-trained vision encoder to extract high-level visual features, which are then aligned with the LLM through a modality connector for vision-language tasks. In contrast, VoRA consists solely of an LLM and a lightweight embedding layer. The LoRA layers serve as visual parameters that can be integrated into the LLM without incurring additional computational costs or memory burdens.
Empirical Performance:
- Benchmarks: Evaluated on VQAv2, ScienceQA-Image, TextVQA, MMBench, AI2D, MMVet, POPE, SEED, etc.
- VoRA matches encoder-based models like LLaVA-1.5 in most categories with only ~30M image-text pairs.
- VoRA underperforms in domains needing world knowledge (e.g., celebrity, landmark recognition) due to limited data in those areas.
Ablation Studies:
- LoRA Rank: Rank 1024 outperforms lower ranks (e.g., 512) in both training and distillation loss.
- Attention Masking: Bi-directional attention leads to consistent improvements over causal masking.
- Distillation: Block-wise > last-block > no distillation. Improves convergence speed and performance.
Data Efficiency:
- VoRA with bi-directional masks and block-wise distillation achieves equivalent loss with 35.5% fewer steps compared to vanilla LoRA.
Code and Models

NLP

1997

Long Short-Term Memory

Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow.
This paper by Hochreiter and Schmidhuber in Neural Computation 1997 briefly reviews Hochreiter’s (1991) analysis of this problem, then addresses it by introducing a novel, efficient, gradient based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow.
LSTM is local in space and time; its computational complexity per time step and weight is O. 1. Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations.
In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.

2003

A Neural Probabilistic Language Model

This paper by Bengio from the University of Montreal in 2003 revolutionized statistical language modeling by replacing “tables of conditional probabilities” (n-gram language models) with more compact and smoother representations based on distributed representations that can accommodate far more conditioning variables.
The traditional technique of learning the joint probability function of sequences of words in a language was intrinsically difficult because of the curse of dimensionality: a word sequence on which the model will be tested is likely to be different from all the word sequences seen during training. Traditional but very successful approaches based on n-grams obtain generalization by concatenating very short overlapping sequences seen in the training set.
They propose to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential/combinatorial number of semantically neighboring sentences, which forms the main reason for the spectacular improvements the proposed approach offers. The model learns simultaneously (i) a distributed representation for each word along with (ii) the probability function for word sequences, expressed in terms of these representations.
Generalization is obtained because a sequence of words that has never been seen before gets high probability if it is made of words that are similar (in the sense of having a nearby representation) to words forming an already seen sentence.
They report on experiments using neural networks for the probability function, showing on two text corpora that the proposed approach significantly improves on state-of-the-art n-gram models, and that the proposed approach allows to take advantage of longer contexts.

2004

ROUGE: A Package for Automatic Evaluation of Summaries

This paper by Lin in ACL 2004 introduces Recall-Oriented Understudy for Gisting Evaluation (ROUGE).
It includes measures to automatically determine the quality of a summary by comparing it to other (ideal) summaries created by humans. The measures count the number of overlapping units such as n-gram, word sequences, and word pairs between the computer-generated summary to be evaluated and the ideal summaries created by humans. This paper introduces four different ROUGE measures: ROUGE-N, ROUGE-L, ROUGE-W, and ROUGE-S included in the ROUGE summarization evaluation package and their evaluations. Three of them have been used in the Document Understanding Conference (DUC) 2004, a large-scale summarization evaluation sponsored by NIST.

2005

METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments

This paper by Banerjee and Lavie in ACL 2005 introduces METEOR, an automatic metric for machine translation evaluation that is based on a generalized concept of unigram matching between the machine-produced translation and human-produced reference translations.
Unigrams can be matched based on their surface forms, stemmed forms, and meanings; furthermore, METEOR can be easily extended to include more advanced matching strategies.
Once all generalized unigram matches between the two strings have been found, METEOR computes a score for this matching using a combination of unigram-precision, unigram-recall, and a measure of fragmentation that is designed to directly capture how well-ordered the matched words in the machine translation are in relation to the reference.
They evaluate METEOR by measuring the correlation between the metric scores and human judgments of translation quality.
They compute the Pearson $R$ correlation value between its scores and human quality assessments of the LDC TIDES 2003 Arabic-to-English and Chinese-to-English datasets.
They perform segment-by-segment correlation, and show that METEOR gets an $R$ correlation value of 0.347 on the Arabic data and 0.331 on the Chinese data. This is shown to be an improvement on using simply unigramprecision, unigram-recall and their harmonic F1 combination. They also perform experiments to show the relative contributions of the various mapping modules.

2010

Recurrent neural network based language model

This paper by Mikolov et al. from Khudanpur’s lab at JHU in Interspeech 2010, was the first to propose using a recurrent neural network-based language model (RNN LM) with applications to speech recognition.
The results indicate that it is possible to obtain around 50% reduction of perplexity (PPL) by using a mixture of several RNN LMs, compared to a state of the art backoff language model. Speech recognition experiments show around 18% reduction of word error rate on the Wall Street Journal task when comparing models trained on the same amount of data, and around 5% on the much harder NIST RT05 task, and 12% even when the backoff model is trained on 5 times more data than the RNN model. For NIST RT05, they can conclude that models trained on just 5.4M words of in-domain data can outperform big backoff models, which are trained on hundreds times more data.
They provide ample empirical evidence to suggest that connectionist language models are superior to standard n-gram techniques, except their high computational (training) complexity. Recurrent neural networks outperformed significantly state of the art backoff models in all of the experiments, most notably even in case when backoff models were trained on much more data than RNN LMs.
The paper seeks to break the myth that language modeling is just about counting n-grams, and that the only reasonable way how to improve results is by acquiring new training data.

2011

Generating Text with Recurrent Neural Networks

Recurrent Neural Networks (RNNs) are very powerful sequence models that do not enjoy widespread use because it is extremely difficult to train them properly. Fortunately, recent advances in Hessian-free optimization have been able to overcome the difficulties associated with training RNNs, making it possible to apply them successfully to challenging sequence problems.
This paper by Sutskever et al. from UofT in ICML 2011 demonstrates the power of RNNs trained with the new Hessian-Free optimizer (HF) by applying them to character-level language modeling tasks. The standard RNN architecture, while effective, is not ideally suited for such tasks, so they introduce a new RNN variant that uses multiplicative (or “gated”) connections which allow the current input character to determine the transition matrix from one hidden state vector to the next.
Having applied a modestly-sized standard RNN architecture to the character-level language modeling problem (where the target output at each time step is defined as the the input character at the next time-step), they found the performance somewhat unsatisfactory, and that while increasing the dimensionality of the hidden state did help, the per-parameter gain in test performance was not sufficient to allow the method to be both practical and competitive with state-of-the-art approaches. They address this problem by proposing a new temporal architecture called the Multiplicative RNN (MRNN) which they argue is better suited to the language modeling task.
Modeling language at the character level seems unnecessarily difficult. This is because morphemes are the appropriate units for making semantic and syntactic predictions and as such, converting large databases into sequences of morphemes, however, is non-trivial compared with treating them as character strings. Also, learning which character strings make words is a relatively easy task compared with discovering the subtleties of semantic and syntactic structure. So, given a powerful learning system like an MRNN, the convenience of using characters may outweigh the extra work of having to learn the words. Their experiments show that an MRNN finds it very easy to learn words. With the exception of proper names, the generated text contains very few non-words. At the same time, the MRNN also assigns probability to (and occasionally generates) plausible words that do not appear in the training set (e.g., “cryptoliation”, “homosomalist”, or “un-ameliary”). This is a desirable property which enabled the MRNN to gracefully deal with real words that it nonetheless didn’t see in the training set. Predicting the next word by making a sequence of character predictions avoids having to use a huge softmax over all known words and this is so advantageous that some word-level language models actually make up binary “spellings” of words so that they can predict them one bit at a time (Mnih & Hinton, 2009).
MRNNs already learn surprisingly good language models using only 1500 hidden units, and unlike other approaches such as the sequence memoizer and PAQ, they are easy to extend along various dimensions. If much bigger MRNNs could be trained with millions of units and billions of connections, it is possible that brute force alone would be sufficient to achieve an even higher standard of performance. But this will of course require considerably more computational power.
After training the multiplicative RNN with the HF optimizer for five days on 8 high-end Graphics Processing Units, they were able to surpass the performance of the best previous single method for character level language modeling – a hierarchical nonparametric sequence model. At this point, this represents the largest recurrent neural network application to date.

KenLM: Faster and Smaller Language Model Queries

This paper by Kenneth Heafield introduces KenLM, a highly efficient and compact language model query library tailored for natural language processing systems that frequently access n-gram probabilities, such as statistical machine translation systems.
KenLM provides two main data structures—PROBING and TRIE—each optimized for different trade-offs between memory consumption and query speed.
PROBING Data Structure:
- Designed for speed, PROBING uses linear probing hash tables. Hash collisions are resolved by sequentially probing next buckets.
- Uses 64-bit hashes (via MurmurHash2) of n-grams as keys. For $n \geq 2$, n-grams are stored in hash tables keyed by these hashes; unigrams are stored in a dense array.
- The space complexity is linear in the number of entries, $O(m)$, where $m > 1$ is a space multiplier to ensure some empty buckets.
- The design supports direct access to any n-gram of any length in a single lookup, making it unique among language model implementations.
- The hash table’s expected lookup time is $O\left(\frac{m}{m-1}\right)$ and independent of the total number of entries.
TRIE Data Structure:
- Optimized for memory efficiency, TRIE uses a bit-packed reverse trie combined with interpolation search for near-log-log time queries.
- Vocabulary lookups use sorted arrays of 64-bit hashes. Nodes for n-grams are arranged in column-major order and stored per n-gram length.
- Each record stores: vocabulary ID, probability, backoff, and pointer to next level n-gram entries. Highest-order n-grams omit backoff and next pointers.
- Probabilities and backoffs can be quantized to 2–25 bits to further reduce memory. Quantization is done via binning methods.
- TRIE avoids wasteful pointer storage (as seen in SRILM) by using offsets and tight packing.
The following figure from the paper shows lookup of “is one of” in a reverse trie. Children of each node are sorted by vocabulary identifier so order is consistent but not alphabetical: “is” always appears before “are”. Nodes are stored in column-major order. For example, nodes corresponding to these n-grams appear in this order: “are one”, “ ~~Australia”, “is one of”, “are one of”, “ ~~Australia is”, and “Australia is one”.~~~~

Implementation Details and Optimizations:
- Stateful Queries: KenLM uses a custom state function that encodes the minimum context needed for scoring, which enables efficient recombination in decoders.
- Backoff Storage in State: For left-to-right decoding patterns, the state also caches backoff values, eliminating redundant reads during successive queries.
- Threading: Fully thread-safe by design, thanks to read-only data structures. Models are memory-mapped for fast load and shared access across threads.
- Memory Mapping: TRIE has good locality, reducing page faults and enabling fast access even in large models. PROBING is less cache-friendly due to random access.
Benchmark Results:
- Sparse Lookup: PROBING is 43% faster than Boost unordered hash and uses less memory.
- Perplexity Evaluation: PROBING handles 1818 queries/ms with 5.28 GB RAM, TRIE achieves 1139 queries/ms using 2.72 GB. Both outperform SRILM, IRSTLM, and BerkeleyLM in speed and/or memory.
- Translation Task with Moses: Multi-threaded tests on 8 cores show that PROBING achieves best runtime (20.2 min wall time), while TRIE offers the lowest memory usage among non-lossy models.
Comparison with Alternatives:
- KenLM significantly outperforms SRILM, IRSTLM, MITLM, and BerkeleyLM in both speed and memory usage.
- Compared to lossy models like RandLM, KenLM’s TRIE offers a compelling trade-off between memory and precision, while PROBING sets the benchmark for speed.
Future Directions: Potential extensions include integrating direct-mapped caches (as in BerkeleyLM), better quantization schemes, trie compression (e.g., per Raj & Whittaker), and tighter integration with decoders to minimize backward state.
Code

2013

Efficient Estimation of Word Representations in Vector Space

“You shall know a word by the company it keeps” — J. R. Firth.
This paper by Mikolov et al. from Google in 2013 proposes word2vec which comprises of two novel model architectures for computing continuous vector representations of words from very large data sets. They studied the quality of vector representations of words derived by various models on a collection of syntactic and semantic language tasks involving word similarity, and the results are compared to the previously best performing techniques based on different types of neural networks. They observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set.
Based on a two-layer MLP neural network (i.e., one hidden layer and output layer), they propose two new model architectures for learning distributed representations of words that try to minimize computational complexity. The Continuous Bag-of-Words (CBOW) model architecture predicts the current word based on the context, while the skip-gram model predicts surrounding/context words given the current word.
They observed that it is possible to train high quality word vectors using very simple model architectures, compared to the popular neural network models (both feedforward and recurrent). Because of the much lower computational complexity, it is possible to compute very accurate high dimensional word vectors from a much larger data set.
Furthermore, they show that these vectors provide state-of-the-art performance on their test set for measuring syntactic and semantic word similarities.
Word2vec popularized the “King – Man + Woman = Queen” analogy.
Overall, two important learnings from Word2Vec were:
- Embeddings of semantically similar words are close in cosine similarity.
- Word embeddings support intuitive arithmetic properties. (An important consequence of this statement is that phrase embeddings can be obtained as the sum of word embeddings.)

Distributed Representations of Words and Phrases and their Compositionality

This paper by Mikolov et al. from Google in NeurIPS 2013 builds on their earlier paper Efficient Estimation of Word Representations in Vector Space which proposed the Skip-gram model as an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. They present several extensions that improve both the quality of the vectors and the training speed.
They describe a simple alternative to the hierarchical softmax called negative sampling, packaged as Skipgram with Negative Sampling (SGNS). Negative sampling is an extremely simple training method that learns accurate representations especially for frequent words. Furthermore, they propose subsampling of frequent words which is shown to to yield both faster training and significantly better representations of uncommon words.
An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of “Canada” and “Air” cannot be easily combined to obtain “Air Canada”. Motivated by this example, they present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.
The techniques introduced in this paper can be used also for training the continuous bag-of-words model introduced in Efficient Estimation of Word Representations in Vector Space.
Owing to the training optimizations proposed in this paper, successfully trained models on several orders of magnitude more data than the previously published models, thanks to the computationally efficient model architecture. This results in a great improvement in the quality of the learned word and phrase representations, especially for the rare entities.
The choice of the training algorithm and the hyper-parameter selection is a task specific decision, as different problems have different optimal hyperparameter configurations. In their experiments, the most crucial decisions that affect the performance are the choice of the model architecture, the size of the vectors, the subsampling rate, and the size of the training window.
A very interesting result of this work is that the word vectors can be somewhat meaningfully combined using just simple vector addition.
Another approach for learning representations of phrases presented in this paper is to simply represent the phrases with a single token. Combination of these two approaches gives a powerful yet simple way how to represent longer pieces of text, while having minimal computational complexity. Our work can thus be seen as complementary to the existing approaches that attempt to represent phrases using recursive matrix-vector operations.

2014

On the Properties of Neural Machine Translation: Encoder–Decoder Approaches

This paper by Cho from Bengio’s lab in Universite de Montreal in 2014 first introduced Gated Recurrent Units (GRUs).
Neural machine translation is a relatively new approach to statistical machine translation based purely on neural networks in which models often consist of an encoder and a decoder. The encoder extracts a fixed-length representation from a variable-length input sentence, and the decoder generates a correct translation from this representation.
The paper focuses on analyzing the properties of the neural machine translation using two types of neural networks that are able to process variable-length sequences (and differ in the choice of the encoder): (i) an recurrent neural network with gated hidden units, and (ii) the newly proposed gated recursive convolutional neural network. They show that the neural machine translation performs relatively well on short sentences without unknown words, but its performance degrades rapidly as the length of the sentence and the number of unknown words increase.
Furthermore, they find that the proposed gated recursive convolutional network learns a grammatical structure of a sentence automatically.

GloVe: Global Vectors for Word Representation

Word2vec relies only on local information of language. That is, the semantics learnt for a given word, is only affected by the surrounding words.
This paper by Pennington et al. from Stanford in EMNLP 2014 proposed Global Vectors (GloVe), an unsupervised learning algorithm which captures both global statistics and local statistics of a corpus, in order to train word vectors. Training is performed on aggregated global word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.
Contemporary methods focused considerable attention on the question of whether distributional word representations are best learned from count-based methods or from prediction-based methods. Currently, prediction-based models garner substantial support; for example, Baroni et al. (2014) argue that these models perform better across a range of tasks. They argue that the two classes of methods are not dramatically different at a fundamental level since they both probe the underlying co-occurrence statistics of the corpus, but the efficiency with which the count-based methods capture global statistics can be advantageous.
After Tomas Mikolov et al. released word2vec, there was a boom of papers about word vector representations. GloVe was one such proposal, which explained why such algorithms work and reformulated word2vec optimizations as a special kind of factorization for word co-occurence matrices. Note that GloVe does not use neural networks while word2vec does.
They construct a model that utilizes this main benefit of count data while simultaneously capturing the meaningful linear substructures prevalent in recent log-bilinear prediction-based methods like word2vec. The result, GloVe, is a new global log-bilinear regression model for the unsupervised learning of word representations that outperforms other models on word analogy, word similarity, and named entity recognition tasks.

Sequence to Sequence Learning with Neural Networks

This paper by Sutskever et al. from Google in 2014 introduced seq2seq encoder-decoder learning to map sequences to sequences, a task that simple Deep Neural Networks (DNNs) cannot be used to accomplish.
They present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Their method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. They show that a large deep LSTM with a limited vocabulary can outperform a standard statistical machine translation (SMT)-based system whose vocabulary is unlimited on a large-scale MT task. The success of their simple LSTM-based approach on MT suggests that it should do well on many other sequence learning problems, provided they have enough training data.
Their main result is that on an English to French translation task from the WMT-14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.8 on the entire test set, where the LSTM’s BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When they used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which is close to the previous state of the art. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice.
They also find that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM’s performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.

Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation

This paper by Cho et al. from Bengio’s lab in EMNLP 2014 introduced the seq2seq encoder-decoder model for neural machine translation. They propose a novel neural network model called RNN Encoder–Decoder that consists of two recurrent neural networks (RNN) that is together able to learn the mapping from a sequence of an arbitrary length to another sequence, possibly from a different set, of an arbitrary length. The encoder RNN encodes a sequence of symbols into a fixed-length vector representation, and the other decodes the representation into another sequence of symbols.
The proposed RNN Encoder–Decoder is able to either score a pair of sequences (in terms of a conditional probability) or generate a target sequence given a source sequence.
The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence.
Along with the new architecture, they propose a novel hidden unit that includes a reset gate and an update gate that adaptively control how much each hidden unit remembers or forgets while reading/generating a sequence.
They evaluated the proposed model with the task of statistical machine translation, where they used the RNN Encoder–Decoder to score each phrase pair in the phrase table. Qualitatively, they were able to show that the new model is able to capture linguistic regularities in the phrase pairs well and also that the RNN Encoder–Decoder is able to propose well-formed target phrases.
The scores by the RNN Encoder–Decoder were found to improve the overall translation performance in terms of BLEU scores. Also, they found that the contribution by the RNN Encoder–Decoder is rather orthogonal to the existing approach of using neural networks in the SMT system, so that they can improve further the performance by using, for instance, the RNN Encoder–Decoder and the neural net language model together.
Qualitative analysis of the the proposed model learns a semantically and syntactically meaningful representation of linguistic phrases at multiple levels, i.e., at the word level as well as phrase level. This suggests that there may be more natural language related applications that may benefit from the proposed RNN Encoder–Decoder.

2015

Neural Machine Translation by Jointly Learning to Align and Translate

This paper by Bahdanau et al. from Bengio’s lab in ICLR 2015 borrowed the attention mechanism from the field of information retrieval and introduced it within the context of NLP (commonly called Bahdanau attention or additive attention in the field).
This paper introduces an attention mechanism for recurrent neural networks (RNN) to improve long-range sequence modeling capabilities. This allows RNNs to translate longer sentences more accurately, which served as the motivation behind developing the original transformer architecture later.
The following diagram from the paper illustrates the proposed model trying to generate the $t^{th}$ target word $y^t$ given a source sentence $(x^1, x^2, \ldots , x^T)$.

Referring to the figure above, the architecture consists of a bidirectional RNN as an encoder and a decoder that emulates searching through a source sentence during decoding a translation.
Decoder:
- In prior encoder-decoder approaches, the decoder defines a probability over the translation $y$ by decomposing the joint probability into the ordered conditionals: $p(\mathbf{y})=\prod_{t=1}^T p\left(y_t \mid\left\{y_1, \cdots, y_{t-1}\right\}, c\right)$
  - where $\mathbf{y}=\left(y_1, \cdots, y_{T_y}\right)$. With an RNN, each conditional probability is modeled as, $p\left(y_t \mid\left\{y_1, \cdots, y_{t-1}\right\}, c\right)=g\left(y_{t-1}, s_t, c\right)$
- On the other hand, in the proposed model architecture, they define each conditional probability over the otuput translation $y$ by decomposing the joint probability into the ordered conditionals as, $p\left(y_i \mid y_1, \ldots, y_{i-1}, \mathbf{x}\right)=g\left(y_{i-1}, s_i, c_i\right)$
  - where $s_i$ is an RNN hidden state for time $i$, computed by $s_i=f\left(s_{i-1}, y_{i-1}, c_i\right) \text {. }$
- It should be noted that unlike the prior encoder-decoder approaches, here the probability is conditioned on a distinct context vector $c_i$ for each target word $y_i$. The context vector $c_i$ depends on a sequence of annotations $\left(h_1, \cdots, h_{T_x}\right)$ to which an encoder maps the input sentence. Each annotation $h_i$ contains information about the whole input sequence with a strong focus on the parts surrounding the $i^{th}$ word of the input sequence. More information on obtaining annotations in the section on Encoder below.
- The context vector $c_i$ is, then, computed as a weighted sum of these annotations $h_i$: $c_i=\sum_{j=1}^{T_x} \alpha_{i j} h_j$
- The weight $\alpha_{i j}$ of each annotation $h_j$ is computed by, $\alpha_{i j}=\frac{\exp \left(e_{i j}\right)}{\sum_{k=1}^{T_x} \exp \left(e_{i k}\right)},$
  - where, $e_{i j}=a\left(s_{i-1}, h_j\right)$
  - is an alignment model which scores how well the inputs around position $j$ and the output at position $i$ match. The score is based on the RNN hidden state $s_{i-1}$ and the $j^{th}$ annotation $h_j$ of the input sentence.
Encoder:
- The prior RNN architecture reads an input sequence $\mathbf{x}$ in order starting from the first symbol $x_1$ to the last one $x_{T_x}$. However, in the proposed scheme, we would like the annotation of each word to summarize not only the preceding words, but also the following words. Hence, they propose to use a bidirectional RNN, which has been successfully used recently in speech recognition.
- A BiRNN consists of forward and backward RNN’s. The forward RNN $\vec{f}$ reads the input sequence as it is ordered (from $x_1$ to $x_{T_x}$ and calculates a sequence of forward hidden states $\left(\vec{h}_1, \cdots, \vec{h}_{T_x}\right)$. The backward RNN $\overleftarrow{f}$ reads the sequence in the reverse order (from $x_{T_x}$ to $x_1$), resulting in a sequence of backward hidden states $\left(\overleftarrow{h}_1, \cdots, \overleftarrow{h}_{T_x}\right)$.
- They obtain an annotation for each word $x_j$ by concatenating the forward hidden state $\vec{h}_j$ and the backward one $\overleftarrow{h}_j$, i.e., $h_j=\left[\vec{h}_j^{\top} ; \overleftarrow{h}_j^{\top}\right]^{\top}$. In this way, the annotation $h_j$ contains the summaries of both the preceding words and the following words. Due to the tendency of RNNs to better represent recent inputs, the annotation $h_j$ will be focused on the words around $x_j$. This sequence of annotations is used by the decoder and the alignment model later to compute the context vector per the equations in the Decoder section above.

Effective Approaches to Attention-based Neural Machine Translation

Neural Machine Translation by Jointly Learning to Align and Translate proposed an attention mechanism to improve neural machine translation (NMT) by selectively focusing on parts of the source sentence during translation.
This paper by Luong et al. in EMNLP 2015 from Manning’s group at Stanford explores useful architectures for attention-based NMT. This paper examines two simple and effective classes of attentional mechanism: a global approach which always attends to all source words and a local one that only looks at a subset of source words at a time.
They demonstrate the effectiveness of both approaches over the WMT translation tasks between English and German in both directions. With local attention, they achieve a significant gain of 5.0 BLEU points over non-attentional systems which already incorporate known techniques such as dropout.
Their ensemble model using different attention architectures has established a new state-of-the-art result in the WMT’15 English to German translation task with 25.9 BLEU points, an improvement of 1.0 BLEU points over the existing best system backed by NMT and an n-gram reranker.

Skip-Thought Vectors

The following paper summary has been contributed by Zhibo Zhang.
This paper by Kiros et al. from University of Toronto, Canadian Institute for Advanced Research and Massachusetts Institute of Technology in NeurIPS 2015 proposes skip-thoughts, a model for learning sentence representations. The vector sentence representations generated are named as skip-thought vectors.
The method is shown in the figure below from the paper. Given a tuple composed of three consecutive sentences, skip-thoughts will learn to predict the first sentence and the third sentence in the tuple (“I got back home” and “this was strange” in the example of the figure) based on the second sentence of the tuple (“I could see the cat on the steps” in the example of the figure). Specifically, in skip-thoughts,
- There are three models (one encoder and two decoders), and each color represents a separate model in the figure.
- The encoder, which is an Recurrent Neural Network with Gated Recurrent Unit (Cho et al., 2014) activations, is used to encode the second sentence of the tuple.
- The two decoders, which are Recurrent Neural Networks with conditional Gated Recurrent Units (conditioned on the encoder output represented by the unattached arrows in the figure), predict the sentence before and after accordingly.
Empirically, the authors validated the capability of skip-thought vectors on tasks such as semantic relatedness, paraphrase detection, image annotation, image search and classification tasks. The skip-thought vectors showed robustness of representations based on the performance on these tasks.

chrF: character n-gram F-score for automatic MT evaluation

This paper by Popović from Humboldt University of Berlin, published in WMT 2015, proposes the chrF metric—a character n-gram F-score—for evaluating machine translation (MT) output. The chrF score is presented as a simple, language- and tokenization-independent evaluation metric that focuses purely on character-level n-gram overlaps between a hypothesis and a reference translation. Unlike other metrics that incorporate character n-grams into complex evaluation frameworks (e.g., BEER, MTERATER), this work investigates the standalone efficacy of character n-gram F-scores for automatic MT evaluation.
Core Metric Definition:
- chrF Score Formula:
  \[chrF_\beta = \frac{(1 + \beta^2) \cdot CHRP \cdot CHRR}{\beta^2 \cdot CHRP + CHRR}\]
  where:
  - $CHRP$ is the precision: the percentage of character n-grams in the hypothesis that appear in the reference.
  - $CHRR$ is the recall: the percentage of character n-grams in the reference that appear in the hypothesis.
  - $\beta$ adjusts the weight of recall relative to precision. $\beta = 1$ yields the harmonic mean (F1 score), and $\beta = 3$ (chrF3) prioritizes recall.
Implementation Details:
- n-gram Range: The metric evaluates fixed-length n-grams. Experiments tested n = 4, 6, and 10, plus a dynamic variant based on average word length. 6-grams performed best.
- Preprocessing: Character sequences are derived directly from text, with and without including spaces. Including spaces showed no improvement and was abandoned.
- Datasets: Evaluations are conducted using WMT shared task outputs from 2012, 2013, and 2014 (WMT12–WMT14) for multiple target languages (English, Spanish, French, German, Czech, Russian, and Hindi).
- Correlation Metrics:
  - System-level: Spearman’s ρ (WMT12, WMT13) and Pearson’s r (WMT14).
  - Segment-level: Kendall’s τ (WMT14).
Evaluation Results:
- System-level Correlation:
  - chrF3 (β=3) showed higher correlation with human judgments compared to BLEU, TER, and METEOR for about 70–80% of translations in WMT14.
  - chrF outperformed BLEU and TER in 68% of cases and METEOR in 50%.
  - chrF3 outperformed METEOR in 70% and BLEU in 80% of cases.
- Segment-level Correlation:
  - chrF and chrF3 delivered competitive Kendall’s τ scores compared to top-performing metrics like BEER and METEOR.
  - chrF3 yielded the best average τ score for translations from English, surpassing BEER and METEOR.
Advantages of chrF:
- It is independent of tokenization and language, making it especially useful for morphologically rich and non-European languages.
- It requires no linguistic preprocessing, making it computationally simpler than metrics like METEOR or BEER.
Limitations and Future Work:
- The paper suggests exploring different β values and weighing specific n-grams differently in future work.
- Further evaluation is needed on non-European languages and different writing systems (e.g., Arabic, Chinese).

2016

Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Neural Machine Translation (NMT) is an end-to-end learning approach for automated translation, with the potential to overcome many of the weaknesses of conventional phrase-based translation systems. Unfortunately, NMT systems are known to be computationally expensive both in training and in translation inference. Also, most NMT systems have difficulty with rare words. These issues have hindered NMT’s use in practical deployments and services, where both accuracy and speed are essential.
This paper by Wu et al. from Google in 2016 presents GNMT, Google’s Neural Machine Translation system, which attempts to address many of these issues. Their model consists of a deep LSTM network with 8 encoder and 8 decoder layers using attention and residual connections.
To improve parallelism and therefore decrease training time, their attention mechanism connects the bottom layer of the decoder to the top layer of the encoder. To accelerate the final translation speed, they employ low-precision arithmetic during inference computations.
To improve handling of rare words, they divide words into a limited set of common sub-word units (“wordpieces”) for both input and output. This method provides a good balance between the flexibility of “character”-delimited models and the efficiency of “word”-delimited models, naturally handles translation of rare words, and ultimately improves the overall accuracy of the system.
Their beam search technique employs a length-normalization procedure and uses a coverage penalty, which encourages generation of an output sentence that is most likely to cover all the words in the source sentence.
On the WMT’14 English-to-French and English-to-German benchmarks, GNMT achieves competitive results to state-of-the-art. Using a human side-by-side evaluation on a set of isolated simple sentences, it reduces translation errors by an average of 60% compared to Google’s phrase-based production system.

Neural machine translation of rare words with subword units

Neural machine translation (NMT) models typically operate with a fixed vocabulary, but translation is an open-vocabulary problem. Previous work addresses the translation of out-of-vocabulary words by backing off to a dictionary.
This paper by Sennrich et al. from the University of Edinburgh in ACL 2016 introduces a simpler and more effective approach based on Byte Pair Encoding (BPE), making the NMT model capable of open-vocabulary translation by encoding rare and unknown words as sequences of subword units. This is based on the intuition that various word classes are translatable via smaller units than words, for instance names (via character copying or transliteration), compounds (via compositional translation), and cognates and loanwords (via phonological and morphological transformations).
They discuss the suitability of different word segmentation techniques, including simple character n-gram models and a segmentation based on the byte pair encoding compression algorithm, and empirically show that subword models improve over a back-off dictionary baseline for the WMT 15 translation tasks English-German and English-Russian by 1.1 and 1.3 BLEU, respectively.

HyperNetworks

This paper by Ha et al. from Google Brain introduces an innovative approach where a smaller network, termed a “hypernetwork,” generates the weights for a larger network, referred to as the “main network.” This concept draws inspiration from evolutionary computing methods and aims to manage the large search spaces involved in weight parameters of neural networks. The hypernetwork approach is designed to be efficient and scalable, trained end-to-end with backpropagation.
Hypernetworks are a form of abstraction similar to the genotype-phenotype relationship in nature. They can be viewed as a relaxed form of weight-sharing across layers in neural networks, striking a balance between the flexibility of convolutional networks (which typically do not have weight-sharing) and the rigidity of recurrent networks (which do).
The following figure from the paper illustrates that hypernetworks generates weights for a feedforward network. Black connections and parameters are associated the main network whereas orange connections and parameters are associated with the hypernetwork.

For convolutional networks, hypernetworks generate weights for each convolutional layer. This method was shown to be effective for image recognition tasks with fewer learnable parameters, achieving respectable results on datasets like CIFAR-10.
For recurrent networks, such as LSTMs, hypernetworks can dynamically generate weights that vary across many timesteps. This approach has been demonstrated to be effective for a variety of sequence modeling tasks, including language modeling and handwriting generation, achieving near state-of-the-art results on datasets like Character Penn Treebank and Hutter Prize Wikipedia.
Hypernetworks have been shown to generate non-shared weights for LSTMs, outperforming standard LSTM versions in certain tasks. They are also beneficial in reducing the number of learnable parameters while maintaining competitive performance in tasks like image recognition and language modeling.
The paper reports experiments in different domains:
- For image recognition, hypernetworks demonstrated effectiveness in generating filters for convolutional networks, tested on MNIST and CIFAR-10 datasets.
- In language modeling, hypernetworks were applied to character-level prediction tasks on the Penn Treebank corpus and the Hutter Prize Wikipedia dataset (enwik8), showing competitive results.
- For handwriting generation, the hypernetwork model was trained on the IAM handwriting database, outperforming several configurations of LSTM models.
- In the neural machine translation task, HyperLSTM cells replaced LSTM cells in a wordpiece model architecture, improving performance on the WMT’14 English to French dataset.
The method presented in the paper is efficient, scalable, and works well with fewer parameters. Hypernetworks proved competitive or sometimes superior to state-of-the-art models in various applications like image recognition, language modeling, and handwriting generation

Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings

This work by Bolukbasi et al. from Boston University, and MSR, investigates gender bias in word embeddings and introduces methods for identifying and mitigating such bias. The authors highlight how widely used word embeddings like word2vec, trained on large text corpora such as Google News, encode and amplify gender stereotypes present in the data.
The paper demonstrates that these embeddings can produce analogies like “man is to computer programmer as woman is to homemaker,” reflecting stereotypical associations. The authors find that such gender biases are not confined to isolated words but are embedded throughout the vector space, affecting both direct associations (e.g., “nurse” being closer to “she” than “he”) and indirect associations (e.g., “receptionist” being closer to “softball” than “football”).
The following figure from the paper shows the most extreme occupations as projected onto the she–he gender direction on w2vNEWS, identifying occupations stereotypically associated with each gender.

Model and Embedding Details:
- The study focuses on the word2vec skip-gram model trained on the Google News corpus (referred to as w2vNEWS), which produces 300-dimensional word vectors.
- Embedding similarities are evaluated via cosine similarity: $\cos(u, v) = \frac{u \cdot v}{\|u\|\|v\|}$.
- To quantify and address gender bias, the authors define gender-specific word pairs (e.g., “he”-“she”) and construct a gender subspace via principal component analysis (PCA) on difference vectors of such pairs.
- Direct bias is measured as the average projection of gender-neutral words onto the gender direction.
- Indirect bias is assessed through similarity shifts between gender-neutral word pairs when the gender component is removed.
Debiasing Algorithms:
- Hard Debiasing:
  - Step 1: Identify the gender subspace using PCA on gender pair difference vectors.
  - Step 2:
    - Neutralize: Project gender-neutral words orthogonally to the gender subspace.
    - Equalize: Adjust sets of word pairs (e.g., “grandmother” and “grandfather”) to be equidistant from gender-neutral words while preserving their difference in the gender subspace.
  - This method ensures that analogies involving gender-neutral words no longer reflect stereotypes while retaining definitional gender distinctions.
- Soft Debiasing:
  - Formulates an optimization problem to find a linear transformation that minimizes distortion of inner products while reducing projections of gender-neutral words onto the gender subspace.
  - Solved as a semidefinite program: $\min_T \|T W^\top T W - W^\top W\|_F^2 + \lambda \|T N^\top T B\|_F^2$ where $W$ are all word vectors, $N$ the gender-neutral subset, $B$ the gender subspace, and $\lambda$ a trade-off parameter.
Evaluation:
- The authors use both human (crowdsourced) evaluations and standard NLP benchmarks to assess bias and utility.
- Bias Reduction:
  - Hard debiasing reduced stereotypical analogies (e.g., “he:doctor :: she:nurse”) by over 70%.
  - Preserved appropriate analogies (e.g., “he:king :: she:queen”).
- Embedding Utility:
  - Performance on standard similarity (RG, WS) and analogy (MSR) tasks remained nearly unchanged post-debiasing.
  - Example metric results:
    - Original embedding: MSR analogy score = 57.0
    - Hard-debiased: 57.0
    - Soft-debiased: 56.8

2017

Attention Is All You Need

This paper by Vaswani et al. from Google in NeurIPS 2017 introduced Transformers (that are based on scaled dot-product multi-headed attention) which are prevalent in most NLP and CV areas today.
Please refer the Transformer primer for a detailed discourse on Transformers.

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significant algorithmic and performance challenges. Also, static neural network architectures apply the same function to every example. In contrast, input dependent models attempt to tailor the function to each example. While it is straightforward for a human to manually specify a single static architecture, it is infeasible to specify every input-dependent function by hand. Instead, the input-dependent function must be automatically inferred by the model, which introduces an extra level of complexity in optimization.
Given the need to automatically infer architectures for each example, a natural solution is to define a single large model (supernetwork) with a numerous subnetworks (experts), and route examples through a path in the supernetwork. The figure below from Ramachandran and Le (2019) visualizes an example of a routing network.. Intuitively, similar examples can be routed through similar paths and dissimilar examples can be routed through different paths. The example-dependent routing also encourages expert specialization, in which experts devote their representational capacity to transforming a chosen subset of examples.

Learning to route examples to well-matched experts is critical for good performance. Effective routing can be achieved by training another small neural network (router) that learns to route examples through the supernetwork. The router takes the example as input and outputs the next expert to use. The router can take advantage of the intermediate representations of the example produced in the supernetwork.
This paper by Shazeer et al. in ICLR 2017 addresses these challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational efficiency on modern GPU clusters.
They introduce a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks. A trainable gating network determines a sparse combination of these experts to use for each example. In this per-example routing setup, different examples are processed by different subcomponents, or experts, inside a larger model, a.k.a. a supernetwork.
Specifically, the proposed MoE layer takes as an input a token representation $x$ and then routes this to the best determined top-$k$ experts, selected from a set $\left\{E_i(x)\right\}_{i=1}^N$ of $N$ experts. The router variable $W_r$ produces logits $h(x)=W_r \cdot x$ which are normalized via a softmax distribution over the available $N$ experts at that layer. The gate-value for expert $i$ is given by,

\[p_i(x)=\frac{e^{h(x)_i}}{\sum_j^N e^{h(x)_j}}\]

The top-$k$ gate values are selected for routing the token $x$. If $\mathcal{T}$ is the set of selected top-$k$ indices then the output computation of the layer is the linearly weighted combination of each expert’s computation on the token by the gate value,

\[y=\sum_{i \in \mathcal{T}} p_i(x) E_i(x)\]

They apply the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora. We present model architectures in which a MoE with up to 137 billion parameters is applied convolutionally between stacked LSTM layers. On large language modeling and machine translation benchmarks, these models achieve significantly better results than state-of-the-art at lower computational cost.
The following diagram from the paper illustrates a Mixture of Experts (MoE) layer embedded within a recurrent language model. In this case, the sparse gating function selects two experts to perform computations. Their outputs are modulated by the outputs of the gating network.

Using the Output Embedding to Improve Language Models

This paper by Press and Wolf from Tel-Aviv University in EACL 2017 proposes the concept of weight tying, by studying the topmost weight matrix of neural network language models.
They show that this matrix constitutes a valid word embedding. When training language models, they recommend tying the input embedding and this output embedding.
They analyze the resulting update rules and show that the tied embedding evolves in a more similar way to the output embedding than to the input embedding in the untied model.
They also offer a new method of regularizing the output embedding.
Their methods lead to a significant reduction in perplexity, as they are able to show on a variety of neural network language models. Finally, they show that weight tying can reduce the size of neural translation models to less than half of their original size without harming their performance.

Enriching Word Vectors with Subword Information

Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Popular models that learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for languages with large vocabularies and many rare words.
This paper by Bojanowski et al. from Facebook AI Research, published in TACL 2017, introduces fastText, a novel approach to word vector representation. FastText, an extension of the skipgram model from word2vec, represents each word as a bag of character n-grams, enriching the traditional word vector models by incorporating subword information. This method is particularly crucial for handling the morphology of words, which is a significant challenge in languages with large vocabularies and many rare words.
In fastText, a vector representation is associated with each character n-gram, and words are represented as the sum of these n-gram vectors. This approach is efficient and enables the model to quickly train on large corpora. Additionally, it allows for the computation of word representations for words not present in the training data, addressing a key limitation of models that assign a distinct vector to each word.
The implementation of fastText involves generating character n-grams for each word, ranging from 3 to 6 characters in length, and using special boundary symbols to differentiate between prefixes and suffixes. The words themselves are included in the n-gram set. A hashing function is employed to map these n-grams to integers, which helps in managing memory requirements. The model parameters are shared across threads and updated asynchronously during the optimization process, which uses stochastic gradient descent with a linear decay of the step size and negative log-likelihood as the objective function.
Extensive evaluations were carried out on nine languages using word similarity and analogy tasks. The results demonstrated that fastText’s word representations outperform traditional models, especially in morphologically rich languages. Its proficiency in handling out-of-vocabulary (OOV) words and robustness against dataset size variations were also highlighted.
Qualitative analyses reveal that the most significant n-grams in a word often correspond to morphemes, contributing to more accurate word representations. FastText’s capability in representing OOV words and its application in various language modeling tasks underscore its versatility and state-of-the-art performance in the field.
Overall, fastText presents a simple yet powerful enhancement to word vectors by integrating subword information, thereby advancing natural language processing, particularly for languages with complex morphology.
Code

chrF++: Words Helping Character n-grams

This paper by Popović from Humboldt University of Berlin, published in WMT 2017, introduces chrF++, an enhanced automatic machine translation (MT) evaluation metric that extends the character n-gram F-score (chrF) by integrating word-level information—specifically word unigrams and bigrams—into the scoring framework to better align with direct human assessments (DA).
Background and Motivation: chrF has demonstrated high correlation with human relative rankings (RR), especially for morphologically rich target languages. However, its alignment with DA—where sentences are rated absolutely rather than relatively—was unclear. Preliminary observations showed that while chrF yields tightly clustered high scores for good translations, it may also be overly optimistic. WORDF, by contrast, tends to be overly pessimistic. This motivates the integration of word-level n-grams into chrF to balance optimism and pessimism in automated evaluation.
Core Metric Formula: The general F-score formula used is:
\[\text{ngrF}_\beta = \frac{(1 + \beta^2) \cdot \text{ngrP} \cdot \text{ngrR}}{\beta^2 \cdot \text{ngrP} + \text{ngrR}}\]
where $\text{ngrP}$ and $\text{ngrR}$ denote n-gram precision and recall, averaged across all n-grams from order 1 to N. The parameter $\beta$ controls the weighting, with $\beta = 2$ found to be optimal for both chrF and WORDF in terms of Pearson’s correlation with DA.
Implementation Details:
- Base Metrics:
  - chrF: Computed on character n-grams up to order 6.
  - WORDF: Computed on word n-grams, with varying maximum order (1 to 4).
- chrF++ Composition:
  - chrF++ combines chrF (character n-grams, N=6) with WORDF restricted to N=2 (word unigrams and bigrams).
  - Scores are averaged uniformly across character and word components.
  - chrF+ is a variant that adds only word unigrams to chrF.
- Data Used: Evaluations are performed on WMT-15 and WMT-16 shared task data, covering English and Russian as target languages with multiple source languages (e.g., Czech, German, Finnish, Romanian, Turkish).
- Evaluation Criterion: Segment-level Pearson’s correlation coefficient between automatic scores and DA is the primary metric.
- Tools and Availability: A Python-based implementation is provided, requiring raw text inputs with reference and hypothesis sentences. The tool supports different $\beta$ values and provides both micro- and macro-averaged scores.
Key Findings:
- Removing word 3-grams and 4-grams from WORDF improves its correlation with DA.
- Combining chrF with word unigrams and bigrams (chrF++) outperforms standalone chrF and WORDF.
- For translation into morphologically rich languages like Russian, chrF+ (chrF + word unigrams) sometimes performs best.
- chrF++ consistently improves score distribution, balancing overly optimistic and pessimistic estimates.
Contribution: The paper provides a robust, language-independent metric that is both fast and tool-free, improving alignment with human DA by combining character- and word-level linguistic cues. chrF++ and chrF+ serve as practical enhancements to existing MT evaluation practices and are included in the WMT-17 metrics shared task.
Code

2018

Deep contextualized word representations

This paper by Peters et al. from Allen AI and UW in NAACL 2018 introduced LSTM-based Embeddings from Language Models (ELMo), an approach for learning high-quality deep context-dependent/context-sensitive word representations/embeddings from biLMs.
These deep contextualized word representations model both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy).
ELMo’s word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus. They show that these representations can be easily added to existing models and significantly improve the state of the art across six challenging NLP problems, including question answering, textual entailment, and sentiment analysis. They also present an analysis showing that exposing the deep internals of the pre-trained network is crucial, allowing downstream models to mix different types of semi-supervision signals.
Through ablations and other controlled experiments, they have confirmed that the biLM layers efficiently encode different types of syntactic and semantic information about words-in-context, and that using all layers improves overall task performance, enabling ELMo to show large improvements on a broad range of NLP tasks.

Improving Language Understanding by Generative Pre-Training

Natural language understanding comprises a wide range of diverse tasks such as textual entailment, question answering, semantic similarity assessment, and document classification. Although large unlabeled text corpora are abundant, labeled data for learning these specific tasks is scarce, making it challenging for discriminatively trained models to perform adequately.
This paper by Radford et al. from OpenAI in 2018 introduces a framework for achieving strong natural language understanding with a single task-agnostic model through generative pre-training and discriminative fine-tuning and demonstrates large gains on the aforementioned NLU tasks can be realized by generative pre-training of a language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task.
In contrast to previous approaches, they make use of task-aware input transformations during fine-tuning to achieve effective transfer while requiring minimal changes to the model architecture.
By pre-training on a diverse corpus with long stretches of contiguous text, their model acquires significant world knowledge and ability to process long-range dependencies which are then successfully transferred to solving discriminative tasks such as question answering, semantic similarity assessment, entailment determination, and text classification, improving the state of the art on 9 of the 12 datasets and thus outperforming discriminatively trained models that use architectures specifically crafted for each task. For instance, they achieve absolute improvements of 8.9% on commonsense reasoning (Stories Cloze Test), 5.7% on question answering (RACE), and 1.5% on textual entailment (MultiNLI).
Using unsupervised (pre-)training to boost performance on discriminative tasks has long been an important goal of Machine Learning research. Their work suggests that achieving significant performance gains is indeed possible, and offers hints as to what models (Transformers) and data sets (text with long range dependencies) work best with this approach.

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

This paper by Kudo and Richardson in EMNLP 2018 describes SentencePiece, a language-independent subword tokenizer and detokenizer designed for Neural-based text processing, including Neural Machine Translation. - It provides open-source C++ and Python implementations for subword units. While existing subword segmentation tools assume that the input is pre-tokenized into word sequences, SentencePiece can train subword models directly from raw sentences, which allows us to make a purely end-to-end and language independent system.
They perform a validation experiment of NMT on English-Japanese machine translation, and find that it is possible to achieve comparable accuracy to direct subword training from raw sentences.
They also compare the performance of subword training and segmentation with various configurations.
Code.

Self-Attention with Relative Position Representations

Relying entirely on an attention mechanism, the Transformer introduced by Vaswani et al. (2017) achieves state-of-the-art results for machine translation. In contrast to recurrent and convolutional neural networks, it does not explicitly model relative or absolute position information in its structure. Instead, it requires adding representations of absolute positions to its inputs.
This paper by Shaw et al. in NAACL 2018 presents an alternative approach, extending the self-attention mechanism to efficiently consider representations of the relative positions, or distances between sequence elements.
The figure below from the paper shows example edges representing relative positions, or the distance between elements. They learn representations for each relative position within a clipping distance $k$. The figure assumes $2 <= k <= n − 4$. Note that not all edges are shown.

In the original Transformer paper, the authors hypothesized that in contrast to learned, absolute position representations, sinusoidal position encodings would help the model to generalize to sequence lengths unseen during training, by allowing it to learn to attend also by relative position. This property is shared by our relative position representations which, in contrast to absolute position representations, are invariant to the total sequence length.
On the WMT 2014 English-to-German and English-to-French translation tasks, this approach yields improvements of 1.3 BLEU and 0.3 BLEU over absolute position representations, respectively. Notably, they observe that combining relative and absolute position representations yields no further improvement in translation quality.
They describe an efficient implementation of their method and cast it as an instance of relation-aware self-attention mechanisms that can generalize to arbitrary graph-labeled inputs.

Blockwise Parallel Decoding for Deep Autoregressive Models

Deep autoregressive sequence-to-sequence models have demonstrated impressive performance across a wide variety of tasks in recent years. While common architecture classes such as recurrent, convolutional, and self-attention networks make different trade-offs between the amount of computation needed per layer and the length of the critical path at training time, generation still remains an inherently sequential process.
This paper by Stern et al. from Google in NeurIPS 2018 seeks to overcome this limitation by propose a novel blockwise parallel decoding scheme in which we make predictions for multiple time steps in parallel then back off to the longest prefix validated by a scoring model. This allows for substantial theoretical improvements in generation speed when applied to architectures that can process output sequences in parallel.
They verify their approach empirically through a series of experiments using state-of-the-art self-attention models for machine translation and image super-resolution, achieving iteration reductions of up to 2x over a baseline greedy decoder with no loss in quality, or up to 7x in exchange for a slight decrease in performance. In terms of wall-clock time, their fastest models exhibit real-time speedups of up to 4x over standard greedy decoding.
The following figure from the paper shows the three substeps of blockwise parallel decoding. In the predict substep, the greedy model and two proposal models independently and in parallel predict “in”, “the”, and “bus”. In the verify substep, the greedy model scores each of the three independent predictions, conditioning on the previous independent predictions where applicable. When using a Transformer or convolutional sequence-to-sequence model, these three computations can be done in parallel. The highest-probability prediction for the third position is “car”, which differs from the independently predicted “bus”. In the accept substep, $\hat{y}$ is hence extended to include only “in” and “the” before making the next $k$ independent predictions.

The following figure from the paper illustrates the fact that combining the scoring and proposal models allows us to merge the previous verification substep with the next prediction substep. This makes it feasible to call the model just once per iteration rather than twice, halving the number of model invocations required for decoding.

Universal Language Model Fine-tuning for Text Classification

Inductive transfer learning has greatly impacted computer vision, but existing approaches in NLP still require task-specific modifications and training from scratch.
This paper by Hoard and Ruder in ACL 2018 proposes Universal Language Model Fine-tuning (ULMFiT), an effective transfer learning method that can be applied to any task in NLP, and introduce techniques that are key for fine-tuning a language model.
The following figure from the paper shows that ULMFiT consists of three stages: a) The LM is trained on a general-domain corpus to capture general features of the language in different layers. b) The full LM is fine-tuned on target task data using discriminative fine-tuning (‘Discr’) and slanted triangular learning rates (STLR) to learn task-specific features. c) The classifier is fine-tuned on the target task using gradual unfreezing, ‘Discr’, and STLR to preserve low-level representations and adapt high-level ones (shaded: unfreezing stages; black: frozen).

ULMFiT significantly outperforms the state-of-the-art on six text classification tasks, reducing the error by 18-24% on the majority of datasets.
Furthermore, with only 100 labeled examples, it matches the performance of training from scratch on 100x more data.
Code.

Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models

The paper by Vijayakumar et al. from Virgina Tech and Indiana University presents an alternative to the traditional Beam Search (BS) method, known as Diverse Beam Search (DBS). The paper is focused on enhancing the diversity in the solutions decoded from neural sequence models, addressing the issue that BS often results in sequences with minor variations and fails to capture the inherent ambiguity of complex AI tasks.
The paper introduces Diverse Beam Search (DBS), an algorithm that decodes a list of diverse outputs by optimizing a diversity-augmented objective. DBS divides the beam budget into groups and enforces diversity between these groups.
- Comparing image captioning outputs decoded by BS and our method, Diverse Beam Search (DBS) – we notice that BS captions are near-duplicates with similar shared paths in the search tree and minor variations in the end. In contrast, DBS captions are significantly diverse and similar to the inter-human variability in describing images.
The following figure from the paper demonstrates that DBS finds better top-1 solutions compared to BS by controlling the exploration and exploitation of the search space. This implies that DBS is a superior search algorithm in terms of result diversity.

The authors also study the impact of the number of groups, the strength of diversity penalty, and various forms of diversity functions for language models. They explore various forms of the dissimilarity term used in DBS, such as Hamming Diversity, Cumulative Diversity, and n-gram Diversity, and their impact on model performance.
The paper provides empirical evidence through experiments on image captioning, machine translation, and visual question generation tasks. It uses both standard quantitative metrics and qualitative human studies to validate the effectiveness of DBS.
DBS shows significant improvements in diversity without compromising task-specific performance metrics. This is particularly evident in cases of complex images where diverse descriptions are more likely.
The paper discusses the role of diversity in image-grounded language generation tasks, highlighting that DBS consistently outperforms BS and previously proposed techniques for diverse decoding. DBS is shown to be robust over a wide range of parameter values and is general enough to incorporate various forms of the dissimilarity term.
Overall, the paper makes a significant contribution to the field of neural sequence modeling by proposing a novel approach to enhance the diversity of decoded solutions, demonstrating its efficacy across different applications and providing insights into the role of diversity in complex AI tasks.

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

This paper by Bajaj et al. from Microsoft AI & Research introduces the MS MARCO dataset for machine reading comprehension (MRC) and open-domain question answering (QA).
The MS MARCO dataset is a large-scale, real-world reading comprehension dataset, consisting of over 1 million anonymized questions derived from Bing’s search query logs, each paired with a human-generated answer. Additionally, it includes 182,669 completely human rewritten generated answers and 8,841,823 passages extracted from 3,563,535 web documents retrieved by Bing.
The dataset is designed to address shortcomings of existing MRC and QA datasets by using real user search queries, making it more representative of natural information-seeking behavior. This contrasts with other datasets where questions are often generated by crowd workers based on provided text spans or documents.
The dataset poses three distinct tasks with varying difficulty levels: predicting if a question is answerable given context passages and synthesizing an answer, generating a well-formed answer based on context passages, and ranking a set of retrieved passages given a question.
MS MARCO’s complexity and real-world relevance are intended to benchmark machine reading comprehension and question-answering models, especially in handling realistic, noisy, and problematic inputs.
The paper also discusses the unique features of the dataset, such as questions being real user queries issued to Bing, the presence of multiple or no answers for some questions, and the inclusion of a large set of passages for each question to mimic real-world information retrieval conditions.
It provides benchmarking results on the dataset, evaluating different machine learning models for their effectiveness in handling the dataset’s tasks. These results include assessments of generative and discriminative models using metrics like ROUGE-L and BLEU for answer quality.
In summary, the MS MARCO dataset represents a significant step forward in developing and benchmarking MRC and QA systems, offering a large-scale, realistic dataset derived from actual user queries and incorporating a variety of tasks to test different aspects of machine comprehension.

Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods

This paper by Zhao et al. from UCLA, UVa, and Allen AI, introduces WinoBias, a new benchmark dataset designed to evaluate and expose gender bias in coreference resolution systems, showing that these systems often inherit and amplify societal stereotypes embedded in training data and resources. Through carefully constructed winograd-style test cases referencing gender-stereotyped occupations, the authors demonstrate that models frequently fall back on biased associations when faced with semantically challenging examples, and propose robust methods—including gender-swapped data augmentation and resource debiasing—to mitigate such bias effectively without degrading overall performance.
The authors evaluate three coreference systems—a rule-based model, a feature-rich statistical model, and a neural end-to-end model—and find significant disparities in performance between pro-stereotypical and anti-stereotypical pronoun-occupation pairings, with an average F1 score difference of 21.1.
WinoBias Dataset:
- Contains 3,160 sentences, evenly split between development and test sets.
- Covers 40 occupations with gender prevalence statistics from the U.S. Department of Labor.
- Follows two main templates:
  - Type 1: Requires semantic inference without syntactic cues.
  - Type 2: Allows syntactic resolution using pronoun agreement and structure.
- Ensures gender balance and occupational role symmetry to isolate bias in model behavior.
- The following figure from the paper shows pairs of gender-balanced coreference tests in the WinoBias dataset, where correct pronoun resolution should be independent of gendered stereotypes.
Systems Evaluated:
- Stanford Deterministic (Rule-Based), Berkeley (Feature-Rich), and UW E2E Neural Coreference System.
- All models showed higher accuracy in pro-stereotypical scenarios, with the rule-based system being the most biased.
Debiasing Techniques:
- Gender Swapping: Uses rule-based replacement of gendered entities after named entity anonymization. Rewrites are guided by human annotations and part-of-speech-aware rules.
- Resource Debiasing:
  - Replaces original GloVe embeddings with debiased versions.
  - Adjusts gender statistics in external noun phrase frequency lists used by feature-rich models.
Training and Results:
- Evaluations on OntoNotes and WinoBias showed that initial systems exhibit significant bias.
- After applying anonymization, resource debiasing, and gender-swapped data augmentation, systems achieved nearly equal performance across pro- and anti-stereotypical examples (e.g., E2E model reduced F1 gap from 26.6 to 1.1).
- These adjustments maintained or minimally affected performance on the standard OntoNotes benchmark, validating their effectiveness and generalizability.

2019

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

This paper by Devlin et al. from Google in ACL 2019 proposed BERT (Bidirectional Encoder Representations from Transformers), a Transformer-based language representation model which proposed pre-training bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. BERT is pre-trained using two unsupervised tasks: (i) masked language modeling (MLM) and, (ii) next sentence prediction (NSP).
- MLM is often referred to as a Cloze task in the literature (Taylor, 1953). In this case, the final hidden vectors corresponding to the mask tokens are fed into an output softmax over the vocabulary, as in a standard LM.
- NSP is needed because many important downstream tasks such as Question Answering (QA) and Natural Language Inference (NLI) are based on understanding the relationship between two sentences, which is not directly captured by language modeling. In order to train a model that understands sentence relationships, they pre-train for a binarized next sentence prediction task that can be trivially generated from any monolingual corpus.
Fine-tuning for the task at hand involves using an additional output layer, to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.
BERT comes in two flavors: (i) BERT Base: 12 layers (transformer blocks), 12 attention heads, and 110 million parameters; (ii) BERT Large: 24 layers (transformer blocks), 16 attention heads, and 340 million parameters.
BERT consumes a max of 512 input tokens. At its output, word embeddings for BERT (what is called BERT-base) have 768 dimensions.
BERT obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).
BERT demonstrated that unsupervised pretraining is an integral part of many language understanding systems and enables even low-resource tasks to benefit from them.
Google Blog’s article that discusses using BERT for improving search relevance and ranking.
Also, here’s a brief timeline of NLP models from Bag of Words to the Transformer family from Fabio Chiusano:

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes, while hyperparameter choices have significant impact on the final results.
This paper by Liu et al. from University of Washington and Facebook AI in 2019 carefully evaluates a number of design decisions when pretraining BERT models.
They present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and training data size. They find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it. They find that performance can be substantially improved by training the model longer, with bigger batches over more data; removing the next sentence prediction objective; training on longer sequences; and dynamically changing the masking pattern applied to the training data.
Their improved pretraining procedure, which they call RoBERTa, achieves state-of-the-art results on GLUE, RACE and SQuAD, without multi-task finetuning for GLUE or additional data for SQuAD. These results highlight the importance of previously overlooked design choices, and suggest that BERT’s pretraining objective remains competitive with recently proposed alternatives.
Note that RoBERTa uses only the masked language model objective (and does not train using the next sentence prediction objective), and achieves better results than the original BERT.

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

This paper by Lewis et al. from Facebook AI in 2019 presented BART, a denoising autoencoder for pretraining sequence-to-sequence models that learns to map corrupted documents to the original. BART is trained by corrupting text with an arbitrary noising function, and learning a model to reconstruct the original text. It uses a standard Tranformer-based neural machine translation architecture which, despite its simplicity, can be seen as generalizing BERT (due to the bidirectional encoder), GPT (with the left-to-right decoder), and many other more recent pretraining schemes.
They evaluate a number of noising approaches, finding the best performance by both randomly shuffling the order of the original sentences and using a novel in-filling scheme, where spans of text are replaced with a single mask token.
Background: With BERT, random tokens are replaced with masks, and the document is encoded bidirectionally. Missing tokens are predicted independently, so BERT cannot easily be used for generation.

With GPT, tokens are predicted auto-regressively (generation of a new token is conditioned on the prior tokens), meaning GPT can be used for generation. However words can only condition on leftward context, so it cannot learn bidirectional interactions.

BART applies noising schemes to an input document and thus corrupts it by replacing spans of text with mask symbols. In the diagram below, the corrupted document (left) is encoded with a bidirectional model, and then the likelihood of the original document (right) is calculated with an autoregressive decoder. For fine-tuning, an uncorrupted document is input to both the encoder and decoder, and they use representations from the final hidden state of the decoder. The advantage of using this scheme is that inputs to the encoder need not be aligned with decoder outputs, allowing arbitary noise transformations.

BART is particularly effective when finetuned for text generation but also works well for comprehension tasks. It matches the performance of RoBERTa with comparable training resources on GLUE and SQuAD, achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 6 ROUGE.
BART also provides a 1.1 BLEU increase over a back-translation system for machine translation, with only target language pretraining.
BART achieves similar performance to RoBERTa on discriminative tasks, while achieving new state-of-the-art results on a number of text generation tasks.

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging.
This paper by Sanh et al. from Huggingface in the Energy Efficient Machine Learning and Cognitive Computing - NeurIPS 2019 introduced a language representation model, DistilBERT which is a general-purpose pre-trained version of BERT. DistilBERT is 40% smaller, 60% faster, cheaper to pre-train, and retains 97% of the language understanding capabilities. DistilBERT can be fine-tuned with good performances on a wide range of tasks much like its larger counterparts.
While most prior work investigated the use of distillation for building task-specific models, they leverage knowledge distillation during the pre-training phase and show that DistilBERT is a compelling option for edge applications.
To leverage the inductive biases learned by larger models during pretraining, they introduce a triple loss combining language modeling, distillation and cosine-distance losses.
The following graph shows the parameter counts of several recently released pretrained language models:

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling.
This paper by Dai et al. from CMU and Google Brain in 2019 proposes a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence.
Transformer-XL consists of a segment-level recurrence mechanism and a novel positional encoding scheme that uses relative positional embeddings (compared to the absolute positional encoding in a vanilla Transformer architecture) which enable longer-context attention.
Transformer-XL not only enables capturing longer-term dependency than RNNs and vanilla Transformers, achieves substantial speedup during evaluation, but also resolves the context fragmentation problem. As a result, Transformer-XL learns dependency that is 80% longer than RNNs and 450% longer than vanilla Transformers, achieves better performance on both short and long sequences, and is up to 1,800+ times faster than vanilla Transformers during evaluation.
They improve the state-of-the-art results of BPC/Perplexity to 0.99 on enwiki8, 1.08 on text8, 18.3 on WikiText-103, 21.8 on One Billion Word, and 54.5 on Penn Treebank (without finetuning). When trained only on WikiText-103, Transformer-XL manages to generate reasonably coherent, novel text articles with thousands of tokens.

XLNet: Generalized Autoregressive Pretraining for Language Understanding

With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive language modeling. However, relying on corrupting the input with masks, BERT neglects the dependency between the masked positions and suffers from a pretrain-finetune discrepancy.
This paper by Yang et al. from CMU and Google in 2019 proposes XLNet considering BERT’s aforementioned pros and cons, and offers a generalized autoregressive pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order (thereby proposing a new objective called Permutation Language Modeling), and (2) overcomes the limitations of BERT thanks to its autoregressive formulation. Put simply, XLNet is a generalized autoregressive pretraining method that uses a permutation language modeling objective to combine the advantages of autoregressive and autoencoder methods.
Furthermore, the neural architecture of XLNet is developed to work seamlessly with the autoregressive objective, including integrating ideas from Transformer-XL, the state-of-the-art autoregressive model and the careful design of the two-stream attention mechanism. Empirically, under comparable experiment settings, XLNet outperforms BERT on 20 tasks, often by a large margin, including question answering, natural language inference, sentiment analysis, and document ranking.
Code.

Adaptive Input Representations for Neural Language Modeling

This paper by Baevski and Auli from Facebook AI in 2019 introduces adaptive input representations by varying the size of input word embeddings for neural language modeling. Adaptive input embeddings can improve accuracy while drastically reducing the number of model parameters.
There are several choices on how to factorize the input and output layers, and whether to model words, characters or sub-word units.
They perform a systematic comparison of popular choices for a self-attentional architecture.
Their experiments show that models equipped with adaptive embeddings are more than twice as fast to train than the popular character input CNN while having a lower number of parameters.
On the WIKITEXT-103 benchmark, they achieve 18.7 perplexity, an improvement of 10.5 perplexity compared to the previously best published result and on the BILLION WORD benchmark, they achieve 23.02 perplexity.

Attention Interpretability Across NLP Tasks

This paper by Vashishth et al. from IISc and Google in 2019 seeks to empirically prove the hypothesis that attention weights are interpretable and are correlated with feature importance measures, However, this holds only for cases when attention weights are essential for model’s prediction.
Some works (Jain & Wallace, 2019; Vig & Belinkov, 2019) have demonstrated that attention weights are not interpretable, and altering them does not affect the model output while several others have shown that attention captures several linguistic notions in the model. They extend the analysis of prior works to diverse NLP tasks and demonstrate that attention weights are interpretable and are correlated with feature importance measures. However, this holds only for cases when attention weights are essential for model’s prediction and cannot simply be reduced to a gating unit. This paper takes a balanced approach – rather than taking a black and white approach – they draw on previous literature that raised issues with the fact “attentions are indicative of model predictions” and show “when is attention interpretable and when it is not”.
The attention layer in a neural network model provides insights into the model’s reasoning behind its prediction, which are usually criticized for being opaque. Recently, seemingly contradictory viewpoints have emerged about the interpretability of attention weights. Amid such confusion arises the need to understand attention mechanism more systematically. The paper attempts to fill this gap by giving a comprehensive explanation which justifies both kinds of observations (i.e., when is attention interpretable and when it is not). Through a series of experiments on diverse NLP tasks, they validate their observations and reinforce the claim of interpretability of attention through manual evaluation.
They find that in both single and pair sequence tasks, the attention weights in samples with original weights do make sense in general. However, in the former case, the attention mechanism learns to give higher weights to tokens relevant to both kinds of sentiment. They show that attention weights in single sequence tasks do not provide a reason for the prediction, which in the case of pairwise tasks, attention do reflect the reasoning behind model output.
Unrelated to the paper: To use attention visualization as a proxy for interpreting your predictions, use the BertViz library. The lib supports multiple views and supports a plethora of models (BERT, GPT-2, XLNet, RoBERTa, XLM, ALBERT, DistilBERT, BART etc.). The BertViz repo has some nice examples to get started.

Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization

This paper by Selvaraju et al. from Parikh/Batra’s team at GATech in 2019 proposes a technique for producing ‘visual explanations’ for decisions from a large class of CNN-based models, making them more transparent and explainable.
Their approach – Gradient-weighted Class Activation Mapping (Grad-CAM), uses the gradients of any target concept (say ‘dog’ in a classification network or a sequence of words in captioning network) flowing into the final convolutional layer to produce a coarse localization map highlighting the important regions in the image for predicting the concept.
Unlike previous approaches, Grad-CAM is applicable to a wide variety of CNN model-families: (1) CNNs with fully-connected layers (e.g. VGG), (2) CNNs used for structured outputs (e.g. captioning), (3) CNNs used in tasks with multimodal inputs (e.g. visual question answering) or reinforcement learning, all without architectural changes or re-training.
They combine Grad-CAM with existing fine-grained visualizations to create a high-resolution class-discriminative visualization, Guided Grad-CAM, and apply it to image classification, image captioning, and visual question answering (VQA) models, including ResNet-based architectures.
In the context of image classification models, their visualizations (a) lend insights into failure modes of these models (showing that seemingly unreasonable predictions have reasonable explanations), (b) outperform previous methods on the ILSVRC-15 weakly-supervised localization task, (c) are robust to adversarial perturbations, (d) are more faithful to the underlying model, and (e) help achieve model generalization by identifying dataset bias.
For image captioning and VQA, their visualizations show that even non-attention based models learn to localize discriminative regions of input image.
They devise a way to identify important neurons through GradCAM and combine it with neuron names to provide textual explanations for model decisions.
Finally, they design and conduct human studies to measure if Grad-CAM explanations help users establish appropriate trust in predictions from deep networks and show that Grad-CAM helps untrained users successfully discern a ‘stronger’ deep network from a ‘weaker’ one even when both make identical predictions.
Code; CloudCV demo.

Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond

This paper by Artetxe and Schwenk from University of the Basque Country and FAIR introduces an architecture to learn joint multilingual sentence representations, called LASER (Language-Agnostic SEntence Representations), for 93 languages, belonging to more than 30 different families and written in 28 different scripts. The work focuses on universal language agnostic sentence embeddings, that is, vector representations of sentences that are general with respect to two dimensions: the input language and the NLP task. The motivations for such representations are multiple: the hope that languages with limited resources benefit from joint training over many languages, the desire to perform zero-shot transfer of an NLP model from one language (typically English) to another, and the possibility to handle code-switching. To that end, they train a single encoder to handle multiple languages, so that semantically similar sentences in different languages are close in the embedding space.
Their system uses a single BiLSTM encoder with a shared BPE vocabulary for all languages, which is coupled with an auxiliary decoder and trained on publicly available parallel corpora. This enables them to learn a classifier on top of the resulting embeddings using English annotated data only, and transfer it to any of the 93 languages without any modification.
Their experiments in cross-lingual natural language inference (XNLI dataset), cross-lingual document classification (MLDoc dataset) and parallel corpus mining (BUCC dataset) show the effectiveness of their approach.
They also introduce a new test set of aligned sentences in 112 languages, and show that their sentence embeddings obtain strong results in multilingual similarity search even for low-resource languages.
Code with the pretrained encoder and multilingual test set.

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

For natural language understanding (NLU) technology to be maximally useful, both practically and as a scientific object of study, it must be general: it must be able to process language in a way that is not exclusively tailored to any one specific task or dataset.
This paper by Wang et al. from NYU, UW, and Deepmin in ICLR 2019 introduces the General Language Understanding Evaluation benchmark (GLUE), a tool for evaluating and analyzing the performance of models across a diverse range of existing NLU tasks. GLUE is model-agnostic, but it incentivizes sharing knowledge across tasks because certain tasks have very limited training data. They further provide a hand-crafted diagnostic test suite that enables detailed linguistic analysis of NLU models.
They evaluate baselines based on current methods for multi-task and transfer learning and find that they do not immediately give substantial improvements over the aggregate performance of training a separate model per task, indicating room for improvement in developing general and robust NLU systems.

Parameter-Efficient Transfer Learning for NLP

Fine-tuning large pre-trained models is an effective transfer mechanism in NLP. However, in the presence of many downstream tasks, fine-tuning is parameter inefficient: an entire new model is required for every task.
As an alternative, they propose transfer with adapter modules. Adapter modules yield a compact and extensible model; they add only a few trainable parameters per task, and new tasks can be added without revisiting previous ones. The parameters of the original network remain fixed, yielding a high degree of parameter sharing.
To demonstrate adapter’s effectiveness, they transfer the recently proposed BERT Transformer model to 26 diverse text classification tasks, including the GLUE benchmark.
Adapters attain near state-of-the-art performance, whilst adding only a few parameters per task. On GLUE, they attain within 0.4% of the performance of full fine-tuning, adding only 3.6% parameters per task. By contrast, fine-tuning trains 100% of the parameters per task.
The following figure from the paper shows the architecture of the adapter module and its integration with the Transformer. Left: They add the adapter module twice to each Transformer layer: after the projection following multiheaded attention and after the two feed-forward layers. Right: The adapter consists of a bottleneck which contains few parameters relative to the attention and feedforward layers in the original model. The adapter also contains a skip-connection. During adapter tuning, the green layers are trained on the downstream data, this includes the adapter, the layer normalization parameters, and the final classification layer (not shown in the figure).

Cross-lingual Language Model Pretraining

Recent studies have demonstrated the efficiency of generative pretraining for English natural language understanding.
This paper by Lampe and Conneau from FAIR extends this approach to multiple languages and show the effectiveness of cross-lingual pretraining. They propose two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual data, and one supervised that leverages parallel data with a new cross-lingual language model objective.
They utilize a shared sub-word vocabulary by processing all languages with the same shared vocabulary created through Byte Pair Encoding (BPE). This greatly improves the alignment of embedding spaces across languages that share either the same alphabet or anchor tokens such as digits or proper nouns. THey learn the BPE splits on the concatenation of sentences sampled randomly from the monolingual corpora.
They re-balance low/high resource languages using multinomial sampling. Specifically, sentences are sampled according to a multinomial distribution (with $\alpha=0.5$) with probabilities $\left\{q_i\right\}_{i=1 \ldots N}$, where:
\[q_i=\frac{p_i^\alpha}{\sum_{j=1}^N p_j^\alpha} \quad \text { with } p_i=\frac{n_i}{\sum_{k=1}^N n_k}\]
Sampling with this distribution increases the number of tokens associated to low-resource languages and alleviates the bias towards high-resource languages. In particular, this prevents words of low-resource languages from being split at the character level.
They obtain state-of-the-art results on cross-lingual classification, unsupervised and supervised machine translation.
On XNLI, XLM pushes the state of the art by an absolute gain of 4.9% accuracy. On unsupervised machine translation, we obtain 34.3 BLEU on WMT’16 German-English, improving the previous state of the art by more than 9 BLEU. On supervised machine translation, we obtain a new state of the art of 38.5 BLEU on WMT’16 Romanian-English, outperforming the previous best approach by more than 4 BLEU.
The following figure from the paper shows the concept behind cross-lingual language model pretraining. The MLM objective is similar to the one of Devlin et al. (2018), but with continuous streams of text as opposed to sentence pairs. The TLM objective extends MLM to pairs of parallel sentences. To predict a masked English word, the model can attend to both the English sentence and its French translation, and is encouraged to align English and French representations. Position embeddings of the target sentence are reset to facilitate the alignment.

MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance

A robust evaluation metric has a profound impact on the development of text generation systems. A desirable metric compares system output against references based on their semantics rather than surface forms.
This paper by Zhao et al. in EMNLP 2019 proposes a new metric, called MoverScore, that shows a high correlation with human judgment of text quality.
They validate MoverScore on a number of text generation tasks including summarization, machine translation, image captioning, and data-to-text generation, where the outputs are produced by a variety of neural and non-neural systems.
Their findings suggest that metrics combining contextualized representations with a distance measure perform the best. Such metrics also demonstrate strong generalization capability across tasks. For ease-of-use we make our metrics available as web service.
The following figure from the paper shows an illustration of MoverScore vs. BERTScore.

Neural Oblivious Decision Ensembles for Deep Learning on Tabular Data

This paper by Popov et al. from Yandex introduces the Neural Oblivious Decision Ensembles (NODE) architecture for machine learning on tabular data.
NODE generalizes ensembles of oblivious decision trees, allowing for gradient-based optimization and multi-layer hierarchical representation learning. It’s designed to improve performance on tabular data, a domain where deep learning hasn’t outperformed gradient boosting decision trees (GBDT).
NODE uses differentiable oblivious decision trees, which are more efficient and less prone to overfitting compared to conventional decision trees. This architecture allows for end-to-end training and integrates smoothly into deep learning pipelines.
A key feature of NODE is the use of the entmax transformation, which enables differentiable split decision construction within the tree nodes. Entmax generalizes both sparsemax and softmax; it is able to learn sparse decision rules, but is smoother than sparsemax, being more appropriate for gradient-based optimization. Entmax is capable of producing sparse probability distributions and learning splitting decisions based on a small subset of data features.
The following figure from the paper shows a single oblivious decision trees (ODT) inside the NODE layer. The splitting features and the splitting thresholds are shared across all the internal nodes of the same depth. The output is a sum of leaf responses scaled by the choice weights.

The following figure from the paper shows an illustration of the NODE architecture, consisting of densely connected NODE layers. Each layer contains several trees whose outputs are concatenated and serve as input for the subsequent layer. The final prediction is obtained by averaging the outputs of all trees from all the layers.

The architecture was extensively compared to leading GBDT implementations like CatBoost and XGBoost across various datasets. NODE consistently outperformed these traditional methods, particularly in settings with default hyperparameters.
NODE’s design includes a multidimensional tree output for classification problems and a concatenation of outputs from multiple trees. This facilitates learning both shallow and deep decision rules.
The paper also presents an ablative analysis, demonstrating the influence of different architectural choices, like choice functions (e.g., softmax, entmax) and architecture depth on performance.
The authors highlight the potential of incorporating NODE into complex pipelines for multi-modal problems, suggesting future research directions in integrating tabular data with other data types like images or sequences.
Overall, NODE introduces an innovative deep learning architecture for tabular data, showcasing its effectiveness over traditional methods and opening new avenues for research in this domain.

Latent Retrieval for Weakly Supervised Open Domain Question Answering

This paper by Lee et al. in ACL 2019 from Google Research, addresses the challenge of open domain question answering (QA) without relying on strong supervision of evidence or black-box information retrieval (IR) systems.
The authors introduce the Open-Retrieval Question Answering system (ORQA), which learns to retrieve evidence from an open corpus, supervised only by question-answer string pairs. This approach contrasts with traditional methods that either assume gold evidence or depend on black-box IR systems.
A central aspect of ORQA is its ability to retrieve any text from an open corpus, unlike traditional methods that rely on a closed set of evidence documents. This capability is enabled by pre-training the retriever using an unsupervised Inverse Cloze Task (ICT). In ICT, a sentence is treated as a pseudo-question, and its context is treated as pseudo-evidence, requiring the model to predict the context given the sentence.
The implementation of ORQA leverages the BERT (Bidirectional Encoder Representations from Transformers) architecture for both its retriever and reader components. This choice capitalizes on recent advances in transfer learning and the strong representational power of BERT.
- Here are the key aspects of the ORQA model architecture:
  1. Retriever Component:
    - The retriever is the first key component of ORQA. It is responsible for selecting relevant document passages from a large corpus that may contain the information required to answer the input question.
    - This component is pre-trained using an unsupervised learning task called the Inverse Cloze Task (ICT). In ICT, the model is given a sentence (treated as a pseudo-question) and is tasked with identifying its surrounding context (treated as pseudo-evidence) from the corpus. This pre-training helps the model learn an effective strategy for document retrieval based on the context of questions.
  2. Reader Component:
    - Following the retrieval stage, the reader component takes over. It processes the passages retrieved by the retriever to generate a precise answer to the input question.
    - The reader, like the retriever, is based on the BERT model. It is fine-tuned to perform the question answering task, taking into account the context provided by the passages retrieved by the retriever.
  3. Integration of BERT:
    - Both the retriever and reader components are built on the BERT framework. BERT’s powerful bidirectional context understanding capabilities make it ideal for understanding the nuances in natural language questions and passages.
    - The use of BERT as a base model facilitates effective transfer learning, where the model, pre-trained on a large corpus, adapts to the specific requirements of question answering and document retrieval tasks through fine-tuning.
  4. End-to-End Training:
    - ORQA is unique in that it is trained end-to-end, meaning that both the retriever and reader are trained simultaneously. This approach allows the retriever to be optimized specifically for the types of questions and answers handled by the reader, leading to a more coherent and effective QA system.
- In essence, ORQA’s architecture represents a significant advance in open-domain question answering systems, allowing it to handle a wide range of questions by effectively searching and interpreting a vast corpus of unstructured text.
The authors address the challenges of inference and learning in an open evidence corpus with a large search space and latent navigation requirements. They accomplish this by pre-training the retriever to provide a strong initialization, enabling dynamic yet fast top-$k$ retrieval during fine-tuning.
The following figure from the paper shows an overview of ORQA. A subset of all possible answer derivations given a question $q$ is shown here. Retrieval scores $S_{\text {retr }}(q, b)$ are computed via inner products between BERT-based encoders. Top-scoring evidence blocks are jointly encoded with the question, and span representations are scored with a multi-layer perceptron (MLP) to compute $S_{\text {read }}(q, b, s)$. The final joint model score is $S_{\text {retr }}(q, b)+S_{\text {read }}(q, b, s)$. Unlike previous work using IR systems for candidate proposal, we learn to retrieve from all of Wikipedia directly.

ORQA’s effectiveness is demonstrated through its performance on open versions of five QA datasets. Notably, on datasets where question writers are genuinely seeking information (as opposed to knowing the answer beforehand), ORQA significantly outperforms traditional IR systems like BM25.
The paper includes a comprehensive experimental setup, comparing ORQA with other retrieval methods on different datasets. These comparisons illustrate ORQA’s strengths, especially in scenarios where question askers do not already know the answer, highlighting the importance of learned retrieval in such contexts.
The authors also discuss the challenges and potential biases in the datasets used for evaluation, providing insights into the limitations and considerations for open-domain QA systems.
In summary, this paper presents a novel approach to open-domain question answering by introducing an end-to-end model that jointly learns retriever and reader components. This model significantly improves upon traditional methods in scenarios that reflect genuine information-seeking questions, marking a notable advancement in the field of natural language processing and QA systems.

Multi-Stage Document Ranking with BERT

This paper by Nogueira et al. from NYU and the University of Waterloo, published in 2019, introduces a novel approach to document ranking using BERT in a multi-stage architecture.
The authors propose two BERT-based models for document ranking: monoBERT and duoBERT. MonoBERT operates as a pointwise classification model assessing individual document relevance, while duoBERT adopts a pairwise approach, comparing pairs of documents for relevance.
Their multi-stage ranking system integrates these models into a pipeline, striking a balance between computational efficiency and ranking quality. The approach allows for a trade-off between result quality and inference latency by controlling candidate admission at each stage.
Experiments conducted on the MS MARCO and TREC CAR datasets demonstrate the models’ effectiveness. The system matches or exceeds state-of-the-art performance, with detailed ablation studies showing the contribution of each component.
The following figure from the paper shows an illustration of our multi-stage ranking architecture. In the first stage $H_0$, given a query $q$, the top-$k_0$ ($k_0=5$ in the figure) candidate documents $R_0$ are retrieved using BM25. In the second stage $H_1$, monoBERT produces a relevance score $s_i$ for each pair of query $q$ and candidate $d_i \in R_0$. The top- $k_1\left(k_1=3\right.$ in the figure) candidates with respect to these relevance scores are passed to the last stage $H_2$, in which duoBERT computes a relevance score $p_{i, j}$ for each triple $\left(q, d_i, d_j\right)$. The final list of candidates $R_2$ is formed by re-ranking the candidates according to these scores. These pairwise scores are aggregated as mentioned below.

In the multi-stage document ranking system using duoBERT, pairwise scores are aggregated using five different methods to so that each document receives a single score: Sum, Binary, Min, Max, and Sample. These methods vary in how they interpret and utilize the pairwise agreement scores for re-ranking the candidates from the previous stage, each focusing on different aspects of the pairwise comparisons. The Sum method aggregates by summing the pairwise agreement scores, indicating the relevance of a candidate document over others. Binary is inspired by the Condorcet method, considering if the pairwise score is greater than a threshold (0.5). Min and Max methods focus on the strongest and weakest competitor, respectively. The Sample method reduces computational costs by sampling from the pairwise comparisons. These methods allow for re-ranking the candidates from the previous stage according to their aggregated scores.
The research addresses the trade-offs inherent in multi-stage ranking systems, exploring the balance between the depth of the ranking model and the breadth of candidate documents considered.
The authors also emphasize the potential of BERT models in document ranking tasks, highlighting the advantages over traditional ranking methods, especially in terms of leveraging deep learning and natural language understanding capabilities.

Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts

This paper by Ma et al. published in KDD 2018, introduces a novel approach to multi-task learning called Multi-gate Mixture-of-Experts (MMoE). The method aims to enhance the performance of multi-task learning models by better handling the relationships between different tasks.
The MMoE model adapts the Mixture-of-Experts (MoE) framework to multi-task learning by sharing expert submodels across all tasks and using a gating network optimized for each task. This design allows the model to dynamically allocate shared and task-specific resources, efficiently handling tasks with varying degrees of relatedness.
The paper presents experiments using synthetic data and real datasets, including a binary classification benchmark and a large-scale content recommendation system at Google. These experiments demonstrate MMoE’s effectiveness in scenarios where tasks have low relatedness and its superiority over traditional shared-bottom multi-task models in terms of both performance and trainability.
MMoE’s architecture consists of multiple experts (feed-forward networks) and a gating network for each task, which determines the contribution of each expert to the task. This setup allows the model to learn nuanced relationships between tasks and allocate computation resources more effectively.
The following figure from the paper shows a (a) shared-bottom model, (b) one-gate MoE model, (c) multi-gate MoE model.

In the experiments with the Census-income dataset, a UCI benchmark dataset, the task was to predict whether an individual’s income exceeds $50,000 based on census data. The dataset contains demographic and employment-related information. MMoE’s application to this dataset involved addressing the challenge of binary classification using multiple socio-economic factors as input features.
On synthetic data, MMoE showed better performance, especially when task correlation is low, and demonstrated improved trainability with less variance in model performance across runs. On real-world datasets, including the UCI Census-income dataset and Google’s content recommendation system, MMoE consistently outperformed baseline models in terms of accuracy and robustness.
MMoE offers computational efficiency by using lightweight gating networks and shared expert networks, making it suitable for large-scale applications. The experiments on Google’s recommendation system highlighted MMoE’s ability to improve both engagement and satisfaction metrics in live experiments compared to single-task and shared-bottom models.

Synthetic QA Corpora Generation with Roundtrip Consistency

This paper by Alberti et al. from Google Research, introduces a novel method for generating synthetic question answering (QA) corpora. The method employs roundtrip consistency to filter results, combining models of question generation and answer extraction.
For a given passage $C$, they sample an extractive short answer $A$ (Step (1) in Table 1). In Step (2), they generate a question $Q$ conditioned on $A$ and $C$, then (Step (3)) predict the extractive answer $A^{\prime}$ conditioned on $Q$ and $C$. If $A$ and $A^{\prime}$ match they finally emit $(C, Q, A)$ as a new synthetic training example (Step (4)). They train a separate model on labeled QA data for each of the first three steps, and then apply the models in sequence on a large number of unlabeled text passages. They show that pretraining on synthetic data generated through this procedure provides us with significant improvements on two challenging datasets, SQuAD2 and NQ, achieving a new state of the art on the latter.
The approach utilizes BERT models in several key aspects. For answer extraction, the team employed two distinct BERT-based models:
- Question-Unconditional Extractive Answer Model: This model extracts potential answer spans from the context without any question input. It’s trained on labeled answer spans, enabling it to identify likely answer segments within a given text.
- Question-Conditional Extractive Answer Model: This variant takes both the passage and a question as input, and it predicts the answer span within the passage. It is fine-tuned on a QA dataset, allowing it to extract answers that are specifically relevant to the given question.
In question generation, two approaches were explored:
- Fine-Tuning BERT Model for Question Generation: This method involves fine-tuning a pre-trained BERT model to generate questions based on the input context and answer spans. This approach utilizes the natural language understanding capabilities of BERT to generate relevant and contextually appropriate questions.
- Sequence-to-Sequence Generation Model Involving Full Pretraining and Fine-Tuning: Here, a more complex approach was taken. A sequence-to-sequence model was first fully pre-trained and then fine-tuned to generate questions. This method likely involves using a BERT-like model for encoding the input context and answer, followed by a generative model (like a Transformer-based decoder) to generate the question.
The following figure from the paper shows an example of how synthetic question-answer pairs are generated. The model’s predicted answer ($A'$) matches the original answer the question was generated from, so the example is kept.

The paper’s experiments used datasets like SQuAD2 and NQ, demonstrating significant improvements in QA tasks by pretraining on synthetic data generated through this method. The paper reports results indicating performance close to human levels on these datasets.
The paper also explores the efficiency of roundtrip consistency filtering, showing its benefits in improving model performance. It notes differences in the style and quality of generated question-answer pairs depending on the approach used.
The authors suggest future research directions, including a more formal understanding of why roundtrip consistency improves QA tasks and potential integration with other methods.

Towards VQA Models That Can Read

This paper by Singh et al. from Facebook AI Research and Georgia Institute of Technology, published in CVPR 2019, presents advancements in Visual Question Answering (VQA) models to include the capability of reading text in images. The authors introduce a new dataset, TextVQA, and propose a novel model architecture named Look, Read, Reason & Answer (LoRRA).
The TextVQA dataset contains 45,336 questions based on 28,408 images from the Open Images dataset, focusing on images with text. This dataset is designed to facilitate the development and evaluation of VQA models that require reading text within images to answer questions. It addresses the gap in existing datasets that either have a small proportion of text-related questions (e.g., VQA dataset) or are too small in size (e.g., VizWiz dataset).
The figure below from the paper illustrates examples from the TextVQA dataset. TextVQA questions require VQA models to understand text embedded in the images to answer them correctly. Ground truth answers are shown in green and the answers predicted by a state-of-the-art VQA model (Pythia) are shown in red. Clearly, today’s VQA models fail at answering questions that involve reading and reasoning about text in images.

LoRRA Architecture:
- VQA Component: LoRRA extends a typical VQA model by incorporating text detection and reading capabilities. It embeds the question using a pre-trained embedding function (e.g., GloVe) and encodes it with a recurrent network (e.g., LSTM) to produce a question embedding.
- Reading Component: Utilizes an OCR model to extract text from images. The extracted words are embedded using pre-trained word embeddings (e.g., FastText) and processed with an attention mechanism to generate combined OCR-question features.
- Answering Module: Integrates a mechanism to decide whether the answer should be directly copied from the OCR output or deduced from a fixed vocabulary. This module predicts probabilities for both pre-determined answers and dynamically identified OCR tokens.
Implementation Details:
- Image Features: Uses both grid-based and region-based features from ResNet-152 and Faster-RCNN models, respectively. These features are attended to using a top-down attention mechanism based on the question embedding.
- Question and OCR Embeddings: Question words are embedded using GloVe and encoded with LSTM. OCR tokens are embedded using FastText and attended to separately.
- Combining Features: The final answer prediction is made by combining the VQA and OCR features using a multi-layer perceptron (MLP) that produces logits for each possible answer, including the OCR tokens.
The figure below from the paper illustrates an overview of LoRRA. LoRRA looks at the image, reads its text, reasons about the image and text content and then answers, either with an answer a from the fixed answer vocabulary or by selecting one of the OCR strings s. Dashed lines indicate components that are not jointly-trained. The answer cubes on the right with darker color have more attention weight. The OCR token “20” has the highest attention weight in the example.

Evaluation:
- Performance on TextVQA: LoRRA significantly outperforms existing state-of-the-art VQA models on the TextVQA dataset. The authors provide extensive ablation studies showing the contributions of different components of the model.
- Upper Bounds and Heuristics: The paper evaluates various baselines, demonstrating that incorporating OCR tokens significantly enhances the model’s performance. The best-performing model achieves a validation accuracy of 26.56% on TextVQA.
- Human vs. Machine Performance: There remains a significant gap between human performance (85.01%) and the model’s performance, highlighting the challenge of the task and the potential for future improvements.
Key Findings:
- The introduction of specialized components like OCR within VQA models substantially improves their ability to handle text-related questions.
- The TextVQA dataset provides a robust benchmark for developing and evaluating VQA models capable of reading and reasoning about text in images.
- LoRRA’s architecture, which integrates reading and reasoning components, sets a new direction for enhancing the capabilities of VQA models.
The paper emphasizes the importance of specialized skills in VQA models to address real-world applications, particularly for visually impaired users. The proposed methods and dataset mark significant steps towards more capable and context-aware VQA systems.
Project page

On the Measure of Intelligence

This paper by Chollet from Google addresses the need for a precise and actionable definition of intelligence that allows effective comparisons between artificial systems and humans. Chollet critiques current AI evaluation approaches, especially the task-specific skill benchmarks, arguing they are insufficient for measuring intelligence since they allow arbitrary “skill acquisition” through extensive priors or training data. Instead, Chollet proposes a new framework to measure intelligence based on “skill-acquisition efficiency,” rooted in algorithmic information theory.
The author introduces two main conceptions of intelligence that have shaped its definition historically: task-specific skills versus a general ability to learn. Current AI research has largely focused on the former, as evidenced by the dominance of benchmarks assessing performance in fixed tasks (e.g., chess or image classification). Chollet contends that these measures do not capture a system’s adaptability or generalization, key components of intelligence. For example, a chess-playing AI, while performing excellently in chess, lacks the flexibility to handle novel or real-world tasks.
Chollet formalizes intelligence as “skill-acquisition efficiency,” emphasizing four critical elements:
1. Scope - the range of tasks over which a system can generalize.
2. Generality - the difficulty of generalization across these tasks.
3. Priors - the knowledge and biases embedded in the system.
4. Experience - the data available for learning.
Intelligence, therefore, is quantified by the efficiency with which a system can acquire skill across varied tasks, relative to its priors and experience.
The author critiques the popular practice of measuring intelligence based on task-specific skill, which he suggests can lead to misleading conclusions about a system’s generalization capabilities. By focusing solely on skill, developers can inject extensive prior knowledge or leverage unlimited training data to artificially enhance performance without genuine intelligence. Chollet instead proposes that an ideal intelligence benchmark must control for priors, limit the training data, and require broad generalization without reliance on task-specific knowledge.
Abstraction and Reasoning Corpus (ARC): To test this new intelligence measure, Chollet introduces ARC, a benchmark designed to evaluate human-like generalization by focusing on tasks solvable with minimal priors and general reasoning. ARC tasks are designed with human innate priors (like basic object manipulation and spatial relations) but are unfamiliar enough that AI systems must genuinely generalize rather than rely on memorized skills. Chollet argues that ARC can enable meaningful comparisons between human intelligence and artificial systems by challenging them to demonstrate genuine adaptive capabilities.
Implementation and Evaluation: ARC’s design avoids task overfitting by ensuring tasks are unknown to developers and preventing reliance on task-specific heuristics. The dataset incorporates a “core knowledge” set similar to human cognitive priors, such as object permanence, numerosity, and spatial manipulation. The benchmark’s format requires systems to learn patterns without task-specific fine-tuning, thus prioritizing generalization over task memorization. Chollet suggests that AI systems evaluated on ARC would have to exhibit flexible, adaptive problem-solving capacities analogous to human fluid intelligence.
The paper’s central thesis advocates moving away from skill-based benchmarks toward those measuring skill acquisition and adaptability. Chollet critiques high-profile AI accomplishments, such as AI systems excelling in video games, as these systems often leverage vast data and hand-coded rules specific to the target task, which limits their application to broader, real-world tasks.
Chollet concludes with guidelines for designing benchmarks that measure broad abilities, recommending methods to assess developer-aware generalization, robustness, and flexibility over isolated skill. He proposes that AI research should strive toward human-centric benchmarks that quantify generalization difficulty and penalize excessive reliance on prior knowledge. This approach, he argues, would align AI progress more closely with the field’s foundational goal of creating adaptable, human-like intelligence.

2020

Language Models are Few-Shot Learners

Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do.
This paper by Brown et al. from OpenAI in 2020 introduces Generative Pretrained Transformer (GPT)-3 and shows that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches.
Specifically, they train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model.
GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.
At the same time, they also identify some datasets where GPT-3’s few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, they find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans.
They also present broader societal impacts of their findings and of GPT-3 in general.

Longformer: The Long-Document Transformer

Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length.
This paper by Beltagy et al. from Allen AI in 2020 seeks to address this limitation, by introducing the Longformer with an attention mechanism that scales linearly with sequence length (commonly called Sliding Window Attention in the field), making it easy to process documents of thousands of tokens or longer.
Longformer’s attention mechanism is a drop-in replacement for the standard self-attention and combines a local windowed attention with a task motivated global attention.
The figure below from the paper compares the full self-attention pattern and the configuration of attention patterns in Longformer.

Following prior work on long-sequence transformers, they evaluate Longformer on character-level language modeling and achieve state-of-the-art results on text8 and enwik8.
In contrast to most prior work, they also pretrain Longformer and finetune it on a variety of downstream tasks.
Their pretrained Longformer consistently outperforms RoBERTa on long document tasks and sets new state-of-the-art results on WikiHop and TriviaQA. They finally introduce the Longformer-Encoder-Decoder (LED), a Longformer variant for supporting long document generative sequence-to-sequence tasks, and demonstrate its effectiveness on the arXiv summarization dataset.
The figure below from the paper illustrates the runtime and memory of full self-attention and different implementations of Longformer’s self-attention; Longformer-loop is nonvectorized, Longformer-chunk is vectorized, and Longformer-cuda is a custom cuda kernel implementations. Longformer’s memory usage scales linearly with the sequence length, unlike the full self-attention mechanism that runs out of memory for long sequences on current GPUs. Different implementations vary in speed, with the vectorized Longformer-chunk being the fastest.

Big Bird: Transformers for Longer Sequences

The primary limitation of Transformer-based models is the quadratic complexity (mainly in terms of memory, but also computation) on the sequence length due to their full attention mechanism. BigBird by Zaheer et al. from Google, published in NeurIPS 2020, remedied this by proposing a sparse attention mechanism that reduces this quadratic complexity to linear.

Beyond Accuracy: Behavioral Testing of NLP models with CheckList

Although measuring held-out test-set accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models either focus on individual tasks or on specific behaviors. Further, ML systems can run to completion without throwing any errors (indicating functional correctness) but can still produce incorrect outputs (indicating behavioral issues). Thus, it is important to test the behavioral aspects of your model to make sure it works as you expected.
This paper by Ribeiro et al. from Microsoft, UW and UCI in 2020 introduces CheckList, a model-agnostic and task-agnostic methodology for testing NLP models inspired by principles of behavioral testing in software engineering. CheckList tests individual capabilities of the model using three different test types.
Checklist includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation, as well as a software tool to generate a large and diverse number of test cases quickly. They illustrate the utility of CheckList with tests for three tasks, identifying critical failures in both commercial and state-of-art models.
Tests created with CheckList can be applied to any model, making it easy to incorporate in current benchmarks or evaluation pipelines. In a user study, a team responsible for a commercial sentiment analysis model found new and actionable bugs in an extensively tested model that has “solved” existing benchmarks on three different tasks. They incorporated three distinct types of tests:
- Minimum Functionality Test (MFT): A Minimum Functionality Test (MFT) uses simple examples to make sure the model can perform a specific task well. For example, they might want to test the performance of a sentiment model when dealing with negations.
- Invariance Test: Besides testing the functionality of a model, they might also want to test if the model prediction stays the same when trivial parts of inputs are slightly perturbed. These tests are called Invariance Tests (IV).
- Directional Expectation Test: In the Invariance Test, they expect the outputs after the perturbation to be the same. However, sometimes they might expect the output after perturbation to change. That is when Directional Expectation Tests comes in handy. In another user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.
Code.

The Curious Case of Neural Text Degeneration

Despite considerable advancements with deep neural language models, the enigma of neural text degeneration persists when these models are tested as text generators. The counter-intuitive empirical observation is that even though the use of likelihood as training objective leads to high quality models for a broad range of language understanding tasks, using likelihood as a decoding objective leads to text that is bland and strangely repetitive.
This paper by Holztman et al. from Choi’s lab at UW in ICLR 2020 provided a deep analysis into the properties of the most common decoding methods for open-ended language generation. It reveals surprising distributional differences between human text and machine text.
In addition, they find that decoding strategies alone can dramatically effect the quality of machine text, even when generated from exactly the same neural language model. They show that likelihood maximizing decoding causes repetition and overly generic language usage, while sampling methods without truncation risk sampling from the low-confidence tail of a model’s predicted distribution. Their findings motivate Nucleus (or top-$p$) Sampling, a simple but effective method that captures the region of confidence of language models effectively to draw the best out of neural generation.
By sampling text from the dynamic nucleus of the probability distribution, which allows for diversity while effectively truncating the less reliable tail of the distribution, the resulting text better demonstrates the quality of human text, yielding enhanced diversity without sacrificing fluency and coherence.

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

Pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to downstream NLP tasks, they generally require large amounts of compute to be effective.
This paper by Clark et al. in 2020 from Manning’s lab at Stanford proposes a more sample-efficient pre-training alternative task called replaced token detection, a new self-supervised task for language representation learning compared to BERT’s masked language modeling (MLM). Instead of masking the input, their approach corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead of training a model that predicts the original identities of the corrupted tokens, the key idea is training a discriminative text encoder model to distinguish input tokens from high-quality negative samples produced by an small generator network.
Thorough experiments demonstrate this new pre-training task is more efficient than MLM because the task is defined over all input tokens rather than just the small subset that was masked out.
As a result, compared to MLM, their pre-training objective is more compute-efficient and results in better performance on downstream tasks. The contextual representations learned by their approach substantially outperform the ones learned by BERT given the same model size, data, and compute.
The gains are particularly strong for small models; for example, they train a model on one GPU for 4 days that outperforms GPT (trained using 30x more compute) on the GLUE natural language understanding benchmark. Their approach also works well at scale, where it performs comparably to RoBERTa and XLNet while using less than 1/4 of their compute and outperforms them when using the same amount of compute.
Since ELECTRA works well even when using relatively small amounts of compute, the authors hope this will make developing and applying pre-trained text encoders more accessible to researchers and practitioners with less access to computing resources.

TinyBERT: Distilling BERT for Natural Language Understanding

Language model pre-training, such as BERT, has significantly improved the performances of many natural language processing tasks. However, pre-trained language models are usually computationally expensive, so it is difficult to efficiently execute them on resource-restricted devices.
This paper by Jiao et al. from Huazhong University of Science and Technology, Wuhan National Lab for Optoelectronics, and Huawei Noah’s Ark Lab in EMNLP 2020 propose a novel Transformer distillation method to accelerate inference and reduce model size while maintaining accuracy, that is specially designed for knowledge distillation (KD) of the Transformer-based models. They also propose a two-stage framework for TinyBERT.
By leveraging this new KD method, the plenty of knowledge encoded in a large teacher BERT can be effectively transferred to a small student Tiny-BERT.
Then, they introduce a new two-stage learning framework for TinyBERT, which performs Transformer distillation at both the pretraining and task-specific learning stages. This framework ensures that TinyBERT can capture he general-domain as well as the task-specific knowledge in BERT.
TinyBERT with 4 layers is empirically effective and achieves more than 96.8% the performance of its teacher BERTBASE on GLUE benchmark, while being 7.5x smaller and 9.4x faster on inference.
Extensive experiments show that TinyBERT achieves competitive performances meanwhile significantly reducing the model size and inference time of BERTBASE, which provides an effective way to deploy BERT-based NLP models on edge devices. Specifically, TinyBERT with 4 layers is also significantly better than 4-layer state-of-the-art baselines on BERT distillation, with only about 28% parameters and about 31% inference time of them. Moreover, TinyBERT with 6 layers performs on-par with its teacher BERTBASE.
Code.

MPNet: Masked and Permuted Pre-training for Language Understanding

BERT adopts masked language modeling (MLM) for pre-training and is one of the most successful pre-training models. Since BERT neglects dependency among predicted tokens, XLNet introduces permuted language modeling (PLM) for pre-training to address this problem. However, XLNet does not leverage the full position information of a sentence and thus suffers from position discrepancy between pre-training and fine-tuning.
This paper by Song et al. from Nanjing University and Microsoft Research in NeurIPS 2020 proposes MPNet, a novel pre-training method that inherits the advantages of BERT and XLNet and avoids their limitations.
MPNet leverages the dependency among predicted tokens through permuted language modeling (vs. MLM in BERT), and takes auxiliary position information as input to make the model see a full sentence and thus reducing the position discrepancy (vs. PLM in XLNet).
They pre-train MPNet on a large-scale dataset (over 160GB text corpora) and fine-tune on a variety of down-streaming tasks (GLUE, SQuAD, etc). Experimental results show that MPNet outperforms MLM and PLM by a large margin, and achieves better results on these tasks compared with previous state-of-the-art pre-trained methods (e.g., BERT, XLNet, RoBERTa) under the same model setting.
Code with code and pre-trained models.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

This paper by Raffel et al. from Google in JMLR delves into the domain of transfer learning in natural language processing (NLP). Published in 2020, it introduces a unified framework, named the Text-to-Text Transfer Transformer (T5), which reformulates all NLP tasks into a text-to-text format. This approach enables the use of a single model, loss function, and set of hyperparameters across various tasks such as translation, question answering, classification, summarization, and sentiment analysis.
The paper’s primary goal is not to propose new methods, but to offer a comprehensive view of the existing landscape in transfer learning for NLP. It includes a survey, exploration, and empirical comparison of existing techniques. The team scales up their models to up to 11 billion parameters to assess the current limits and achieve state-of-the-art results on numerous benchmarks.
The T5 model is built upon the Transformer architecture, which has become prevalent in recent NLP research. This architecture, originally designed for machine translation, has been effectively applied to various NLP tasks. The T5 model, in particular, employs an encoder-decoder structure with each component being of similar size and configuration to a BERTBASE stack, amounting to approximately 220 million parameters in total.
The following figure from the paper shows a diagram of our text-to-text framework. Every task they consider—including translation, question answering, and classification—is cast as feeding our model text as input and training it to generate some target text. This allows us to use the same model, loss function, hyperparameters, etc. across our diverse set of tasks. It also provides a standard testbed for the methods included in our empirical survey. “T5” refers to our model, which they dub the “Text-to-Text Transfer Transformer”.

For the training process, the T5 employs a denoising objective for pre-training, where a portion of the input tokens is randomly masked, and the model is trained to predict these missing tokens. This pre-training is conducted using unlabeled data, leveraging a dataset named “Colossal Clean Crawled Corpus” (C4), which is an extensive collection of clean and natural English text extracted and processed from the Common Crawl archive. The following figure from the paper shows the schematic of the objective T5 uses in our baseline model. In this example, they process the sentence “Thank you for inviting me to your party last week.” The words “for”, “inviting” and “last” (marked with an $\times$) are randomly chosen for corruption. Each consecutive span of corrupted tokens is replaced by a sentinel token (shown as <X> and <Y>) that is unique over the example. Since “for” and “inviting” occur consecutively, they are replaced by a single sentinel <X>. The output sequence then consists of the dropped-out spans, delimited by the sentinel tokens used to replace them in the input plus a final sentinel token <Z>.

All tasks are formulated as text-to-text tasks, enabling standard maximum likelihood training, namely, through teacher forcing and a cross-entropy loss. For optimization, AdaFactor is utilized. At test time, greedy decoding is employed, which involves selecting the highest-probability logit at each timestep. Each model is pre-trained for $2^{19} = 524,288$ steps on C4 before fine-tuning, with a maximum sequence length of 512 and a batch size of 128 sequences. Whenever feasible, multiple sequences are “packed” into each batch entry (hence, colloquially called the “T5 packing trick”), so that batches roughly contain $2^{16} = 65,536$ tokens. The total batch size and number of steps correspond to pre-training on $2^{35} \approx 34B$ tokens, considerably less than the roughly 137B tokens used by BERT or the approximately 2.2T tokens used by RoBERTa. Utilizing only $2^{35}$ tokens achieves a reasonable computational budget while still providing a sufficient amount of pre-training for acceptable performance. Notably, $2^{35}$ tokens cover only a fraction of the entire C4 dataset, ensuring no data repetition during pre-training.
An “inverse square root” learning rate schedule is applied during pre-training, defined as $\frac{1}{\sqrt{\max(n, k)}}$, where $n$ is the current training iteration and $k$ is the number of warm-up steps, set to $10^4$ in all experiments. This establishes a constant learning rate of 0.01 for the first 10^4 steps, then allows for an exponential decay of the learning rate until pre-training concludes. Although experimenting with a triangular learning rate schedule yielded slightly better results, it necessitates prior knowledge of the total number of training steps. Given the variable number of training steps in some experiments, the more flexible inverse square root schedule is preferred.
Models are fine-tuned for $2^{18} = 262,144$ steps on all tasks, balancing the needs of high-resource tasks, which benefit from more fine-tuning, against the propensity of low-resource tasks to quickly overfit. During fine-tuning, batches of 128 length-512 sequences ($2^{16}$ tokens per batch) are continued, with a constant learning rate of 0.001. Checkpoints are saved every 5,000 steps, and results are reported on the checkpoint with the highest validation performance. For models fine-tuned on multiple tasks, the best checkpoint for each task is selected independently. Except for the experiments described in Section 3.7, results are reported on the validation set to avoid model selection bias on the test set.
The model’s training uses the maximum likelihood objective with teacher forcing and a cross-entropy loss. SentencePiece is used for encoding text into WordPiece tokens, with a vocabulary size of 32,000 wordpieces, encompassing English, German, French, and Romanian languages.
In terms of architectural variants, the paper examines different attention mask patterns used in Transformer models. For instance, the encoder in T5 uses a fully-visible attention mask, enabling the self-attention mechanism to consider the entire input sequence. In contrast, the decoder employs a causal masking pattern, preventing each output element from depending on future input elements.
The methodology and findings in this paper provide valuable insights into the practical application of transfer learning in NLP, particularly in how large-scale pre-trained models can be effectively adapted to a wide range of language tasks.

Scaling Laws for Neural Language Models

This paper by from Kaplan et al. from OpenAI studies empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range.
Simple equations govern the dependence of overfitting on model/dataset size and the dependence of training speed on model size. These relationships allow them to determine the optimal allocation of a fixed compute budget. Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.
The following figure from the paper shows that language modeling performance improves smoothly as they increase the model size, dataset size, and amount of compute used for training. For optimal performance all three factors must be scaled up in tandem. Empirical performance has a power-law relationship with each individual factor when not bottlenecked by the other two.

In particular, they propose 10x more compute should be spent on 5.5x larger model and 1.8x more tokens (vs. Chincilla’s 10x more compute should be spent on 3.2x larger model and 3.2x more tokens)

Unsupervised Cross-lingual Representation Learning at Scale

This paper by Conneau et al. from Facebook AI in ACL 2020 shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks.
They train a Transformer-based masked language model on one hundred languages, using more than two terabytes of filtered CommonCrawl data.
Their model, dubbed XLM-R, significantly outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +14.6% average accuracy on XNLI, +13% average F1 score on MLQA, and +2.4% F1 score on NER.
XLM-R performs particularly well on low-resource languages, improving 15.7% in XNLI accuracy for Swahili and 11.4% for Urdu over previous XLM models.
They also present a detailed empirical analysis of the key factors that are required to achieve these gains, including the trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and low resource languages at scale.
Finally, they show, for the first time, the possibility of multilingual modeling without sacrificing per-language performance; XLM-R is very competitive with strong monolingual models on the GLUE and XNLI benchmarks.
Facebook AI post.

SpanBERT: Improving Pre-training by Representing and Predicting Spans

This paper by Joshi et al. from UW, Princeton University, Allen Institute of Artificial Intelligence, and FAIR in TACL 2020 presents SpanBERT, a pre-training method that is designed to better represent and predict spans of text.
Their approach extends BERT by (1) masking contiguous random spans, rather than random tokens, and (2) training the span boundary representations to predict the entire content of the masked span, without relying on the individual token representations within it.
SpanBERT consistently outperforms BERT and their better-tuned baselines, with substantial gains on span selection tasks such as question answering and coreference resolution.
In particular, with the same training data and model size as BERT-large, Span-BERT obtains 94.6% and 88.7% F1 on SQuAD 1.1 and 2.0, respectively.
They also achieve a new state of the art on the OntoNotes coreference resolution task (79.6% F1), strong performance on the TACRED relation extraction benchmark, and even show gains on GLUE.
The following figure from the paper offers an illustration of SpanBERT training. The span an American football game is masked. The span boundary objective (SBO) uses the output representations of the boundary tokens, $\mathbf{x}_4$ and $\mathbf{x}_9$ (in blue), to predict each token in the masked span. The equation shows the MLM and SBO loss terms for predicting the token, football (in pink), which as marked by the position embedding $\mathbf{p}_3$, is the third token from $x_4$.

Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning

Although pretrained language models can be fine-tuned to produce state-of-the-art results for a very wide range of language understanding tasks, the dynamics of this process are not well understood, especially in the low data regime. Why can we use relatively vanilla gradient descent algorithms (e.g., without strong regularization) to tune a model with hundreds of millions of parameters on datasets with only hundreds or thousands of labeled examples?
This paper by Aghajanyan et al. from Facebook AI argues that analyzing fine-tuning through the lens of intrinsic dimension provides us with empirical and theoretical intuitions to explain this remarkable phenomenon.
They empirically show that common pre-trained models have a very low intrinsic dimension; in other words, there exists a low dimension reparameterization that is as effective for fine-tuning as the full parameter space. For example, by optimizing only 200 trainable parameters randomly projected back into the full space, we can tune a RoBERTa model to achieve 90% of the full parameter performance levels on MRPC.
Furthermore, they empirically show that pre-training implicitly minimizes intrinsic dimension and, perhaps surprisingly, larger models tend to have lower intrinsic dimension after a fixed number of pre-training updates, at least in part explaining their extreme effectiveness.
Lastly, they connect intrinsic dimensionality with low dimensional task representations and compression based generalization bounds to provide intrinsic-dimension-based generalization bounds that are independent of the full parameter count.

Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval

This paper by Xiong et al. from Microsoft presents a novel approach to improve dense text retrieval (DR) efficiency and effectiveness.
The paper identifies a primary bottleneck in dense retrieval training: the use of uninformative negatives, which leads to slow learning convergence. These negatives are locally sampled in batches and yield diminishing gradient norms and large stochastic gradient variances.
To address this, the authors propose Approximate Nearest Neighbor Negative Contrastive Estimation (ANCE). ANCE selects hard training negatives globally from the entire corpus using an asynchronously updated ANN index. This method aligns the distribution of negative samples in training with irrelevant documents in testing.
ANCE is implemented using an asynchronously updated ANN index. This involves maintaining an ‘Inferencer’ that parallelly computes document encodings with a recent checkpoint from the DR model and refreshes the ANN index, keeping up with the model training.
The following figure from the paper shows the asynchronous training of ANCE. The Trainer learns the representation using negatives from the ANN index. The Inferencer uses a recent checkpoint to update the representation of documents in the corpus and once finished, refreshes the ANN index with most up-to-date encodings.

The effectiveness of ANCE was demonstrated in three text retrieval scenarios: standard web search, OpenQA (Open Domain Question Answering), and a commercial search engine’s retrieval system. ANCE showed significant improvements over traditional methods, nearly matching the accuracy of BERT-based cascade IR pipeline while being 100x more efficient.
The authors empirically validated that the gradient norms on ANCE sampled negatives are much bigger than local negatives, hence improving the convergence of dense retrieval models.
The paper also includes extensive experimental methodologies, evaluation results, and discussions on the convergence of dense retrieval training, highlighting the empirical analyses and theoretical foundations that underpin ANCE.
Overall, this paper presents a significant advancement in dense text retrieval by addressing the critical issue of ineffective negative sampling and demonstrating the efficiency and effectiveness of ANCE in various retrieval scenarios.

Document Ranking with a Pretrained Sequence-to-Sequence Model

This paper by Nogueira et al. from the University of Waterloo presents MonoT5, a novel approach using a pretrained sequence-to-sequence model, specifically T5, for document ranking tasks.
This approach involves training the model to generate relevance labels as “target words,” interpreting the logits of these words as relevance probabilities, a significant shift from conventional methods.
The model demonstrates strong performance on the MS MARCO passage ranking task, matching or surpassing previous models, and even outperforming state-of-the-art models in zero-shot transfer on the TREC 2004 Robust Track.
Particularly notable is its superior performance in data-poor scenarios with few training examples, leveraging latent knowledge from pretraining.
Probing experiments, varying target words, offer insights into how the model utilizes latent knowledge for relevance predictions, highlighting its innovative approach to document ranking.

ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT

This paper by Khattab and Zaharia from Stanford University, introduce ColBERT, a novel ranking model for efficient and effective passage search, adapting deep Language Models (LMs), specifically BERT, for retrieval tasks.
ColBERT employs a late interaction mechanism, which encodes queries and documents separately using BERT, then applies a cheap but powerful interaction step to evaluate fine-grained similarity. This allows for pre-computing document representations offline, significantly speeding up query processing.
The late interaction aspect of ColBERT is unique as it separates the encoding of queries and documents, allowing for efficient pre-computation and storage of document representations. This architecture contrasts with traditional methods that intertwine query and document processing. By independently encoding queries and documents, ColBERT facilitates rapid retrieval while maintaining fine-grained similarity comparisons through its MaxSim operator. This design significantly enhances query processing speed, making it highly practical for large-scale applications.
The MaxSim operator in ColBERT is a key component of its late interaction architecture. It operates after the independent encoding of queries and documents. The MaxSim operator computes the maximum similarity score between each query term and every term in a document. This approach allows for a fine-grained similarity measurement, as it captures the best match for each query term within the document. The use of MaxSim contributes to the efficiency and effectiveness of ColBERT in large-scale information retrieval tasks, providing a balance between computational efficiency and retrieval accuracy.
The model demonstrates competitive effectiveness with existing BERT-based models, achieving considerably faster execution and requiring substantially fewer FLOPs per query.
It leverages vector-similarity indexes for end-to-end retrieval from large document collections, improving recall over traditional models.
The following figure from the paper shows schematic diagrams illustrating query–document matching paradigms in neural IR. The figure contrasts existing approaches (sub-figures (a), (b), and (c)) with the proposed late interaction paradigm (sub-figure (d)).

The following figure from the paper shows the general architecture of ColBERT given a query $q$ and a document $d$.

ColBERT’s architecture consists of separate BERT-based query and document encoders and a MaxSim operator for late interaction. The encoders produce m-dimensional embeddings, reducing computational load and space requirements.
The paper also details the evaluation of ColBERT on MS MARCO and TREC CAR datasets, highlighting its robustness and scalability in various retrieval scenarios.

REALM: Retrieval-Augmented Language Model Pre-Training

This paper by Guu et al. from Google Research introduces Retrieval-Augmented Language Model Pre-Training (REALM), a novel framework that augments language model pre-training with a latent knowledge retriever, enabling the model to access a vast external knowledge corpus like Wikipedia during pre-training, fine-tuning, and inference phases. REALM uniquely addresses the challenge of integrating explicit, external knowledge into language models in a modular and interpretable way, differing from traditional models that implicitly store knowledge within their parameters.
The authors propose an unsupervised approach to pre-train the knowledge retriever using masked language modeling as a learning signal, which is novel in allowing backpropagation through a retrieval step over millions of documents. The retrieval process is optimized to improve the language model’s perplexity, rewarding helpful retrievals and penalizing uninformative ones. This method also poses a significant computational challenge, which is tackled by structuring the retriever to facilitate caching and asynchronous updates.
The following figure from the paper shows the overall framework of REALM. Left: Unsupervised pre-training. The knowledge retriever and knowledge-augmented encoder are jointly pre-trained on the unsupervised language modeling task. Right: Supervised fine-tuning. After the parameters of the retriever ($\theta$) and encoder ($\phi$) have been pre-trained, they are then fine-tuned on a task of primary interest, using supervised examples.

REALM is evaluated on the Open-domain Question Answering (Open-QA) task against state-of-the-art models that either store knowledge implicitly or use non-learned heuristics for retrieval. The results demonstrate that REALM significantly outperforms all previous methods by 4-16% absolute accuracy across three popular Open-QA benchmarks, highlighting the effectiveness of retrieval-augmented pre-training.
The implementation includes several innovative strategies to enhance performance, such as salient span masking, incorporating a null document to model the absence of necessary external knowledge, and preventing trivial retrievals during pre-training. These strategies are crucial for focusing the learning process on examples that require world knowledge and ensuring meaningful retriever training.
The paper also discusses the potential for REALM to generalize to structured knowledge, multilingual settings, and multi-modal inputs, suggesting a broad applicability of retrieval-augmented models in capturing and utilizing external knowledge across various domains and tasks.

Linformer: Self-Attention with Linear Complexity

This paper by Wang et al. from Facebook AI proposes a novel approach to optimizing the self-attention mechanism in Transformer models, reducing its complexity from quadratic to linear with respect to sequence length. This method, named Linformer, maintains competitive performance with standard Transformer models while significantly enhancing efficiency in both time and memory usage.
Linformer introduces a low-rank approximation of the self-attention mechanism. By empirically and theoretically demonstrating that the self-attention matrix is of low rank, the authors propose a decomposition of the original scaled dot-product attention into multiple smaller attentions via linear projections. This factorization effectively reduces both the space and time complexity of self-attention from $O(n^2)$ to $O(n)$, addressing the scalability issues of traditional Transformers.
The model architecture involves projecting key and value matrices into lower-dimensional spaces before computing the attention, which retains the model’s effectiveness while reducing computational demands. The approach includes options for parameter sharing across projections, which can further reduce the number of trainable parameters without significantly impacting performance.
In summary, here’s how Linformer achieves linear-time attention:
1. Low-Rank Approximation: The core idea behind Linformer is the observation that self-attention can be approximated by a low-rank matrix. This implies that the complex relationships captured by self-attention in Transformers do not necessarily require a full rank matrix, allowing for a more efficient representation.
2. Reduced Complexity: While standard self-attention mechanisms in Transformers have a time and space complexity of $O(n^2)$ with respect to the sequence length (n), Linformer reduces this complexity to $O(n)$. This significant reduction is both in terms of time and space, making it much more efficient for processing longer sequences.
3. Mechanism of Linear Self-Attention: The Linformer achieves this by decomposing the scaled dot-product attention into multiple smaller attentions through linear projections. Specifically, it introduces two linear projection matrices $E_i$ and $F_i$ which are used when computing the key and value matrices. By first projecting the original high-dimensional key and value matrices into a lower-dimensional space ($n \times k$), Linformer effectively reduces the complexity of the attention mechanism.
4. Combination of Operations: The combination of these operations forms a low-rank factorization of the original attention matrix. Essentially, Linformer simplifies the computational process by approximating the full attention mechanism with a series of smaller, more manageable operations that collectively capture the essential characteristics of the original full-rank attention.
The figure below from the paper shows: (left and bottom-right) architecture and example of the proposed multihead linear self-attention; (top right) inference time vs. sequence length the various Linformer models.

Experimental validation shows that Linformer achieves similar or better performance compared to the original Transformer on standard NLP tasks such as sentiment analysis and question answering, using datasets like GLUE and IMDB reviews. Notably, the model offers considerable improvements in training and inference speeds, especially beneficial for longer sequences.
Additionally, various strategies for enhancing the efficiency of Linformer are tested, including different levels of parameter sharing and the use of non-uniform projected dimensions tailored to the specific demands of different layers within the model.
The authors suggest that the reduced computational requirements of Linformer not only make high-performance models more accessible and cost-effective but also open the door to environmentally friendlier AI practices due to decreased energy consumption.
In summary, Linformer proposes a more efficient self-attention mechanism for Transformers by leveraging the low-rank nature of self-attention matrices. This approach significantly reduces the computational burden, especially for long sequences, by lowering the complexity of attention calculations from quadratic to linear in terms of both time and space. This makes Linformer an attractive choice for tasks involving large datasets or long sequence inputs, where traditional Transformers might be less feasible due to their higher computational demands.

BLEURT: Learning Robust Metrics for Text Generation

This paper by Sellam et al. from Google Research, published in ACL 2020, introduces BLEURT, a learned evaluation metric based on BERT. BLEURT is designed to better model human judgments for text generation tasks, addressing the limitations of traditional metrics like BLEU and ROUGE which often correlate poorly with human evaluations. This is achieved through a novel pre-training scheme that leverages millions of synthetic examples to enhance generalization capabilities, particularly when training data is scarce or out-of-distribution.
The authors detail the fine-tuning process for BLEURT using a BERT-based model. This involves initial unsupervised pre-training of BERT to learn contextualized text representations, followed by supervised fine-tuning with a linear layer on top of the BERT output to predict quality ratings. The fine-tuning uses regression loss based on human ratings.
A significant aspect of their approach is the creation of synthetic training data. They generate numerous synthetic sentence pairs by perturbing sentences from Wikipedia using techniques like mask-filling with BERT, backtranslation, and random word dropout. This synthetic data is used to train BLEURT under various pre-training tasks, each designed to capture different lexical and semantic discrepancies that are crucial for the robustness of the metric.
The table below from the paper shows their pre-training signals.

BLEURT was evaluated across different domains and setups, showing state-of-the-art performance on the WMT Metrics Shared Task for the years 2017 to 2019 and the WebNLG 2017 Challenge. The paper reports significant improvements in metric performance when pre-trained on synthetic data, highlighting BLEURT’s ability to handle quality and domain drifts effectively.
The robustness of BLEURT is further demonstrated through ablation studies, which assess the impact of each pre-training task on the model’s performance. The results validate the importance of comprehensive pre-training in achieving high correlation with human judgments, even under challenging conditions involving data skewness and distribution shifts.

Query-Key Normalization for Transformers

This paper by Henry et al. from Cyndx Technologies proposes QKNORM, a normalization technique designed to improve the Transformer model’s performance on low-resource language translation tasks. The primary innovation of QKNORM is its modification of the attention mechanism, specifically targeting the softmax function to prevent arbitrary saturation while maintaining the expressivity of the model.
The authors introduce QKNORM as a way to modify the attention mechanism in Transformers. Instead of the traditional scaled dot product attention, QKNORM applies L2 normalization to the query (Q) and key (K) matrices along the head dimension before their multiplication. This results in the dot product being replaced by cosine similarity, which is then scaled up by a learnable parameter rather than divided by the square root of the embedding dimension. This change aims to keep the input to the softmax function within an appropriate range, thus avoiding saturation issues.
Implementation Details:
1. Layer Normalization:
  - The proposed QKNORM technique builds upon existing normalization strategies. It combines FIXNORM (unit length normalization for word embeddings), PRENORM (applying layer normalization to the input of each sublayer), and vanilla LAYERNORM.
  - The new QKNORM method applies L2 normalization along the head dimension of Q and K, resulting in cosine similarity instead of dot products for the attention mechanism.
2. Learnable Scaling Parameter:
  - The learnable parameter $g$ is introduced to scale the cosine similarities. It is initialized based on the length of the sequences in the training data using the heuristic $g_0 = \log_2(L^2 - L)$, where $L$ is the 97.5th percentile sequence length.
3. Attention Mechanism Modification:
  - Traditional attention is modified from $\text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V$ to $\text{softmax}(g \cdot \hat{Q}\hat{K}^T)V$, where $\hat{Q}$ and $\hat{K}$ are the L2-normalized versions of $Q$ and $K$.
4. Experimental Setup:
  - The experiments were conducted on five low-resource language pairs from the TED Talks corpus and IWSLT’15, translating Arabic, Slovak, and Galician to English, and English to Hebrew and Vietnamese.
  - Training used tokenization scripts from Qi et al. (2018) and the Moses toolkit for BLEU scoring. The experiments followed the implementation from Nguyen and Salazar (2019), adjusting for tokenization differences.
5. Results:
  - QKNORM showed an average improvement of 0.928 BLEU over the baseline, with significant gains across all five language pairs. For example, the IWSLT’15 en→vi task achieved a test BLEU score of 33.24, compared to 32.79 for the baseline.
  - Ablation studies confirmed the importance of each component, particularly the learnable scaling factor $g$, which had the most substantial impact on performance.
The figure below from the paper shows Scaled Dot Product Attention. Self-attention heatmaps for 4 heads from one encoder layer displaying more “concentrated” attention, consistent with the conjecture that unnormalized dot products in QKT saturate the softmax and limit the attention patterns that can be learned.

The figure below from the paper shows Query-Key Normalized Attention. Self-attention heatmaps of the same 4 heads in the above figure. QKNORM enables more diffuse attention patterns.

QKNORM effectively enhances Transformer performance on low-resource translation tasks by stabilizing the input to the softmax function, allowing for more diffuse attention patterns and preventing the saturation issues inherent to traditional dot product attention. Future work will explore applying QKNORM in high-resource settings and further analyze its impact on attention head learning dynamics.

2021

Towards a Unified View of Parameter-Efficient Transfer Learning

Fine-tuning large pre-trained language models on downstream tasks has become the de-facto learning paradigm in NLP. However, conventional approaches fine-tune all the parameters of the pre-trained model, which becomes prohibitive as the model size and the number of tasks grow. Recent work has proposed a variety of parameter-efficient transfer learning methods that only fine-tune a small number of (extra) parameters to attain strong performance. While effective, the critical ingredients for success and the connections among the various methods are poorly understood.
This paper by He et al. from Neubig’s lab at CMU in ICLR 2022 breaks down the design of state-of-the-art parameter-efficient transfer learning methods and present a unified framework that establishes connections between them. Specifically, they re-frame them as modifications to specific hidden states in pre-trained models, and define a set of design dimensions along which different methods vary, such as the function to compute the modification and the position to apply the modification.
Through comprehensive empirical studies across machine translation, text summarization, language understanding, and text classification benchmarks, they utilize the unified view to identify important design choices in previous methods. Furthermore, their unified framework enables the transfer of design elements across different approaches, and as a result they are able to instantiate new parameter-efficient fine-tuning methods that tune less parameters than previous methods while being more effective, achieving comparable results to fine-tuning all parameters on all four tasks.
The below figure from the paper offers a graphical illustration of existing methods and the proposed variants. “PLM module” represents a certain sublayer of the PLM (e.g., attention or FFN) that is frozen. “Scaled PA” denotes scaled parallel adapter.

Code.

BinaryBERT: Pushing the Limit of BERT Quantization

The rapid development of large pre-trained language models has greatly increased the demand for model compression techniques, among which quantization is a popular solution. In this paper,
This paper by Bai et al. from CUHK and Huawei Noah’s Ark Lab in 2021 proposes BinaryBERT, which pushes BERT quantization to the limit by weight binarization.
They find that a binary BERT is hard to be trained directly than a ternary counterpart due to its steep and complex loss landscape. Therefore, they propose ternary weight splitting, which initializes BinaryBERT by equivalently splitting from a half-sized ternary network, followed by fine-tuning for further refinement.
The binary model thus inherits the good performance of the ternary one, and can be further enhanced by fine-tuning the new architecture after splitting.
Their approach also supports adaptive splitting that can tailor the size of BinaryBERT based on the edge device constraints.
Empirical results show that BinaryBERT has only a slight performance drop compared with the full-precision model while being 24x smaller, achieving the state-of-the-art compression results on the GLUE and SQuAD benchmarks.

Towards Zero-Label Language Learning

This paper by Wang et al. from Google in 2021 explores “zero-label” learning in NLP, whereby no human-annotated data is used anywhere during training and models are trained purely on synthetic data. They show that language models (LMs) are also few-shot generators or example creators (rather than just few-shot learners as in the GPT-3 paper) in that they can be used to generate high-quality synthetic data in a fully unsupervised manner. In other words, their propose that labelled-data generation is easy with prompting, LMs are great few-shot data generators, and that classic fine-tuning » zero/few shot prompting.
At the core of their framework is a novel approach for better leveraging the powerful pretrained LMs. Specifically, inspired by the recent success of few-shot inference on GPT-3, they present a training data creation procedure named Unsupervised Data Generation (UDG), which leverages few-shot prompts to synthesize high-quality training data without real human annotations.
Their method enables zero-label learning as they train task-specific models solely on the synthetic data, yet they achieve better or comparable results from strong baseline models trained on human-labeled data. Furthermore, when mixed with labeled data, their approach serves as a highly effective data augmentation procedure, achieving new state-of-the-art results on the SuperGLUE benchmark.
The paper illustrates a promising direction for future transfer learning research in NLP.
Key takeaways:
- Old idea (from OpenAI’s GPT3 paper):
  - Treat LMs as few-shot learners.
  - Create prompts with <sample, label> pair(s).
  - Ask the model to infer the label for a new
  - The emphasis is on the inference.
- New idea (from Google’s zero-label paper):
  - Treat LMs as few-shot generators (rather than few-shot learners).
  - Create prompts with <sample, label> pair(s).
  - Ask the model to generate more for the same label.
  - The emphasis is on the labelled data generation (rather than inference).
- Learnings:
  - Old idea created a new wave of prompt programming, i.e. no need for conventional task specific fine-tuning.
  - However, prompting can solve only lower-order tasks, for e.g., classification, NLI. Even with lower-order tasks it is not practical because you cannot build a human-in-the-loop system to continually improve the model.
  - The new idea is about generating more data and going with conventional route.
  - This paper confirms all the above by introducing UDG using LMs, even for complex higher-order tasks and empirically shows classical fine-tuning with more data works better.
The diagram below from Prithvi Da summarizes the proposed approach.

Improving Language Models by Retrieving from Trillions of Tokens

This paper by Borgeaud et al. from DeepMind in 2021 proposes Retrieval-Enhanced Transformer (RETRO) which enhances auto-regressive language models by conditioning on document chunks retrieved from a large corpus, based on local similarity with preceding tokens. RETRO incorporates information retrieved from a database to free its parameters from being an expensive store of facts and world knowledge. With a 2 trillion token database, RETRO obtains comparable performance to GPT-3 and Jurassic-1 on the Pile, despite using 25x fewer parameters.
After fine-tuning, RETRO performance translates to downstream knowledge-intensive tasks such as question answering. RETRO combines a frozen BERT retriever, a differentiable encoder and a chunked cross-attention mechanism to predict tokens based on an order of magnitude more data than what is typically consumed during training.
The figure below from the paper shows the Retro architecture. Left: simplified version where a sequence of length $n = 12$ is split into $l = 3$ chunks of size $m = 4$. For each chunk, we retrieve $k = 2$ neighbors of $r = 5$ tokens each. The retrieval pathway is shown on top. Right: Details of the interactions in the CCA operator. Causality is maintained as neighbors of the first chunk only affect the last token of the first chunk and tokens from the second chunk.

On Wikitext103 and the Pile, RETRO outperforms previous models trained on large scale datasets. They also show that RETRO is competitive on retrieval-intensive downstream tasks such as question answering.
RETRO models are flexible and can be used without retrieval at evaluation and still achieve comparable performance to baseline models. Conversely, baseline pre-trained transformer models can be rapidly fine-tuned (“RETROfit with retrieval”) to obtain nearly the same performance as if trained from scratch.
They demonstrates at an unprecedented scale that improving semi-parametric language models through explicit memory can provide an orthogonal, more efficient approach than raw parameter scaling as they seek to build more powerful language models.
Related: The Illustrated Retrieval Transformer by Jay Alammar.

WebGPT: Browser-assisted question-answering with human feedback

This paper by Nakano et al. from OpenAI in 2021 proposes WebGPT, which is a fine-tuned version of GPT-3 to more accurately answer open-ended questions using a text-based web browser. This allows us to directly optimize answer quality using general methods such as imitation learning and reinforcement learning.
Their prototype copies how humans research answers to questions online —- it submits search queries, follows links, and scrolls up and down web pages. It is trained to cite its sources, which makes it easier to give feedback to improve factual accuracy.
By setting up the task so that it can be performed by humans, they are able to train models on the task using imitation learning, and then optimize answer quality with human feedback. To make human evaluation of factual accuracy easier, models must collect references while browsing in support of their answers.
They train and evaluate their models on ELI5, a dataset of questions asked by Reddit users. Their best model is obtained by fine-tuning GPT-3 using behavior cloning, and then performing rejection sampling against a reward model trained to predict human preferences. This model’s answers are preferred by humans 56% of the time to those of their human demonstrators, and 69% of the time to the highest-voted answer from Reddit. While their best model outperforms humans on ELI5, but still struggles with out-of-distribution questions.

The Power of Scale for Parameter-Efficient Prompt Tuning

This paper by Lester et al. introduces a simple yet effective method called prompt tuning, which learns soft prompts to condition frozen language models to perform specific downstream tasks. Unlike the discrete text prompts, soft prompts are learned through backpropagation and can be tuned to incorporate signals from any number of labeled examples.
Also, prompt tuning only requires storing a small task-specific prompt for each task, and enables mixed-task inference using the original pre-trained model.
The authors show that prompt tuning outperforms few-shot learning by a large margin, and becomes more competitive with scale.
This is an interesting approach that can help to effectively use a single frozen model for multi-task serving.
Model tuning requires making a task-specific copy of the entire pre-trained model for each downstream task and inference must be performed in separate batches. Prompt tuning only requires storing a small task-specific prompt for each task, and enables mixed-task inference using the original pretrained model. With a T5 “XXL” model, each copy of the tuned model requires 11 billion parameters. By contrast, their tuned prompts would only require 20,480 parameters per task—a reduction of over five orders of magnitude – assuming a prompt length of 5 tokens.

Prefix-Tuning: Optimizing Continuous Prompts for Generation

Fine-tuning is the de facto way to leverage large pretrained language models to perform downstream tasks. However, it modifies all the language model parameters and therefore necessitates storing a full copy for each task.
This paper by Li and Liang from Stanford proposes prefix-tuning, a lightweight alternative to fine-tuning for natural language generation tasks, which keeps language model parameters frozen, but optimizes a small continuous task-specific vector (called the prefix).
Instead of adding a soft prompt to the model input, it prepends trainable parameters to the hidden states of all transformer blocks. During fine-tuning, the LM’s original parameters are kept frozen while the prefix parameters are updated.
Prefix-tuning draws inspiration from prompting, allowing subsequent tokens to attend to this prefix as if it were “virtual tokens”.
The figure below from the paper shows that fine-tuning (top) updates all Transformer parameters (the red Transformer box) and requires storing a full model copy for each task. They propose prefix-tuning (bottom), which freezes the Transformer parameters and only optimizes the prefix (the red prefix blocks). Consequently, prefix-tuning only need to store the prefix for each task, making prefix-tuning modular and space-efficient. Note that each vertical block denote transformer activations at one time step.

They apply prefix-tuning to GPT-2 for table-to-text generation and to BART for summarization. They find that by learning only 0.1% of the parameters, prefix-tuning obtains comparable performance in the full data setting, outperforms fine-tuning in low-data settings, and extrapolates better to examples with topics unseen during training. A potential hypothesis is that training fewer parameters helped reduce overfitting on smaller target datasets.

LoRA: Low-Rank Adaptation of Large Language Models

An important paradigm of natural language processing consists of large-scale pre-training on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes less feasible. Using GPT-3 175B as an example – deploying independent instances of fine-tuned models, each with 175B parameters, is prohibitively expensive.
Powerful models with billions of parameters, such as GPT-3, are prohibitively expensive to fine-tune in order to adapt them to particular tasks or domains. LoRA proposes to freeze pre-trained model weights and inject trainable layers (rank-decomposition matrices) in each transformer block. This greatly reduces the number of trainable parameters and GPU memory requirements since gradients don’t need to be computed for most model weights. The researchers found that by focusing on the Transformer attention blocks of large-language models, fine-tuning quality with LoRA was on par with full model fine-tuning while being much faster and requiring less compute.
This paper by Hu et al. from Microsoft in 2021 proposes Low-Rank Adaptation (LoRA), which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks.
LoRA is a technique where weight updates are designed to be the product of two low-rank matrices. It was inspired by Aghajanyan et al. which showed that, when adapting to a specific task, pre-trained language models have a low intrinsic dimension and can still learn efficiently despite a random projection into a smaller subspace. Thus, LoRA hypothesized that weight updates $\Delta W$ during adaption also have low intrinsic rank.
The figure below from the paper shows LoRA’s reparametrization. They only train $A$ and $B$.

Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times. LoRA performs on-par or better than fine-tuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite having fewer trainable parameters, a higher training throughput, and, unlike adapters, no additional inference latency.
Similar to prefix tuning, they found that LoRA outperformed several baselines including full fine-tuning. Again, the hypothesis is that LoRA, thanks to its reduced rank, provides implicit regularization. In contrast, full fine-tuning, which updates all weights, could be prone to overfitting.
They also provide an empirical investigation into rank-deficiency in language model adaptation, which sheds light on the efficacy of LoRA.
They release a package that facilitates the integration of LoRA with PyTorch models and provide their implementations and model checkpoints for RoBERTa, DeBERTa, and GPT-2.
Code

Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm

Prevailing methods for mapping large generative language models to supervised tasks may fail to sufficiently probe models’ novel capabilities. Using GPT-3 as a case study, they show that 0-shot prompts can significantly outperform few-shot prompts. They suggest that the function of few-shot examples in these cases is better described as locating an already learned task rather than meta-learning. This analysis motivates rethinking the role of prompts in controlling and evaluating powerful language models.
This paper by Reynolds and McDonell discusses methods of prompt programming, emphasizing the usefulness of considering prompts through the lens of natural language. They explore techniques for exploiting the capacity of narratives and cultural anchors to encode nuanced intentions and techniques for encouraging deconstruction of a problem into components before producing a verdict.
Informed by this more encompassing theory of prompt programming, they also introduce the idea of a metaprompt that seeds the model to generate its own natural language prompts for a range of tasks. Finally, they discuss how these more general methods of interacting with language models can be incorporated into existing and future benchmarks and practical applications.

Muppet: Massive Multi-task Representations with Pre-Finetuning

This paper by Aghajanyan et al. from Meta AI proposes pre-finetuning, an additional large-scale learning stage between language model pre-training and fine-tuning.
Pre-finetuning is massively multi-task learning (around 50 datasets, over 4.8 million total labeled examples), and is designed to encourage learning of representations that generalize better to many different tasks. Pre-finetuning is performed on around 50 classification, summarization, question answering, and common sense reasoning tasks.
After pretraining, they prefinetune the model on each of the aforementioned tasks by attaching a task-specific head (MLP) to the output of the [CLS] token for each task that they prefinetune the model on. They use the output of this task-specific head as the model output for that prefinetuning task. While training, the overall loss function used is a convex combination of the seven task-specific losses.
They show, in particular, that standard multi-tasking schemes can be unstable and often fail to learn high quality representations. However, they introduce a new training scheme which uses loss scaling and task-heterogeneous batches so that gradient steps are more evenly balanced across multiple different competing tasks, greatly improving training stability and overall performance.
Accumulating gradients across tasks (i.e., the concept of “heterogeneous batches”) is important since the model is trying to optimize not a single objective but several potentially competing objectives to create a unified representation across several tasks during model training. During gradient descent, moving along the gradient of a single task may not be the optimal direction for the model to move to learn a single unified representation across tasks. To overcome this, we ensure each batch their model optimizes consists of several tasks. Each worker samples a random batch from their set of tasks and computes a gradient, accumulated for the final update. Empirically we use 64 GPUs for pre-finetuning, resulting in each batch consisting of gradients across 64 sampled tasks. This strategy allows for their model to arrive at a better representation for end task finetuning.
As pre-finetuning optimizes several different types of tasks and datasets, each having its own output spaces, loss scaling becomes essential to ensure stable training. They attempted various forms of loss-scaling throughout initial experimentation, but the most effective was the novel method as follows.
- Let us denote $L_i\left(x_i, y_i ; \theta\right)$ as the loss for datapoint $i$ for a model parameterized by $\theta$. Remember that the loss depends on the type of task (commonsense loss is different from binary classification). Furthermore let $n: \mathbb{N} \rightarrow \mathbb{N}$ be a function which for each data-point returns the number of predictions $L$ operates over. For example, for binary classification, $n$ would return two, while for generation, $n$ would return the size of the vocabulary (since we average across loss per token generated). They scale data-point loss so that, if the class distribution were uniformly distributed along with their models predictions, all of their losses would have equivalent values.
\[L_i^{\text {scaled }}\left(x_i, y_i ; \theta\right)=\frac{L_i\left(x_i, y_i ; \theta\right)}{\log n(i)}\]
- They found that this static scaling worked incredibly well, outperforming other loss scaling methods in early experimentation.
They show that pre-finetuning consistently improves performance for pretrained discriminators (e.g.~RoBERTa) and generation models (e.g.~BART) on a wide range of tasks (sentence prediction, commonsense reasoning, MRC, etc.), while also significantly improving sample efficiency during fine-tuning. They also show that large-scale multi-tasking is crucial; pre-finetuning can hurt performance when few tasks are used up until a critical point (usually above 15) after which performance improves linearly in the number of tasks.
The figure below from the paper shows a plot of RoBERTa’s evaluation accuracy of five datasets: RTE, BoolQ, RACE, SQuAD, and MNLI, across various scales of multi-task learning measured in the number of datasets. They notice that performance initially degrades until a critical point is reached regarding the number of the datasets used by the MTL framework for all but one dataset. Post this critical point; pre-finetuning improve over the original RoBERTa model.

Synthesizer: Rethinking Self-Attention in Transformer Models

The dot product self-attention is known to be central and indispensable to state-of-the-art Transformer models. But is it really required?
This paper by Tay et al. from Google in ICML 2021 investigates the true importance and contribution of the dot product-based self-attention mechanism on the performance of Transformer models.
Via extensive experiments, they find that (1) random alignment matrices surprisingly perform quite competitively and (2) learning attention weights from token-token (query-key) interactions is useful but not that important after all.
To this end, they propose Synthesizer, a model that learns synthetic attention weights without token-token interactions.
The figure below from the paper shows the proposed Synthesizer model architecture.

In their experiments, they first show that simple Synthesizers achieve highly competitive performance when compared against vanilla Transformer models across a range of tasks, including machine translation, language modeling, text generation and GLUE/SuperGLUE benchmarks.
When composed with dot product attention, they find that Synthesizers consistently outperform Transformers. Moreover, they conduct additional comparisons of Synthesizers against Dynamic Convolutions, showing that simple Random Synthesizer is not only 60% faster but also improves perplexity by a relative 3.5%. Finally, they show that simple factorized Synthesizers can outperform Linformers on encoding only tasks.

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation

The paper by Wang et al. from Salesforce Research Asia and NTU Singapore in EMNLP 2021 proposes CodeT5, a pre-trained encoder-decoder model for code understanding and generation tasks.
It builds on the T5 architecture and proposes two novel pre-training objectives:
- Identifier tagging: Predict whether a token is an identifier.
- Masked identifier prediction: Recover masked identifiers using sentinel tokens.
These objectives enable CodeT5 to leverage the semantic information from identifiers.
It also proposes a bimodal dual generation task using code-comment pairs to improve Natural Languages (NL)-Programming Languages (PL) alignment.
CodeT5 significantly outperforms prior work on 6 tasks across 14 datasets in CodeXGLUE.
The figure below from the paper illustrates CodeT5 for code-related understanding and generation tasks.

The figure below from the paper illustrates the pre-training tasks of CodeT5. They first alternately train span prediction, identifier prediction, and identifier tagging on both unimodal and bimodal data, and then leverage the bimodal data for dual generation training.

Ablations show the identifier-aware pre-training and bimodal dual generation are effective. Identifier masking helps capture semantics while span masking focuses on syntax.
Comprehensive experiments show that CodeT5 significantly outperforms prior methods on understanding tasks such as code defect detection and clone detection, and generation tasks across various directions including PL-NL, NL-PL, and PL-PL, allowing for multi-task learning.
Overall, CodeT5 incorporates code structure via a novel identifier-aware pre-training and demonstrates strong performance on a diverse set of code intelligence tasks.
Code.

Extracting Training Data from Large Language Models

This paper by Carlini et al. from Google, Stanford, UC Berkeley, Northeastern University, OpenAI, Harvard, and Apple, examines the vulnerability of large language models (LMs) like GPT-2 to training data extraction attacks. These attacks can retrieve individual training examples, including sensitive and personal information, by querying the language model.
The core of the paper revolves around a methodology for extracting verbatim sequences from an LM’s training set, even in cases where these examples are not evidently distinguishable from test examples in terms of loss. This approach involves generating a diverse set of high-likelihood samples from the model and then sorting each sample using various metrics to estimate the likelihood of each sample being a memorized instance.
For their experiments, the authors focused on the GPT-2 model, utilizing its different sizes (with the largest having 1.5 billion parameters). The model was trained on data scraped from the public Internet, specifically from websites linked on Reddit, amounting to about 40GB of text data.
The researchers began with a baseline attack that generates text using a simple method and then infers which outputs contain memorized content based on the model’s confidence in the generated sequences. They identified weaknesses in this initial approach, such as low diversity of outputs and a high rate of false positives.
To improve the attack, they employed better sampling methods, including sampling with a decaying temperature to increase output diversity and reduce the likelihood of stepping off memorized sequences. This improved approach involved generating 200,000 samples using three different strategies: top-n sampling, temperature-based sampling, and Internet-conditioned sampling.
The evaluation of these methods revealed that 604 unique memorized training examples were identified from 1800 candidates, representing a 33.5% aggregate true positive rate, with the best variant achieving a 67% rate. The memorized content varied, including canonical text from news, forums, wikis, religious texts, as well as unique data like 128-bit UUIDs, resolving URLs, and personal information.
The figure below from the paper illustrates the extraction attack. Given query access to a neural network language model, they extract an individual person’s name, email address, phone number, fax number, and physical address. The example in this figure shows information that is all accurate so they’ve redacted it to protect privacy.

The research highlighted that the most effective way to identify memorized content was sampling conditioned on Internet text, although all generation schemes revealed significant memorized content. The comparison-based metrics for membership inference were more effective than directly looking at LM perplexity.
A notable finding was the extraction of personally identifiable information (PII), with examples containing individual names, phone numbers, addresses, and social media accounts, indicating a considerable privacy risk associated with large LMs trained on public data.

Large Dual Encoders Are Generalizable Retrievers

This paper by Ni et al. from Google Research presents a study on the scalability of dual encoder models for retrieval tasks.
The team challenges the belief that the simple dot-product bottleneck in dual encoders limits out-of-domain generalization. They scale up the model size of dual encoders while keeping the bottleneck embedding size fixed.
Their approach, Generalizable T5-based dense Retrievers (GTR), significantly outperforms existing sparse and dense retrievers on the BEIR dataset for a variety of retrieval tasks, especially in out-of-domain generalization.
GTR utilizes dual encoders by leveraging the encoder part of T5. For effectively using the power of large models, they collect roughly two billion community question-answer pairs as generic pre-training data. By combining pre-training using generic training data and fine-tuning using MS Marco, they are able to train large-scale dual encoder retrieval models.
A major finding is that GTR models are data-efficient, requiring only 10% of MS Marco supervised data for optimal out-of-domain performance. The study includes a multi-stage training approach using a combination of web-mined corpus and high-quality search datasets for pre-training and fine-tuning respectively.
The figure below from the paper architecture of Generalizable T5-based dense Retrievers. The research question we ask is: can scaling up dual encoder model size improve the retrieval performance while keeping the bottleneck layers fixed? Only the encoder is taken from the pre-train T5 models, and the question tower and document tower of the dual encoder share parameters.

Results show that scaling up model size improves retrieval performance across all evaluated tasks, suggesting that larger models can better capture semantic nuances for effective retrieval.
The paper also discusses the data efficiency of large-scale models, demonstrating that GTR models can achieve comparable or superior performance with reduced training data.
Additional insights are provided through ablation studies, which highlight the importance of both the model scale and the nature of the training dataset.
The paper presents a significant advancement in the field of information retrieval, showcasing the potential of large-scale dual encoder models in improving generalizability and efficiency in retrieval task.

Text Generation by Learning from Demonstrations

This paper by Pang and He from NYU in ICLR 2021, introduces Generation by Off-policy Learning from Demonstrations (GOLD), an approach to text generation using offline reinforcement learning (RL) with expert demonstrations.
GOLD addresses issues in text generation caused by the mismatch between training objectives (maximum likelihood estimation) and evaluation metrics (quality), as well as exposure bias (discrepancy between training and inference).
The proposed algorithm upweights confident tokens and downweights less confident ones in references during training, which helps to avoid optimization issues faced by previous RL methods reliant on online data collection.
The authors validate GOLD’s effectiveness in various tasks including news summarization, question generation, and machine translation, showing it outperforms traditional maximum likelihood estimation (MLE) and policy gradient methods in both automatic and human evaluations.
GOLD-trained models demonstrate less sensitivity to decoding algorithms and alleviate exposure bias issues, maintaining output quality across varying generation lengths.
The paper also discusses the algorithm’s ability to focus on “easy” tokens for learning, leading to high-precision models at the expense of lower recall.

Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering

This paper by Izacard and Grave from Facebook AI Research, ENS, PSL University, and Inria, explores the integration of passage retrieval with generative models to improve open domain question answering (QA).
The paper introduces a novel approach, combining passage retrieval and generative sequence-to-sequence models, significantly enhancing performance on benchmarks like Natural Questions and TriviaQA.
The method involves two key steps: firstly, retrieving text passages that potentially contain evidence using techniques like BM25 and Dense Passage Retrieval (DPR), and secondly, processing these passages with a generative model.
The figure below from the paper shows a simple approach to open domain question answering. First, it retrieves support text passages from an external source of knowledge such as Wikipedia. Then, a generative encoder-decoder model produces the answer, conditioned on the question and the retrieved passages. This approach scales well with the number of retrieved passages, as the performance keeps improving when retrieving up to one hundred passages.

The figure below from the paper shows the architecture of the Fusion-in-Decoder method.

A key finding is the positive correlation between the number of retrieved passages and performance improvement, indicating the generative model’s effectiveness in aggregating and combining evidence from multiple sources.
The generative model is based on a sequence-to-sequence network, such as T5 or BART, pretrained on unsupervised data. It processes each retrieved passage independently in the encoder but jointly in the decoder, a method referred to as Fusion-in-Decoder.
Technically, the model utilizes special tokens for question, title, and context to process the passages. The approach differs from other models by independently processing passages in the encoder, allowing scalability with a large number of contexts, while jointly processing in the decoder enhances evidence aggregation.
Empirically, the paper demonstrates that the Fusion-in-Decoder model outperforms existing methods, particularly in scenarios requiring the aggregation of evidence from multiple passages.
For implementation, the authors used pretrained T5 models, varying in size (base and large), and fine-tuned them on each dataset independently using Adam optimization, with specific training configurations like a constant learning rate, dropout rate, and batch size.
The approach shows significant performance gains in accuracy compared to traditional methods, especially when processing a large number of passages, indicating the potential of generative models combined with passage retrieval in open domain QA.

A General Language Assistant as a Laboratory for Alignment

This paper by Askell et al. from Anthropic introduces a comprehensive study towards aligning general-purpose, text-based AI systems with human values, focusing on making AI helpful, honest, and harmless (HHH). Given the capabilities of large language models, the authors investigate various alignment techniques and their evaluations to ensure these models adhere to human preferences without compromising performance.
The authors begin by examining naive prompting as a baseline for alignment, finding that the benefits from such interventions increase with model size and generalize across multiple alignment evaluations. Prompting was shown to impose negligible performance costs (‘alignment taxes’) on large models. The paper also explores the scaling trends of several training objectives relevant to alignment, including imitation learning, binary discrimination, and ranked preference modeling. The results indicate that ranked preference modeling significantly outperforms imitation learning and scales more favorably with model size, while binary discrimination performs similarly to imitation learning.
A key innovation discussed is ‘preference model pre-training’ (PMP), which aims to improve the sample efficiency of fine-tuning models on human preferences. This involves pre-training on large public datasets that encode human preferences, such as Stack Exchange, Reddit, and Wikipedia edits, before fine-tuning on smaller, more specific datasets. The findings suggest that PMP substantially enhances sample efficiency and often improves asymptotic performance when fine-tuning on human feedback datasets.
Implementation Details:
- Prompts and Context Distillation: The authors utilize a prompt composed of 14 fictional conversations to induce the HHH criteria in models. They introduce ‘context distillation,’ a method where the model is fine-tuned using the KL divergence between the model’s predictions and the distribution conditioned on the prompt context. This technique effectively transfers the prompt’s conditioning into the model.
- Training Objectives:
  - Imitation Learning: Models are trained to imitate ‘good’ behavior using supervised learning on sequences labeled as correct or desirable.
  - Binary Discrimination: Models distinguish between ‘correct’ and ‘incorrect’ behavior by training on pairs of correct and incorrect samples.
  - Ranked Preference Modeling: Models are trained to assign higher scores to better samples from ranked datasets using pairwise comparisons, a more complex but effective approach for capturing preferences.
- Preference Model Pre-Training (PMP): The training pipeline includes a PMP stage where models are pre-trained on binary discriminations sourced from Stack Exchange, Reddit, and Wikipedia edits. This stage significantly enhances sample efficiency during subsequent fine-tuning on smaller datasets.
Results:
- Prompting: Simple prompting significantly improves model performance on alignment evaluations, including HHH criteria and toxicity reduction. Prompting and context distillation both decrease toxicity in generated text as model size increases.
- Scaling Trends: Ranked preference modeling outperforms imitation learning, especially on tasks with ranked data like summarization and HellaSwag. Binary discrimination shows little improvement over imitation learning.
- Sample Efficiency: PMP dramatically increases the sample efficiency of fine-tuning, with larger models benefiting more from PMP than smaller ones. Binary discrimination during PMP is found to transfer better than ranked preference modeling.
The figure below from the paper shows: (Left) Simple prompting significantly improves performance and scaling on our HHH alignment evaluations (y-axis measures accuracy at choosing better responses on our HHH evaluations). (Right) Prompts impose little or no ‘alignment tax’ on large models, even on complex evaluations like function synthesis. Here we have evaluated our python code models on the HumanEval codex dataset at temperature T = 0.6 and top P = 0.95.

The study demonstrates that simple alignment techniques like prompting can lead to meaningful improvements in AI behavior, while more sophisticated methods like preference modeling and PMP offer scalable and efficient solutions for aligning large language models with human values.

2022

Formal Mathematics Statement Curriculum Learning

This paper by Polu et al. from OpenAI in 2022 proposes a neural theorem prover using GPT-f that can successfully solve a curriculum of increasingly difficult problems out of a set of formal statements of sufficiently varied difficulty, including many high-school Math Olympiad problems. The prover uses a language model to find proofs of formal statements.
They explore the use of expert iteration in the context of language modeling applied to formal mathematics. They show that at same compute budget, expert iteration, by which they mean proof search interleaved with learning, dramatically outperforms proof search only. They also observe that when applied to a collection of formal statements of sufficiently varied difficulty, expert iteration is capable of finding and solving a curriculum of increasingly difficult problems, without the need for associated ground-truth proofs.
Finally, by applying this expert iteration to a manually curated set of problem statements, they achieve state-of-the-art on the miniF2F benchmark, automatically solving multiple challenging problems drawn from high school olympiads.
Their results suggest that the lack of self-play in the formal mathematics setup can be effectively compensated for by automatically as well as manually curated sets of formal statements, which are much cheaper to formalize than full proofs. The statement curriculum learning methodology presented in this work can help accelerate progress in automated reasoning, especially if scaled with automated generation and curation of formal statements in the future.
OpenAI link.

Survey of Hallucination in Natural Language Generation

While natural language generation (NLG) has improved exponentially in recent years thanks to the development of deep learning technologies such as Transformer-based language models, large language models (LLMs) -based NLG often produces false statements that are disconnected from reality because such models are not grounded in reality. Such generation includes hallucinated texts, which makes the performances of text generation fail to meet users’ expectations in many real-world scenarios owing to the lack of commonsense built from experiencing the real world.
This paper by Ji et al. from Pascale Fung’s group at Hong Kong University of Science and Technology in 2022 reviews studies in evaluation and mitigation methods of hallucinations that have been presented in various tasks.
They provide a broad overview of the research progress and challenges in the hallucination problem of NLG. The survey is organized into two big divisions: (i) a general overview of metrics, mitigation methods, and future directions; (ii) task-specific research progress for hallucinations in a large set of downstream tasks: abstractive summarization, dialogue generation, generative question answering, data-to-text generation, and machine translation.

Transformer Quality in Linear Time

This paper by Hua et al. form Cornell University and Google Brain in 2022 revisits the design choices in Transformers, and propose methods to address their weaknesses in handling long sequences by presenting FLASH - a novel efficient modification of Transformer architecture. This is achieved by designing a performant layer (gated linear unit) and by combining it with an accelerator-efficient approximation strategy (mixed chunk attention).
Existing efficient attention methods often cause significant quality drop compared to full self-attention. At the same time they might be difficult to implement to fully leverage hardware accelerators. The authors introduce GAU (gated attention unit; a generalization of GLU - gated linear unit) that allows for better and more efficient approximation of multi-head attention than many other efficient attention methods by using a weaker single-head attention with minimal quality loss.
Next, complementary to this new layer, they propose mixed chunk attention - an efficient linear approximation method that combines the benefits from partial and linear attention mechanisms, which is accelerator-friendly and highly competitive in quality. The method works on chunks of tokens and leverages local (within chunk) and global (between chunks) attention spans.
The resulting model, named FLASH, when deployed on bidirectional and auto-regressive language modeling tasks, outperforms three baselines: vanilla Transformer, Performer and Combiner in terms of quality and efficiency. FLASH matches the quality (perplexity) of fully-augmented Transformers over both short (512) and long (8K) context lengths, while being substantially faster to train than the state-of-the-art - achieving training speedups of up to 4.9x on Wiki-40B and 12.1x on PG-19 for auto-regressive language modeling, and 4.8x on C4 for masked language modeling. The differences are particularly pronounced for larger context sizes (4096-8192).

Chain of Thought Prompting Elicits Reasoning in Large Language Models

Although scaling up language model size has reliably improved performance on a range of NLP tasks, even the largest models currently struggle with certain reasoning tasks such as arithmetic reasoning, math word problems, symbolic manipulation, and commonsense reasoning.
This paper by Wei et al. from Google in 2022 explores the ability of language models to generate a coherent chain of thought – a series of short sentences that mimic the reasoning process a person might have when responding to a question. The idea is strikingly simple: instead of being terse while prompting show the model a few examples of a multi-step reasoning process (the like of which a human would use). Couple this with LLMs (the larger the better) and magic happens! Check out the below image from the paper.

They have explored chain of thought prompting as a simple and broadly applicable method for enhancing reasoning in language models. The superb results you can elucidate via this method are an emergent property of model scale (surprise surprise) - bigger models benefit more from this, and the conclusion holds across models (LaMDA, GPT, PaLM).
Interestingly enough, the more complex the task of interest is (in the sense of requiring multi-step reasoning approach), the bigger the boost from the chain of thought prompting!
In order to make sure that the performance boost comes from this multi-step approach and not simply because of e.g. more compute, the authors have done a couple of ablations: (i) outputting a terse equation instead of a multi-step reasoning description, (ii) outputting the answer and only then the chain of thought, etc. None of these experiments yielded good results.
The method also proved to be fairly robust (always outperforms standard prompting) to the choice of exact few shot exemplars. Despite different annotators, different styles, etc. the method is always better than standard prompting.
Through experiments on arithmetic, symbolic, and commonsense reasoning, they find that chain of thought processing is an emergent property of model scale that can be induced via prompting and can enable sufficiently large language models to better perform reasoning tasks that otherwise have flat scaling curves.

PaLM: Scaling Language Modeling with Pathways

This paper by Chowdhery et al. from Google in 2022 introduces Pathways Language Model (PaLM), a single 540 billion parameter dense Transformer language model, trained on 780B tokens of high-quality, diverse text, that generalizes across domains and tasks while being highly efficient. PaLM pushes the boundaries of scale for few-shot language understanding and generation.
Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application.
To further their understanding of the impact of scale on few-shot learning, they trained a 540-billion parameter, densely activated, Transformer language model, which they call Pathways Language Model PaLM. They trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. They demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks.
On a number of these tasks, PaLM 540B achieves breakthrough few-shot performance on language, reasoning, and code tasks, achieving state-of-the-art results on 28 out of the 29 most widely evaluated English NLP tasks when compared to the best finetuned per-task result from any previous large language model. Their evaluation suite consists of multi-step reasoning tasks, and comparisons to average human performance on the recently released BIG-bench benchmark.
Another critical takeaway from this work is the breakthrough performance on reasoning tasks, which require multi-step logical inference. Their few-shot results match or exceed the finetuned state of the art across a number of different arithmetic and commonsense reasoning tasks. The results on reasoning tasks are not achieved through model scale alone, but by a combination of scale and chain-of-thought prompting, where the model is explicitly prompted to generate a natural language logical inference chain before making its prediction. They present a number of intriguing examples where PaLM was able to write explicit logical inference chains to both explain jokes and answer complex questions about scenarios. On BIG-bench, a recently developed benchmark containing 150+ challenging new language tasks, PaLM 5-shot achieves higher performance than the average performance score of humans who were asked to complete the same tasks. Additional state-of-the-art performance is demonstrated on source code understanding/generation, multilingual NLP, and machine translation.
From these results, they draw a number of conclusions.
- First, the results presented here suggest that the improvements from scale for few-shot language understanding have not yet plateaued. When they compare results from PaLM 540B to their own identically trained 62B and 8B model variants, improvements are typically log-linear. This alone suggests that they have not yet reached the apex point of the scaling curve. However, a number of BIG-bench benchmarks showed discontinuous improvements from model scale, improvements are actually discontinuous, meaning that the improvements from 8B to 62B are very modest, but then steeply increase when scaling to 540B. This suggests that certain capabilities of language models only emerge when trained at sufficient scale, and there are additional capabilities that could emerge from future generations of models.
- Second, the breakthrough performance on reasoning tasks has critical implications. It is obvious that a model being able to generate natural language to explain its predictions is beneficial to the end user of a system, in order to better understand why a model made a certain prediction. However, these results go far beyond that, demonstrating that prompting the model to generate explicit inference chains can drastically increase the quality of the predictions themselves. In other words, the model’s generation (rather than just understanding) capabilities can be immensely beneficial even for tasks that are modeled as categorical prediction or regression, which typically do not require significant language generation.
Finally, although they achieved their goal of further pushing the boundaries of scale for few-shot language modeling, there are still many open questions about the ideal network architecture and training scheme for future generations of models. PaLM is only the first step in their vision towards establishing Pathways as the future of ML scaling at Google and beyond. To that end, they chose to demonstrate this scaling capability on a well-studied, well-established recipe: a dense, decoder-only, full-attention Transformer model, which is trained to perform autoregressive language modeling. However, their wider goal is to explore a diverse array of novel architectural choices and training schemes, and combine the most promising systems with the extreme scaling capabilities of Pathways.
They believe that PaLM demonstrates a strong foundation in their ultimate goal of developing a large-scale, modularized system that will have broad generalization capabilities across multiple modalities.
They additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale.
Finally, they discuss the ethical considerations related to large language models and discuss potential mitigation strategies.
Google AI blog.

Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity

When primed with only a handful of training samples, very large, pretrained language models such as GPT-3 have shown competitive results when compared to fully-supervised, fine-tuned, large, pretrained language models.
This paper by Lu et al. from in 2022 demonstrates that that few-shot prompts suffer from order sensitivity, in that for the same prompt the order in which samples are provided can make the difference between state-of-the-art and random performance – essentially some permutations are “fantastic” and some not.
They analyze this phenomenon in detail, establishing that the problem is prevalent across tasks, model sizes (even for the largest current models), prompt templates, it is not related to a specific subset of samples, number of training samples, and that a given good permutation for one model is not transferable to another.
While one could use a development set to determine which permutations are performant, this would deviate from the true few-shot setting as it requires additional annotated data. Instead, to alleviate this problem, they introduce a novel probing method that exploits the generative nature of language models to construct an artificial development set. They identity performant permutations for prompts using entropy-based statistics over this set, which yields a 13% relative improvement for GPT-family models across eleven different established text classification tasks.

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that they understand the present and near-future capabilities and limitations of language models.
This paper by Srivastava et al. from Google in 2022 addresses this challenge by introducing the Beyond the Imitation Game benchmark (BIG-bench), a benchmark that can measure progress well beyond the current state-of-the-art. BIG-bench currently consists of 204 tasks, contributed by 442 authors across 132 institutions.
Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. They evaluate the behavior of OpenAI’s GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters.
In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit “breakthrough” behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.
Code.

Training Compute-Optimal Large Language Models

Previous work in training LLMs offered a heuristic that given a 10x increase in computational budget, model size should increase 5.5x, and the number of tokens should only increase 1.8x.
This paper by Hoffman et al. from DeepMind in 2022 challenges that assumption and shows that model and data size should increase in accordance! Thus collecting high-quality datasets will play a key role in further scaling of LLMs. They investigate the optimal model size and number of tokens for training a transformer language model under a given compute budget.
They find that current large language models are significantly undertrained, a consequence of the recent focus on scaling language models whilst keeping the amount of training data constant.
By training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, they find that for compute-optimal training, the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled.
The following image from the paper overlays the predictions from three different approaches, along with projections from Kaplan et al. (2020). THey find that all three methods predict that current large models should be substantially smaller and therefore trained much longer than is currently done.

They test this hypothesis by training a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and 4x more more data.
Chinchilla uniformly and significantly outperforms Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a large range of downstream evaluation tasks.
This also means that Chinchilla uses substantially less compute for fine-tuning and inference, greatly facilitating downstream usage. As a highlight, Chinchilla reaches a state-of-the-art average accuracy of 67.5% on the MMLU benchmark, greater than a 7% improvement over Gopher.
In particular, they propose 10x more compute should be spent on 3.2x larger model and 3.2x more tokens (vs. OpenAI’s Scaling Laws paper which suggests 10x more compute should be spent on 5.5x larger model and 1.8x more tokens)

Large Language Models Still Can’t Plan (A Benchmark for LLMs on Planning and Reasoning about Change)

The recent advances in large language models (LLMs) have transformed the field of natural language processing (NLP). From GPT-3 to PaLM, the state-of-the-art performance on natural language tasks is being pushed forward with every new large language model. Along with natural language abilities, there has been a significant interest in understanding whether such models, trained on enormous amounts of data, exhibit reasoning capabilities. Hence there has been interest in developing benchmarks for various reasoning tasks and the preliminary results from testing LLMs over such benchmarks seem mostly positive. However, the current benchmarks are relatively simplistic and the performance over these benchmarks cannot be used as an evidence to support, many a times outlandish, claims being made about LLMs’ reasoning capabilities. As of right now, these benchmarks only represent a very limited set of simple reasoning tasks and they need to look at more sophisticated reasoning problems if they are to measure the true limits of such LLM-based systems.
This paper by Valmeekam et al. from ASU in 2022 proposes an extensible assessment framework motivated by the above gaps in current benchmarks to test the abilities of LLMs on a central aspect of human intelligence, which is reasoning about actions and change.
They provide multiple test cases that are more involved than any of the previously established reasoning benchmarks and each test case evaluates a certain aspect of reasoning about actions and change. Their initial results on even on simple common-sense planning tasks the base version of GPT-3 (Davinci) seems to display a dismal performance.

OPT: Open Pre-trained Transformer Language Models

Large language models, which are often trained for hundreds of thousands of compute days, have shown remarkable capabilities for zero- and few-shot learning. Given their computational cost, these models are difficult to replicate without significant capital. For the few that are available through APIs, no access is granted to the full model weights, making them difficult to study.
This paper by Zhang et al. from Facebook AI introduces Open Pre-trained Transformers (OPT), a collection of auto-regressive/decoder-only pre-trained transformer-based language models ranging in size from 125M to 175B parameters, which they aim to fully and responsibly share with interested researchers.
Their goal is to replicate the performance and sizes of the GPT-3 class of models, while also applying the latest best practices in data curation and training efficiency.
They show that OPT-175B is comparable to GPT-3, while requiring only 1/7th the carbon footprint to develop. They also release their logbook detailing the infrastructure challenges they faced, along with code for experimenting with all of the released models.
They believe that broad access to these types of models will increase the diversity of voices defining the ethical considerations of such technologies.
Code.

Diffusion-LM Improves Controllable Text Generation

The following paper summary has been contributed by Zhibo Zhang.
Controlling the behavior of language models (LMs) without re-training is a major open problem in natural language generation. While recent works have demonstrated successes on controlling simple sentence attributes (e.g., sentiment), there has been little progress on complex, fine-grained controls (e.g., syntactic structure).
This paper by Li et al. from Stanford in NeurIPS 2022 seeks to address this challenge by introducing Diffusion-LM, a novel non-autoregressive language model based on continuous diffusion for controllable text generation, which enables new forms of complex fine-grained control tasks.
Building upon the recent successes of diffusion models in continuous domains, Diffusion-LM iteratively denoises a sequence of Gaussian vectors into word vectors, yielding a sequence of intermediate latent variables. The continuous, hierarchical nature of these intermediate variables enables a simple gradient-based algorithm to perform complex, controllable generation tasks.
Considering the discrete nature of text, the authors add an extra step on top of the Markov chain of standard diffusion models. As shown in the illustration figure, in the forward diffusion process, this extra step (embedding) is responsible for converting text into numerical embeddings. In the reverse process, this extra step (rounding) maps the continuous vectors back into text.
In order for the model to generate a vector that closely aligns with a word embedding in the reverse process, the authors did a re-parameterization such that the model directly predicts the word embedding state of the Markov chain at each term of the loss function.
In order to make the text generation process controllable, under a particular control objective, the conditional inference at each state of the Markov chain is decomposed into two parts:
- The Markov transition probability between the latent variables of two consecutive time steps, which is used as fluency regularization.
- The probability of the control objective given the latent variable of the current time step, which is used for controlling the text generation.
Empirically, the authors validated that Diffusion-LM significantly outperforms prior work by almost doubling the control success rate compared to the PPLM (Dathathri et al., 2020) and FUDGE (Yang et al., 2021) baselines on six fine-grained control tasks: Semantic Content, Parts-of-speech, Syntax Tree, Syntax Spans and Length.

DeepPERF: A Deep Learning-Based Approach For Improving Software Performance

Performance bugs may not cause system failure and may depend on user input, so detecting them can be challenging. They also tend to be harder to fix than non-performance bugs.
In recent years, a variety of performance bug detection approaches have emerged to help developers identify performance issues. However, a majority of existing performance bug detection approaches focus on specific types of performance problems and rely on expert-written algorithms or pre-defined set of rules to detect and fix issues. Building rule-based analyzers is a non-trivial task, as it requires achieving the right balance between precision and recall. Once developed, maintaining these rules can also be costly.
Transformer-based approaches have been shown to achieve state-of-the-art performance, not only in various NLP problems, but also in a variety of software engineering tasks such as code-completion, documentation generation, unit test generation, bug detection, etc. In this paper, the authors present an approach called DeepPERF that uses a large transformer model to suggest changes at application source code level to improve its performance. The authors first pretrain the model using masked language modelling (MLM) tasks on English text and source code taken from open source repositories on GitHub, followed by finetuning on millions of performance commits made by .NET developers.
This paper by Garg et al. from Microsoft in 2022 shows that their approach is able to recommend patches to provide a wide-range of performance optimizations in C# applications. Most suggested changes involve modifications to high-level constructs like API/Data Structure usages or other algorithmic changes, often spanning multiple methods, which cannot be optimized away automatically by the C# compiler and could, therefore, lead to slow-downs on the user’s side.
Their evaluation shows that the model can generate the same performance improvement suggestion as the developer fix in ∼53% of the cases, getting ∼34% of them verbatim in their expert-verified dataset of performance changes made by C# developers. Additionally, the authors evaluate DeepPERF on 50 open source C# repositories on GitHub using both benchmark and unit tests and find that the model is able to suggest valid performance improvements that can improve both CPU usage and Memory allocations.

No Language Left Behind: Scaling Human-Centered Machine Translation

Driven by the goal of eradicating language barriers on a global scale, machine translation has solidified itself as a key focus of artificial intelligence research today. However, such efforts have coalesced around a small subset of languages, leaving behind the vast majority of mostly low-resource languages.
This paper by Costa-jussà et al. from Meta AI in 2022 explores what it takes to break the 200 language barrier while ensuring safe, high quality results, all while keeping ethical considerations in mind. In No Language Left Behind, they take on this challenge by first contextualizing the need for low-resource language translation support through exploratory interviews with native speakers.
Furthermore, they created datasets and models aimed at narrowing the performance gap between low and high-resource languages. More specifically, they developed a conditional compute model based on Sparsely Gated Mixture of Experts that is trained on data obtained with novel and effective data mining techniques tailored for low-resource languages.
They propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks. Critically, they evaluated the performance of over 40,000 different translation directions using a human-translated benchmark, Flores-200, and combined human evaluation with a novel toxicity benchmark covering all languages in Flores-200 to assess translation safety. - Their model achieves an improvement of 44% BLEU relative to the previous state-of-the-art, laying important groundwork towards realizing a universal translation system.
Facebook AI article; Code
They tackle three major tasks:
- Automatic dataset construction for low-resource languages: They’ve solved this by investing in a teacher-student training procedure, making it possible to 1) extend LASER’s language coverage to 200 languages, and 2) produce a massive amount of data, even for low resource languages.
  - Specifically, to scale one model to hundreds of languages, as the first step, they built an appropriate data set. Meta created an initial model able to detect languages automatically, which they call their language identification system.
  - It then uses another language model based on Transformers to find sentence pairs for all the scrapped data. These two models are only used to build the 200 paired-languages datasets they need to train the final language translation model, NLLB200.
- Modeling 200 languages: They’ve developed a Sparse Mixture-of-Experts model that has a shared and specialized capacity, so low-resource languages without much data can be automatically routed to the shared capacity. When combined with better regularization systems, this avoids overfitting. Further, they used self-supervised learning and large-scale data augmentation through multiple types of back translation.
  - Specifically, the multi-language translation model is a Transformer based encoder-decoder architecture. This implies NLLB200 takes a text sentence, encodes it and then decodes it to produce a new text sentence, a translated version of the input.
  - What’s new is the modifications they’ve done to the model to scale up to so many different languages instead of being limited to one. The first modification is adding a variable identifying the source language of the input, taken from the language detector we just discussed. This will help the encoder do a better job for the current input language. Then, they do the same thing with the decoder giving it which language to translate to. Note that this conditioned encoding scheme is very similar to CLIP, which encodes images and text similarly. Here, in ideal conditions, it will encode a sentence similarly whatever the language.
  - They use Sparsely Gated Mixture of Experts models to achieve a more optimal trade-off between cross-lingual transfer and interference and improve performance for low-resource languages. Sparsely Gated Mixture of Experts are basically regular models but only activate a subset of model parameters per input instead of involving most if not all parameters every time. You can easily see how this is the perfect kind of model for this application. The Mixture of Experts is simply an extra step added in the Transformer architecture for both the encoder and decoder, replacing the feed-forward network sublayer with $N$ feed-forward networks, each with input and output projections, and the Transformer model automatically learns which subnetwork to use for each language during training.
- Evaluating translation quality: They’ve extended 2x the coverage of FLORES, a human-translated evaluation benchmark, to now cover 200 languages. Through automatic metrics and human evaluation support, we’re able to extensively quantify the quality of their translations.

Efficient Few-Shot Learning Without Prompts

Recent few-shot methods, such as parameter-efficient fine-tuning (PEFT) and pattern exploiting training (PET), have achieved impressive results in label-scarce settings. However, they are difficult to employ since they are subject to high variability from manually crafted prompts, and typically require billion-parameter language models to achieve high accuracy.
This paper by Tunstall et al. from Weights, cohere, TU Darmstadt, and Intel Labs in 2022 addresses these shortcomings by proposing SetFit (Sentence Transformer Fine-tuning), an efficient and prompt-free framework for few-shot fine-tuning of Sentence Transformers (ST). SetFit works by first fine-tuning a pretrained ST on a small number of text pairs, in a contrastive Siamese manner.
The resulting model is then used to generate rich text embeddings, which are used to train a classification head. Compared to other few-shot learning methods, SetFit has several unique features:
- No prompts or verbalisers: Current techniques for few-shot fine-tuning require handcrafted prompts or verbalisers to convert examples into a format that’s suitable for the underlying language model. SetFit dispenses with prompts altogether by generating rich embeddings directly from text examples.
- Fast to train: SetFit doesn’t require large-scale models like T0 or GPT-3 to achieve high accuracy. As a result, it is typically an order of magnitude (or more) faster to train and run inference with.
- Multilingual support: SetFit can be used with any Sentence Transformer on the Hub, which means you can classify text in multiple languages by simply fine-tuning a multilingual checkpoint.
- Achieves high accuracy: SetFit achieves high accuracy with little labeled data - for instance, with only 8 labeled examples per class on the Customer Reviews sentiment dataset, SetFit is competitive with fine-tuning RoBERTa Large on the full training set of 3k examples. This is accomplished with orders of magnitude less parameters than existing techniques.
Their experiments show that SetFit obtains comparable results with PEFT and PET techniques, while being an order of magnitude faster to train. We also show that SetFit can be applied in multilingual settings by simply switching the ST body.
Code.

Large language models are different

The following summary has been contributed by Zhibo Zhang.
Large language models have shown promising capability in various natural language tasks in recent years. This presentation by Wei from Google Brain in 2022 covers some of the recent works for large language models.
The motivation behind large language models is clear: It is ideal to have pre-trained models that can easily generalize to different downstream tasks rather than training a new model for each different task that will require a new dataset. In addition, pre-trained large language models only require a few labeled examples to learn from when it comes to a new task.
Training a large language model and doing inference on it typically contains the following components:
- Pre-train a language model on a massive amount of data. The model size nowadays is huge, such as GPT-3 which contains 175 billion parameters (Brown et al., 2020) and PaLM which contains 540 billion parameters (Chowdhery et al., 2022). An important property of large language models is the emergent ability. That is, the performance of the model grows from near-random to well above-random after the model size reaches a certain threshold (Wei et al., 2022).
- Perform in-context learning with a few examples. This step is typically done through promoting techniques, where a few example natural language tasks are provided in the form of input-label pairs, and the machine is expected to generalize the learning outcome to predict the label for an unseen input. Notice that the term “learning” here does not involve any optimization step of the model parameters.
Researchers have been trying to understand the property of prompting. In particular, Zhao et al., 2020 discusses three major biases introduced by the natural language prompts during in-context learning:
- The majority label bias: the predictions largely depend on the majority label in the prompts.
- The recency bias: the labels near the end of the prompts affect the predictions more.
- Common token bias: the predictions are more likely to be high frequency words in the n-gram model.
The authors of the paper proposed to use affine transformation to calibrate the probabilistic output of the model for each specific prediction, named contextual calibration.
Min et al., 2022 pointed out that whether the demonstration prompts have the correct labels or not does not significantly affect the prediction. The input text distribution, the label space and the input-label pairing format have a larger impact on the predictions.
The speaker also mentioned other prompting techniques, such as chain-of-thought prompting (Wei et al., 2022).
In addition to prompting, Wei et al., 2021 shows that fine tuning language models on different datasets through instructions can improve the model performance when there are no demonstrations given for downstream tasks.

Solving Quantitative Reasoning Problems with Language Models

The following paper summary has been contributed by Zhibo Zhang.
Solving Quantitative Reasoning Problems with Language Models by Lewkowycz et al. from Google Research in NeurIPS 2022 introduces Minerva, a language model based on PaLM (Chowdhery et al., 2022) to solve quantitative reasoning problems.
Specifically, the authors used the pre-trained PaLM models with 8 billion, 62 billion and 540 billion parameters accordingly and fine-tuned them on the technical training dataset that is composed of web pages of mathematical content, arXiv papers and general natural language data.
At the inference stage, the authors utilized the following techniques to boost the performance of the model:
- Selecting the most common answer based on a total of $k$ sampled solutions.
- Prompting the model with 4 examples when evaluating on the MATH dataset (Hendrycks et al., 2021) and with 5 examples when evaluating on the STEM (science, technology, engineering and mathematics) subset of the MMLU dataset (Hendrycks et al., 2021).
- Chain-of-thought prompting when evaluating on the GSM8k dataset (Cobbe et al., 2021) and the subset of the MMLU dataset.
Empirically, under the same model scale, Minerva consistently outperformed the PaLM model on the evaluation datasets according to the paper. In addition, Minerva with 62 billion parameters and 540 billion parameters outperformed both OpenAI davinci-002 and published state-of-the-art on the MATH dataset.
Through additional validation, the authors concluded that there is little evidence that memorization contributes to the model performance.

AD-DROP: Attribution-Driven Dropout for Robust Language Model Fine-Tuning

The following paper summary has been contributed by Zhibo Zhang.
AD-DROP: Attribution-Driven Dropout for Robust Language Model Fine-Tuning by Yang et al. from Sun Yat-sen University and Meta AI in NeurIPS 2022 proposes AD-DROP, an attribution-based dropout mechanism for self-attention modules.
The authors propose to generate attribution scores based on existing input gradient explanation methods. In particular, the attribution scores are generated for the attention map of each attention head with respect to the output logit of the Transformer for a particular class.
Following the above attribution methods, the authors empirically observed that dropping neurons with low attribution scores will lead to a larger degree of overfitting compared to random dropping, and dropping neurons with high attribution scores increases training loss but alleviates the overfitting problem.
Based on the above empirical finding, the authors proposed AD-DROP, as indicated in the illustration figure (below) from the paper: the attribution matrices are generated for the self-attention maps based on the logits from the forward pass. The mask matrices (that contain information about which position to drop) are then produced relying on the attribution scores and sampling. As a last step, an element-wise addition operation between the mask matrices and the original self-attention maps is done to produce the masked self-attention maps, which are then used to perform the forward propagation.
In addition, the authors proposed a cross-tuning algorithm to alternatively perform optimization without dropout (at odd number epochs) and optimization with AD-DROP (at even number epochs) during the training process.
The authors conducted experiments on eight tasks of the GLUE benchmark (Wang et al., 2019) using BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019) models as the base, observing that AD-DROP had the best average performance compared to several other regularization methods.

Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers

The following paper summary has been contributed by Zhibo Zhang.
In-Context Learning has been an effective strategy in adapting a pre-trained large language model to a new task by showing the model with a few input-label pairs through prompts. How In-Context Learning works has been an active research topic that people try to understand.
This paper by Dai et al. from Peking University, Tsinghua University and Microsoft Research in 2022 proposes that In-Context Learning can be understood as performing implicit fine-tuning on the Transformer models.
In particular, Aizerman et al. and Irie et al. pointed out that linear attention is a dual form of linear layers with gradient descent optimization.
Based on the above finding and through relaxing the standard attention into linear attention, the authors demonstrate that it is possible to express the attention outcome as a linear expression of any new input query, where the weight matrix can be decomposed into two parts: the part based on the pre-trained model and the updates of the former part due to prompt demonstrations.
Empirically, the authors compared between the models generated by fine-tuning and In-Context Learning accordingly on 6 datasets, observing similarities between the two in terms of prediction capability, updates of the attention output (where the pre-trained model is used as a baseline when calculating the updates) as well as attention maps.

Finetuned language models are zero-shot learners

This paper by Wei et al. from Google, published at ICLR 2022, investigates a method called “instruction tuning”, a simple method for improving the zero-shot learning abilities of language models. The authors show that instruction tuning – finetuning language models on a collection of datasets described via instructions – substantially improves zeroshot performance on unseen tasks.
The authors modify a 137B parameter pretrained language model (LaMDA-PT), instruction tuning it on over 60 NLP datasets described via natural language instructions. This results in the Finetuned LAnguage Net (FLAN) model, evaluated on on unseen task types, which notably surpasses zero-shot 175B GPT-3 on 20 out of 25 datasets and outperforms few-shot GPT-3 on several key datasets. This process is illustrated below with a couple of examples:

The model architecture used is LaMDA-PT, a decoder-only transformer language model, pretrained on a diverse collection of web documents, dialog data, and Wikipedia. The instruction tuning involves mixing all datasets and randomly sampling from each, with limitations on training example numbers per dataset and a specific mixing scheme.
FLAN’s evaluation encompassed a variety of tasks: natural language inference (NLI), reading comprehension, closed-book QA, translation, commonsense reasoning, coreference resolution, and struct-to-text. The evaluation strategy involved grouping datasets into task clusters, holding out each cluster during instruction tuning, and then evaluating performance on these held-out tasks. This method showed that instruction tuning significantly improves the base model’s performance on most datasets.
In more detail, FLAN demonstrated superior performance in NLI tasks by rephrasing queries into more natural questions. For reading comprehension, it excelled in tasks like MultiRC, OBQA, and BoolQ. In closed-book QA, FLAN outperformed GPT-3 across all four datasets. In translation tasks, FLAN showed strong results, especially in translating into English, though it was weaker in translating from English into other languages.
The study revealed some limitations of instruction tuning. It didn’t improve performance on tasks closely aligned with the original language modeling pre-training objective, like commonsense reasoning or coreference resolution tasks framed as sentence completions.
The figure below from the paper shows instruction tuning as a simple method that merges elements of pretraining-finetuning and prompting paradigms, can effectively improve a model’s performance on unseen tasks, especially when the model scale is substantial. This finding highlights the potential of using labeled data in instruction tuning to enhance the capabilities of large language models across various tasks by using supervision via finetuning to improve language model’s responses to inference-time text interactions, shifting the focus towards more generalist models. Their empirical results demonstrate promising abilities of language models to perform tasks described purely via instructions.

Code.

Learning to summarize from human feedback

As language models become more powerful, training and evaluation are increasingly bottlenecked by the data and metrics used for a particular task. For example, summarization models are often trained to predict human reference summaries and evaluated using ROUGE, but both of these metrics are rough proxies for what we really care about – summary quality.
This paper by Stiennon et al. from OpenAI in 2022 introduces Reinforcement Learning from Human Feedback (RLHF), a framework that shows that it is possible to significantly improve summary quality by training a model to optimize for human preferences.
They collect a large, high-quality dataset of human preferences/comparisons between summaries, train a reward model via supervised learning to predict the human-preferred summary, and use that model as a reward function (“reward model”) to fine-tune large pretrained models (they use GPT-3) using a summarization policy obtained using reinforcement learning. Specifically, they train a policy via reinforcement learning (RL) to maximize the score given by the reward model; the policy generates a token of text at each ‘time step’, and is updated using the proximal policy optimization (PPO) algorithm based on the reward model’s reward given to the entire generated summary. They can then gather more human data using samples from the resulting policy, and repeat the process.
Empirically, RLHF tends to perform better than supervised fine-tuning. This is because supervised fine-tuning uses a token-level loss (that can be summed or averaged over the text passage), and RLHF takes the entire text passage, as a whole, into account.
They apply the method to a version of the TL;DR dataset of Reddit posts and find that their models significantly outperform both human reference summaries and much larger models fine-tuned with supervised learning alone.
Their models also transfer to CNN/DM news articles, producing summaries nearly as good as the human reference without any news-specific fine-tuning.
They conduct extensive analyses to understand their human feedback dataset and fine-tuned models. They establish that their reward model generalizes to new datasets, and that optimizing their reward model results in better summaries than optimizing ROUGE according to humans.
The key takeaway point here is that pay closer attention to how training loss affects the model behavior they is actually desired.
The graph below from the paper shows the fraction of the time humans prefer summaries from variations of the trained models over the human-generated reference summaries on the TL;DR dataset.

Training language models to follow instructions with human feedback

Making language models bigger does not inherently make them better at following a user’s intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users.
This paper by Ouyang et al. from OpenAI in 2022 introduces InstructGPT, a model that aligns language models with user intent on a wide range of tasks by fine-tuning with human feedback.
Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, they collect a dataset of labeler demonstrations of the desired model behavior, which they use to fine-tune GPT-3 using supervised fine-tuning (SFT). This process is referred to as “instruction tuning” by other papers such as Wei et al. (2022).
They then collect a dataset of rankings of model outputs, which they use to further fine-tune this supervised model using Reinforcement Learning from Human Feedback (RLHF).
In human evaluations on their prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters.
Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, their results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.
It is important to note that ChatGPT is trained using the same methods as InstructGPT (using SFT followed by RLHF), but is fine-tuned from a model in the GPT-3.5 series.
Furthermore, the fine-tuning process proposed in the paper isn’t without its challenges. First, we need a significant volume of demonstration data. For instance, in the InstructGPT paper, they used 13k instruction-output samples for supervised fine-tuning, 33k output comparisons for reward modeling, and 31k prompts without human labels as input for RLHF. Second, fine-tuning comes with an alignment tax “negative transfer” – the process can lead to lower performance on certain critical tasks. (There’s no free lunch after all.) The same InstructGPT paper found that RLHF led to performance regressions (relative to the GPT-3 base model) on public NLP tasks like SQuAD, HellaSwag, and WMT 2015 French to English. A potential workaround is to have several smaller, specialized models that excel at narrow tasks.
The figure below from the paper illustrates the three steps of training InstructGPT: (1) SFT, (2) reward model training, and (3) reinforcement learning via proximal policy optimization (PPO) on this reward model. Blue arrows indicate that this data is used to train the respective model in the diagram. In Step 2, boxes A-D are samples from the SFT model that get ranked by labelers.

Constitutional AI: Harmlessness from AI Feedback

As AI systems become more capable, we would like to enlist their help to supervise other AIs.
This paper by Bai et al. from Anthropic in 2022 experiments with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so they refer to the method as ‘Constitutional AI’.
The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase they sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, they sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences.
They then train with RL using the preference model as the reward signal, i.e. they use ‘RL from AI Feedback’ (RLAIF). As a result they are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels.
The figure below from the paper shows the basic steps of their Constitutional AI (CAI) process, which consists of both a supervised learning (SL) stage, consisting of the steps at the top, and a Reinforcement Learning (RL) stage, shown as the sequence of steps at the bottom of the figure. Both the critiques and the AI feedback are steered by a small set of principles drawn from a ‘constitution’. The supervised stage significantly improves the initial model, and gives some control over the initial behavior at the start of the RL phase, addressing potential exploration problems. The RL stage significantly improves performance and reliability.

The graph below shows harmlessness versus helpfulness Elo scores (higher is better, only differences are meaningful) computed from crowdworkers’ model comparisons for all 52B RL runs. Points further to the right are later steps in RL training. The Helpful and HH models were trained with human feedback as in [Bai et al., 2022], and exhibit a tradeoff between helpfulness and harmlessness. The RL-CAI models trained with AI feedback learn to be less harmful at a given level of helpfulness. The crowdworkers evaluating these models were instructed to prefer less evasive responses when both responses were equally harmless; this is why the human feedback-trained Helpful and HH models do not differ more in their harmlessness scores.

RoFormer: Enhanced Transformer with Rotary Position Embedding

Position encoding recently has shown effective in the transformer architecture. It enables valuable supervision for dependency modeling between elements at different positions of the sequence.
This paper by Su et al. from Zhuiyi Technology Co., Ltd. in 2022 first investigates various methods to integrate positional information into the learning process of transformer-based language models. Then, they propose a novel method named Rotary Position Embedding (RoPE) to effectively leverage positional information. Specifically, the proposed RoPE encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative position dependency in self-attention formulation.
RoPE thus takes relative positions in account so it means that attention scores are affected by the distance between two tokens (rather than indices) and acts as a decay. Larger the distance between two words, lesser is the effect. Notably, RoPE enables valuable properties, including the flexibility of sequence length, decaying inter-token dependency with increasing relative distances, and the capability of equipping the linear self-attention with relative position encoding.
The following figure from the paper shows the implementation of Rotary Position Embedding (RoPE).

Also, RoPE is multiplicative in nature; so instead of “shifting” the word embedding by addition (similar to the “bias” term in neural networks, which has a shifting effect), it “scales” the effect due to rotation.
You can use RoPE with “linear attention” (a type of efficient attention which that has $O(N)$ complexity compared to $O(N^2)$ in regular attention).
Finally, they evaluate the enhanced transformer with rotary position embedding, also called RoFormer, on various long text classification benchmark datasets. Our experiments show that it consistently overcomes its alternatives. Furthermore, they provide a theoretical analysis to explain some experimental results.
Weights docs.

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Since the introduction of the transformer model by Vaswani et al. (2017), a fundamental question has yet to be answered: how does a model achieve extrapolation at inference time for sequences that are longer than it saw during training? We first show that extrapolation can be enabled by simply changing the position representation method, though they find that current methods do not allow for efficient extrapolation.
This paper by Press et al. from University of Washington and Facebook AI Research in ICLR 2022 introduces a simpler and more efficient position method, Attention with Linear Biases (ALiBi). ALiBi does not add positional embeddings to word embeddings; instead, it biases query-key attention scores with a penalty that is proportional to their distance.
They show that this method trains a 1.3 billion parameter model on input sequences of length 1024 that extrapolates to input sequences of length 2048, achieving the same perplexity as a sinusoidal position embedding model trained on inputs of length 2048 but training 11% faster and using 11% less memory.
ALiBi’s inductive bias towards recency also leads it to outperform multiple strong position methods on the WikiText-103 benchmark.
The figure below from the paper shows that when computing attention scores for each head, their linearly biased attention method, ALiBi, adds a constant bias (right) to each attention score $\left(\mathbf{q}_i \cdot \mathbf{k}_j\right.$, left). As in the unmodified attention sublayer, the softmax function is then applied to these scores, and the rest of the computation is unmodified. $\mathbf{m}$ is a head-specific scalar that is set and not learned throughout training. They show that their method for setting $m$ values generalizes to multiple text domains, models and training compute oudgets. When using ALiBi, they do not add positional embeddings at the bottom of the network.

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

In deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) defies this and instead selects different parameters for each incoming example. The result is a sparsely-activated model – with outrageous numbers of parameters – but a constant computational cost.
This paper by Fedus et al. from Google in JMLR 2022 introduces the Switch Transformer which seeks to address the lack of widespread adoption of MoE which has been hindered by complexity, communication costs, and training instability.
They simplify the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs. Our proposed training techniques help wrangle the instabilities and they show large sparse models may be trained, for the first time, with lower precision (bfloat16) formats.
The guiding design principle for Switch Transformers is to maximize the parameter count of a Transformer model (Vaswani et al., 2017) in a simple and computationally efficient way. The benefit of scale was exhaustively studied in Kaplan et al. (2020) which uncovered powerlaw scaling with model size, data set size and computational budget. Importantly, this work advocates training large models on relatively small amounts of data as the computationally optimal approach. Heeding these results, they investigate a fourth axis: increase the parameter count while keeping the floating point operations (FLOPs) per example constant. Our hypothesis is that the parameter count, independent of total computation performed, is a separately important axis on which to scale. They achieve this by designing a sparsely activated model that efficiently uses hardware designed for dense matrix multiplications such as GPUs and TPUs. In their distributed training setup, their sparsely activated layers split unique weights on different devices. Therefore, the weights of the model increase with the number of devices, all while maintaining a manageable memory and computational footprint on each device.
Their switch routing proposal reimagines MoE. Shazeer et al. (2017) conjectured that routing to $k > 1$ experts was necessary in order to have non-trivial gradients to the routing functions. The authors intuited that learning to route would not work without the ability to compare at least two experts. Ramachandran and Le (2018) went further to study the top-$k$ decision and found that higher $k$-values in lower layers in the model were important for models with many routing layers. Contrary to these ideas, they instead use a simplified strategy where they route to only a single expert. They show this simplification preserves model quality, reduces routing computation and performs better. This $k = 1$ routing strategy is later referred to as a Switch layer.
The following figure from the paper illustrates the Switch Transformer encoder block. We replace the dense feed forward network (FFN) layer present in the Transformer with a sparse Switch FFN layer (light blue). The layer operates independently on the tokens in the sequence. They diagram two tokens ($x_1$ = “More” and $x_2$ = “Parameters” below) being routed (solid lines) across four FFN experts, where the router independently routes each token. The switch FFN layer returns the output of the selected FFN multiplied by the router gate value (dotted-line).

They design models based off T5-Base and T5-Large to obtain up to 7x increases in pre-training speed with the same computational resources. These improvements extend into multilingual settings where they measure gains over the mT5-Base version across all 101 languages.
Finally, they advance the current scale of language models by pre-training up to trillion parameter models on the “Colossal Clean Crawled Corpus” and achieve a 4x speedup over the T5-XXL model.
Code.

Locating and Editing Factual Associations in GPT

This paper by Meng at l. from MIT CSAIL, Northeastern University, and Technion in NeurIPS 2022 analyzes the storage and recall of factual associations in autoregressive transformer language models, finding evidence that these associations correspond to localized, directly-editable computations.
They first develop a causal intervention for identifying neuron activations that are decisive in a model’s factual predictions. This reveals a distinct set of steps in middle-layer feed-forward modules that mediate factual predictions while processing subject tokens. Specifically, they perform the following steps to locate factual retrieval:
- To identify decisive computations, they introduce a method called Causal Tracing. By isolating the causal effect of individual states within the network while processing a factual statement, we can trace the path followed by information through the network.
- Causal traces work by running a network multiple times, introducing corruptions to frustrate the computation, and then restoring individual states in order to identify the information that restores the results. Tracing can be used to test any individual state or combinations of states. We use carefully-designed traces to identify a specific small set of MLP module computations that mediate retrieval of factual associations.
- Then they check this finding by asking: can the MLP module computations be altered to edit a model’s belief in a specific fact?
To test their hypothesis that these computations correspond to factual association recall, they modify feed-forward weights to update specific factual associations using Rank-One Model Editing (ROME). Specifically, they perform the following steps to edit factual storage:
- To modify individual facts within a GPT model, we introduce a method called ROME, or Rank-One Model Editing. It treats an MLP module as a simple key-value store: for example, if the key encodes a subject and the value encodes knowledge about the subject, then the MLP can recall the association by retrieving the value corresponding to the key. ROME uses a rank-one modification of the MLP weights to directly write in a new key-value pair.
- The figure above illustrates a single MLP module within a transformer. The D-dimensional vector at (b) acts as the key that represents a subject to know about, and the H-dimensional output at (c) acts at the value that encodes learned properties about the subject. ROME inserts new association by making a rank-one change to the matrix (d) that maps from keys to values.
- Note that ROME assumes a linear view of memory within a neural network rather than an individual-neuron view. This linear perspective sees individual memories as rank-one slices of parameter space. Experiments confirm this view: when we do a rank-one update to an MLP module in the computational center identified by causal tracing, we find that associations of individual facts can be updated in a way that is both specific and generalized.
They find that ROME is effective on a standard zero-shot relation extraction (zsRE) model-editing task, comparable to existing methods. To perform a more sensitive evaluation, they also evaluate ROME on a new dataset of counterfactual assertions, on which it simultaneously maintains both specificity and generalization, whereas other methods sacrifice one or another.
Their results confirm an important role for mid-layer feed-forward modules in storing factual associations and suggest that direct manipulation of computational mechanisms may be a feasible approach for model editing.
Project page.

Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers

There remain many open questions pertaining to the scaling behaviour of Transformer architectures. These scaling decisions and findings can be critical, as training runs often come with an associated computational cost which have both financial and/or environmental impact.
This paper by Tay et al. from Google Research and DeepMind in ICLR 2022 presents scaling insights from pretraining and finetuning Transformers. While Kaplan et al. presents a comprehensive study of the scaling behaviour of Transformer language models, the scope is only on the upstream (pretraining) loss. Therefore, it is still unclear if these set of findings transfer to downstream task within the context of the pretrain-finetune paradigm.
The key findings of this paper are as follows:
1. They show that aside from only the model size, model shape matters for downstream fine-tuning;
2. Scaling protocols operate differently at different compute regions;
3. Widely adopted T5-base and T5-large sizes are Pareto-inefficient.
To this end, they present improved scaling protocols whereby their redesigned models achieve similar downstream fine-tuning quality while having 50% fewer parameters and training 40% faster compared to the widely adopted T5-base model.
In terms of scaling recommendations, they recommend a DeepNarrow strategy where the model’s depth is preferentially increased before considering any other forms of uniform scaling across other dimensions. This is largely due to how much depth influences the Pareto-frontier. Specifically, a tall small (deep and narrow) model is generally more efficient compared to the base model. Likewise, a tall base model might also generally more efficient compared to a large model. They generally find that, regardless of size, even if absolute performance might increase as we continue to stack layers, the relative gain of Pareto-efficiency diminishes as we increase the layers, converging at 32 to 36 layers.
They publicly release over 100 pretrained checkpoints of different T5 configurations to facilitate future research and analysis.

Holistic Evaluation of Language Models

Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood.
This technical report by Liang et al. from Stanford’s Center for Research on Foundation Models (CRFM) presents Holistic Evaluation of Language Models (HELM) to improve the transparency of language models.
First, HELM taxonomizes the vast space of potential scenarios (i.e. use cases) and metrics (i.e. desiderata) that are of interest for LMs. Then we select a broad subset based on coverage and feasibility, noting what’s missing or underrepresented (e.g. question answering for neglected English dialects, metrics for trustworthiness).
The figure below from the paper shows the importance of the taxonomy to HELM. Previous language model benchmarks (e.g. SuperGLUE, EleutherAI LM Evaluation Harness, BIG-Bench) are collections of datasets, each with a standard task framing and canonical metric, usually accuracy (left). In comparison, in HELM we take a top-down approach of first explicitly stating what we want to evaluate (i.e. scenarios and metrics) by working through their underlying structure. Given this stated taxonomy, we make deliberate decisions on what subset we implement and evaluate, which makes explicit what we miss (e.g. coverage of languages beyond English).

Second, HELM adopts a multi-metric approach: They measure 7 metrics (accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency) for each of 16 core scenarios when possible (87.5% of the time). This ensures metrics beyond accuracy don’t fall to the wayside, and that trade-offs are clearly exposed. They also perform 7 targeted evaluations, based on 26 targeted scenarios, to analyze specific aspects (e.g. reasoning, disinformation).
The figure below from the paper shows HELM’s multi-metric measurement for each use case. In comparison to most prior benchmarks of language technologies, which primarily center accuracy and often relegate other desiderata to their own bespoke datasets (if at all), in HELM we take a multi-metric approach. This foregrounds metrics beyond accuracy and allows one to study the tradeoffs between the metrics.

Third, HELM conducts a large-scale evaluation of 30 prominent language models (spanning open, limited-access, and closed models) on all 42 scenarios, 21 of which were not previously used in mainstream LM evaluation.
Prior to HELM, models on average were evaluated on just 17.9% of the core HELM scenarios, with some prominent models not sharing a single scenario in common. HELM improves this to 96.0%: now all 30 models have been densely benchmarked on the same core scenarios and metrics under standardized conditions.
HELM’s evaluation surfaces 25 top-level findings. For full transparency, they release all raw model prompts and completions publicly for further analysis, as well as a general modular toolkit. They intend for HELM to be a living benchmark for the community, continuously updated with new scenarios, metrics, and models.
Project page.

SummaC: Re-Visiting NLI-based Models for Inconsistency Detection in Summarization

In the summarization domain, a key requirement for summaries is to be factually consistent with the input document. Previous work has found that natural language inference (NLI) models do not perform competitively when applied to inconsistency detection.
This paper by Laban et al. from UC Berkeley and Microsoft in TACL 2022 proposes $SummaC$ and revisits the use of NLI for inconsistency detection, finding that past work suffered from a mismatch in input granularity between NLI datasets (sentence-level), and inconsistency detection (document level).
The figure below from the paper shows an example document with an inconsistent summary. When running each sentence pair $(D_i, S_j)$ through an NLI model, $S_3$ is not entailed by any document sentence. However, when running the entire (document, summary) at once, the NLI model incorrectly predicts that the document highly entails the entire summary.

They provide a highly effective and light-weight method called $SummaC_{Conv}$ that enables NLI models to be successfully used for this task by segmenting documents into sentence units and aggregating scores between pairs of sentences.
The figure below from the paper shows a diagram of the $SummaC_{ZS}$ (top) and $SummaC_{Conv}$ (bottom) models. Both models utilize the same NLI Pair Matrix (middle) but differ in their processing to obtain a score. $SummaC_{ZS}$ is Zero-Shot, and does not have trained parameters. $SummaC_{Conv}$ uses a convolutional layer trained on a binned version of the NLI Pair Matrix.

On their newly introduced benchmark called $SummaC$ (Summary Consistency) consisting of six large inconsistency detection datasets, $SummaC_{Conv}$ obtains state-of-the-art results with a balanced accuracy of 74.4%, a 5% point improvement compared to prior work.
Code.

InCoder: A Generative Model for Code Infilling and Synthesis

Code is seldom written in a single left-to-right pass and is instead repeatedly edited and refined.
This paper by Fried et al. in ICLR 2023 from FAIR, UW, UC Berkeley, TTI-Chicago, and CMU introduces InCoder, a unified generative model that can perform program synthesis (via left-to-right generation) as well as editing (via infilling).
InCoder is trained to generate code files from a large corpus of permissively licensed code, where regions of code have been randomly masked and moved to the end of each file, allowing code infilling with bidirectional context.
InCoder is the first generative model that is able to directly perform zero-shot code infilling, which they evaluate on challenging tasks such as type inference, comment generation, and variable re-naming.
The figure below from the paper shows that during training time (top), InCoder’s causal masking objective samples one or more spans of code in training documents (in the upper left figure, a single span) and moves these spans to the end of the document, with their original location denoted by special mask sentinel tokens. An autoregressive language model is trained to produce these entire masked documents, allowing it to learn to generate insertion text conditioned on bidirectional context. At inference time (bottom), InCoder can perform a variety of code editing and infilling tasks in a zero-shot fashion by inserting mask tokens at desired locations and allowing the model to generate code to insert there. All examples shown are real outputs from the InCoder-6.7B model, with the regions inserted by the model highlighted in orange.

They find that the ability to condition on bidirectional context substantially improves performance on these tasks, while still performing comparably on standard program synthesis benchmarks in comparison to left-to-right only models pretrained at similar scale.

Large Language Models are Zero-Shot Reasoners

This paper by Kojima et al. from University of Tokyo and Google Brain in NeurIPS 2022 explores the zero-shot reasoning capabilities of large language models (LLMs) through a simple technique called Zero-shot Chain of Thought (Zero-shot-CoT).
Zero-shot-CoT adds the prompt “Let’s think step by step” before the answer to elicit multi-step reasoning from LLMs, without requiring any task-specific examples like prior work in Chain of Thought (CoT) prompting.
The figure below from the paper shows example inputs and outputs of GPT-3 with (a) standard Few-shot ([Brown et al., 2020]), (b) Few-shot-CoT ([Wei et al., 2022]), (c) standard Zero-shot, and (d) Zero-shot-CoT. Similar to Few-shot-CoT, Zero-shot-CoT facilitates multi-step reasoning (blue text) and reach correct answer where standard prompting fails. Unlike Few-shot-CoT using step-by-step reasoning examples per task, Zero-shot-CoT does not need any examples and just uses the same prompt “Let’s think step by step” across all tasks (arithmetic, symbolic, commonsense, and other logical reasoning tasks).

Experiments across 12 diverse reasoning tasks (arithmetic, symbolic, commonsense, logical) show Zero-shot-CoT substantially improves over standard zero-shot prompting. For example, on MultiArith accuracy increases from 17.7% to 78.7% for InstructGPT.
Zero-shot-CoT also shows improvements with scaling model size, akin to few-shot CoT prompting, suggesting the single prompt unlocks latent multi-step reasoning capabilities inside LLMs.
The simplicity and versatility of Zero-shot-CoT across tasks, compared to careful per-task prompt engineering in prior work, highlights the surprisingly broad cognitive capabilities hidden in LLMs.
The authors suggest Zero-shot-CoT serves as a strong zero-shot baseline and encourages further analysis of the multi-task, broad cognitive abilities of LLMs before crafting specialized prompts or datasets.
Code.

An Efficient Memory-Augmented Transformer for Knowledge-Intensive NLP Tasks

This paper by Wu et al. from UCL, Harbin Institute of Technology, and University of Edinburgh in EMNLP 2022 proposes Efficient Memory-Augmented Transformer (EMAT), which augments Transformer models with an efficient key-value memory module to leverage external knowledge for knowledge-intensive NLP tasks like open-domain QA, outperforming baselines on knowledge-intensive NLP tasks.
EMAT encodes external knowledge (e.g. question-answer pairs from PAQ) into dense key and value representations to build the memory. Keys are encoded questions, values are encoded answers.
At inference time, EMAT produces a query to retrieve relevant keys and values from memory using fast maximum inner product search. The retrieved representations are integrated into the Transformer encoder to inform generation.
EMAT requires only a single inference pass through the Transformer, allowing memory access to run concurrently for efficiency.
The figure below from the paper shows the architecture of the proposed Efficient Key-Value Memory Augmented Transformers (EMAT): factual knowledge is stored in a key-value memory where keys and values correspond to questions and answers, respectively; during inference, the model retrieves information from the memory via MIPS and uses it to condition the generation process.

EMAT introduces pre-training objectives for learning informative key-value representations and an implicit strategy to integrate multiple memory slots.
Experiments on open-domain QA, dialogue, and long-form QA show EMAT significantly outperforms vanilla Transformers while retaining high throughput. It also outperforms retrieval-augmented models on some tasks while being much faster.
Ablations demonstrate the importance of the pre-training objectives. Qualitative analysis shows EMAT retrieves useful information but does not just copy from memory.
Main limitations are the need for weak supervision to train the retriever, and large memory requirements.

Unsupervised Dense Information Retrieval with Contrastive Learning

This paper by Izacard et al. and published in Transactions on Machine Learning Research (TMLR) in August 2022 presents a novel approach in the field of information retrieval.
The study focuses on overcoming the limitations of dense retrievers that utilize neural networks, which perform well on large training datasets but struggle with new applications lacking specific training data. Traditional methods like BM25, based on term-frequency, often outperform these dense retrievers in unsupervised settings.
The authors explore the application of contrastive learning for training unsupervised dense retrievers. This approach, inspired by successful applications in computer vision, is examined to see if it can match or exceed the performance of term-frequency methods like BM25.
- Contrastive learning is an approach that relies on the fact that every document is, in some way, unique. This signal is the only information available in the absence of manual supervision. A contrastive loss is used to learn by discriminating between documents. This loss compares either positive (from the same document) or negative (from different documents) pairs of document representations. Formally, given a query $q$ with an associated positive document $k_{+}$, and a pool of negative documents $\left(k_i\right)_{i=1 . . K}$, the contrastive InfoNCE loss is defined as:
\[L\left(q, k_{+}\right)=-\frac{\exp \left(s\left(q, k_{+}\right) / \tau\right)}{\exp \left(s\left(q, k_{+}\right) / \tau\right)+\sum_{i=1}^K \exp \left(s\left(q, k_i\right) / \tau\right)}\]
- where $\tau$ is a temperature parameter. This loss encourages positive pairs to have high scores and negative pairs to have low scores. Another interpretation of this loss function is the following: given the query representation $q$, the goal is to recover, or retrieve, the representation $k_{+}$ corresponding to the positive document, among all the negatives $k_i$. In the following, we refer to the left-hand side representations in the score $s$ as queries and the right-hand side representations as keys.
A critical ingredient for this training paradigm is to obtain positive pairs from a single text document, which is done as follows:
- A crucial element of contrastive learning is how to build positive pairs from a single input. In computer vision, this step relies on applying two independent data augmentations to the same image, resulting in two “views” that form a positive pair. While we consider similar independent text transformation, we also explore dependent transformations designed to reduce the correlation between views.
- Inverse Cloze Task is a data augmentation that generates two mutually exclusive views of a document, introduced in the context of retrieval by Lee et al. (2019). The first view is obtained by randomly sampling a span of tokens from a segment of text, while the complement of the span forms the second view. Specifically, given a sequence of text $\left(w_1, \ldots, w_n\right)$, ICT samples a span $\left(w_a, \ldots, w_b\right)$, where $1 \leq a \leq b \leq n$, uses the tokens of the span as the query and the complement $\left(w_1, \ldots, w_{a-1}, w_{b+1}, \ldots, w_n\right)$ as the key. In the original implementation by Lee et al. (2019) the span corresponds to a sentence, and is kept in the document 10% of the time to encourage lexical matching. The Inverse Cloze Task is closely related to the Cloze task which uses the span complement $\left(w_1, \ldots, w_{a-1}, w_{b+1}, \ldots, w_n\right)$ as the query.
- Independent cropping is a common independent data augmentation used for images where views are generated independently by cropping the input. In the context of text, cropping is equivalent to sampling a span of tokens. This strategy thus samples independently two spans from a document to form a positive pair. As opposed to the inverse Cloze task, in cropping both views of the example correspond to contiguous subsequence of the original data. A second difference between cropping and ICT is the fact that independent random cropping is symmetric: both the queries and documents follow the same distribution. Independent cropping also lead to overlap between the two views of the data, hence encouraging the network to learn exact matches between the query and document, in a way that is similar to lexical matching methods like BM25. In practice, we can either fix the length of the span for the query and the key, or sample them.
- Additional data augmentation. Finally, we also consider additional data augmentations such as random word deletion, replacement or masking. We use these perturbations in addition to random cropping.
An important aspect of contrastive learning is to sample a large set of negatives. Most standard frameworks differ from each other in terms of how the negatives are handled, and we briefly describe two of them, in-batch negative sampling and MoCo, that we use in this work.
- Negatives within a batch. A first solution is to generate the negatives by using the other examples from the same batch: each example in a batch is transformed twice to generate positive pairs, and we generate negatives by using the views from the other examples in the batch. We will refer to this technique as “in-batch negatives”. In that case, the gradient is back-propagated through the representations of both the queries and the keys. A downside of this approach is that it requires extremely large batch sizes to work well, in some cases reporting improvement in the context of information retrieval up to 8192 negatives. This method has been widely used to train information retrieval models with supervised data and was also considered when using ICT to pre-train retrievers.
- Negative pairs across batches. An alternative approach is to store representations from previous batches in a queue and use them as negative examples in the loss (Wu et al., 2018). This allows for smaller batch size but slightly changes the loss by making it asymmetric between “queries” (one of the view generated from the elements of the current batch), and “keys” (the elements stored in the queue). Gradient is only backpropagated through the “queries”, and the representation of the “keys” are considered as fixed. In practice, the features stored in the queue from previous batches comes form previous iterations of the network. This leads to a drop of performance when the network rapidly changes during training. Instead, He et al. (2020) proposed to generate representations of keys from a second network that is updated more slowly. This approach, called MoCo, considers two networks: one for the keys, parametrized by $\theta_k$, and one of the query, parametrized by $\theta_q$. The parameters of the query network are updated with backpropagation and stochastic gradient descent, similarly to when using in-batch negatives, while the parameters of the key network, or Momentum encoder, is updated from the parameters of the query network by using a exponential moving average: $\theta_k \leftarrow m \theta_k+(1-m) \theta_q$ where $m$ is the momentum parameter that takes its value in $[0,1]$.
They use a transformer network to embed both queries and documents. Alternatively, two different encoders can be used to encode queries and documents respectively as in DPR. Empirically, they observed that using the same encoder, such as in Xiong et al. (2020) and Reimers & Gurevych (2019), generally improves robustness in the context of zero-shot transfer or few-shot learning, while having no impact on other settings.
Significant contributions of the paper include demonstrating that contrastive learning can lead to competitive unsupervised retrievers. The model, named Contriever, shows promising results on the BEIR benchmark, outperforming BM25 on 11 out of 15 datasets for Recall@100. The model benefits from a few training examples and achieves better results than models transferred from large datasets like MS MARCO. Ablation studies highlighted that cropping is a more effective approach than the inverse Cloze task for building positive pairs.
The implementation details of Contriever reveal that it employs MoCo with random cropping for contrastive learning. This is a deviation from the Inverse Cloze Task (ICT) approach. The training data includes a mix of documents from Wikipedia and CCNet. The model shows strong performance against established benchmarks such as NaturalQuestions and TriviaQA, even in fully unsupervised settings without fine-tuning on MS MARCO or other annotated data.
The paper also delves into the realm of multilingual retrieval, a significant area where large labeled datasets are typically scarce, especially for lower-resource languages. The multilingual model, mContriever, demonstrates strong performance in both fully unsupervised settings and when fine-tuned on English data. This model is capable of effective cross-lingual retrieval, a significant advancement over traditional lexical matching methods.
In summary, the paper introduces Contriever, an unsupervised dense retriever trained using contrastive learning, which effectively handles tasks in information retrieval, including multilingual and cross-lingual retrieval. This approach marks a notable advancement in the field, particularly in settings where large annotated datasets are unavailable.

Implicit Relation Linking for Question Answering over Knowledge Graph

This paper by Zhao et al. from Nanjing University and Alibaba Group in ACL 2022 addresses the challenge of linking implicit relations in natural language to knowledge graphs for question answering systems.
The authors introduce ImRL, a novel method that links natural language relation phrases to relation paths in knowledge graphs. This approach is significant as it deals with the ambiguity of natural language and the incompleteness of knowledge graphs.
The figure below from the paper shows an example of RL to DBpedia. There is no explicit relation between dbr:Dragonaut:_The_Resonance and dbr:Japan. We expect to implicitly link the phrase “from” to an indirect relation path dbp:publisher $\rightarrow$ dbo:country.

ImRL incorporates a unique path ranking model that aligns textual information in word embeddings with structural information in knowledge graph embeddings. This model is designed to capture the correlation between single relations and relation paths, effectively addressing relation phrases with vague meanings.
To enhance the model’s performance, the authors integrate external paraphrase dictionaries using a gated mechanism with attention. This feature injects prior knowledge into the model, aiding in the disambiguation of relation phrases.
The figure below from the paper shows an overview of ImRL. The method has two parts: (1) Path generation parses the input question and finds the relation path candidates in the KG, by entity linking, relation identification and candidate generation. (2) Path ranking encodes the relation phrase in the question and path candidates in the KG in the BERT embedding space and RotatE embedding space, utilizes a ranking model to rank those candidates, and takes the one with the highest similarity score as answer. It also leverages a gated mechanism with attention to inject prior knowledge from external dictionaries to help relation disambiguation.

The paper presents a comprehensive evaluation using two benchmark datasets and a newly created dataset, demonstrating that ImRL significantly outperforms existing state-of-the-art methods, particularly in scenarios involving implicit relation linking.
The authors’ experiments and results highlight ImRL’s effectiveness in dealing with the inherent challenges of knowledge-based question answering systems, such as handling incomplete knowledge graphs and interpreting ambiguous natural language expressions.

Galactica: A Large Language Model for Science

This paper Taylor et al. from Meta AI introduces Galactica, a large language model designed for organizing scientific knowledge, trained on a diverse corpus including scientific papers, textbooks, and more.
The model stands out for its dataset design, incorporating scientific texts and task-specific datasets in pre-training, enhancing knowledge composition for various tasks.
Galactica utilizes specialized tokenization for various scientific formats, aiding in tasks like citation prediction, LaTeX equations, and protein sequences.
he figure below from the paper shows the tokenizing nature. Galactica trains on text sequences that represent scientific phenomena.

It shows superior performance in scientific tasks and knowledge probes, outperforming general models like GPT-3 in LaTeX equations and achieving state-of-the-art results in tasks like PubMedQA.
The model incorporates a unique working memory token (<work>) for complex reasoning, demonstrating strong performance in reasoning tasks and mathematical benchmarks.
The figure below from the paper shows model-machine symbiosis. They show an example answer with the <work> working memory token. It performs exact steps for rearranging the equation, and when it reaches a calculation that it cannot solve reliably in a forward-pass, it writes a program, which can then be offloaded to a classical computer.

The architecture involves a Transformer setup with modifications like GeLU activation, a 2048-length context window, and a 50k BPE vocabulary.
Training involved multiple model sizes up to 120B parameters, showing improved performance with repeated tokens, contrary to previous beliefs in the field.
Galactica’s potential as a scientific knowledge interface is highlighted, with applications in synthesizing and organizing scientific information.

MuRAG: Multimodal Retrieval-Augmented Generator

This paper by Chen et al. from Google Research proposes Multimodal Retrieval-Augmented Transformer (MuRAG), which looks to extend the retrieval process beyond text to include other modalities like images or structured data, which can then be used alongside textual information to inform the generation process.
MuRAG’s magic lies in its two-phase training approach: pre-training and fine-tuning, each carefully crafted to build the model’s ability to tap into a vast expanse of multimodal knowledge.
The key goal of MuRAG is to incorporate both visual and textual knowledge into language models to improve their capability for multimodal question answering.
MuRAG is distinct in its ability to access an external non-parametric multimodal memory (images and texts) to enhance language generation, addressing the limitations of text-only retrieval in previous models.
MuRAG has a dual-encoder architecture combines pre-trained visual transformer (ViT) and a text encoder (T5) models to create a backbone encoder, enabling the encoding of image-text pairs, image-only, and text-only inputs into a unified/joint multimodal representation.
MuRAG is pre-trained on a mixture of image-text data (LAION, Conceptual Captions) and text-only data (PAQ, VQA). It uses a contrastive loss for retrieving relevant knowledge and a generation loss for answer prediction. It employs a two-stage training pipeline: initial training with small in-batch memory followed by training with a large global memory.
During the retriever stage, MuRAG takes a query $q$ of any modality as input and retrieves from a memory $\mathcal{M}$ of image-text pairs. Specifically, we apply the backbone encoder $f_\theta$ to encode a query $q$, and use maximum inner product search (MIPS) over all of the memory candidates $m \in \mathcal{M}$ to find the top-$k$ nearest neighbors $\operatorname{Top}_K(\mathcal{M} \mid q)=\left[m_1, \cdots, m_k\right]$. Formally, we define $\operatorname{Top}_K(\mathcal{M} \mid q)$ as follows:

\[\operatorname{Top}_K(\mathcal{M} \mid q)=\underset{m \in \mathcal{M}}{\operatorname{Top}} \quad f_\theta(q)_{[\mathrm{CLS}]} \cdot f_\theta(m)_{[\mathrm{CLS}]}\]

During the reader stage, the retrievals (the raw image patches) are combined with the query $q$ as an augmented input $\left[m_1, \cdots, m_k, q\right]$, which is fed to the backbone encoder $f_\theta$ to produce retrieval augmented encoding. The decoder model $g_\theta$ uses attention over this representation to generate textual outputs $\mathbf{y}=y_1, \cdots, y_n$ token by token.
\[p\left(y_i \mid y_{i-1}\right)=g_\theta\left(y_i \mid f_\theta\left(\operatorname{Top}_K(\mathcal{M} \mid q) ; q\right) ; y_{1: i-1}\right)\]
- where $y$ is decoded from a given vocabulary $\mathcal{V}$.
The figure below from the paper (source) shows how the model taps into an external repository to retrieve a diverse range of knowledge encapsulated within both images and textual fragments. This multimodal information is then employed to enhance the generative process. The upper section outlines the setup for the pre-training phase, whereas the lower section specifies the framework for the fine-tuning phase.

The process can be summarized as follows:
- For retrieval, MuRAG uses maximum inner product search to find the top-$k$ most relevant image-text pairs from the memory given a question. The “memory” here refers to the external knowledge base that the model can retrieve information from. Specifically, the memory contains a large collection of image-text pairs that are encoded offline by the backbone encoder prior to training.
- During training and inference, given a question, MuRAG’s retriever module will search through this memory to find the most relevant image-text pairs using maximum inner product search.
- The memory serves as the knowledge source and can contain various types of multimodal data like images with captions, passages of text, tables, etc. that are related to the downstream task.
- For example, when fine-tuning on the WebQA dataset, the memory contains 1.1 million image-text pairs extracted from Wikipedia that the model can retrieve from to answer questions.
- So in summary, the memory is the large non-parametric external knowledge base encoded in a multimodal space that MuRAG learns to retrieve relevant knowledge from given a question, in order to augment its language generation capabilities. The memory provides the world knowledge to complement what is stored implicitly in the model’s parameters.
- For reading, the retrieved multimodal context is combined with the question embedding and fed into the decoder to generate an answer.
MuRAG achieves state-of-the-art results on two multimodal QA datasets - WebQA and MultimodalQA, outperforming text-only methods by 10-20% accuracy. It demonstrates the value of incorporating both visual and textual knowledge.
Key limitations are the reliance on large-scale pre-training data, computational costs, and issues in visual reasoning like counting objects. But overall, MuRAG represents an important advance in building visually-grounded language models.

Distilling Knowledge from Reader to Retriever for Question Answering

This paper by Izacard and Grave from Facebook AI Research, École normale supérieure, and Inria, published at ICLR 2021, proposes a novel approach for training retriever systems in open-domain question answering tasks without strong supervision in the form of query-document pairs.
The authors introduce a method leveraging the attention scores of a sequence-to-sequence reader model to generate synthetic labels for training a retriever model. This approach, inspired by knowledge distillation, uses the reader model’s cross-attention activations over input documents as a relevance measure for training the retriever.
The methodology includes:
1. Utilizing a dense bi-encoder for passage retrieval, encoding questions and passages into fixed-size representations.
2. Distilling the reader’s cross-attention scores to the bi-encoder, employing various loss functions like KL-divergence, mean squared error, and max-margin loss.
3. Implementing iterative training, where the reader and retriever are trained alternately, improving the retriever’s performance with each iteration.
The authors demonstrate the effectiveness of this method on several question-answering benchmarks, achieving state-of-the-art results. The experiments used datasets like TriviaQA, NaturalQuestions, and NarrativeQA, and the iterative training process showed consistent improvement in retrieval accuracy and question answering performance.
The paper contributes significantly to the field by enabling the training of efficient retrievers without requiring annotated pairs of queries and documents, a common limitation in information retrieval tasks. The approach is particularly beneficial for tasks with limited or no supervision, like fact-checking or long-form question answering.
Code

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering

In this paper by Lu et al. from the University of California Los Angeles, Arizona State University, and the Allen Institute for AI, presented at NeurIPS 2022, the authors propose a new approach for science question answering, leveraging multimodal reasoning through the generation of explanatory thought chains.
The core contribution is the development of the Science Question Answering (ScienceQA) dataset, consisting of approximately 21,208 multimodal multiple-choice questions across diverse science topics, coupled with annotated lectures and explanations.
The figure below from the paper shows a sample of the ScienceQA dataset where a data example consists of multimodal question answering information and the grounded lecture and explanation. They study if QA models can generate a reasonable explanation to reveal the chain-of-thought reasoning.

The authors propose a method where language models are trained to generate lectures and explanations, forming a ‘Chain of Thought’ (CoT), to mimic human-like multi-hop reasoning processes. This approach aims to improve both the interpretability and performance of AI systems in science question answering.
Experimental results show that incorporating CoT into language models enhances question-answering performance. Specifically, a 1.20% improvement was observed in few-shot GPT-3 and 3.99% in fine-tuned UnifiedQA models. Further, the study explores the upper bound for model performance when feeding explanations directly into the input, resulting in an 18.96% improvement in few-shot GPT-3’s performance.
The study also underscores how language models, akin to humans, benefit from explanations, achieving comparable performance with just 40% of the training data.
The implementation details include fine-tuning UnifiedQA and prompting GPT-3 with the CoT framework. UnifiedQA was fine-tuned to generate a sequence consisting of an answer followed by lecture and explanation. GPT-3 was prompted using a CoT style, where the model generates the answer and then the lecture and explanation.
The paper’s findings suggest that explanations in the CoT format not only aid in better performance but also in learning efficiency, demonstrating the potential of this approach in educational and AI interpretability contexts.
Project page; Leaderboard

BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models

This paper by Ben-Zaken et al. from Yoav Goldberg’s group at Bar Ilan University and the Allen Institute for Artificial Intelligence introduces BitFit, a fine-tuning method for pre-trained BERT models. BitFit focuses on updating only the bias-terms of the model, which are a minimal fraction of the model’s parameters, effectively reducing the memory footprint and computational demands typically associated with full model fine-tuning.
BitFit’s methodology leverages the observation that fine-tuning often doesn’t require extensive retraining of all parameters. Instead, fine-tuning only the bias terms achieves competitive results compared to full model fine-tuning, especially with small to medium-sized datasets. In scenarios permitting slight performance degradation, the method can be constrained to adjust only two specific types of bias terms, representing just 0.04% of the total model parameters.
Implementation details include freezing the transformer-encoder’s main weights and training only the bias terms along with a task-specific classification layer. This approach allows the model to handle multiple tasks efficiently in a streaming fashion without requiring simultaneous access to all task datasets.
Experimental results on the GLUE benchmark show that BitFit is comparable or superior to full fine-tuning in several NLP tasks. It also outperforms other parameter-efficient methods like Diff-Pruning and Adapters in terms of the number of parameters modified, showcasing its effectiveness in achieving high performance with significantly fewer trainable parameters.
The findings underscore the potential of focusing fine-tuning efforts on a small subset of parameters, specifically bias terms, to maintain or even enhance performance while minimizing computational costs. This approach also prompts further exploration of the role of bias terms in neural networks and their impact on model behavior and task transferability.

Recurrent Memory Transformer

This paper by Bulatov et al. from Moscow Institute of Physics and Technology and AIRI, presented at NeurIPS 2022, introduces the Recurrent Memory Transformer (RMT), a memory-augmented segment-level recurrent Transformer designed to improve the processing of long sequences. Transformer-based models are effective across multiple domains due to their self-attention mechanism, which combines information from all sequence elements into context-aware representations. However, these models have limitations, such as storing global and local information within single element-wise representations and the quadratic computational complexity of self-attention that limits input sequence length.
The proposed RMT addresses these limitations by incorporating memory tokens into the input or output sequences, allowing the model to store and process both local and global information efficiently. These memory tokens facilitate the passage of information between segments of long sequences through recurrence. Unlike previous memory-augmented models, RMT requires no modifications to the Transformer architecture itself, making it compatible with existing Transformer models.
Implementation details include adding special memory tokens to the sequence, which are used to read from and write to memory states. These memory tokens are positioned at the beginning and end of each segment, enabling seamless information transfer across segments. During training, gradients flow through the memory tokens from one segment to the previous ones, leveraging Backpropagation Through Time (BPTT) for effective learning.
The following figure from the paper shows the Recurrent Memory Transformer. Memory is added as tokens to the input sequence and memory output is passed to the next segment. During training gradients flow from the current segment through memory to the previous segment.

The following figure from the paper shows a comparison of Recurrent Memory Transformer (RMT) and Transformer-XL architectures. Recurrent Memory Transformer augments Transformer with global memory tokens and passes them to allow a segment-level recurrence. Special read/write memory tokens are added to the input sequence. Multiple memory tokens can be used in each read/write block. Updated representations of write memory are passed to the next segment. During training, RMT uses BPTT to propagate gradient to previous segments through memory tokens representation. Effective context length for recurrence with memory is not limited by the depth of a network which is the case for the cache of Transformer-XL.

Experimental results show that RMT performs on par with Transformer-XL in language modeling tasks when using smaller memory sizes and outperforms it in tasks requiring longer sequence processing. Specifically, RMT demonstrated superior performance on tasks requiring long-term dependency management, such as copy, reverse, associative retrieval, and quadratic equations. Additionally, combining RMT with Transformer-XL’s cache mechanism further improved language modeling performance.
The attention maps visualized in the experiments showed that RMT effectively separates memory operations from sequence token representations, avoiding the mixing issues observed in Transformer-XL. This separation contributes to RMT’s enhanced ability to handle long-term dependencies and general-purpose memory processing, making it a promising architecture for tasks involving algorithmic reasoning and other complex applications.
The RMT model’s efficiency in memory usage is notable, maintaining performance even with smaller memory sizes compared to Transformer-XL. The ability to augment any Transformer-based model with memory tokens without altering the underlying architecture underscores the practicality and versatility of RMT.
Overall, the Recurrent Memory Transformer introduces a robust method for extending the capabilities of Transformer models, particularly in scenarios requiring extensive sequence processing and memory management, while maintaining compatibility with existing architectures and improving computational efficiency. This makes RMT a promising solution for applications that necessitate learning long-term dependencies and efficient memory processing, such as algorithmic tasks and complex reasoning.

2023

ReAct: Synergizing Reasoning and Acting in Language Models

While large language models (LLMs) have demonstrated impressive capabilities across tasks in language understanding and interactive decision making, their abilities for reasoning (e.g., chain-of-thought prompting) and acting (e.g., action plan generation) have primarily been studied as separate topics.
This paper by Yao et al. from Princeton and Google Brain in ICLR 2023 proposes ReAct, approach that Reasons and Acts by exploring the use of LLMs to generate both reasoning traces and task-specific actions in an interleaved manner, allowing for greater synergy between the two: reasoning traces help the model induce, track, and update action plans as well as handle exceptions, while actions allow it to interface with external sources, such as knowledge bases or environments, to gather additional information.
They apply ReAct to a diverse set of language and decision making tasks and demonstrate its effectiveness over state-of-the-art baselines, as well as improved human interpretability and trustworthiness over methods without reasoning or acting components.
Concretely, on question answering (HotpotQA) and fact verification (Fever), ReAct overcomes issues of hallucination and error propagation prevalent in chain-of-thought reasoning by interacting with a simple Wikipedia API, and generates human-like task-solving trajectories that are more interpretable than baselines without reasoning traces.
On two interactive decision making benchmarks (ALFWorld and WebShop), ReAct outperforms imitation and reinforcement learning methods by an absolute success rate of 34% and 10% respectively, while being prompted with only one or two in-context examples.
The figure below from the paper shows a comparison of four prompting methods: (a) standard, (b) Chain-of-Thought (CoT, Reason Only), (c) Act-only, and (d) ReAct (Reason+Act), solving a HotpotQA question. (2) Comparison of (a) Act-only and (b) ReAct prompting to solve an AlfWorld. Note that in both domains, in-context examples are omitted as part of the prompt, and only show task solving trajectories generated by the model (Act, Thought) and the environment (Obs).

Project page

LLaMA: Open and Efficient Foundation Language Models

This paper by Touvron et al. from Meta AI in 2023 introduces LLaMA, a collection of foundation language models ranging from 7B to 65B parameters.
They train LLaMA models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets.
In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B. They release all their models to the research community.
Please refer the LLaMA primer for an article on LLaMA.

Alpaca: A Strong, Replicable Instruction-Following Model

Stanford’s Alpaca 7B, a model fine-tuned from the LLaMA 7B model on 52K instruction-following demonstrations. On their preliminary evaluation of single-turn instruction following, Alpaca behaves qualitatively similarly to OpenAI’s text-davinci-003, while being surprisingly small and easy/cheap to reproduce.

Transformer models: an introduction and catalog

In the past few years, we have seen the meteoric appearance of dozens of models of the Transformer family, all of which have funny, but not self-explanatory, names. The goal of this paper is to offer a somewhat comprehensive but simple catalog and classification of the most popular Transformer models. The paper also includes an introduction to the most important aspects and innovation in Transformer models.
Spreadsheet tabulation of the paper.
The following plot from the paper shows the transformers family tree with prevalent models:

And, the plot below from the paper shows the timeline for prevalent transformer models:

Lastly, the plot below, again from the paper, shows the timeline vs. number of parameters for prevalent transformer models:

Learning to Compress Prompts with Gist Tokens

Prompting is now the primary way to utilize the multitask capabilities of language models (LMs), but prompts occupy valuable space in the input context window, and re-encoding the same prompt is computationally inefficient.
Finetuning and distillation methods allow for specialization of LMs without prompting, but require retraining the model for each task.
This paper by Mu et al. from Stanford in 2023 avoids this trade-off entirely by presenting gisting, which trains an LM to compress prompts into smaller sets of “gist” tokens which can be reused for compute efficiency.
Gist models can be easily trained as part of instruction finetuning via a restricted attention mask that encourages prompt compression.
On decoder (LLaMA-7B) and encoder-decoder (FLAN-T5-XXL) LMs, gisting enables up to 26x compression of prompts, resulting in up to 40% FLOPs reductions, 4.2% wall time speedups, storage savings, and minimal loss in output quality.
The figure below from the paper shows prompting (top), which retains the multitask capabilities of LMs, but is computationally inefficient. Finetuning/distillation (middle) removes the dependence on prompts, but requires training a model for each task. Gisting (bottom) compresses prompts into a smaller set of gist tokens, saving compute while also generalizing to novel prompts during deployment.

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

This paper by Zhang et al. from Shanghai Artificial Intelligence Laboratory, CUHK MMLab, and UCLA presents LLaMA-Adapter, a lightweight adaption method to efficiently fine-tune LLaMA into an instruction-following model.
Using 52K self-instruct demonstrations, LLaMA-Adapter only introduces 1.2M learnable parameters upon the frozen LLaMA 7B model, and costs less than one hour for fine-tuning on 8 A100 GPUs.
Specifically, they adopt a set of learnable adaption prompts, and prepend them to the input text tokens at higher transformer layers. Then, a zero-init attention mechanism with zero gating is proposed, which adaptively injects the new instructional cues into LLaMA, while effectively preserves its pre-trained knowledge.
With efficient training, LLaMA-Adapter generates high-quality responses, comparable to Alpaca with fully fine-tuned 7B parameters. Furthermore, their approach can be simply extended to multi-modal input, e.g., images, for image-conditioned LLaMA, which achieves superior reasoning capacity on ScienceQA.
Code.

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

How to efficiently transform large language models (LLMs) into instruction followers is recently a popular research direction, while training LLM for multi-modal reasoning remains less explored. Although the recent LLaMA-Adapter demonstrates the potential to handle visual inputs with LLMs, it still cannot generalize well to open-ended visual instructions and lags behind GPT-4.
This paper by Zhang et al. from Shanghai Artificial Intelligence Laboratory, CUHK MMLab, and Rutgers University presents LLaMA-Adapter V2, a parameter-efficient visual instruction model.
Specifically, they first augment LLaMA-Adapter by unlocking more learnable parameters (e.g., norm, bias and scale), which distribute the instruction-following ability across the entire LLaMA model besides adapters. Secondly, they propose an early fusion strategy to feed visual tokens only into the early LLM layers, contributing to better visual knowledge incorporation. Thirdly, a joint training paradigm of image-text pairs and instruction-following data is introduced by optimizing disjoint groups of learnable parameters.
This strategy effectively alleviates the interference between the two tasks of image-text alignment and instruction following and achieves strong multi-modal reasoning with only a small-scale image-text and instruction dataset.
During inference, they incorporate additional expert models (e.g., captioning/OCR systems) into LLaMA-Adapter to further enhance its image understanding capability without incurring training costs. Compared to the original LLaMA-Adapter, their LLaMA-Adapter V2 can perform open-ended multi-modal instructions by merely introducing 14M parameters over LLaMA. The newly designed framework also exhibits stronger language-only instruction-following capabilities and even excels in chat interactions.
The figure below from the paper shows the training pipeline of LLaMA-Adapter V2. They introduce several strategies to enhance the capability of LLaMA-Adapter, which enable a parameter-efficient visual instruction model with superior multi-modal reasoning.

Code.

LIMA: Less Is More for Alignment

Large language models are trained in two stages: (1) unsupervised pretraining from raw text, to learn general-purpose representations, and (2) large scale instruction tuning and reinforcement learning, to better align to end tasks and user preferences.
This paper by Zhou et al. from Meta AI, Carnegie Mellon University, University of Southern California, and Tel Aviv University in 2023 measures the relative importance of these two stages by training LIMA, a 65B parameter LLaMa language model fine-tuned with the standard supervised loss on only 1,000 carefully curated prompts and responses, without any reinforcement learning or human preference modeling.
They define the Superficial Alignment Hypothesis: A model’s knowledge and capabilities are learnt almost entirely during pretraining, while alignment teaches it which subdistribution of formats should be used when interacting with users. If this hypothesis is correct, and alignment is largely about learning style, then a corollary of the Superficial Alignment Hypothesis is that one could sufficiently tune a pretrained language model with a rather small set of examples. To that end, they collect a dataset of 1,000 prompts and responses, where the outputs (responses) are stylistically aligned with each other, but the inputs (prompts) are diverse. Specifically, they seek outputs in the style of a helpful AI assistant. They curate such examples from a variety of sources, primarily split into community Q&A forums and manually authored examples. They also collect a test set of 300 prompts and a development set of 50.
LIMA demonstrates remarkably strong performance, learning to follow specific response formats from only a handful of examples in the training data, including complex queries that range from planning trip itineraries to speculating about alternate history. Moreover, the model tends to generalize well to unseen tasks that did not appear in the training data. In a controlled human study, responses from LIMA are either equivalent or strictly preferred to GPT-4 in 43% of cases; this statistic is as high as 58% when compared to Bard and 65% versus DaVinci003, which was trained with human feedback.
Taken together, these results strongly suggest that almost all knowledge in large language models is learned during pretraining, and only limited instruction tuning data is necessary to teach models to produce high quality output.
The figure below from the paper shows (left) the human preference evaluation, comparing LIMA to 5 different baselines across 300 test prompts; (right) preference evaluation using GPT-4 as the annotator, given the same instructions provided to humans.

Language Is Not All You Need: Aligning Perception with Language Models

This paper by Huang et al. from Microsoft, introduces KOSMOS-1, a Multimodal Large Language Model (MLLM) designed to perceive various modalities, learn in context (few-shot learning), and follow instructions (zero-shot learning). The model is trained from scratch on a web-scale multimodal corpus comprising interleaved text and images, image-caption pairs, and text data. KOSMOS-1 demonstrates remarkable performance in language understanding and generation, OCR-free NLP, perception-language tasks like multimodal dialogue and image captioning, and vision tasks such as image recognition with textual descriptions.
KOSMOS-1, a Transformer-based causal language model, auto-regressively generates texts and handles multimodal input via a Transformer decoder. The input format includes special tokens to indicate the beginning and end of sequences and encoded image embeddings.
The figure below from the paper shows that KOSMOS-1 is a multimodal large language model (MLLM) that is capable of perceiving multimodal input, following instructions, and performing in-context learning for not only language tasks but also multimodal tasks. In this work, we align vision with large language models (LLMs), advancing the trend of going from LLMs to MLLMs.

Technical details of the implementation include using MAGNETO, a Transformer variant, as the backbone architecture, and XPOS for relative position encoding. MAGNETO offers training stability and improved performance across modalities, while XPOS enhances long-context modeling and attention resolution.
The training involves web-scale multimodal corpora and focuses on next-token prediction to maximize log-likelihood of tokens. The data sources for training include The Pile, Common Crawl, LAION-2B, LAION-400M, COYO-700M, and Conceptual Captions. The model also undergoes language-only instruction tuning using the Unnatural Instructions and FLANv2 datasets to align better with human instructions.
Evaluation of KOSMOS-1 covered a wide array of tasks:
- Language tasks: language understanding, generation, and OCR-free text classification.
- Cross-modal transfer and commonsense reasoning.
- Nonverbal reasoning using Raven’s Progressive Matrices.
- Perception-language tasks like image captioning and visual question answering.
- Vision tasks, including zero-shot image classification.
In perception-language tasks, the model excels in image captioning and visual question answering. For image captioning, it was tested on MS COCO Caption and Flickr30k, achieving a CIDEr score of 67.1 on the Flickr30k dataset. In visual question answering, KOSMOS-1 showed higher accuracy and robustness on VQAv2 and VizWiz datasets compared to other models.
OCR-free language understanding involved understanding text within images without OCR. WebSRC dataset was used for evaluating web page question answering, where KOSMOS-1 showed the ability to benefit from the layout and style information of web pages in images.
Chain-of-thought prompting was also investigated, enabling KOSMOS-1 to generate a rationale first, then tackle complex question-answering and reasoning tasks. This approach showed better performance compared to standard prompting methods.
For zero-shot image classification on ImageNet, KOSMOS-1 significantly outperformed GIT in both constrained and unconstrained settings. The approach involved prompting the model with an image and a corresponding natural language query to predict the category name of the image.
Code

QLoRA: Efficient Finetuning of Quantized LLMs

This paper by Dettmers et al. from UW presents QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance. Put simply, QLoRA is a new technique to reduce the memory footprint of large language models during finetuning, without sacrificing performance. QLoRA backpropagates gradients through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters (LoRA).
Put simply, QLoRA is a method designed to efficiently fine-tune large pre-trained language models (LLMs), like a 65B parameter model, on limited GPU memory without sacrificing performance. It combines the principles of Low-Rank Adaptation (LoRA) with innovative 4-bit NormalFloat (NF4) quantization and Double Quantization techniques, optimizing parameter efficiency and computational resource utilization.
At a top-level, QLoRA operates based on the following steps:
- Quantize the pre-trained model to 4 bits and freeze it.
- Attach small, trainable adapter layers (similar to LoRA).
- Finetune only the adapter layers while using the frozen quantized model for context.
Key Components:
1. Low-Rank Adaptation: QLoRA follows LoRA’s strategy of injecting trainable low-rank matrices into the architecture of pretrained LLMs, specifically targeting Transformer layers. This selective fine-tuning strategy focuses on optimizing these low-rank matrices rather than the entire model, reducing the number of trainable parameters and computational costs.
2. Quantization: The distinguishing aspect of QLoRA lies in its quantization approach, which includes:
  - NF4 Quantization: This technique involves quantizing the model weights to 4-bit NormalFloat (NF4), efficiently compressing them to fit a specific distribution suitable for NF4 without complex algorithms.
  - Double Quantization: This secondary quantization further reduces memory overhead by quantizing the quantization constants themselves, using 8-bit floats with a 256 block size, achieving significant memory savings without affecting model performance.
Operation:
- QLoRA employs a frozen, 4-bit quantized pretrained language model and fine-tunes it by backpropagating gradients into the low rank adapters. This method optimizes computation through low-bit quantization and reduces the number of parameters by using low-rank structures, striking a balance between efficiency and performance.
Their best model family, which they name Guanaco, outperforms all previous openly released models on the Vicuna benchmark, reaching 99.3% of the performance level of ChatGPT while only requiring 24 hours of finetuning on a single GPU. QLoRA introduces a number of innovations to save memory without sacrificing performance: (a) 4-bit NormalFloat (NF4), a new data type that is information theoretically optimal for normally distributed weights (b) double quantization to reduce the average memory footprint by quantizing the quantization constants, and (c) paged optimziers to manage memory spikes.
They use QLoRA to finetune more than 1,000 models, providing a detailed analysis of instruction following and chatbot performance across 8 instruction datasets, multiple model types (LLaMA, T5), and model scales that would be infeasible to run with regular finetuning (e.g. 33B and 65B parameter models).
Their results show that QLoRA finetuning on a small high-quality dataset leads to state-of-the-art results, even when using smaller models than the previous SoTA. They provide a detailed analysis of chatbot performance based on both human and GPT-4 evaluations showing that GPT-4 evaluations are a cheap and reasonable alternative to human evaluation. Furthermore, they find that current chatbot benchmarks are not trustworthy to accurately evaluate the performance levels of chatbots. A lemon-picked analysis demonstrates where Guanaco fails compared to ChatGPT.
The figure below from the paper shows different finetuning methods and their memory requirements. QLORA improves over LoRA by quantizing the transformer model to 4-bit precision and using paged optimizers to handle memory spikes.

In the QLoRA approach, it is the original model’s weights that are quantized to 4-bit precision. The newly added Low-rank Adapter (LoRA) weights are not quantized; they remain at a higher precision and are fine-tuned during the training process. This strategy allows for efficient memory use while maintaining the performance of large language models during finetuning.
To learn more about QLoRA and how it works, the Weights blog post is highly recommended.

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Large-scale unsupervised language models (LMs) acquire extensive world knowledge and reasoning skills, but precisely controlling their behavior is challenging due to their unsupervised training nature. Traditionally, methods like Reinforcement Learning from Human Feedback (RLHF), discussed earlier in this article, are used to steer these models, involving two stages: training a reward model based on human preference labels and then fine-tuning the LM to align with these preferences using reinforcement learning (RL). However, RLHF presents complexities and instability issues, necessitating fitting a reward model and then training a policy to optimize this reward, which is prone to stability concerns.
This paper by Rafailov et al. from Stanford in 2023 introduces Direct Preference Optimization (DPO), a novel approach that simplifies and enhances this process. DPO leverages a mathematical relationship between optimal policies and reward functions, demonstrating that the constrained reward maximization problem in RLHF can be optimized more effectively with a single stage of policy training. DPO redefines the RLHF objective by showing that the reward can be rewritten purely as a function of policy probabilities, allowing the LM to implicitly define both the policy and the reward function. This innovation eliminates the need for a separate reward model and the complexities of RL.
This paper introduces a novel algorithm that gets rid of the two stages of RL, namely - fitting a reward model, and training a policy to optimize the reward via sampling. The second stage is particularly hard to get right due to stability concerns, which DPO obliterates. The way it works is, given a dataset of the form <prompt, worse completion, better completion>, you train your LLM using a new loss function which essentially encourages it to increase the likelihood of the better completion and decrease the likelihood of the worse completion, weighted by how much higher the implicit reward model. This method obviates the need for an explicit reward model, as the LLM itself acts as a reward model. The key advantage is that it’s a straightforward loss function optimized using backpropagation.
The stability, performance, and computational efficiency of DPO are significant improvements over traditional methods. It eliminates the need for sampling from the LM during fine-tuning, fitting a separate reward model, or extensive hyperparameter tuning.
The figure below from the paper illustrates that DPO optimizes for human preferences while avoiding reinforcement learning. Existing methods for fine-tuning language models with human feedback first fit a reward model to a dataset of prompts and human preferences over pairs of responses, and then use RL to find a policy that maximizes the learned reward. In contrast, DPO directly optimizes for the policy best satisfying the preferences with a simple classification objective, without an explicit reward function or RL.

Experiments demonstrate that DPO can fine-tune LMs to align with human preferences as effectively, if not more so, than traditional RLHF methods. It notably surpasses RLHF in controlling the sentiment of generations and enhances response quality in tasks like summarization and single-turn dialogue. Its implementation and training processes are substantially simpler.
In essence, DPO represents a groundbreaking shift in training language models to align with human preferences. It consolidates the two-stage process of RLHF into a single, efficient end-to-end policy learning approach. By reparameterizing the reward function and unifying policy learning and reward modeling into one streamlined optimization process, DPO offers a more efficient and lightweight method for training language models to match human preferences.

Deduplicating Training Data Makes Language Models Better

This paper by Lee et al. from Google Brain in 2023 finds that existing language modeling datasets contain many near-duplicate examples and long repetitive substrings. As a result, over 1% of the unprompted output of language models trained on these datasets is copied verbatim from the training data.
They develop two tools that allow us to deduplicate training datasets – for example removing from C4 a single 61 word English sentence that is repeated over 60,000 times.
Deduplication allows them to train models that emit memorized text ten times less frequently and require fewer train steps to achieve the same or better accuracy. They can also reduce train-test overlap, which affects over 4% of the validation set of standard datasets, thus allowing for more accurate evaluation.
Code.

Llama 2: Open Foundation and Fine-Tuned Chat Models

Llama 2 is a collection of pretrained and fine-tuned large language models (LLMs) from Meta AI ranging in scale from 7 billion to 70 billion parameters. The fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Their models outperform open-source chat models on most benchmarks we tested, and based on their human evaluations for helpfulness and safety, may be a suitable substitute for closed source models. We provide a detailed description of their approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on their work and contribute to the responsible development of LLMs.
Llama 2 is powered by Ghost Attention (GAtt), introduced in the paper, which improves multi-turn memory. From section 3.3 in the technical report:
- “In a dialogue setup, some instructions should apply for all the conversation turns, e.g., to respond succinctly, or to “act as” some public figure. When we provided such instructions to Llama 2-Chat, the subsequent response should always respect the constraint. However, their initial RLHF models tended to forget the initial instruction after a few turns of dialogue, as illustrated in the below figure (left) which shows that issues with multi-turn memory (left) can be improved with GAtt (right).
- To address these limitations, we propose Ghost Attention (GAtt), a very simple method inspired by Context Distillation (Bai et al., 2022) that hacks the fine-tuning data to help the attention focus in a multi-stage process. GAtt enables dialogue control over multiple turns, as illustrated in the figure above (right).
- GAtt Method: Assume we have access to a multi-turn dialogue dataset between two persons (e.g., a user and an assistant), with a list of messages $\left[u_1, a_1, \ldots, u_n, a_n\right]$, where $u_n$ and $a_n$ correspond to the user and assistant messages for turn $n$, respectively. Then, we define an instruction, inst, that should be respected throughout the dialogue. For example, inst could be “act as.” We can then synthetically concatenate this instruction to all the user messages of the conversation.
- Next, we can sample from this synthetic data using the latest RLHF model. We now have a context-dialogue and the sample with which to fine-tune a model, in a process analogous to Rejection Sampling. Instead of augmenting all context-dialogue turns with the instruction, we can drop it in all but the first turn, but this would lead to a mismatch at training time between the system message, i.e., all the intermediate assistant messages that come before the last turn, and their sample. To fix this issue, which could hurt the training, we simply set the loss to 0 for all the tokens from the previous turns, including assistant messages.
- For the training instructions, we created a few synthetic constraints to sample from: Hobbies (“You enjoy e.g. Tennis”), Language (“Speak in e.g. French”), or Public Figure (“Act as e.g. Napoleon”). To obtain the lists of hobbies and public figures, we asked Llama 2-Chat to generate it, avoiding a mismatch between the instruction and model knowledge (e.g., asking the model to act as someone it had not encountered during training). To make the instructions more complex and diverse, we construct the final instruction by randomly combining the above constraints. When constructing the final system message for the training data, we also modify the original instruction half of the time to be less verbose, e.g., “Always act as Napoleon from now”-> “Figure: Napoleon.” These steps produce an SFT dataset, on which we can fine-tune Llama 2-Chat.
- GAtt Evaluation: We applied GAtt after RLHF V3. We report a quantitative analysis indicating that GAtt is consistent up to 20+ turns, until the maximum context length is reached (see Appendix A.3.5 in the paper). We tried to set constraints not present in the training of GAtt at inference time, for instance “Always answer with Haiku,” for which the model was found to remain consistent.
- To illustrate how GAtt helped reshape attention during fine-tuning, we display the maximum attention activations of the model in Figure 10. The left-hand side of each figure corresponds to the system message (“Act as Oscar Wilde”). From the figure above, we can see that the GAtt-equipped model (right) maintains large attention activations with respect to the system message for a larger portion of the dialogue, as compared to the model without GAtt (left).
- Despite its utility, the current implementation of GAtt is vanilla, and more development and iteration on this technique could likely further benefit the model. For instance, we could teach the model to change the system message during the conversation by integrating such data during fine-tuning.”
Another important aspect that is highlighted in the report is the effect of RLHF on Llama 2, and this graph from Meta’s paper shows how high-quality human preferences data (obtained from Surge AI) keeps on improving Llama 2 – without saturation.

They also call out the importance of supervised fine-tuning (SFT) data quality (in the “quality is all you need” section) – it’s not about volume, but diversity and quality.
From Linxi Fan’s notes:
- Llama-2 likely costed $20M+ to train. Meta has done an incredible service to the community by releasing the model with a commercially-friendly license. AI researchers from big companies were wary of Llama-1 due to licensing issues, but now many of them will jump on the ship and contribute their firepower.
- Meta’s team did a human study on 4K prompts to evaluate Llama-2’s helpfulness. They use “win rate” as a metric to compare models, in similar spirit as the Vicuna benchmark. 70B model roughly ties with GPT-3.5-0301, and performs noticeably stronger than Falcon, MPT, and Vicuna. These real human ratings should be trusted more than academic benchmarks, because they typically capture the “in-the-wild vibe” better.
- Llama-2 is not yet at GPT-3.5 level, mainly because of its weak coding abilities. On “HumanEval” (standard coding benchmark), it isn’t nearly as good as StarCoder or many other models specifically designed for coding. That being said, I have little doubt that Llama-2 will improve significantly thanks to its open weights.
- Meta’s team goes above and beyond on AI safety issues. In fact, almost half of the paper is talking about safety guardrails, red-teaming, and evaluations. A round of applause for such responsible efforts!
- In prior works, there’s a thorny trade-ff between helpfulness and safety. Meta mitigates this by training 2 separate reward models. They aren’t open-source yet, but would be extremely valuable to the community.
- Llama-2 will dramatically boost multimodal AI and robotics research. These fields need more than just blackbox access to an API.
- So far, we have to convert the complex sensory signals (video, audio, 3D perception) to text description and then feed to an LLM, which is awkward and leads to huge information loss. It’d be much more effective to graft sensory modules directly on a strong LLM backbone.
- The whitepaper itself is a masterpiece. Unlike GPT-4’s paper that shared very little info, Llama-2 spelled out the entire recipe, including model details, training stages, hardware, data pipeline, and annotation process. For example, there’s a systematic analysis on the effect of RLHF with nice visualizations. Quote sec 5.1: “We posit that the superior writing abilities of LLMs, as manifested in surpassing human annotators in certain tasks, are fundamentally driven by RLHF.”
The following figure from the paper shows the training of Llama 2-Chat: This process begins with the pretraining of Llama 2 using publicly available online sources. Following this, they create an initial version of Llama 2-Chat through the application of supervised fine-tuning. Subsequently, the model is iteratively refined using Reinforcement Learning with Human Feedback (RLHF) methodologies, specifically through rejection sampling and Proximal Policy Optimization (PPO). Throughout the RLHF stage, the accumulation of iterative reward modeling data in parallel with model enhancements is crucial to ensure the reward models remain within distribution.

Summary:
- Llama 2 is available for free (including commercial license).
- Llama 2 can be accessed via managed services in Azure and AWS.
- Llama is trained on 2B tokens, with 4 variants, ranging from 7-70B parameters.
- Llama is intended to be used in English, with almost 90% of the pre-training data being in English.
- The commercial license specifies a number of harmful use cases that violate the license, including spam!
- Llama 2 is very comparable to ChatGPT 3.5 in most benchmarks (particularly, it beats ChatGPT in human evaluation on helpfulness: Win 36%; Tie 32%; Loss 32%) other than coding, looking at the data mix coding data is still quite small (classified under the - unknown language category)
- Llama 2 outperforms all other open-source models including Falcon and MPT, and has three variants including 7B, 13B, and 70B; the 70B variant achieves top performance across the board.
- Benchmarks were done both on standardized ones (like MMLU) and head to head competition against other models, including PaLM-2 Bison and ChatGPT 3.5.
- A large portion of the paper focuses on RLHF improvements and objectives which is super neat.
- Model toxicity and evaluation is another large focus, including evaluations like red-teaming which were found in the Claude 2 model card. Generally Llama 2 performed very well with fewer safety violations than ChatGPT in human evaluations.
- The tokenizer is the same as Llama 1 which is interesting, but the context length is now 4k, double the original 2k!
- There’s both a regular and chat variation, as has been the trend in recent papers.
- Llama 2 (with fine tuning) offers better domain-specificity via fine-tuning at lower cost, and better guardrails.
- Llama 2 is trained on 40% more data than Llama 1 and performs well against benchmarks.
- In short: companies can create their own enterprise “ChatGPT” (without sharing any data with OpenAI).
Quantized Llama 2 weights are available for local inference here.
The following diagram presents summarizes the key graphs/tables of the Llama 2 paper:

The following infographic (source) presents an overview of Llama 2:

Demo; Weights repo; Project page.

Retentive Network: A Successor to Transformer for Large Language Models

This paper by Sun wet al. from Microsoft Research and Tsinghua University proposes a foundation architecture called Retentive Network (RetNet) to replace the transformer as default backbone for language modelling, simultaneously achieving training parallelism, low-cost inference, and good performance.
One of the main reasons why NLP research couldn’t progress beyond a particular level with RNNs and LSTMs was that they weren’t parallelizable, which hindered people from developing reasonably huge models that could learn large range dependencies with them. Transformers enabled parallelism, however suffer from quadratic computational complexity. RetNet is a smart “mathematical makeover” for RNNs which makes it parallelizable, thus circumventing their biggest limitation while still enabling linear time complexity.
The idea behind RetNet is to combine recurrence and parallelism in a way that is flexible and combines the best of both worlds. To achieve this, the researchers introduced a mechanism called “retention” that can be formulated in both a recurrent and a parallel way (or even both at the same time). They theoretically derive the connection between recurrence and attention. Then they propose the retention mechanism for sequence modeling, which supports three computation paradigms, i.e., parallel, recurrent, and chunkwise recurrent.
As a rule of thumb, the parallel representation allows for training parallelism. The recurrent representation enables low-cost $O(1)$ inference, which improves decoding throughput, latency, and GPU memory without sacrificing performance. A hybrid form deals with a few exceptions like long sequences.
The chunkwise recurrent representation facilitates efficient long-sequence modeling with linear complexity, where each chunk is encoded parallelly while recurrently summarizing the chunks.
RetNet shows favorable scaling laws compared to the transformer; they observe better perplexity for sizes north of 2B parameters!
They trained RetNet on 512 AMD MI200 GPUs.
The following figure from the paper illustrates the fact that RetNet makes the “impossible triangle” possible, which achieves training parallelism, good performance, and low inference cost simultaneously.

The following figure from the paper presents the dual form of RetNet. “GN” is short for GroupNorm.

The following figure from the paper presents RetNet which achieves low-cost inference (i.e., GPU memory, throughput, and latency), training parallelism, and favorable scaling curves compared with Transformer. Results of inference cost are reported with 8k as input length. Figure 6 shows more results on different sequence lengths. Put simply, RetNet additionally makes inference much more efficient (smaller memory footprint) and lower cost, while keeping the desirable properties of transformers (training parallelism) by offering $O(1)$ inference, in contrast with $O(n)$ for transformers!

Experimental results on language modeling show that RetNet achieves favorable scaling results, parallel training, low-cost deployment, and efficient inference. With alluring benefits on basically all fronts – less memory, higher throughput, better scaling, and faster inference – these intriguing properties make RetNet a strong successor to Transformer for large language models.
The following table from the paper shows the model comparison from various perspectives. RetNet achieves training parallelization, constant inference cost, linear long-sequence memory complexity, and good performance.

Code.

The case for 4-bit precision: k-bit Inference Scaling Laws

Quantization methods reduce the number of bits required to represent each parameter in a model, trading accuracy for smaller memory footprints and inference latencies. However, the final model size depends on both the number of parameters of the original model and the rate of compression. - Put simply, this paper seeks to answer the question: “what’s the optimal number of bits for quantizing transformer weights if you wish to maximize the zero shot accuracy given a particular budget of total model weight bits”. For example, a 30B 8-bit model and a 60B 4-bit model have the same number of bits but may have very different zero-shot accuracies with the the 4-bit model outperforming the 8-bit model.
In this work, we study this trade-off by developing inference scaling laws of zero-shot performance in Large Language Models (LLMs) to determine the bit-precision and model size that maximizes zero-shot performance.
They run more than 35,000 experiments with 16-bit inputs and k-bit parameters to examine which zero-shot quantization methods improve scaling for 3 to 8-bit precision at scales of 19M to 176B parameters across the LLM families BLOOM, OPT, NeoX/Pythia, and GPT-2. We find that it is challenging to improve the bit-level scaling trade-off, with the only improvements being the use of a small block size – splitting the parameters into small independently quantized blocks – and the quantization data type being used (e.g., Int vs Float).
Overall, their findings show that {4-bit} precision is almost universally optimal for total model bits and zero-shot accuracy.
There are practical limitations to this method though. Low-bit models with 16-bit inputs might be less latency efficient if such a model is deployed to be used by many users (i.e. bigger batch sizes). Something to keep in mind.
The following table from the paper illustrates bit-level scaling laws for mean zero-shot performance across four datasets for 125M to 176B parameter OPT models. Zero-shot performance increases steadily for fixed model bits as they reduce the quantization precision from 16 to 4 bits. At 3-bits, this relationship reverses, making 4-bit precision optimal.

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks.
This paper by He et al. from Microsoft in ICLR 2021 proposes a new model architecture Decoding-enhanced BERT with disentangled attention (DeBERTa) that improves the BERT and RoBERTa models using two novel techniques.
The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions, respectively.
Second, an enhanced mask decoder is used to incorporate absolute positions in the decoding layer to predict the masked tokens in model pre-training.
In addition, a new virtual adversarial training method is used for fine-tuning to improve models’ generalization.
They show that these techniques significantly improve the efficiency of model pre-training and the performance of both natural language understanding (NLU) and natural langauge generation (NLG) downstream tasks.
Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). Notably, they scale up DeBERTa by training a larger version that consists of 48 Transform layers with 1.5 billion parameters. The significant performance boost makes the single DeBERTa model surpass the human performance on the SuperGLUE benchmark (Wang et al., 2019a) for the first time in terms of macro-average score (89.9 versus 89.8), and the ensemble DeBERTa model sits atop the SuperGLUE leaderboard as of January 6, 2021, out performing the human baseline by a decent margin (90.3 versus 89.8).
The following infographic (source) presents the fact that interestingly, DeBERTa-1.5B (and encoder-only model) beats Llama 2 on BoolQ, which is a nice example that encoders still outperform large decoders on classification tasks. For fairness: The DeBERTa-1.5B model was likely finetuned on the training data whereas Llama 2 was used via few-shot prompting. In that case, it highlights once more that finetuning custom LLMs remains worthwhile.

UL2: Unifying Language Learning Paradigms

Existing pre-trained models are generally geared towards a particular class of problems. To date, there seems to be still no consensus on what the right architecture and pre-training setup should be. This paper presents a unified framework for pre-training models that are universally effective across datasets and setups.
THis paper by Tay et al. from GOogle Brain begin by disentangling architectural archetypes with pre-training objectives – two concepts that are commonly conflated. Next, we present a generalized & unified perspective for self-supervision in NLP and show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective. We then propose Mixture-of-Denoisers (MoD), a pre-training objective that combines diverse pre-training paradigms together. We furthermore introduce a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training schemes. We conduct extensive ablative experiments to compare multiple pre-training objectives and find that their method pushes the Pareto-frontier by outperforming T5 & GPT-like models across multiple diverse setups. By scaling their model up to 20B parameters, we achieve SOTA performance on 50 well-established supervised finetuning based NLP tasks. Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization. On 0-shot MMLU, UL2 20B outperforms T0 and T5 models. UL2 20B also works well with chain-of-thought prompting and reasoning, making it an appealing choice for research into reasoning at a small to medium scale of 20B parameters. Finally, we apply FLAN instruction tuning to the UL2 20B model, achieving MMLU and Big-Bench scores competitive to FLAN-PaLM 62B.
They release Flax-based T5X checkpoints for the UL2 20B & Flan-UL2 20B.
The following table from the paper illustrates an overview of UL2 pretraining paradigm. UL2 proposes a new pretraining objective that works well on a diverse suite of downstream tasks.

The following table from the paper illustrates the mixture of denoisers for training UL2. Greyed out rectangles are masked tokens that are shifted to ‘targets’ for prediction.

Code

Graph of Thoughts: Solving Elaborate Problems with Large Language Models

Similar to Beyond Chain-of-Thought, Effective Graph-of-Thought Reasoning in Large Language Models, this paper by Besta et al. from ETH Zurich, Cledar, and Warsaw University of Technology introduces Graph of Thoughts (GoT) a framework that advances prompting capabilities in large language models (LLMs) beyond those offered by paradigms such as Chain-of-Thought or Tree of Thoughts (ToT).
The key idea and primary advantage of GoT is the ability to model the information generated by an LLM as an arbitrary graph, where units of information (“LLM thoughts”) are vertices, and edges correspond to dependencies between these vertices.
This approach enables combining arbitrary LLM thoughts into synergistic outcomes, distilling the essence of whole networks of thoughts, or enhancing thoughts using feedback loops.
They illustrate that GoT offers advantages over state of the art on different tasks, for example increasing the quality of sorting by 62% over ToT, while simultaneously reducing costs by >31%.
They ensure that GoT is extensible with new thought transformations and thus can be used to spearhead new prompting schemes. This work brings the LLM reasoning closer to human thinking or brain mechanisms such as recurrence, both of which form complex networks.
The following table from the paper shows a comparison of Graph of Thoughts (GoT) with other prompting strategies.

Accelerating Large Language Model Decoding with Speculative Sampling

This paper by Chen et al. from Google DeepMind presents speculative sampling, an algorithm for accelerating transformer decoding by enabling the generation of multiple tokens from each transformer call.
Their algorithm relies on the observation that the latency of parallel scoring of short continuations, generated by a faster but less powerful draft model, is comparable to that of sampling a single token from the larger target model. This is combined with a novel modified rejection sampling scheme which preserves the distribution of the target model within hardware numerics.
They benchmark speculative sampling with Chinchilla, a 70 billion parameter language model, achieving a 2-2.5x decoding speedup in a distributed setup, without compromising the sample quality or making modifications to the model itself.

Pretraining Language Models with Human Preferences

The following paper summary has been contributed by Zhibo Zhang.
This paper by Korbak et al. from University of Sussex, New York University, FAR AI, Northeastern University and Anthropic in ICML 2023 explores objective functions that incorporate human preferences when pre-training language models.
Assuming access to a reward function that assigns scores to document segments, on top of maximum likelihood estimation, the authors explore the following PHF (pre-training with human preferences) objective functions: maximum likelihood estimation with filtering (Solaiman & Dennison, 2021, Wang et al., 2022); conditional training (Ficler & Goldberg, 2017, Fan et al., 2018, Keskar et al., 2019); unlikelihood (Welleck et al., 2020); reward-weighted regression (Peters & Schaal, 2007); advantage-weighted regression (Peng et al., 2019).
The authors studied the chosen objective functions through three perspectives: (i) avoiding toxic content, (ii) avoid leaking personally identifiable information, and (iii) code generation that aligns with user intent.
The objective functions in question were evaluated from the alignment and capability perspectives through misalignment scores and KL divergence from the GPT-3 model (Brown et al., 2020) accordingly. It was observed that among the PHF objective functions investigated, conditional training achieved the best balance between alignment and utility.
The authors also evaluated robustness to adversarial prompts for the models pre-trained with the objective functions in question. It was observed from the misalignment scores that conditional training and filtering are most robust to adversarial prompts overall.
The following table from the paper illustrates the toxicity score (lower is better) of LMs pretrained with the standard objective (solid blue), using conditional training (solid orange) and LMs finetuned using conditional training for 1.6B (orange dashed) and 330M tokens (orange dotted). Pretraining with Human Feedback (PHF) reduces the amount of offensive content much more effectively than finetuning with human feedback.

Large Language Models as Optimizers

Optimization is ubiquitous. While derivative-based algorithms have been powerful tools for various problems, the absence of gradient imposes challenges on many real-world applications.
This paper by Yang et al. from Google DeepMind proposes Optimization by PROmpting (OPRO), a simple and effective approach to leverage large language models (LLMs) as optimizers, where the optimization task is described in natural language. In each optimization step, the LLM generates new solutions from the prompt that contains previously generated solutions with their values, then the new solutions are evaluated and added to the prompt for the next optimization step.
They first showcase OPRO on linear regression and traveling salesman problems, then move on to prompt optimization where the goal is to find instructions that maximize the task accuracy.
The following table from the paper illustrates an overview of the OPRO framework. Given the meta-prompt as the input, the LLM generates new solutions to the objective function, then the new solutions and their scores are added into the meta-prompt for the next optimization step. The meta-prompt contains the solution-score pairs obtained throughout the optimization process, as well as a natural language description of the task and (in prompt optimization) a few exemplars from the task. See the below figure from the paper for a sample meta-prompt for prompt optimization.

The following table from the paper illustrates an example of the meta-prompt for prompt optimization with instruction-tuned PaLM 2-L (PaLM 2-L-IT) on GSM8K, where the generated instruction will be prepended to the beginning of “A:” in the scorer LLM output (A_begin in Section 4.1). <INS> denotes the position where the generated instruction will be added. The blue text contains solution-score pairs; the purple text describes the optimization task and output format; the orange text are meta-instructions.

With a variety of LLMs, they demonstrate that the best prompts optimized by OPRO outperform human-designed prompts by up to 8% on GSM8K, and by up to 50% on Big-Bench Hard tasks.
In terms of limitations, OPRO is designed for neither outperforming the state-of-the-art gradient-based optimization algorithms for continuous mathematical optimization, nor surpassing the performance of specialized solvers for classical combinatorial optimization problems such as TSP. Instead, the goal is to demonstrate that LLMs are able to optimize different kinds of objective functions simply through prompting, and reach the global optimum for some small-scale problems. Their evaluation reveals several limitations of OPRO for mathematical optimization. Specifically, the length limit of the LLM context window makes it hard to fit large-scale optimization problem descriptions in the prompt, e.g., linear regression with high-dimensional data, and traveling salesman problems with a large set of nodes to visit. In addition, the optimization landscape of some objective functions are too bumpy for the LLM to propose a correct descending direction, causing the optimization to get stuck halfway.

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

The quality of texts generated by natural language generation (NLG) systems is hard to measure automatically. Conventional reference-based metrics, such as BLEU and ROUGE, have been shown to have relatively low correlation with human judgments, especially for tasks that require creativity and diversity. Recent studies suggest using large language models (LLMs) as reference-free metrics for NLG evaluation, which have the benefit of being applicable to new tasks that lack human references. However, these LLM-based evaluators still have lower human correspondence than medium-size neural evaluators.
This paper by Liu et al. from presents G-Eval, a framework of using large language models with chain-of-thoughts (CoT) and a form-filling paradigm, to assess the quality of NLG outputs.
The following table from the paper illustrates the overall framework of G-Eval. We first input Task Introduction and Evaluation Criteria to the LLM, and ask it to generate a CoT of detailed Evaluation Steps. Then we use the prompt along with the generated CoT to evaluate the NLG outputs in a form-filling paradigm. Finally, we use the probability-weighted summation of the output scores as the final score.

They experiment with two generation tasks, text summarization and dialogue generation. They show that G-Eval with GPT-4 as the backbone model achieves a Spearman correlation of 0.514 with human on summarization task, outperforming all previous methods by a large margin.
They also propose preliminary analysis on the behavior of LLM-based evaluators, and highlight the potential issue of LLM-based evaluators having a bias towards the LLM-generated texts.
Code

Chain-of-Verification Reduces Hallucination in Large Language Models

Generation of plausible yet incorrect factual information, termed hallucination, is an unsolved issue in large language models.
This paper by Dhuliawala et al. from Meta AI and ETH Zurich studies the ability of language models to deliberate on the responses they give in order to correct their mistakes.
They develop the Chain-of-Verification (CoVe) method whereby the model first (i) drafts an initial response; then (ii) plans verification questions to fact-check its draft; (iii) answers those questions independently so the answers are not biased by other responses; and (iv) generates its final verified response.
The following table from the paper illustrates the Chain-of-Verification (CoVe) method. Given a user query, a large language model generates a baseline response that may contain inaccuracies, e.g. factual hallucinations. We show a query here which failed for ChatGPT (see section 9 for more details). To improve this, CoVe first generates a plan of a set of verification questions to ask, and then executes that plan by answering them and hence checking for agreement. We find that individual verification questions are typically answered with higher accuracy than the original accuracy of the facts in the original longform generation. Finally, the revised response takes into account the verifications. The factored version of CoVe answers verification questions such that they cannot condition on the original response, avoiding repetition and improving performance.

Via experiments, they show that CoVe decreases hallucinations across a variety of tasks, from list-based questions from Wikidata, closed book MultiSpanQA and longform text generation.

LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models

This paper by Chen et al. from CUHK and MIT presents LongLoRA, an efficient fine-tuning approach that extends the context sizes of pre-trained large language models (LLMs) during fine-tuning, with limited computation cost.
Typically, training LLMs with long context sizes is computationally expensive, requiring extensive training hours and GPU resources. For example, training on the context length of 8192 needs 16x computational costs in self-attention layers as that of 2048.
LongLoRA speeds up the context extension of LLMs in two aspects. On the one hand, although dense global attention is needed during inference, fine-tuning the model can be effectively and efficiently done by sparse local attention. The proposed shift short attention ($S^2$-attention) effectively enables context extension, leading to non-trivial computation savings with similar performance to fine-tuning with vanilla attention. Particularly, it can be implemented with only two lines of code in training, while being optional in inference.
$S^2$-attention splits the context into groups and only attends within each group. Tokens are shifted between groups in different heads to enable information flow. This approximates full attention but is much more efficient.
On the other hand, they revisit the parameter-efficient fine-tuning regime for context expansion. Notably, they find that LoRA for context extension works well under the premise of trainable embedding and normalization. LongLoRA demonstrates strong empirical results on various tasks on Llama 2 models from 7B/13B to 70B.
LongLoRA adopts Llama 2 7B from 4k context to 100k, or Llama 2 70B to 32k on a single 8x A100 machine. LongLoRA extends models’ context while retaining their original architectures, and is compatible with most existing techniques, like FlashAttention-2.
In addition, to make LongLoRA practical, they collect a dataset, LongQA, for supervised fine-tuning. It contains more than 3k long context question-answer pairs.
In summary, the key ideas are:
1. Shift Short Attention (S2-Attn): During fine-tuning, standard full self-attention is very costly for long contexts. S2-Attn approximates the full attention using short sparse attention within groups of tokens. It splits the sequence into groups, computes attention in each group, and shifts the groups in half the heads to allow information flow. This is inspired by Swin Transformers. S2-Attn enables efficient training while allowing full attention at inference.
2. Improved LoRA: Original LoRA only adapts attention weights. For long contexts, the gap to full fine-tuning is large. LongLoRA shows embedding and normalization layers are key. Though small, making them trainable bridges the gap.
3. Compatibility with optimizations like FlashAttention-2: As S2-Attn resembles pre-training attention, optimizations like FlashAttention-2 still work at both train and inference. But many efficient attention mechanisms have large gaps to pre-training attention, making fine-tuning infeasible.
4. Evaluation: LongLoRA extends the context of Llama 2 7B to 100k tokens, 13B to 64k tokens, and 70B to 32k tokens on one 8x A100 machine. It achieves strong perplexity compared to full fine-tuning baselines, while being much more efficient. For example, for Llama 2 7B with 32k context, LongLoRA reduces training time from 52 GPU hours to 24 hours.
The following table from the paper illustrates an overview of LongLoRA designs. LongLoRA introduces shift short attention during finetuning. The trained model can retain its original standard self-attention during inference. In addition to plain LoRA weights, LongLoRA additionally makes embedding and normalization layers trainable, which is essential to long context learning, but takes up only a small proportion of parameters.

The following table from the paper shows a performance and efficiency comparison between full fine-tuning, plain LoRA, and our LongLoRA. They fine-tune LLaMA2 7B on various context lengths, with FlashAttention-2 and DeepSpeed stage 2. Perplexity is evaluated on the Proof-pile test set. Plain LoRA baseline spends limited GPU memory cost, but its perplexity gets worse as the context length increases. LongLoRA achieves comparable performance to full fine-tuning while the computational cost is much less.

Code.

Mass-Editing Memory in a Transformer

Recent work has shown exciting promise in updating large language models with new memories, so as to replace obsolete information or add specialized knowledge. However, this line of work is predominantly limited to updating single associations.
ICLR 2023 develops MEMIT, a method for directly updating a language model with many memories, demonstrating experimentally that it can scale up to thousands of associations for GPT-J (6B) and GPT-NeoX (20B), exceeding prior work by orders of magnitude.
The following table from the paper illustrates the fact that MEMIT modifies transformer parameters on the critical path of MLP-mediated factual recall. We edit stored associations based on observed patterns of causal mediation: (a) first, the early-layer attention modules gather subject names into vector representations at the last subject token $S$. (b) Then MLPs at layers $l \in \mathcal{R}$ read these encodings and add memories to the residual stream. (c) Those hidden states are read by attention to produce the output. (d) MEMIT edits memories by storing vector associations in the critical MLPs.

The following table from the paper shows the MEMIT update process. They first (i) replace $h_i^l$ with the vector $z_i$ and optimize Eqn. 16 in the paper so that it conveys the new memory. Then, after all $z_i$ are calculated we (ii) iteratively insert a fraction of the residuals for all $z_i$ over the range of critical MLP modules, executing each layer’s update by applying Eqn. 14 in the paper. Because changing one layer will affect activations of downstream modules, they recollect activations after each iteration.

Code.

MTEB: Massive Text Embedding Benchmark

Text embeddings are commonly evaluated on a small set of datasets from a single task not covering their possible applications to other tasks. It is unclear whether state-of-the-art embeddings on semantic textual similarity (STS) can be equally well applied to other tasks like clustering or reranking. This makes progress in the field difficult to track, as various models are constantly being proposed without proper evaluation.
To solve this problem, Muennighoff et al. from Weights and cohere.ai introduce the Massive Text Embedding Benchmark (MTEB) Leaderboard. MTEB spans 8 embedding tasks covering a total of 58 datasets and 112 languages.
Through the benchmarking of 33 models on MTEB, they establish the most comprehensive benchmark of text embeddings to date. The following figure from the paper shows an overview of tasks and datasets in MTEB. Multilingual datasets are marked with a purple shade

They find that no particular text embedding method dominates across all tasks. This suggests that the field has yet to converge on a universal text embedding method and scale it up sufficiently to provide state-of-the-art results on all embedding tasks.

Language Modeling Is Compression

It has long been established that predictive models can be transformed into lossless compressors and vice versa. Incidentally, in recent years, the machine learning community has focused on training increasingly large and powerful self-supervised (language) models. Since these large language models exhibit impressive predictive capabilities, they are well-positioned to be strong compressors.
This paper by Delétang et al. from DeepMind, Meta AI, and Inria, advocates for viewing the prediction problem through the lens of compression and evaluate the compression capabilities of large (foundation) models.
They empirically investigate the lossless compression capabilities of foundation models. To that end, we review how to compress with predictive models via arithmetic coding and call attention to the connection between current language modeling research and compression.
The following figure from the paper shows arithmetic encoding of the sequence ‘AIXI’ with a probabilistic (language) model $P$ (both in blue) resulting in the binary code ‘0101001’ (in green). Arithmetic coding compresses data by assigning unique intervals to symbols based on the probabilities assigned by $P$. It progressively refines these intervals to output compressed bits, which represent the original message. To decode, arithmetic coding initializes an interval based on the received compressed bits. It iteratively matches intervals with symbols using the probabilities given by $P$ to reconstruct the original message.

They show that foundation models, trained primarily on text, are general-purpose compressors due to their in-context learning abilities. In other words, large language models are powerful general-purpose predictors and that the compression viewpoint provides novel insights into scaling laws, tokenization, and in-context learning. Specifically, they provide a novel view on scaling laws, showing that the dataset size provides a hard limit on model size in terms of compression performance and that scaling is not a silver bullet. They also demonstrate that tokenization, which can be viewed as a pre-compression, does, in general, not improve compression performance, but allows models to increase the information content in their context and is thus generally employed to improve prediction performance. For example, Chinchilla 70B, while trained primarily on text, compresses ImageNet patches to 43.4% and LibriSpeech samples to 16.4% of their raw size, beating domain-specific compressors like PNG (58.5%) or FLAC (30.3%), respectively.
They leverage the compression-prediction equivalence to employ compressors as generative models and visually illustrate the performance of the underlying compressor.
Finally, they show that the prediction-compression equivalence allows us to use any compressor (like gzip) to build a conditional generative model.

SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models

Generative Large Language Models (LLMs) such as GPT-3 are capable of generating highly fluent responses to a wide variety of user prompts. However, LLMs are known to hallucinate facts and make non-factual statements which can undermine trust in their output. Existing fact-checking approaches either require access to the output probability distribution (which may not be available for systems such as ChatGPT) or external databases that are interfaced via separate, often complex, modules.
This paper by Manakul et al. from Cambridge in EMNLP 2023 proposes “SelfCheckGPT”, a simple sampling-based approach that can be used to fact-check the responses of black-box models in a zero-resource fashion, i.e. without an external database.
SelfCheckGPT leverages the simple idea that if an LLM has knowledge of a given concept, sampled responses are likely to be similar and contain consistent facts. However, for hallucinated facts, stochastically sampled responses (i.e., token sampling methods such as top-$p$/top-$k$ sampling or beam search, adjusting the softmax temperature, etc.) are likely to diverge and contradict one another.
The following figure from the paper illustrates SelfCheckGPT with Prompt. Each LLM-generated sentence is compared against stochastically generated responses with no external database. A comparison method can be, for example, through LLM prompting as shown above.

They investigate this approach by using GPT-3 to generate passages about individuals from the WikiBio dataset, and manually annotate the factuality of the generated passages.
They demonstrate that SelfCheckGPT can: (i) detect non-factual and factual sentences; and (ii) rank passages in terms of factuality.
They compare SelfCheckGPT to several baselines and show that our approach has considerably higher AUC-PR scores in sentence-level hallucination detection and higher correlation scores in passage-level factuality assessment compared to grey-box methods.

Zephyr: Direct Distillation of LM Alignment

This paper by Tunstall et al. from Huggingface introduces a technique termed distilled direct preference optimization (dDPO), designed to align a small language model (LM) to user intent via distillation, eliminating the need for human feedback. Furthermore, the study presents a 7B parameter language model named Zephyr, which is specifically tailored to align with user intent. Their approach has three main steps:
1. Distilled Supervised Fine-Tuning (dSFT): They first fine-tune the base 7B Mistral model using the UltraChat dataset, which contains 1.4M dialogues generated by having a large proprietary teacher model like GPT-3.5 Turbo converse with itself. This provides a strong initialization for the student model.
2. AI Feedback (AIF) Collection: An ensemble of diverse open chat models (e.g., Claude, Falcon) are used to generate responses to prompts from the UltraFeedback dataset. These responses are then scored by a powerful teacher model like GPT-4. The top scoring response is taken as the “chosen” response and one random lower scoring response as the “rejected” response. This provides training pairs of good vs. bad responses.
3. Distilled Direct Preference Optimization (dDPO): The dSFT model is further optimized by training it to rank the “chosen” responses higher than “rejected” responses from the AIF collection step. This is done by directly optimizing a preference likelihood objective on the static AIF data without needing to sample from the model during training.
They apply this approach to train Zephyr-7B, starting from Mistral-7B. First dSFT using UltraChat (1.4M examples from GPT-3.5), then AIF from UltraFeedback (64K prompts ranked by GPT-4), then dDPO.
Results:
- Zephyr-7B sets a new SOTA for alignment and conversational ability compared to other 7B models on MT-Bench (7.34 score) and AlpacaEval (90.6% win rate), surpassing prior best dSFT and PPO distillation methods.
- It matches (and in some cases, even outperforms) the performance of 70B RLHF models like LLaMA2 on MT-Bench.
- Ablations show dSFT is necessary before dDPO, and overfitting dDPO can still improve performance.
The key technical innovation is direct distillation of preferences without human involvement, through dSFT then dDPO, achieving strong alignment for small 7B models.
Key advantages are that it requires no human labeling or feedback, scales easily to larger models, and can be trained in just a few hours on commercially available hardware. Limitations are potential biases inherited from the teacher models and lack of safety considerations. Overall, it demonstrates the surprising efficacy of distillation and preference learning for aligning smaller open models.
The image below (source) gives a graphical sense of Zephyr’s performance on tasks as compared with pther prevalent LLMs.

Intuitions

Start with the strongest pretrained model you can find: Mistral 7B is by far the strongest 7B pretrained model.
Scale human-preference annotations: Several studies have show how for many tasks GPT4 is on-par with the average human annotators while making scalable annotations as easy as an API call: The Weights H4 team started from the largest and most diverse public GPT4 preference annotation dataset: UltraFeedback.
Drop Reinforcement Learning in favor of DPO (Direct Preference Optimization): While using RL with LLMs is definitely much easier compared to the struggles of getting deep-RL to work from scratch, DPO totally remove RL from the preference annotation training and directly optimize the preference model in a much more stable training procedure in the H4 team’s experiments.
Don’t be scared of overfitting on the preference dataset: This is maybe the most counter-intuitive results of the work. While the train/test loss of DPO training shows signs of overfitting on the feedback dataset after just one epoch, training further still show significant improvements on downstream tasks even up to 3 epochs without signs of performances regression.

Weights’s Alignment Handbook

The Alignment Handbook contains robust recipes to align language models with human and AI preferences. It also contains code to train your very own Zephyr models:
- Full fine-tuning with Microsoft’s DeepSpeed ZeRO-3 on A100s
- LoRA or QLoRA fine-tuning on consumer GPUs

Dataset from Weights called No Robots of 10k instructions and demonstrations to train instruct models. This is based on the SFT dataset from OpenAI’s InstructGPT paper. 100% organic and written entirely by skilled human annotators.

Evaluating Large Language Models: A Comprehensive Survey

This paper by Guo et al. from Tianjin University offers a comprehensive survey providing an in-depth analysis of evaluating large language models (LLMs).
The paper categorizes LLM evaluation into three key domains: knowledge and capability evaluation, alignment evaluation, and safety evaluation, addressing the need for rigorous assessment across various tasks and applications.
The following figure from the paper illustrates the proposed taxonomy of major categories and sub-categories of LLM evaluation.

In-depth exploration of knowledge and capability evaluation includes question answering, knowledge completion, reasoning, and tool learning, highlighting LLMs’ growing sophistication in handling diverse information processing tasks.
Alignment evaluation focuses on ethics, bias, toxicity, and truthfulness, critical for ensuring LLM outputs align with societal values and user expectations.
Safety evaluation examines robustness and risks associated with LLM deployment, emphasizing the importance of secure and reliable model performance in real-world applications.
The survey also covers specialized evaluations in fields like biology, medicine, education, legislation, computer science, and finance, demonstrating the broad applicability and impact of LLMs.
Future directions suggest enhanced evaluation methods, including dynamic, agent-oriented, and risk-focused assessments, to guide responsible LLM development and maximize societal benefits.

Tamil-LLaMA: A New Tamil Language Model Based on LLaMA 2

This paper by Abhinand Balachandran introduces Tamil-LLaMA, an enhancement of the open-source LLaMA model, tailored for Tamil language processing.
The tokenization process is a crucial aspect of enhancing the model’s proficiency in handling the Tamil language. The integration of an additional 16,000 Tamil tokens into the LLaMA model’s vocabulary is a key step in this process. This expansion of the vocabulary allows for a more accurate and nuanced representation of the Tamil language, improving the model’s ability to understand and generate Tamil text. The tokenization specifically aims to address the unique linguistic features of Tamil, making the model more effective in tasks involving this language.
The approach uses the Low-Rank Adaptation (LoRA) methodology for efficient model training, focusing on a comprehensive Tamil corpus. This ensures computational feasibility while enhancing the model’s robustness in text generation.
Tamil-LLaMA utilizes datasets like CulturaX for pre-training and a Tamil-translated version of the Alpaca dataset, along with a subset of the OpenOrca dataset, for instruction fine-tuning.
Key contributions include the expansion of LLaMA’s vocabulary with 16,000 Tamil tokens, training on a comprehensive Tamil dataset, and presenting Tamil-translated versions of Alpaca and OpenOrca datasets for instruction fine-tuning.
Tamil LLaMA outperforms its predecessors and other open-source models in tasks specific to the Tamil language, demonstrating significant advancements in performance. The paper presents results from instruction tasks, showing Tamil-LLaMA’s superior performance in areas like reasoning, translation, code generation, and open question answering. It surpasses GPT-3.5-turbo in many tasks, according to evaluations using GPT-4.
Performance comparison on the IndicSentiment-7B dataset (left) and the IndicGLUE Text Classification (right).

The paper emphasizes the importance of language diversity in LLMs and contributes to advancing language models for Indian languages, with public access to models, datasets, and code to foster further research.
The table below shows a list of available models:

Code

Think before you speak: Training Language Models With Pause Tokens

This paper by Goyal from Carnegie Mellon University and Google Research introduces a novel training method for language models using pause tokens.
The concept involves appending learnable pause tokens to the input during both pretraining and downstream finetuning, allowing the model additional computation time before generating responses.
The following figure from the paper illustrates standard vs. pause-inference (and finetuning). We consider a downstream task where, given a prefix, the decoder-only model (bidirectionally) attends to all of the prefix to generate its target answer. The rounded squares denote one Transformer operation (a self-attention and MLP) in a 2-layer Transformer. Any “Ignore Output” denotes that during inference, the corresponding output token is not extracted and thus, not fed back autoregressively; during finetuning, this output is not backpropagated through. The connecting lines denote some (not all) of the “computational pathways” within the model. Specifically, we visualize only those pathways that begin at a specific token in the prefix (here arbitrarily chosen to be “4 is”) and end at an output token (here arbitrarily chosen to be “25+”). All differences between the two settings are highlighted in color. (a) In standard inference (finetuning), the model’s output is extracted immediately upon seeing the last prefix token. (b) In pause-inference (and pause-finetuning), this is initiated only after appending a manually specified number of <pause> tokens. This introduces new computational pathways (the colored lines) between the prefix token and the output token of interest.

The following figure from the paper illustrates standard vs. pause-pretraining. We consider pretraining based on causal language modeling, where each token is predicted given all preceding tokens in the sequence, using unidirectional self-attention. Here, we visualize the computational pathways beginning from the token “is” on the input side of the decoder-only model, to a subsequent token “soccer” on the output side. Please see the above figure for a guide on how to follow this visualization. (a) In standard pretraining, we compute the model’s loss at each output token, and backpropagate through it. (b) In pause-pretraining, we insert multiple copies of <pause> tokens at uniformly random locations in the input. However, we do not apply a loss on the model to predict these tokens, as indicated by each corresponding Ignore Output flags. This introduces new computational pathways connecting the input token and the output token of interest.

This method demonstrates significant improvements in various tasks, notably an 18% increase in Exact Match score on the SQuAD question-answering task and 8% on CommonSenseQA.
The paper reveals that the gains are most pronounced when pause tokens are used during both pretraining and finetuning, with lesser improvements observed when used only during finetuning.
The approach alters the traditional immediate next-token prediction in language models, introducing a new paradigm – delayed next-token prediction – that offers enhanced performance on complex language tasks.

YaRN: Efficient Context Window Extension of Large Language Models

This paper by Peng et al. from Nous Research, EleutherAI, and the University of Geneva, proposes Yet Another RoPE extensioN method (YaRN) to efficiently extend the context window of transformer-based language models using Rotary Position Embeddings (RoPE).
The authors address the limitation of transformer-based language models, specifically their inability to generalize beyond the sequence length they were trained on. YaRN demonstrates a compute-efficient way to extend the context window of such models, requiring significantly fewer tokens and training steps compared to previous methods.
YaRN enables LLaMA models to effectively utilize and extrapolate to context lengths much longer than their original pre-training would allow. This method surpasses previous state-of-the-art approaches in context window extension.
The paper details various technical aspects of YaRN, including its capability to extrapolate beyond the limited context of a fine-tuning dataset. The models fine-tuned using YaRN have been reproduced online, supporting context lengths up to 128k.
YaRN introduces an innovative technique known as “Dynamic NTK” (Neural Tangents Kernel) interpolation, which modifies the attention mechanism of the model. This dynamic scaling allows the model to handle longer contexts without extensive retraining. By doing so, YaRN surpasses previous approaches in context window extension and significantly reduces the computational resources required. Put simply, Dynamic NTK is designed to address the challenge of extending the context window of transformer-based language models using Rotary Position Embeddings (RoPE). It achieves this by dynamically scaling the attention mechanism of the model, allowing it to efficiently process longer text sequences without requiring extensive retraining.
Dynamic NTK interpolation modifies the traditional attention mechanism to adapt to extended contexts, ensuring that the model can effectively utilize and extrapolate to context lengths much longer than its original pre-training would allow. This dynamic scaling approach optimizes the use of available resources and computational power.
Dynamic NTK interpolation is a key component of YaRN that empowers language models to handle extended context windows with improved efficiency and performance, making it a valuable advancement in the field of large language models.
Additionally, YaRN incorporates a temperature parameter that affects the perplexity across different data samples and token positions within the extended context window. Adjusting this temperature parameter modifies the attention mechanism, enhancing the model’s ability to handle extended context lengths efficiently.
Extensive experiments demonstrate YaRN’s efficacy. For instance, it achieves context window extension of language models with RoPE as the position embedding, using only about 0.1% of the original pre-training corpus, a significant reduction in computational resources.
The following figure from the paper illustrates that evaluations focus on several aspects, such as perplexity scores of fine-tuned models with extended context windows, the passkey retrieval task, and performance on standardized LLM benchmarks. YaRN models show strong performance across all contexts, effectively extending the context window of LLaMA 2 models to 128k. The following figure from the paper illustrates the sliding window perplexity (S = 256) of ten 128k Proof-pile documents truncated to evaluation context window size.

The paper concludes that YaRN improves upon all existing RoPE interpolation methods and acts as a highly efficient drop-in replacement. It preserves the original abilities of fine-tuned models while attending to very large context sizes and allows for efficient extrapolation and transfer learning under compute-constrained scenarios.
The research illustrates YaRN as a significant advancement in extending the context window of large language models, offering a more compute-efficient approach with broad implications for model training and performance.
Code.

StarCoder: May the Source Be with You!

The BigCode community, an open-scientific collaboration, introduces StarCoder and StarCoderBase: Large Language Models (LLMs) for code, each with 15.5 billion parameters, 8K context length, infilling capabilities, and fast large-batch inference enabled by multi-query attention.
StarCoderBase was trained on 1 trillion tokens from The Stack, a large collection of permissively licensed GitHub repositories, covering over 80 programming languages, GitHub issues, Git commits, and Jupyter notebooks. StarCoder was fine-tuned on an additional 35 billion Python tokens.
These models exhibit novel architectural features like an 8K context length, Fill-in-the-Middle (FIM) capabilities, and Multi-Query-Attention (MQA). An extensive evaluation of these models was conducted, showcasing their superiority over other open Code LLMs in handling multiple programming languages and even matching or surpassing the OpenAI code-cushman-001 model.
StarCoder, when fine-tuned on Python, significantly outperforms other Python-tuned LLMs and, with its 8K token context, can function as a virtual technical assistant without requiring instruction-tuning or Reinforcement Learning from Human Feedback (RLHF).
Significant steps were taken towards ensuring a safe open model release. StarCoder is released under the OpenRAIL-M license, promoting transparency and commercial viability. The release includes an integrated attribution tool in the VSCode demo for detecting and locating model generations potentially copied from the training set. Additionally, a robust Personally Identifiable Information (PII) detection model, StarEncoder, was developed to enhance privacy protection, utilizing a dataset containing 12,000 files with 22,950 annotated entities.

Let’s Verify Step by Step

This paper by Lightman et al. from OpenAI presents a detailed investigation into the effectiveness of process supervision compared to outcome supervision in training language models for complex multi-step reasoning.
The authors explore the concepts of outcome and process supervision. Outcome-supervised reward models (ORMs) focus on the final result of a model’s reasoning chain, while process-supervised reward models (PRMs) receive feedback at each step in the reasoning chain.
To collect process supervision data, they present human data-labelers with step-by-step solutions to MATH problems sampled by the large-scale generator. Their task is to assign each step in the solution a label of positive, negative, or neutral, as shown in the below figure. A positive label indicates that the step is correct and reasonable. A negative label indicates that the step is either incorrect or unreasonable. A neutral label indicates ambiguity. In practice, a step may be labelled neutral if it is subtly misleading, or if it is a poor suggestion that is technically still valid. Neutral labels allows them to defer the decision about how to handle ambiguity: at test time, we can treat neutral labels as either positive or negative. The following figure from the paper shows a screenshot of the interface used to collect feedback for each step in a solution.

The following figure from the paper shows two solutions to the same problem, graded by the PRM. The solution on the left is correct while the solution on the right is incorrect. A green background indicates a high PRM score, and a red background indicates a low score. The PRM correctly identifies the mistake in the incorrect solution.

For their experiments, they used large-scale models fine-tuned from GPT-4 and smaller models for detailed comparisons. These models were trained on the MATH dataset, which includes complex mathematical problems.
The paper introduces a new dataset, PRM800K, comprising 800,000 step-level human feedback labels, which was instrumental in training their PRM models.
The key findings show that process supervision significantly outperforms outcome supervision in training models to solve complex problems. Specifically, their PRM model solved 78.2% of problems from a representative subset of the MATH test set.
The researchers also demonstrate that active learning significantly improves the efficiency of process supervision, leading to better data utilization.
They conducted out-of-distribution generalization tests using recent STEM tests like AP Physics and Calculus exams, where the PRM continued to outperform other methods.
The paper discusses the implications of their findings for AI alignment, highlighting the advantages of process supervision in producing more interpretable and aligned models.
They acknowledge potential limitations related to test set contamination but argue that the relative comparisons made in their work are robust against such issues.
This research contributes to the field by showing the effectiveness of process supervision and active learning in improving the reasoning capabilities of language models, especially in complex domains like mathematics.

Scalable Extraction of Training Data from (Production) Language Models

This paper by Nasr et al. from Google DeepMind, University of Washington, Cornell, CMU, UC Berkeley, and ETH Zurich, investigates “extractable memorization” in language models. It focuses on the ability of an adversary to efficiently extract training data by querying a machine learning model, including open-source models like Pythia and GPT-Neo, semi-open models like LLaMA and Falcon, and closed models like ChatGPT.
The paper highlights a new attack method, termed “divergence attack,” which induces ChatGPT to deviate from its typical chatbot-style responses and emit training data at a much higher rate.
The paper unifies two major approaches: large-scale studies of total memorized training data and practical attacks for data extraction. It introduces a scalable methodology to detect memorization across trillions of tokens in terabyte-sized datasets. Notably, larger and more capable models were found to be more vulnerable to data extraction attacks.
The study first addresses open-source models with publicly available parameters and training sets. It follows a conservative definition of memorization, focusing on verbatim matches of training data as a basis for “extractable memorization.” The study adapts previous data extraction methods and uses a suffix array to efficiently verify if generated outputs are part of the training set.
The research applies the attack methodology to nine open-source models. The study revealed a strong correlation between model size and both the rate of emitting memorized output and the total number of unique memorized sequences, with rates of memorization ranging from 0.1% to 1% and the number of unique memorized sequences varying from several hundred thousand to several million.
The study finds that the amount of memorization extracted grows nearly linearly with the number of generations, presenting a challenge in estimating total memorization. The rate of extractable memorization was found to decrease with more queries, particularly in smaller models like Pythia 1.4B, compared to larger models like GPT-Neo 6B.
A comparison between discoverable and extractable memorization reveals that only a portion of discoverably memorized data is extractable, indicating potential improvements in current extraction techniques. The research suggests that measuring discoverable memorization provides a reasonably tight characterization of data that can be extracted by an adversary.
The study then extends its methodology to semi-closed models, where model parameters are available but training datasets and algorithms are not public. For these models, the researchers established a new “ground truth” using an auxiliary dataset built from a large corpus of Internet text, allowing them to verify and quantify extractable memorization.
The study confronts the unique challenges posed by aligned models like ChatGPT. A new attack strategy, the “divergence attack,” was developed to cause ChatGPT to diverge from its alignment training and revert to its original language modeling objective. This approach successfully extracted over 10,000 unique verbatim-memorized training examples from ChatGPT using only $200 worth of queries.
The following figure from the paper shows the process of extracting pre-training data from ChatGPT. They discover a prompting strategy that causes LLMs to diverge and emit verbatim pre-training examples. The figure show an example of ChatGPT revealing a person’s email signature which includes their personal contact information.

The following figure from the paper shows the test for memorization in large language models. Models emit more memorized training data as they get larger. The aligned ChatGPT (gpt-3.5-turbo) appears 50$\times$ more private than any prior model, but they develop an attack that shows it is not. Using our attack, ChatGPT emits training data 150$\times$ more frequently than with prior attacks, and 3$\times$ more frequently than the base model.

The extracted data from ChatGPT covered a wide range of text sources, including personally identifiable information, NSFW content, literature, URLs, UUIDs, code snippets, research papers, boilerplate text, and merged memorized outputs.

Gemini: A Family of Highly Capable Multimodal Models

Google’s Gemini series represents a milestone in AI development, featuring three models: Ultra, Pro, and Nano, each tailored for specific tasks ranging from complex problem-solving to on-device operations. Gemini Ultra, the flagship model, excels in demanding tasks and sets new benchmarks in AI performance. Gemini Pro is optimized for a wide range of tasks, while Nano is designed for efficiency in on-device applications. This suite of models, part of Google DeepMind’s vision, marks a significant scientific and engineering endeavor for the company.
Gemini models are built with a transformative architecture that allows for a “deep fusion” of modalities, surpassing the capabilities of typical modular AI designs. This integration enables seamless concept transfer across various domains, such as vision and language. The models, trained on TPUs, support a 32k context length and are capable of handling diverse inputs and outputs, including text, vision, and audio. The visual encoder, inspired by Flamingo, and the comprehensive training data, comprising web documents, books, code, and multimedia, contribute to the models’ versatility.
The figure below from the paper illustrates that Gemini supports interleaved sequences of text, image, audio, and video as inputs (illustrated by tokens of different colors in the input sequence). It can output responses with interleaved image and text.

The training infrastructure for Gemini utilizes Google’s latest TPU v4 and v5e accelerators, ensuring efficient scaling and reliable performance at an unprecedented scale. This advanced setup is integral to handling hardware failures and silent data corruption, ensuring high-quality training outcomes.
The training dataset is multimodal and multilingual, with quality and safety filters to enhance model performance. The dataset mix is adjusted during training to emphasize domain-relevant data, contributing to the models’ high performance.
Gemini Ultra showcases extraordinary capabilities across various benchmarks, surpassing GPT-4 in areas like coding and reasoning. Its performance in benchmarks like HumanEval and Natural2Code, as well as its superior reasoning capabilities in complex subjects like math and physics, demonstrate its state-of-the-art capabilities. For instance, the figure below from the paper shows solving a geometrical reasoning task. Gemini shows good understanding of the task and is able to provide meaningful reasoning steps despite slightly unclear instructions.

Furthermore, in another instance, the figure below from the paper shows Gemini verifying a student’s solution to a physics problem. The model is able to correctly recognize all of the handwritten content and verify the reasoning. On top of understanding the text in the image, it needs to understand the problem setup and correctly follow instructions to generate LaTeX.

Gemini outperforms OpenAI’s GPT-4 in 30 out of 32 benchmarks. Furthermore, it’s worth noting is that Gemini Ultra is the first model to outperform human experts on MMLU (massive multitask language understanding). The following table from Google’s blog Gemini surpasses state-of-the-art performance on a range of benchmarks including text and coding.

For image understanding, Gemini Ultra sets new standards by outperforming existing models in zero-shot evaluations for OCR-related tasks. Its native multimodality and complex reasoning abilities enable it to excel in interpreting and reasoning with visual information. The following table from Google’s blog Gemini surpasses state-of-the-art performance on a range of multimodal benchmarks.

Gemini’s training involves Reinforcement Learning from Human Feedback (RLHF), enhancing its performance and capabilities. This advanced training, combined with its innovative architecture and diverse dataset, contributes to its exceptional performance across various tasks.
Despite its remarkable capabilities, specific details about Gemini’s architecture, training data, and the size of the Ultra and Pro models remain undisclosed. However, the models represent a significant leap in AI development, driven by the promise of AI to benefit humanity in diverse ways.
Safety and responsibility are central to Gemini’s development, with comprehensive safety evaluations for bias and toxicity. Google is collaborating with external experts and partners to stress-test the models and ensure they adhere to robust safety policies, aligning with Google’s AI Principles.
Gemini’s capabilities and its development approach reflect Google’s commitment to advancing AI responsibly and ethically, emphasizing safety and collaboration with the industry and broader ecosystem to define best practices and safety benchmarks.
Report; Blog.

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

The Purple Llama initiative by Meta, aimed at promoting responsible and safe development in generative AI, encompasses a variety of tools and evaluations, including the notable Llama-Guard and the Llama 7B model for content moderation.
Llama Guard, a component of Meta’s Purple Llama Initiative, is a 7B parameter model based on Llama2, designed to classify content in LLM prompts and responses, enhancing trust and safety in AI applications.
It uses a safety risk taxonomy for content moderation, detecting policy violations and indicating the safety level of text, with detailed subcategory violations when necessary.
The model is instruction-tuned on a dataset comprising about 13,000 examples, including prompts and responses annotated for safety, with training inputs from the Anthropic dataset and in-house redteaming examples.
Llama Guard outperforms existing moderation tools in benchmarks like the OpenAI Moderation Evaluation dataset and ToxicChat, and is adept at detecting harmful content across various categories.
Its functionality includes evaluating probabilities for classifying text as safe or unsafe, and it can generate outputs indicating safety status and policy violations.
The following figure from the blog shows example task instructions for the Llama Guard prompt and response classification tasks. A task consists of four main components. Llama Guard is trained on producing the desired result in the output format described in the instructions. It acts as an LLM: it generates text in its output that indicates whether a given prompt or response is safe/unsafe, and if unsafe based on a policy, it also lists the violating subcategories. Here is an example:

Customizable for different use cases, Llama Guard is adaptable for chatbots and digital assistants, offering flexibility without compromising safety.
Part of the broader Purple Llama ecosystem, which includes industry collaborations and is available on Weights, Llama Guard’s model weights are released for public use, licensed permissively for research and commercial applications.
In summary, Meta’s Purple Llama initiative represents a major advancement in ensuring safe and responsible development in generative AI. By providing a suite of tools, including the Llama-Guard and the Llama 7B model for content moderation, the initiative addresses the need for comprehensive safety measures in AI applications, fostering an open environment for ongoing innovation in the field.
Blog; Model; Notebook; Benchmark.

Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models

Training language models typically requires vast quantities of human-generated text, which can be scarce or of variable quality, especially for specialized domains like mathematics or programming. This scarcity limits the model’s ability to learn diverse patterns and hinders its performance. $ReST_{EM}$ addresses this problem by reducing the reliance on human-curated datasets and instead exploring the potential of fine-tuning models using self-generated data validated through scalar feedback mechanisms.
This paper by Singh et al. from Google DeepMind, presented at NeurIPS 2023, explores a new frontier in Large Language Model (LLM) training: Reinforced Self-Training based on expectation-maximization ($ReST_{EM}$). This innovative approach aims to reduce reliance on human data while avoiding the pitfalls of a synthetic data death spiral, a trend becoming increasingly evident in LLM training.
$ReST_{EM}$ is a potent alternative to traditional dataset curation, comprising two primary stages: generating multiple output samples (E-step) and fine-tuning the language model on these samples (M-step). This process is cyclically iterated, combining the generation of model-derived answers and their subsequent refinement. The feedback for filtering these outputs is sourced from tasks with binary feedback, such as math problems with clear right or wrong answers.
The paper’s focus is on two challenging domains: advanced mathematical problem-solving (MATH) and code generation (APPS). Utilizing PaLM 2 models of various scales, the study demonstrates that $ReST_{EM}$ significantly outperforms models fine-tuned solely on human-generated data, offering up to 2x performance boosts. This indicates a major step toward more independent AI systems, seeking less human input for skill refinement.
$ReST_{EM}$ employs an iterative self-training process leveraging expectation-maximization. It first generates outputs from the language model, then applies a filtering mechanism based on binary correctness feedback—essentially sorting the wheat from the chaff. Subsequently, the model is fine-tuned using these high-quality, self-generated samples. This cycle is repeated several times, thus iteratively enhancing the model’s accuracy and performance on tasks by self-generating and self-validating the training data.
Notably, the experiments revealed diminishing returns beyond a certain number of ReST iterations, suggesting potential overfitting issues. Ablation studies further assessed the impact of dataset size, the number of model-generated solutions, and the number of iterations on the effectiveness of ReST.
The models fine-tuned using ReST showed enhanced performance on related but distinct benchmarks like GSM8K, Hungarian HS finals, and Big-Bench Hard tasks, without any noticeable degradation in broader capabilities. This finding underscores the method’s versatility and generalizability.
The following figure from the paper shows Pass@K results for PaLM-2-L pretrained model as well as model fine-tuned with $ReST_{EM}$. For a fixed number of samples $K$, fine-tuning with $ReST_{EM}$ substantially improves Pass@K performance. They set temperature to 1.0 and use nucleus sampling with $p = 0.95$.

While ReST offers significant advantages in performance, it necessitates a moderate-sized training set of problems or prompts and access to a manually-designed or learned reward function. It’s highly data-efficient but requires careful application to prevent overfitting.
This research opens new avenues for self-improvement in language models, suggesting the need for automating manual parts of the pipeline and exploring algorithmic improvements to further enhance performance. With $ReST_{EM}$ showing promising results, especially in larger models, one can anticipate further exploration in applying self-training techniques to various other domains beyond math and coding tasks. The significant improvement over fine-tuning on human data implies that future models can be made more efficient, less reliant on extensive datasets, and potentially achieve better performance.

Human-Centered Loss Functions (HALOs)

This report by Ethayarajh et al. from Stanford University presents a novel approach to aligning large language models (LLMs) with human feedback, building upon Kahneman & Tversky’s prospect theory. The proposed Kahneman-Tversky Optimization (KTO) loss function diverges from existing methods by not requiring paired preference data, relying instead on the knowledge of whether an output is desirable or undesirable for a given input. This makes KTO significantly easier to deploy in real-world scenarios where such data is more abundant.
The report identifies that existing methods for aligning LLMs with human feedback can be seen as human-centered loss functions, which implicitly model some of the distortions in human perception as suggested by prospect theory. By adopting this perspective, the authors derive a HALO that maximizes the utility of LLM generations directly, rather than relying on maximizing the log-likelihood of preferences, as current methods do.
The KTO-aligned models were found to match or exceed the performance of direct preference optimization methods across scales from 1B to 30B. One of the key advantages of KTO is its feasibility in real-world applications, as it requires less specific types of data compared to other methods.
To validate the effectiveness of KTO and understand how alignment scales across model sizes, the authors introduced Archangel, a suite comprising 56 models. These models, ranging from 1B to 30B, were aligned using various methods, including KTO, on human-feedback datasets such as Anthropic HH, Stanford Human Preferences, and OpenAssistant.
The following report from the paper illustrates the fact that LLM alignment involves supervised finetuning followed by optimizing a human-centered loss (HALO). However, the paired preferences that existing approaches need are hard-to-get. Kahneman-Tversky Optimization (KTO) uses a far more abundant kind of data, making it much easier to use in the real world.

The report’s experimental findings reveal surprising insights into the scaling and effectiveness of different alignment methods. It was observed that supervised finetuning (SFT) contributes significantly to the performance gains at every scale under 30B. The benefits of combining SFT with alignment methods become apparent at model sizes of around 7B and above. Interestingly, KTO alone was found to be significantly better than DPO (Direct Preference Optimization) alone at scales of 13B and 30B.
The practical implications of KTO are notable, especially in contexts where abundant data on customer interactions and outcomes is available, but counterfactual data is scarce. This aspect underscores KTO’s potential for broader application in real-world settings compared to preference-based methods like DPO.
Future work suggested by the authors includes exploring a human value function specifically for language, examining differences in model behavior at different scales, and investigating the potential of synthetic data in model alignment with KTO. The report highlights the importance of understanding how human-centered loss functions can influence the alignment of LLMs with human preferences and perceptions.
Code

Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a Global Scale Prompt Hacking Competition

Large Language Models (LLMs), including InstructGPT, BLOOM, and GPT-4, are deployed in interactive settings like chatbots and writing assistants. These models are vulnerable to prompt hacking—prompt injection and jailbreaking — where they are manipulated to ignore original instructions and execute potentially harmful commands.
This paper by Schulhoff et al. from University of Maryland, Mila, Towards AI, Stanford University, Technical University of Sofia, University of Milan, NYU, and University of Arizona, aims to understand the vulnerabilities of LLMs to prompt hacking and create a large-scale dataset of adversarial prompts through a global competition.
The competition, offering 600K+ adversarial prompts against three state-of-the-art LLMs, provides new insights into the types of attacks and their effectiveness.
Background and Motivation:
- The paper highlights the limited research in prompt hacking and the need for a comprehensive understanding of LLM vulnerabilities.
- It references various studies and efforts to test LLM robustness, including automated and human-driven adversarial approaches.
- The HackAPrompt competition represents the largest collection of human-written adversarial prompts, aiming to fill the gap in systematic understanding and defense strategies against prompt hacking.
Key Intentions of Prompt Hacking:
- Six major intents identified for prompt hacking: Prompt Leaking, Training Data Reconstruction, Malicious Action Generation, Harmful Information Generation, Token Wasting, and Denial of Service.
- The competition focuses on Prompt Leaking, Harmful Information Generation (Target Phrase Generation), and Malicious Action Generation. Other intents are not directly studied but believed to be applicable in similar settings.
Competition Structure and Implementation:
- The competition presented ten real-world-inspired prompt hacking challenges, each with specific task descriptions and prompt templates.
- Participants used various models (GPT-3, ChatGPT, FlanT5-XXL) to generate responses and were encouraged to submit JSON files with their prompt-model pairings for each challenge.
- The competition setup aimed to mirror real-world attack scenarios, with the goal of either outputting a specific phrase or a hidden key in the prompt template.
The following report from the paper illustrates the uses of LLMs often define the task via a prompt template (top left), which is combined with user input (bottom left). They created a competition to see if user input can overrule the original task instructions and elicit specific target output (right).

Analysis of Prompt Hacking Strategies:
- The competition revealed various hacking strategies, including novel techniques like Context Overflow attacks.
- Two datasets were used for analysis: Submissions Dataset and Playground Dataset, providing a comprehensive view of the hacking process and the effectiveness of different strategies.
- Notable strategies included Two Token Attack, using Chinese characters to bypass letter separation, and Context Overflow attacks.
- The analysis also explored the frequent words used in successful hacks, offering insights into potential defense strategies.
Taxonomical Ontology of Prompt Hacking Exploits:
- The paper presents a comprehensive taxonomy of prompt hacking techniques, breaking down attacks into component parts and describing their relationships.
- This ontology is the first data-driven classification of prompt hacking exploits, based on the competition submissions and literature review.
- It includes various attack types such as Simple Instruction Attack, Context Ignoring Attack, and many others.
Conclusion and Ethical Considerations:
- The competition aimed to advance research in LLM security and prompt hacking, yielding over 600K adversarial prompts.
- It documented 29 prompt hacking techniques and explored their generalizability across different intents, LLMs, and modalities.
- The paper discusses the ethical considerations of releasing such a dataset, emphasizing its utility for defensive purposes and responsible use.
- Limitations include the focus on a few LLMs, potential non-reproducibility due to model updates, and the unexplored risks like training data poisoning.

Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs

This paper by Ovadia et al. from Microsoft presents an insightful comparison of knowledge injection methods in large language models (LLMs). The core question addressed is whether unsupervised fine-tuning (USFT) is more effective than retrieval-augmented generation (RAG) for improving LLM performance on knowledge-intensive tasks.
The researchers focus on LLMs’ ability to memorize, understand, and retrieve factual data, using a knowledge base scraped from Wikipedia and a dataset of current events questions created with GPT-4. The study employs models like Llama2-7B, Mistral-7B, and Orca2-7B, evaluating them on tasks from the Massively Multitask Language Understanding Evaluation (MMLU) benchmark and a current events dataset.
Two methods of knowledge injection are explored: fine-tuning, which continues the model’s pre-training process using task-specific data, and retrieval-augmented generation (RAG), which uses external knowledge sources to enhance LLMs’ responses. The paper also delves into supervised, unsupervised, and reinforcement learning-based fine-tuning methods.
The key finding is that RAG outperforms unsupervised fine-tuning in knowledge injection. RAG, which uses external knowledge sources, is notably more effective in terms of knowledge injection than USFT alone and even more so than a combination of RAG and fine-tuning, particularly in scenarios where questions directly corresponded to the auxiliary dataset. This suggests that USFT may not be as efficient in embedding new knowledge into the model’s parameters.
The figure below from the paper shows a visualization of the knowledge injection framework.

Note that USFT in this context is a direct continuation of pre-training (hence also called continued pre-training in literature), predicting the next token on the dataset. Interestingly, fine-tuning with multiple paraphrases of the same fact significantly improves the baseline performance, indicating the importance of repetition and varied presentation of information for effective knowledge assimilation.
The authors created a knowledge base by scraping Wikipedia articles relevant to various topics, which was used for both fine-tuning and RAG. Additionally, a dataset of multiple-choice questions about current events was generated using GPT-4, with paraphrases created to augment this dataset.
Limitations of the study include the exclusive focus on unsupervised fine-tuning, without exploring supervised fine-tuning or reinforcement learning from human feedback (RLHF). The study also notes a high variance in accuracy performance across experiments, making it challenging to ascertain the statistical significance of the results.
The paper also questions why baseline models don’t achieve a 25% accuracy rate for multiple-choice questions with four options, suggesting that the tasks may not represent truly “unseen” knowledge. Moreover, the research primarily assesses straightforward knowledge or fact tasks, without delving into reasoning capabilities.
In summary, while fine-tuning can be beneficial, RAG is identified as a superior method for knowledge injection in LLMs, especially for tasks involving new information. The results highlight the potential of using diverse fine-tuning techniques and auxiliary knowledge bases for further research in this domain.

Tuning Language Models by Proxy

This paper by Liu et al. from the University of Washington and the Allen Institute for AI introduces proxy-tuning, a promising method for adapting large language models (LLMs) without direct model modification. Proxy-tuning operates during the decoding phase, where it modifies the logits of the target LLM using a smaller, modifiable “proxy” LM. Specifically, the method computes the difference in the logits over the output vocabulary between a smaller base model (M2) and a finetuned version of it (M3), and then applies this difference to the target model’s (M1) logits. Put simply, the improved target model’s outputs are calculated as $M1'(x) = M1(x) + [M3(x) - M2(x)]$.
The adjusted output aligns with the predictions of the small tuned model (the expert), contrasting its untuned version (the anti-expert), and ultimately influencing the large, fixed LM’s predictions while maintaining the benefits of large-scale pretraining.
The technique was explored in various contexts including instruction-tuning, domain adaptation (specifically for code), and task-specific finetuning (e.g., for question-answering and math problems). For instruction-tuning experiments, proxy-tuning was applied to models of the Llama2 family. The process involved improving a Llama2 70B Base model to the level of Llama2 70B Chat without RLHF, using a 7B-parameter expert for guidance. The results demonstrated that proxy-tuning significantly closes the performance gap between base models and their directly tuned counterparts, achieving performance close to Llama2 70B Chat.
The annotated figure below from the paper (source) illustrates: (Top) proxy-tuning “tunes” a large pretrained model without accessing its internal weights, by steering it using an “expert” (a small tuned model) and its corresponding “anti-expert” (the small model, untuned). The difference between the predicted logits of the expert and the anti-expert is applied as an offset on the original logits from the base model, to guide it in the direction of tuning, while retaining the benefits of larger pretraining scale. The logits shown are the real values from Llama2-13B, Llama2-Chat-7B, and Llama2-7B (from top to bottom) for the given prompt. (Bottom): proxy-tuning pushes the base model close to the directly tuned model without updating any weights of the base model.

In domain adaptation, the method was used to adapt pre-trained models to coding tasks using CodeLlama-7B-Python as the expert. This led to substantial improvements in coding benchmarks, though proxy-tuning a larger model did not always surpass the performance of the tuned 7B expert in this context.
The method’s effectiveness was also tested in task finetuning for various tasks. Applying the proxy-tuning technique to large models with a small task-specific expert resulted in significant performance improvements across tasks.
The paper further examines the influence of proxy-tuning at the token level, observing a notable contribution in promoting reasoning and stylistic tokens. A hyperparameter introduced in the method allows control over the guidance level, facilitating a balance between different attributes of generations.
Proxy-tuning provides an efficient way to customize large, potentially proprietary LMs through decoding-time guidance, without needing direct model modifications. The method’s success hinges on the caveat that smaller models need to be trained on the same vocabulary as the larger model, enabling the creation of specialized versions of LLMs using this approach.

Group Preference Optimization: Few-shot Alignment of Large Language Models

This paper by Zhao et al. from UCLA proposes Group Preference Optimization (GPO), a novel framework for aligning large language models (LLMs) with the opinions and preferences of desired interest group(s) in a few-shot manner. The method aims to address the challenge of steering LLMs to align with various groups’ preferences, which often requires substantial group-specific data and computational resources. The key idea in GPO is to view the alignment of an LLM policy as a few-shot adaptation problem within the embedded space of an LLM.
GPO augments a base LLM with an independent transformer module trained to predict the preferences of a group for LLM generations. This module is parameterized via an independent transformer and is trained via meta-learning on several groups, allowing for few-shot adaptation to new groups during testing. The authors employ an in-context autoregressive transformer, offering efficient adaptation with limited group-specific data. Put simply, the preference module in GPO is trained to explicitly perform in-context supervised learning to predict preferences (targets) given joint embeddings (inputs) of prompts and corresponding LLM responses. These embeddings allow efficient processing of in-context examples, with each example being a potentially long sequence of prompt and generated response. The module facilitates rapid adaptation to new, unseen groups with minimal examples via in-context learning.
GPO is designed to perform group alignment by learning a few-shot preference model that augments the base LLM. Once learned, the preference module can be used to update the LLM via any standard preference optimization or reweighting algorithm (e.g., PPO, DPO, Best-of-N). Specifically, GPO is parameterized via a transformer and trained to perform in-context learning on the training preference datasets. Given a training group $g \in G_{\text {train }}$, they randomly split its preference dataset $\mathcal{D}_g$ into a set of $m$ context points and $n-m$ target points, where $n=\|\mathcal{D}_g\|$ is the size of the preference dataset for group $g$. Thereafter, GPO is trained to predict the target preferences $y_{m+1: n}^g$ given the context points $\left(x_{1: m}^g, y_{1: m}^g\right)$ and target inputs $x_{m+1: n}^g$. Mathematically, this objective can be expressed as:
\[L(\theta)=\mathbb{E}_{g, m}\left[\log p_\theta\left(y_{m+1: n}^g \mid x_{1: n}^g, y_{1: m}^g\right)\right]\]
- where the training group $g \sim G_{\text {train }}$ and context size $m$ are sampled uniformly. $\theta$ represents the parameters of the GPO preference model.
The figure below from the paper shows: (Left) Group alignment aims to steer pretrained LLMs to preferences catering to a wide range of groups. For each group $g$, they represent its preference dataset as $\mathcal{D}_g=$ $\left\{\left(x_1^g, y_1^g\right), \ldots,\left(x_n^g, y_n^g\right)\right\}$. Here, $y_i^g$ signifies the preference of group $g$ for a pair of given prompt $q_i^g$ and response $r_i^g$, while $x_i^g$ is its LLM representation obtained with $\pi_{\mathrm{emb}}\left(q_i^g, r_i^g\right)$. (Right) Once trained, GPO provides a few-shot framework for aligning any base LLM to a test group given a small amount of in-context preference data.

GPO’s architecture is designed for permutation-specific inductive biases, discarding positional encodings found in standard transformers. However, this loses the pairwise relations between the inputs and outputs. To solve this, GPO concatenates each pair of inputs and outputs into a single token, informing the transformer of their pairwise relation. The target inputs are padded with a dummy token (e.g., 0), and a masking strategy is employed where context pairs can self-attend, but padded targets can only attend to context points.
Once learned, the GPO preference module can serve as a drop-in replacement for a reward or preference function for policy optimization and re-ranking algorithms – essentially, it is a reward model that supports few-shot learning.
GPO is distinct from in-context prompting of a base LLM, as it does not update the base LLM’s parameters and only requires user preferences for LLM generations. The few-shot model learned by GPO augments the base LLM, offering more flexibility than traditional prompting methods.
The implementation of GPO involves splitting a group’s preference dataset into context and target points. The model is trained to predict target preferences given the context points and target inputs. The figure below from the paper illustrates the GPO architecture for a sequence of $n$ points, with $m$ context points and $n-m$ target points. The context $\left(x_{1: m}, y_{1: m}\right)$ serves as few-shot conditioning for GPO. GPO processes the full sequence using a transformer and predicts the preference scores $\hat{y}_{m+1: n}$.

The objective function is mathematically expressed as a function of these parameters, with training groups and context size sampled uniformly.
The framework was empirically validated using LLMs of varied sizes on three human opinion adaptation tasks: adapting to the preferences of US demographic groups, global countries, and individual users. Results showed that GPO not only aligns models more accurately to these preferences but also requires fewer group-specific preferences and less computational resources, outperforming existing strategies like in-context steering and fine-tuning methods.
Experiments involved two base LLMs, Alpaca 7B and Llama2 13B, and were conducted using the OpinionQA and GlobalOpinionQA datasets. GPO demonstrated significant improvements over various baselines, achieving a 7.1% increase in alignment score over the In-context Finetune method for the OpinionQA dataset and an 8.4% improvement for the GlobalOpinionQA dataset.
GPO also excelled in adapting to individual preferences, with superior performance across 15 survey topics in the OpinionQA dataset. This ability is particularly noteworthy given the diverse and often contrasting opinions within individual and demographic groups.
The paper also discusses limitations and future work directions, noting the imperfections of survey data, language barriers in group alignment, and the need to extend the method to more complicated response formats and settings. Additionally, the authors highlight potential ethical concerns, such as misuse of aligned models and amplification of biased or harmful outputs, suggesting future research should address these issues.
Code

Large Language Models Are Neurosymbolic Reasoners

This paper by Fang et al. from the University of Liverpool, Eindhoven University of Technology, University of Technology Sydney, and University College London in AAAI 2024 investigates the application of Large Language Models (LLMs) as symbolic reasoners in the context of text-based games. These games, serving as benchmarks for artificial intelligence, often present challenges in areas such as mathematics, logic puzzles, and navigation, requiring a combination of NLP and symbolic reasoning skills.
The study addresses the issue of present-day LLMs predominantly leveraging patterns in data without explicit symbolic manipulation capabilities. To enhance LLMs for tasks necessitating strong symbolic reasoning within text-based game environments, the authors propose an LLM agent integrated with a symbolic module. This hybrid approach allows the LLM to process observations and valid actions from the games, enhancing its capability in symbolic reasoning tasks.
The symbolic module works in tandem with the LLM, processing observations and valid actions from the game environments. It enhances the LLM’s ability to understand complex symbolic tasks such as arithmetic calculations, logical deductions, and navigation in the games. This integration illustrates a move towards a more hybrid AI approach, where statistical learning and symbolic reasoning are orchestrated to handle complex tasks. The table below from the paper shows examples of text-based games with symbolic tasks and how their corresponding symbolic modules are utilized. INPUT refers to the current action that is sent to the symbolic modules. RESPONSE denotes the responses generated by the symbolic modules at the present time.

Symbolic modules thus play a crucial role in maximizing the reasoning capabilities of LLMs. For example, as shown in the figure below, consider a scenario where a mathematical problem is presented, and a calculator is available. In such cases, the LLM’s reasoning can effectively use the calculator to complete the task in a zero-shot manner. Furthermore, symbolic modules are adept at their functions, as employing an external tool like a calculator is considered an action in itself. The scenarios include four distinct symbolic modules: the Calculation Module, Sorting Module, Knowledge Base Module, and Navigation Module. For instance, in a mathematical task, the LLM agent may select a computational action such as “multiply 8 by 7” (mul 8 7). This action triggers the symbolic module to calculate the product, and the resulting observation, “Multiplying 8 and 7 results in 56,” is then returned.
The figure below from the paper shows an overview of how an LLM agent plays textbased games with external symbolic modules. The following procedural steps are involved in utilizing the LLM agent for engaging in a text-based game. Initially, the LLM agent is provided with a role initialization prompt. The first observation received by the LLM agent comes from the text game environment. As depicted in the diagram, the selection of actions, determined by the LLM’s reasoning, activates the symbolic module. Subsequently, the symbolic module provides output, including observations related to the module. Then the next action chosen by the LLM agent is influenced by the outcome from the symbolic module. This process is executed repeatedly until the end of the game.

The LLM agent, initialized with its role and equipped with symbolic computational capacity, utilizes both natural language processing abilities and symbolic reasoning to select and perform in-game actions. This integration illustrates a move towards a more hybrid AI approach, where statistical learning and symbolic reasoning are orchestrated to handle complex tasks.
Experimental results show the LLM agent outperforming baselines in tasks requiring symbolic reasoning, achieving an impressive average performance of 88% across various tasks. This demonstrates the effectiveness of integrating symbolic reasoning with LLMs for complex decision-making in text-based games.
The implementation details include tailored prompts for interaction with the game and symbolic modules, focusing on specific tasks like arithmetic, map reading, sorting, and common sense applications. The agent’s in-context generalization capabilities highlight the feasibility of using LLMs as neurosymbolic reasoners without extensive training data.
The paper indicates a promising pathway for the utilization of LLMs in symbolic reasoning applications. Future steps involve refining and scaling the methodology for more complex environments, exploring the agent’s limitations and potential biases within these symbolic tasks, and potentially extending such hybrid models beyond gaming to other real-world symbolic tasks such as automated theorem proving, software program synthesis, or advanced knowledge representation.

LM-Infinite: Simple On-The-Fly Length Generalization for Large Language Models

This paper by Han et al. from University of Illinois Urbana-Champaign and Meta, presents LM-Infinite, a method for enabling large language models (LLMs) to generate longer texts without retraining or additional parameters.
LM-Infinite addresses the issue of LLMs’ length generalization failures, particularly on unseen sequence lengths. The authors identify three out-of-distribution (OOD) factors contributing to this problem: (1) unseen distances, (2) unseen number of tokens, and (3) implicitly encoded absolute position. The figure below from the paper shows a diagnosis of three OOD factors in LLMs.

The proposed solution, LM-Infinite, introduces a $\Lambda$-shaped attention mask and a distance limit during attention. This method is computationally efficient, requiring only $O(n)$ time and space, and is compatible with various LLMs using relative-position encoding methods.
In terms of implementation, LM-Infinite uses a sliding window approach with a $\Lambda$-shaped mask for attention calculation. It allows for longer sequence handling by dynamically adjusting the window size based on token distances. The model also incorporates a novel truncation scheme to handle tokens outside the attention window, which are summarized and reused to maintain context relevance without significant computational overhead. The figure below from the paper shows: (a) LM-Infinite is a plug-and-play solution for various LLMs, consisting of a $\Lambda$-shaped mask and a distance constraint during attention. (b) A conceptual model for understanding how relative position encoding works.

An overview of LM-Infinite is illustrated in the figure above. This simple solution consists of two components: a $\Lambda$–shaped attention mask and a distance limit. As visualized in the figure, the $\Lambda$-shaped attention mask has two branches: a global branch on the left and a local branch on the right. The global branch allows each token to attend to the starting $n_{\text {global }}$ tokens if they appear before the current token. The local branch allows each token to attend to preceding tokens within $n_{\text {local }}$ distance. Any other tokens outside these two branches are ignored during attention. Here they heuristically set $n_{\text {local }}=L_{\text {pretrain }}$ as equal to the training length limit. This choice includes the “comfort zone” of LLMs in attention. The selection of $n_{\text {global }}$ has less effect on model performance, and they find that the range $[10,100]$ is generally okay. Note that $n_{\text {global }}=0$ will lead to immediate quality degradation. This design is based on the OOD factors 2 and 3 above, where they aim to control the number of tokens to be attended to, while also ensuring the inclusion of starting tokens. Theoretically, LM-Infinite can access information from a context as long as $n_{\text {layer }} L_{\text {pretrain }}$, because each deeper layer allows the attention to span $L_{\text {pretrain }}$ farther than the layer above.
The distance limit involves bounding the “effective distance” $d$ within $L_{\text {pretrain }}$. This only affects tokens that are in the global branch. In specific, recall that in relative positional encoding, the attention logit is originally calculated as $w(\mathbf{q}, \mathbf{k}, d)$, where $d$ is the distance between two tokens. Now they modify it as $w\left(\mathbf{q}, \mathbf{k}, \min \left(d, L_{\text {pretrain }}\right)\right)$. This design is motivated by the OOD factor 1 and ensures that LLMs are not exposed to distances unseen during pre-training.
LM-Infinite demonstrates consistent text generation fluency and quality for lengths up to 128k tokens on the ArXiv and OpenWebText2 datasets, achieving 2.72x decoding speedup without parameter updates.
Evaluation includes experiments on ArXiv and OpenWebText2 corpora with models like LLaMA, Llama-2, MPT-7B, and GPT-J. LM-Infinite shows comparable or superior performance to LLMs fine-tuned on long sequences in terms of perplexity and BLEU/ROUGE metrics.
The paper suggests that LM-Infinite can extend task-solving ability to longer sequences than training samples and offers potential for future work in understanding information in the masked-out attention region.

LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning

This paper by Jin et al. addresses the limitation of Large Language Models (LLMs) in managing long context sequences, particularly due to their training on fixed-length sequences. This results in unpredictable behavior and performance degradation when LLMs encounter long input sequences during inference. The authors propose that LLMs inherently possess the ability to handle long contexts without requiring fine-tuning.
Their core contributions are as follows:
1. Self-Extend Method: The paper introduces ‘Self-Extend’, a method aimed at leveraging the intrinsic long context capabilities of LLMs. Self-Extend uses a FLOOR operation to map unseen large relative positions (encountered during inference) to known positions from the training phase. This approach is designed to address the positional out-of-distribution issue, allowing LLMs to process longer texts coherently without additional fine-tuning.
2. Bi-Level Attention Information: Self-Extend constructs bi-level attention - group level and neighbor level. Group level attention is for tokens with long distances, employing the FLOOR operation on positions. Neighbor level attention targets nearby tokens without any modifications. This design aims to retain the relative ordering of information in extended texts.
The figure below from the paper shows the attention score matrix (the matrix before Softmax operation) of the proposed Self-Extend while a sequence of length 10 is input to a LLM with pretraining context window ($L$) of length 7. The number is the relative distance between the corresponding query and key tokens. Self-Extend has two kinds of attention mechanism: for neighbor tokens within the neighbor window ($w_n$, in this figure, it’s 4), it adapts the normal self-attention in transformers; for tokens out of the window, it adapts the values from the grouped attention. The group size ($G$) is set to 2. After the two parts merge, the same as the normal attention, the softmax operation is applied to the attention value matrix and gets the attention weight matrix.

The effectiveness of Self-Extend was evaluated on popular LLMs like Llama-2, Mistral, and SOLAR across a variety of tasks, including language modeling, synthetic long context tasks, and real-world long context tasks. The results showed significant improvements in long context understanding, often surpassing fine-tuning-based methods.
Performance was analyzed along the following dimensions:
1. Language Modeling: Self-Extend was evaluated for language modeling on the PG19 dataset, which comprises long books. It effectively maintained low perplexity outside the pretraining context window for models like Llama-2-chat and Mistral.
2. Synthetic Long Context Tasks: The method was assessed using the passkey retrieval task, testing the LLMs’ ability to recognize information throughout the entire length of a long text sequence.
3. Real Long Context Tasks: The evaluation included benchmarks such as Longbench and L-Eval. Results showed notable performance improvements across different datasets. Self-Extend was found to perform comparably or better than several fine-tuned models, with some exceptions attributed to the specific characteristics of certain datasets.
The paper concludes that Self-Extend, which is effective during inference without needing fine-tuning, can achieve performance on par with or superior to learning-based methods. This represents a significant advancement in enhancing the natural and efficient handling of longer contexts by LLMs.

Large Language Models are Null-Shot Learners

This paper by Taveekitworachai et al. from Ritsumeikan University introduces the concept of null-shot ($\varnothing$-shot) prompting, a novel approach aimed at addressing the issue of hallucination in Large Language Models (LLMs). Hallucination refers to the phenomenon where LLMs generate factually incorrect or ungrounded information, particularly in zero-shot learning scenarios where models need to perform tasks without specific example-based fine-tuning. The study focuses on embracing hallucination to improve LLM performance on tasks like reading comprehension and arithmetic reasoning, overcoming some limitations inherent in zero-shot prompting.
Null-shot prompting works by directing the LLM to reference a non-existent “Examples” section, a counterintuitive method that paradoxically encourages the model to draw on its internal knowledge, leading to more accurate or relevant responses. This strategy banks on the LLM’s capacity for imaginative synthesis, effectively tricking it into producing more useful information for tasks it has not been fine-tuned for.
The figure below from the paper shows examples outputted by PaLM 2 for Chat for WinoGrande.

The paper encompasses experiments across six different LLMs over eight datasets in areas like arithmetic reasoning, commonsense reasoning, reading comprehension, and closed-book question answering. The results indicate notable improvements in task performance, illustrating the potential of $\varnothing$-shot prompting in harnessing hallucinations for enhanced output.
A key aspect of the study is its exploration of $\varnothing$-shot prompting as a method for hallucination detection in LLMs. The performance enhancement achieved via $\varnothing$-shot prompting could act as an indicator of the level of hallucination in a model, potentially serving as a new diagnostic tool.
Several ablation studies, including scaling analysis and modifications inspired by zero-shot chain-of-thought prompting, provide insights into the effectiveness of null-shot prompting. The positioning of the $\varnothing$-shot phrase, whether before or after the task instruction, significantly influences performance.
The authors speculate that LLMs internally conjure examples based on their training, influencing task performance. This suggests an inherent “mental model” within LLMs that can be utilized for task execution, bypassing the need for explicit examples.
Looking ahead, the potential applications of $\varnothing$-shot prompting are vast. It could lead to a subtle shift in LLM training and task performance evaluation, especially for more complex reasoning tasks. The technique could also integrate with existing methodologies, blurring the lines between perceived weaknesses and strengths of LLMs.
The paper concludes by discussing the implications of $\varnothing$-shot prompting in various fields, acknowledging limitations such as potential dataset poisoning risks and the uncertainty surrounding the nature of examples envisioned by LLMs during output generation.
This comprehensive study underscores the innovative concept of leveraging inherent hallucination tendencies in LLMs, showcasing a unique approach to enhancing their performance and providing new insights into their operational characteristics.

Knowledge Fusion of Large Language Models

This paper by Wan et al. from Sun Yat-sen University and Tencent AI Lab, and published at ICLR 2024, this paper introduces the concept of knowledge fusion for Large Language Models (LLMs). It presents a method to merge the capabilities of different pre-trained LLMs into a more potent model, addressing the high costs and redundancy in LLM development.
The novel approach, FuseLLM, leverages generative distributions from source LLMs to transfer their collective knowledge and strengths to a target LLM. It involves aligning tokenizations from different LLMs and fusing their probability distributions using lightweight continual training. This reduces divergence between the target LLM’s probabilistic distributions and those of the source LLMs.
The figure below from the paper illustrates conventional model fusion techniques (ensemble and weight merging) and our knowledge fusion approach for LLMs (FuseLLM). Different animal icons represent different LLMs, with various species denoting LLMs possessing differing architectures. FuseLLM externalizes the knowledge from multiple LLMs and transfers their capabilities to a target LLM.

The paper tests this concept with three distinct LLMs: Llama-2, MPT, and OpenLLaMA. Evaluations across 42 tasks in reasoning, commonsense, and code generation showed that FuseLLM outperforms each source LLM and the baseline in most tasks. For example, in Big-Bench Hard (BBH), FuseLLM improved exact match accuracy by 2.5% compared to Llama-2 CLM, showing a 3.9× reduction in token requirement.
The paper also compares FuseLLM with traditional model ensemble and weight merging techniques. In experiments simulating scenarios with multiple LLMs trained on distinct corpora, FuseLLM consistently achieved the lowest average perplexity, demonstrating its effectiveness in integrating diverse models.
The study underscores the potential of FuseLLM for harnessing collective knowledge more effectively than traditional methods, particularly given the diverse structures and substantial sizes of LLMs.
Code

This paper by Yang et al. from various universities explore the use of Large Language Models (LLMs) for interpretable mental health analysis on social media. They introduce MentaLLaMA, an open-source LLM series capable of generating high-quality explanations for mental health analysis.
The authors develop the Interpretable Mental Health Instruction (IMHI) dataset, featuring 105K samples from 10 social media sources covering 8 mental health tasks. The dataset is unique for its multi-task and multi-source nature, focusing on generating detailed explanations for each mental health condition.
The figure below from the paper illustrates some examples of MentaLLaMA’s capabilities in diverse mental health analysis tasks.

To construct the dataset, they used expert-written few-shot prompts to generate explanations via ChatGPT. The explanations’ quality was assured through rigorous automatic and human evaluations, ensuring correctness, consistency, and professional standards.
MentaLLaMA models, based on the LLaMA2 foundation models, are fine-tuned on the IMHI dataset. Three versions of MentaLLaMA were developed: MentaLLaMA-7B, MentaLLaMA-chat-7B, and MentaLLaMA-chat-13B, each demonstrating strong capabilities in mental health classification and explanation generation.
The models’ performance was evaluated on the IMHI benchmark, demonstrating that MentaLLaMA-chat-13B approaches or surpasses state-of-the-art discriminative methods in correctness. The model was also found to generate explanations of ChatGPT-level quality, benefiting from instruction tuning, reinforcement learning from human feedback, and increasing model sizes.
MentaLLaMA models show strong generalizability to unseen tasks, underlining their adaptability in diverse mental health analysis scenarios. The research provides an essential step towards interpretable, automatic mental health analysis on social media, leveraging the advanced capabilities of large language models.
Code

ChatQA: Building GPT-4 Level Conversational QA Models

This paper by Liu et al. from NVIDIA introduces ChatQA, a family of conversational question-answering (QA) models that achieve GPT-4 level accuracies. The main contribution is a two-stage instruction tuning method which significantly improves zero-shot conversational QA results from large language models (LLMs).
The first stage involves supervised fine-tuning (SFT) on a blend of instruction-following and dialogue datasets. The second stage, context-enhanced instruction tuning, integrates contextualized QA datasets into the instruction tuning blend. This method outperforms regular instruction tuning or RLHF-based recipes (e.g. Llama-2-Chat).
For retrieval-augmented generation (RAG) in conversational QA, the authors fine-tune state-of-the-art single-turn query retrievers on a human-annotated multi-turn QA dataset. This approach is as effective as using state-of-the-art LLM-based query rewriting models (e.g., GPT-3.5-turbo) but more cost-effective.
The figure below from the paper illustrates the two-stage instruction tuning framework for ChatQA.

The authors built a family of ChatQA models based on Llama2-7B, Llama2-13B, Llama2-70B, and an in-house 8B pretrained GPT model. They conducted a comprehensive study on 10 conversational QA datasets, including datasets requiring retrieval and datasets with tables. ChatQA-70B outperformed GPT-4 in terms of average score on these datasets without using any synthetic data from OpenAI GPT models.
They also investigated the “unanswerable” scenario, demonstrating that adding a small number of “unanswerable” samples in instruction tuning can steer the model to generate “cannot answer” responses, thereby reducing hallucination. ChatQA-70B outperformed GPT-3.5-turbo in this regard, though it had a slight gap compared to GPT-4.
The paper highlights the effectiveness of the proposed two-stage instruction tuning method, emphasizing its superiority in integrating user-provided or retrieved context for zero-shot conversational QA tasks. The method’s ability to enhance LLMs’ capability of integrating context in conversational QA is a significant advancement.

Parameter-efficient Tuning for Large Language Model without Calculating Its Gradients

This paper by Jin et al. from the Institute of Automation Chinese Academy of Sciences, School of Artificial Intelligence University of Chinese Academy of Sciences, Wuhan AI Research, and Shanghai Artificial Intelligence Laboratory, a novel method for tuning large language models (LLMs) without calculating their gradients is introduced. This approach, aimed at addressing the high computational demand and memory requirements of conventional tuning methods, leverages similarities between parameter-efficient modules learned by both large and small language models.
The proposed method transfers parameter-efficient modules, initially derived from small language models (SLMs), to larger ones. To tackle dimension mismatch and facilitate effective interaction between models, a Bridge model is introduced. This model ensures dimensional consistency and dynamic interaction between the parameter-efficient module and the LLM.
The figure below from the paper illustrates: (a) In the training stage, they use parameter-efficient tuning methods to learn task-specific characteristics in a smaller model and fine-tune the Bridge model with the acquired parameter-efficient modules. (b) In the inference stage, they directly plug the parameter-efficient modules into the large language model for efficient predictions.

The effectiveness of this approach is demonstrated using the T5 and GPT-2 series of language models on the SuperGLUE benchmark. It achieves comparable performance to traditional fine-tuning and parameter-efficient tuning methods like Adapter, Prefix tuning, and LoRA, but without the need for gradient-based optimization. Notably, it also achieves a significant memory reduction, up to 5.7× compared to traditional parameter-efficient tuning methods.
The process includes employing parameter-efficient tuning methods on a SLM to learn task-specific characteristics, followed by fine-tuning a Bridge model along with the parameter-efficient module. This dual-step method ensures the acquired module matches the LLM’s dimensions and enriches it with the larger model’s knowledge. The tuned module is then seamlessly integrated into the LLM for inference.
The research highlights the potential of combining the capabilities of small and large models, indicating that substantial task-specific similarities exist across models of different sizes. This method opens avenues for leveraging expansive LLMs more efficiently and economically, presenting a promising direction for future large-scale model tuning.

BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining

This paper by Luo et al. from MSR, AI4Science, and Peking University, introduces BioGPT, a domain-specific generative pre-trained Transformer language model for biomedical text generation and mining. Unlike previous biomedical pre-trained models (BioBERT and PubMedBERT) which focus on language understanding tasks, BioGPT is designed to also excel in generation tasks, addressing the limitations of models like BioBERT and PubMedBERT.
BioGPT, leveraging a Transformer language model backbone, is pre-trained from scratch on 15 million PubMed abstracts. It diverges from typical approaches by adopting a domain-specific training corpus and vocabulary, using Byte Pair Encoding (BPE) for efficient text representation.
The paper presents BioGPT’s evaluation across six biomedical NLP tasks, including end-to-end relation extraction on BC5CDR, KD-DTI, DDI datasets, question answering on PubMedQA, document classification on HoC, and text generation. BioGPT sets new performance benchmarks on most tasks, particularly excelling with F1 scores of 44.98% on BC5CDR, 38.42% on KD-DTI, 40.76% on DDI, and achieving 78.2% accuracy on PubMedQA.
Significant attention is given to the methodological innovation of BioGPT, including the design of target sequence formats and prompts for fine-tuning on downstream tasks. The model’s adaptability is showcased through different strategies for converting task-specific labels into sequences that align with its pre-training on natural language, demonstrating superior performance compared to structured prompts used in previous works.
The study also experiments with scaling BioGPT to a larger model size, BioGPT-Large, based on the GPT-2 XL architecture, further exploring its potential in enhancing biomedical text mining and generation capabilities.

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

This paper by Wu et al. from Microsoft Research, Pennsylvania State University, University of Washington, and Xidian University, introduces AutoGen, an open-source framework designed to facilitate the development of multi-agent large language model (LLM) applications. The framework allows the creation of customizable, conversable agents that can operate in various modes combining LLMs, human inputs, and tools.
AutoGen agents can be easily programmed using both natural language and computer code to define flexible conversation patterns for different applications. The framework supports hierarchical chat, joint chat, and other conversation patterns, enabling agents to converse and cooperate to solve tasks. The agents can hold multiple-turn conversations with other agents or solicit human inputs, enhancing their ability to solve complex tasks.
The figure below from the paper illustrates the fact that AutoGen enables diverse LLM-based applications using multi-agent conversations. (Left) AutoGen agents are conversable, customizable, and can be based on LLMs, tools, humans, or even a combination of them. (Top-middle) Agents can converse to solve tasks. (Right) They can form a chat, potentially with humans in the loop. (Bottom-middle) The framework supports flexible conversation patterns.

Key technical details include the design of conversable agents and conversation programming. Conversable agents can send and receive messages, maintain internal context, and be configured with various capabilities such as LLMs, human inputs, and tools. These agents can also be extended to include more custom behaviors. Conversation programming involves defining agent roles and capabilities and programming their interactions using a combination of natural and programming languages. This approach simplifies complex workflows into intuitive multi-agent conversations.
Implementation details:
1. Conversable Agents: AutoGen provides a generic design for agents, enabling them to leverage LLMs, human inputs, tools, or a combination. The agents can autonomously hold conversations and solicit human inputs at certain stages. Developers can easily create specialized agents with different roles by configuring built-in capabilities and extending agent backends.
2. Conversation Programming: AutoGen adopts a conversation programming paradigm to streamline LLM application workflows. This involves defining conversable agents and programming their interactions via conversation-centric computation and control. The framework supports various conversation patterns, including static and dynamic flows, allowing for flexible agent interactions.
3. Unified Interfaces and Auto-Reply Mechanisms: Agents in AutoGen have unified interfaces for sending, receiving, and generating replies. An auto-reply mechanism enables conversation-driven control, where agents automatically generate and send replies based on received messages unless a termination condition is met. Custom reply functions can also be registered to define specific behavior patterns.
4. Control Flow: AutoGen allows control over conversations using both natural language and programming languages. Natural language prompts guide LLM-backed agents, while Python code specifies conditions for human input, tool execution, and termination. This flexibility supports diverse multi-agent conversation patterns, including dynamic group chats managed by the GroupChatManager class.
The figure below from the paper illustrates how to use AutoGen to program a multi-agent conversation. The top subfigure illustrates the built-in agents provided by AutoGen, which have unified conversation interfaces and can be customized. The middle sub-figure shows an example of using AutoGen to develop a two-agent system with a custom reply function. The bottom sub-figure illustrates the resulting automated agent chat from the two-agent system during program execution.

The paper details the framework’s architecture, where agents are defined with specific roles and capabilities, interacting through structured conversations to process tasks efficiently. This approach improves task performance, reduces development effort, and enhances application flexibility. The significant technical aspects include using a unified interface for agent interaction, conversation-centric computation for defining agent behaviors, and conversation-driven control flows that manage the sequence of interactions among agents.
Applications demonstrate AutoGen’s capabilities in various domains, such as:
- Math Problem Solving: AutoGen builds systems for autonomous and human-in-the-loop math problem solving, outperforming other approaches on the MATH dataset.
- Retrieval-Augmented Code Generation and Question Answering: The framework enhances retrieval-augmented generation systems, improving performance on question-answering tasks through interactive retrieval mechanisms.
- Decision Making in Text World Environments: AutoGen implements effective interactive decision-making applications using benchmarks like ALFWorld.
- Multi-Agent Coding: The framework simplifies coding tasks by dividing responsibilities among agents, improving code safety and efficiency.
- Dynamic Group Chat: AutoGen supports dynamic group chats, enabling collaborative problem-solving without predefined communication orders.
- Conversational Chess: The framework creates engaging chess games with natural language interfaces, ensuring valid moves through a board agent.
The empirical results indicate that AutoGen significantly outperforms existing single-agent and some multi-agent systems in complex task environments by effectively integrating and managing multiple agents’ capabilities. The paper includes a figure illustrating the use of AutoGen to program a multi-agent conversation, showing built-in agents, a two-agent system with a custom reply function, and the resulting automated agent chat.
The authors highlight the potential for AutoGen to improve LLM applications by reducing development effort, enhancing performance, and enabling innovative uses of LLMs. Future work will explore optimal multi-agent workflows, agent capabilities, scaling, safety, and human involvement in multi-agent conversations. The open-source library invites contributions from the broader community to further develop and refine AutoGen.

Towards Expert-Level Medical Question Answering with Large Language Models

This paper by Singhal et al. from Google Research and DeepMind introduces Med-PaLM 2, a large language model (LLM) significantly advancing the field of medical question answering. The model builds on the previous Med-PaLM, incorporating improvements from the base model PaLM 2, specialized medical domain fine-tuning, and novel prompting strategies, including ensemble refinement.
Med-PaLM 2 notably scored up to 86.5% on the MedQA dataset, surpassing the previous model by over 19%, and demonstrated competitive performance on MedMCQA, PubMedQA, and MMLU clinical topics, often reaching or exceeding state-of-the-art results.
A novel component of Med-PaLM 2’s development is the Ensemble Refinement (ER) prompting strategy, which involves generating multiple reasoning paths from the model, then refining these into a single, more accurate response. This method leveraged chain-of-thought and self-consistency approaches to enhance reasoning capabilities.
ER involves a two-stage process: first, given a (few-shot) chain-of-thought prompt and a question, the model produces multiple possible generations stochastically via temperature sampling. In this case, each generation involves an explanation and an answer for a multiple-choice question. Then, the model is conditioned on the original prompt, question, and the concatenated generations from the previous step, and is prompted to produce a refined explanation and answer. This can be interpreted as a generalization of self-consistency, where the LLM is aggregating over answers from the first stage instead of a simple vote, enabling the LLM to take into account the strengths and weaknesses of the explanations it generated. Here, to improve performance they perform the second stage multiple times, and then finally do a plurality vote over these generated answers to determine the final answer. The figure below from the paper illustrates ER with Med-PaLM 2. In this approach, an LLM is conditioned on multiple possible reasoning paths that it generates to enable it to refine and improves its answer.

The model’s efficacy was extensively tested through various benchmarks and human evaluations, comparing its performance to that of practicing physicians across multiple axes, such as factual accuracy, medical knowledge recall, and reasoning. In tests involving 1066 consumer medical questions, Med-PaLM 2’s responses were preferred over those from human physicians in the majority of cases, especially in terms of reflecting medical consensus and reducing the likelihood of harm.
Despite its successes, the paper notes the need for ongoing validation in real-world settings, stressing that while Med-PaLM 2 represents a significant advance in medical LLMs, further research is essential to optimize its practical application and ensure safety in clinical environments.

Adaptation with Self-Evaluation to Improve Selective Prediction in LLMs

This paper by Chen et al. from the University of Wisconsin-Madison and Google, published in EMNLP 2023, introduces a novel framework named ASPIRE (Adaptation with Self-Evaluation to Improve Selective Prediction in LLMs) aimed at enhancing the reliability of Large Language Models (LLMs) in high-stakes decision-making by improving their selective prediction performance. Selective prediction allows LLMs to abstain from answering when unsure, critical for applications requiring high reliability.
ASPIRE utilizes parameter-efficient tuning for task-specific adaptation of LLMs, significantly improving their self-evaluation ability to judge the correctness of their answers. This is achieved without requiring the generation of multiple outputs for uncertainty estimation, thus reducing computational costs and latency.
The methodology involves two key steps: First, task-specific tuning to adapt LLMs to particular tasks, followed by self-evaluation learning where the LLM learns to distinguish between correct and incorrect answers using additional adaptable parameters. This process relies on the generation of answers with varied likelihoods for comprehensive learning.
The following figure from the paper shows a safety-critical question from the TriviaQA dataset: “Which vitamin helps regulate blood clotting?” The OPT-2.7B model incorrectly answers “Vitamin C”, when the correct answer is “Vitamin K”. Without selective prediction, LLMs will directly output the wrong answer which in this case could lead users to take the wrong medicine, and thus causing potential harm. With selective prediction, LLMs will output a low selection score along with the wrong answer and can further output “I don’t know!” to warn users not to trust it or verify it using other sources.

The following figure from the paper shows that in the proposed framework ASPIRE, they first perform task specific tuning to train adaptable parameters $\theta_p$ while freezing the LLM. Then, they use the LLM with the learned $\theta_p$ to generate different answers for each training question to create a dataset for self-evaluation learning. Finally, they train the adaptable parameters $\theta_s$ to learn self-evaluation using the created dataset while freezing the LLM and the learned $\theta_p$.

Extensive experiments demonstrate ASPIRE’s superior performance over state-of-the-art selective prediction methods across multiple question-answering datasets, notably improving the AUACC (Area Under the Accuracy-Coverage Curve) and AUROC (Area Under the Receiver Operating Characteristic curve) metrics on benchmarks like CoQA, TriviaQA, and SQuAD with various LLMs including OPT and GPT-2 models.
The paper also explores the impacts of different decoding algorithms for answer generation within the ASPIRE framework, revealing that sampling diverse high-likelihood answers is crucial for achieving optimal selective prediction performance.
Implementation details reveal the use of soft prompt tuning for adaptable parameter learning, indicating the practical applicability and efficiency of ASPIRE in enhancing LLMs for selective prediction, particularly in settings where computational resources are limited or when high selective prediction performance is desired with minimal inference costs.

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

This paper by Zhou et al. from Google Research introduces a novel prompting method called least-to-most prompting, aimed at enhancing problem-solving capabilities in large language models like GPT-3. This method, which draws from educational psychology, involves decomposing complex problems into simpler, sequential subproblems, leveraging answers from prior subproblems to facilitate the solving of subsequent ones.
The implementation of least-to-most prompting does not require model training or finetuning. It is executed entirely through few-shot prompting, demonstrated effectively in tasks like symbolic manipulation, compositional generalization, and mathematical reasoning.
The following figure from the paper shows least-to-most prompting solving a math word problem in two stages: (1) query the language model to decompose the problem into subproblems; (2) query the language model to sequentially solve the subproblems. The answer to the second subproblem is built on the answer to the first subproblem. The demonstration examples for each stage’s prompt are omitted in this illustration.

In empirical evaluations, least-to-most prompting significantly outperformed traditional chain-of-thought prompting, especially in handling tasks that required generalization from simple to complex problem-solving scenarios. For example, in the compositional generalization benchmark SCAN, least-to-most prompting achieved a 99% accuracy with just 14 examples, compared to the 16% accuracy with traditional methods.
Notably, the method has shown remarkable efficiency in length generalization tasks, where it maintained high performance as the complexity of test cases increased, unlike other prompting methods that exhibited steep performance declines.
The paper also discusses the integration of this method with other prompting techniques like self-consistency, highlighting its flexibility and effectiveness in diverse problem-solving contexts without additional computational costs or complex model modifications.

BitNet: Scaling 1-bit Transformers for Large Language Models

In the paper “BitNet: Scaling 1-bit Transformers for Large Language Models” by Wang et al., the authors introduce BitNet, a novel 1-bit transformer architecture designed to address the challenges of deploying increasingly large language models (LLMs). Traditionally, LLMs are loaded into memory with each parameter stored as a float16 type, significantly increasing memory usage—for instance, a 7B model takes up 14GB of space (assuming half-precision/16-bit floats). BitNet addresses this issue by training LLMs with reduced precision from the outset, as opposed to conventional methods where quantization occurs post-training.
BitNet employs a new component called BitLinear—a drop-in replacement for the nn.Linear layer—to train 1-bit weights from scratch, effectively reducing the precision of each parameter and thus the overall memory footprint. This method significantly cuts down on memory usage and energy consumption while maintaining competitive performance with traditional FP16 Transformers and outperforming state-of-the-art 8-bit quantization methods.
The interesting aspect of BitNet is the reduced computational requirements since large LLM models involve millions of matrix multiplications. To multiply a matrix (weights) by a vector (input), we must do multiplication and addition. If your matrix consists of binarized weights, you can replace multiplication and addition with just addition.
The implementation of BitNet involves replacing linear projections in the transformer architecture with BitLinear layers. BitLinear functions like a linear layer but utilizes 1-bit weights, enhancing computational efficiency. Crucially, since BitNet performs dequantization as execution moves from one layer to another, it ensures that other components remain high-precision to preserve model quality.
This straightforward implementation strategy, coupled with novel quantization techniques for both weights (using the signum function for binarization) and activations (using an “absmax” approach for scaling), allows BitNet to efficiently manage the scaling of large models. The architecture adheres to scaling laws similar to those of full-precision transformers, as demonstrated through comparative analyses with FP16 transformers, showing that BitNet achieves similar or better performance while consuming significantly less energy.
The following figure from the paper shows: (a) The computation flow of BitLinear. (b) The architecture of BitNet, consisting of the stacks of attentions and FFNs, where matrix multiplication is implemented as BitLinear.

Overall, BitNet advances the efficiency and scalability of large language models, offering a promising solution to the cost and environmental concerns associated with current large-scale neural networks, without sacrificing the robust performance essential for practical applications.

Reflexion: Language Agents with Verbal Reinforcement Learning

This paper by Shinn et al. from Northeastern University, MIT, and Princeton University introduces Reflexion, a reinforcement learning framework for large language model (LLM)-based agents, enabling them to improve task performance using self-reflective verbal feedback instead of traditional weight updates. Reflexion processes external feedback, transforming it into actionable self-reflections stored in episodic memory, enhancing agents’ decision-making over successive trials in tasks such as sequential action selection, reasoning, and code generation.
Framework Overview: Reflexion involves three models: an Actor (task action generation), an Evaluator (performance assessment), and a Self-Reflection model (produces verbal guidance for improvement). The Actor, built on LLMs, generates actions based on both state observations and past experiences. The Evaluator assigns task-specific rewards, and the Self-Reflection model formulates verbal feedback based on task failures, guiding future attempts.
Memory Structure: Reflexion employs short-term memory (trajectory history) and long-term memory (aggregated self-reflections), which the Actor consults during action generation. This structure allows the agent to remember specific past mistakes while retaining broader learnings across episodes, which aids in complex decision-making tasks.
The following figure from the paper shows that Reflexion works on decision-making 4.1, programming 4.3, and reasoning 4.2 tasks.

Experimentation and Results:
- Decision-Making in AlfWorld: Reflexion significantly improved performance on multi-step tasks in AlfWorld by 22%, using heuristics to detect repetitive failures and adapt action choices based on memory. Reflexion enables effective backtracking and context recall, with a notable reduction in inefficient planning errors compared to baseline.
- Reasoning with HotPotQA: Reflexion enhanced reasoning on HotPotQA, achieving a 20% improvement by refining answers through Chain-of-Thought and episodic memory. Reflexion agents retained task-specific strategies across trials, outpacing baseline approaches in correctly navigating long contextual questions.
- Programming in HumanEval and LeetcodeHardGym: Reflexion set new state-of-the-art scores, achieving 91% on HumanEval, aided by self-generated test suites and continuous error-checking through self-reflection. This iterative testing allowed Reflexion agents to refine code output by addressing both syntactical and logical errors.
Implementation Details:
- Reflexion agents use Chain-of-Thought and ReAct generation techniques, with self-reflective prompts implemented through few-shot examples tailored for each task type. For programming, Reflexion employs syntactically validated test suites, filtered to retain only valid abstract syntax tree representations, ensuring comprehensive error handling in code generation.
- Self-reflection feedback is stored in memory limited to the last three experiences to maintain efficiency within LLM context limits. Reflexion’s feedback loop iterates until the Evaluator confirms task success, effectively combining reinforcement with natural language memory for performance gains.
Ablation Studies and Analysis: Tests on compromised versions of Reflexion, such as without test generation or self-reflection, showed marked performance drops, underscoring the importance of verbal self-reflection in driving task success. This highlights Reflexion’s effectiveness in environments requiring high interpretability and actionable feedback.
Reflexion exemplifies a low-compute yet adaptive approach to reinforcement learning for LLM agents, showing potential for expansive applications in autonomous decision-making where interpretable and incremental learning are essential.

The Impact of Positional Encoding on Length Generalization in Transformers

This paper by Kazemnejad et al. from Mila, McGill University, IBM Research, Facebook CIFAR AI Chair, and ServiceNow Research, published in NeurIPS 2023, explores how different positional encoding (PE) methods affect the ability of decoder-only Transformer models to generalize to sequence lengths not seen during training.

Here are the key findings from the paper:

1. No Positional Encoding (NoPE) performs best:

  * Transformer models without any positional encoding consistently outperform those with explicit PEs in downstream tasks that require length generalization.
  * NoPE models not only generalize better but also avoid the computational overhead introduced by PEs like T5’s Relative PE.

2. T5’s Relative PE is the best among explicit methods:

  * Among the widely used explicit positional encodings (Absolute Position Embedding, ALiBi, Rotary), T5’s Relative PE shows the strongest performance in generalizing to longer sequences.

3. Popular encodings like Rotary and ALiBi underperform:

  * Rotary and ALiBi, despite being used in major models like PaLM and BLOOM, fail to generalize well in downstream tasks when compared to T5’s PE and NoPE.
  * Rotary shows attention patterns and behavior similar to Absolute PE, which are less suitable for length generalization.

4. Theoretical and empirical analysis of NoPE:

  * The authors prove that NoPE Transformers can, in theory, learn both absolute and relative positional information.
  * Empirically, NoPE’s attention patterns are most similar to those of T5’s Relative PE, suggesting it learns relative-style positioning.

5. Scratchpad (chain-of-thought) is task-dependent:

  * Scratchpad improves performance only on certain tasks (e.g., addition), and the effectiveness heavily depends on the format.
  * It doesn’t eliminate the need for a good PE scheme.

6. NoPE scales reasonably well:

  * In preliminary experiments with 1B-parameter models, NoPE and ALiBi generalize better than Rotary on longer sequences, even though ALiBi is more stable for extreme lengths.

The following figure from the paper shows that No positional encoding (NoPE) outperforms all other positional encodings at length generalization of decoder-only Transformers (GPT-style) trained from scratch and evaluated on a battery of reasoning-like downstream tasks. This figure shows aggregate ranking of positional encoding methods across 10 tasks.

Scalable-Softmax Is Superior for Attention

This paper by Ken M. Nakanishi from the University of Tokyo introduces Scalable-Softmax (SSMax), a novel replacement for the Softmax function in Transformer attention mechanisms, designed to address the phenomenon of “attention fading” — a flattening of the attention distribution as context size increases, which impairs the model’s ability to focus on important tokens and limits its generalization to longer sequences.
Problem with Softmax: As the input vector size grows in attention layers, the denominator in the Softmax calculation increases (since it’s a sum over all exponentiated inputs), causing the maximum output of the Softmax function to approach zero. This leads to “attention fading”, where the attention distribution becomes increasingly flat, diminishing focus on key tokens.
Scalable-Softmax (SSMax) reformulates Softmax as:
\[\text{SSMax}(z_i) = \frac{n^{s z_i}}{\sum_{j=1}^n n^{s z_j}} = \frac{e^{(s \log n) z_i}}{\sum_{j=1}^n e^{(s \log n) z_j}}\]
- where $n$ is the input vector size and $s$ is a learnable scaling parameter. This introduces a logarithmic dependency on the input size, allowing SSMax to maintain a peaked distribution and avoid fading.
Implementation:
- Replacement: In standard Transformer attention computation $a_n = \text{Softmax}(q_n K_{1:n}^T / \sqrt{d})$, SSMax modifies it to $\text{Softmax}((s \log n) q_n K_{1:n}^T / \sqrt{d})$.
- Parameterization: The scaling parameter $s$ is a learnable scalar per attention head and layer. Some variants also introduce a bias term $b$, leading to an extended form: $e^{(s \log n + b) z_i}$.
- Integration: The implementation is straightforward, requiring only minor changes to existing Transformer codebases.
The following figure from the paper shows a comparison of Softmax and SSMax, illustrating the issue of attention fading and the effectiveness of SSMax in preventing it. As the input vector size increases, the maximum value of the output vector produced by Softmax decreases, demonstrating the problem of attention fading. In contrast, SSMax keeps the maximum value close to 1, regardless of the input size.

Experimental Setup:
- Architecture: 12-layer Transformer with 12 attention heads (162M parameters), similar to LLaMA 2 (RoPE, RMSNorm, SwiGLU, biasless projections).
- Dataset: SlimPajama (a compressed version of RedPajama), trained with 1024 sequence length and GPT-2 tokenizer.
- Evaluated variants: (a) Standard Softmax; (b) SSMax with learnable $s$; (c) SSMax with fixed $s = 1$; (d) SSMax with both $s$ and bias $b$; (e) Post-training Softmax → SSMax replacement; (f) Mid-training Softmax → SSMax switch.
Empirical Findings:
- Training Efficiency: SSMax variants achieve faster loss reduction during pretraining. Variant (d) with bias showed the best loss performance.
- Length Generalization: Models trained with SSMax exhibit significantly improved test loss when extrapolated to sequence lengths up to 20,000 (20× training length). Softmax struggles with longer sequences.
- Key Information Retrieval: Using a Needle-in-a-Haystack test, SSMax models outperform Softmax by a wide margin, effectively retrieving target tokens even in very long contexts.
- Attention Focus: SSMax yields higher attention weights on key tokens, as measured by “needle scores” in attention heads. This correlates strongly with successful information retrieval.
- Adaptability: Although optimal when used from the start of training, replacing Softmax with SSMax after or during training still yields performance improvements.
Design Justification:
- Theoretical analysis confirms that SSMax maintains focused attention as long as the top input element exceeds others by ~$\frac{1}{s}$, regardless of input size.
- The scaling parameter $s$ plays a crucial role; removing it (variant c) degrades retrieval performance despite similar loss curves.
- Introducing a bias parameter $b$ improves pretraining loss but harms long-context generalization.
SSMax offers a scalable, drop-in replacement for Softmax that improves both training dynamics and long-context capabilities, with minimal implementation overhead and substantial empirical gains in attention fidelity and model robustness.

2024

Relying on the Unreliable: The Impact of Language Models’ Reluctance to Express Uncertainty

This paper by Zhou et al. from Stanford University, USC, CMU, and Allen Institute for AI investigates the impact of language models’ (LMs) reluctance to express uncertainty in their responses. This reluctance leads to overconfidence, which can have significant implications for human-LM interactions.
The study examines how LMs, including GPT, LLaMA-2, and Claude, communicate uncertainties and how users respond to these LM-articulated uncertainties. It was found that LMs rarely express uncertainties, even when incorrect, and tend to overuse expressions of certainty, leading to high error rates.
The authors conducted human experiments to assess the impact of LM overconfidence. Results showed that users heavily rely on LM-generated responses, whether marked by certainty or not. Additionally, even minor miscalibrations in how a model uses epistemic markers can lead to long-term harms in human performance.
An investigation into the origins of model overconfidence identified reinforcement learning with human feedback (RLHF) as a key contributing factor. The study revealed a bias against expressions of uncertainty in the preference-annotated datasets used in RLHF alignment.
The paper highlights the risks posed by overconfidence in LMs and proposes design recommendations to mitigate these risks. It suggests designing LMs to verbalize uncertainty more effectively and counteract human biases in interpreting uncertainties.
This work exposes new safety concerns in human-LM collaborations and underscores the need for more nuanced communication strategies in LMs, particularly regarding expressing uncertainty and confidence.

Matryoshka Representation Learning

The paper by Kusupati et al. from UW introduces Matryoshka Representation Learning (MRL), a novel approach for adaptive and efficient representation learning. This technique, adopted in OpenAI’s latest embedding update, text-embedding-3-large, is characterized by its ability to encode information at multiple granularities within a single high-dimensional vector. Drawing an analogy from the Russian Matryoshka dolls, MRL encapsulates details at various levels within a single embedding structure, allowing for adaptability to the computational and statistical needs of different tasks.
The essence of MRL lies in its ability to create coarse-to-fine representations, where earlier dimensions in the embedding vector store more crucial information, and subsequent dimensions add finer details. You can understand how this works by the analogy of trying to classify an image at multiple resolutions – the lower resolutions give high-level info and the higher resolutions add finer details – human perception of the natural world also has a naturally coarse-to-fine granularity, as shown in the animation below.

MRL achieves this by modifying the loss function in the model, where the total loss is the sum of losses over individual vector dimension ranges: $Loss_{Total} = L(\text{upto 8d}) + L(\text{upto 16d}) + L(\text{upto 32d}) + \ldots + L(\text{upto 2048d})$. As a result, MRL incentivizes the model to capture essential information in each subsection of the vector. Notably, this technique allows for the use of any subset of the embedding dimensions, offering flexibility beyond fixed dimension slices like 8, 16, 32, etc.
The figure below from the paper shows that MRL is adaptable to any representation learning setup and begets a Matryoshka Representation $z$ by optimizing the original loss $L(.)$ at $O(\log(d))$ chosen representation sizes. Matryoshka Representation can be utilized effectively for adaptive deployment across environments and downstream tasks.

MRL’s adaptability extends to a wide range of modalities, including vision, vision+language, and language models (such as ViT, ResNet, ALIGN, and BERT). The method has shown remarkable results in various applications, such as adaptive classification and retrieval, robustness evaluations, few-shot and long-tail learning, and analyses of model disagreement. In practical terms, MRL facilitates up to 14x smaller embedding sizes for tasks like ImageNet-1K classification without compromising accuracy, up to 14x real-world speed-ups for large-scale retrieval, and up to 2% accuracy improvements in long-tail few-shot classification.
One of the striking outcomes of using MRL is demonstrated in OpenAI’s text-embedding-3-large model, which, when trimmed to 256 dimensions, outperforms the full-sized text-embedding-ada-002 with 1536 dimensions on the MTEB benchmark. This indicates a significant reduction in size (to about 1/6th) while maintaining or even enhancing performance.
Importantly, MRL integrates seamlessly with existing representation learning pipelines, requiring minimal modifications and imposing no additional costs during inference and deployment. Its flexibility and efficiency make it a promising technique for handling web-scale datasets and tasks. OpenAI has made the pretrained models and code for MRL publicly available, underlining the method’s potential as a game-changer in the field of representation learning.
Code; OpenAI Blog

Like humans, large language models (LLMs) do not always generate the best output on their first try.
This paper by Madaan et al. from CMU, Allen AI, UW, NVIDIA, UC San Diego, and Google Research introduces a novel approach for enhancing outputs from large language models (LLMs) like GPT-3.5 and GPT-4 through self-generated iterative feedback and refinement, without the need for additional training data or supervised learning – similar to how humans refine their written text.
Put simply, the main idea is to generate an initial output using an LLM; then, the same LLM provides feedback for its output and uses it to refine itself, iteratively. This process repeats until a predefined condition is met. Self-Refine does not require any supervised training data, additional training, or reinforcement learning, and instead uses a single LLM as the generator, refiner and the feedback provider.
The figure below from the paper shows that given an input (step 0), Self-Refine starts by generating an output and passing it back to the same model M to get feedback (step 1). The feedback is passed back to M, which refines the previously generated output (step 2). Steps (step 1) and (step 2) iterate until a stopping condition is met. SELF-REFINE is instantiated with a language model such as GPT-3.5 and does not involve human assistance.

The approach is evaluated across seven diverse tasks, including dialogue response and code optimization, demonstrating significant improvements over conventional one-step generation methods. This method leverages few-shot prompting for guiding the LLM to generate feedback and incorporate it for output refinement.
The results show that Self-Refine significantly enhances output quality in terms of human preference and task-specific metrics, indicating its potential to improve LLM-generated content across a range of applications.
Code

The Claude 3 Model Family: Opus, Sonnet, Haiku

Anthropic has introduced Claude 3, comprising Claude 3 Opus (most capable), Sonnet (balance of skills and speed), and Haiku (fastest and most cost-effective). These models incorporate vision capabilities for processing and analyzing image data, establishing new industry benchmarks in reasoning, math, coding, multilingual understanding, and vision quality.
Claude 3 Opus, lauded for setting a new standard for intelligence, excels in undergraduate-level expert knowledge (MMLU), graduate-level expert reasoning (GPQA), basic mathematics (GSM8K), and more, showcasing near-human comprehension and fluency across complex tasks. It demonstrates unparalleled performance across common AI system evaluation benchmarks, leading the frontier of general intelligence.
All Claude 3 models enhance capabilities in analysis, forecasting, nuanced content creation, code generation, and multilingual conversation, showing proficiency in languages like Spanish, Japanese, and French. They are characterized by their ability to produce near-instant results, powering live customer chats, auto-completions, and data extraction tasks with immediate and real-time responses.
Haiku stands out for its exceptional speed and cost-effectiveness, capable of processing data-dense research papers with visuals in under three seconds, setting a benchmark for intelligence and response speed in its category. Sonnet offers twice the speed of Claude 2 and 2.1, excelling in tasks requiring rapid responses, while Opus maintains similar speeds but with significantly higher intelligence.
Enhanced vision capabilities enable Claude 3 models to process various visual formats, including photos, charts, graphs, and technical diagrams, on par with leading models in the field. This feature is particularly beneficial for enterprise customers with substantial portions of their knowledge bases in visual formats.
Improvements in behavioral design lead to fewer unnecessary refusals, with Claude 3 models exhibiting a more nuanced understanding of requests and a greater likelihood to engage with prompts that border on the system’s guardrails.
Significant advancements in accuracy ensure high reliability of model outputs for businesses. Opus, for example, demonstrates a twofold improvement in answering complex, factual questions correctly compared to Claude 2.1, with reduced incorrect responses. Upcoming updates will enable Claude 3 models to cite precise sentences in reference materials for answer verification.
The Claude 3 family initially offers a 200K context window, with capabilities extending to inputs exceeding 1 million tokens for enhanced processing power for select customers. Notably, Claude 3 Opus exhibits near-perfect recall in the ‘Needle In A Haystack’ evaluation, accurately recalling information from a diverse corpus and even identifying artificial insertions in the text.
Developed by Anthropic and available through Claude.ai, Claude Pro, and enterprise solutions like Anthropic API, Amazon Bedrock, and Google Vertex AI, the Claude 3 models’ knowledge cutoff is August 2023. This report provides a detailed analysis of evaluations focusing on core capabilities, safety, societal impacts, and catastrophic risk assessments aligned with Anthropic’s Responsible Scaling Policy.
Key highlights from the model evaluations include:
- Superior capabilities in reasoning, reading comprehension, math, science, and coding, surpassing previous Claude models and achieving state-of-the-art results in benchmarks like GPQA, MMLU, ARC-Challenge, and PubMedQA.
- Enhanced multilingual performance, demonstrating significant improvements in non-English languages.
- Advanced vision capabilities, performing well in visual question answering and analysis, with state-of-the-art results in AI2D science diagram benchmark.
- Commitment to safety and security, including comprehensive Trust & Safety evaluations, efforts towards election integrity, societal impact assessments, and adherence to Anthropic’s Acceptable Use Policy.
- Ongoing improvements in behavioral design, including refining refusals, honesty and truthfulness, instruction following, and formatting for customer use cases.
Blog

ORPO: Monolithic Preference Optimization without Reference Model

This paper by Hong et al. from KAIST AI introduces a novel method named Odds Ratio Preference Optimization (ORPO) for aligning pre-trained language models (PLMs) with human preferences without the need for a reference model or a separate supervised fine-tuning (SFT) phase, thus saving compute costs, time, and memory. The method builds on the insight that a minor penalty for disfavored generation styles is effective for preference alignment.
Odds Ratio Preference Optimization (ORPO) proposes a new method to train LLMs by combining SFT and Alignment into a new objective (loss function), achieving state of the art results. ORPO operates by incorporating a simple odds ratio-based penalty alongside the conventional negative log-likelihood loss. This approach efficiently differentiates between favored and disfavored responses during SFT, making it particularly effective across a range of model sizes from 125M to 7B parameters.
SFT plays a significant role in tailoring the pre-trained language models to the desired domain by increasing the log probabilities of pertinent tokens. Nevertheless, this inadvertently increases the likelihood of generating tokens in undesirable styles, as illustrated in Figure 3. Therefore, it is necessary to develop methods capable of preserving the domain adaptation role of SFT while concurrently discerning and mitigating unwanted generation styles.
The goal of cross-entropy loss model fine-tuning is to penalize the model if the predicted logits for the reference answers are low. Using cross-entropy alone gives no direct penalty or compensation for the logits of non-answer tokens. While cross-entropy is generally effective for domain adaptation, there are no mechanisms to penalize rejected responses when compensating for the chosen responses. Therefore, the log probabilities of the tokens in the rejected responses increase along with the chosen responses, which is not desired from the viewpoint of preference alignment. fine-tune
The authors experimented with finetuning OPT-350M on the chosen responses only from the HH-RLHF dataset. Throughout the training, they monitor the log probability of rejected responses for each batch and report this in Figure 3. Both the log probability of chosen and rejected responses exhibited a simultaneous increase. This can be interpreted from two different perspectives. First, the cross-entropy loss effectively guides the model toward the intended domain (e.g., dialogue). However, the absence of a penalty for unwanted generations results in rejected responses sometimes having even higher log probabilities than the chosen ones.
Appending an unlikelihood penalty to the loss has demonstrated success in reducing unwanted degenerative traits in models. For example, to prevent repetitions, an unwanted token set of previous contexts, $k \in \mathcal{C}_{\text {recent }}$, is disfavored by adding the following term to $(1-p_i^{(k)})$ to the loss which penalizes the model for assigning high probabilities to recent tokens. Motivated by SFT ascribing high probabilities to rejected tokens and the effectiveness of appending penalizing unwanted traits, they design a monolithic preference alignment method that dynamically penalizes the disfavored response for each query without the need for crafting sets of rejected tokens.
Given an input sequence $x$, the average loglikelihood of generating the output sequence $y$, of length $m$ tokens, is computed as the below equation.

\[\log P_\theta(y \mid x)=\frac{1}{m} \sum_{t=1}^m \log P_\theta\left(y_t \mid x, y_{<t}\right)\]

The odds of generating the output sequence $y$ given an input sequence $x$ is defined in the below equation:

\[\operatorname{odds}_\theta(y \mid x)=\frac{P_\theta(y \mid x)}{1-P_\theta(y \mid x)}\]

Intuitively, $\boldsymbol{o d d s}_\theta(y \mid x)=k$ implies that it is $k$ times more likely for the model $\theta$ to generate the output sequence $y$ than not generating it. Thus, the odds ratio of the chosen response $y_w$ over the rejected response $y_l, \mathbf{O R}_\theta\left(y_w, y_l\right)$, indicates how much more likely it is for the model $\theta$ to generate $y_w$ than $y_l$ given input $x$, defined in the below equation.

\[\mathbf{O R}_\theta\left(y_w, y_l\right)=\frac{\operatorname{odds}_\theta\left(y_w \mid x\right)}{\operatorname {odds}_\theta\left(y_l \mid x\right)}\]

The objective function of ORPO in the below equation consists of two components: (i) supervised fine-tuning (SFT) loss $\left(L_{S F T}\right))$; (ii) relative ratio loss $\left(L_{O R}\right)$.

\[L_{O R P O}=\mathbb{E}_{\left(x, y_w, y_l\right)}\left[L_{S F T}+\lambda \cdot L_{O R}\right]\]

$L_{S F T}$ follows the conventional causal language modeling negative log-likelihood (NLL) loss function to maximize the likelihood of generating the reference tokens. $L_{O R}$ in the below equation maximizes the odds ratio between the likelihood of generating the favored/chosen response $y_w$ and the disfavored/rejected response $y_l$. ORPO wrap the log odds ratio with the log sigmoid function so that $L_{O R}$ could be minimized by increasing the log odds ratio between $y_w$ and $y_l$.

\[L_{O R}=-\log \sigma\left(\log \frac{\operatorname{odds}_\theta\left(y_w \mid x\right)}{\operatorname{odds}_\theta\left(y_l \mid x\right)}\right)\]

Together, $L_{S F T}$ and $L_{O R}$ weighted with $\lambda$ tailor the pre-trained language model to adapt to the specific subset of the desired domain and disfavor generations in the rejected response sets.
Training process:
1. Create a pairwise preference dataset (chosen/rejected), e.g., Argilla UltraFeedback
2. Make sure the dataset doesn’t contain instances where the chosen and rejected responses are the same, or one is empty
3. Select a pre-trained LLM (e.g., Llama-2, Mistral)
4. Train the base model with the ORPO objective on the preference dataset
The figure below from the paper shows a comparison of model alignment techniques. ORPO aligns the language model without a reference model in a single-step manner by assigning a weak penalty to the rejected responses and a strong adaptation signal to the chosen responses with a simple log odds ratio term appended to the negative log-likelihood loss.

Empirical evaluations show that fine-tuning models such as Phi-2 (2.7B), Llama-2 (7B), and Mistral (7B) using ORPO significantly surpasses the performance of state-of-the-art models on benchmarks such as AlpacaEval 2.0, IFEval, and MT-Bench. For instance, Mistral-ORPO-$\alpha$ and Mistral-ORPO-$\beta$ achieve up to 12.20% on AlpacaEval 2.0, 66.19% on IFEval, and 7.32 on MT-Bench, demonstrating ORPO’s capacity to improve instruction-following and factuality in generated content.
Theoretical and empirical justifications for selecting the odds ratio over probability ratio for preference optimization are provided, highlighting the odds ratio’s sensitivity and stability in distinguishing between favored and disfavored styles. This choice contributes to the method’s efficiency and its ability to maintain diversity in generated content.
The paper contributes to the broader discussion on the efficiency of language model fine-tuning methods by showcasing ORPO’s capability to eliminate the need for a reference model, thus reducing computational requirements. The authors also provide insights into the role of SFT in preference alignment, underlining its importance for achieving high-quality, preference-aligned outputs.
Code

Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts

This paper by Samvelyan et al. from Meta, UCL, and the University of Oxford introduces “Rainbow Teaming,” designed to generate a diverse set of adversarial prompts to explore and enhance the robustness of large language models (LLMs) across multiple domains, including safety, question answering, and cybersecurity. The technique leverages quality-diversity search, particularly MAP-Elites, for open-ended exploration of adversarial strategies.
Rainbow Teaming operates by evolving adversarial prompts in a multidimensional “archive” based on feature descriptors like Risk Category and Attack Style for safety, and similar frameworks for other domains. The process involves mutation operators (e.g., prompting LLMs to alter existing prompts) and preference models (e.g., a Judge LLM to evaluate prompt effectiveness) to iteratively refine and expand the diversity of adversarial challenges.
Applied to Llama 2-chat models, Rainbow Teaming uncovered significant vulnerabilities without compromising the models’ general capabilities. Moreover, the synthetic adversarial data generated during this process, when used for fine-tuning, significantly improved the LLMs’ safety robustness. The method’s flexibility was demonstrated across different domains, showing its potential for broad applicability in assessing and improving LLM robustness.
The following figure from the paper shows an overview of Rainbow Teaming in the safety domain: Our method operates on a discretised grid, archiving adversarial prompts with $K$ defining features, such as Risk Category or Attack Style. Each iteration involves a Mutator LLM applying K mutations to generate new candidate prompts. These prompts are then fed into the Target LLM. A Judge LLM evaluates these responses against archived prompts with the same features, updating the archive with any prompt that elicits a more unsafe response from the Target.

Key results include a demonstration of how fine-tuning LLMs on Rainbow Teaming-generated synthetic data could improve safety robustness. For instance, fine-tuning the Llama 2-chat 7B model on this data reduced the attack success rate from over 80% to around 2%, without degrading the model’s performance on standard benchmarks.
The paper also explores the utility of Rainbow Teaming in generating adversarial data for cybersecurity, showing high effectiveness across various models. The approach not only uncovers a wide range of vulnerabilities but also offers a path towards mitigating these through targeted fine-tuning, emphasizing the method’s potential for ongoing, minimal human-input-required improvement of LLM safety and reliability.
The research has broader implications for the development and deployment of AI, highlighting the importance of comprehensive adversarial testing in ensuring the robustness and safety of LLMs. It suggests that continuous, automated adversarial exploration could become an integral part of AI development workflows, contributing to more reliable and trustworthy AI systems.

Stealing Part of a Production Language Model

The paper by Carlini et al. from Google DeepMind, ETH Zurich, UW, OpenAI, and McGill University introduces the first model-stealing attack that successfully extracts precise nontrivial information from black-box production language models, specifically targeting models like OpenAI’s ChatGPT and Google’s PaLM-2. The authors achieve this by recovering the embedding projection layer of transformer models through typical API access.
For a cost under $20 USD, the attack extracts the entire projection matrix of OpenAI’s ada and babbage language models, revealing hidden dimensions of 1024 and 2048, respectively. It also successfully recovers the hidden dimension size of the gpt-3.5-turbo model, estimating a recovery cost of under $2000 for the entire projection matrix.
The methodology exploits the final layer of a language model that projects from a hidden dimension to a higher-dimensional logit vector. The attack is designed to be efficient and applicable to production models whose APIs expose full logprobs or a “logit bias,” including those of Google’s PaLM-2 and OpenAI’s GPT-4.
Despite only recovering a relatively small part of the model, the success of this attack in extracting parameters of a production model is highlighted as both surprising and concerning. It points to potential future work that could extend the attack to recover more information.
The following figure from the paper shows that SVD can recover the hidden dimensionality of a model when the final output layer dimension is greater than the hidden dimension. Here they extract the hidden dimension (2048) of the Pythia 1.4B model. We can precisely identify the size by obtaining slightly over 2048 full logit vectors.

The paper also discusses potential defenses and mitigations against such attacks, emphasizing the need for careful consideration of API features that might inadvertently facilitate model extraction. The successful extraction of model details from prominent language models like OpenAI’s ada and babbage serves as a significant demonstration of the attack’s effectiveness and raises questions about the security of black-box models.
Responsible disclosure was conducted, sharing the attack details with services vulnerable to the attack and collaborating with OpenAI to confirm the efficacy of the approach. Subsequently, OpenAI and Google modified their APIs to introduce measures that either prevent the attack or increase its cost.

OneBit: Towards Extremely Low-bit Large Language Models

This paper by Xu et al. from Tsinghua University and Harbin Institute of Technology, introduced a pioneering 1-bit quantization-aware training (QAT) framework named OneBit for large language models (LLMs). This framework addresses the significant performance degradation seen in traditional quantization methods when reducing the bit-width of model weights to 1-bit. Through a novel 1-bit parameter representation and an effective parameter initialization method based on matrix decomposition, OneBit achieves a robust training process and maintains at least 83% of the non-quantized performance, significantly reducing both storage and computational overheads for deploying LLMs.
OneBit’s architecture introduces a 1-bit linear layer, where each original high-bit weight matrix is decomposed into a sign matrix and two value vectors, aiming for a balanced precision retention and model performance in extremely low-bit scenarios. This method helps in overcoming the drastic precision loss associated with conventional 1-bit quantization methods, thereby maintaining the effectiveness of LLMs.
The proposed Sign-Value-Independent Decomposition (SVID) technique decomposes high-bit matrices into low-bit ones, crucial for initializing the 1-bit architecture effectively. Experiments demonstrate that SVID-based initialization significantly improves model performance and convergence speed.
The following figure from the paper shows the main idea of OneBit. The left is the original FP16 Linear Layer, in which both the activation $X$ and the weight matrix $W$ are in FP16 format. The right is OneBit. Only value vectors $g$ and $h$ are in FP16 format and the weight matrix consists of ±1 instead.

Extensive experimental results across various model sizes from 1.3B to 13B in OPT, LLaMA, and LLaMA2 showcase the generalizability of OneBit. The method outperforms existing low-bit quantization methods, particularly in maintaining model capabilities and achieving a high compression ratio, demonstrating its efficacy for deploying efficient and compact LLMs on resource-constrained environments.
OneBit also includes an evaluation on the instruction-following ability in both zero-shot and few-shot settings, highlighting its potential for practical applications where model size and efficiency are critical constraints. Despite the inevitable performance loss due to extreme quantization, OneBit presents a promising approach toward making LLMs more accessible for a wider range of devices and applications.

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

This paper introduces BitNet b1.58, a 1-bit Large Language Model (LLM) variant where each parameter is ternary, taking on values {-1, 0, 1}, contrasting with the traditional 16-bit floating values used in Transformer LLMs. This model matches the perplexity and end-task performance of full-precision LLMs while significantly improving cost-effectiveness in terms of latency, memory throughput, and energy consumption.
The 1.58-bit LLM defines a new scaling law and recipe for training future LLMs, aiming at both high performance and cost-effectiveness. It also suggests a shift towards a new computation paradigm and the design of specific hardware optimized for 1-bit computations.
The architecture of BitNet b1.58 is based on the Transformer, with nn.Linear replaced by BitLinear, trained from scratch with 1.58-bit weights and 8-bit activations. Modifications include an absmean quantization function for weights, scaling model parameters by their average absolute value, and rounding to the nearest integer among {-1, 0, +1}.
BitNet b1.58 adopts LLaMA-like components for integration into popular open-source software, employing RMSNorm, SwiGLU, rotary embedding, and eliminating all biases, which aligns it with the architecture of leading open-source LLMs like LLaMA.
1-bit LLMs (e.g., BitNet b1.58) provide a Pareto solution to reduce inference cost (latency, throughput, and energy) of LLMs while maintaining model performance. The new computation paradigm of BitNet b1.58 calls for actions to design new hardware optimized for 1-bit LLMs.

The interesting aspect of BitNet is the reduced computational requirements since large LLM models involve millions of matrix multiplications. To multiply a matrix (weights) by a vector (input), we must do multiplication and addition. If your matrix consists of {-1, 0, +1}, you can replace multiplication and addition with just addition.
Evaluation on various model sizes (700M, 1.3B, 3B, and 3.9B) against reproduced FP16 LLaMA LLMs shows that BitNet b1.58 achieves comparable or better performance in terms of perplexity and zero-shot accuracy on several language tasks while significantly reducing GPU memory usage and latency.
Scaling to larger models (7B, 13B, 70B) shows increased efficiency in decoding latency and memory consumption, with BitNet b1.58 offering up to 11 times the batch size and 8.9 times higher throughput than LLaMA LLM at 70B, demonstrating a Pareto improvement over state-of-the-art LLM models.
The paper discusses potential future directions, including 1-bit Mixture-of-Experts (MoE) LLMs, native support for long sequences in LLMs, the application of 1.58-bit LLMs on edge and mobile devices, and the development of new hardware optimized for 1-bit LLMs, highlighting the broad implications of this research for the field of AI and LLMs.

Multilingual E5 Text Embeddings: A Technical Report

This report by Wang et al. from Furu Wei’s group at Microsoft introduces the open-source multilingual E5 text embedding models, including small, base, and large variants, and an instruction-tuned large model. These models offer a balance between inference efficiency and embedding quality, extending the capabilities of English E5 models to multilingual contexts.
The models are pre-trained in two stages: weakly-supervised contrastive pre-training on 1 billion text pairs from diverse sources, followed by supervised fine-tuning on high-quality labeled datasets. The instruction-tuned variant, mE5-large-instruct, utilizes synthetic data to improve performance.
The report evaluates the models using the English portion of the MTEB benchmark, showing competitive performance against existing multilingual and English-only models. The mE5-large model outperforms the previous state-of-the-art multilingual model and a strong English-only model.
Multilingual retrieval capabilities are demonstrated on the MIRACL benchmark across 16 languages, showing significant improvements over baseline models in terms of nDCG@10 and recall metrics.
Bitext mining results across a broad range of languages show competitive performance, with the instruction-tuned model surpassing the LaBSE model due to expanded language coverage from synthetic data.
The report concludes with the assertion that the multilingual E5 models, made publicly available, can be leveraged for various applications across a wide range of languages, enhancing tasks like information retrieval, semantic similarity, and clustering.

MambaByte: Token-free Selective State Space Model

This paper by Wang et al. from Cornell introduced MambaByte, a novel adaptation of the Mamba state space model designed for efficient language modeling directly from raw byte sequences. Addressing the challenges posed by the significantly longer sequences of bytes compared to traditional subword units, MambaByte leverages the computational efficiency of state space models (SSMs) to outperform existing byte-level models and rival state-of-the-art subword Transformers.
MambaByte’s architecture is distinguished by its selective mechanism tailored for discrete data like text, enabling linear scaling in length and promising faster inference speeds compared to conventional Transformers. This breakthrough is attributed to the model’s ability to efficiently process the extended sequences inherent to byte-level processing, eliminating the need for subword tokenization and its associated biases.
The figure below from the paper shows a Mamba block. $\sigma$ indicates Swish activation.

Experimental results highlight MambaByte’s superior performance and computational efficiency. Benchmarks on the PG19 dataset and comparisons with other byte-level models, including the MegaByte Transformer and gated diagonalized S4, demonstrated MambaByte’s reduced computational demands and enhanced effectiveness in language modeling tasks. Its capability to maintain competitive performance with significantly longer sequences without relying on tokenization marks a substantial advancement in language model training.
The figure below from the paper shows the benchmarking byte-level models with a fixed parameter budget. Language modeling results on PG19 (8, 192 consecutive bytes), comparing the standard Transformer, MegaByte Transformer, gated diagonalized S4, and MambaByte. (Left) Model loss over training step. (Right) FLOP-normalized training cost. MambaByte reaches Transformer loss in less than one-third of the compute budget.

The paper provides a comprehensive analysis of the MambaByte model, including its experimental setup, dataset specifics, and detailed implementation techniques. The study meticulously outlines the comparative evaluation of MambaByte against other models under fixed parameter and compute settings across several long-form text datasets. Furthermore, it delves into the selective state space sequence modeling background that underpins MambaByte’s design, offering insights into the model’s operational efficiency and practicality for large-scale language processing tasks.
MambaByte’s introduction as a token-free model that effectively addresses the inefficiencies of byte-level processing while rivaling the performance of subword models is a significant contribution to the field of natural language processing. Its development paves the way for future explorations into token-free language modeling, potentially influencing large-scale model training methodologies and applications.
Code

How faithful are RAG models? Quantifying the tug-of-war between RAG and LLMs’ internal prior

This paper by Wu et al. from from Stanford investigates the effectiveness of Retrieval Augmented Generation (RAG) frameworks in moderating the behavior of Large Language Models (LLMs) when confronted with conflicting information. It centers on the dynamic between an LLM’s pre-existing knowledge and the information retrieved via RAG, particularly when discrepancies arise.
The authors conducted a systematic study using models like GPT-4 and GPT-3.5, simulating scenarios where the models were provided with both accurate and deliberately perturbed information across six distinct datasets. The paper confirms that while correct information typically corrects LLM outputs (with a 94% accuracy rate), incorrect data leads to errors if the model’s internal prior is weak.
The study introduces a novel experimental setup where documents are systematically modified to test LLM reliance on prior knowledge versus retrieved content. Changes ranged from numerical modifications (e.g., altering drug dosages or dates by specific multipliers or intervals) to categorical shifts in names and locations, assessing model response variations.
The figure below from the paper shows a schematic of generating modified documents for each dataset. A question is posed to the LLM with and without a reference document containing information relevant to the query. This document is then perturbed to contain modified information and given as context to the LLM. They then observe whether the LLM prefers the modified information or its own prior answer.

Key findings include an inverse correlation between the likelihood of an LLM adhering to retrieved information and its internal confidence, quantified through token probabilities. Models with stronger priors demonstrated greater resistance to misleading RAG content, reverting to their initial responses.
Additionally, the paper discusses the influence of different prompting strategies on RAG adherence. The ‘strict’ prompting led to higher reliance on retrieved content, whereas ‘loose’ prompting allowed more independent reasoning from the models, highlighting the importance of prompt design in RAG systems.
Results across the datasets illustrated varying degrees of RAG effectiveness, influenced by the model’s confidence level. This nuanced exploration of RAG dynamics provides insights into improving the reliability of LLMs in practical applications, emphasizing the delicate balance needed in integrating RAG to mitigate errors and hallucinations in model outputs.

Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity

This paper by Jeong et al. from KAIST presents a novel framework named Adaptive-RAG for dynamic adjustment of retrieval strategies in Large Language Models (LLMs) based on the complexity of incoming queries. This allows for efficient and accurate responses across different query complexities.
The system categorizes queries into simple, moderate, and complex, each requiring different retrieval strategies: non-retrieval, single-step retrieval, and multi-step retrieval, respectively. The determination of query complexity is facilitated by a classifier trained on automatically labeled data.
The figure below from the paper shows a conceptual comparison of different retrieval-augmented LLM approaches to question answering. (A) In response to a query, this single-step approach retrieves relevant documents and then generates an answer. However, it may not be sufficient for complex queries that require multi-step reasoning. (B) This multi-step approach iteratively retrieves documents and generates intermediate answers, which is powerful yet largely inefficient for the simple query since it requires multiple accesses to both LLMs and retrievers. (C) Their adaptive approach can select the most suitable strategy for retrieval-augmented LLMs, ranging from iterative, to single, to even no retrieval approaches, based on the complexity of given queries determined by our classifier.

Adaptive-RAG was validated across multiple open-domain QA datasets, showing significant improvements in both efficiency and accuracy over existing models. It employs a blend of iterative and single-step retrieval processes tailored to the specific needs of a query, which optimizes resource use and response time.
The implementation utilizes a secondary smaller language model as a classifier to predict query complexity. The classifier is trained on datasets synthesized without human labeling, using model predictions and inherent dataset biases to automatically generate training labels.
Experimental results demonstrate that Adaptive-RAG efficiently allocates resources, handling complex queries with detailed retrieval while effectively answering simpler queries directly through the LLM, thus avoiding unnecessary computation.
Additionally, Adaptive-RAG’s flexibility is highlighted in its ability to interchange between different retrieval strategies without altering the underlying model architecture or parameters, providing a scalable solution adaptable to varied query complexities.

Many-Shot In-Context Learning

This paper by Agarwal et al. from Google DeepMind investigates the implications of increasing the number of examples used for in-context learning (ICL) in large language models (LLMs), exploring the transition from few-shot to many-shot ICL. The authors observe that many-shot ICL, which utilizes hundreds or thousands of examples, significantly enhances model performance across a range of tasks such as math problem-solving, summarization, and sentiment analysis. The study leverages expanded context windows in modern LLMs, notably using Gemini 1.5 Pro with a context window of up to 1 million tokens.
A major contribution of this work is the introduction of “Reinforced ICL” and “Unsupervised ICL” to mitigate the limitations posed by the need for extensive high-quality, human-generated outputs. Reinforced ICL utilizes model-generated chain-of-thought rationales instead of human-written ones, while Unsupervised ICL involves prompting the model solely with domain-specific problems without any paired solutions. The findings suggest that these methods are particularly effective for complex reasoning tasks.
The paper also discusses the capacity of many-shot ICL to override pretraining biases and adapt to tasks with high-dimensional numerical inputs, which are typically challenging for models relying on traditional few-shot learning approaches. It highlights the robustness of many-shot learning in scenarios where conventional few-shot methods may falter due to the inherent biases in pre-trained models.
The figure below from the paper shows many-shot vs. few-shot ICL across several tasks. Many-shot learning exhibits consistent performance gains over few-shot ICL. This gain is especially dramatic for difficult non-natural language tasks like sequential parity prediction and linear classification. Number of best-performing shots for many-shot ICL are shown inside the bar for each task. For few-shot ICL, we either use typical number of shots used on a benchmark, for example, 4-shot for MATH, or the longest prompt among the ones we tested with less than the GPT-3 context length of 2048 tokens. Reasoning-oriented tasks, namely MATH, GSM8K, BBH, and GPQA uses human-generated chain-of-thought rationales. For translation, we report performance FLORES-MT result on English to Kurdish, summarization uses XLSum, MATH corresponds to the MATH500 test set, and sentiment analysis results are reported with semantically-unrelated labels.

Throughout the study, the authors employ various evaluations, including sentiment analysis where flipped and abstract labels are introduced to assess the model’s ability to unlearn biases and generalize beyond its training data. The effectiveness of many-shot ICL in learning non-natural language tasks like sequential parity prediction and linear classification is also validated.
Finally, the paper critically examines the limitations of using next-token prediction loss as an indicator of ICL performance, proposing that this metric may not reliably predict downstream task success in many-shot settings. This insight is underscored by detailed analyses of performance trends across different ICL configurations and tasks, suggesting that more nuanced measures may be needed to evaluate the effectiveness of in-context learning at scale.

Gemma 2: Improving Open Language Models at a Practical Size

This paper by Gemma Team at Google DeepMind introduces Gemma 2, a family of lightweight, state-of-the-art open language models ranging from 2B to 27B parameters (that can fit on a single GPU). The models, particularly the 9 billion and 27 billion parameter versions, incorporate several architectural advancements and are trained with knowledge distillation, which enhances their performance and efficiency.
The Gemma 2 models are built on a decoder-only transformer architecture with parameters summarized as follows:
- 2.6B parameters: 26 layers, d_model 2304, GeGLU non-linearity, 18432 feedforward dimension, 8 attention heads.
- 9B parameters: 42 layers, d_model 3584, GeGLU non-linearity, 28672 feedforward dimension, 16 attention heads.
- 27B parameters: 46 layers, d_model 4608, GeGLU non-linearity, 73728 feedforward dimension, 32 attention heads.
The core architectural innovations include the interleaving of local and global attention mechanisms and the use of Grouped-Query Attention (GQA). Specifically, local sliding window attention handles sequences of 4096 tokens, while global attention spans 8192 tokens. Logit soft-capping is employed to stabilize the training, ensuring the logits stay within defined bounds. Both pre-norm and post-norm with RMSNorm are used to normalize inputs and outputs of transformer sub-layers, improving training stability.
The models are trained on extensive datasets with the 27B model trained on 13 trillion primarily-English tokens, the 9B model on 8 trillion tokens, and the 2.6B model on 2 trillion tokens. The training infrastructure involves TPUv4 and TPUv5 configurations, leveraging significant parallelization and sharding techniques to handle the computational load efficiently.
Knowledge distillation is a pivotal part of the training process for Gemma 2 models. Smaller models are trained using the probability distributions provided by a larger model, enhancing their learning efficiency and enabling them to simulate training on a much larger token corpus than what is physically available. This method not only reduces training time but also allows smaller models to achieve performance levels competitive with significantly larger models.
In post-training, the models undergo supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF). The SFT phase uses a mix of synthetic and real prompt-response pairs, predominantly generated by a larger teacher model. The RLHF phase involves training a reward model based on human-labeled preference data to fine-tune the policy further, enhancing the models’ conversational capabilities and safety.
Empirical evaluations show that the Gemma 2 models outperform previous versions and competitive models in various benchmarks, including MMLU, GSM8K, and ARC-C. The models are tested for their robustness to formatting variations, safety, and alignment, showcasing significant improvements in these areas.
Safety and responsible deployment are critical components of the Gemma 2 project. The models are rigorously tested and filtered to prevent harmful content generation, and extensive safety mechanisms are integrated into the training process. This approach ensures that the models align with Google’s safety policies, mitigating risks associated with malicious use.
The performance of the Gemma 2 models is on par with models twice or more their size! Model weights are open-source, thanks to Google DeepMind.
Key architectural components:
1. Grouped Query Attention (GQA): The key difference between GQA and the standard Multi-headed attention is the reduction in the number of key and value heads, effectively grouping heads together. This approach balances between MHA and MQA, maintaining efficiency and performance.:
2. Sliding Window Attention (SWA): The authors interleaved local and global attentions in alternating layers. This technique reduces the number of parameters while maintaining performance. The sliding window size of local attention layers is set to 4096 tokens, while the span of the global attention layers is set to 8192 tokens.:
3. Rotary Position Embeddings (RoPE): RoPE encodes absolute position with a rotation matrix and incorporates relative position dependency in self-attention formulation, enabling sequence length flexibility and improved model performance. This is standard with prevalent LLMs.
4. Logit soft-capping: Logit soft-capping stabilizes training by ensuring logits stay within defined bounds. This technique is implemented using a scaled tanh function, capping attention logits at 50.0 and final logits at 30.0.:
5. Model merging: Model merging involves averaging models from experiments run with different hyperparameters, improving overall performance. Techniques like Exponential Moving Average (EMA), Spherical Linear Interpolation (SLERP), and Linear Interpolation Towards Initialization (LITI) are employed during the merging process.
6. Knowledge distillation for training 2B and 9B models (instead of next token prediction): The 2B and 9B models in Gemma 2 are trained using knowledge distillation, where a larger, pre-trained model (the teacher) provides probability distributions over the vocabulary for each token. Instead of predicting the next token using one-hot vectors, the smaller models (students) learn from these richer distributions, using the Kullback-Leibler (KL) divergence as the loss function. The gradients derived from the richer, softer probability distributions help the student model to learn more effectively than from hard one-hot vectors due to the nuanced signal provided by the teacher model. This method improves performance and efficiency, allowing smaller models to achieve results comparable to much larger models.
In conclusion, Gemma 2 represents a significant advancement in the development of open language models, offering competitive performance at a practical size. The use of knowledge distillation and advanced attention mechanisms provides a pathway for smaller models to achieve high performance, making state-of-the-art language modeling more accessible to the community.
Hugging Face; Aman Arora’s Blog

The Llama 3 Herd of Models

This paper by the Llama Team at Meta introduces the Llama 3 foundation models, which are dense Transformers designed to support multilinguality, coding, reasoning, and tool usage. The largest model contains 405B parameters and can process up to 128K tokens. Llama 3 models are publicly released, including pre-trained and post-trained versions and a Llama Guard model for input and output safety. The models integrate image, video, and speech capabilities using a compositional approach.
Llama 3 development optimizes data quality and diversity, scaling, and managing complexity. The model is trained on a corpus of approximately 15T multilingual tokens, which is a significant increase from Llama 2’s 1.8T tokens. The flagship model is pre-trained using 3.8×10^25 FLOPs, nearly 50 times more than the largest Llama 2 model. The architecture uses a dense Transformer model with grouped query attention (GQA) for improved inference speed and reduced key-value cache size during decoding.
Pre-Training:
- Data Curation: Pre-training data is curated from various sources, filtered for quality, and de-duplicated at multiple levels. PII and unsafe content are removed, and a custom parser extracts high-quality text from web data. Multilingual data processing involves fasttext-based language identification and document de-duplication. The data mix includes 50% general knowledge, 25% mathematical and reasoning data, 17% code, and 8% multilingual data, optimized through scaling law experiments.
- Model Architecture: Llama 3 uses a standard dense Transformer architecture with enhancements like GQA and a vocabulary size of 128K tokens. The RoPE (Rotary Position Embedding) base frequency is increased to 500,000 to support longer contexts, enabling the model to better handle sequences of up to 128K tokens.
- Preprocessing and Filtering: The data preprocessing pipeline is highly sophisticated and involves several key steps to ensure high-quality input for the model. Roberta and DistilRoberta models are employed to classify data quality, providing a robust initial screening. The system also utilizes fasttext for language identification, categorizing documents into 176 languages, which aids in the precise removal of non-relevant content. Extensive de-duplication processes are applied at URL, document, and line levels to eliminate redundant information. This includes:
  - URL-level de-duplication: Removing duplicate URLs to keep the most recent versions.
  - Document-level de-duplication: Utilizing MinHash techniques to remove near-duplicate documents.
  - Line-level de-duplication: Filtering out lines that appear excessively within document batches to remove boilerplate and repetitive content.
  - Heuristic Filtering: Further filters are applied using techniques like duplicated n-gram coverage ratio to eliminate content like log messages, and dirty word counting to filter out adult content. Additionally, a token-distribution Kullback-Leibler divergence measure is used to remove documents with unusual token distributions.
  - Model-Based Quality Filtering: Fast classifiers such as fasttext are used alongside more complex Roberta-based classifiers to select high-quality tokens. These classifiers are trained on a curated set of documents labeled for quality by previous Llama models.
  - Quality Scoring: Quality scoring involves both reward model (RM) and Llama-based signals. For RM-based scoring, data in the top quartile of RM scores is considered high quality. For Llama-based scoring, the Llama 3 checkpoint rates samples on a three-point scale for general English data (Accuracy, Instruction Following, and Tone/Presentation) and a two-point scale for coding data (Bug Identification and User Intention). Samples marked as high quality by either the RM or Llama-based filters are selected, though the systems often disagree, suggesting complementary strengths.
  - Difficulty Scoring: Difficulty scoring prioritizes more complex examples using Instag and Llama-based measures. Instag involves intention tagging of SFT prompts with Llama 3 70B, where more intentions indicate higher complexity. Additionally, Llama 3 rates dialog difficulty on a three-point scale, providing another layer of complexity assessment.
Training Details: The model training employs a combination of pipeline parallelism and Fully Sharded Data Parallelism (FSDP). Pipeline parallelism partitions the model into stages across GPUs, while FSDP shards the model weights and optimizer states, allowing efficient training across multiple GPUs. The training process also includes a model averaging technique across the Reward Model (RM), Supervised Finetuning (SFT), and Direct Preference Optimization (DPO) stages, ensuring consistency and performance optimization.
Post-Training:
- Supervised Finetuning and Direct Preference Optimization (DPO): The model undergoes multiple rounds of post-training involving supervised finetuning and DPO. A reward model is trained on human-annotated preference data, and the language model is finetuned on a mixture of human-curated and synthetic data.
- Data Quality Control: Extensive data cleaning, pruning, and quality control measures ensure high-quality training samples. Data is categorized and scored for quality using both reward model and Llama-based signals.
- Multimodal Capabilities: Separate encoders for images and speech are trained and integrated into the language model using adapters. This approach enables the model to handle image, video, and speech inputs effectively.
- Float8 Quantization: The Llama 3 models utilize Float8 (fp8) quantization, where both weights and inputs are quantized to fp8. This quantization is followed by multiplication with scaling factors, resulting in outputs in bf16 format. This approach reduces VRAM usage and speeds up inference, making the models more efficient for deployment.
The experimental results show that Llama 3 models perform competitively with state-of-the-art models like GPT-4 across various tasks, including language understanding, coding, and reasoning. The models also demonstrate robustness and scalability, with improvements in safety and alignment with human preferences.
Blog; Model card; Hugging Face

MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases

This paper by Liu et al. from Meta, presents MobileLLM, a family of language models optimized for deployment on mobile devices, addressing the limitations of LLMs in terms of computational cost, latency, and energy consumption. The work specifically targets the development of efficient LLMs under a billion parameters, challenging the prevailing assumption that model quality heavily depends on data and parameter quantity.
Technical Details and Innovations:
1. Deep and Thin Architectures: The study emphasizes the importance of architecture over scale for sub-billion parameter LLMs. It utilizes deep-and-thin structures, which are more efficient in capturing abstract concepts compared to wider models. This design leads to a significant performance boost, outperforming previous state-of-the-art models of similar sizes.
2. Embedding Sharing and Grouped Query Attention: To maximize weight utilization, the authors implement embedding sharing, a technique that reuses the same weights for both input and output embeddings. Additionally, grouped query attention (GQA) is employed, which reduces redundancy by using fewer key-value heads relative to query heads. This approach maintains accuracy while reducing model size and computational requirements.
3. Immediate Block-wise Weight Sharing: The paper introduces an innovative layer-sharing strategy that further optimizes memory usage without increasing model size. This method shares weights between adjacent layers, avoiding the need for additional memory transfers during inference, which is crucial for memory-bounded environments like mobile devices.
The figure below from the paper illustrates the design roadmap of sub-billion sized transformer models. The foreground and background bars represent the averaged accuracy on zero-shot common sense reasoning tasks for 125M and 350M models, respectively. The 125M model, initially a 12-layer 768-dimension structure, is enhanced via improving feed-forward network design, network depth adjustments, and weight-sharing strategies. The detailed accuracy of each modification can be found in the appendix.

Implementation Details:
- Baseline Models: The MobileLLM family includes models with 125M and 350M parameters, designed to provide state-of-the-art performance within the sub-billion parameter range. The MobileLLM models incorporate techniques such as SwiGLU activation functions, deeper model architectures, and embedding sharing.
- Training Setup: The models were trained using 32 A100 GPUs with an initial learning rate set to 2e-3, following a cosine decay schedule. The training process included a preliminary phase with 120k iterations on 0.25T tokens, followed by extended training with 480k iterations on 1T tokens for the final models.
- Evaluation and Performance: The models were evaluated on a suite of zero-shot common sense reasoning tasks, such as ARC, BoolQ, PIQA, and SIQA. The MobileLLM models achieved substantial improvements over existing sub-billion parameter models, including GPT-Neo and OPT, and performed comparably to larger models like LLaMA-v2 7B in specific tasks like API calling.
On-Device Use Case Validation: The MobileLLM models were tested in practical applications, including chat and API calling tasks. The results demonstrated that these models, despite their smaller size, deliver competitive performance, with MobileLLM-LS-350M achieving near parity with the much larger LLaMA-v2 7B model in API calling tasks.
The paper highlights the potential of sub-billion parameter models for on-device use, offering a path to efficient, high-performance LLMs suitable for mobile deployment. The MobileLLM family exemplifies how architectural innovations and efficient weight utilization can overcome the limitations traditionally associated with smaller models. The findings encourage further exploration into compact, efficient LLM designs, particularly for applications where computational resources and energy efficiency are critical.
Code

MatFormer: Nested Transformer for Elastic Inference

This paper from Devvrit et al. from Google DeepMind, UT Austin, UW, and Harvard introduces MatFormer, a Transformer architecture designed for elastic inference—enabling dynamic trade-offs between model accuracy and computational cost without retraining or fine-tuning.
Core idea: MatFormer introduces a nested structure within the feedforward network (FFN) of standard Transformer blocks, allowing the extraction of multiple submodels (of varying sizes) from a single trained model. This structure is inspired by matryoshka (nested) representations, where each submodel is a strict subset of a larger one.
Model architecture:
- The FFN block in MatFormer is divided into $g$ granularities with exponentially spaced widths, for instance:
  \[\left\{ \frac{d_{\text{ff}}}{8}, \frac{d_{\text{ff}}}{4}, \frac{d_{\text{ff}}}{2}, d_{\text{ff}} \right\}\]
- For each training step, one of these granularities is sampled, and only the corresponding subnetwork is activated and optimized.
- This results in a universal model that encapsulates all $g$ submodels as well as allowing Mix’n’Match combinations across layers to form exponentially many accurate subnetworks.
  - The following figure from the paper illustrates the nested structure that MatFormer introduces into the Transformer’s FFN block & trains all the submodels, enabling free extraction of hundreds of accurate submodels for elastic inference.
Training method:
- Uses standard gradient-based optimization on randomly sampled granularities from the nested FFN blocks.
- All model weights are shared and reused across submodels; no additional loss terms like distillation are used.
- The training objective for a sample is simply the standard task loss applied to a sampled submodel.
Mix’n’Match inference:
- During inference, any layer in the model can use any of the $g$ granularities, resulting in a combinatorial number of possible submodels.
- A heuristic is introduced to select performant submodels: using minimal and consistent changes in granularity across layers yields better results.
Implementation details:
- Language models (MatLMs) and vision transformers (MatViTs) were implemented.
- MatLMs are decoder-only language models ranging up to 850M parameters, and show better validation loss and one-shot evaluation performance than separately trained baselines (OFA, DynaBERT, and vanilla Transformers).
- MatViTs are Vision Transformers evaluated on ImageNet classification and image retrieval. Submodels from a single MatViT model outperform individually trained models and preserve semantic similarity in embedding space across granularities.
Speculative decoding:
- MatLM submodels show higher consistency with the full model than independently trained models, making them highly effective for speculative decoding.
- Demonstrated improved inference speeds (~6% over traditional speculative decoding) using shared attention caches and consistent behavior.
Scalability:
- MatFormer exhibits similar loss vs. compute scaling laws to vanilla Transformers, with submodels achieving state-of-the-art efficiency-accuracy trade-offs.
Advantages:
- Enables deployment of size-adaptive models without retraining.
- Significantly reduces memory and training costs by maintaining only a single universal model.
- Offers elastic performance, making it suitable for real-time and resource-constrained inference environments.
Project page

Biomni: A General-Purpose Biomedical AI Agent

This paper by Huang et al. from Stanford and Genentech, UW, Princeton, and UCSF, introduces Biomni, a general-purpose biomedical AI agent developed by Stanford University and collaborators, designed to autonomously perform a wide array of biomedical research tasks spanning multiple subfields such as genomics, pharmacology, molecular biology, and clinical medicine. Biomni is structured around two key components: the unified agentic environment Biomni-E1, and the generalist agent architecture Biomni-A1.
Key Objective: Addressing bottlenecks in biomedical research caused by fragmented workflows and limited human expertise by enabling scalable, automated execution of research tasks—ranging from bioinformatics analyses to experimental protocol design—via a unified, LLM-powered system.
Architecture and Implementation:
- Biomni-E1 (Environment):
  - Curated by mining 2,500 biomedical publications across 25 subfields (e.g., bioengineering, genetics).
  - Consists of 150 validated specialized biomedical tools, 105 software packages, and 59 structured biomedical databases.
  - Tools include non-trivial computational methods (e.g., RNA velocity, pathway analysis), expert-validated pipelines, and AI models.
  - Database queries are unified through LLM-mediated interfaces that dynamically parse schemas and produce executable queries; web APIs and local preprocessed datasets (as pandas DataFrames) are used for integration.
- Biomni-A1 (Agent):
  - Built on the CodeAct framework, enabling LLMs to interactively plan and execute code in Python, R, or Bash.
  - Operates through iterative planning, reasoning, and execution cycles: the agent generates a step-by-step plan, retrieves necessary tools/software, writes code, executes it, and refines results based on observations.
  - Uses a prompt-based retriever to dynamically select relevant resources, avoiding long context windows.
  - Code serves as a universal interface for actions, allowing for flexible workflow construction, including loops, conditionals, and interleaved tool/data operations.
- Execution Details:
  - Dynamic execution with logging and error handling.
  - Results are structured into reproducible folders with logs, intermediate code, and outputs for human inspection.
The following figure from the paper shows the unified biomedical action space and example of agentic execution using code-based reasoning, extracted from 2,500 papers and integrating tools, databases, and software across 25 subfields.

Benchmarks and Evaluation:
- Biomedical QA Benchmarks:
  - Achieved state-of-the-art on LAB-Bench (Database QA: 74.4%, Sequence QA: 81.9%) and Humanity’s Last Exam (17.3%), outperforming both coding and literature agents and achieving near or above expert human-level performance.
- Zero-shot Generalization Tasks (8 domains):
  - Tasks include variant prioritization, GWAS gene detection, CRISPR screen design, rare disease diagnosis, drug repurposing, single-cell RNA annotation, microbiome analysis, and patient gene prioritization.
  - Outperformed base LLM by 402.3%, coding agent by 43.0%, and ablation variant by 20.4%.
  - Task execution spanned 6–24 steps, dynamically using 0–4 tools, 1–8 software libraries, and 0–3 databases.
Case Studies:
- Wearable Sensor Analysis:
  - Analyzed 458 files of CGM and temperature data to identify postprandial thermogenic responses (avg. +2.19°C).
  - Conducted sleep pattern analysis and multi-omics correlation to uncover biologically significant patterns.
- Multi-omics GRN Analysis:
  - Performed 10-stage regulatory network inference using snRNA and snATAC-seq from human embryonic skeletons.
  - Identified known and novel regulators (e.g., AUTS2, ZFHX3, PBX1) of skeletal lineage commitment.
- Molecular Cloning Design:
  - Evaluated against LLMs and human experts in 10 cloning tasks (e.g., Golden Gate, Gibson).
  - Biomni matched expert performance in accuracy and completeness; completed tasks autonomously with time efficiency.
  - Protocols were validated in the lab with successful sgRNA cloning into lentiCRISPR v2 Blast.
Core Innovations:
- Unified biomedical action space with verified tools and API-driven databases.
- Code-centric planning architecture that supports dynamic, high-fidelity biomedical workflows.
- Strong performance in real-world tasks, demonstrating generalization and expert-level execution.
Code; Web Interface; Data

2025

Harnessing the Universal Geometry of Embeddings

This paper by Jha et al. from Cornell introduces vec2vec, the first method to translate text embeddings across different vector spaces without any paired data, encoder access, or predefined correspondences. It leverages the Strong Platonic Representation Hypothesis, which posits the existence of a universal semantic structure across neural text embedding models.
Have you ever thought of taking an embedding vector from model A and translating that vector into model B’s space, without even knowing the original text? That’s exactly what the authors do with vec2vec—a method they train over the course of this research. This work goes beyond similarity assessment and achieves actual transformation between embedding spaces, fully unsupervised.
Core finding: All modern embedding models, regardless of architecture, training data, or pretraining objectives, implicitly learn the same underlying geometry of language. This geometric consistency is formalized as the Strong Platonic Representation Hypothesis, which claims that neural encoders converge to a shared latent structure. Vec2vec exploits this to align embeddings from disparate models—like BERT and T5—using just samples of embeddings and no knowledge of the originating texts or encoders.
Problem Statement: Embeddings generated by different models (e.g., BERT vs. T5) are incompatible despite encoding the same semantic content. Given only a set of embeddings from an unknown encoder and no access to the encoder or its data, the goal is to extract document-level information by translating these embeddings into the space of a known encoder.
Method (vec2vec Architecture):
- Embedding translation is done via adapter modules and a shared latent transformation:
  - Input adapters $A_1, A_2$: Map model-specific embeddings to a latent space.
  - Shared transformation $T$: Applies consistent mapping in the latent space.
  - Output adapters $B_1, B_2$: Map latent embeddings back to respective embedding spaces.
  - Translation function: $F_1 = B_2 \circ T \circ A_1$, $F_2 = B_1 \circ T \circ A_2$
- Implemented wc atth multilayer perceptrons using residual connections, layer normalization, and SiLU activations.
- The following figure from the paper illustrates that given only a vector database from an unknown model, vec2vec translates the database into the space of a known model using latent structure alone. Converted embeddings reveal sensitive information about the original documents, such as the topic of an email (pictured, real example).
Training Loss:
- Vec2vec is trained to:
  - Fool discriminators using adversarial loss, ensuring translated embeddings mimic real ones.
  - Preserve pairwise relationships between vectors using vector-space preservation (VSP).
  - Maintain cycle-consistency—round-trip translations should return the original vector.
  - Enforce reconstruction, retaining the identity of original embeddings through latent translation.
- The overall generator loss is formulated as:
  \[L_{gen} = \lambda_{rec}L_{rec} + \lambda_{CC}L_{CC} + \lambda_{VSP}L_{VSP}\]
  where each term represents reconstruction, cycle-consistency, and vector-space preservation losses respectively, and the $\lambda$s are hyperparameters that balance their contributions.
Implementation Details:
- Trained on a large subset (1M samples) of Natural Questions (NQ) dataset embeddings.
- Supports multiple model pairs (e.g., GTR, GTE, E5, STELLA, Granite, and CLIP).
- Evaluated using metrics like cosine similarity, top-1 accuracy, and mean rank between translated and target embeddings.
Results:
- Achieved cosine similarities up to 0.92 and 99-100% top-1 matching on unpaired embeddings.
- Demonstrated generalization across model families (e.g., RoBERTa to T5) and on out-of-distribution data like tweets and medical records.
- Translations preserve enough semantics for zero-shot attribute inference and embedding inversion, revealing names, topics, and medical conditions.
Security Implications:
- Perhaps the most alarming insight is that vec2vec enables recovery of original text content from embeddings alone, using inversion methods on translated vectors.
- Even without model weights or text, just having access to stored embeddings may suffice to leak sensitive data.
- This underlines a critical paradigm shift: embeddings must now be treated as sensitive data. Simply keeping model weights private is no longer sufficient.
Code

Humanity’s Last Exam

This paper introduces Humanity’s Last Exam (HLE), a multi-modal benchmark designed to assess large language models (LLMs) at the frontiers of human knowledge. The authors argue that existing benchmarks have been saturated by LLMs, with state-of-the-art models achieving over 90% accuracy, making it difficult to measure AI progress effectively.
HLE consists of 2,700 questions across dozens of subjects, including mathematics, humanities, and natural sciences. It is designed to be the final closed-ended academic benchmark with broad subject coverage. Questions are multiple-choice and short-answer types, ensuring automated grading. Each question has a known, unambiguous solution and is resistant to simple internet retrieval.
Dataset collection and review process:
- Questions are developed globally by subject-matter experts and undergo a two-stage review process:
  - Initial screening: Questions are tested against frontier LLMs. Those that can be answered correctly are rejected.
  - Expert review: Graduate-level reviewers provide feedback, ensuring quality and adherence to strict submission guidelines.
  - A public review period is planned to further refine the dataset.
  - The figure below from the paper shows the dataset creation pipeline. We accept questions that make frontier LLMs fail, then iteratively refine them with the help of expert peer reviewers. Each question is then manually approved by organizers or expert reviewers trained by organizers. A private held-out set is kept in addition to the public set to assess model overfitting and gaming on the public benchmark.
Evaluation results:
- Current state-of-the-art LLMs perform poorly on HLE, with accuracy below 10% and high calibration errors (above 80%), indicating that these models provide incorrect answers with high confidence.
- The benchmark is designed to be extremely challenging for LLMs, ensuring its longevity as a measurement tool.
Technical aspects:
- Multi-modal: 10% of questions require both text and image comprehension.
- Submission criteria: Questions must be original, precise, and not answerable via retrieval-based methods.
- Scoring: Both multiple-choice and exact-match questions allow for automated grading.
Impact and future implications:
- HLE establishes a clear benchmark for AI performance on academic questions, helping policymakers and researchers gauge AI progress.
- Although current LLMs perform poorly, rapid AI advancements suggest that models may exceed 50% accuracy on HLE by the end of 2025.
The dataset is publicly available at lastexam.ai, with a private test set maintained to prevent overfitting.

Qwen2.5-Omni Technical Report

This technical report by the Qwen team introduces Qwen2.5-Omni, an end-to-end multimodal model capable of perceiving text, image, audio, and video inputs and generating both text and natural speech responses in real time. The system emphasizes streaming performance and unified multimodal understanding and generation.
Architecture and Key Design: The core of Qwen2.5-Omni is the Thinker-Talker architecture. The Thinker is a Transformer-based large language model responsible for understanding and generating text, while the Talker is a dual-track autoregressive model that generates audio tokens based on high-level representations and text tokens produced by the Thinker. This separation mirrors human cognitive and vocal functions, allowing simultaneous, low-latency text and speech outputs.
Modality Processing and TMRoPE: The model supports block-wise streaming input for audio and video, using separate encoders adapted from Qwen2.5-VL and Whisper-large-v3. It introduces TMRoPE (Time-aligned Multimodal RoPE), a position embedding method encoding 3D spatial-temporal information. For audio-video input, representations are interleaved in 2-second blocks to align visual and auditory temporal context.
Speech Generation and Streaming: The qwen-tts-tokenizer is used for tokenizing audio. The Talker autoregressively generates speech tokens using Thinker’s high-level representations and text outputs, without needing word-level alignment. Streaming is enabled via a Flow-Matching DiT model that converts tokens to mel-spectrograms, followed by a modified BigVGAN vocoder. A sliding-window attention strategy with a 4-block receptive field is used to ensure low latency.
Training:
- Stage 1: Vision and audio encoders are trained with frozen LLM using image-text and audio-text pairs.
- Stage 2: End-to-end training on a broader multimodal corpus (1.2T tokens total: image/video/audio).
- Stage 3: Long-sequence training up to 32k tokens using extended multimodal data.
Post-training: Instruction tuning uses the ChatML format across pure text, visual, audio, and mixed-modality dialogues. Talker training involves:
- Stage 1: Context continuation learning.
- Stage 2: DPO (Direct Preference Optimization) for generation stability.
- Stage 3: Multi-speaker instruction fine-tuning with timbre disentanglement.
Evaluation: Qwen2.5-Omni achieves state-of-the-art or near state-of-the-art performance across benchmarks:
- Text→Text: Strong general, coding, and math performance; often close to or surpassing Qwen2.5-7B.
- Audio→Text: Excels in ASR, S2TT, SER, VSC, and music understanding.
- Image→Text: Matches or exceeds open-source models like Qwen2.5-VL-7B and GPT-4o-mini in VQA, OCR, and grounding tasks.
- Video→Text: Outperforms other open-source omni-models.
- Multimodality→Text: State-of-the-art results on OmniBench, especially in speech, sound, and music understanding.
- Text→Speech: Competitive in zero-shot and speaker-conditioned speech generation with low WER and high naturalness (NMOS ~4.5).
Qwen2.5-Omni significantly narrows the gap between speech- and text-driven models in instruction-following tasks and paves the way toward more human-like multimodal interactions.

Speech

2006

Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks

Many real-world sequence learning tasks require the prediction of sequences of labels from noisy, unsegmented input data. In speech recognition, for example, an acoustic signal is transcribed into words or sub-word units. Recurrent neural networks (RNNs) are powerful sequence learners that would seem well suited to such tasks. However, because they require pre-segmented training data, and post-processing to transform their outputs into label sequences, their applicability has so far been limited.
This paper by Graves et al. from Schmidhuber’s lab presents a novel method for for temporal classification with RNNs to label unsegmented sequences directly, thereby solving both aforementioned problems. Their method fits naturally into the existing framework of neural network classifiers, and is derived from the same probabilistic principles. It obviates the need for pre-segmented data, and allows the network to be trained directly for sequence labelling.
An experiment on a real-world temporal classification problem with the TIMIT speech corpus demonstrates its advantages over both a baseline HMM and a hybrid HMM-RNN without requiring any task-specific knowledge.

2010

Front-end factor analysis for speaker verification

This paper by Dehak et al. from JHU in IEEE/ACM Transactions on Audio, Speech, and Language Processing 2010 proposes a non-deep learning method that users Joint Factor Analysis (JFA) as a feature extractor to learn a low-dimensional speaker representation for speaker verification, which is also used to model session and channel effects/variabilities.
In this new space, a given speech utterance is represented by a new vector named total factors (called the identity-vector or the “i-vector”). The i-vector is thus a feature that represents the characteristics of the frame-level features’ distributive pattern. i-vector extraction is essentially a dimensionality reduction of the GMM supervector (although the GMM supervector is not extracted when computing the i-vector). It’s extracted in a similar manner with the eigenvoice adaptation scheme or the JFA technique, but is extracted per sentence (or input speech sample).
Two speaker verification systems are proposed which use this new representation. The first system is a Support-Vector-Machine-based system that uses the cosine kernel to estimate the similarity between the input data. The second system directly uses the cosine similarity as the final decision score. In this scoring, they removed the SVM from the decision process. One important characteristic of this approach is that there is no speaker enrollment, unlike in other approaches like SVM and JFA, which makes the decision process faster and less complex.
They achieved an EER of 1.12% and MinDCF of 0.0094 using the cosine distance scoring on the male English trials of the core condition of the NIST 2008 Speaker Recognition Evaluation dataset. They also obtained 4% absolute EER improvement for both-gender trials on the 10sec-10sec condition compared to the classical joint factor analysis scoring.
Up until d-vectors, the state-of-the-art speaker verification systems were based on the concept of i-vectors (which use Probabilistic Linear Discriminant Analysis (PLDA) as a classifier to make the final decision).

2012

Sequence Transduction with Recurrent Neural Networks

Many machine learning tasks can be expressed as the transformation or transduction of input sequences into output sequences: speech recognition, machine translation, protein secondary structure prediction and text-to-speech to name but a few. One of the key challenges in sequence transduction is learning to represent both the input and output sequences in a way that is invariant to sequential distortions such as shrinking, stretching and translating.
Recurrent neural networks (RNNs) are a powerful sequence learning architecture that has proven capable of learning such representations. However RNNs traditionally require a pre-defined alignment between the input and output sequences to perform transduction. This is a severe limitation since finding the alignment is the most difficult aspect of many sequence transduction problems. Indeed, even determining the length of the output sequence is often challenging.
This paper by Graves in the 2012 ICML Workshop on Representation Learning introduces an end-to-end, probabilistic sequence transduction system, based entirely on RNNs, that is in principle able to transform any input sequence into any finite, discrete output sequence.
Experimental results for phoneme recognition are provided on the TIMIT speech corpus.
Slides.

2013

Hybrid speech recognition with Deep Bidirectional LSTM

Deep Bidirectional LSTM (DBLSTM) recurrent neural networks have recently been shown to give state-of-the-art performance on the TIMIT speech database. However, the results in that work relied on recurrent-neural-network-specific objective functions, which are difficult to integrate with existing large vocabulary speech recognition systems.
This paper by Graves et al. from UofT in the 2013 IEEE Workshop on Automatic Speech Recognition and Understanding investigates the use of DBLSTM as an acoustic model in a standard neural network-HMM hybrid system. They find that a DBLSTM-HMM hybrid gives equally good results on TIMIT as the previous work. It also outperforms both GMM and deep network benchmarks on a subset of the Wall Street Journal corpus. However the improvement in word error rate over the deep network is modest, despite a great increase in framelevel accuracy.
They conclude that the hybrid approach with DBLSTM appears to be well suited for tasks where acoustic modelling predominates. Further investigation needs to be conducted to understand how to better leverage the improvements in frame-level accuracy towards better word error rates.

2014

Towards End-To-End Speech Recognition with Recurrent Neural Networks

This paper by Graves and Jaitly in PMLR in 2014 presents a character-level speech recognition system that directly transcribes audio data with text using a recurrent neural network with minimal preprocessing, without requiring an intermediate phonetic representation.
The system is based on a combination of the deep bidirectional LSTM recurrent neural network architecture and a modified Connectionist Temporal Classification (CTC) objective function that allows a direct optimization of the word error rate, even in the absence of a lexicon or language model. Further, they show how to integrate the network outputs with a language model during decoding.
The system achieves a word error rate of 27.3% on the Wall Street Journal corpus with no prior linguistic information, 21.9% with only a lexicon of allowed words, and 8.2% with a trigram language model. Combining the network with a baseline system further reduces the error rate to 6.7% and achieves state-of-the-art accuracy on the Wall Street Journal corpus for speaker independent recognition.

Deep neural networks for small footprint text-dependent speaker verification

This paper by Variani et al. from JHU, Google, and Biometric Recognition Group in 2014 investigates the use of deep neural networks (DNNs) to train speaker embeddings for a small footprint text-dependent speaker verification task. The DNN architecture is shown in the figure below.
During model training, the DNN takes stacked filterbank features as input (similar to the DNN acoustic model used in ASR) and generates the one-hot speaker label (or the speaker probability) to classify speakers at the frame-level.
During speaker enrollment, the trained DNN is used to extract speaker-specific features/embeddings by averaging the activations from the last hidden layer (called deep-vectors or “d-vectors” for short), which is taken as the speaker model.
During speaker evaluation, a d-vector is extracted for each utterance and compared to the enrolled speaker model to make a verification decision by calculating the cosine distance between the test d-vector and the claimed speaker’s d-vector, similar to the i-vector framework. A verification decision is made by comparing the distance to a threshold.
Experimental results show the DNN based speaker verification system achieves good performance compared to a popular i-vector system on a small footprint text-dependent speaker verification task. In addition, the DNN based system is more robust to additive noise and outperforms the i-vector system at low False Rejection operating points. The combined system outperforms the i-vector system by 14% and 25% relative in equal error rate (EER) for clean and noisy conditions respectively.
Experimental results show the d-vectors are more robust to additive noise and outperforms i-vectors at low False Rejection operating points. The combined (d+i)-vector system outperforms the i-vector system by 14% and 25% relative in equal error rate (EER) for clean and noisy conditions respectively.
Note that unlike the i-vector framework, this doesn’t have any assumptions about the feature’s distribution (the i-vector framework assumes that the i-vector has a Gaussian distribution).

2015

Listen, Attend and Spell

This paper by Chan et al. from CMU and Google in 2015 presents Listen, Attend and Spell (LAS), a neural network that learns to transcribe speech utterances to characters. Unlike traditional DNN-HMM models, this model learns all the components of a speech recognizer jointly.
LAS is based on the sequence-to-sequence framework, is trained end-to-end and has two main components: a listener (encoder) and a speller (decoder). The listener is a pyramidal RNN encoder that accepts filter bank spectra as inputs, transforms the input sequence into a high level feature representation and reduces the number of timesteps that the decoder has to attend to. The speller is an attention-based RNN decoder that attends to the high level features and spells out the transcript one character at a time.
The proposed system does not use the concepts of phonemes, nor does it rely on pronunciation dictionaries or HMMs. They bypass the conditional independence assumptions of CTC, and show how they can learn an implicit language model that can generate multiple spelling variants given the same acoustics. In other words, producing character sequences without making any independence assumptions between the characters is the key improvement of LAS over previous end-to-end CTC models.
To further improve the results, they used samples from the softmax classifier in the decoder as inputs to the next step prediction during training. Finally, they show how a language model trained on additional text can be used to rerank their top hypotheses.
On a subset of the Google voice search task, LAS achieves a word error rate (WER) of 14.1% without a dictionary or a language model, and 10.3% with language model rescoring over the top 32 beams. By comparison, the state-of-the-art CLDNN-HMM model achieves a WER of 8.0%.

2017

CNN Architectures for Large-Scale Audio Classification

This paper by Hershey et al. from Google in ICASSP 2017 presents VGGish by applying various state-of-the-art image networks with CNN architectures to audio and show that they are capable of excellent results on audio classification when compared to a simple fully connected network or earlier image classification architectures.
They examine fully connected deep neural networks such as AlexNet, VGG, InceptionNet, and ResNet. The input audio is divided into non-overlapping 960 ms frames which are decomposed by applying the Fourier transform, resulting in a spectrogram. The spectrogram is integrated into 64 mel-spaced frequency bins, and the magnitude of each bin is log-transformed. Finally, this gives log-mel spectrogram patches that are passed on as input to all classifiers. They explore the effects of training with different sized subsets of the 70M training videos (5.24 million hours) with 30,871 labels.
While their dataset contains video-level labels, they are also interested in Acoustic Event Detection (AED) and train a classifier on embeddings learned from the video-level task on AudioSet. They find that a model for AED with embeddings learned from these classifiers does much better than raw features on the Audio Set AED classification task.
They find that derivatives of image classification networks do well on the audio classification task, that increasing the number of labels they train on provides some improved performance over subsets of labels, that performance of models improves as they increase training set size, and that a model using embeddings learned from the video-level task do much better than a baseline on the AudioSet classification task.

2018

X-Vectors: Robust DNN Embeddings for Speaker Recognition

This paper by Synder et al. from JHU in ICASSP 2018 uses data augmentation to improve performance of deep neural network (DNN) embeddings for speaker recognition.
The DNN, which is trained to discriminate between speakers, maps variable-length utterances to fixed-dimensional embeddings called x-vectors.
While prior studies have found that embeddings leverage large-scale training datasets better than i-vectors, it can be challenging to collect substantial quantities of labeled data for training. They use data augmentation, consisting of added noise and reverberation, as an inexpensive method to multiply the amount of training data and improve robustness.
Their data augmentation strategy employs additive noises and reverberation. Reverberation involves convolving room impulse responses (RIR) with audio. They use the simulated RIRs described by Ko et al. and the reverberation itself is performed with the multicondition training tools in the Kaldi ASpIRE recipe. For additive noise, they use the MUSAN dataset, which consists of over 900 noises, 42 hours of music from various genres and 60 hours of speech from twelve languages
A PLDA classifier is used in the x-vector framework to make the final decision, similar to i-vector systems.
The x-vectors are compared with i-vector baselines on Speakers in the Wild and NIST SRE 2016 Cantonese where they achieve superior performance on the evaluation datasets.

WaveGlow: A Flow-based Generative Network for Speech Synthesis

This paper by Prenger et al. from NVIDIA in 2018 proposes WaveGlow, a flow-based network capable of generating high quality speech from mel-spectrograms.
WaveGlow combines insights from Glow and WaveNet in order to provide fast, efficient and high-quality audio synthesis, without the need for auto-regression. WaveGlow is implemented using only a single network, trained using only a single cost function: maximizing the likelihood of the training data, which makes the training procedure simple and stable.
Their PyTorch implementation produces audio samples at a rate of more than 500 kHz on an NVIDIA V100 GPU. Mean Opinion Scores show that it delivers audio quality as good as the best publicly available WaveNet implementation.

Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions

This paper by Shen et al. from Google in 2018 describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize timedomain waveforms from those spectrograms.
Their model achieves a mean opinion score (MOS) of 4.53 comparable to a MOS of 4.58 for professionally recorded speech.
To validate their design choices, they present ablation studies of key components of their system and evaluate the impact of using mel-spectrograms as the input to WaveNet instead of linguistic, duration, and F0 features.
They further demonstrate that using a compact acoustic intermediate representation enables significant simplification of the WaveNet architecture.
PyTorch hub

2019

wav2vec: Unsupervised Pre-training for Speech Recognition

Reducing the need for manually annotated data is important for developing systems that understand non-English languages, particularly those with limited existing training sets of transcribed speech.
This paper by Schneider from Facebook AI in 2019 introduces wav2vec, the first application of unsupervised pre-training to speech recognition using a fully convolutional model that learns representations of raw, unlabeled audio.
Wav2vec is trained on large amounts of unlabeled audio data and the resulting representations are then used to improve acoustic model training. They pre-train a simple multi-layer convolutional neural network optimized via a noise contrastive binary classification task.
Wav2vec trains models to learn the difference between original speech examples and modified versions, often repeating this task hundreds of times for each second of audio, and predicting the correct audio milliseconds into the future.
This self-supervised approach beats traditional ASR systems that rely solely on transcribed audio. Their experiments on WSJ reduce WER of a strong character-based log-mel filterbank baseline by up to 36% when only a few hours of transcribed data is available. Their approach achieves 2.43% WER on the nov92 test set. This outperforms Deep Speech 2 (Amodei et al., 2016), the best reported character-based system in the literature while using two orders of magnitude less labeled training data.
They show that more data for pre-training improves performance and that this approach not only improves resource-poor setups, but also settings where all WSJ training data is used.
Facebook AI article.

SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

This paper by Park et al. from Google in 2019 presents SpecAugment, a simple data augmentation method for speech recognition.
SpecAugment greatly improves the performance of ASR networks. SpecAugment is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients). The augmentation policy consists of warping the features, masking blocks of frequency channels, and masking blocks of time steps. They apply SpecAugment on Listen, Attend and Spell (LAS) networks for end-to-end speech recognition tasks.
They achieve state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks on end-to-end LAS networks by augmenting the training set using simple handcrafted policies, surpassing the performance of hybrid systems even without the aid of a language model. SpecAugment converts ASR from an over-fitting to an under-fitting problem, and they are able to gain performance by using bigger networks and training longer. On LibriSpeech, they achieve 6.8% WER on test-other without the use of a language model, and 5.8% WER with shallow fusion with a language model. This compares to the previous state-of-the-art hybrid system of 7.5% WER. For Switchboard, they achieve 7.2%/14.6% on the Switchboard/CallHome portion of the Hub5’00 test set without the use of a language model, and 6.8%/14.1% with shallow fusion, which compares to the previous state-of-the-art hybrid system at 8.3%/17.3% WER.

Margin Matters: Towards More Discriminative Deep Neural Network Embeddings for Speaker Recognition

Recently, speaker embeddings extracted from a speaker discriminative deep neural network (DNN) yield better performance than the conventional methods such as i-vector. In most cases, the DNN speaker classifier is trained using cross entropy loss with softmax. However, this kind of loss function does not explicitly encourage inter-class separability and intra-class compactness. As a result, the embeddings are not optimal for speaker recognition tasks.
This paper by Xiang et al. from Shanghai Jiao Tong and AISpeech in Interspeech 2019 addresses this issue, with three different margin-based losses which not only separate classes but also demand a fixed margin between classes are introduced to deep speaker embedding learning.
- Angular softmax loss (denoted by A-Softmax loss),
- Additive margin softmax loss (denoted by AM-Softmax loss), and
- Additive angular margin loss (denoted by AAM-Softmax loss).
They find that the margin plays a vital role in learning discriminative embeddings and leads to a significant performance boost.
Experiments are conducted on two public text independent tasks: VoxCeleb1 and Speaker in The Wild (SITW).
The proposed approach can achieve the state-of-the-art performance, with 25% ~ 30% equal error rate (EER) reduction on both tasks when compared to strong baselines using cross entropy loss with softmax, obtaining 2.238% EER on VoxCeleb1 test set and 2.761% EER on SITW core-core test set, respectively.

Fréchet Audio Distance: A Metric for Evaluating Music Enhancement Algorithms

This paper by Kilgour et al. from Google AI introduces the Fréchet Audio Distance (FAD), a novel evaluation metric for music enhancement algorithms. FAD, inspired by the Fréchet Inception Distance used in image models, provides a reference-free metric that compares statistics of embeddings generated from enhanced music to those from clean music.
The paper identifies the limitations of traditional signal-based metrics like Signal to Distortion Ratio (SDR), which may not align with human perception of audio quality. FAD, unlike these metrics, does not require a reference signal, making it adaptable for various applications where original clean music or noise signals are unavailable.
The FAD computation involves using the VGGish model, derived from the VGG image recognition architecture, to generate embeddings from audio. These embeddings are then used to compute the Fréchet distance between multivariate Gaussians estimated from the enhanced and clean music datasets.
The following figure shows an overview of FAD computation: using a pretrained audio classification model, VGGish, embeddings are extracted from both the output of a enhancement model that is to be evaluated and a large database of background music. The Fréchet distance is then computed between multivariate Gaussians estimated on these embeddings.

To validate FAD’s effectiveness, the authors applied various artificial distortions to audio samples and compared FAD scores with human perception. The distortions included Gaussian noise, pops, frequency filtering, quantization, Griffin-Lim distortions, speed alterations, and reverberations. FAD’s performance was found to correlate more closely with human ratings than traditional metrics, with a correlation coefficient of 0.52 compared to SDR’s 0.39.
Human evaluations were conducted using a Plackett-Luce model, further establishing FAD’s superior correlation with human perception compared to other metrics. The paper concludes with suggestions for future work, including evaluating FAD’s effectiveness on more distortions and exploring alternative embedding models that could capture long-distance temporal changes in music.

2020

Conformer: Convolution-augmented Transformer for Speech Recognition

Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR), outperforming Recurrent neural networks (RNNs). Transformer models are good at capturing content-based global interactions, while CNNs exploit local features effectively.
This paper by Gulati et al. from Google in Interspeech 2020 achieves the best of both worlds by integrating components from both CNNs and Transformers for end-to-end speech recognition to model both local and global dependencies of an audio sequence in a parameter-efficient way.
They studied the importance of each component, and demonstrated that the inclusion of convolution modules is critical to the performance of the Conformer model.
To this regard, they propose the convolution-augmented transformer for speech recognition, named Conformer. Conformer significantly outperforms the previous Transformer and CNN based models achieving state-of-the-art accuracies. On the widely used LibriSpeech benchmark, Conformer model achieves WER of 2.1%/4.3% without using a language model and 1.9%/3.9% with an external language model on test/testother. They also observe competitive performance of 2.7%/6.3% with a small model of only 10M parameters.
The following figure from the paper shows the Conformer encoder model architecture. Conformer comprises of two macaron-like feed-forward layers with halfstep residual connections sandwiching the multi-headed selfattention and convolution modules. This is followed by a post-layernorm.

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

This paper by Baevski et al. from Facebook AI in NeurIPS 2020 shows for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler.
Wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned.
Compared to wav2vec, wav2vec 2.0 learns basic speech units used to tackle a self-supervised task. The model is trained to predict the correct speech unit for masked parts of the audio, while at the same time learning what the speech units should be.
Experiments using all labeled data of Librispeech achieve 1.8/3.3 WER on the clean/other test sets. When lowering the amount of labeled data to one hour, wav2vec 2.0 outperforms the previous state of the art on the 100 hour subset while using 100 times less labeled data. With just 10 minutes of transcribed speech and 53K hours of unlabeled speech, wav2vec 2.0 enables speech recognition models at a word error rate (WER) of 8.6 percent on noisy speech and 5.2 percent on clean speech on the standard LibriSpeech benchmark. This demonstrates the feasibility of speech recognition with limited amounts of labeled data.
This opens the door for speech recognition models in many more languages, dialects, and domains that previously required much more transcribed audio data to provide acceptable accuracy.
They have also developed a cross-lingual approach, dubbed XLSR, that can learn speech units common to several languages. This approach helps when they have even small amounts of unlabeled speech, since languages for which they have little data can benefit from languages for which more data is available.
Code; Facebook AI article.

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Several recent work on speech synthesis have employed generative adversarial networks (GANs) to produce raw waveforms. Although such methods improve the sampling efficiency and memory usage, their sample quality has not yet reached that of autoregressive and flow-based generative models.
This paper by Kong et al. from Kakao Enterprise in NeurIPS 2020 proposes HiFi-GAN, which achieves both efficient and high-fidelity speech synthesis. As speech audio consists of sinusoidal signals with various periods, they demonstrate that modeling periodic patterns of an audio is crucial for enhancing sample quality.
HiFi-GAN outperforms the best performing publicly available models in terms of synthesis quality, even comparable to human level. Moreover, it shows a significant improvement in terms of synthesis speed. A subjective human evaluation (mean opinion score, MOS) of a single speaker dataset indicates that their proposed method demonstrates similarity to human quality while generating 22.05 kHz high-fidelity audio 167.9 times faster than real-time on a single V100 GPU.
They took inspiration from the characteristic of speech audio that consists of patterns with various periods and applied it to neural networks, and verified that the existence of the proposed discriminator greatly influences the quality of speech synthesis through the ablation study.
HiFi-GAN shows ability to generalize to the mel-spectrogram inversion of unseen speakers and synthesize speech audio comparable to human quality from noisy inputs in an end-to-end setting. In addition, their small footprint model demonstrates comparable sample quality with the best publicly available autoregressive counterpart, while generating samples in an order-of-magnitude faster than real-time on CPU. This shows progress towards on-device natural speech synthesis, which requires low latency and memory footprint.
Finally, their experiments show that the generators of various configurations can be trained with the same discriminators and learning mechanism, which indicates the possibility of flexibly selecting a generator configuration according to the target specifications without the need for a time-consuming hyper-parameter search for the discriminators.
Code.

GAN-based Data Generation for Speech Emotion Recognition

This paper by Eskimez et al. from Microsoft in Interspeech 2020 proposes a GAN-based method to generate synthetic data in the form of speech emotion spectrograms, which can be used for training speech emotion recognition networks. Specifically, they investigate the usage of GANs for capturing the data manifold when the data is eyes-off, i.e., where they can train networks using the data but cannot copy it from the clients.
They propose a CNN-based GAN with spectral normalization on both the generator and discriminator, both of which are pre-trained on large unlabeled speech corpora. They show that their method provides better speech emotion recognition performance than a strong baseline.
They proposed to use GANs for modeling imbalanced and highly skewed data among clients for future use, even after the original data is removed.
Furthermore, they show that even after the data on the client is lost, their model can generate similar data that can be used for model bootstrapping in the future. Although they evaluated their method for speech emotion recognition, it can be applied to other tasks.

Generalized end-to-end loss for speaker verification

This paper by Wan et al. from Google in 2020 propose a new loss function called generalized end-to-end (GE2E) loss, which makes the training of speaker verification models more efficient (especially compared to their previous tuple-based end-to-end (TE2E) loss function).
Unlike TE2E, the GE2E loss function updates the network in a way that emphasizes examples that are difficult to verify at each step of the training process. GE2E loss pushes the embedding towards the centroid of the true speaker, and away from the centroid of the most similar different speaker.
Additionally, the GE2E loss does not require an initial stage of example selection. With these properties, their model with the new loss function decreases speaker verification EER by more than 10%, while reducing the training time by 60% at the same time.
Both theoretical and experimental results verified the advantage of this novel loss function.
They also introduce the MultiReader technique, which allows them to do domain adaptation — training a more accurate model that supports multiple keywords (i.e., “OK Google” and “Hey Google”) as well as multiple languages/dialects. By combining these two techniques, they produced more accurate speaker verification models.

2021

Generative Spoken Language Modeling from Raw Audio

This paper by Lakhotia et al. from Facebook AI in 2021 introduces Generative Spoken Language Modeling which learns speech representations from CPC, Wav2Vec2.0, and HuBERT for synthesizing speech.
Generative Spoken Language Modeling, the task of learning the acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation.
They set up baseline systems consisting of a discrete speech encoder (returning pseudo-text units), a generative language model (trained on pseudo-text), and a speech decoder (generating a waveform from pseudo-text) all trained without supervision and validate the proposed metrics with human evaluation. THe following figure from the paper shows the setup of the baseline model architecture, tasks and metrics.

Across 3 speech encoders (CPC, wav2vec 2.0, HuBERT), they find that the number of discrete units (50, 100, or 200) matters in a task-dependent and encoder-dependent way, and that some combinations approach text-based systems.
Facebook AI post.
Code.

Text-Free Prosody-Aware Generative Spoken Language Modeling

Speech pre-training has primarily demonstrated efficacy on classification tasks, while its capability of generating novel speech, similar to how GPT-2 can generate coherent paragraphs, has barely been explored.
This paper by Kharitonov et al. from Facebook AI in 2021 builds upon Generative Spoken Language Modeling (GSLM) (Lakhotia et al., 2021) which addresses the generative aspects of speech pre-training, by replacing text with discovered phone-like units for language modeling and shows the ability to generate meaningful novel sentences. Unfortunately, despite eliminating the need of text, the units used in GSLM discard most of the prosodic information. Hence, GSLM fails to leverage prosody for better comprehension, and does not generate expressive speech.
In this work, they present a prosody-aware generative spoken language model (pGSLM). It is composed of a multi-stream transformer language model (MS-TLM) of speech, represented as discovered unit and prosodic feature streams, and an adapted HiFi-GAN model converting MS-TLM outputs to waveforms.
They devise a series of metrics for prosody modeling and generation, and re-use metrics from GSLM for content modeling. Experimental results show that the pGSLM can utilize prosody to improve both prosody and content modeling, and also generate natural, meaningful, and coherent speech given a spoken prompt.
Facebook AI post.
Code

Speech Resynthesis from Discrete Disentangled Self-Supervised Representations

This paper by Polyak et al. from Facebook AI in Interspeech 2021 proposes using self-supervised discrete representations for the task of speech resynthesis. To generate disentangled representation, they separately extract low-bitrate representations for speech content, prosodic information, and speaker identity. This allows to synthesize speech in a controllable manner.
They analyze various state-of-the-art, self-supervised representation learning methods and shed light on the advantages of each method while considering reconstruction quality and disentanglement properties. Specifically, they evaluate the F0 reconstruction, speaker identification performance (for both resynthesis and voice conversion), recordings’ intelligibility, and overall quality using subjective human evaluation.
Lastly, they demonstrate how these representations can be used for an ultra-lightweight speech codec. Using the obtained representations, they can get to a rate of 365 bits per second while providing better speech quality than the baseline methods.
Facebook AI post.
Code

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

Self-supervised learning (SSL) achieves great success in speech recognition, while limited exploration has been attempted for other speech processing tasks. As speech signal contains multi-faceted information including speaker identity, paralinguistics, spoken content, etc., learning universal representations for all speech tasks is challenging.
This paper by Chen et al. from Furu Wei’s group at Microsoft Research in JSTSP 2021 proposes WavLM, a new large-scale pre-trained model trained on 94k hour audio, to solve full stack downstream speech processing tasks.
WavLM extends the HuBERT framework to masked speech prediction and denoising modeling, enabling the pre-trained models to perform well on both ASR and non-ASR tasks.
WavLM jointly learns masked speech prediction and denoising in pre-training. By this means, WavLM does not only keep the speech content modeling capability by the masked speech prediction, but also improves the potential to non-ASR tasks by the speech denoising.
In addition, WavLM employs gated relative position bias for the Transformer structure to better capture the sequence ordering of input speech. THey also scale up the training dataset from 60k hours to 94k hours.
WavLM Large achieves state-of-the-art performance on the SUPERB benchmark, and brings significant improvements for various speech processing tasks such as speaker verification, speech separation, and speaker diarization.
In contrast to previous SSL models, WavLM is not only effective for the ASR task but also has the potential to become the next-generation backbone network for speaker-related tasks.
Code with code and pre-trained models.

Recent Advances in End-to-End Automatic Speech Recognition

The following paper summary has been contributed by Zhibo Zhang.
This paper by Li from Microsoft in APSIPA Transactions on Signal and Information Processing in 2021 reviewed the influential frameworks in end-to-end automatic speech recognition systems, the major challenges as well as the solutions and advances in this field.
The author firstly reviewed three popular methods in this domain, including CTC (Connectionist Temporal Classification) by Graves et al., AED (Attention-based Encoder-Decoder) by Cho et al., Bahdanau et al. as well as RNN-T (RNN Transducer) by Graves.
The author then analyzed two major encoder architectures - LSTMs by Hochreiter and Schmidhuber and Transformers by Vaswani et al., along with their limitations and variations.
The author also mentioned other training criteria including knowledge distillation by Hinton et al. and minimum word error rate.
It is easier to build a multilingual model with end-to-end systems compared to hybrid systems.
The paper covered several major challenges for end-to-end models:
- It is difficult to adapt the model to the test speaker because of the small amount of adaptation data. Approaches to solve this issue include utilizing regularization techniques, multi-task learning as well as multi-speaker text-to-speech.
- The performance would be worse when adapting the end-to-end model to a different content domain due to the lack of the speech-text data pairs in the new domain. Approaches to overcome this problem include:
  - Fusing the end-to-end model with an extra language model where the language model was trained on the text data of the new domain.
  - Training the end-to-end model on the new domain by synthesizing speech from the text of the new domain utilizing TTS (text-to-speech) technologies.
  - Adopting the spliced data method by Zhao et al..
- Improving the capability of making use of the context is challenging for end-to-end models and the author mentioned a few existing solutions that address this issue including adding a context encoder.

w2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training

This paper by Chung et al. from MIT CSAIL and Google Brain proposes w2v-BERT, which combines the core methodologies of self-supervised pre-training of speech embodied in the wav2vec 2.0 model and the self-supervised pre-training of language emobdied in BERT.
The following figures from the paper shows an illustration of the w2v-BERT pre-training framework. w2vBERT is composed of a feature encoder, a contrastive module, and a masked language modeling (MLM) module, where the latter two are both a stack of conformer blocks. $N$ and $M$ denote the number of conformer blocks in the two modules, respectively.

The idea of w2v-BERT is learn contextualized speech representations by using the contrastive task defined earlier in wav2vec 2.0 to obtain an inventory of a finite set of discretized speech units, and then use them as tokens in a masked prediction task similar to the masked language modeling (MLM) proposed in BERT.
From the figure above, we can see that w2v-BERT consists of three main components:
- Feature Encoder: The feature encoder acts as a convolutional sub-sampling block that consists of two 2D-convolution layers, both with strides $\left( 2,2 \right)$, resulting in a 4x reduction in the acoustic input’s sequence length. Given, for example, a log-mel spectrogram as input, the feature encoder extracts latent speech representations that will be taken as input by the subsequent contrastive module.
- Contrastive Module: The goal of the contrastive module is to discretize the feature encoder output into a finite set of representative speech units; that’s why the output of the feature encoder follows two different paths:
  - First path: It is masked, then fed into the linear projection layer followed by the stack of Conformer blocks to produce context vectors.
  - Second Path: It is passed to the quantization mechanism without masking to yield quantized vectors and their assigned token IDs.
  - The quantized vectors are used in conjunction with the context vectors that correspond to the masked positions to solve the contrastive task defined in wav2vec 2.0; the assigned token IDs will be later used by the subsequent masked prediction module as prediction target.
- Masked Prediction Module: The masked prediction module is a stack of Conformer blocks (identical to the one used with the contrastive module) which directly takes in the context vectors produced by the contrastive module and extracts high-level contextualized speech representations.
Pre-training & Fine-tuning: During pre-training only unlabeled speech data is used to train w2v-BERT to solve two self-supervised tasks at the same time weighted by two different hyper-parameters $\beta$ and $\gamma$ which were set to 1 in the paper:

\[L = \beta.L_{c} + \gamma.L_{m}\]

Contrastive Loss $L_{\mathbf{c}}$: For a context vector $c_t$ corresponding to a masked time step $t$, the model is asked to identify its true quantized vector $q_t$ from a set of $K$ distractors $\left\{ {\widetilde{q}}_1,\ {\widetilde{q}}_2,\ ...{\widetilde{q}}_K \right\}$ that are also quantized vectors uniformly sampled from other masked time steps of the same utterance. This loss is denoted as $L_w$, and further augment it with a codebook diversity loss $L_d$ to encourage a uniform usage of codes weighted by a hyper-parameter $\alpha$. Therefore, the final contrastive loss is defined as:

\[L_{c} = L_{w} + \alphaL_{d}\]

**Mask Prediction Loss $L_{\mathbf{m}}$: This is the cross entropy loss for the predicting masked context vectors. They randomly sample the starting positions to be masked with a probability of 0.065 and mask the subsequent 10 time steps knowing that the masked spans may overlap.
During fine-tuning, a labeled data was used to train an RNN-T model where the encoder is a pre-trained w2v-BERT model, the decoder is a two-layer LSTM with a hidden dimension of 640, and the joint network is a linear layer with Swish activation and batch normalization.

SUPERB: Speech processing Universal PERformance Benchmark

Self-supervised learning (SSL) has proven vital for advancing research in natural language processing (NLP) and computer vision (CV). The paradigm pretrains a shared model on large volumes of unlabeled data and achieves state-of-the-art (SOTA) for various tasks with minimal adaptation. However, the speech processing community lacks a similar setup to systematically explore the paradigm.
This paper by Yang et al. from Facebook AI in Interspeech 2021 seeks to bridge this gap and introduces the Speech processing Universal PERformance Benchmark (SUPERB).
SUPERB is a leaderboard to benchmark the performance of a shared model across a wide range of speech processing tasks with minimal architecture changes and labeled data. Among multiple usages of the shared model, they especially focus on extracting the representation learned from SSL due to its preferable re-usability.
They present a simple framework to solve SUPERB tasks by learning task-specialized lightweight prediction heads on top of the frozen shared model.
Their results demonstrate that the framework is promising as SSL representations show competitive generalizability and accessibility across SUPERB tasks.
They release SUPERB as a challenge with a leaderboard and a benchmark toolkit to fuel research in representation learning and general speech processing.

2022

Direct speech-to-speech translation with discrete units

This paper by Lee et al. from Facebook AI in 2022 presents a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation.
They tackle the problem by first applying a self-supervised discrete speech encoder on the target speech and then training a sequence-to-sequence speech-to-unit translation (S2UT) model to predict the discrete representations of the target speech.
When target text transcripts are available, they design a joint speech and text training framework that enables the model to generate dual modality output (speech and text) simultaneously in the same inference pass.
Experiments on the Fisher Spanish-English dataset show that the proposed framework yields improvement of 6.7 BLEU compared with a baseline direct S2ST model that predicts spectrogram features. When trained without any text transcripts, S2ST’s performance is comparable to models that predict spectrograms and are trained with text supervision, showing the potential of their system for translation between unwritten languages.
Audio samples

Textless Speech Emotion Conversion using Discrete and Decomposed Representations

Speech emotion conversion is the task of modifying the perceived emotion of a speech utterance while preserving the lexical content and speaker identity.
This paper by Kreuk et al. from Facebook AI in 2021 casts the problem of emotion conversion as a spoken language translation task. They use a decomposition of the speech signal into discrete learned representations, consisting of phonetic-content units, prosodic features, speaker, and emotion.
First, they modify the speech content by translating the phonetic-content units to a target emotion, and then predict the prosodic features based on these units.
Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder. Such a paradigm allows them to go beyond spectral and parametric changes of the signal, and model non-verbal vocalizations, such as laughter insertion, yawning removal, etc.
They demonstrate objectively and subjectively that the proposed method is vastly superior to current approaches and even beats text-based systems in terms of perceived emotion and audio quality. They rigorously evaluate all components of such a complex system and conclude with an extensive model analysis and ablation study to better emphasize the architectural choices, strengths and weaknesses of the proposed method.
Facebook AI post
Code

Generative Spoken Dialogue Language Modeling

This paper by Nguyen et al. from Facebook AI in 2022 introduces dGSLM, the first “textless” model able to generate audio samples of naturalistic spoken dialogues. It uses recent work on unsupervised spoken unit discovery coupled with a dual-tower transformer architecture with cross-attention trained on 2000 hours of two-channel raw conversational audio (Fisher dataset) without any text or labels.
It is able to generate speech, laughter and other paralinguistic signals in the two channels simultaneously and reproduces naturalistic turn taking.
Facebook AI post
Code

textless-lib: a Library for Textless Spoken Language Processing

Textless spoken language processing research aims to extend the applicability of standard NLP toolset onto spoken language and languages with few or no textual resources.
This paper by Kharitonov et al. from Facebook AI in 2022 introduces textless-lib, a PyTorch-based library aimed to facilitate research in this research area. They describe the building blocks that the library provides and demonstrate its usability by discuss three different use-case examples: (i) speaker probing, (ii) speech resynthesis and compression, and (iii) speech continuation.
They believe that textless-lib substantially simplifies research the textless setting and will be handful not only for speech researchers but also for the NLP community at large.
Facebook AI post
Code

Self-Supervised Speech Representation Learning: A Review

Although supervised deep learning has revolutionized speech and audio processing, it has necessitated the building of specialist models for individual tasks and application scenarios. It is likewise difficult to apply this to dialects and languages for which only limited labeled data is available.
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains. Such methods have shown success in natural language processing and computer vision domains, achieving new levels of performance while reducing the number of labels required for many downstream scenarios. Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods. Other approaches rely on multi-modal data for pre-training, mixing text or visual data streams with speech.
This paper by Mohamed et al. from Facebook AI in 2022 reviews the current approaches in the field for self-supervised speech representation learning and their connection to other research areas. Since many current methods focus solely on automatic speech recognition as a downstream task, they review recent efforts on benchmarking learned representations to extend the application beyond speech recognition.

Masked Autoencoders that Listen

This paper by Huang et al. from Facebook AI and CMU in 2022 introuces Audie-MAE, a simple extension of image-based Masked Autoencoders (MAE) to self-supervised representation learning from audio spectrograms. Audio-MAE learns to reconstruct masked spectrogram patches from audio recordings and achieves state-of-the-art performance on six audio and speech classification tasks.
Following the Transformer encoder-decoder design in MAE, Audio-MAE first encodes audio spectrogram patches with a high masking ratio, feeding only the non-masked tokens through encoder layers.
The decoder then re-orders and decodes the encoded context padded with mask tokens, in order to reconstruct the input spectrogram. They find it beneficial to incorporate local window attention in the decoder, as audio spectrograms are highly correlated in local time and frequency bands.
They then fine-tune the encoder with a lower masking ratio on target datasets. Empirically, Audio-MAE sets new state-of-the-art performance on six audio and speech classification tasks, outperforming other recent models that use external supervised pre-training.
They draw the four interesting observations:
- A simple MAE approach works surprisingly well for audio spectrograms.
- It is possible to learn stronger representations with local self-attention in the decoder.
- They show that masking can be applied to both pre-training and fine-tuning, improving accuracy and reducing training computation. The optimal strategy depends on the nature of the data (audio, image, etc.) and the learning type (self-/supervised).
- The best performance can be achieved by pre-training and fine-tuning under the same modality, without reliance on cross-modality transfer learning.
Code with code and models.

Robust Speech Recognition via Large-Scale Weak Supervision

This paper by Radford et al. from OpenAI in 2022 proposes Whisper, a model trained to predict large amounts of transcripts of audio on the internet and studies its capabilities.
Whisper suggests that scaling weakly supervised pretraining has been underappreciated so far in speech recognition research. When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zeroshot transfer setting without the need for any finetuning.
When compared to humans, the models approach their accuracy and robustness.
What is important to note is that Whisper achieves stellar results without the need for self-supervision and self-training techniques that have been a mainstay of recent large-scale speech recognition work and demonstrates how training on a large and diverse supervised dataset and focusing on zero-shot transfer can significantly improve the robustness of a speech recognition system.
Project page.

AudioGen: Textually Guided Audio Generation

This paper by Kreuk et al. from FAIR and the Hebrew University of Jerusalem in 2022 proposes AudioGen, which tackles the problem of generating audio samples conditioned on descriptive text captions.
AudioGen is an auto-regressive generative model that operates on a learnt discrete audio representation and generates audio samples conditioned on text inputs.
The task of text-to-audio generation poses multiple challenges. Due to the way audio travels through a medium, differentiating ‘objects’ can be a difficult task (e.g., separating multiple people simultaneously speaking). This is further complicated by real-world recording conditions (e.g., background noise, reverberation, etc.). Scarce text annotations impose another constraint, limiting the ability to scale models.
Finally, modeling high-fidelity audio requires encoding audio at high sampling rate, leading to extremely long sequences. To alleviate the aforementioned challenges, they propose an augmentation technique that mixes different audio samples, driving the model to internally learn to separate multiple sources. They curated 10 datasets containing different types of audio and text annotations to handle the scarcity of text-audio data points.
For faster inference, they explore the use of multi-stream modeling, allowing the use of shorter sequences while maintaining a similar bitrate and perceptual quality. They apply classifier-free guidance to improve adherence to text.
Comparing to the evaluated baselines, AudioGen outperforms over both objective and subjective metrics. Finally, they explore the ability of the proposed method to generate audio continuation conditionally and unconditionally.
Audio samples.

AudioLM: a Language Modeling Approach to Audio Generation

The following summary has been contributed by Zhibo Zhang.
This paper by Borsos et al. from Google Research in 2022 proposes a generative language model approach to synthesize audios that are consistent and are of high quality.
The method proposed contains the following stages:
- The tokenization stage that maps the single channel audio sequence into acoustic tokens and semantic tokens. Specifically, the SoundStream codec (Zeghidour et al., 2021) is adopted to produce acoustic tokens and the w2v-BERT model (Chung et al., 2021) is used to produce semantic tokens using intermediate layer representations. The acoustic token representations and the semantic token representations are for ensuring high quality and long-term consistency of the generated audio accordingly.
- The hierarchical modeling stage that is composed of the following three steps, as indicated in the illustration figure by Borsos et al.:
  - Autoregressive modeling on the semantic tokens. This step is for learning long-term temporal structure.
  - Coarse acoustic modeling conditioned on the acoustic tokens from the previous time steps that are produced by the first $Q’$ SoundStream quantizers. This step is for capturing high-level acoustic properties.
  - Fine acoustic modeling conditioned on both the coarse tokens of all time steps and the fine tokens (from the last $Q - Q’$ quantizers) of the previous time steps. This step is for better capturing fine acoustic details.
At inference time, AudioLM can be used to:
- Generate audios with diverse context, various speakers and acoustic conditions when there are no conditional restrictions.
- Generate audios of the same content with various speaker identities when conditioned on given semantic tokens.
- Generate continuations of the audio given an acoustic prompt.
Empirically, the authors trained the AudioLM components on the unlab-60k train split of the Libri-Light dataset. In order to validate the functionality of the semantic tokens and the acoustic tokens.
- The authors conducted acoustic generation experiments conditioned on semantic tokens. Automatic speech recognition was performed on the generated audio, and with a low Word Error Rate, this shows that the system captures the linguistic content mostly relying on the semantic tokens.
- The authors also conducted speaker classification on the generated audio. A low classification accuracy suggests that the semantic tokens lack information about speaker identities.

This paper by Furu Wei’s group in ACL 2022 from Microsoft builds upon the T5 (Text-To-Text Transfer Transformer) by Raffel et al. (2020) in pre-trained natural language processing models, propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning.
The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets. After preprocessing the input speech/text through the pre-nets, the shared encoder-decoder network models the sequence-to-sequence transformation, and then the post-nets generate the output in the speech/text modality based on the output of the decoder. Leveraging large-scale unlabeled speech and text data, they pre-train SpeechT5 to learn a unified-modal representation, hoping to improve the modeling capability for both speech and text. To align the textual and speech information into this unified semantic space, they propose a cross-modal vector quantization approach that randomly mixes up speech/text states with latent units as the interface between encoder and decoder.
Extensive evaluations show the superiority and versatility of the proposed SpeechT5 framework on a wide variety of spoken language processing tasks, including automatic speech recognition (ASR), speech synthesis (TTS), voice conversion (VC), speech translation (ST), speech enhancement (SE), and speaker identification (SID).
Huggingface spaces demos:
- TTS
- VC
- ASR

Scaling Speech Technology to 1,000+ Languages

Expanding the language coverage of speech technology has the potential to improve access to information for many more people. However, current speech technology is restricted to about one hundred languages which is a small fraction of the over 7,000 languages spoken around the world.
This paper by Pratap et al. from Meta AI in 2023 introduces the Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task.
The main ingredients are a new dataset based on readings of publicly available religious texts and effectively leveraging self-supervised learning. They built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, as well as a language identification model for 4,017 languages.
Forced alignment determines which parts of the audio correspond to which parts of the text. They employ a Scalable Forced Alignment step, using the following tweaks:
1. Generating Posterior Probabilities: forced alignment requires posterior probabilities from an acoustic model which they use for alignment. This acoustic model is a Transformer which requires substantial amounts of memory to store activations which makes it infeasible to use for long audio files. As a workaround, they chunk the audio files into 15 second segments, generate posterior probabilities for each audio frame using the alignment model, and then concatenate these posterior probabilities into a single matrix again. The acoustic model is trained with Connectionist Temporal Classification (CTC).
2. Efficient Forced Alignment on GPUs: they implemented a GPU version that computes the Viterbi path in a memory efficient way. Storing all $O(T \times L)$ forward values for the Viterbi algorithm is infeasible on GPUs due to memory constraints. They therefore only store forward values for the current and the previous time-step and regularly transfer the computed backtracking matrices to CPU memory. This reduces the required GPU memory to $O(L)$ compared to $O(T \times L)$ and enables forced alignment for very long audio).
3. Robust Alignment for Noisy Transcripts: a star token <∗> to map audio segments if there is no good alternative in the text.
Also, to create a labeled dataset which includes speech audio paired with corresponding transcriptions in 1,107 languages by aligning New Testament texts obtained from online sources using the following steps:
1. Download and preprocess both the speech audio and the text data.
2. Apply a scalable alignment algorithm which can force align very long audio files with text and do this for data in 1000+ languages in the following steps.
3. Initial Data Alignment: they train an initial alignment model using existing multilingual speech datasets covering 8K hours of data in 127 languages and use this model to align data for all languages.
4. Improved Data Alignment: they train a second alignment model on the newly aligned data for which the original alignment model has high confidence and generate the alignments again. The new alignment model supports 1,130 languages and 31K hours of data including the data used in step 3.
5. Final data filtering: they filter the low-quality samples of each language based on a cross-validation procedure. For each language, they train a monolingual ASR model on half of the aligned data to transcribe the other half of the data. They retain only samples for which the transcriptions are of acceptable quality.
Experiments show that their multilingual speech recognition model more than halves the word error rate of Whisper on 54 languages of the FLEURS benchmark while being trained on a small fraction of the labeled data.
The following figure from the paper shows (top) MMS-lab (paired data): amount of speech data across languages – they show the size of the training data sets and name some of the 1,107 languages; (bottom) MMS-unlab (unpaired data): amount of speech data across languages – they show the size of the training data sets and name a few of the 3,809 languages.

Code

Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling

This paper authored by Gandhi et al. from Weights introduces Distil-Whisper, a distilled variant of the Whisper automatic speech recognition model.
Distil-Whisper is significantly smaller and faster (5.8 times faster, 51% fewer parameters) than the original Whisper model, maintaining similar performance (within 1% WER) on out-of-distribution test data in a zero-shot setting.
The authors use a large-scale pseudo-labelling approach to assemble an open-source dataset for training, selecting high-quality pseudo-labels based on a word error rate (WER) heuristic.
The motivation was the fact that OpenAI’s Whisper yields astonishing accuracy for most audio, but it’s too slow and expensive for most production use cases. In addition, it has a tendency to hallucinate.
Encoding takes $O(1)$ passes while decoding takes $O(N)$. This implies that reducing decoder layers is $N$ time more effective. They kept the whole encoder, but utilized only two decoder layers.
The encoder is frozen during distillation to ensure Whisper’s robustness to noise is kept.
The model demonstrates improved robustness against hallucination errors in long-form audio, and its design allows it to be paired with Whisper for speculative decoding, doubling the inference speed while maintaining output accuracy.
The paper highlights the utility of large-scale pseudo-labelling in speech recognition and the effectiveness of the WER threshold filter in distillation. The training and inference code, along with the models, are made publicly available by the authors.
To make sure Distil-Whisper does not inherit hallucinations, they filtered out all data samples below a certain WER threshold. By doing so, we were able to reduce hallucinations and actually beat the teacher on long-form audio evaluation.
Code; Weights

Matcha-TTS: A Fast TTS Architecture with Conditional Flow Matching

This paper by Mehta et al. from KTH Royal Institute of Technology in ICASSP 2024, introduces Matcha-TTS, a novel encoder-decoder architecture for efficient text-to-speech (TTS) acoustic modeling using optimal-transport conditional flow matching (OT-CFM).
Matcha-TTS employs a unique combination of 1D CNNs and Transformers in the decoder, which significantly reduces memory consumption and improves synthesis speed.
The paper focuses on the training method, OT-CFM, which is a new approach to learn ordinary differential equations (ODEs) for sampling from a data distribution. Compared to conventional methods, OT-CFM enables accurate synthesis in fewer steps, enhancing both speed and quality.
The model is non-autoregressive and probabilistic, capable of learning to speak and align without external alignments.
Some of the highest-quality text-to-speech (TTS) systems use either Deep Probabilistic Models (DPMs) or discrete-time normalizing flows, with continuous-time flows being less explored. Lipman et al. introduced a framework for synthesis using Ordinary Differential Equations (ODEs) that unifies and extends probability flow ODEs and Continuous Normalizing Flows (CNFs). This framework introduced an efficient approach to learn ODEs for synthesis, using a simple vector-field regression loss called conditional flow matching (CFM), as an alternative to learning score functions for DPMs or using numerical ODE solvers at training time like classic CNFs.
CFM, leveraging ideas from optimal transport, sets up ODEs with simple vector fields that change little during the process of mapping samples from the source distribution onto the data distribution, transporting probability mass along straight lines. This technique, termed OT-CFM, allows the ODE to be solved accurately with fewer discretization steps, enabling fewer synthesis steps for the same quality as DPMs.
CFM is a new technique, to date primarily used for speech synthesis by Meta’s Voicebox model. Voicebox (VB) is a text-guided speech-infilling system trained on a large amount of proprietary data. Unlike Matcha-TTS, which is a pure TTS model trained solely using OT-CFM, VB performs TTS, denoising, and text-guided acoustic infilling using a combination of masking and CFM. VB employs convolutional positional encoding with AliBi self-attention bias, while Matcha-TTS uses RoPE. VB is proprietary and significantly larger in terms of parameters compared to Matcha-TTS, which is trained on standard data and has publicly available code and models. Additionally, VB uses external alignments for training, whereas Matcha-TTS learns to speak and align without them.
The following figure from the paper shows an overview of the proposed approach at synthesis time.

The following figure from the paper shows Matcha-TTS decoder (the flow prediction network in the above diagram).

In summary, Matcha-TTS, is an approach to non-autoregressive neural TTS, that uses conditional flow matching (similar to rectified flows) to speed up ODE-based speech synthesis. The approach:
- Is probabilistic.
- Has compact memory footprint.
- Sounds highly natural.
- Is very fast to synthesise from.
Experimental results demonstrate Matcha-TTS’s superiority over pre-trained baseline models in terms of smaller memory footprint, faster synthesis speed on long utterances, and higher mean opinion scores in a listening test.
Matcha-TTS is particularly noted for its ability to provide superior speech naturalness and match the speed of the fastest model on longer utterances, suggesting both the architecture and training method contribute significantly to these improvements.
Demo; Code.

Audiobox: Unified Audio Generation with Natural Language Prompts

The paper by Vyas et al. from Fundamental AI Research (FAIR) presents Audiobox, a unified model for generating various audio modalities using flow-matching.
Audiobox addresses limitations in existing audio generative models, such as lack of controllability and domain coverage limitations. It introduces novel features like the ability to synthesize diverse styles from text descriptions and control over transcript, vocal, and other audio styles.
The model employs description-based and example-based prompting, enhancing controllability and unifying speech and sound generation paradigms.
To improve generalization with limited labels, Audiobox uses a self-supervised infilling objective for pre-training on extensive unlabeled audio datasets. This approach sets new benchmarks in speech and sound generation, achieving a 0.745 similarity on Librispeech for zero-shot text-to-speech synthesis (TTS) and a 0.77 Fréchet Audio Distance (FAD) on AudioCaps for text-to-sound generation.
The following figure shows the audiobox model diagram.

The integration of Bespoke Solvers in Audiobox significantly speeds up audio generation, over 25 times faster than the default Ordinary Differential Equation (ODE) solver used in flow-matching, without compromising performance.
The Audiobox model thus represents a significant advancement in audio generation, offering enhanced controllability, speed, and generality across different audio domains.
Demo; Blog;

Multimodal

2015

CIDEr: Consensus-based Image Description Evaluation

Automatically describing an image with a sentence is a long-standing challenge in computer vision and natural language processing. Due to recent progress in object detection, attribute classification, action recognition, etc., there is renewed interest in this area. However, evaluating the quality of descriptions has proven to be challenging.
This paper by Vedantam in CVPR 2015 introduced CIDEr, a novel paradigm for evaluating image descriptions that uses human consensus. This paradigm consists of three main parts: a new triplet-based method of collecting human annotations to measure consensus, a new automated metric (CIDEr) that captures consensus, and two new datasets: PASCAL-50S and ABSTRACT-50S that contain 50 sentences describing each image. Our simple metric captures human judgment of consensus better than existing metrics across sentences generated by various sources.
They also evaluate five state-of-the-art image description approaches using this new protocol and provide a benchmark for future comparisons.

2016

“Why Should I Trust You?” Explaining the Predictions of Any Classifier

Trust is crucial for effective human interaction with machine learning systems, and that explaining individual predictions is important in assessing trust.
This paper by Ribeiro et al. from Guestrin’s lab in UW in 2016 proposes LIME, a novel model-agnostic modular and extensible explanation technique that explains the predictions of any classifier in an interpretable and faithful manner, by learning an interpretable model locally around the prediction. They further introduced SP-LIME, a method to explain models by selectingrepresentative individual predictions and their explanations in a non-redundant way, framing the task as a submodular optimization problem and providing a global view of the model to users.
They demonstrate the flexibility of these methods by explaining different models for text (e.g. random forests) and image classification (e.g. neural networks). The usefulness of explanations is shown via novel experiments, both simulated and with human subjects.
Their explanations empower users in various scenarios that require trust: deciding if one should trust a prediction, choosing between models, improving an untrustworthy classifier, getting insights into predictions, and detecting why a classifier should not be trusted.
LIME - Local Interpretable Model-Agnostic Explanations blog post.

SPICE: Semantic Propositional Image Caption Evaluation

There is considerable interest in the task of automatically generating image captions. However, evaluation is challenging. Existing automatic evaluation metrics are primarily sensitive to n-gram overlap, which is neither necessary nor sufficient for the task of simulating human judgment.
This paper by Anderson et al. from Australian National University and Macquarie University in ECCV 2016 hypothesizes that semantic propositional content is an important component of human caption evaluation, and propose a new automated caption evaluation metric defined over scene graphs coined SPICE.
The following figure from the paper illustrates SPICE’s main principle which uses semantic propositional content to assess the quality of image captions. Reference and candidate captions are mapped through dependency parse trees (top) to semantic scene graphs (right)— encoding the objects (red), attributes (green), and relations (blue) present. Caption quality is determined using an F-score calculated over tuples in the candidate and reference scene graphs.

Extensive evaluations across a range of models and datasets indicate that SPICE captures human judgments over model-generated captions better than other automatic metrics (e.g., system-level correlation of 0.88 with human judgments on the MS COCO dataset, versus 0.43 for CIDEr and 0.53 for METEOR).
Furthermore, SPICE can answer questions such as ‘which caption-generator best understands colors?’ and ‘can caption-generators count?’

A Diagram Is Worth A Dozen Images

This paper by Kembhavi et al. from Allen AI, proposes a framework for diagram interpretation and reasoning by introducing Diagram Parse Graphs (DPGs). The authors focus on the challenging task of syntactic parsing and semantic interpretation of diagrams, particularly in the context of science diagrams.
The core contribution is the introduction of DPGs to model the structure of diagrams, including their constituents (such as text, arrows, and objects) and relationships (such as spatial and attribute-based relations). These relationships are categorized into 10 classes, such as Intra-Object Label and Inter-Object Linkage. The following figure from the paper illustrates the space of visual illustrations is very rich and diverse. The top palette shows the inter class variability for diagrams in our new diagram dataset, AI2D. The bottom palette shows the intra-class variation for the Water Cycles category.

For syntactic parsing, the authors propose a Deep Sequential Diagram Parser Network (Dsdp-Net), an LSTM-based approach that sequentially adds relationships to form the DPG. The model leverages both local and global contextual cues to infer the relationships in a diagram and is trained on a large dataset compiled by the authors.
The dataset, AI2 Diagrams (AI2D), contains over 5,000 science diagrams with rich annotations, including syntactic parses and 15,000 question-answer pairs. The syntactic parsing is done using a stacked LSTM, which processes relationships between diagram elements and selects the most relevant ones. Training data for the Dsdp-Net is generated by sampling relationships from the diagram dataset, resulting in around 400,000 samples.
For diagram question answering, the paper introduces Dqa-Net, a neural network architecture that uses attention mechanisms to focus on diagram relations relevant to a given question. The Dqa-Net model uses a two-stage embedding process: one for the question and one for the diagram’s DPG relations. The model attends to relevant facts in the DPG and selects the best matching response.
The authors conducted experiments demonstrating that Dsdp-Net significantly outperforms baselines such as Greedy Search and A* Search in syntactic parsing. Additionally, the Dqa-Net model showed improved performance on question answering tasks compared to standard visual question answering techniques (VQA), achieving higher accuracy by focusing on high-level semantic information encoded in the DPGs.
Implementation details include the use of CNNs for detecting diagram constituents like arrowheads and text. For relationship proposals, Random Forest classifiers are trained on spatial and constituent features. The Dsdp-Net employs a 2-layer stacked LSTM with 512 hidden units and uses RMSProp for optimization. For diagram question answering, Dqa-Net uses GloVe embeddings for text representation and an LSTM for encoding questions and diagram relations.
The paper concludes that the proposed framework of DPGs and the corresponding models for parsing and question answering offer a robust solution for understanding complex diagrams, setting a foundation for future work in integrating diagrammatic and commonsense knowledge.

2017

A Unified Approach to Interpreting Model Predictions

While various methods have recently been proposed to help users interpret the predictions of complex models, it is often unclear how these methods are related and when one method is preferable over another.
This paper by Lundberg and Lee from UW in NeurIPS 2017 seeks to address this problem and presents a unified framework for interpreting predictions, SHAP (SHapley Additive exPlanations).
SHAP is a game theoretic approach to explain the output of any machine learning model. It connects optimal credit allocation with local explanations using the classic Shapley values from game theory and their related extensions. SHAP assigns each feature an importance value for a particular prediction. Its novel components include: (1) the identification of a new class of additive feature importance measures, and (2) theoretical results showing there is a unique solution in this class with a set of desirable properties.
The new class unifies six existing methods, notable because several recent methods in the class lack the proposed desirable properties. Based on insights from this unification, they present new methods that show improved computational performance and/or better consistency with human intuition than previous approaches.
Github repo.

mixup: Beyond Empirical Risk Minimization

Large deep neural networks are powerful, but exhibit undesirable behaviors such as memorization and sensitivity to adversarial examples.
This paper by Zhang et al. from MIT and FAIR in ICLR 2018 proposes mixup, a regularizer, which trains a neural network on convex combinations of pairs of examples and their labels. By doing so, mixup regularizes the neural network to favor simple linear behavior in-between training examples.
Their experiments on the ImageNet-2012, CIFAR-10, CIFAR-100, Google commands, and UCI datasets show that mixup improves the generalization of state-of-the-art neural network architectures.
They also find that mixup reduces the memorization of corrupt labels, increases the robustness to adversarial examples, and stabilizes the training of generative adversarial networks.

Multimodal Machine Learning: A Survey and Taxonomy

Our experience of the world is multimodal – we see objects, hear sounds, feel texture, smell odors, and taste flavors. Modality refers to the way in which something happens or is experienced and a research problem is characterized as multimodal when it includes multiple such modalities. In order for Artificial Intelligence to make progress in understanding the world around us, it needs to be able to interpret such multimodal signals together.
Multimodal machine learning aims to build models that can process and relate information from multiple modalities. It is a vibrant multi-disciplinary field of increasing importance and with extraordinary potential.
- This paper by Baltrusaitis et al. from Microsoft and Louis-Philippe Morency’s lab at CMU surveys the recent advances in multimodal machine learning itself and presents them in a common taxonomy instead of focusing on specific multimodal applications.
They go beyond the typical early and late fusion categorizationThis new taxonomy will enable researchers to better understand the state of the field and identify directions for future research.
Their taxonomy goes beyond the typical early and late fusion split and identify broader challenges that are faced by multimodal machine learning, namely: 1) Representation: A first fundamental challenge is learning how to represent and summarize multimodal data in a way that exploits the complementarity and redundancy of multiple modalities. The heterogeneity of multimodal data makes it challenging to construct such representations. For example, language is often symbolic while audio and visual modalities will be represented as signals. 2) Translation: A second challenge addresses how to translate (map) data from one modality to another. Not only is the data heterogeneous, but the relationship between modalities is often open-ended or subjective. For example, there exist a number of correct ways to describe an image and and one perfect translation may not exist. 3) Alignment: A third challenge is to identify the direct relations between (sub)elements from two or more different modalities. For example, we may want to align the steps in a recipe to a video showing the dish being made. To tackle this challenge we need to measure similarity between different modalities and deal with possible longrange dependencies and ambiguities. 4) Fusion: A fourth challenge is to join information from two or more modalities to perform a prediction. For example, for audio-visual speech recognition, the visual description of the lip motion is fused with the speech signal to predict spoken words. The information coming from different modalities may have varying predictive power and noise topology, with possibly missing data in at least one of the modalities. 5) Co-learning: A fifth challenge is to transfer knowledge between modalities, their representation, and their predictive models. This is exemplified by algorithms of cotraining, conceptual grounding, and zero shot learning. Co-learning explores how knowledge learning from one modality can help a computational model trained on a different modality. This challenge is particularly relevant when one of the modalities has limited resources (e.g., annotated data).
The table below from the paper offers a summary of applications enabled by multimodal machine learning. For each application area they identify the core technical challenges that need to be addressed in order to tackle it.

2019

Representation Learning with Contrastive Predictive Coding

While supervised learning has enabled great progress in many applications, unsupervised learning has not seen such widespread adoption, and remains an important and challenging endeavor for artificial intelligence.
This paper by Oord et al. from Google in 2019 proposes a universal unsupervised learning approach to extract useful representations from high-dimensional data, which they call Contrastive Predictive Coding (CPC), a framework for extracting compact latent representations to encode predictions over future observations.
The key insight of CPC is to learn such representations by predicting the future in latent space by using powerful autoregressive models.
CPC uses a probabilistic contrastive loss based on NCE, which both the encoder and autoregressive model are trained to jointly optimize, which they call InfoNCE. InfoNCE induces the latent space to capture information that is maximally useful to predict future samples.
CPC combines autoregressive modeling and noise-contrastive estimation with intuitions from predictive coding to learn abstract representations in an unsupervised fashion.
It also makes the model tractable by using negative sampling. While most prior work has focused on evaluating representations for a particular modality, they demonstrate that CPC is able to learn useful representations achieving strong performance on four distinct domains: speech, images, text and reinforcement learning in 3D environments.
The figure below from the paper offers an overview of Contrastive Predictive Coding, the proposed representation learning approach. Although this figure shows audio as input, they use the same setup for images, text, and reinforcement learning.

They tested these representations in a wide variety of domains: audio, images, natural language, and reinforcement learning and achieve strong or state-of-the-art performance when used as stand-alone features.
The simplicity and low computational requirements to train the model, together with the encouraging results in challenging reinforcement learning domains when used in conjunction with the main loss are exciting developments towards useful unsupervised learning that applies universally to many more data modalities.

2020

Modality Dropout for Improved Performance-driven Talking Faces

This paper by Adbelaziz et al. from Apple in 2020 introduces the idea of Modality Dropout (MDO). The begin by describing a novel deep learning approach for driving animated faces using both acoustic and visual information. In particular, speech-related facial movements are generated using audiovisual information, and non-speech facial movements are generated using only visual information.
To ensure that the proposed model exploits both modalities during training, batches are generated that contain audio-only, video-only, and audiovisual input features. The probability of dropping a modality allows control over the degree to which the model exploits audio and visual information during training.
Their trained model runs in real-time on resource limited hardware (e.g., a smart phone), it is user agnostic, and it is not dependent on a potentially error-prone transcription of the speech.
They use subjective testing to demonstrate: 1) the improvement of audiovisual-driven animation over the equivalent video-only approach, and 2) the improvement in the animation of speech-related facial movements after introducing modality dropout. Before introducing dropout, viewers prefer audiovisual-driven animation in 51% of the test sequences compared with only 18% for video-driven. After introducing dropout viewer preference for audiovisual-driven animation increases to 74%, but decreases to 8% for video-only.

Augmentation adversarial training for self-supervised speaker recognition

This paper by Huh et al. from Oxford, Naver Corporation, Shinji Watanabe’s lab at JHU in the Workshop on Self-Supervised Learning for Speech and Audio Processing, NeurIPS 2020 seeks to train robust speaker recognition models without speaker labels.
Recent works on unsupervised speaker representations are based on contrastive learning in which they encourage within-utterance embeddings to be similar and across-utterance embeddings to be dissimilar. However, since the within-utterance segments share the same acoustic characteristics, it is difficult to separate the speaker information from the channel information.
They propose augmentation adversarial training strategy that trains the network to be discriminative for the speaker information, while invariant to the augmentation applied.
Since the augmentation simulates the acoustic characteristics, training the network to be invariant to augmentation also encourages the network to be invariant to the channel information in general. Extensive experiments on the VoxCeleb and VOiCES datasets show significant improvements over previous works using self-supervision, and the performance of their self-supervised models far exceed that of humans.
The following figure from the paper illustrates an overview of the training strategy. The index notation for the inputs and the embeddings are consistent with the equations, i.e., $i, j, k$ refer to $j^{th}$ segment of $i^{th}$ utterance, with augmentation type $k$.

BERTScore: Evaluating Text Generation with BERT

This paper by Zhang et al. from Cornell Tech, Cornell University, and ASAPP Inc. in ICLR 2020 proposes BERTScore, an automatic evaluation metric for text generation. Analogously to common metrics, BERTScore computes a similarity score for each token in the candidate sentence with each token in the reference sentence. However, instead of exact matches, they compute token similarity using contextual embeddings.
They evaluate using the outputs of 363 machine translation and image captioning systems. BERTScore correlates better with human judgments and provides stronger model selection performance than existing metrics.
Finally, they use an adversarial paraphrase detection task to show that BERTScore is more robust to challenging examples when compared to existing metrics.
The following figure from the paper offers an illustration of the computation of the recall metric $R_{BERT}$. Given the reference $x$ and candidate $\hat{x}$, they compute BERT embeddings and pairwise cosine similarity. They highlight the greedy matching in red, and include the optional idf importance weighting.

Code.

2021

Comparing Data Augmentation and Annotation Standardization to Improve End-to-end Spoken Language Understanding Models

All-neural end-to-end (E2E) Spoken Language Understanding (SLU) models can improve performance over traditional compositional SLU models, but have the challenge of requiring high-quality training data with both audio and annotations. In particular they struggle with performance on “golden utterances”, which are essential for defining and supporting features, but may lack sufficient training data.
This paper by Nicolich-Henkin et al. from Amazon in NeurIPS 2021 proposes using data augmentation to compare two data-centric AI methods to improve performance on golden utterances: improving the annotation quality of existing training utterances and augmenting the training data with varying amounts of synthetic data.
Their experimental results show improvements with both methods, and in particular that augmenting with synthetic data is effective in addressing errors caused by both inconsistent training data annotations as well as lack of training data. In other words, both data-centric approaches to improving E2E SLU achieved the desired effect, although data augmentation was much more powerful than annotation standardization. This method leads to improvement in intent recognition error rate (IRER) on their golden utterance test set by 93% relative to the baseline without seeing a negative impact on other test metrics.

Learning Transferable Visual Models From Natural Language Supervision

This paper by Radford et al. from OpenAI introduces Contrastive Language-Image Pre-training (CLIP), a pre-training task which efficiently learns visual concepts from natural language supervision. CLIP uses vision and language encoders trained in isolation and uses a contrastive loss to bring similar image-text pairs closer, while pulling apart dissimilar pairs as a part of pretaining. CLIP’s unique aspect is its departure from traditional models reliant on fixed object categories, instead utilizing a massive dataset of 400 million image-text pairs.
CLIP’s core methodology revolves around a pre-training task using vision and language encoders, which are trained in isolation. These encoders are optimized using a contrastive loss, effectively narrowing the gap between similar image-text pairs while distancing dissimilar ones. This process is crucial for the model’s pretraining.
The encoders in CLIP are designed to predict the pairing of images with corresponding texts in the dataset. This predictive capability is then harnessed to transform CLIP into a robust zero-shot classifier. For classification, CLIP utilizes captions (e.g., “a photo of a dog”) to predict the class of a given image, mirroring the zero-shot capabilities seen in models like GPT-2 and GPT-3.
CLIP’s architecture consists of an image encoder and a text encoder, both fine-tuned to maximize the cosine similarity of embeddings from the correct pairs and minimize it for incorrect pairings. This structure enhances the efficiency of the model, enabling accurate prediction of pairings from a batch of training examples. The following figure from the paper offers an illustration of CLIP’s architecture. While standard image models jointly train an image feature extractor and a linear classifier to predict some label, CLIP jointly trains an image encoder and a text encoder to predict the correct pairings of a batch of (image, text) training examples. At test time the learned text encoder synthesizes a zero-shot linear classifier by embedding the names or descriptions of the target dataset’s classes.

The model exhibits exceptional zero-shot transfer capabilities, allowing it to classify images into categories it has never encountered during training, using only category names or descriptions.
CLIP has been thoroughly evaluated on over 30 diverse datasets, encompassing tasks from OCR to object classification. It often matches or surpasses fully supervised baselines, despite not receiving dataset-specific training.
The paper also explores the impact of prompt engineering and ensembling techniques on zero-shot classification performance. These techniques involve tailoring text prompts for each classification task, providing more context to the model.
CLIP’s ability to rival the generalization of state-of-the-art ImageNet models is highlighted, thanks to its training on a diverse and extensive dataset. This versatility makes it particularly suitable for zero-shot image classification and cross-modal searches.
The innovation of CLIP lies in its capacity to understand and learn from natural language supervision, a much more expansive and adaptable source than traditional methods. This feature positions CLIP as a pivotal tool in computer vision, capable of comprehending and categorizing a broad range of visual concepts with minimal specific training data.
OpenAI article

Zero-Shot Text-to-Image Generation

Text-to-image generation (i.e., language-guided image generation) has traditionally focused on finding better modeling assumptions for training on a fixed dataset. These assumptions might involve complex architectures, auxiliary losses, or side information such as object part labels or segmentation masks supplied during training.
This paper by Ramesh et al. from OpenAI introduces DALL-E which offers a simple approach for text-to-image generation based on an autoregressive transformer which models the text and image tokens as a single stream of data. DALL-E is a simple decoder-only transformer that receives both the text and the image as a single stream of 1280 tokens—256 for the text and 1024 for the image—and models all of them autoregressively.
They find that sufficient data and scale can lead to improved generalization, both in terms of zero-shot performance relative to previous domain-specific approaches, and in terms of the range of capabilities that emerge from a single generative model. Their findings suggest that improving generalization as a function of scale may be a useful driver for progress on this task.
OpenAI article.

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

This paper by Kim et al. from NAVER AI and Kakao in 2021 introduces Vision-and-Language Transformer (ViLT) that seeks to improve performance on various joint vision-and-language downstream tasks using Vision-and-Language Pre-training (VLP).
CLIP and Weights’s VisionEncoderDecoder utilize image and language encoders learned/trained in isolation and aligning/gluing them using either (i) cross-entropy loss that utilizes cross-attention (in case of VisionEncoderDecoder), and (ii) contrastive loss (in case of CLIP). This is shown in the figure below from Prithvi Da which summarizes the aforementioned approaches.

The downside of the above approach is poor image-text alignment, huge data appetite and longer training time. This approach is useful to create a downstream generative model to tackle applications such as cross-modal retrieval, say OCR or image captioning or content based image retrieval (CBIR) or even text2image (using DALL-E or CLIPDraw). However, there are derived/advanced multimodal tasks involving vision and language that are much more complicated in nature such as Natural Language for Visual Reasoning (NLVR), Visual Question Answering (VQA), Visual Commonsense Reasoning (VCR), Visual Navigation, etc. than the aforementioned higher-order tasks. The diagram below from Prithvi Da summarizes the hierarchy of image-based tasks.

In order to tackle derived tasks in a similar way, they need to train image and language data jointly (rather than in isolation) in a “mixed-modal” fashion with a combination of image level loss, language level loss, and alignment loss. This is the underlying idea behind VLP. The diagram below from Prithvi Da summarizes the two approaches of aligning/gluing the modalities together (with either cross-entropy loss or contrastive loss) independently-trained vision and language encoders vs. training both encoders jointly.

Current approaches to VLP heavily rely on image feature extraction processes using convolutional visual embedding networks (e.g., Faster R-CNN and ResNets), which involve region supervision (e.g., object detection) and the convolutional architecture (e.g., ResNet). This is problematic in terms of both efficiency/speed, in that extracting input features requires much more computation than the multimodal interaction steps; and expressive power, as it is upper bounded to the expressive power of the visual embedder and its predefined visual vocabulary.
ViLT seeks to remedy the above two issues by presenting a minimal VLP model, which is monolithic in that the processing of visual inputs is drastically simplified to just the same convolution-free manner that they process textual inputs. In other words, the unique selling point of ViLT is that while most VLP models rely on object detectors, CNNs or transformers for feature extraction (for e.g., UNiTER, LXMERT and VisualBERT need Faster-RCNN for object detection), ViLT stands out of the crowd by removing the need for object detectors. ViLT accomplishes this by avoiding heavyweight image encoders by directly embedding low-level pixel data with a single-layer projection and achieves similar results with reduced complexity, as shown in the diagram below:

Self-supervision is accomplished using (i) Image Text Matching (ITM) loss and (ii) Masked Language Model (MLM) loss. ITM loss is an alignment loss that encompasses cross-modality interaction between image and text. ITM requires positive and negative pairs. For text, ViLT simply reuses Masked Language Model (MLM), used in BERT.
ViLT is pre-trained on four datasets: MSCOCO, Visual Genome, SBU Captions, and Google Conceptual Captions. They evaluate ViLT on two widely explored types of vision-and-language downstream tasks: for classification, they use VQAv2 and NLVR2; for retrieval, they use MSCOCO and Flickr30K (F30K).
Finally, they show that ViLT is over 10x faster than previous VLP models, yet with competitive or better downstream task performance.
The key takeaway in this paper is that VLP needs to focus more on the multi-modality interactions aspect inside the transformer module rather than engaging in an arms race that merely powers up unimodal embedders. ViLT-B/32 is a proof of concept that efficient VLP models free of convolution and region supervision can still be competent.
Code with code and pre-trained weights; Weights docs; ViLT tutorials/notebooks.

MLIM: Vision-and-language Model Pre-training With Masked Language and Image Modeling

Vision-and-Language Pre-training (VLP) improves model performance for downstream tasks that require image and text inputs. Current VLP approaches differ on (i) model architecture (especially image embedders), (ii) loss functions, and (iii) masking policies. Image embedders are either deep models like ResNet or linear projections that directly feed image-pixels into the transformer. Typically, in addition to the Masked Language Modeling (MLM) loss, alignment-based objectives are used for cross-modality interaction, and RoI feature regression and classification tasks for Masked ImageRegion Modeling (MIRM). Alignment-based objectives require pairings of image and text and heuristic objective functions. MIRM relies on object detectors. Masking policies either do not take advantage of multi-modality or are strictly coupled with alignments generated by other models.
This paper by Arici et al. from Amazon in 2021 presents Masked Language and Image Modeling (MLIM) for VLP. MLIM is pre-trained using two pre-training tasks as a multi-loss objective given a mini-batch of image-text pairs: Masked Language Modeling (MLM) loss (as in BERT) for text, and image reconstruction (RECON) loss for image, coupled with Modality Aware Masking (MAM). MAM determines the masking probability and applies masking to both word and image embeddings. MLP is based on BERT predict the masked words from available words and image regions. They follow BERT for this task: two-layer MLP MLM head outputting logits over the vocabulary. MLM loss is negative log-likelihood for masked word. The RECON loss is an an average of pixel-wise sum of squared errors (SSE). Both image and word masking is realized by replacing an embedding with the embedding of [MASK]. This way transformer layers recognize [MASK]’s embedding as a special embedding that needs to be “filled in”, independent of the modality, by attending to other vectors in the layer inputs.
Note that unlike other architectures (LXMERT, UNiTER, ViLBERT, VLP, VL-BERT, VisualBERT, etc.), image masking is not based on image regions detected by the object detector, but a shallow CNN as an image embedder which is much more lightweight than deep models like ResNet and is designed to be masking friendly. MLM + RECON losses apply only to the masked text/image areas and measure reconstructed text and image quality.
MLIM uses no specific alignment loss, but instead proposes Modality Aware Masking (MAM) to boost cross-modality interaction and take advantage of MLM and RECON losses that separately capture text and image reconstruction quality. Using MLM + RECON tasks coupled with MAM, they present a simplified VLP methodology and show that it has better downstream task performance on a proprietary e-commerce multi-modal dataset.
Since the the task of finding closely-matching (CM) item pairs requires a pair of image+text inputs, they exploit this multi-modality by employing Modality Dropout (MDO). MDO improves fine-tuning by randomly dropping one of the modalities. Similar to MAM, MDO operates in one of the three modes on a micro-batch: text-only, image-only, and image-text mode.
The authors also tried using the ITM loss proposed in ViLT. However, RECON instead of ITM loss offers better PR AUC. Similarly, using the ITM loss together with MLM and RECON does not change the performance.
The key takeways from this paper are that MLIM is a simplified VLP method using MLM and RECON losses and MAM. They simplify loss function design, propose a shallow CNN-based image embedder to avoid heavyweight object-detectors and present an image decoder to enable RECON loss. They believe VLP datasets (e.g. e-commerce datasets) are large enough to enable learning built-in image embedders during pre-training. While alignment-based loss functions are promising and help in learning contrastive features, finding good image-text pairs (especially negative pairs) becomes an issue and makes pre-training rely on pairing techniques. On the other hand finer-grained alignment objectives such as alignment and MIRM objectives do not have ground truth. Masked Image-Region Modeling (MIRM) relies on RoI features and classes predicted by the object detector. Furthermore MIRM tasks aim to “fill in” masked regions. However the proposed RECON task aims to reconstruct the whole image and is designed to get the best cross-modality interaction inside the transformer.

MURAL: Multimodal, Multi-task Retrieval Across Languages

This paper by Jain and Yang from Google Research in EMNLP 2021 describes MURAL, a representation model for image–text matching that uses multitask learning applied to image–text pairs in combination with translation pairs covering 100+ languages.
While we currently have solutions that take both image and text and embed them in the same vector space with solutions like CLIP and ALIGN, we do not have solutions that scale for languages outside of English due to lack of training data.
MURAL shows that training jointly using translation pairs helps overcome the scarcity of image–text pairs for many under-resourced languages and improves cross-modal performance.
MURAL consistently outperforms prior state-of-the-art models in multilingual image-to-text and text-to-image retrieval.
Additionally, when visualizing MURAL’s embeddings with LaBSE’s, it is interesting to observe hints of areal linguistics and contact linguistics in the text representations learned by using a multimodal model.
The diagram below shows the MURAL architecture (from the paper), which is based on the architecture of ALIGN but employed in a multitask fashion:

The MURAL paper shows that (i) training jointly with both image and text helps possibly overcome scarcity of data for low-resource languages, and (ii) training jointly also increases cross-modal performance.

Perceiver: General Perception with Iterative Attention

Biological systems perceive the world by simultaneously processing high-dimensional inputs from modalities as diverse as vision, audition, touch, proprioception, etc. The perception models used in deep learning on the other hand are designed for individual modalities, often relying on domain-specific assumptions such as the local grid structures exploited by virtually all existing vision models. These priors introduce helpful inductive biases, but also lock models to individual modalities.
This paper by Jeagle et al. from DeepMind in ICML 2021 introduces the Perceiver - a model that builds upon Transformers and hence makes few architectural assumptions about the relationship between its inputs, but that also scales to hundreds of thousands of inputs, like ConvNets.
The model leverages an asymmetric attention mechanism to iteratively distill inputs into a tight latent bottleneck, allowing it to scale to handle very large inputs. We show that this architecture is competitive with or outperforms strong, specialized models on classification tasks across various modalities: images, point clouds, audio, video, and video+audio.
The Perceiver obtains performance comparable to ResNet-50 and ViT on ImageNet without 2D convolutions by directly attending to 50,000 pixels. It is also competitive in all modalities in AudioSet.

Multimodal Few-Shot Learning with Frozen Language Models

When trained at sufficient scale, auto-regressive language models exhibit the notable ability to learn a new language task after being prompted with just a few examples.
This paper by Tsimpoukelli et al. from DeepMind in NeurIPS 2021 presents Frozen – a simple, yet effective, approach for transferring this few-shot learning ability to a multimodal setting (vision and language).
Using aligned image and caption data, they train a vision encoder to represent each image as a sequence of continuous embeddings, such that a pre-trained, frozen language model prompted with this prefix generates the appropriate caption.
The resulting system is a multimodal few-shot learner, with the surprising ability to learn a variety of new tasks when conditioned on examples, represented as a sequence of multiple interleaved image and text embeddings.
They demonstrate that it can rapidly learn words for new objects and novel visual categories, do visual question-answering with only a handful of examples, and make use of outside knowledge, by measuring a single model on a variety of established and new benchmarks.
The following figure from the paper shows that gradients through a frozen language model’s self attention layers are used to train the vision encoder:

Code.

On the Opportunities and Risks of Foundation Models

AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks. We call these models foundation models to underscore their critically central yet incomplete character.
This report by Bommasani et al. from the Center for Research on Foundation Models (CRFM) at Stanford provides a thorough account of the opportunities and risks of foundation models, ranging from their capabilities (e.g., language, vision, robotics, reasoning, human interaction) and technical principles (e.g., model architectures, training procedures, data, systems, security, evaluation, theory) to their applications (e.g., law, healthcare, education) and societal impact (e.g., inequity, misuse, economic and environmental impact, legal and ethical considerations). Though foundation models are based on standard deep learning and transfer learning, their scale results in new emergent capabilities,and their effectiveness across so many tasks incentivizes homogenization. Homogenization provides powerful leverage but demands caution, as the defects of the foundation model are inherited by all the adapted models downstream. Despite the impending widespread deployment of foundation models, we currently lack a clear understanding of how they work, when they fail, and what they are even capable of due to their emergent properties. To tackle these questions, we believe much of the critical research on foundation models will require deep interdisciplinary collaboration commensurate with their fundamentally sociotechnical nature.
The following figure from the paper illustrates the fact that a foundation model can centralize the information from all the data from various modalities. This one model can then be adapted to a wide range of downstream tasks.

CLIPScore: A Reference-free Evaluation Metric for Image Captioning

This paper by Hessel et al. from Allen AI and UW in EMNLP 2021 proposes CLIPScore, a new automatic evaluation metric for image captioning that uses CLIP to assess the compatibility between images and candidate captions without requiring reference captions.
A reference-augmented version, RefCLIPScore, achieves even higher correlation by combining CLIPScore with maximal reference cosine similarity. Analyses show CLIPScore captures different aspects of caption quality compared to text-only metrics, with CLIPScore being more focused on image-text compatibility. CLIPScore is sensitive to detecting incorrect “hallucinated” captions where a noun has been swapped. It also correlates well for rating image alt-text quality on Twitter.
Experiments on several standard captioning datasets (Flickr8K, Composite, Pascal-50S) show that CLIPScore achieves higher correlation with human judgments than previous reference-based metrics like CIDEr and SPICE.
For abstract clipart images and personality-based engaging captions, CLIPScore underperforms compared to reference-based metrics. For news image captions requiring richer context, reference-based metrics do better.
Overall, for literal image description tasks, CLIPScore offers strong correlation without needing reference captions, complementing existing reference-based metrics. The authors recommend reporting CLIPScore along with text-only metrics like SPICE.
The following figure from the paper illustrates the following: (Top) CLIPScore uses CLIP to assess image-caption compatibility without using references, just like humans. (Bottom) This frees CLIPScore from the well-known shortcomings of n-gram matching metrics, which disfavor good captions with new words (top) and favor any captions with familiar words (bottom).

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

This paper by Xu et al. from Facebook AI and CMU in EMNLP 2021 presents an innovative approach to pre-train a unified model for zero-shot video and text understanding without relying on labels for downstream tasks. The primary objective of VideoCLIP is to establish a fine-grained association between video and text to address the diverse requirements of end tasks. The method stands out by using contrastive learning with hard-retrieved negatives and overlapping positives for video-text pre-training.
VideoCLIP aims for zero-shot video understanding via learning fine-grained association between video and text in a transformer using a contrastive objective with two key novelties: (1) for positive pairs, we use video and text clips that are loosely temporarily overlapping instead of enforcing strict start/end timestamp overlap; (2) for negative pairs, we employ a retrieval based sampling technique that uses video clusters to form batches with mutually harder videos.

Key Technical and Implementation Details:
- Video and Text Encoding: VideoCLIP uses pairs of video and text clips as inputs, employing a Transformer model for both video and text. The video features, extracted by a convolutional neural network (CNN), are projected to video tokens before being fed into a video transformer. Text tokens are obtained via embedding lookup as in BERT, and then both video and text tokens are processed by separate trainable Transformers to obtain hidden states. Average pooling is applied over these token sequences to encourage learning token-level representations for tasks like action localization and segmentation.
- Contrastive Loss: The method utilizes the InfoNCE objective for contrastive loss, aiming to minimize the sum of two multimodal contrastive losses. This process involves contrasting positive video-text pairs with negative pairs within the same batch. This loss function is key to learning fine-grained correspondence between video and text by discriminating between positive and negative pairs. InfoNCE uses a softmax function over a dot product of video and text representations to estimate the mutual information between them. This process involves contrasting positive video-text pairs, where the positive pairs have high mutual information, with negative pairs within the same batch, which are assumed to have lower mutual information. The InfoNCE objective is crucial for the model to learn effective representations that distinguish relevant video-text pairs from irrelevant ones.
- Overlapped Video-Text Clips: The approach samples text clips first to ensure nearby corresponding video clips, then grows a video clip with random duration from a center timestamp within the text clip. This method improves video-text association by focusing on higher relevance pairs, as opposed to strictly temporally aligned clips that may lack semantic closeness.
- Retrieval Augmented Training: This component of training uses hard pairs for negatives, derived through retrieval-based sampling. The process involves building a dense index of videos’ global features and retrieving clusters of videos that are mutually closer to each other. This approach aims to model more fine-grained video-text similarity using difficult examples.
- Zero-shot Transfer to End Tasks: VideoCLIP is evaluated on various end tasks without using any labels. These tasks include text-to-video retrieval, multiple-choice VideoQA, action segmentation, etc. Each of these tasks tests different aspects of the learned video-text representation, such as similarities between video and text, action labeling, and segmenting meaningful video portions.
Pre-training and Implementation Details:
- The pre-training utilized HowTo100M, which contains instructional videos from YouTube. After filtering, 1.1 million videos were used, each averaging about 6.5 minutes with approximately 110 clip-text pairs.
- The video encoder used is a S3D, pre-trained on HowTo100M, and the video and text Transformers were initialized with weights from BERTBASE-uncased. The maximum number of video tokens was limited to 32, and 16 video/text pairs were sampled from each video to form batches of 512.
Results and Impact:
- VideoCLIP showed state-of-the-art performance on a variety of tasks, often outperforming previous work and, in some cases, even supervised approaches. This was evident in its application to datasets like Youcook2, MSR-VTT, DiDeMo, COIN, and CrossTask.
- The method, by contrasting temporally overlapping positives with hard negatives from nearest neighbor retrieval, has been effective without supervision on downstream datasets. It also showed improvement upon fine-tuning.
In summary, VideoCLIP demonstrates a significant advancement in zero-shot video-text understanding, offering a robust and versatile approach that effectively leverages the synergy between video and text data.

2022

DeepNet: Scaling Transformers to 1,000 Layers

This paper by Wang et al. from Microsoft Research in 2022 introduces DeepNet, a new method that allows train extremely deep transformers with 1000L+ layers – order of magnitude improvements over existing efforts and with theoretical justification.
DeepNet is fundamental, effective and simple. It can be used in any Transformer architecture (encoder, decoder, encoder-decoder) which covers almost all different tasks across AI areas (language, vision, speech, multimodal, and beyond). It is not only for 1000L+ Transformers, but also important and effective for training existing large models (e.g., [24, 100] layers). It combines the best of two worlds, i.e., good performance of Post-LN and stable training of Pre-LN, making it a preferred alternative for any Transformers model training.
At the core of DeepNet is a newly proposed normalization function (called DeepNorm) which modifies the residual connection in Transformers. DeepNorm has theoretical justification of bounding the model update by a constant which makes stable training possible in a principled way. They only need lines of code change to make it work in existing Transformer implementation.
DeepNorm modifies the residual connection in the Transformer architecture by up-scaling it before performing layer normalization. It works alongside a dedicated initialization scheme based on Xavier initialization.
These two tricks lead to greater stability during the training which allows the authors to scale their modified Transformer architecture (DeepNet) up to 1000 layers.
DeepNet’s 200-layer model with 3.2B parameters significantly outperforms the 48-layer state-of-the-art model with 12B parameters by 5 BLEU points a in multilingual translation task with 7,482 translation directions.
Github repo.

data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language

While the general idea of self-supervised learning is identical across modalities, the actual algorithms and objectives differ widely because they were developed with a single modality in mind.
This paper by Baevski et al. from Facebook in 2022 helps get us closer to general self-supervised learning by presenting data2vec, a framework that uses the same learning method for either speech, NLP or computer vision. The core idea is to predict latent representations of the full input data based on a masked view of the input in a self distillation setup using a standard Transformer architecture.
Instead of predicting modality-specific targets such as words, visual tokens or units of human speech which are local in nature, data2vec predicts contextualized latent representations that contain information from the entire input. Experiments on the major benchmarks of speech recognition, image classification, and natural language understanding demonstrate a new state of the art or competitive performance to predominant approaches.
Today’s self-supervised learning research almost typically focuses on a single modality. As a result, researchers specializing in one modality often adopt a totally different strategy than those specializing in another. Researchers train algorithms to fill in blanks in sentences in the case of the text. On the other hand, speech models must learn an inventory of essential speech sounds, like forecasting missing sounds. In computer vision, models are frequently taught to assign comparable representations to a color image of a cow, and the same image flipped upside down, allowing them to correlate the two far more closely than they would with an unrelated image like a duck. data2vec symbolizes a new paradigm of holistic self-supervised learning, in which further research enhances several rather than just one modality.
For each modality, algorithms anticipate distinct units: pixels or visual tokens for images, words for the text, and learned sound inventories for voice. Because a collection of pixels differs significantly from an audio waveform or a passage of text, algorithm creation has been related to a particular modality. This means that algorithms in each modality continue to work differently. Data2vec makes this easier by teaching models to anticipate their own representations of the incoming data, regardless of mode. Instead of predicting visual tokens, phrases, or sounds, a single algorithm may work with completely different sorts of input by focusing on these representations — the layers of a neural network. This eliminates the learning task’s reliance on modality-specific targets. It also doesn’t use contrastive learning or reconstructed input examples.
It was necessary to define a robust normalization of the features for the job that would be trustworthy in different modalities to directly predict representations. The method starts by computing target representations from an image, a piece of text, or a voice utterance using a teacher network. After that, a portion of the input was masked and repeated with a student network, which predicts the teacher’s latent representations. Even though it only has a partial view of the data, the student model must predict accurate input data. The instructor network is identical to the student network, except with somewhat out-of-date weights.
The method was tested on the primary ImageNet computer vision benchmark, and it outperformed existing processes for a variety of model sizes. It surpassed wav2vec 2.0 and HuBERT, two previous Meta AI self-supervised voice algorithms. It was put through its paces on the popular GLUE benchmark suite for text, and it came out on par with RoBERTa, a reimplementation of BERT.
Key takeaways:
- data2vec is a self-supervised algorithm that works for multiple modalities outperforming the previous best single-purpose algorithms for computer vision and speech and generating competitive scores on NLP tasks.
- The core idea is to predict latent representations of the full input data based on a masked view of the input in a self-distillation setup using a standard Transformer architecture. Instead of predicting modality-specific targets such as words, visual tokens or units of human speech which are local in nature, data2vec predicts contextualized latent representations that contain information from the entire input.
- Method:
  - data2vec is trained by predicting the model representations of the full input data given a partial view of the input
  - They first encode a masked version of the training sample (model in student mode) and then construct training targets by encoding the unmasked version of the input sample with the same model but when parameterized as an exponentially moving average of the model weights (model in teacher mode)
  - The target representations encode all of the information in the training sample and the learning task is for the student to predict these representations given a partial view of the input.
- Modality encoding:
  - The model architecture used is the standard Transformer architecture with a modality-specific encoding of the input data borrowed from prior work:
    - For computer vision, they have used the ViT-strategy of encoding an image as a sequence of patches, each spanning 16x16 pixels, input to a linear transformation.
    - Speech data is encoded using a multi-layer 1-D convolutional neural network that maps 16 kHz waveform to 50 Hz representations.
    - Text is pre-processed to obtain sub-word units, which are then embedded in distributional space via learned embedding vectors.
- Ablations (layer-averaged targets):
  - They have used targets which are based on averaging multiple layers from the teacher network.
Facebook AI link; Github; Marktechpost article.

Hierarchical Text-Conditional Image Generation with CLIP Latents

In January 2021, OpenAI introduced DALL-E. A year later, their newest system, DALL-E 2, generates more realistic and accurate images with 4x greater resolution, better caption matching and photorealism.
Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style.
This paper by Ramesh et al. from OpenAI in 2022 proposes DALL-E 2 leverages these representations for image generation, they propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a “unCLIP” decoder that generates an image conditioned on the image embedding.
They show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity.
Their decoder, which is conditioned on image representations, can also produce variations of an image that preserve both its semantics and style, while varying the non-essential details absent from the image representation.
They use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples
OpenAI article.

AutoDistill: an End-to-End Framework to Explore and Distill Hardware-Efficient Language Models

Recently, large pre-trained models have significantly improved the performance of various Natural LanguageProcessing (NLP) tasks but they are expensive to serve due to long serving latency and large memory usage. To compress these models, knowledge distillation has attracted an increasing amount of interest as one of the most effective methods for model compression. However, existing distillation methods have not yet addressed the unique challenges of model serving in datacenters, such as handling fast evolving models, considering serving performance, and optimizing for multiple objectives.
This paper by Zhang et al. from Google in 2022 solve these problems, they propose AutoDistill, an end-to-end model distillation framework integrating model architecture exploration and multi-objective optimization for building hardware-efficient NLP pre-trained models. They use Bayesian Optimization to conduct multi-objective Neural Architecture Search for selecting student model architectures. The proposed search comprehensively considers both prediction accuracy and serving latency on target hardware. The experiments on TPUv4i show the finding of seven model architectures with better pre-trained accuracy (up to 3.2% higher) and lower inference latency (up to 1.44x faster) than MobileBERT.
By running downstream NLP tasks in the GLUE benchmark, the model distilled for pre-training by AutoDistill with 28.5M parameters achieves an 81.69 average score, which is higher than BERT_BASE, DistillBERT, TinyBERT, NAS-BERT, and MobileBERT. The most compact model found by AutoDistill contains only 20.6M parameters but still outperform BERT_BASE(109M), DistillBERT(67M), TinyBERT(67M), and MobileBERT(25.3M) regarding the average GLUE score. By evaluating on SQuAD, a model found by AutoDistill achieves an 88.4% F1 score with 22.8M parameters, which reduces parameters by more than 62% while maintaining higher accuracy than DistillBERT, TinyBERT, and NAS-BERT.

A Generalist Agent

This paper by Reed et al. from DeepMind in 2022 proposes Gato, a single generalist agent beyond the realm of text outputs, inspired by progress in large-scale language modeling.
Gato, works as a multi-modal, multi-task, multi-embodiment generalist policy. The same network with the same weights can play Atari, caption images, chat, stack blocks with a real robot arm and much more, deciding based on its context whether to output text, joint torques, button presses, or other tokens.
The guiding design principle of Gato is to train on the widest variety of relevant data possible, including diverse modalities such as images, text, proprioception, joint torques, button presses, and other discrete and continuous observations and actions. To enable processing this multi-modal data from different tasks and modalities, it is serialized into a flat sequence of tokens. In this representation, Gato can be trained and sampled from akin to a standard large-scale language model. Masking is used such that the loss function is applied only to target outputs, i.e. text and various actions. During deployment, sampled tokens are assembled into dialogue responses, captions, button presses, or other actions based on the context.
Gato uses a 1.2B parameter decoder-only transformer with 24 layers, an embedding size of 2048, and a post-attention feedforward hidden size of 8196.
Transformer sequence models are effective as multi-task multi-embodiment policies, including for real-world text, vision and robotics tasks. They show promise as well in few-shot out-of-distribution task learning. The authors envision that in the future, such models could be used as a default starting point via prompting or fine-tuning to learn new behaviors, rather than training from scratch.
DeepMind page.

Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors

Recent text-to-image generation methods provide a simple yet exciting conversion capability between text and image domains. While these methods have incrementally improved the generated image fidelity/quality and text relevancy (i.e., adherence to text of generated images), several pivotal gaps remain unanswered, limiting applicability and quality.
This paper by Gafni et al. from Meta AI in 2022 proposes a novel text-to-image method that addresses these gaps by (i) enabling a simple control mechanism complementary to text in the form of a scene, (ii) introducing elements that substantially improve the tokenization process by employing domain-specific knowledge over key image regions (faces and salient objects), and (iii) adapting classifier-free guidance for the transformer use case.
While some methods propose image editing techniques, progress is not often directed towards enabling new forms of human creativity and experiences. They attempt to progress text-to-image generation towards a more interactive experience, where people can perceive more control over the generated outputs, thus enabling real-world applications such as storytelling.
In addition to improving the general image quality, they focus on improving key image aspects that are significant in human perception, such as faces and salient objects, resulting in higher favorability of their method in human evaluations and objective metrics.
Their model achieves state-of-the-art FID and human evaluation results, unlocking the ability to generate high fidelity images in a resolution of 512 $\times$ 512 pixels, significantly improving visual quality. Through scene controllability, they introduce several new capabilities: (i) scene editing, (ii) text editing with anchor scenes, (iii) overcoming out-of-distribution text prompts, and (iv) story illustration generation, as demonstrated in the story they wrote.

i-Code: An Integrative and Composable Multimodal Learning Framework

Human intelligence is multimodal; they integrate visual, linguistic, and acoustic signals to maintain a holistic worldview. Most current pretraining methods, however, are limited to one or two modalities.
This paper by Yang et al. from Microsoft in 2022 presents i-Code, a self-supervised pretraining framework which jointly learns representations for vision, language and speech into a unified, shared and general-purpose vector representation.
In this framework, data from each modality are first given to pretrained single-modality encoders. The encoder outputs are then integrated with a multimodal fusion network, which uses novel attention mechanisms and other architectural innovations to effectively combine information from the different modalities. The entire system is pretrained end-to-end with new objectives including (i) masked modality modeling and (ii) cross-modality contrastive learning.
They show that pretraining on dual-modality datasets can also yield competitive or even better performance than pretraining on videos, the data resource that previous three-modality models were restricted to. i-Code can dynamically process single, dual, and triple-modality data during training and inference, flexibly projecting different combinations of modalities into a single representation space.
Experimental results demonstrate how i-Code can outperform state-of-the-art techniques on five video understanding tasks and the GLUE NLP benchmark, improving by as much as 11% and demonstrating the power of integrative multimodal pretraining.
The figure below from the paper shows the overall model architecture of i-Code. Shown on the right is the attention and feed-forward operation in a fusion network layer with (a) merge-attention layers and (b) co-attention layers. To facilitate more effective cross-modality understanding and design the best fusion architecture, they explore two variations of the traditional attention mechanism: mechanisms that merge and cross the attention scores of different modalities, namely merge-attention (based on self-attention) and co-attention (based on self- and cross-attention) respectively. Note that for simplicity, only the residual connection of the language modality is drawn, but all three modalities use residual connections.

VL-BEIT: Generative Vision-Language Pretraining

This paper by Bao et al. from Furu Wei’s research group at Microsoft Research introduces a vision-language foundation model called VL-BEIT, a simple and effective approach to pretraining a bidirectional multimodal Transformer encoder for both vision-language and vision tasks learned by generative pretraining. Their minimalist solution conducts masked prediction on both monomodal and multimodal data with a shared Transformer.
VL-BEIT solely employs generative pretraining tasks, including masked language modeling on texts, masked image modeling on images, and masked vision-language modeling on image-text pairs. VL-BEIT is learned from scratch with one unified pretraining task, one shared backbone, and one-stage training which renders it conceptually simple and empirically effective.
Experimental results show that VL-BEIT obtains strong results on various vision-language benchmarks, such as visual question answering, visual reasoning, and image-text retrieval. Moreover, their method learns transferable visual features, achieving competitive performance on image classification, and semantic segmentation.
Code.

FLAVA: A Foundational Language And Vision Alignment Model

This paper by Singh et al. from Meta AI Research in CVPR 2022 presents FLAVA, a foundational vision and language alignment model that performs well on all three target modalities: 1) vision, 2) language, and 3) vision & language.
State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety of downstream tasks. Generally, such models are often either cross-modal (contrastive) or multi-modal (with earlier fusion) but not both; and they often only target specific modalities or tasks. A promising direction would be to use a single holistic universal model, as a “foundation”, that targets all modalities at once – a true vision and language foundation model should be good at vision tasks, language tasks, and cross- and multi-modal vision and language tasks.
FLAVA was trained on a corpus of publicly available datasets that is several orders of magnitude smaller than similar recent models, but still obtained better or competitive performance. FLAVA paves the way forward towards generalized but open models that perform well on a wide variety of multimodal tasks.
FLAVA demonstrates impressive performance on a wide range of 35 tasks spanning these target modalities.

Flamingo: a Visual Language Model for Few-Shot Learning

In recent years, large-scale pre-training followed by task-specific fine-tuning has emerged as a standard approach, but the fine-tuning step still requires a lot of samples. In other words, building models that can be rapidly adapted to numerous tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research.
This paper by Alayrac et al. from DeepMind in NeurIPS 2022 introduces Flamingo, a family of Visual Language Models (VLM), represents an innovative solution to the challenge of rapidly adapting to a multitude of tasks using only a few annotated examples. This ability is achieved through a few-shot learning approach (which refers to the ability to learn a new task with just a few samples for training), enabling understanding across multiple modalities like visual, audio, and text. Flamingo models are trained on large-scale multimodal web corpora, including interleaved text and images, setting new benchmarks in various image and video tasks through their in-context few-shot learning capabilities.
Flamingo models include key architectural innovations to: (i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily interleaved visual and textual data, and (iii) seamlessly ingest images or videos as inputs.
The key ideas behind Flamingo are:
- Interleave cross-attention layers with language-only self-attention layers (frozen).
- Perceiver-based architecture that transforms the input sequence data (videos) into a fixed number of visual tokens.
- Large-scale (web) multi-modal data by scraping webpages which has inter-leaved text and images.
Flamingo is designed to accept and process sequences of text interleaved with images/videos, outputting free-form text. It integrates key architectural innovations:
1. Bridging Vision and Language Models: Flamingo connects powerful pretrained vision-only and language-only models.
2. Handling Interleaved Data: It processes sequences of arbitrarily interleaved visual and textual data.
3. Seamless Ingestion of Multimodal Inputs: Capable of ingesting images or videos effortlessly.
4. Perceiver Resampler: Transforms the input sequence data into a fixed number of visual tokens.
5. Cross-Attention Layer Integration: Interleaves cross-attention layers with frozen language-only self-attention layers for visual conditioning.
Key Architectural Innovations:
1. Perceiver Resampler: Converts large, variable-size feature maps into a fixed number of visual tokens, reducing the complexity of vision-text cross-attention.
2. GATED XATTN-DENSE Layers: Condition the LM on visual inputs, integrating a gating mechanism to stabilize training and maintain initial model outputs, thus preserving the pretrained LM’s knowledge.
3. Multi-Visual Input Support: Uses per-image/video attention masking, allowing the model to attend to a single image at a time in the sequence.
  - The following figure from the paper shows the Flamingo architecture overview. Flamingo is a family of visual language models (VLMs) that take as input visual data interleaved with text and produce free-form text as output.
Implementation Details:
- Flamingo is available in three sizes: Flamingo-3B, Flamingo-9B, and Flamingo-80B.
- Training involves datasets with interleaved image-text pairs, leveraging the MultiModal MassiveWeb (M3W) dataset, image-text pairs (LTIP), and video-text pairs (VTP).
- Task adaptation uses in-context few-shot learning by conditioning the model on a multimodal interleaved prompt, similar to GPT-3’s methodology.
Experimental Results:
- Flamingo was evaluated across 16 benchmarks, including open-ended tasks like visual question-answering, captioning tasks, and close-ended tasks such as multiple-choice visual question-answering.
- It achieved state-of-the-art results in few-shot learning, outperforming previous methods with minimal examples and competing with or surpassing fine-tuned models using only 32 task-specific examples.
Ablation Studies:
- The training data mixture, visual conditioning of the LM, and frequency of adding GATED XATTN-DENSE blocks significantly impacted performance. Freezing LM components was critical to prevent catastrophic forgetting.
Flamingo’s flexibility and innovative architecture allow it to be trained on large-scale multimodal web corpora, crucial for its in-context few-shot learning capabilities. The model’s ability to adapt rapidly to a variety of tasks with minimal training samples sets it apart in the field of multimodal machine learning.
For tasks lying anywhere on this spectrum, they demonstrate that a single Flamingo model can achieve a new state of the art for few-shot learning, simply by prompting the model with task-specific examples. On many of these benchmarks, Flamingo actually surpasses the performance of models that are fine-tuned on thousands of times more task-specific data.

Stable and Latent Diffusion Model

The following blog post summary has been contributed by Zhibo Zhang.
This blog post from Weights describes stable diffusion, a latent representation model developed by CompVis, Stability AI and LAION.
According to the blog, the stable diffusion model takes in a text description as input, where the text encoder from the CLIP model is used to generate a representation for the text input.
A latent image representation of size 64 * 64 is initialized based on the Gaussian distribution. A UNet (conditioned on the text representation) works together with a scheduler algorithm to denoise the latent representation. Generally, 50 denoising iterations are sufficient to generate images of high quality. After the denoising process, the decoder of a variational autoencoder is responsible for reconstructing the latent representation back into the image of size $512 \times 512$.

DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation

The following summary has been contributed by Zhibo Zhang.
This paper by Ruiz et al. from Google Research and Boston University in 2022 introduces DreamBooth, which generates subjects with diverse contexts through text-to-image diffusion model fine-tuning.
Specifically, this work defines a new problem setting: recontextualize the specified subject while ensuring that the key visual features of the original subject are preserved.
In order to achieve this goal, the authors adopted the pre-trained Imagen model (Saharia et al.) and fine-tune it using around 3 to 5 images of a chosen subject as follows:
- The fine-tuning of the low-resolution part of the model: The image generation process is conditioned on the text which is composed of the class noun and a rare token identifier for the subject. The objective function contains two parts: 1. The reconstruction loss to ensure that the generated images are similar to the input images. 2. The class-specific prior preservation loss to ensure that the generated images have diversity.
- The fine-tuning of the super-resolution part of the model: Only the reconstruction loss is used. This step is to ensure the preservation of fine-grained details of the subjects in output images.
The authors discussed a few application scenarios of the DreamBooth framework including recontextualization, art renditions, expression manipulation, novel view synthesis, accessorization as well as property modification and displayed some example images for each application.
The authors also performed ablation studies validating that:
- It is necessary to use the correct class noun in the input text.
- The prior preservation encourages diversity in the generated images.
- Using low-level noise when fine-tuning the super-resolution component improves the quality of the generated images.

UniT: Multimodal Multitask Learning with a Unified Transform

This paper by Hu And Singh from Facebook AI in 2021 proposes UniT, a Unified Transformer model to simultaneously learn the most prominent tasks across different domains, ranging from object detection to natural language understanding and multimodal reasoning.
Based on the transformer encoder-decoder architecture, UniT encodes each input modality with an encoder and makes predictions on each task with a shared decoder over the encoded input representations, followed by task-specific output heads. The entire model is jointly trained end-to-end with losses from each task.
Compared to previous efforts on multi-task learning with transformers, they share the same model parameters across all tasks instead of separately finetuning task-specific models and handle a much higher variety of tasks across different domains.
In their experiments, they learn 7 tasks jointly over 8 datasets, achieving strong performance on each task with significantly fewer parameters.
Code.

Perceiver IO: A General Architecture for Structured Inputs & Outputs

A central goal of machine learning is the development of systems that can solve many problems in as many data domains as possible. Current architectures, however, cannot be applied beyond a small set of stereotyped settings, as they bake in domain & task assumptions or scale poorly to large inputs or outputs.
This paper by Jeagle et al. from DeepMind in ICLR 2022 proposes Perceiver IO, a general-purpose architecture that handles data from arbitrary settings while scaling linearly with the size of inputs and outputs.
Perceiver IO augments the Perceiver with a flexible querying mechanism that enables outputs of various sizes and semantics, doing away with the need for task-specific architecture engineering. The same architecture achieves strong results on tasks spanning natural language and visual understanding, multi-task and multi-modal reasoning, and StarCraft II.
As highlights, Perceiver IO outperforms a Transformer-based BERT baseline on the GLUE language benchmark despite removing input tokenization and achieves state-of-the-art performance on Sintel optical flow estimation with no explicit mechanisms for multiscale correspondence.

Foundation Transformers

The following summary has been contributed by Zhibo Zhang.
Transformers are widely adopted across various input modalities such as speech, text and images. However, Transformers for different input modalities are generally designed to have distinct implementations such that the best performance can be achieved in each domain.
Foundation Transformers by Wang et al. from Microsoft in 2022 proposes the MAGNETO architecture, a general purpose transformer that can achieve stable task performance under various input modalities.
A key component is the introduction of Sub-LN (Sub-LayerNorm). As shown in the illustration figure by Wang et al., there are two layer normalization operations in both the multi-head attention module and the feed-forward network module accordingly. Specifically, for the multi-head attention module, compared to Pre-LN, Sub-LN introduces one more layer normalization operation following the multi-head self-attention component. For the feedforward network module, compared to Pre-LN, Sub-LN introduces one more layer normalization operation following the ReLU activation function.
With theoretical support, the authors showed the best initialization and weight scaling approaches for the encoder-only / decoder-only architecture and the encoder-decoder architecture.
Empirically, the authors validated the effectiveness of MAGNETO in domains with different input modalities including language, vision, speech and vision-language:
- For the language domain, the authors conducted experiments on tasks including causal language modeling, masked language modeling as well as neural machine translation. On average, MAGNETO performed better than the comparison methodologies.
- For the vision domain, the authors compared MAGNETO with the Vision Transformer (Dosovitskiy et al., 2021) with Pre-LN on both ImageNet (and its variants) image classification and ADE20k semantic segmentation tasks. MAGNETO outperformed Pre-LN in terms of top-1 accuracy for classification and in terms of mIoU score for semantic segmentation.
- For the speech recognition task, MAGNETO achieved lower Word Error Rates compared to pre-LN.
In addition, the MAGNETO module also outperformed pre-LN on two vision-language tasks.

Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language

Current self-supervised learning algorithms are often modality-specific and require large amounts of computational resources.
This paper by Baevski et al. from Facebook in 2022 seeks to address the computational inefficiency of data2vec 1.0, we increase the training efficiency of data2vec, a learning objective that generalizes across several modalities.
data2vec 2.0 does not encode masked tokens, uses a fast convolutional decoder and amortizes the effort to build teacher representations.
data2vec 2.0 benefits from the rich contextualized target representations introduced in data2vec which enable a fast self-supervised learner.
Experiments on ImageNet-1K image classification show that data2vec 2.0 matches the accuracy of Masked Autoencoders in 16.4x lower pre-training time, on Librispeech speech recognition it performs as well as wav2vec 2.0 in 10.6x less time, and on GLUE natural language understanding it matches a retrained RoBERTa model in half the time.
Trading some speed for accuracy results in ImageNet-1K top-1 accuracy of 86.8% with a ViT-L model trained for 150 epochs.
Facebook AI link.

Imagic: Text-Based Real Image Editing with Diffusion Models

Text-conditioned image editing has recently attracted considerable interest. However, most methods are currently either limited to specific editing types (e.g., object overlay, style transfer), or apply to synthetically generated images, or require multiple input images of a common object.
This paper by Kawar et al. from Google Research in CVPR 2023 introduces Imagic which, for the very first time, demonstrates the ability to apply complex (e.g., non-rigid) text-guided semantic edits to a single real image. For example, Imagic can change the posture and composition of one or multiple objects inside an image, while preserving its original characteristics. Imagic can make a standing dog sit down or jump, cause a bird to spread its wings, etc. — each within its single high-resolution natural image provided by the user.

Contrary to previous work, Imagic requires only a single input image and a target text (the desired edit). It operates on real images, and does not require any additional inputs (such as image masks or additional views of the object). Imagic leverages a pre-trained text-to-image diffusion model for this task.
It produces a text embedding that aligns with both the input image and the target text, while fine-tuning the diffusion model to capture the image-specific appearance. They demonstrate the quality and versatility of Imagic on numerous inputs from various domains, showcasing a plethora of high quality complex semantic image edits, all within a single unified framework.
The following diagram shows the method adopted by Imagic. Given a real image and a target text prompt, they encode the target text and get the initial text embedding $e_{tgt}$, then optimize it to reconstruct the input image, obtaining $e_{opt}$. They then fine-tune the generative model to improve fidelity to the input image while fixing $e_{opt}$. Finally, they interpolate $e_{opt}$ with $e_{tgt}$ to generate the edit result.

Project page.

EDICT: Exact Diffusion Inversion via Coupled Transformations

Finding an initial noise vector that produces an input image when fed into the diffusion process (known as inversion) is an important problem in denoising diffusion models (DDMs), with applications for real image editing. The state-of-the-art approach for real image editing with inversion uses denoising diffusion implicit models (DDIMs) to deterministically noise the image to the intermediate state along the path that the denoising would follow given the original conditioning.
However, DDIM inversion for real images is unstable as it relies on local linearization assumptions, which result in the propagation of errors, leading to incorrect image reconstruction and loss of content.
This paper by Wallace et al. seeks to alleviate these problems and proposes Exact Diffusion Inversion via Coupled Transformations (EDICT), an inversion method that draws inspiration from affine coupling layers. EDICT enables mathematically exact inversion of real and model-generated images by maintaining two coupled noise vectors which are used to invert each other in an alternating fashion. - Using Stable Diffusion, a state-of-the-art latent diffusion model, they demonstrate that EDICT successfully reconstructs real images with high fidelity.
On complex image datasets like MS-COCO, EDICT reconstruction significantly outperforms DDIM, improving the mean square error of reconstruction by a factor of two. Using noise vectors inverted from real images, EDICT enables a wide range of image edits – from local and global semantic edits to image stylization – while maintaining fidelity to the original image structure.
EDICT requires no model training/finetuning, prompt tuning, or extra data and can be combined with any pretrained DDM.
Code.

CLAP: Learning Audio Concepts From Natural Language Supervision

Mainstream Audio Analytics models are trained to learn under the paradigm of one class label to many recordings focusing on one task. Learning under such restricted supervision limits the flexibility of models because they require labeled audio for training and can only predict the predefined categories.
This paper by Elizalde et al. from Microsoft in 2022 proposes Contrastive Language-Audio Pretraining (CLAP), which learns audio concepts from natural language supervision. CLAP connects language and audio by using two encoders and a contrastive learning to bring audio and text descriptions into a joint multimodal space.
They trained CLAP with 128k audio and text pairs and evaluated it on 16 downstream tasks across 8 domains, such as Sound Event Classification, Music tasks, and Speech-related tasks. Although CLAP was trained with significantly less pairs than similar computer vision models, it establishes SoTA for Zero-Shot performance.
Additionally, they evaluated CLAP in a supervised learning setup and achieve SoTA in 5 tasks. Hence, CLAP’s Zero-Shot capability removes the need of training with class labels, enables flexible class prediction at inference time, and generalizes to multiple downstream tasks.

An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA

Knowledge-based visual question answering (VQA) involves answering questions that require external knowledge not present in the image. Existing methods first retrieve knowledge from external resources, then reason over the selected knowledge, the input image, and question for answer prediction. However, this two-step approach could lead to mismatches that potentially limit the VQA performance. For example, the retrieved knowledge might be noisy and irrelevant to the question, and the re-embedded knowledge features during reasoning might deviate from their original meanings in the knowledge base (KB).
This paper by Yang et al. in AAAI 2022 proposes PICa, a simple yet effective method that Prompts GPT3 via the use of Image Captions, for knowledge-based VQA. Inspired by GPT-3’s power in knowledge retrieval and question answering, instead of using structured KBs as in previous work, we treat GPT-3 as an implicit and unstructured KB that can jointly acquire and process relevant knowledge. Specifically, we first convert the image into captions (or tags) that GPT-3 can understand, then adapt GPT-3 to solve the VQA task in a few-shot manner by just providing a few in-context VQA examples.
They further boost performance by carefully investigating: (i) what text formats best describe the image content, and (ii) how in-context examples can be better selected and used. PICa unlocks the first use of GPT-3 for multimodal tasks. By using only 16 examples, PICa surpasses the supervised state of the art by an absolute +8.6 points on the OK-VQA dataset. They also benchmark PICa on VQAv2, where PICa also shows a decent few-shot performance.
The following figure from the paper shows the inference-time interface of PICa for $n$-shot VQA. The input prompt to GPT-3 consists of a prompt head $\boldsymbol{h}$ (blue box), $n$ in-context examples $\left(\left\{\boldsymbol{x}_i, \boldsymbol{y}_i\right\}_{i=1}^n\right)$ (red boxes), and the VQA input $\boldsymbol{x}$ (green box). The answer $\boldsymbol{y}$ is produced in an open-ended text generation manner. PICa supports zero-/few-shot VQA by including different numbers of in-context examples in prompt.

OCR-free Document Understanding Transformer

Understanding document images (e.g., invoices) is a core but challenging task since it requires complex functions such as reading text and a holistic understanding of the document. Current Visual Document Understanding (VDU) methods outsource the task of reading text to off-the-shelf Optical Character Recognition (OCR) engines and focus on the understanding task with the OCR outputs. The following figure from the paper shows the pipeline overview and benchmarks. The proposed end-to-end model, Donut, outperforms the recent OCR-dependent VDU models in memory, time cost and accuracy. Performances on visual document information extraction are shown in (b).

Although such OCR-based approaches have shown promising performance, they suffer from 1) high computational costs for using OCR; 2) inflexibility of OCR models on languages or types of document; 3) OCR error propagation to the subsequent process.
This paper by Kim et al. from NAVER CLOVA, NAVER Search, NAVER AI Lab, Upstage, Tmax, Google, LBox in ECCV 2022 seeks to address these issues and introduces a novel OCR-free VDU model named Donut, which stands for Document understanding transformer.
As the first step in OCR-free VDU research, they propose a simple architecture (i.e., Transformer) with a pre-training objective (i.e., cross-entropy loss). Donut is conceptually simple yet effective.
The following figure from the paper shows the pipeline of Donut. The encoder maps a given document image into embeddings. With the encoded embeddings, the decoder generates a sequence of tokens that can be converted into a target type of information in a structured form.

Through extensive experiments and analyses, they show a simple OCR-free VDU model, Donut, achieves state-of-the-art performances on various VDU tasks in terms of both speed and accuracy.
In addition, they offer a synthetic data generator that helps the model pre-training to be flexible in various languages and domains.
Code

PubTables-1M: Towards Comprehensive Table Extraction from Unstructured Documents

This paper by Smock et al. from Microsoft presents PubTables-1M, a dataset designed for extracting tables from unstructured documents. This work addresses the significant challenge of creating comprehensive and reliable datasets on a large scale.
PubTables-1M is remarkable for its extensive size, encompassing nearly one million tables sourced from scientific articles. Its detailed annotations are a notable feature, including projected row headers and bounding boxes for all rows, columns, and cells, encompassing even the blank cells.
A significant contribution of this paper is the introduction of a novel canonicalization procedure aimed at correcting oversegmentation in table structures. This ensures that each table maintains a unique, unambiguous structure.
The paper also demonstrates how improvements in data quality can substantially enhance training performance and provide more reliable model evaluations in tasks related to table structure recognition.
A key highlight of the paper is the introduction of transformer-based object detection models, such as the Table Annotation Transformer (TATR), for table detection, structure recognition, and functional analysis. These models, when trained on the PubTables-1M dataset, show exceptional performance across these tasks without the need for task-specific customizations.
The following figure from the paper illustrates the three subtasks of table extraction addressed by the PubTables-1M dataset

This research significantly advances the field of automated table extraction from documents and contributes importantly to document analysis and information retrieval.
Code

CoCa: Contrastive Captioners are Image-Text Foundation Models

This paper by Yu et al. from Google Research presents CoCa, a novel image-text encoder-decoder model combining contrastive and captioning loss, encapsulating capabilities from both CLIP (contrastive approaches) and SimVLM (generative methods). CoCa uniquely omits cross-attention in initial decoder layers for unimodal text representations and employs subsequent layers for multimodal image-text representations, efficiently integrating both training objectives with minimal overhead.
CoCa’s architecture allows for efficient encoding of unimodal text and multimodal image-text representations, which are then optimized through a contrastive objective between image encoder outputs and unimodal text decoder outputs, and a captioning objective at the multimodal decoder outputs. The model is pretrained end-to-end on diverse datasets including web-scale alt-text data and annotated images, treating all labels simply as text. This unification of natural language supervision for representation learning demonstrates significant pretraining efficiency.
CoCa is a hybrid model that incorporates both generative and discriminative aspects in its design. It is not strictly one or the other, but rather a combination of both. Here’s how:
1. Generative Aspect: CoCa includes a captioning loss component, which is a characteristic of generative models. In this aspect, CoCa acts like a generative model by predicting text tokens autoregressively, similar to language models. This enables it to generate captions for images and understand multimodal image-text representations, akin to tasks in generative models.
2. Discriminative Aspect: CoCa also employs a contrastive loss component, which is a feature of discriminative models. This aspect of CoCa involves learning to discriminate between matching and non-matching image-text pairs. The model learns to align unimodal image and text embeddings in a shared space, which is a key functionality of discriminative models used in tasks like image-text retrieval and zero-shot classification.
  - In summary, CoCa’s architecture and training objectives enable it to perform both generative tasks (like image captioning) and discriminative tasks (like image classification and cross-modal retrieval), making it a versatile model that combines elements of both generative and discriminative modeling approaches.
The figure below from the paper shows an overview of CoCa pretraining as image-text foundation models. The pretrained CoCa can be used for downstream tasks including visual recognition, vision-language alignment, image captioning and multimodal understanding with zero-shot transfer, frozen-feature evaluation or end-to-end finetuning.

The paper details extensive experiments across various domains, including visual recognition tasks (e.g., ImageNet classification), crossmodal alignment tasks (e.g., image-text retrieval on Flickr30K and MSCOCO), multimodal understanding (e.g., VQA, SNLI-VE, NLVR2), and image captioning tasks. CoCa achieves state-of-the-art performance in zero-shot transfer or minimal task-specific adaptation on these downstream tasks, showcasing its versatility and robustness.
In addition to the primary model, variants of CoCa, such as CoCa-Base and CoCa-Large, are discussed. The design incorporates attentional poolers, optimizing the model for various downstream tasks. Ablation studies further highlight the impact of various components and training objectives on the model’s performance. These studies confirm the efficacy of the combined contrastive and captioning objectives, the balanced design of unimodal and multimodal decoders, and the importance of attentional poolers in achieving state-of-the-art results.
The paper underscores the broader impacts and potential applications of CoCa in various vision and vision-language tasks. It also emphasizes the need for community exploration to understand the broader implications, including fairness, social bias, and potential misuse. CoCa sets a new precedent in image-text foundation models by effectively bridging various pretraining approaches.

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

This paper by Li et al. from Salesforce Research presents a novel Vision-Language Pre-training (VLP) framework named BLIP. Unlike most existing pre-trained models, BLIP excels in both understanding-based and generation-based tasks. It addresses the limitations of relying on noisy web-based image-text pairs for training, demonstrating significant improvements in various vision-language tasks.
Technical and Implementation Details: BLIP consists of two primary innovations:
1. Multimodal Mixture of Encoder-Decoder (MED): This new architecture effectively multitasks in pre-training and allows flexible transfer learning. It operates in three modes: as a unimodal encoder, an image-grounded text encoder, or an image-grounded text decoder. MED employs a visual transformer as an image encoder, dividing an input image into patches encoded into a sequence of embeddings. The text encoder and decoder share all parameters except for the self-attention layers to enhance efficiency. The model is pre-trained with three objectives: image-text contrastive learning (ITC), image-text matching (ITM), and image-conditioned language modeling (LM).
  - Image-Text Contrastive Loss (ITC): This loss function focuses on aligning the feature spaces of visual and textual representations. The goal is to bring closer the embeddings of positive image-text pairs while distancing the embeddings of negative pairs. This objective is crucial for improving vision and language understanding. The equation is: $ITC = -\log \frac{\exp(sim(v_i, t_i)/\tau)}{\sum_{j=1}^N \exp(sim(v_i, t_j)/\tau)}$ where $v_i$ and $t_i$ are the image and text embeddings of the $i^{th}$ positive pair, $sim$ is a similarity function, $\tau$ is a temperature scaling parameter, and $N$ is the number of negative samples.
  - Image-Text Matching Loss (ITM): This objective is a more complex and nuanced task compared to ITC. It aims to learn a fine-grained, multimodal representation of image-text pairs, focusing on the alignment between visual and linguistic elements. ITM functions as a binary classification task, where the model predicts whether an image-text pair is correctly matched. This involves using an image-grounded text encoder that takes the multimodal representation and predicts the match/non-match status. The ITM loss is especially significant in training the model to understand the subtleties and nuances of how text and images relate, going beyond mere surface-level associations. To ensure informative training, a hard negative mining strategy is employed, selecting more challenging negative pairs based on their contrastive similarity, thereby enhancing the model’s discriminative ability. The loss function can be expressed as: $ITM = -y \log(\sigma(f(v, t))) - (1 - y) \log(1 - \sigma(f(v, t)))$ where $v$ and $t$ are the visual and textual embeddings, $y$ is the label indicating if the pair is a match (1) or not (0), $\sigma$ denotes the sigmoid function, and $f(v, t)$ represents the function that combines the embeddings to produce a match score.
  - Language Modeling Loss (LM): This loss optimizes the generation of textual descriptions from images, used in the image-grounded text decoder. It aims to generate textual descriptions given an image, training the model to maximize the likelihood of the text in an autoregressive manner. It is typically formulated as a cross-entropy loss over the sequence of words in the text: $LM = -\sum_{t=1}^{T} \log P(w_t \mid w_{\<t}, I)$ where $w_t$ is the $t^{th}$ word in the caption, $w_{\<t}$ represents the sequence of words before $w_t$, and $I$ is the input image.
2. Captioning and Filtering (CapFilt): This method improves the quality of training data from noisy web-based image-text pairs. It involves a captioner module, which generates synthetic captions for web images, and a filter module, which removes noisy captions from both web texts and synthetic texts. Both modules are derived from the pre-trained MED model and fine-tuned on the COCO dataset. CapFilt allows the model to learn from a refined dataset, leading to performance improvements in downstream tasks.
The figure below from the paper shows the pre-training model architecture and objectives of BLIP (same parameters have the same color). We propose multimodal mixture of encoder-decoder, a unified vision-language model which can operate in one of the three functionalities: (1) Unimodal encoder is trained with an image-text contrastive (ITC) loss to align the vision and language representations. (2) Image-grounded text encoder uses additional cross-attention layers to model vision-language interactions, and is trained with a image-text matching (ITM) loss to distinguish between positive and negative image-text pairs. (3) Image-grounded text decoder replaces the bi-directional self-attention layers with causal self-attention layers, and shares the same cross-attention layers and feed forward networks as the encoder. The decoder is trained with a language modeling (LM) loss to generate captions given images.

Experimentation and Results:
- BLIP’s models were implemented in PyTorch and pre-trained on a dataset including 14 million images, comprising both human-annotated and web-collected image-text pairs.
- The experiments showed that the captioner and filter, when used in conjunction, significantly improved performance in downstream tasks like image-text retrieval and image captioning.
- The CapFilt approach proved to be scalable with larger datasets and models, further boosting performance.
- The diversity introduced by nucleus sampling in generating synthetic captions was found to be key in achieving better results, outperforming deterministic methods like beam search.
- Parameter sharing strategies during pre-training were explored, with results indicating that sharing all layers except for self-attention layers provided the best performance.
- BLIP achieved substantial improvements over existing methods in image-text retrieval and image captioning tasks, outperforming the previous best models on standard datasets like COCO and Flickr30K.
Conclusion:
- BLIP represents a significant advancement in unified vision-language understanding and generation tasks, effectively utilizing noisy web data and achieving state-of-the-art results in various benchmarks. The framework’s ability to adapt to both understanding and generation tasks, along with its robustness in handling web-collected noisy data, marks it as a notable contribution to the field of Vision-Language Pre-training.
Code

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

This paper by Tong et al., from Nanjing University, Tencent AI Lab, and Shanghai AI Lab, presented at NeurIPS 2022, introduces VideoMAE, a novel approach for self-supervised video pre-training. VideoMAE demonstrates that video transformers can be effectively pre-trained on small datasets without external data, challenging the common belief of requiring large-scale datasets.
VideoMAE adapts the masked autoencoder framework to videos, using a novel video tube masking strategy with an extremely high masking ratio (90-95%). This approach significantly differs from image models due to the temporal redundancy and correlation in videos.
The authors found that VideoMAE is particularly data-efficient, achieving impressive results on datasets as small as 3k-4k videos, and showing that data quality is more crucial than quantity for self-supervised video pre-training (SSVP). Notably, VideoMAE achieves 87.4% on Kinetics-400 and 75.4% on Something-Something V2 without extra data.
The figure below from the paper shows that VideoMAE performs the task of masking random cubes and reconstructing the missing ones with an asymmetric encoder-decoder architecture. Due to high redundancy and temporal correlation in videos, VideoMAE presents the customized design of tube masking with an extremely high ratio (90% to 95%). This simple design enables VideoMAE to create a more challenging and meaningful self-supervised task to make the learned representations capture more useful spatiotemporal structures.

The method incorporates temporal downsampling and cube embedding to handle video data efficiently. It employs a vanilla Vision Transformer (ViT) with joint space-time attention, allowing interaction among all pair tokens in the multi-head self-attention layer.
Extensive experiments and ablation studies reveal the importance of decoder design, masking strategy, and reconstruction targets in the effectiveness of VideoMAE. The high masking ratio helps mitigate information leakage during masked modeling, making the task more challenging and encouraging the learning of representative features.
VideoMAE’s pre-training strategy outperforms traditional methods like training from scratch or contrastive learning models like MoCo v3. It demonstrates superior efficiency and effectiveness, requiring less training time due to its asymmetric encoder-decoder design and high masking ratio.
The authors also highlight the strong transferability of VideoMAE, showing its effectiveness in downstream tasks like action detection. They note potential for future improvements by expanding to larger datasets, models, and integrating additional data streams like audio or text.
The paper concludes by acknowledging potential negative societal impacts, mainly related to energy consumption during the pre-training phase. However, it emphasizes the practical value of VideoMAE in scenarios with limited data availability and its capacity to enhance video analysis using vanilla vision transformers.
Code

Grounded Language-Image Pre-training (GLIP)

This paper by Li et al. from UCLA, Microsoft Research, University of Washington, University of Wisconsin-Madison, Microsoft Cloud and AI, International Digital Economy Academy, presents the GLIP model, a novel approach for learning object-level, language-aware, and semantic-rich visual representations.
GLIP innovatively unifies object detection and phrase grounding for pre-training, leveraging 27M grounding data, including 3M human-annotated and 24M web-crawled image-text pairs. This unification allows GLIP to benefit from both data types, improving grounding models and learning from massive image-text pairs.
A standout feature of GLIP is its reformulation of object detection as a phrase grounding task, which takes both an image and a text prompt as input. This approach leads to language-aware visual representations and superior transfer learning performance.
The model introduces deep fusion between image and text encoders, enabling enhanced phrase grounding performance and making visual features language-aware. This deep fusion significantly contributes to the model’s ability to serve various downstream detection tasks.
The figure below from the paper shows a unified framework for detection and grounding. Unlike a classical object detection model which predicts a categorical class for each detected object, we reformulate detection as a grounding task by aligning each region/box to phrases in a text prompt. GLIP jointly trains an image encoder and a language encoder to predict the correct pairings of regions and words. They further add the cross-modality deep fusion to early fuse information from two modalities and to learn a language-aware visual representation.

Experimentally, GLIP demonstrates impressive zero-shot and few-shot transferability to multiple object-level recognition tasks, surpassing many supervised baselines on benchmarks like COCO and LVIS. The paper also explores the model’s robustness across 13 different object detection tasks, highlighting its versatility.
The figure below from the paper shows that GLIP zero-shot transfers to various detection tasks, by writing the categories of interest into a text prompt.

A key observation is that pre-training with both detection and grounding data is advantageous, enabling significant improvements in rare category detection and overall performance. The model’s data efficiency and ability to adapt to various tasks are also emphasized.
The authors provide comprehensive implementation details, including model architecture, training strategies, and performance metrics across different datasets, offering valuable insights into the model’s practical applications and effectiveness.
Code

LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking

This paper by Huang et al. from Sun Yat-sen University and MSR Asia introduces LayoutLMv3, a multimodal pre-training model for document AI. It introduces a unified approach for text and image masking to facilitate multimodal representation learning.
The model architecture integrates a multimodal transformer combining text and image embeddings. Text is processed using OCR and embedded alongside image patches through linear projection, eschewing the need for CNNs in image feature extraction.
The figure below from the paper shows that LayoutLMv3 employs three pre-training objectives: Masked Language Modeling (MLM) for text, Masked Image Modeling (MIM) for images, and Word-Patch Alignment (WPA) for cross-modal alignment. This enables the model to learn fine-grained alignment between text words and corresponding image patches. The figure below from the paper shows comparisons with existing works (e.g., DocFormer and SelfDoc) on (1) image embedding: LayoutLMv3 uses linear patches to reduce the computational bottleneck of CNNs and eliminate the need for region supervision in training object detectors; (2) pre-training objectives on image modality: LayoutLMv3 learns to reconstruct discrete image tokens of masked patches instead of raw pixels or region features to capture high-level layout structures rather than noisy details.

The figure below from the paper shows the architecture and pre-training objectives of LayoutLMv3. LayoutLMv3 is a pre-trained multimodal Transformer for Document AI with unified text and image masking objectives. Given an input document image and its corresponding text and layout position information, the model takes the linear projection of patches and word tokens as inputs and encodes them into contextualized vector representations. LayoutLMv3 is pre-trained with discrete token reconstructive objectives of Masked Language Modeling (MLM) and Masked Image Modeling (MIM). Additionally, LayoutLMv3 is pre-trained with a Word-Patch Alignment (WPA) objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked. “Seg” denotes segment-level positions. “[CLS]”, “[MASK]”, “[SEP]” and “[SPE]” are special tokens.

Experimental results show that LayoutLMv3 surpasses state-of-the-art performance in text-centric tasks (like form and receipt understanding) and image-centric tasks (such as document image classification and document layout analysis) across multiple benchmarks.
The paper also presents an ablation study demonstrating the effectiveness of the pre-training objectives and the proposed model architecture. The study highlights the importance of MIM and WPA in improving performance across various tasks.

2023

Pix2Video: Video Editing using Image Diffusion

Image diffusion models, trained on massive image collections, have emerged as the most versatile image generator model in terms of quality and diversity. They support inverting real images and conditional (e.g., text) generation, making them attractive for high-quality image editing applications.
This paper by Ceylan et al. from Adobe Research and UCL investigates how pre-trained image models could be used for text-guided video editing. The critical challenge is to achieve the target edits while still preserving the content of the source video.
Pix2Video works in two simple steps: first, they use a pre-trained structure-guided (e.g., depth) image diffusion model to perform text-guided edits on an anchor frame; then, in the next step, they progressively propagate the changes to the future frames via self-attention feature injection to adapt the core denoising step of the diffusion model. In other words, as shown in the figure below (source), Pix2Video first inverts each frame with DDIM-inversion and consider it as the initial noise for the denoising process. To edit each frame (lower row), they select a reference frame (upper row), inject its self-attention features to the UNet. At each diffusion step, they also update the latent of the current frame guided by the latent of the reference.

Pix2Video then consolidates the changes by adjusting the latent code for the frame before continuing the process.
Pix2Video’s approach is training-free and generalizes to a wide range of edits. They demonstrate the effectiveness of the approach by extensive experimentation and compare it against four different prior and parallel efforts. They demonstrate that realistic text-guided video edits are possible, without any compute-intensive preprocessing or video-specific finetuning.
Project page.

TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs

Artificial Intelligence (AI) has made incredible progress recently. On the one hand, advanced foundation models like ChatGPT can offer powerful conversation, in-context learning and code generation abilities on a broad range of open-domain tasks. They can also generate high-level solution outlines for domain-specific tasks based on the common sense knowledge they have acquired. However, they still face difficulties with some specialized tasks because they lack enough domain-specific data during pre-training or they often have errors in their neural network computations on those tasks that need accurate executions. On the other hand, there are also many existing models and systems (symbolic-based or neural-based) that can do some domain-specific tasks very well. However, due to the different implementation or working mechanisms, they are not easily accessible or compatible with foundation models. Therefore, there is a clear and pressing need for a mechanism that can leverage foundation models to propose task solution outlines and then automatically match some of the sub-tasks in the outlines to the off-the-shelf models and systems with special functionalities to complete them.
This paper by Liang et al. from Microsoft in 2023 introduces TaskMatrix.AI as a new AI ecosystem that connects foundation models with millions of APIs for task completion. Unlike most previous work that aimed to improve a single AI model, TaskMatrix.AI focuses more on using existing foundation models (as a brain-like central system) and APIs of other AI models and systems (as sub-task solvers) to achieve diversified tasks in both digital and physical domains.
As a position paper, they will present their vision of how to build such an ecosystem, explain each key component, and use study cases to illustrate both the feasibility of this vision and the main challenges that need to be addressed next.
The following figure from the paper presents an overview of TaskMatrix.AI. Given user instruction and the conversational context, the multimodal conversational foundation model (MCFM) first generates a solution outline (step 1), which is a textual description of the steps needed to solve the task. Then, the API selector chooses the most relevant APIs from the API platform according to the solution outline (step 2). Next, MCFM generates action codes using the recommended APIs, which will be further executed by calling APIs. Last, the user feedback on task completion is returned to MCFM and API developers.

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace

Solving complicated AI tasks with different domains and modalities is a key step toward advanced artificial intelligence. While there are abundant AI models available for different domains and modalities, they cannot handle complicated AI tasks.
This paper by Shen et al. from Zhejiang University and Microsoft Research Asia in 2023 advocates that LLMs could act as a controller to manage existing AI models to solve complicated AI tasks and language could be a generic interface to empower this, considering the exceptional ability large language models (LLMs) have exhibited in language understanding, generation, interaction, and reasoning, etc. Based on this philosophy, they present HuggingGPT, a framework that leverages LLMs (e.g., ChatGPT) to connect various AI models in machine learning communities (e.g., Weights) to solve AI tasks.
Specifically, they use ChatGPT to conduct task planning when receiving a user request, select models according to their function descriptions available in Weights, execute each subtask with the selected AI model, and summarize the response according to the execution results. By leveraging the strong language capability of ChatGPT and abundant AI models in Weights, HuggingGPT is able to cover numerous sophisticated AI tasks in different modalities and domains and achieve impressive results in language, vision, speech, and other challenging tasks, which paves a new way towards advanced artificial intelligence.
Summary:
1. HuggingGPT is recently introduced as a suitable middleware to bridge the connections between Large Language Models (LLMs) and AI models. The workflow goes as follows.
2. Users can send a request (multimodal for sure) which will be processed by an LLM controller. The LLM analyzes the request, understands the intention of the user, and generates possible solvable sub-tasks.
3. ChatGPT selects and invokes the corresponding models hosted on Weights to solve each subtask.
4. Once tasks are executed, the invoked model returns the results to the ChatGPT controller.
5. Finally, ChatGPT integrates the prediction of all models and generates the response.
6. It is amazing how HuggingGPT can show its reasoning and point to its in-context task-model assignment as intermediary steps before generating the output.
The following figure from the paper shows that language serves as an interface for LLMs (e.g., ChatGPT) to connect numerous AI models (e.g., those in Weights) for solving complicated AI tasks. In this concept, an LLM acts as a controller, managing and organizing the cooperation of expert models. The LLM first plans a list of tasks based on the user request and then assigns expert models to each task. After the experts execute the tasks, the LLM collects the results and responds to the user.

Code

Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation

Contrastive learning has shown remarkable success in the field of multimodal representation learning.
This paper by Wu et al. from ICASSP 2023 proposes a pipeline of contrastive language-audio pretraining to develop an audio representation by combining audio data with natural language descriptions.
To accomplish this target, they first release LAION-Audio-630K, a large collection of 633,526 audio-text pairs from different data sources.
Second, they construct a contrastive language-audio pretraining model by considering different audio encoders and text encoders.
They incorporate the feature fusion mechanism and keyword-to-caption augmentation into the model design to further enable the model to process audio inputs of variable lengths and enhance the performance.
Third, they perform comprehensive experiments to evaluate their model across three tasks: text-to-audio retrieval, zero-shot audio classification, and supervised audio classification.
The results demonstrate that their model achieves superior performance in text-to-audio retrieval task. In audio classification tasks, the model achieves state-of-the-art performance in the zero-shot setting and is able to obtain performance comparable to models’ results in the non-zero-shot setting.
Code.

ImageBind: One Embedding Space To Bind Them All

This paper by Girdhar et al. from Meta in CVPR 2023 presents ImageBind, an approach to learn a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data.
They show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together.
ImageBind can leverage recent large scale vision-language models, and extends their zero-shot capabilities to new modalities just by using their natural pairing with images. It enables novel emergent applications ‘out-of-the-box’ including cross-modal retrieval, composing modalities with arithmetic, cross-modal detection, and generation.
The emergent capabilities improve with the strength of the image encoder and we set a new state-of-the-art on emergent zero-shot recognition tasks across modalities, outperforming specialist supervised models. Finally, they show strong few-shot recognition results outperforming prior work, and that ImageBind serves as a new way to evaluate vision models for visual and non-visual tasks.
This figure below from the paper shows ImageBind’s joint embedding space which enables novel multimodal capabilities. By aligning six modalities’ embedding into a common space, IMAGEBIND enables: (i) Cross-Modal Retrieval, which shows emergent alignment of modalities such as audio, depth or text, that aren’t observed together, (ii) Adding embeddings from different modalities naturally composes their semantics, and (iii) Audio-to-Image generation, by using their audio embeddings with a pre-trained DALLE-2 decoder designed to work with CLIP text embeddings.

Demo; Code.

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models.
This paper by Li et al. from Salesforce Research proposes BLIP-2 that utilizes a cost-effective pre-training strategy for vision-language models using off-the-shelf frozen image encoders and large language models (LLMs). The core component, the Querying Transformer (Q-Former), originally from the BLIP model, bridges the modality gap in a two-stage bootstrapping process, leading to state-of-the-art performance in vision-language tasks with significantly fewer trainable parameters. BLIP-2 leverages existing unimodal models from vision and language domains, utilizing Q-Former ti specifically address the challenge of interoperability between different modality embeddings, such as aligning visual and textual representations.
Q-Former Architecture and Functionality””
1. Q-Former Design: The Q-Former, central to BLIP-2, is a trainable BERT encoder with a causal language modeling head, akin to GPT. It integrates one cross-attention layer for every two layers of BERT and introduces a fixed number of 32 trainable query vectors, crucial for modality alignment.
2. Embedding Alignment: The query vectors are designed to extract the most useful features from one of the frozen encoders, aligning embeddings across modalities, such as visual and textual spaces.
3. Modality Handling: In BLIP-2, which is a vision-language model, the Q-Former uses cross-attention between query vectors and image patch embeddings to obtain image embeddings. For a hypothetical model with purely textual input, it functions like a normal BERT Model, bypassing cross-attention or query vectors.
Methodology: BLIP-2 employs a two-stage bootstrapping method with the Q-Former:
1. Vision-Language Representation Learning: Utilizes a frozen image encoder for vision-language representation learning. The Q-Former is trained to extract visual features most relevant to text, employing three pre-training objectives with different attention masking strategies: Image-Text Contrastive Learning (ITC), Image-grounded Text Generation (ITG), and Image-Text Matching (ITM).
2. Vision-to-Language Generative Learning: Connects the Q-Former to a frozen LLM. The model uses a fully-connected layer to adapt the output query embeddings from the Q-Former to the LLM’s input dimension, functioning as soft visual prompts. This stage is compatible with both decoder-based and encoder-decoder-based LLMs.
The following figure from the paper shows an overview of BLIP-2’s framework. They pre-train a lightweight Querying Transformer following a two-stage strategy to bridge the modality gap. The first stage bootstraps vision-language representation learning from a frozen image encoder. The second stage bootstraps vision-to-language generative learning from a frozen LLM, which enables zero-shot instructed image-to-text generation.

The following figure from the paper shows: (Left) Model architecture of Q-Former and BLIP-2’s first-stage vision-language representation learning objectives. They jointly optimize three objectives which enforce the queries (a set of learnable embeddings) to extract visual representation most relevant to the text. (Right) The self-attention masking strategy for each objective to control query-text interaction.

The following figure from the paper shows BLIP-2’s second-stage vision-to-language generative pre-training, which bootstraps from frozen large language models (LLMs). (Top) Bootstrapping a decoder-based LLM (e.g. OPT). (Bottom) Bootstrapping an encoder-decoder-based LLM (e.g. FlanT5). The fully-connected layer adapts from the output dimension of the Q-Former to the input dimension of the chosen LLM.

Training: The Q-Former in BLIP-2 is trained on multiple tasks, including image captioning, image and text embedding alignment via contrastive learning, and classifying image-text pair matches, utilizing special attention masking schemes.
Implementation Details:
- Pre-training Data: BLIP-2 is trained on a dataset comprising 129 million images from sources like COCO, Visual Genome, CC3M, CC12M, SBU, and LAION400M. Synthetic captions are generated using the CapFilt method and ranked based on image-text similarity.
- Image Encoder and LLMs: The method explores state-of-the-art vision transformer models like ViT-L/14 and ViT-g/14 for the image encoder, and OPT and FlanT5 models for the language model.
- Training Parameters: The model is pre-trained for 250k steps in the first stage and 80k steps in the second stage, using batch sizes tailored for each stage and model. Training utilizes AdamW optimizer, cosine learning rate decay, and images augmented with random resizing and horizontal flipping.
Capabilities and Limitations: BLIP-2 enables effective zero-shot image-to-text generation, preserving the LLM’s ability to follow text prompts. It shows state-of-the-art results on the zero-shot visual question answering task on datasets like VQAv2 and GQA. However, the model’s performance does not improve with in-context learning using few-shot examples, attributed to the pre-training dataset’s structure. Additionally, BLIP-2 may inherit the risks of LLMs, such as outputting offensive language or propagating bias
Applications: The Q-Former’s ability to align modalities makes it versatile for various models, including MiniGPT-4 and InstructBlip (Image + Text), and Video-LLaMA (image, video, audio, text). Its capability to produce a fixed sequence of high-information embeddings proves useful in different multimodal contexts.
Code.

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

General-purpose language models that can solve various language-domain tasks have emerged driven by the pre-training and instruction-tuning pipeline. However, building general-purpose vision-language models is challenging due to the increased task discrepancy introduced by the additional visual input. Although vision-language pre-training has been widely studied, vision-language instruction tuning remains relatively less explored.
This paper by Dai et al. from Salesforce Research, HKUST, and NTU Singapore in 2023 conducts a systematic and comprehensive study on vision-language instruction tuning based on the pre-trained BLIP-2 models. They gather a wide variety of 26 publicly available datasets, transform them into instruction tuning format and categorize them into two clusters for held-in instruction tuning and held-out zero-shot evaluation. Additionally, they introduce instruction-aware visual feature extraction, a crucial method that enables the model to extract informative features tailored to the given instruction.
The following figure from the paper shows the model architecture of InstructBLIP. The Q-Former extracts instruction-aware visual features from the output embeddings of the frozen image encoder, and feeds the visual features as soft prompt input to the frozen LLM. We instruction-tune the model with the language modeling loss to generate the response.

The resulting InstructBLIP models achieve state-of-the-art zero-shot performance across all 13 held-out datasets, substantially outperforming BLIP-2 and the larger Flamingo.
Their models also lead to state-of-the-art performance when finetuned on individual downstream tasks (e.g., 90.7% accuracy on ScienceQA IMG). Furthermore, they qualitatively demonstrate the advantages of InstructBLIP over concurrent multimodal models.
The figure below from the paper shows a few qualitative examples generated by our InstructBLIP Vicuna model. Here, a range of its diverse capabilities are demonstrated, including complex visual scene understanding and reasoning, knowledge-grounded image description, multi-turn visual conversation, etc.

Code.

AtMan: Understanding Transformer Predictions Through Memory Efficient Attention Manipulation

Generative transformer models have become increasingly complex, with large numbers of parameters and the ability to process multiple input modalities. Current methods for explaining their predictions are resource-intensive. Most crucially, they require prohibitively large amounts of extra memory, since they rely on backpropagation which allocates almost twice as much GPU memory as the forward pass. This makes it difficult, if not impossible, to use them in production.
This paper by Deb et al. from Aleph Alpha, TU Darmstadt, and German Center for Artificial Intelligence (DFKI) in 2023 presents AtMan that provides explanations of generative transformer models at almost no extra cost. Specifically, AtMan is a modality-agnostic perturbation method that manipulates the attention mechanisms of transformers to produce relevance maps for the input with respect to the output prediction. Instead of using backpropagation, AtMan applies a parallelizable token-based search method based on cosine similarity neighborhood in the embedding space.
Their exhaustive experiments on text and image-text benchmarks demonstrate that AtMan outperforms current state-of-the-art gradient-based methods on several metrics while being computationally efficient. As such, AtMan is suitable for use in large model inference deployments.
The following figures from the paper (top) show an illustration of the proposed explainability method where first, they collect the original cross-entropy score of the target tokens. Then they iterate and suppress one token at a time, indicated by the red box, and track changes in the cross-entropy score of the target token (2); (bottom) manipulating the attention scores of a single token (highlighted in blue) inside a transformer block to steer the model’s prediction into a different contextual direction (amplifications highlighted in green, suppression in red).

Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models

Large language models (LLMs) have achieved remarkable progress in solving various natural language processing tasks due to emergent reasoning abilities. However, LLMs have inherent limitations as they are incapable of accessing up-to-date information (stored on the Web or in task-specific knowledge bases), using external tools, and performing precise mathematical and logical reasoning.
This paper by Lu et al. from UCLA and Microsoft Research presents Chameleon, an AI system that mitigates these limitations by augmenting LLMs with plug-and-play modules for compositional reasoning. Chameleon synthesizes programs by composing various tools (e.g., LLMs, off-the-shelf vision models, web search engines, Python functions, and heuristic-based modules) for accomplishing complex reasoning tasks.
At the heart of Chameleon is an LLM-based planner that assembles a sequence of tools to execute to generate the final response.
They showcase the effectiveness of Chameleon on two multi-modal knowledge-intensive reasoning tasks: ScienceQA and TabMWP. Chameleon, powered by GPT-4, achieves an 86.54% overall accuracy on ScienceQA, improving the best published few-shot result by 11.37%. On TabMWP, GPT-4-powered Chameleon improves the accuracy by 17.0%, lifting the state of the art to 98.78%.
Their analysis also shows that the GPT-4-powered planner exhibits more consistent and rational tool selection via inferring potential constraints from instructions, compared to a ChatGPT-powered planner.
The following figures from the paper shows two examples from their Chameleon with GPT-4 on TabMWP, a mathematical reasoning benchmark with tabular contexts. Chameleon demonstrates flexibility and efficiency in adapting to different queries that require various reasoning abilities.

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

This paper by Yang et al. from proposes MM-REACT, a system paradigm that integrates ChatGPT with a pool of vision experts to achieve multimodal reasoning and action.
They define and explore a comprehensive list of advanced vision tasks that are intriguing to solve, but may exceed the capabilities of existing vision and vision-language models. To achieve such advanced visual intelligence, MM-REACT introduces a textual prompt design that can represent text descriptions, textualized spatial coordinates, and aligned file names for dense visual signals such as images and videos.
MM-REACT’s prompt design allows language models to accept, associate, and process multimodal information, thereby facilitating the synergetic combination of ChatGPT and various vision experts. Zero-shot experiments demonstrate MM-REACT’s effectiveness in addressing the specified capabilities of interests and its wide application in different scenarios that require advanced visual understanding.
Furthermore, they discuss and compare MM-REACT’s system paradigm with an alternative approach that extends language models for multimodal scenarios through joint finetuning.
The following figure from the paper shows that MM-REACT allocates specialized vision experts with ChatGPT to solve challenging visual understanding tasks through multimodal reasoning and action. For example, the system could associate information from multiple uploaded receipts and calculate the total travel cost (“Multi-Image Reasoning”).

The following figure from the paper shows the flowchart of MM-REACT for enhanced visual understanding with ChatGPT. The user input can be in the form of text, images, or videos, with the latter two represented as file path strings. ChatGPT is instructed to say specific watchwords in action request if a vision expert is required to interpret the visual inputs. Regular expression matching is applied to parse the expert’s name and the file path, which are then used to call the vision expert (action execution). The expert’s output (observation) is serialized as text and combined with the history to further activate ChatGPT. If no extra experts are needed, MM-REACT would return the final response to the user. The right figure shows a single-round vision expert execution, which is the component that constructs the full execution flow.

PaLM-E: An Embodied Multimodal Language Model

Large language models have been demonstrated to perform complex tasks. However, enabling general inference in the real world, e.g. for robotics problems, raises the challenge of grounding.
This paper by Driess from Google, TU Berlin, and Google Research proposes PaLM-E, an embodied language models to directly incorporate real-world continuous sensor modalities into language models and thereby establish the link between words and percepts. Input to their embodied language model are multi-modal sentences that interleave visual, continuous state estimation, and textual input encodings.
They train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks, including sequential robotic manipulation planning, visual question answering, and captioning.
Their evaluations show that PaLM-E, a single large embodied multimodal model, can address a variety of embodied reasoning tasks, from a variety of observation modalities, on multiple embodiments, and further, exhibits positive transfer: the model benefits from diverse joint training across internet-scale language, vision, and visual-language domains.
Their largest model, PaLM-E-562B with 562B parameters, in addition to being trained on robotics tasks, is a visual-language generalist with state-of-the-art performance on OK-VQA, and retains generalist language capabilities with increasing scale.
The following figures from the paper shows PaLM-E, a single general-purpose multimodal language model for embodied reasoning tasks, visual-language tasks, and language tasks. - PaLM-E transfers knowledge from visual-language domains into embodied reasoning – from robot planning in environments with complex dynamics and physical constraints, to answering questions about the observable world. PaLM-E operates on multimodal sentences, i.e. sequences of tokens where inputs from arbitrary modalities (e.g. images, neural 3D representations, or states, in green and blue) are inserted alongside text tokens (in orange) as input to an LLM, trained end-to-end.

Project page; Code.

High-quality instructions and responses are essential for the zero-shot performance of large language models on interactive natural language tasks. For interactive vision-language tasks involving intricate visual scenes, a large quantity of diverse and creative instruction-response pairs should be imperative to tune vision-language models (VLMs). Nevertheless, the current availability of vision-language instruction-response pairs in terms of quantity, diversity, and creativity remains limited, posing challenges to the generalization of interactive VLMs.
This paper by Li et al. from NTU Singapore and Microsoft Research presents MultI-Modal In-Context Instruction Tuning (MIMIC-IT), a dataset comprising 2.8 million multimodal instruction-response pairs, with 2.2 million unique instructions derived from images and videos. Each pair is accompanied by multi-modal in-context information, forming conversational contexts aimed at empowering VLMs in perception, reasoning, and planning.
The instruction-response collection process, dubbed as Syphus, is scaled using an automatic annotation pipeline that combines human expertise with GPT’s capabilities. Using the MIMIC-IT dataset, they train a large VLM named Otter based on OpenFlamingo.
Based on extensive evaluations conducted on vision-language benchmarks, it has been observed that Otter demonstrates remarkable proficiency in multi-modal perception, reasoning, and in-context learning. Human evaluation reveals it effectively aligns with the user’s intentions. They release the MIMIC-IT dataset, instruction-response collection pipeline, benchmarks, and the Otter model.
The following figure from the paper shows an overview of MIMIC-IT. The MIMIC-IT dataset comprises 2.8M multi-modal instructionresponse pairs spanning fundamental capabilities: perception, reasoning, and planning. Each instruction is accompanied by multi-modal conversational context, allowing VLMs trained on MIMIC-IT to demonstrate strong proficiency in interactive instruction following with zero-shot generalization.

Visual Instruction Tuning

Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field.
The paper presents the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, they introduce Large Language-and-Vision Assistant (LLaVA), an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding.
LLaVA is a minimal extension of the LLaMA series which conditions the model on visual inputs besides just text. The model leverages a pre-trained CLIP’s vision encoder to provide image features to the LLM, with a lightweight projection module in between.
The model is first pre-trained on image-text pairs to align the features of the LLM and the CLIP encoder, keeping both frozen, and only training the projection layer. Next, the entire model is fine-tuned end-to-end, only keeping CLIP frozen, on visual instruction data to turn it into a multimodal chatbot.
Their early experiments show that LLaVA demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%.
The following figure from the paper shows the LLaVA network architecture.

Project page; Demo; Code.

Multimodal Chain-of-Thought Reasoning in Language Models

Large language models (LLMs) have shown impressive performance on complex reasoning by leveraging chain-of-thought (CoT) prompting to generate intermediate reasoning chains as the rationale to infer the answer. However, existing CoT studies have focused on the language modality.
This paper by Zhang et al. from Shanghai Jiao Tong University and Amazon Web Services addresses the limitations of current CoT studies in LLMs by incorporating both language (text) and vision (images) modalities.
It introduces Multimodal-CoT, a novel two-stage framework that enhances complex reasoning in LLMs. This approach first generates rationales using both text and images, then leverages these enhanced rationales for more accurate answer inference. This method marks a significant departure from existing CoT studies that focus solely on the language modality.
The following figure from the paper shows an example of the multimodal CoT task.

The following figure from the paper shows an overview of their Multimodal-CoT framework. Multimodal-CoT consists of two stages: (i) rationale generation and (ii) answer inference. Both stages share the same model architecture but differ in the input and output. In the first stage, they feed the model with language and vision inputs to generate rationales. In the second stage, they append the original language input with the rationale generated from the first stage. Then, they feed the updated language input with the original vision input to the model to infer the answer.

The authors demonstrate that their model, which has fewer than 1 billion parameters, significantly outperforms the state-of-the-art LLM, GPT-3.5, on the ScienceQA benchmark. With a 16 percentage point increase in accuracy (from 75.17% to 91.68%), Multimodal-CoT not only surpasses GPT-3.5 but also exceeds human performance levels.
The paper provides a detailed analysis of the model’s architecture, highlighting the use of fine-tuned language models to effectively fuse vision and language representations. This is a key component in generating more informative rationales for the subsequent inference stage.
Empirical evaluations are included to demonstrate the model’s effectiveness in both rationale generation and answer accuracy, showcasing its superiority in scenarios where traditional CoT reasoning may falter.
The authors compare Multimodal-CoT with other models and baselines, emphasizing the considerable advancements it brings to multimodal reasoning tasks.
The potential applications and future improvements of Multimodal-CoT are also discussed, particularly in enhancing the interaction between language and vision features and incorporating more sophisticated vision extraction techniques.
Overall, this paper represents a significant leap in multimodal reasoning for LLMs, showing how integrating language and vision modalities can lead to remarkable improvements in reasoning and understanding.
Code

Dreamix: Video Diffusion Models are General Video Editors

Text-driven image and video diffusion models have recently achieved unprecedented generation realism. While diffusion models have been successfully applied for image editing, very few works have done so for video editing.
This paper by Molad et al. from Google Research and The Hebrew University of Jerusalem presents the first diffusion-based method that is able to perform text-based motion and appearance editing of general videos. Our approach uses a video diffusion model to combine, at inference time, the low-resolution spatio-temporal information from the original video with new, high resolution information that it synthesized to align with the guiding text prompt.
The following figure from the paper shows the video editing use-case with Dreamix: Frames from a video conditioned on the text prompt “A bear dancing and jumping to upbeat music, moving his whole body“. Dreamix transforms the eating monkey (top row) into a dancing bear, affecting appearance and motion (bottom row). It maintains fidelity to color, posture, object size and camera pose, resulting in a temporally consistent video.

As obtaining high-fidelity to the original video requires retaining some of its high-resolution information, we add a preliminary stage of finetuning the model on the original video, significantly boosting fidelity.
They propose to improve motion editability by a new, mixed objective that jointly finetunes with full temporal attention and with temporal attention masking.
They further introduce a new framework for image animation. They first transform the image into a coarse video by simple image processing operations such as replication and perspective geometric projections, and then use their general video editor to animate it.
As a further application, Dreamix can be used for subject-driven video generation. Extensive qualitative and numerical experiments showcase the remarkable editing ability of Dreamix and establish its superior performance compared to baseline methods.
The following figure from the paper illustrates the process of inference. Dreamix supports multiple applications by application dependent pre-processing (left), converting the input content into a uniform video format. For image-to-video, the input image is duplicated and transformed using perspective transformations, synthesizing a coarse video with some camera motion. For subject-driven video generation, the input is omitted - finetuning alone takes care of the fidelity. This coarse video is then edited using their general “Dreamix Video Editor“ (right): we first corrupt the video by downsampling followed by adding noise. We then apply the finetuned text-guided VDM, which upscales the video to the final spatio-temporal resolution.

Code

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

This paper by Liu et al. from Tsinghua University, International Digital Economy Academy (IDEA), The Hong Kong University of Science and Technology, CUHK, MSR, presents an open-set object detector, called Grounding DINO, by marrying Transformer-based detector DINO with grounded pre-training, which can detect arbitrary objects with human inputs such as category names or referring expressions.
The following figure from the paper illustrates: (a) closed-set object detection requires models to detect objects of pre-defined categories; (b) previous work zero-shot transfer models to novel categories for model generalization – they propose to add Referring expression comprehension (REC) as another evaluation for model generalizations on novel objects with attributes; (c) they present an image editing application by combining Grounding DINO and Stable Diffusion.

The key solution of open-set object detection is introducing language to a closed-set detector for open-set concept generalization. To effectively fuse language and vision modalities, they conceptually divide a closed-set detector into three phases and propose a tight fusion solution, which includes a feature enhancer, a language-guided query selection, and a cross-modality decoder for cross-modality fusion.
The following figure from the paper illustrates the framework of Grounding DINO including the overall framework, a feature enhancer layer, and a decoder layer in block 1, block 2, and block 3, respectively.

While previous works mainly evaluate open-set object detection on novel categories, they propose to also perform evaluations on referring expression comprehension for objects specified with attributes.
Grounding DINO performs remarkably well on all three settings, including benchmarks on COCO, LVIS, ODinW, and RefCOCO/+/g. Grounding DINO achieves a 52.5 AP on the COCO detection zero-shot transfer benchmark, i.e., without any training data from COCO. It sets a new record on the ODinW zero-shot benchmark with a mean 26.1 AP.
Code.

OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

This technical report by Awadalla et al. from UW, Stanford, Allen Institute for AI, LAION, UCSB, Hebrew University, Columbia, Google DeepMind introduces OpenFlamingo, a family of autoregressive vision-language models ranging from 3B to 9B parameters.
OpenFlamingo is an open-source replication of DeepMind’s Flamingo models, a suite of autoregressive vision-language models. On seven vision-language datasets, OpenFlamingo models average between 80 - 89% of corresponding Flamingo performance. describes their models, training data, hyperparameters, and evaluation suite. They describe the training pipeline to replicate the Flamingo models with 80 to 89% of the original Flamingo performance (on an average).
The following figure from the paper illustrates the fact that OpenFlamingo-9B can process interleaved image-and-text sequences. This interface allows OpenFlamingo to learn many vision-language tasks through in-context demonstrations.

The following figure from Sebastian Raschka summarizes their results:

Code.

Med-Flamingo: a Multimodal Medical Few-shot Learner

Medicine, by its nature, is a multifaceted domain that requires the synthesis of information across various modalities. Medical generative vision-language models (VLMs) make a first step in this direction and promise many exciting clinical applications. However, existing models typically have to be fine-tuned on sizeable down-stream datasets, which poses a significant limitation as in many medical applications data is scarce, necessitating models that are capable of learning from few examples in real-time.
This paper by Moor et al. from Stanford University, Stanford Medicine, Hospital Israelita Albert Einstein, and Harvard Medical School proposes Med-Flamingo, a multimodal few-shot learner adapted to the medical domain. Based on OpenFlamingo-9B, they continue pre-training on paired and interleaved medical image-text data from publications and textbooks.
The following figure from the paper shows an overview of the Med-Flamingo model using three steps. First, they pre-train their Med-Flamingo model using paired and interleaved image-text data from the general medical domain (sourced from publications and textbooks). They initialize their model at the OpenFlamingo checkpoint continue pre-training on medical image-text data. Second, we perform few-shot generative visual question answering (VQA). For this, we leverage two existing medical VQA datasets, and a new one, Visual USMLE. Third, we conduct a human rater study with clinicians to rate generations in the context of a given image, question and correct answer. The human evaluation was conducted with a dedicated app and results in a clinical evaluation score that serves as their main metric for evaluation.

Med-Flamingo unlocks few-shot generative medical visual question answering (VQA) abilities, which they evaluate on several datasets including a novel challenging open-ended VQA dataset of visual USMLE-style problems.
Furthermore, they conduct the first human evaluation for generative medical VQA where physicians review the problems and blinded generations in an interactive app. Med-Flamingo improves performance in generative medical VQA by up to 20% in clinician’s rating and firstly enables multimodal medical few-shot adaptations, such as rationale generation.
Code

Towards Generalist Biomedical AI

Medicine is inherently multimodal, with rich data modalities spanning text, imaging, genomics, and more. Generalist biomedical artificial intelligence (AI) systems that flexibly encode, integrate, and interpret this data at scale can potentially enable impactful applications ranging from scientific discovery to care delivery.
This paper by Tu et al. from Google Research and Google DeepMind seeks to enable the development of these models by first curating MultiMedBench, a new multimodal biomedical benchmark. MultiMedBench encompasses 14 diverse tasks such as medical question answering, mammography and dermatology image interpretation, radiology report generation and summarization, and genomic variant calling. They then introduce Med-PaLM Multimodal (Med-PaLM M), their proof of concept for a generalist biomedical AI system. Med-PaLM M is a large multimodal generative model that flexibly encodes and interprets biomedical data including clinical language, imaging, and genomics with the same set of model weights.
Med-PaLM M reaches performance competitive with or exceeding the state of the art on all MultiMedBench tasks, often surpassing specialist models by a wide margin. They also report examples of zero-shot generalization to novel medical concepts and tasks, positive transfer learning across tasks, and emergent zero-shot medical reasoning.
To further probe the capabilities and limitations of Med-PaLM M, they conduct a radiologist evaluation of model-generated (and human) chest X-ray reports and observe encouraging performance across model scales.
In a side-by-side ranking on 246 retrospective chest X-rays, clinicians express a pairwise preference for Med-PaLM M reports over those produced by radiologists in up to 40.50% of cases, suggesting potential clinical utility.
The following figure from the paper shows an overview of Med-PaLM M. A generalist biomedical AI system should be able to handle a diverse range of biomedical data modalities and tasks. To enable progress towards this overarching goal, they curate MultiMedBench, a benchmark spanning 14 diverse biomedical tasks including question answering, visual question answering, image classification, radiology report generation and summarization, and genomic variant calling. Med-PaLM Multimodal (Med-PaLM M), their proof of concept for such a generalist biomedical AI system (denoted by the shaded blue area) is competitive with or exceeds prior SOTA results from specialists models (denoted by dotted red lines) on all tasks in MultiMedBench. Notably, Med-PaLM M achieves this using a single set of model weights, without any task-specific customization.

PaLI: A Jointly-Scaled Multilingual Language-Image Model

Effective scaling and a flexible task interface enable large language models to excel at many tasks.
This paper by Chen et al. from Google Research in ICLR 2023 presents PaLI (Pathways Language and Image model), a model that extends this approach to the joint modeling of language and vision.
PaLI generates text based on visual and textual inputs, and with this interface performs many vision, language, and multimodal tasks, in many languages.
To train PaLI, they make use of large pre-trained encoder-decoder language models and Vision Transformers (ViTs). This allows them to capitalize on their existing capabilities and leverage the substantial cost of training them. They find that joint scaling of the vision and language components is important.
Since existing Transformers for language are much larger than their vision counterparts, we train a large, 4-billion parameter ViT (ViT-e) to quantify the benefits from even larger-capacity vision models.
To train PaLI, they create a large multilingual mix of pretraining tasks, based on a new image-text training set containing 10B images and texts in over 100 languages. PaLI achieves state-of-the-art in multiple vision and language tasks (such as captioning, visual question-answering, scene-text understanding), while retaining a simple, modular, and scalable design.
The PaLI main architecture is simple and scalable. It uses an encoder-decoder Transformer model, with a large-capacity ViT component for image processing.

Code.

Nougat: Neural Optical Understanding for Academic Documents

Scientific knowledge is predominantly stored in books and scientific journals, often in the form of PDFs. However, the PDF format leads to a loss of semantic information, particularly for mathematical expressions. Information Extraction from PDFs and especially scientific papers is a problem is one of the the first milestones to conquer if we want to revolutionize science in the coming decades.
This paper by Blecher et al. from Meta proposes Nougat (Neural Optical Understanding for Academic Documents), a Visual Transformer model that performs an Optical Character Recognition (OCR) task for processing scientific documents into a markup language, and demonstrate the effectiveness of our model on a new dataset of scientific documents.
Nougat offers a way to unlock the next trillion high-quality tokens, currently frozen in textbook pixels that are not LLM-ready.
As of this paper’s writing, since there are no paired dataset of PDF pages and corresponding source code out there, they created our own from the open access articles on arXiv. For layout diversity they also include a subset of the PubMed Central 5 (PMC) open access non-commercial dataset. During the pretraining, a portion of the Industry Documents Library 6 (IDL) was also included. From arXiv, they collected the source code and compiled PDFs from 1,748,201 articles. To ensure consistent formatting, we first process the source files using LaTeXML and convert them into HTML5 files. This step was important as it standardized and removed ambiguity from the LaTeX source code, especially in mathematical expressions. The conversion process included replacing user-defined macros, standardizing whitespace, adding optional brackets, normalizing tables, and replacing references and citations with their correct numbers. This following figure from the paper shows the data processing aspect of Nougat. The source file is converted into HTML which is then converted to Markdown. a) The LaTeX source provided by the authors. b) The HTML file computed form the LaTeX source using LaTeXML. c) The Markdown file parsed from the HTML file. d) The PDF file provided by the authors.

The following table from the paper illustrates Nougat’s simple end-to-end architecture (which resembles that of Donut). The Swin Transformer encoder takes a document image and converts it into latent embeddings, which are subsequently converted to a sequence of tokens in a autoregressive manner

Nougat offers a promising solution to enhance the accessibility of scientific knowledge in the digital age, by bridging the gap between human-readable documents and machine-readable text.
The following table from the paper offers an example of Nougat’s OCR capabilities on an old calculus text book.

Text-Conditional Contextualized Avatars For Zero-Shot Personalization

Recent large-scale text-to-image generation models have made significant improvements in the quality, realism, and diversity of the synthesized images and enable users to control the created content through language. However, the personalization aspect of these generative models is still challenging and under-explored. In this work, we propose a pipeline that enables personalization of image generation with avatars capturing a user’s identity in a delightful way.
This paper by Azadi et al. from Meta AI proposes Personalized Avatar Scene (PAS), a pipeline that is zero-shot, avatar texture and style agnostic, and does not require training on the avatar at all – it is scalable to millions of users who can generate a scene with their avatar.
To render the avatar in a pose faithful to the given text prompt, PAS utilizes a novel text-to-3D pose diffusion model trained on a curated large-scale dataset of in-the-wild human poses improving the performance of the SOTA text-to-motion models significantly.
The following figure from the paper shows PAS: They generate 3D SMPL body poses using a diffusion based transformer model and leverage a pre-trained VPoser either for pose regularization or decoding. The generated pose is re-targeted to the avatar body enabling every user to render their own avatar in the target generated pose. Finally, we generate an avatar scene using a fine-tuned text-to-image model conditioned on the rendered avatar and the text prompt.

The following figure from the paper shows their transformer based Text-to-3D pose diffusion model at time step $t$. The input sequence includes CLIP text embedding, tokens embedding, diffusion timestep, and noised pose and root orient representations $\left(\hat{x}_p, \hat{x}_r\right)$ all projected to the transformer dimension. A positional embedding is added to each token in the above sequence. The un-noised pose and root orient representations are predicted at each timestep during training.

At the time of this paper’s writing, this was the first instance which explored leveraging large-scale image datasets to learn human 3D pose parameters and overcome the limitations of motion capture datasets.
The following figure from the paper shows samples of images generated by our proposed approach Personalized Avatar Scene (PAS). Each caption is prefixed by “A person (is) ”.

Make-An-Animation: Large-Scale Text-conditional 3D Human Motion Generation

Text-guided human motion generation has drawn significant interest because of its impactful applications spanning animation and robotics. Recently, application of diffusion models for motion generation has enabled improvements in the quality of generated motions. However, existing approaches are limited by their reliance on relatively small-scale motion capture data, leading to poor performance on more diverse, in-the-wild prompts.
This paper by introduces Make-An-Animation, a text-conditioned human motion generation model which learns more diverse poses and prompts from large-scale image-text datasets, enabling significant improvement in performance over prior works.
Make-An-Animation is trained in two stages. First, they train on a curated large-scale dataset of (text, static pseudo-pose) pairs extracted from image-text datasets. Second, they fine-tune on motion capture data, adding additional layers to model the temporal dimension. Unlike prior diffusion models for motion generation, Make-An-Animation uses a U-Net architecture similar to recent text-to-video generation models.
The following figure from the paper illustrates the Make-An-Animation Model Architecture. Their diffusion model is built on a U-Net architecture inspired by recent image and video generation models. The U-Net consists of a sequence of Residual Blocks with 1x1 2D-convolution layers and Attention Blocks with cross-attention on textual information. To model the temporal dimension, they add 1D temporal convolution layers after each 1x1 2D-convolution, as well as temporal attention layers after each cross-attention layer. These temporal layers (greyed out in the figure) are only added in the motion fine-tuning stage.

Human evaluation of motion realism and alignment with input text shows that our model reaches state-of-the-art performance on text-to-motion generation.
The following figure from the paper shows samples generated by Make-An-Animation for text conditional motion generation. The lighting of the body models represents progress across time. Darker color indicates later frames in the sequence. In the top image, for a better visualization, frames are distributed horizontally.

AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model

This paper by Moon et al. from Meta AI and Reality Labs presents Any-Modality Augmented Language Model (AnyMAL), a unified model that reasons over diverse input modality signals (i.e. text, image, video, audio, IMU motion sensor), and generates textual responses.
AnyMAL inherits the powerful text-based reasoning abilities of the state-of-the-art LLMs including LLaMA-2 (70B), and converts modality-specific signals to the joint textual space through a pre-trained aligner module.
The following figure from the paper illustrates the AnyMAL Training process. (a) Modality alignment pre-training allows for mapping the output of each modality encoder into the joint LLM embeddings space through projection layers. (b) With multimodal instruction tuning, the model learns to associate system instructions and text queries with input multimodal contexts. Our modality-specific encoder zoo includes: CLIP ViT-L, ViT-G, DinoV2 (image), CLAP (audio), IMU2CLIP (IMU motion sensor), and Intervideo (video).

The following figure from the paper shows example AnyMAL outputs. The model understands various input signals (i.e. vision, audio, motion sensor signals), and responds to free-form user queries. When multiple modalities are interleaved and given as input (e.g. right-most: image + IMU motion sensor signals), the model reasons over them jointly.

To further strengthen the multimodal LLM’s capabilities, AnyMAL is fine-tuned with a multimodal instruction set manually collected to cover diverse topics and tasks beyond simple QAs.
They conduct comprehensive empirical analysis comprising both human and automatic evaluations, and demonstrate state-of-the-art performance on various multimodal tasks.

Phenaki: Variable Length Video Generation From Open Domain Textual Description

This paper by Villegas et al. from Google Brain presents Phenaki, a model capable of realistic video synthesis, given a sequence of textual prompts. Generating videos from text is particularly challenging due to the computational cost, limited quantities of high quality text-video data and variable length of videos.
To address these issues, Phenaki learns video representations by compressing the video to a small representation of discrete tokens. This tokenizer uses causal attention in time, which allows it to work with variable-length videos. To generate video tokens from text, Phenaki uses a bidirectional masked transformer conditioned on pre-computed text tokens. The generated video tokens are subsequently de-tokenized to create the actual video.
To address data issues, they demonstrate how joint training on a large corpus of image-text pairs as well as a smaller number of video-text examples can result in generalization beyond what is available in the video datasets. Compared to the previous video generation methods, Phenaki can generate arbitrary long videos conditioned on a sequence of prompts (i.e. time variable text or a story) in open domain.
To the best of their knowledge, this is the first time a paper studies generating videos from time variable prompts. In addition, compared to the per-frame baselines, the proposed video encoder-decoder computes fewer tokens per video but results in better spatio-temporal consistency.
The following figure from the paper shows the architecture of Phenaki. Left: C-ViViT encoder architecture. The embeddings of images and video patches from raw frames $x$ are processed by a spatial and then a causal transformer (auto-regressive in time) to generate video tokens $z$. Center: MaskGiT is trained to reconstruct masked tokens $z$ predicted by a frozen C-ViViT encoder and conditioned on T5X tokens of a given prompt $p_0$. Right: How Phenaki can generate arbitrary long videos by freezing the past token and generating the future tokens. The prompt can change over time to enable time-variable prompt (i.e. story) conditional generation. The subscripts represent time (i.e. frame number).

Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators

Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets.
This paper by Khachatryan et al. from Picsart AI Resarch (PAIR), UT Austin, U of Oregon, UIUC introduces a new task of zero-shot text-to-video generation and propose a low-cost approach (without any training or optimization) by leveraging the power of existing text-to-image synthesis methods (e.g., Stable Diffusion), making them suitable for the video domain.
Their key modifications include (i) enriching the latent codes of the generated frames with motion dynamics to keep the global scene and the background time consistent; and (ii) reprogramming frame-level self-attention using a new cross-frame attention of each frame on the first frame, to preserve the context, appearance, and identity of the foreground object.
Experiments show that this leads to low overhead, yet high-quality and remarkably consistent video generation. Moreover, their approach is not limited to text-to-video synthesis but is also applicable to other tasks such as conditional and content-specialized video generation, and Video Instruct-Pix2Pix, i.e., instruction-guided video editing.
Based on experiments, Text2Video-Zero method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data.
The following figure from the paper shows the overview of the method: starting from a randomly sampled latent code $x_T^1$, they apply $\Delta t$ DDIM backward steps to obtain $x_{T^{\prime}}^1$ using a pre-trained Stable Diffusion model (SD). A specified motion field results for each frame $k$ in a warping function $W_k$ that turns $x_{T^{\prime}}^1$ to $x_{T^{\prime}}^k$. By enhancing the latent codes with motion dynamics, they determine the global scene and camera motion and achieve temporal consistency in the background and the global scene. A subsequent DDPM forward application delivers latent codes $x_T^k$ for $k=1, \ldots, m$. By using the (probabilistic) DDPM method, a greater degree of freedom is achieved with respect to the motion of objects. Finally, the latent codes are passed to our modified SD model using the proposed cross-frame attention, which uses keys and values from the first frame to generate the image of frame $k=1, \ldots, m$. By using cross-frame attention, the appearance and the identity of the foreground object are preserved throughout the sequence. Optionally, they apply background smoothing. To this end, they employ salient object detection to obtain for each frame $k$ a mask $M^k$ indicating the foreground pixels. Finally, for the background (using the mask $M^k$), a convex combination between the latent code $x_t^1$ of frame one warped to frame $k$ and the latent code $x_t^k$ is used to further improve the temporal consistency of the background.

The following figure from the paper shows that Text2Video-Zero enables zero-shot video generation using (i) a textual prompt (see rows 1, 2), (ii) a prompt combined with guidance from poses or edges (see lower right), and (iii) Video Instruct-Pix2Pix, i.e., instruction-guided video editing (see lower left). Results are temporally consistent and follow closely the guidance and textual prompts.

Code; Project page.

SeamlessM4T – Massively Multilingual & Multimodal Machine Translation

Building a universal language translator, like the fictional Babel Fish in The Hitchhiker’s Guide to the Galaxy, is challenging because existing speech-to-speech and speech-to-text systems only cover a small fraction of the world’s languages. SeamlessM4T represents a significant breakthrough in the field of speech-to-speech and speech-to-text by addressing the challenges of limited language coverage and a reliance on separate systems, which divide the task of speech-to-speech translation into multiple stages across subsystems. These systems can leverage large amounts of data and generally perform well for only one modality. Our challenge was to create a unified multilingual model that could do it all.
This technical report by Barrault et al. from Meta AI and UC Berkeley builds on advancements Meta and others have made over the years in the quest to create a universal translator. In 2022, Meta released No Language Left Behind (NLLB), a text-to-text machine translation model that supports 200 languages and has since been integrated into Wikipedia as one of its translation providers. A few months later, they shared a demo of their Universal Speech Translator, which was the first direct speech-to-speech translation system for Hokkien, a language without a widely used writing system. Through this, they developed SpeechMatrix, the first large-scale multilingual speech-to-speech translation dataset, derived from SpeechLASER, a breakthrough in supervised representation learning. Earlier this year, they also shared Massively Multilingual Speech, which provides automatic speech recognition, language identification, and speech synthesis technology across more than 1,100 languages. SeamlessM4T draws on findings from all of these projects to enable a multilingual and multimodal translation experience stemming from a single model, built across a wide range of spoken data sources and with state-of-the-art results.
Note that there are two views of what constitutes a direct model in speech-to-speech translation literature: (1) A model that does not use intermediate text representation and (2) a model that directly predicts the target spectrogram.
For the model, they use the multitask UnitY model architecture, which is capable of directly generating translated text and speech. This new architecture also supports automatic speech recognition, text-to-text, text-to-speech, speech-to-text, and speech-to-speech translations that are already a part of the vanilla UnitY model. The multitask UnitY model consists of three main sequential components. Text and speech encoders have the task of recognizing speech input in nearly 100 languages. The text decoder then transfers that meaning into nearly 100 languages for text followed by a text-to-unit model to decode into discrete acoustic units for 36 speech languages. The self-supervised encoder, speech-to-text, text-to-text translation components, and text-to-unit model are pre-trained to improve the quality of the model and for training stability The decoded discrete units are then converted into speech using a multilingual HiFi-GAN unit vocoder. The following figure from the paper shows an overview of SeamlessM4T. (1) shows the pre-trained models used when finetuning multitasking UnitY. (2) outlines multitasking UnitY with its two encoders, text decoder, T2U encoder-decoder, and the supporting vocoders for synthesizing output speech in S2ST.

How the encoder processes speech: Their self-supervised speech encoder, w2v-BERT 2.0 which is an improved version of w2v-BERT that improves its training stability and representation quality, learns to find structure and meaning in speech by analyzing millions of hours of multilingual speech. The encoder takes the audio signal, breaks it down into smaller parts, and builds an internal representation of what is being said. Because spoken words are made up of many of those sounds and characters, they use a length adaptor to roughly map them to actual words.
How the encoder processes text: Similarly, they have a text encoder that is based on the NLLB model. It has been trained to understand text in nearly 100 languages and produce representations that are useful for translation.
Producing text: Our text decoder is trained to take encoded speech representations or text representations. This can be applied to tasks in the same language, such as automatic speech recognition, and multilingual translation tasks. For example, someone can say the word “bonjour” in French, and expect the translated text in Swahili to be “habari.” With multitask training, they leverage the strengths of a strong text-to-text translation model (NLLB) to guide their speech-to-text translation model via token-level knowledge distillation. The following figure from the paper shows an overview of the SeamlessM4T X2T (Into-Text Translation and Transcription) model. (1) describes the main two building blocks: w2v-BERT 2.0 and SeamlessM4T-NLLB. (2) describes the training of the X2T model. In Stage 1, the model is trained on X–eng directions and in Stage 2, eng–X directions are added.

Producing speech: They use acoustic units to represent speech on the target side. The text-to-unit (T2U) component in the UnitY model generates these discrete speech units based on the text output and is pre-trained on ASR data prior to UnitY fine-tuning. A multilingual HiFi-GAN unit vocoder is then used to convert these discrete units into audio waveforms. The following figure from the paper shows an overview of the SeamlessM4T multitask UnitY model with the speech-to-speech translation task. (1) describes the additional two building blocks on top of X2T: T2U encoder-decoder and unit vocoder. (2) describes the training of the UnitY model. In Stage 3, the model is trained on S2ST data.

Data scaling:
- Data-driven models like SeamlessM4T usually benefit from large amounts of high-quality end-to-end data, namely speech-to-text and speech-to-speech data. Relying only on human transcribed and translated speech does not scale to tackle the challenging task of speech translation for 100 languages. They build upon their pioneering work on text-to-text mining using a similarity measure in a joint embedding space, and initial work in speech mining to create additional resources to train the SeamlessM4T model.
- First, they build a new massively multilingual and -modal text embedding space for 200 languages, named SONAR (Sentence-level mOdality- and laNguage-Agnostic Representations), which substantially outperforms existing approaches like LASER3 or LaBSE in multilingual similarity search. They then apply a teacher-student approach to extend this embedding space to the speech modality and currently cover 36 languages. Mining is performed in data from publicly available repositories of web data (tens of billions of sentences) and speech (4 million hours). In total, they were able to automatically align more than 443,000 hours of speech with texts and create about 29,000 hours of speech-to-speech alignments. This corpus, dubbed SeamlessAlign, is the largest open speech/speech and speech/text parallel corpus in terms of total volume and language coverage to date.
Results: For these tasks and languages, SeamlessM4T achieves state-of-the-art results for nearly 100 languages and multitask support across automatic speech recognition, speech-to-text, speech-to-speech, text-to-speech, and text-to-text translation—all in a single model.

Video overview by Manish Gupta; Project page; Project blog; Code.

PaLI-X: On Scaling up a Multilingual Vision and Language Model

This paper by Chen et al. presents the training recipe and results of scaling up PaLI-X, a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture.
PaLI-X achieves new levels of performance on a wide-range of varied and complex tasks, including multiple image-based captioning and question-answering tasks, image-based document understanding and few-shot (in-context) learning, as well as object detection, video question answering, and video captioning.
PaLI-X advances the state-of-the-art on most vision-and-language benchmarks considered (25+ of them).
Finally, they observe emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix.
The following figure from the paper shows examples demonstrating multilingual, OCR and other capabilities transferred to detection.

The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)

Large multimodal models (LMMs) extend large language models (LLMs) with multi-sensory skills, such as visual understanding, to achieve stronger generic intelligence.
This report by Yang et al. from Microsoft analyzes the latest model, GPT-4V(ision), to deepen the understanding of LMMs.
The analysis focuses on the intriguing tasks that GPT-4V can perform, containing test samples to probe the quality and genericity of GPT-4V’s capabilities, its supported inputs and working modes, and the effective ways to prompt the model.
In our approach to exploring GPT-4V, they curate and organize a collection of carefully designed qualitative samples spanning a variety of domains and tasks. Observations from these samples demonstrate that GPT-4V’s unprecedented ability in processing arbitrarily interleaved multimodal inputs and the genericity of its capabilities together make GPT-4V a powerful multimodal generalist system. Furthermore, GPT-4V’s unique capability of understanding visual markers drawn on input images can give rise to new human-computer interaction methods such as visual referring prompting.
The following figure from the paper shows that GPT-4V can work with multi-image and interleaved image-text inputs.

The following figure from the paper shows constrained prompting to return in JSON format. Images are example IDs for samples. Red highlights the wrong answer.

The following figure from the paper illustrates that conditioning on a memetic proxy can help improve the model’s response. Green (red) highlights the correct (wrong) answer. Blue indicates different ways to prompting in addition to the basic requirement of “Count the number of apples in the image.”

They conclude the report with in-depth discussions on the emerging application scenarios and the future research directions for GPT-4V-based systems. They hope that this preliminary exploration will inspire future research on the next-generation multimodal task formulation, new ways to exploit and enhance LMMs to solve real-world problems, and gaining better understanding of multimodal foundation models.

Sparks of Artificial General Intelligence: Early Experiments with GPT-4

This paper from Microsoft Research presents a comprehensive analysis of GPT-4’s capabilities, showcasing its advancement towards artificial general intelligence (AGI).
The study covers various domains where GPT-4 demonstrates proficiency, such as natural language processing, coding, mathematics, and understanding human motives.
A notable example from the paper is GPT-4’s performance in creating a poem about the infinitude of primes, blending mathematical reasoning with poetic expression, as shown in the figure below.

The researchers highlight GPT-4’s ability to generate and comprehend complex concepts and its flexibility in applying knowledge across multiple contexts, illustrating a significant step towards AGI.
The paper discusses the limitations of GPT-4, particularly in planning and learning from experience, emphasizing the autoregressive nature of its architecture.
Societal impacts and future research directions are also discussed, considering the broad capabilities and challenges posed by this early form of AGI.

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

This paper by Zhu et al. from King Abdullah University of Science and Technology explores whether aligning visual features with advanced large language models (LLMs) like Vicuna can replicate the impressive vision-language capabilities exhibited by GPT-4.
The authors present MiniGPT-4 which combines a frozen visual encoder (ViT + Q-Former from BLIP-2) with a frozen Vicuna LLM using just a single trainable projection layer.
The model undergoes a two-stage training process. The first stage involves pretraining on a large collection of aligned image-text pairs. The second stage involves finetuning with a smaller, detailed image description dataset to enhance generation reliability and usability. MiniGPT-4 was initially pretrained on 5M image-caption pairs, then finetuned on 3.5K detailed image descriptions to improve language quality.
Without training the vision or language modules, MiniGPT-4 demonstrates abilities similar to GPT-4, such as generating intricate image descriptions, creating websites from handwritten text, and explaining unusual visual phenomena. Additionally, it showcases unique capabilities like generating detailed cooking recipes from food photos, writing stories or poems inspired by images, and diagnosing problems in photos with solutions. Quantitative analysis showed strong performance in tasks like meme interpretation, recipe generation, advertisement creation, and poem composition compared to BLIP-2.
The finetuning process in the second stage significantly improved the naturalness and reliability of language outputs. This process was efficient, requiring only 400 training steps with a batch size of 12, and took around 7 minutes with a single A100 GPU.
Additional emergent skills are observed like composing ads/poems from images, generating cooking recipes from food photos, retrieving facts from movie images etc. Aligning visual features with advanced LLMs appears critical for GPT-4-like capabilities, as evidenced by the absence of such skills in models like BLIP-2 with less powerful language models.
The figure below from the paper shows the architecture of MiniGPT-4. It consists of a vision encoder with a pretrained ViT and Q-Former, a single linear projection layer, and an advanced Vicuna large language model. MiniGPT-4 only requires training the linear projection layer to align the visual features with the Vicuna.

The simple methodology verifies that advanced vision-language abilities can emerge from properly aligning visual encoders with large language models, without necessarily needing huge datasets or model capacity.
Despite its advancements, MiniGPT-4 faces limitations like hallucination of nonexistent knowledge and struggles with spatial localization. Future research could explore training on datasets designed for spatial information understanding to mitigate these issues.
Project page; Code; HuggignFace Space; Video; Dataset.

MiniGPT-v2: Large Language Model as a Unified Interface for Vision-Language Multi-task Learning

This paper by Chen et al. from King Abdullah University of Science and Technology and Meta AI Research presents MiniGPT-v2, a model designed to handle various vision-language tasks such as image description, visual question answering, and visual grounding.
MiniGPT-v2 uniquely incorporates task-specific identifiers in training, allowing it to distinguish and effectively handle different task instructions. This is achieved by using a three-stage training strategy with a mix of weakly-labeled image-text datasets and multi-modal instructional datasets. The model architecture includes a visual backbone (adapted from EVA), a linear projection layer, and a large language model (LLaMA2-chat, 7B), trained with high-resolution images to process visual tokens efficiently.
The figure below from the paper shows the architecture of MiniGPT-v2. The model takes a ViT visual backbone, which remains frozen during all training phases. We concatenate four adjacent visual output tokens from ViT backbone and project them into LLaMA-2 language model space via a linear projection layer.

In terms of performance, MiniGPT-v2 demonstrates superior results in various visual question-answering and visual grounding benchmarks, outperforming other generalist models like MiniGPT-4, InstructBLIP, LLaVA, and Shikra. It also shows a robust ability against hallucinations in image description tasks.
The figure below from the paper shows that MiniGPT-v2 achieves state-of-the-art performances on a broad range of vision-language tasks compared with other generalist models.

The paper highlights the importance of task identifier tokens, which significantly enhance the model’s efficiency in multi-task learning. These tokens have been shown to be crucial in the model’s strong performance across multiple tasks.
Despite its capabilities, MiniGPT-v2 faces challenges like occasional hallucinations and the need for more high-quality image-text aligned data for improvement.
The paper concludes that MiniGPT-v2, with its novel approach of task-specific identifiers and a unified interface, sets a new benchmark in multi-task vision-language learning. Its adaptability to new tasks underscores its potential in vision-language applications.
Project page; Code; HuggignFace Space; Demo; Video

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

This paper by Podell et al. from Stability AI Applied Research details significant advancements in the field of text-to-image synthesis using latent diffusion models (LDMs).
The paper introduces SDXL, a latent diffusion model that significantly improves upon previous versions of Stable Diffusion for text-to-image synthesis.
SDXL incorporates a UNet architecture three times larger than its predecessors, primarily due to an increased number of attention blocks and a larger cross-attention context. This is achieved by using a second text encoder, significantly enhancing the model’s capabilities.
Novel conditioning schemes are introduced, such as conditioning on original image resolution and cropping parameters. This conditioning is achieved through Fourier feature encoding and significantly improves the model’s performance and flexibility.
SDXL is trained on multiple aspect ratios, a notable departure from standard square image outputs. This training approach allows the model to better handle images with varied aspect ratios, reflecting real-world data more accurately.
An improved autoencoder is used, enhancing the fidelity of generated images, particularly in high-frequency details.
The paper also discusses a refinement model used as a post-hoc image-to-image technique to further improve the visual quality of samples generated by SDXL. SDXL demonstrates superior performance compared to earlier versions of Stable Diffusion and rivals state-of-the-art black-box image generators. The model’s performance was validated through user studies and quantitative metrics.
The figure below from the illustrates: (Left) Comparing user preferences between SDXL and Stable Diffusion 1.5 & 2.1. While SDXL already clearly outperforms Stable Diffusion 1.5 & 2.1, adding the additional refinement stage boosts performance. (Right) Visualization of the two-stage pipeline: They generate initial latents of size 128 $\times$ 128 using SDXL. Afterwards, they utilize a specialized high-resolution refinement model and apply SDEdit on the latents generated in the first step, using the same prompt. SDXL and the refinement model use the same autoencoder.

The authors emphasize the open nature of SDXL, highlighting its potential to foster transparency in large model training and evaluation, which is crucial for responsible and ethical deployment of such technologies.
The paper represents a significant step forward in generative modeling for high-resolution image synthesis, showcasing the potential of latent diffusion models in creating detailed and realistic images from textual descriptions.

Diffusion Model Alignment Using Direct Preference Optimization

This paper by Wallace et al. from Salesforce AI and Stanford University proposes a novel method for aligning diffusion models to human preferences.
The paper introduces Diffusion-DPO, a method adapted from Direct Preference Optimization (DPO), for aligning text-to-image diffusion models with human preferences. This approach is a significant shift from typical language model training, emphasizing direct optimization on human comparison data.
Unlike typical methods that fine-tune pre-trained models using curated images and captions, Diffusion-DPO directly optimizes a policy that best satisfies human preferences under a classification objective. It re-formulates DPO to account for a diffusion model notion of likelihood using the evidence lower bound, deriving a differentiable objective.
The authors utilized the Pick-a-Pic dataset, comprising 851K crowdsourced pairwise preferences, to fine-tune the base model of the Stable Diffusion XL (SDXL)-1.0 model with Diffusion-DPO. The fine-tuned model showed significant improvements over both the base SDXL-1.0 and its larger variant in terms of visual appeal and prompt alignment, as evaluated by human preferences.
The paper also explores a variant of the method that uses AI feedback, showing comparable performance to training on human preferences. This opens up possibilities for scaling diffusion model alignment methods.
The figure below from paper illustrates: (Top) DPO-SDXL significantly outperforms SDXL in human evaluation. (L) PartiPrompts and (R) HPSv2 benchmark results across three evaluation questions, majority vote of 5 labelers. (Bottom) Qualitative comparisons between SDXL and DPO-SDXL. DPOSDXL demonstrates superior prompt following and realism. DPO-SDXL outputs are better aligned with human aesthetic preferences, favoring high contrast, vivid colors, fine detail, and focused composition. They also capture fine-grained textual details more faithfully.

Experiments demonstrate the effectiveness of Diffusion-DPO in various scenarios, including image-to-image editing and learning from AI feedback. The method significantly outperforms existing models in human evaluations for general preference, visual appeal, and prompt alignment.
The paper’s findings indicate that Diffusion-DPO can effectively increase measured human appeal across an open vocabulary with stable training, without increased inference time, and improves generic text-image alignment.
The authors note ethical considerations and risks associated with text-to-image generation, emphasizing the importance of diverse and representative sets of labelers and the potential biases inherent in the pre-trained models and labeling process.
In summary, the paper presents a groundbreaking approach to align diffusion models with human preferences, demonstrating notable improvements in visual appeal and prompt alignment. It highlights the potential of direct preference optimization in the realm of text-to-image diffusion models and opens avenues for further research and application in this field.

Seamless: Multilingual Expressive and Streaming Speech Translation

This paper by Barrault et al. from Meta, INRIA, and UC Berkeley introduces models for automatic speech translation to enable seamless, expressive, and multilingual communication in real-time. Seamless, the first publicly available system from this initiative, is designed to make cross-lingual communication more human-like by preserving key elements of speech such as tone, pauses, and emphasis.
The foundational model, SeamlessM4T v2, demonstrates improved performance in automatic speech recognition, speech-to-speech, speech-to-text, and text-to-speech capabilities. It incorporates the UnitY2 architecture with a non-autoregressive text-to-unit decoder for improved consistency and efficiency in speech generation. The Prosody UnitY2 integrates an expressivity encoder to guide unit generation with appropriate rhythm, speaking rate, and pauses.
SeamlessExpressive and SeamlessStreaming are built on the SeamlessM4T v2 model. SeamlessExpressive focuses on preserving vocal styles and prosody, addressing underexplored aspects like speech rate and pauses, and is currently effective in translations between English, Spanish, German, French, Italian, and Chinese. SeamlessStreaming, on the other hand, enables real-time translation with around two seconds of latency, supporting nearly 100 input and output languages for speech recognition and text translation, and 36 output languages for speech-to-speech translation.
The paper details the implementation of the Efficient Monotonic Multihead Attention (EMMA) mechanism in SeamlessStreaming for generating low-latency target translations. This allows simultaneous translation while the speaker is still talking, a significant improvement over traditional systems.
An expressive unit-to-speech generator, PRETSSEL, is used in SeamlessExpressive for waveform generation, transferring tones, emotional expressions, and vocal styles from the source speech.
The figure below from paper illustrates data for speech translation. An overview of the pre-training and finetuning data used in SeamlessM4T v2.

The figure below from paper shows an overview of the illustration of the SeamlessM4T v2 model. Panel (1) shows the three main blocks of UnitY2 with its non-autoregressive (NAR) T2U. Panel (2) shows multitask-UnitY2 with its additional text encoder. Panel (3) breaks down the components of SeamlessM4T v2 (a multitask-UnitY2 model) with a side panel illustration of the teacher T2U model used for pseudo-labeling.

The figure below from paper shows an overview of the technical components of Seamless and how they fit together.

The models were evaluated using novel and modified automatic metrics for prosody, latency, and robustness, as well as human evaluations focused on the preservation of meaning, naturalness, and expressivity. The paper also discusses the responsible AI approach for building these systems, including the first red-teaming effort in multimodal machine translation, systems for detecting and mitigating added toxicity, evaluating gender bias, and a watermarking mechanism to counter deepfakes.
The team has publicly released the models, code, a watermark detector, and various tools including metadata, data, and data alignment tools to aid the research community. This includes metadata for an extended version of SeamlessAlign and SeamlessAlignExpressive, translated text data for mExpresso, and tools for creating multimodal translation pairs.
The updates to UnitY2 have led to improved translation quality across various tasks, with significant improvements over previous models and competing systems in speech-to-text and speech-to-speech translation. The paper presents comparative results showing the superiority of Seamless models in various translation and speech recognition tasks.
In summary, this suite of models, especially SeamlessExpressive and SeamlessStreaming, represents a significant advancement in making the Universal Speech Translator a reality, achieving a delicate balance between maintaining the human elements of speech and ensuring low-latency, accurate translations in real-time. The entire project not only pushes forward the technical boundaries of speech translation but also sets a new standard for responsible AI development in this domain.
Project page; Expressive Translation Demo; Hugging Face Demo; Models; Seamless.

VideoPoet: A Large Language Model for Zero-Shot Video Generation

This paper by Kondratyuk et al. from Google Research introduces VideoPoet, a language model designed for synthesizing high-quality video with matching audio from a range of conditioning signals. It employs a decoder-only transformer architecture to process multimodal inputs like images, videos, text, and audio. The model follows a two-stage training protocol of pretraining and task-specific adaptation, incorporating multimodal generative objectives within an autoregressive Transformer framework. Empirical results highlight VideoPoet’s state-of-the-art capabilities in zero-shot video generation, particularly in generating high-fidelity motions.
The figure below from the paper shows VideoPoet, a versatile video generator that conditions on multiple types of inputs and performs a variety of video generation tasks.

Technical and Implementation Details:
- Tokenization: VideoPoet utilizes the MAGVIT-v2 tokenizer for joint image and video tokenization and the SoundStream tokenizer for audio. A unified vocabulary includes codes for special tokens, task prompts, image/video tokenization, and audio codes. Text modality is represented by text embeddings.
- Language Model Backbone: VideoPoet employs a Large Language Model (LLM) with a decoder-only transformer architecture. The model is a prefix language model that allows for control over task types by constructing different patterns of input tokens to output tokens. The shared multimodal vocabulary represents the generation of all modalities as a language modeling problem, totaling approximately 300,000 tokens. This approach effectively turns the task of generating videos and audios into a language modeling problem.
- Super-Resolution: For generating high-resolution videos, a custom spatial super-resolution (SR) non-autoregressive video transformer operates in token space atop the language model output, mitigating the computational demands of long sequences.
- LLM Pretraining for Generation: The model is trained with a large mixture of multimodal objectives, allowing individual tasks to be chained and demonstrating zero-shot capabilities beyond individual tasks.
- Task Prompt Design: A foundation model is produced through a mixture of tasks designed in pretraining, with defined prefix input and output for each task.
- Training Strategy: The training involves image-text pairs and videos with or without text or audio, covering approximately 2 trillion tokens across all modalities. A two-stage pretraining strategy is employed, initially focusing more on image data and then switching to video data. Post-pretraining, the model is fine-tuned to enhance performance on specific tasks or undertake new tasks.
- The figure below from the paper shows the sequence layout for VideoPoet. VideoPoet encode all modalities into the discrete token space, so that we can directly use large language model architectures for video generation. VideoPoet denote specital tokens in <>. The modality agnostic tokens are in darker red; the text related components are in blue; the vision related components are in yellow; the audio related components are in green. The left portion of the layout on light yellow represents the bidirectional prefix inputs. The right portion on darker red represents the autoregressively generated outputs with causal attention.
Experiments and Evaluations:
- Experimental Setup: The model was trained on a mix of tasks like text-to-image, text-to-video, image-to-video, and video-to-video, including specialized tasks like outpainting, inpainting, stylization, and future frame prediction. The training dataset comprised 1 billion image-text pairs and around 270 million videos, with a focus on contextual and demographic diversity.
- Pretraining Task Analysis: Different combinations of pretraining tasks were analyzed using a 300 million parameter model. Incorporating all pretraining tasks resulted in the best overall performance across various evaluated tasks.
- Model Scaling: Scaling up the model size and training data showed significant improvements in video and audio generation quality. Larger models exhibited enhanced temporal consistency, prompt fidelity, and motion dynamics.
- Comparison to State-of-the-Art: VideoPoet demonstrated highly competitive performance in zero-shot text-to-video evaluation on MSR-VTT and UCF-101 datasets. The model, after fine-tuning, achieved even better performance in text-video pairings.
- Human Evaluations with Text-to-Video: VideoPoet outperformed other leading models in human evaluations across dimensions like text fidelity, video quality, motion interestingness, realism, and temporal consistency.
- Video Stylization: In video stylization tasks, VideoPoet significantly outperformed the Control-A-Video model. Human raters consistently preferred VideoPoet for text fidelity and video quality.
Responsible AI and Fairness Analysis:
- The model was evaluated for fairness regarding attributes like perceived age, gender expression, and skin tone. It was observed that the model can be prompted to produce outputs with non-uniform distributions across these groups, but also has the capability to enhance uniformity through semantically unchanged prompts. This underscores the need for continued research to improve fairness in video generation.
LLM’s Capabilities in Video Generation:
- VideoPoet demonstrates notable capabilities in video generation, including zero-shot video editing and task chaining. It can perform novel tasks by chaining multiple capabilities, such as image-to-video animation followed by stylization. The quality of outputs in each stage is sufficient to maintain in-distribution for subsequent stages without noticeable artifacts.
Project page.

LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models

This paper by Li et al. from CUHK and SmartMore proposes LLaMA-VID, a novel approach designed to efficiently manage the token generation issue in long videos for Vision Language Models (VLMs). It addresses the computational challenges faced by traditional VLMs in processing long videos due to the requirement of excessive visual tokens. LLaMA-VID encodes each video frame using two distinct tokens - a context token and a content token - thus enabling support for hour-long videos with reduced computational burden.
Architecture:
- Base Models: LLaMA-VID leverages the pre-trained LLM Vicuna for text processing and a Vision Transformer (ViT) model to generate image embeddings from video frames.
- Context-Attention Token $E_t$: The Q-Former generates text embeddings ($Q$) from the user’s query. Attention is computed between $Q$ and the visual tokens ($X$), and the resulting $Q$ tokens are averaged to obtain the context-attention token $E_t$. This token encapsulates visual features relevant to the query.
- Content Token $E_v$: The visual tokens ($X$) undergo a 2D mean pooling operation to create the content token $E_v$, summarizing all visual features of the frame.
- Integration with Vicuna Decoder: Both the context token and content token are appended to the input of the Vicuna decoder, which generates text responses to the user’s query.
Token Generation Strategy:
- Dual-Token Generation: Each frame is represented with a context token and a content token. The context token is generated through interactive queries, while the content token captures frame details. This approach adapts to different settings, maintaining efficiency for videos and detail for single images.
Training Framework and Strategy:
- Efficiency: Training can be completed within 2 days on a machine with 8xA100 GPUs. The model outperforms previous methods on most video- and image-based benchmarks.
- Stages: Training is divided into modality alignment, instruction tuning, and long video tuning. The modality alignment ensures that visual features align with the language space, instruction tuning enhances multi-modality understanding, and long video tuning focuses on extensive videos.
- Training Objectives: The model is trained on objectives of cross-modality embedding alignment, image/video captioning, and curated tasks for long video understanding.
The figure below from the paper shows the framework of LLaMA-VID. With user directive, LLaMA-VID operates by taking either a single image or video frames as input, and generates responses from LLM. The process initiates with a visual encoder that transforms input frames into the visual embedding. Then, the text decoder produces text queries based on the user input. In context attention, the text query aggregates text-related visual cues from the visual embedding. For efficiency, an option is provided to downsample the visual embedding to various token sizes, or even to a single token. The text-guided context token and the visually-enriched content token are then formulated using a linear projector to represent each frame at time $t$. Finally, the LLM takes the user directive and all visual tokens as input and gives responses.

Implementation Details:
- Experimental Setup: The model uses EVA-G for the visual encoder and QFormer for the text decoder. Training involves keeping the visual encoder fixed and optimizing other trainable parameters.
- Datasets: The training set is constructed from various sources, including image- and video-caption pairs, and the model is evaluated on numerous video- and image-based benchmarks.
Performance and Analysis:
- Video-Based Benchmarks: LLaMA-VID demonstrates superior performance across various zero-shot video QA benchmarks, with notable accuracy using only two tokens per frame. Its efficiency in processing is evident from its performance with compressed content tokens and the effectiveness of the context token.
- Component Analysis: Different token types and numbers are analyzed to validate each part’s effectiveness. The instruction-guided context token significantly enhances performance across all datasets. Different text decoders show substantial gains, proving the effectiveness of the context token generation paradigm.
Conclusion:
- LLaMA-VID introduces an efficient and effective method for token generation in VLMs. By representing images and video frames with just two tokens, it ensures detail preservation and efficient encoding. The model’s robust performance across diverse benchmarks and its capability to support hour-long videos affirm its potential as a benchmark for efficient visual representation.

FERRET: Refer and Ground Anything Anywhere at Any Granularity

The paper by You et al. from Columbia and Apple introduces Ferret, a novel Multimodal Large Language Model (MLLM) capable of spatial referring and grounding in images at various shapes and granularities.
Ferret stands out in its ability to understand and localize open-vocabulary descriptions within images.
Key Contributions:
1. Hybrid Region Representation: Ferret employs a unique representation combining discrete coordinates and continuous visual features. This approach enables the processing of diverse region inputs like points, bounding boxes, and free-form shapes.
2. Spatial-Aware Visual Sampler: To capture continuous features of various region shapes, Ferret uses a specialized sampler adept at handling different sparsity levels in shapes. This allows Ferret to deal with complex and irregular region inputs.
3. GRIT Dataset: The Ground-and-Refer Instruction-Tuning (GRIT) dataset was curated for model training. It includes 1.1 million samples covering hierarchical spatial knowledge and contains 95k hard negative samples to enhance robustness.
4. Ferret-Bench: A benchmark for evaluating MLLMs on tasks that require both referring and grounding abilities. Ferret excels in these tasks, demonstrating improved spatial understanding and commonsense reasoning capabilities.
The figure below from the paper shows that Ferret enables referring and grounding capabilities for MLLMs. In terms of referring, a user can refer to a region or an object in point, box, or any free-form shape. The regionN (green) in the input will be replaced by the proposed hybrid representation before being fed into the LLM. In terms of grounding, Ferret is able to accurately ground any open-vocabulary descriptions. The boxN (red) in the output denotes the predicted bounding box coordinates.

Implementation Details:
- Model Architecture: Ferret’s architecture consists of an image encoder, a spatial-aware visual sampler, and an LLM to model image, text, and region features.
- Input Processing: The model uses a pre-trained visual encoder (CLIP-ViT-L/14) and LLM’s tokenizer for image and text embeddings. Referred regions are denoted using coordinates and a special token for continuous features.
- Output Grounding: Ferret generates box coordinates corresponding to the referred regions/nouns in its output.
- Language Model: Ferret utilizes Vicuna, a decoder-only LLM, instruction-tuned on LLaMA, for language modeling.
- Training: Ferret is trained on the GRIT dataset for three epochs. During training, the model randomly chooses between center points or bounding boxes to represent regions.
The figure below from the paper shows an overview of the proposed Ferret model architecture. (Left) The proposed hybrid region representation and spatial-aware visual sampler. (Right) Overall model architecture. All parameters besides the image encoder are trainable.

Evaluations and Findings:
1. Performance on Standard Benchmarks: Ferret surpasses existing models in standard referring and grounding tasks.
2. Capability in Multimodal Chatting: Ferret significantly improves performance in multimodal chatting tasks, integrating refer-and-ground capabilities.
3. Ablation Studies: Studies indicate mutual benefits between grounding and referring data and demonstrate the effectiveness of the spatial-aware visual sampler.
4. Reducing Object Hallucination: Notably, Ferret mitigates the issue of object hallucination, a common challenge in multimodal models.
Ferret represents a significant advancement in MLLMs, offering robust and versatile spatial referring and grounding abilities. Its innovative approach and superior performance in various tasks mark it as a promising tool for practical applications in vision-language learning.
Code

StarVector: Generating Scalable Vector Graphics Code from Images

This paper by Rodriguez et al. from ServiceNow Research, Mila - Quebec AI Institute, Canada CIFAR AI Chair, ETS Montreal, UBC Vancouver, and Apple MLR Barcelona, presents StarVector, a novel multimodal SVG generation model that effectively integrates Code Generation Large Language Models (CodeLLMs) and vision models. StarVector employs a CLIP image encoder and an Adapter module to transform visual representations from images into visual tokens, which are then used to condition a StarCoder model for generating SVG code.
The figure below from the paper shows the image-to-SVG generation task: Given an input image, generate the corresponding SVG code. On the left, they show test examples of complex SVGs from SVG-Emoji and SVG-Stack datasets. StarVector encodes images and processes them in a multimodal language modeling fashion, to generate executable SVG code that resembles the input image. They show real generated code and rasterized images produced by our StarVector model, showing impressive capabilities at generating appealing SVG designs and using complex syntax.

The StarVector model is distinctive for its ability to generate unrestricted SVGs that accurately represent pixel images. This is achieved by learning to align the visual and code tokens through next-token prediction. The approach allows for preserving the richness and versatility of SVG primitives and syntax without the need for simplification, a common limitation in previous methods.
The authors also introduce SVG-Bench, a comprehensive benchmark for SVG validation. This includes novel datasets such as SVG-Stack, a large-scale dataset of real-world SVG examples, and SVG-Emoji, composed of complex emoji SVGs. The benchmark aims to evaluate SVG methods across multiple datasets and relevant metrics.
The paper provides significant insights into the training process of the model. Images are first encoded with the image encoder, which returns hidden 2D features. These are then projected by the Adapter into the CodeLLM dimensionality, resulting in visual tokens. The ground truth SVG code is tokenized and embedded into the CodeLLM space, and the sequence of visual and SVG token embeddings is modeled using a standard language modeling training objective. During inference, visual tokens from images are used to decode autoregressively from the CodeLLM.
The figure below from the paper shows the StarVector architecture: Images in the pixel space are encoded into a set of 2D embeddings using CLIP. The Adapter applies a non-linear transformation to the image embeddings to align them with Code-LLM space, obtaining visual tokens. StarCoder uses the image embeddings as context to generate the SVG. During training the task is supervised by the next token prediction of the SVG tokens. During inference, the model uses the visual tokens from an input image to predict SVG code autoregressively.

The concatenation of visual and SVG token embeddings during StarVector’s training is a crucial step for effectively training the model using a next-token prediction objective. Here’s why this approach is used:
1. Multimodal Integration: StarVector is a multimodal model, meaning it needs to understand and integrate information from both visual (image) and textual (SVG code) domains. By concatenating visual and SVG token embeddings, the model learns to correlate features from the image with the corresponding SVG code. This integration is essential for the model to generate accurate SVG representations of images.
2. Contextual Understanding: Next-token prediction in language models typically relies on understanding the context provided by previous tokens. In the case of StarVector, the “context” includes not just the preceding SVG tokens, but also the visual information extracted from the image. Concatenating visual tokens with SVG tokens provides the model with a comprehensive context, encompassing both visual and textual information, which is necessary for accurately predicting the next token in the SVG code.
3. Training Efficiency: By training the model to predict the next SVG token based on a combination of visual and SVG token embeddings, StarVector efficiently leverages both types of information. This approach helps the model to better learn the mapping between visual features and their SVG representations, leading to more effective and accurate SVG code generation.
4. Richer Feature Representation: The visual tokens carry rich, fine-grained details about the image, and the SVG tokens carry the structure and syntax of the vector graphics code. When concatenated, these embeddings provide a richer feature representation that the model can use to generate more detailed and precise SVGs.
  - In summary, concatenating visual and SVG token embeddings during training enables StarVector to effectively learn the complex relationship between images and their corresponding SVG code representations. This approach is key to the model’s ability to perform accurate image-to-SVG conversion using next-token prediction.
Empirical results demonstrate StarVector’s advancements over current methods in handling visual quality and complexity. The model outperforms previous baselines for the image-to-SVG conversion task and shows considerable improvements in datasets with limited training data when pre-training on SVG-Stack is applied.
The study’s broader impact lies in standardizing and providing a reproducible benchmark consisting of a collection of SVG datasets and evaluation metrics, thereby paving the way for future vector graphic generation models to assist digital artists. While the work currently focuses on image-to-SVG generation, the authors envision extending the framework to more complex multimodal systems for creating and editing SVGs for various applications, including logotype design and scientific diagram creation.
Code

KOSMOS-2: Grounding Multimodal Large Language Models to the World

This paper by Peng et al. from Microsoft Research introduces KOSMOS-2, a groundbreaking Multimodal Large Language Model (MLLM). This model enhances traditional MLLMs by enabling new capabilities to perceive object descriptions, such as bounding boxes, and grounding text to the visual world.
KOSMOS-2 uniquely represents refer expressions in a Markdown-like format, [text span](bounding boxes), where object descriptions are sequences of location tokens. This approach allows the model to link text spans, such as noun phrases and referring expressions, to spatial locations in images.
The following figure from the paper illustrates KOSMOS-2’s new capabilities of multimodal grounding and referring. KOSMOS-2 can understand multimodal input, follow instructions, perceive object descriptions (e.g., bounding boxes), and ground language to the visual world.

For image input, KOSMOS-2 employs a sophisticated process. Images are first processed through a vision encoder, which generates embeddings for each image. These embeddings are then combined with the location tokens representing bounding boxes or specific areas of interest within the image. This combination enables the model to understand and relate specific parts of an image to corresponding textual descriptions.
The large-scale dataset of grounded image-text pairs, named GRIT, is pivotal for training. Derived from subsets of the LAION-2B and COYO-700M datasets, it integrates grounding capability into downstream applications, alongside the existing capabilities of MLLMs like perceiving general modalities, following instructions, and performing in-context learning.
The model’s architecture is built on KOSMOS-1, utilizing a Transformer-based causal language model for next-word prediction tasks. The vision encoder and multimodal large language model components process discrete tokens, including location tokens added to the word vocabulary for unified modeling with texts.
KOSMOS-2 was rigorously trained with a mix of grounded image-text pairs, monomodal text corpora, and interleaved image-text data. The training involved 60k steps over 25 billion tokens, using the AdamW optimizer on 256 V100 GPUs.
The evaluation of KOSMOS-2 covered a wide range of tasks: multimodal grounding (phrase grounding, referring expression comprehension), multimodal referring (referring expression generation), perception-language tasks (image captioning, visual question answering), and language understanding and generation. The results affirmed KOSMOS-2’s capacity to handle complex multimodal tasks and its effectiveness in grounding text descriptions to the visual world.
This significant research lays the foundation for Embodiment AI and represents a vital step towards the convergence of language, multimodal perception, action, and world modeling. It marks a substantial advancement towards artificial general intelligence.
The paper includes illustrative figures demonstrating KOSMOS-2’s capabilities in multimodal grounding and referring. These show how the model understands multimodal input, follows instructions, perceives object descriptions, and grounds language to the visual world.
Code

Generative Multimodal Models are In-Context Learners

This paper by Sun et al. from Beijing Academy of Artificial Intelligence, Tsinghua University, and Peking University introduces Emu2, a 37 billion-parameter generative multimodal model. The model is trained on extensive multimodal sequences and exhibits strong multimodal in-context learning capabilities, setting new records on various multimodal understanding tasks in few-shot settings.
Emu2 employs a unified autoregressive objective for predicting the next multimodal element, either visual embeddings or textual tokens, using large-scale multimodal sequences like text, image-text pairs, and interleaved image-text-video. The model architecture consists of a Visual Encoder, Multimodal Modeling, and a Visual Decoder. The Visual Encoder tokenizes each image into continuous embeddings, interleaved with text tokens for autoregressive Multimodal Modeling. The Visual Decoder is trained to decode visual embeddings back into images or videos.
The “Multimodal Modeling” component of Emu2 is crucial for integrating and understanding the relationships between different modalities. This module is designed to process and synthesize information from both visual and textual embeddings, enabling the model to generate coherent outputs irrespective of the modality of the input. It leverages a transformer-based architecture, known for its efficacy in capturing long-range dependencies, to handle the complexities inherent in multimodal data. This module’s design allows it to seamlessly blend information from different sources, making it possible for the model to generate contextually relevant and accurate multimodal outputs, such as coherent text descriptions for images or generating images that match textual descriptions.
Image generation in Emu2 is modeled as a regression task where the model learns to predict the features of the next portion of an image, given the previous context. This is achieved by training the Visual Decoder to reconstruct images from their encoded embeddings. The embeddings represent high-dimensional, continuous representations of the visual data, allowing the model to learn fine-grained details and nuances of the images. This regression-based approach enables Emu2 to generate high-quality, coherent images that are contextually aligned with the preceding text or visual inputs.
The pretraining data includes datasets such as LAION-2B, CapsFusion-120M, WebVid-10M, Multimodal-C4, YT-Storyboard-1B, GRIT-20M, CapsFusion-grounded-100M, and language-only data from Pile. Emu2 is pretrained with a captioning loss on text tokens and image regression loss. The training uses the AdamW optimizer and involves different resolutions and batch sizes, spanning over 55,550 iterations.
Emu2 demonstrates its proficiency in a few-shot setting on vision-language tasks, significantly improving as the number of examples in the context increases. The model also performs robustly in instruction tuning, where it is fine-tuned to follow specific instructions, leading to enhanced capabilities like controllable visual generation and instruction-following chat.
Overview of Emu2 architecture. Emu2 learns with a predict-the-next-element objective in multimodality. Each image in the multimodal sequence is tokenized into embeddings via a visual encoder, and then interleaved with text tokens for autoregressive modeling. The regressed visual embeddings will be decoded into an image or a video by a visual decoder.

The model’s performance is evaluated across various benchmarks and scenarios, showcasing remarkable abilities in both visual question answering and open-ended subject-driven generation. It outperforms other models in image question-answering tasks and shows notable improvements in tasks requiring external knowledge and video question-answering despite not using specific training data for these tasks.
Emu2’s controllable visual generation abilities are demonstrated through zero-shot text-to-image generation and subject-driven generation, achieving state-of-the-art performance in comparison to other models. It excels in tasks like re-contextualization, stylization, modification, region-controllable generation, and multi-entity composition.
The paper also discusses the broader impact and limitations of Emu2, emphasizing its potential applications and the need for responsible deployment, considering the challenges of hallucination, potential biases, and the gap in question-answering capabilities compared to closed multimodal systems.

Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

This paper by Sun et al. from Shanghai Jiao Tong University, Fudan University, The Chinese University of Hong Kong, Shanghai AI Laboratory, University of Macau, and MThreads Inc., introduces Alpha-CLIP, an enhanced version of the CLIP model that focuses on specific image regions.
Alpha-CLIP modifies the CLIP image encoder to accommodate an additional alpha channel along with the traditional RGB channels to suggest attentive regions, fine-tuned with millions of RGBA (Red, Green, Blue, Alpha) region-text pairs. This alpha channel is designed to highlight specific regions of interest in the image, guiding the model to focus on relevant parts. Alpha-CLIP incorporates This enables precise control over image contents and maintains the visual recognition ability of CLIP.
The structure of the Alpha-CLIP Image Encoder involves integrating the alpha channel with the original CLIP’s image encoder. This integration allows the model to process RGBA images, with the alpha channel providing spatial information about the area of interest. Specifically:
- In the CLIP image encoder’s ViT structure, an RGB convolution is applied to the image in the first layer. As shown in the figure below, they introduce an additional Alpha Conv layer parallel to the RGB Conv layer, which enables the CLIP image encoder to accept an extra alpha channel as input. The alpha channel input is set to range from [0, 1], where 1 represents the foreground and 0 indicates the background. They initialize the Alpha Conv kernel weights to zero, ensuring that the initial Alpha-CLIP ignores the alpha channel as input. Both conv outputs are combined using element-wise addition as follows: x = self.relu1(self.bn1(self.conv1(x) + self.conv1_alpha(alpha)))
- During training, they keep the CLIP text encoder fixed and entirely train the Alpha-CLIP image encoder. Compared to the first convolution layer that processes the alpha channel input, they apply a lower learning rate to the subsequent transformer blocks. To preserve CLIP’s global recognition capability for full images, they adopt a specific data sampling strategy during training. They set the sample ratio, denoted as $r_s$ = 0.1 to occasionally replace their generated RGBA-text pairs with the original image-text pairs and set the alpha channel to full 1.
For training, the Alpha-CLIP utilizes a loss function that combines the original CLIP loss, which is a contrastive loss measuring the alignment between image and text embeddings, with an additional term. This additional term ensures that the model pays more attention to regions highlighted by the alpha channel, thus enhancing its ability to focus on specified areas in the image. This could be achieved by applying a weighted loss mechanism where regions marked by the alpha channel contribute more to the loss calculation, encouraging the model to focus more on these areas.
The figure below from the paper shows the pipeline of Alpha-CLIP’s data generation method and model architecture. (a) They generate millions of RGBA-region text pairs. (b) Alpha-CLIP modifies the CLIP image encoder to take an additional alpha channel along with RGB.

The figure below from the paper shows the usage of Alpha-CLIP. Alpha-CLIP can seamlessly replace the original CLIP in a wide range of tasks to allow the whole system to focus on any specified region given by points, strokes or masks. Alpha-CLIP possesses the capability to focus on a specified region and controlled editing. Alpha-CLIP can enhance CLIP’s performance on various baselines in a plug-and-play fashion, across various downstream tasks like recognition, MLLM, and 2D/3D generation. Cases marked with are generated with the original CLIP. Cases marked with are generated with Alpha-CLIP. All cases shown here are made simply by replacing the original CLIP of the system with a plug-in Alpha-CLIP without further tuning.

Experiments demonstrate Alpha-CLIP’s superior performance in zero-shot image classification, REC (Referring Expression Comprehension), and open vocabulary detection. It outperforms baselines like MaskCLIP, showing significant improvement in classification accuracy.
The model showcases versatility in enhancing region-focused tasks while seamlessly replacing the original CLIP in multiple applications.
Future work aims to address limitations like focusing on multiple objects and enhancing the model’s resolution for recognizing small objects.
Code

2024

MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

This paper by Lin et al. from Peking University, Sun Yat-sen University, FarReel Ai Lab, Tencent Data Platform, and Peng Cheng Laboratory introduces MoE-LLaVA, a novel training strategy for Large Vision-Language Models (LVLMs). The strategy, known as MoE-tuning, constructs a sparse model with a large number of parameters while maintaining constant computational costs and effectively addressing performance degradation in multi-modal learning and model sparsity.
MoE-LLaVA uniquely activates only the top-$k$ experts through routers during deployment, keeping the remaining experts inactive. This approach results in impressive visual understanding capabilities and reduces hallucinations in model outputs. Remarkably, with 3 billion sparsely activated parameters, MoE-LLaVA performs comparably to the LLaVA-1.5-7B and surpasses the LLaVA-1.5-13B in object hallucination benchmarks.
The architecture of MoE-LLaVA includes a vision encoder, a visual projection layer (MLP), a word embedding layer, multiple stacked LLM blocks, and MoE blocks. The MoE-tuning process involves three stages: In Stage I, an MLP adapts visual tokens to the LLM. Stage II trains the whole LLM’s parameters except for the Vision Encoder (VE), and in Stage III, FFNs are used to initialize the experts in MoE, and only the MoE layers are trained.
The following image from the paper illustrates MoE-tuning. The MoE-tuning consists of three stages. In stage I, only the MLP is trained. In stage II, all parameters are trained except for the Vision Encoder (VE). In stage III, FFNs are used to initialize the experts in MoE, and only the MoE layers are trained. For each MoE layer, only two experts are activated for each token, while the other experts remain silent.

The authors offer the following findings regarding expert loads and preferences from the provided document, highlighting the dynamic allocation of workloads among experts and their balanced handling of multimodal data.
- Expert Loads:
  - The distribution of workloads among experts in the model shows a pattern where different experts take on varying amounts of the total workload depending on the depth of the model layers.
  - In the shallower layers (e.g., layers 5-11), experts 2, 3, and 4 primarily collaborate, while expert 1 is more active in the initial layers but gradually withdraws as the layers deepen.
  - Expert 3, in particular, dominates the workload from layers 17 to 27, indicating a significant increase in its activation, which suggests that specific experts become more prominent at different depths of the model.
- The figure below from the paper shows a distribution of expert loadings. The discontinuous lines represent a perfectly balanced distribution of tokens among different experts or modalities. The first figure on the left illustrates the workload among experts, while the remaining four figures depict the preferences of experts towards different modalities.
- Expert Preferences:
  - Each expert develops preferences for handling certain types of data, such as text or image tokens.
  - The routing distributions for text and image modalities are highly similar, indicating that experts do not exhibit a clear preference for any single modality but instead handle both types of data efficiently. This reflects strong multimodal learning capabilities.
  - The visualizations of expert preferences demonstrate how text and image tokens are distributed across different experts, revealing that the model maintains a balanced approach in processing these modalities.
- The figure below from the paper shows a distribution of expert loadings. The discontinuous lines represent a perfectly balanced distribution of tokens among different experts or modalities. The first figure on the left illustrates the workload among experts, while the remaining four figures depict the preferences of experts towards different modalities.
The model was evaluated on various visual understanding datasets, demonstrating its efficiency and effectiveness. MoE-LLaVA’s performance was on par with or even superior to state-of-the-art models with fewer activated parameters. The paper also includes extensive ablation studies and visualizations to illustrate the effectiveness of the MoE-tuning strategy and the MoE-LLaVA architecture.
The paper provides a significant contribution to the field of multi-modal learning systems, offering insights for future research in developing more efficient and effective systems.
Code

Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language

The paper by Berrios et al. from Contextual AI and Stanford University introduces LENS, a modular framework designed to equip large language models (LLMs) with visual processing capabilities without necessitating additional multimodal pretraining. LENS leverages independent vision modules to generate comprehensive textual descriptions from images, which are then processed by any standard LLM.
LENS utilizes pretrained vision modules like contrastive models and image captioning systems to convert visual inputs into rich textual data. This data includes tags, attributes, and captions that describe the visual content, thereby bridging the gap between visual perception and language understanding.
Significant emphasis is placed on the system’s architecture and implementation. LENS operates without requiring multimodal data or intensive pretraining stages, setting it apart from other approaches that combine visual and language modalities. The system includes several components: a Tag Module that assigns descriptive tags to images using a vision encoder, an Attributes Module for assigning attributes, and an Intensive Captioner that generates multiple image captions. These components feed into a frozen LLM that acts as the reasoning module.
The figure below from the paper shows a comparison of approaches for aligning visual and language modalities: (a) Multimodal pretraining using a paired or web dataset, and (b) LENS , a pretraining-free method that can be applied to any off-the-shelf LLM without the need for additional multimodal datasets. Unlike LENS, prior methods are computationally intensive and require joint alignment pretraining on large multimodal datasets to perform visual tasks.

Experimental results showcased in the paper illustrate that LENS performs on par with or better than more complex systems that require extensive multimodal training. For example, in zero-shot object recognition tasks across various datasets, LENS demonstrated competitive accuracy, highlighting its efficacy in leveraging existing LLMs for visual tasks without additional training.
The framework was validated using standard vision and language datasets, demonstrating its versatility and robustness in handling different types of visual information. LENS’s flexible design allows it to be adapted easily to new tasks and datasets, promoting broader applicability and ease of use in real-world settings.
Code

This paper by Zhao et al. from Westlake University and Zhejiang University introduces Cobra, in response to the growing need for efficient multi-modal large language models (MLLMs). This model enhances the efficiency of MLLMs by incorporating a linear computational complexity approach through the use of the state space model (SSM) framework, distinct from the common quadratic complexity of traditional Transformer networks.
Cobra extends the Mamba model by integrating visual information processing capabilities. This integration is achieved through a combination of an image encoder and a novel training methodology. The Mamba model, known for its efficient processing relative to Transformer-based models, is enhanced with visual modality by incorporating an image encoder that allows for the efficient handling of visual data.
A significant feature of Cobra is its modal fusion approach, which optimizes the interaction between visual and linguistic data. Various fusion schemes were explored, with experiments showing that specific strategies significantly enhance the model’s multi-modal capabilities.
The model demonstrates its effectiveness across multiple benchmarks, particularly in tasks like Visual Question Answering (VQA), where it competes robustly against other state-of-the-art models like LLaVA and TinyLLaVA, despite having fewer parameters. For instance, Cobra achieved performance comparable to LLaVA while utilizing only about 43% of LLaVA’s parameters.
The architectural design of Cobra includes a combination of DINOv2 and SigLIP as vision encoders, projecting visual information into the language model’s embedding space. This setup not only preserves but enhances the model’s ability to process and understand complex visual inputs alongside textual data.
Training adjustments and implementation details reveal a departure from traditional pre-alignment phases used in other models. Instead, Cobra’s approach involves direct fine-tuning of the entire LLM backbone along with the projector over two epochs, which optimizes both efficiency and model performance.
The figure below from the paper shows a detailed architecture of Cobra (right) that takes Mamba as the backbone consisting of identical Mamba blocks (left). The parameters of vision encoders are frozen during training.

Performance metrics from the paper indicate that Cobra is not only faster but also retains high accuracy in interpreting and responding to multi-modal inputs, showing particularly strong capabilities in handling visual illusions and spatial relationships, a testament to its robust visual processing capabilities.
Overall, Cobra’s design significantly reduces the computational cost and model complexity while maintaining competitive accuracy and speed, making it a promising solution for applications requiring efficient and effective multi-modal processing.
Code

2025

One-Minute Video Generation with Test-Time Training

The paper by Dalal et al. from Nvidia, Stanford, UCSD, UC Berkeley, and UT Austin, published in CVPR 2025, introduces a method to generate coherent, multi-scene, minute-long videos from text prompts using Test-Time Training (TTT) layers integrated into a pre-trained Diffusion Transformer (CogVideo-X 5B).
Core Idea: The main contribution is the use of TTT layers, which have expressive neural network-based hidden states (like 2-layer MLPs). These layers perform gradient-based updates during inference (test time) to improve long-context understanding, enabling coherent video generation over extended durations.
- The figure below from the paper shows an overview of their approach. (Left) Their modified architecture adds a TTT layer with a learnable gate after each attention layer. (Right) Their overall pipeline creates input sequences composed of 3-second segments. This structure enables us to apply self-attention layers locally over segments and TTT layers globally over the entire sequence.
Why TTT Layers?:
- Traditional Transformers suffer from quadratic time/memory complexity with longer input sequences.
- Linear RNN variants like Mamba and DeltaNet are more efficient but less expressive.
- TTT layers balance efficiency and expressiveness by allowing the hidden state to be a learnable model itself, updated per sequence at test time.
System and Pipeline:
- Starts with CogVideo-X 5B (originally limited to 3s clips).
- Inserts TTT layers and fine-tunes on a Tom and Jerry storyboard dataset.
- Limits attention to 3s segments locally, while TTT layers operate globally.
- Utilizes three text prompt formats with increasing detail (from summaries to full storyboards).
Performance and Evaluation:
- TTT-MLP outperforms other methods like Mamba 2 and Gated DeltaNet by an average of 34 Elo points in human evaluations across four criteria: text alignment, motion naturalness, aesthetics, and temporal consistency.
- Shows especially strong gains in motion smoothness and scene consistency.

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

This paper by Xie et al. from the University of Hong Kong, CMU, Salesforce Research, and University of Waterloo introduces OSWorld, a scalable real-computer environment designed to benchmark multimodal agents on complex and open-ended computer tasks. Unlike previous benchmarks that focus on isolated applications or simulated environments, OSWorld operates across real operating systems (Ubuntu, Windows, macOS) and allows multimodal agents to interact with diverse software applications through keyboard and mouse controls.
OSWorld provides a realistic and scalable testing ground for multimodal AI agents, highlighting their current shortcomings and guiding future research toward generalist digital assistants. The benchmark sets a new standard for evaluating AI-driven computer interaction beyond simulated environments.
To evaluate agents in this environment, the authors construct a benchmark of 369 real-world tasks, covering GUI interactions, command-line operations, file management, and multi-application workflows. Each task includes a detailed initial state setup and an execution-based evaluation script to ensure reproducibility. The evaluation of state-of-the-art LLM/VLM-based agents (e.g., GPT-4V, Gemini, Claude-3, Mixtral, Llama-3, CogAgent) highlights significant limitations in their GUI grounding, operational knowledge, and cross-application workflows. While humans complete 72.36% of tasks, the best-performing agent only achieves 12.24% success.
Key Contributions:
- OSWorld environment: A scalable, executable environment supporting real-world computer tasks across multiple OS platforms.
- Comprehensive benchmark: 369 tasks covering diverse applications, file operations, multi-app workflows, and GUI/CLI interactions.
- Execution-based evaluation: Custom scripts ensure reproducibility and fairness in task evaluation.
- LLM/VLM performance evaluation: Extensive benchmarking shows that current AI agents struggle with GUI grounding, cross-app workflows, and long-horizon planning.
The figure below from the paper shows an overview of the OSWorld environment infrastructure. The environment uses a configuration file for initializing tasks (highlighted in red), agent interaction, post-processing upon agent completion (highlighted in orange), retrieving files and information (highlighted in yellow), and executing the evaluation function (highlighted in green). Environments can run in parallel on a single host machine for learning or evaluation purposes. Headless operation is supported.

Implementation Details:
- Task setup: Each task is configured in a virtual machine (VM) with preloaded files, software states, and window arrangements.
- Action space: Agents interact with OSWorld using pyautogui-based keyboard/mouse actions (e.g., clicks, typing, hotkeys, scrolling).
- Observation space: Includes full desktop screenshots, accessibility (a11y) trees, and structured text-based metadata.
- Evaluation metrics: Success is determined using task-specific execution-based scripts that verify output correctness.
Experimental Results:
- Best model performance: GPT-4V achieves 12.24% success, with other models performing worse.
- Failure cases: Agents struggle with precise mouse-clicking, UI layout changes, application switching, and workflow integration.
- Human performance: Humans complete 72.36% of tasks, demonstrating the gap between AI agents and human capability.
Future Directions:
- Improving GUI grounding: Enhance AI models’ ability to process high-resolution screenshots and precisely locate UI elements.
- Longer trajectory memory: Develop better ways for agents to retain and utilize interaction history for complex tasks.
- Cross-application workflows: Improve AI’s ability to switch between applications and manage multi-app tasks effectively.
- Robustness to UI changes: Ensure agents can handle dynamic interface layouts, pop-ups, and unexpected software behavior.
Code

Core ML

1991

What Every Computer Scientist Should Know About Floating-Point Arithmetic

This gem by Goldberg et al. from Oracle in the 1991 issue of ACM Computing Surveys helps demystify your errors about computer arithmetic and enables you to write more careful code.

1997

Bidirectional recurrent neural networks

This paper by Schuster and Paliwal from the ATR Interpreting Telecommunications Research Laboratory, Kyoto, Japan in IEEE Transactions on Signal Processing 1997 proposes a bidirectional recurrent neural network (BRNN) by extending a regular recurrent neural network (RNN).
The BRNN can be trained without the limitation of using input information just up to a preset future frame. This is accomplished by training it simultaneously in positive and negative time direction. Structure and training procedure of the proposed network are explained. In regression and classification experiments on artificial data, the proposed structure gives better results than other approaches. For real data, classification experiments for phonemes from the TIMIT database show the same tendency.
They also show how the proposed bidirectional structure can be easily modified to allow efficient estimation of the conditional posterior probability of complete symbol sequences without making any explicit assumption about the shape of the distribution. For this part, experiments on real data are reported.

2001

Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers

Accurate, well-calibrated estimates of class membership probabilities are needed in many supervised learning applications, in particular when a cost-sensitive decision must be made about examples with example-dependent costs.
This paper by Zadrozny and Elkan from UCSD in 2001 presents histogram binning, a simple but commonly-used calibration concept for obtaining calibrated probability estimates from decision tree and naive Bayesian classifiers.
Using the large and challenging KDD’98 contest dataset as a testbed, they report the results of a detailed experimental comparison of ten methods, according to four evaluation measures.
They conclude that binning succeeds in significantly improving naive Bayesian probability estimates, while for improving decision tree probability estimates, they recommend smoothing by $m$-estimation and a new variant of pruning that they call curtaitment.

2002

Transforming classifier scores into accurate multiclass probability estimates

Class membership probability estimates are important for many applications of data mining in which classification outputs are combined with other sources of information for decision-making, such as example-dependent misclassification costs, the outputs of other classifiers, or domain knowledge. Previous calibration methods apply only to two-class problems.
This paper by Zadrozny and Elkan from UCSD in 2002 proposes isotonic regression, which helps obtain accurate probability estimates for multiclass problems by combining calibrated binary probability estimates.
They also propose a new method for obtaining calibrated two-class probability estimates that can be applied to any classifier that produces a ranking of examples.
Using naive Bayes and support vector machine classifiers, they give experimental results from a variety of two-class and multiclass domains, including direct marketing, text categorization and digit recognition.

Dimensionality Reduction by Learning an Invariant Mapping

This paper by Hadsell et al. from LeCun’s lab in CVPR 2006 first introduced the concept of a contrastive loss.
Contrastive loss is a distance-based loss as opposed to more conventional error-prediction losses. This loss is used to learn embeddings in which two “similar” points have a low Euclidean distance and two “dissimilar” points have a large Euclidean distance.
Two samples are either similar or dissimilar. This binary similarity can be determined using several approaches:
- In this work, the $N$ closest neighbors of a sample in input space (e.g. pixel space) are considered similar; all others are considered dissimilar. (This approach yields a smooth latent space; e.g. the latent vectors for two similar views of an object are close)
- To the group of similar samples to a sample, transformed versions of the sample can be added (e.g. using data augmentation). This allows the latent space to be invariant to one or more transformations.
- A manually obtained label determining if two samples are similar can be used (for e.g., we could use the class label. However, there can be cases where two samples from the same class are relatively dissimilar, or where two samples from different classes are relatively similar. Using classes alone does not encourage a smooth latent space.)
Formally, if we consider $\vec{X}$ as the input data and $G_W(\vec{X})$ the output of a neural network, the interpoint distance is given by,

\[D_W\left(\vec{X}_1, \vec{X}_2\right)=\left\|G_W\left(\vec{X}_1\right)-G_W\left(\vec{X}_2\right)\right\|_2\]

The contrastive loss is simply,
\[\begin{aligned} L(W) &=\sum_{i=1}^P L\left(W,\left(Y, \vec{X}_1, \vec{X}_2\right)^i\right) \\ L\left(W,\left(Y, \vec{X}_1, \vec{X}_2\right)^i\right) &=(1-Y) L_S\left(D_W^i\right)+Y L_D\left(D_W^i\right) \end{aligned}\]
- where $Y=0$ when $X_1$ and $X_2$ are similar and $Y=1$ otherwise, and $L_S$ is a loss for similar points and $L_D$ is a loss for dissimilar points.
More formally, the contrastive loss is given by,
\[\begin{aligned} &L\left(W, Y, \vec{X}_1, \vec{X}_2\right)= \\ &\quad(1-Y) \frac{1}{2}\left(D_W\right)^2+(Y) \frac{1}{2}\left\{\max \left(0, m-D_W\right)\right\}^2 \end{aligned}\]
- where $$ m $$ is a predefined margin.
The gradient is given by the simple equations:

\[\begin{gathered} \frac{\partial L_S}{\partial W}=D_W \frac{\partial D_W}{\partial W} \\ \frac{\partial L_D}{\partial W}=-\left(m-D_W\right) \frac{\partial D_W}{\partial W} \end{gathered}\]

Contrastive Loss is often used in image retrieval tasks to learn discriminative features for images. During training, an image pair is fed into the model with their ground truth relationship: equals 1 if the two images are similar and 0 otherwise. The loss function for a single pair is:
\[y d^2+(1-y) \max (\operatorname{margin}-d, 0)^2\]
- where $d$ is the Euclidean distance between the two image features (suppose their features are $f_1$ and $f_2$): $d=\left \| f_1-f_2\right \|{_2}$. The $margin$ term is used to “tighten” the constraint: if two images in a pair are dissimilar, then their distance should be at least $margin$, or a loss will be incurred.
Shown below are the results from the paper which are quite convincing:

Note that while this is one of the earliest of the contrastive losses, this is not the only one. For instance, the contrastive loss used in SimCLR is quite different.

2006

Reducing the Dimensionality of Data with Neural Networks

High-dimensional data can be converted to low-dimensional codes by training a multilayer neural network with a small central layer to reconstruct high-dimensional input vectors. Gradient descent can be used for fine-tuning the weights in such “autoencoder” networks, but this works well only if the initial weights are close to a good solution.
This paper by Hinton and Salakhutdinov in Science in 2006 describe an effective way of initializing the weights that allows deep autoencoder networks to learn low-dimensional codes that work much better than principal components analysis as a tool to reduce the dimensionality of data.

2007

What Every Programmer Should Know About Memory

This must-read paper by Drepper from Red Hat in 2007 offers a detailed treatment on how system memory works.

2008

ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning

This paper by Haibo He, Yang Bai, Edwardo A. Garcia, and Shutao Li introduces the Adaptive Synthetic (ADASYN) sampling method, addressing the challenge of learning from imbalanced datasets in machine learning.
ADASYN focuses on generating synthetic data for the minority class, emphasizing those samples that are more difficult to learn. This approach aims to reduce learning bias caused by class imbalance and adaptively shift the classification decision boundary towards challenging examples.
The authors propose a weighted distribution mechanism for generating more synthetic data for hard-to-learn minority class examples, while fewer or no synthetic data are generated for easier ones.
Simulation analysis on several machine learning datasets demonstrates the effectiveness of ADASYN across various evaluation metrics, showing improved handling of class imbalances compared to existing methods.
The plot below from the paper shows the performance of the ADASYN algorithm for imbalanced learning.

The paper emphasizes the utility of ADASYN in providing balanced class distribution and in focusing the learning process on the more challenging aspects of the minority class, contributing significantly to the field of imbalanced learning.

2009

Large-scale Deep Unsupervised Learning using Graphics Processors

The promise of unsupervised learning methods lies in their potential to use vast amounts of unlabeled data to learn complex, highly nonlinear models with millions of free parameters. They consider two well-known unsupervised learning models, deep belief networks (DBNs) and sparse coding, that have recently been applied to a flurry of machine learning applications.
Unfortunately, current learning algorithms for both models are too slow for large-scale applications, forcing researchers to focus on smaller-scale models, or to use fewer training examples.
This must-read paper by Raina et al. from Andrew Ng’s lab at Stanford in ICML 2009 was the first to introduce deep learning on GPUs by suggesting massively parallel methods to help resolve the aforementioned problems.
They argue that modern graphics processors far surpass the computational capabilities of multicore CPUs, and have the potential to revolutionize the applicability of deep unsupervised learning methods. They develop general principles for massively parallelizing unsupervised learning tasks using graphics processors. They show that these principles can be applied to successfully scaling up learning algorithms for both DBNs and sparse coding.
Their implementation of DBN learning is up to 70 times faster than a dual-core CPU implementation for large models. For example, they are able to reduce the time required to learn a four-layer DBN with 100 million free parameters from several weeks to around a single day. For sparse coding, they develop a simple, inherently parallel algorithm, that leads to a 5 to 15-fold speedup over previous methods.

Practical Guide to Controlled Experiments on the Web: Listen to Your Customers not to the HiPPO

The web provides an unprecedented opportunity to evaluate ideas quickly using controlled experiments, also called randomized experiments (single-factor or factorial designs), A/B tests (and their generalizations), split tests, Control/Treatment tests, and parallel flights. Controlled experiments embody the best scientific design for establishing a causal relationship between changes and their influence on user-observable behavior.
This paper by Kohavi et al. from Microsoft in KDD 2007 provide a practical guide to conducting online experiments, where end-users can help guide the development of features. Their experience indicates that significant learning and return-on-investment (ROI) are seen when development teams listen to their customers, not to the Highest Paid Person’s Opinion (HiPPO).
They provide several examples of controlled experiments with surprising results. They review the important ingredients of running controlled experiments, and discuss their limitations (both technical and organizational). They focus on several areas that are critical to experimentation, including statistical power, sample size, and techniques for variance reduction.
They describe common architectures for experimentation systems and analyze their advantages and disadvantages. They evaluate randomization and hashing techniques, which they show are not as simple in practice as is often assumed. Controlled experiments typically generate large amounts of data, which can be analyzed using data mining techniques to gain deeper understanding of the factors influencing the outcome of interest, leading to new hypotheses and creating a virtuous cycle of improvements.
Organizations that embrace controlled experiments with clear evaluation criteria can evolve their systems with automated optimizations and real-time analyses. Based on their extensive practical experience with multiple systems and organizations, they share key lessons that will help practitioners in running trustworthy controlled experiments.

Curriculum Learning

The paper by Bengio et al. from the University of Montreal and NEC Laboratories America, presented at ICML 2009, introduces the concept of “Curriculum Learning” for machine learning, drawing parallels to human learning where the organization and complexity of learning materials significantly impact learning effectiveness.
The authors establish a foundation for curriculum learning in the context of deep deterministic and stochastic neural networks, particularly in the presence of non-convex training criteria. The experiments demonstrate notable improvements in generalization through curriculum learning.
The paper posits that curriculum learning affects both the speed of convergence during training and the quality of local minima achieved in non-convex optimization problems. This approach is likened to a continuation method, a strategy for global optimization in non-convex functions.
Several experiments across different domains, including vision and language tasks, show the efficacy of curriculum learning. Simple multi-stage curriculum strategies resulted in enhanced generalization and faster convergence.
The plot below from the paper shows the average error rate of the perceptron, with or without the curriculum. Top: the number of nonzero irrelevant inputs determines easiness. Bottom: the margin $yw'x$ determines easiness.

In language modeling, the paper examines training a model to predict the next word in an English sentence. Using a curriculum approach, the authors demonstrate a statistically significant improvement in test set performance compared to a non-curriculum approach.
The research opens up new perspectives on machine learning training methodologies, suggesting that carefully designing the sequence and complexity of training examples can lead to better performance, especially in complex neural network architectures.
The paper suggests future exploration in understanding why certain curriculum strategies work better than others and automating the process of curriculum design, potentially leveraging active learning principles.

2011

SMOTE: Synthetic Minority Over-sampling Technique

This paper by Chawla et al. from University of South Florida introduces an approach to the construction of classifiers from imbalanced datasets.
A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of “normal” examples with only a small percentage of “abnormal” or “interesting” examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Under-sampling of the majority (normal) class has been proposed as a good means of increasing the sensitivity of a classifier to the minority class.
This paper shows that a combination of the proposed method of over-sampling the minority (abnormal) class and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space) than only under-sampling the majority class. This paper also shows that a combination of their method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space) than varying the loss ratios in Ripper or class priors in Naive Bayes.
Their method of over-sampling the minority class involves creating synthetic minority class examples. Experiments are performed using C4.5, Ripper and a Naive Bayes classifier.
The method is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.

2012

Acoustic Modeling using Deep Belief Networks

At the time of writing, Gaussian mixture models were predominantly the dominant technique for modeling the emission distribution of hidden Markov models for speech recognition.
This paper by Mohamed et al. from Hinton’s lab at UofT in IEEE Transactions on Audio, Speech, and Language Processing 2012 showed that better phone recognition on the TIMIT dataset can be achieved by replacing Gaussian mixture models by deep neural networks that contain many layers of features and a very large number of parameters.
These networks are first pre-trained as a multi-layer generative model of a window of spectral feature vectors without making use of any discriminative information. Once the generative pre-training has designed the features, they perform discriminative fine-tuning using backpropagation to adjust the features slightly to make them better at predicting a probability distribution over the states of monophone hidden Markov models.

Improving neural networks by preventing co-adaptation of feature detectors

This paper by Hinton et al. in 2012 introduced Dropout as a way to avoid overfitting.
When a large feedforward neural network is trained on a small training set, it typically performs poorly on held-out test data. This overfitting is greatly reduced by randomly omitting half of the feature detectors on each training case. This prevents complex co-adaptations in which a feature detector is only helpful in the context of several other specific feature detectors.
Instead, each neuron learns to detect a feature that is generally helpful for producing the correct answer given the combinatorially large variety of internal contexts in which it must operate.
Random “dropout” gives big improvements on many benchmark tasks and sets new records for speech and object recognition.

Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained

Online controlled experiments are often utilized to make data-driven decisions at Amazon, Microsoft, eBay, Facebook, Google, Yahoo, Zynga, and at many other companies. While the theory of a controlled experiment is simple, and dates back to Sir Ronald A. Fisher’s experiments at the Rothamsted Agricultural Experimental Station in England in the 1920s, the deployment and mining of online controlled experiments at scale — thousands of experiments now — has taught us many lessons. These exemplify the proverb that the difference between theory and practice is greater in practice than in theory.
This paper by Kohavi et al. from Microsoft in KDD 2012 presents the authors’ learnings as they happened: puzzling outcomes of controlled experiments that they analyzed deeply to understand and explain. Each of these took multiple-person weeks to months to properly analyze and get to the often surprising root cause. The root causes behind these puzzling results are not isolated incidents; these issues generalized to multiple experiments. The heightened awareness should help readers increase the trustworthiness of the results coming out of controlled experiments.
At Microsoft’s Bing, it is not uncommon to see experiments that impact annual revenue by millions of dollars, thus getting trustworthy results is critical and investing in understanding anomalies has tremendous payoff: reversing a single incorrect decision based on the results of an experiment can fund a whole team of analysts.
The topics they cover include: the OEC (Overall Evaluation Criterion), click tracking, effect trends, experiment length and power, and carryover effects.

2014

Dropout: A Simple Way to Prevent Neural Networks from Overfitting

This paper by Srivastava et al. from Hinton’s lab in JMLR 2014 introduced Dropout, which (just like Batchnorm) is now part of the standard recipe for regularizing deep neural nets.
Please refer the Dropout primer for a detailed discourse on Dropout.

Intriguing properties of neural networks

- Deep neural networks are highly expressive models that have recently achieved state of the art performance on speech and visual recognition tasks. While their expressiveness is the reason they succeed, it also causes them to learn uninterpretable solutions that could have counter-intuitive properties.
This paper by Szegedy et al. from Google, NYU, University of Montreal, and Facebook reports two such properties and most notably, introduced adversarial examples in the context of deep learning.
First, they find that there is no distinction between individual high level units and random linear combinations of high level units, according to various methods of unit analysis. It suggests that it is the space, rather than the individual units, that contains of the semantic information in the high layers of neural networks.
Second, they find that deep neural networks learn input-output mappings that are fairly discontinuous to a significant extend. They can cause the network to misclassify an image by applying a certain imperceptible perturbation, which is found by maximizing the network’s prediction error. In addition, the specific nature of these perturbations is not a random artifact of learning: the same perturbation can cause a different network, that was trained on a different subset of the dataset, to misclassify the same input.

2015

ADAM: A Method for Stochastic Optimization

This paper by Kingma and Ba in ICLR 2015 introduces Adam (derived from adaptive moment estimation), an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters.
It is a fusion of RMSProp with momentum and involves calculating the exponentially weighted moving average of the first moment and second moment (which are gated by the hyper parameters $\beta_1$ and $\beta_2$ respectively).
The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. They also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, they discuss AdaMax, a variant of Adam based on the infinity norm.

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

This paper by Ioffe and Szegedy from Google in ICML 2015 introduced BatchNorm, which is now commonly implemented to accelerate training of deep neural nets.
Also, check out this in-depth article on BatchNorm here.

2016

XGBoost: A Scalable Tree Boosting System

This paper by Chen and Guestrin from UW in 2016 proposes eXtreme Gradient Boost (XGBoost), a scalable end-to-end tree boosting system that is widely used by data scientists and provides state-of-the-art results on many problems.
They propose a novel sparsity aware algorithm for handling sparse data and a theoretically justified weighted quantile sketch for approximate tree learning.
Their experience shows that cache access patterns, data compression and sharding are essential elements for building a scalable end-to-end system for tree boosting. These lessons can be applied to other machine learning systems as well.
By combining these insights, XGBoost is able to solve real-world scale problems using far fewer resources than existing systems..

Layer Normalization

Training state-of-the-art, deep neural networks is computationally expensive. One way to reduce the training time is to normalize the activities of the neurons. A recently introduced technique called batch normalization uses the distribution of the summed input to a neuron over a mini-batch of training cases to compute a mean and variance which are then used to normalize the summed input to that neuron on each training case. This significantly reduces the training time in feed-forward neural networks. However, the effect of batch normalization is dependent on the mini-batch size and it is not obvious how to apply it to recurrent neural networks.
This paper by Ba et al. from Hinton’s lab in 2016 introduces layer normalization (LayerNorm) by transposing batch normalization into layer normalization by computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case. Like batch normalization, they also give each neuron its own adaptive bias and gain which are applied after the normalization but before the non-linearity.
Unlike batch normalization, layer normalization performs exactly the same computation at training and test times. It is also straightforward to apply to recurrent neural networks by computing the normalization statistics separately at each time step.
The following figure from the paper shows that LayerNorm is very effective at stabilizing the hidden state dynamics in recurrent networks. Empirically, they show that layer normalization can substantially reduce the training time compared with previously published techniques.

Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs

This paper by Malkov and Yashunin from Institute of Applied Physics of the Russian Academy of Sciences and Yandex presents a new approach for the approximate K-nearest neighbor search based on navigable small world graphs with controllable hierarchy (Hierarchical NSW, HNSW).
The proposed solution is fully graph-based, without any need for additional search structures, which are typically used at the coarse search stage of the most proximity graph techniques.
Hierarchical NSW incrementally builds a multi-layer structure consisting from hierarchical set of proximity graphs (layers) for nested subsets of the stored elements. The maximum layer in which an element is present is selected randomly with an exponentially decaying probability distribution. This allows producing graphs similar to the previously studied Navigable Small World (NSW) structures while additionally having the links separated by their characteristic distance scales.
Starting search from the upper layer together with utilizing the scale separation boosts the performance compared to NSW and allows a logarithmic complexity scaling. Additional employment of a heuristic for selecting proximity graph neighbors significantly increases performance at high recall and in case of highly clustered data. Performance evaluation has demonstrated that the proposed general metric space search index is able to strongly outperform previous opensource state-of-the-art vector-only approaches. Similarity of the algorithm to the skip list structure allows straightforward balanced distributed implementation.
The following figure from the paper illustrates the Hierarchical NSW idea. The search starts from an element from the top layer (shown red). Red arrows show direction of the greedy algorithm from the entry point to the query (shown green).

2017

Axiomatic Attribution for Deep Networks

This paper by Sundararajan from Google in ICML 2017 studies the problem of attributing the prediction of a deep network to its input features, a problem previously studied by several other works.
They identify two fundamental axioms — Sensitivity and Implementation Invariance that attribution methods ought to satisfy. They show that they are not satisfied by most known attribution methods, which they consider to be a fundamental weakness of those methods.
They use the axioms to guide the design of a new attribution method called Integrated Gradients.
Their method requires no modification to the original network and is extremely simple to implement; it just needs a few calls to the standard gradient operator.
Since this method is multimodal, they apply this method to a couple of image models, a couple of text models and a chemistry model, demonstrating its ability to debug networks, to extract rules from a network, and to enable users to engage with models better.
Since integrated gradients add up to the final prediction score, the magnitudes can be use for accounting the contributions of each feature. For instance, for the molecule in the figure, atom-pairs that have a bond between them cumulatively contribute to 46% of the prediction score, while all other pairs cumulatively contribute to only −3%.

Decoupled Weight Decay Regularization

L2 regularization and weight decay regularization are equivalent for standard stochastic gradient descent (when rescaled by the learning rate), but as they demonstrate this is not the case for adaptive gradient algorithms, such as Adam.
This paper by Loshchilov and Hutter from University of Freiburg in ICLR 2019 proposes Adam with decoupled weight decay (AdamW), a simple modification to recover the original formulation of weight decay regularization by decoupling the weight decay from the optimization steps taken w.r.t. the loss function. Following suggestions that adaptive gradient methods such as Adam might lead to worse generalization than SGD with momentum (Wilson et al., 2017), they identify and expose the inequivalence of L2 regularization and weight decay for Adam.
They provide empirical evidence that AdamW proposed modification (i) decouples the optimal choice of weight decay factor from the setting of the learning rate for both standard SGD and Adam, and (ii) substantially improves Adam’s generalization performance, allowing it to compete with SGD with momentum on image classification datasets (on which it was previously typically outperformed by the latter). They empirically show that AdamW yields substantially better generalization performance than the common implementation of Adam with L2 regularization. They also proposed to use warm restarts for Adam to improve performance.
Their results obtained on image classification datasets must be verified on a wider range of tasks, especially ones where the use of regularization is expected to be important. While they focus their experimental analysis on Adam, they believe that similar results also hold for other adaptive gradient methods, such as AdaGrad (Duchi et al., 2011) and AMSGrad (Reddi et al., 2018).
AdamW has been implemented in TensorFlow and PyTorch.
Code.

On Calibration of Modern Neural Networks

Modern neural networks exhibit a strange phenomenon: probabilistic error and miscalibration worsen even as classification error is reduced.
This paper by Guo et al. from Cornell University in ICML 2017 proposes temperature scaling. They begin by discovering that modern neural networks, unlike those from a decade ago, are poorly calibrated. Through extensive experiments, they observe that model capacity (in terms of depth, width), weight decay (regularization), and Batch Normalization are important factors affect calibration while improving accuracy.
They evaluate the performance of various post-processing calibration methods on state-of-the-art architectures with image and document classification datasets.
They suggest that simple techniques can effectively remedy the miscalibration phenomenon in neural networks. Temperature scaling – a single-parameter variant of Platt Scaling – is the simplest, fastest, and most straightforward of the methods at calibrating predictions.

Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers

For optimal decision making under variable class distributions and misclassification costs a classifier needs to produce well-calibrated estimates of the posterior probability. Isotonic calibration is a powerful non-parametric method that is however prone to overfitting on smaller datasets; hence a parametric method based on the logistic curve is commonly used.
This paper by Kull et al. from University of Bristol and Universidade Federal de Pernambuco demonstrates that while logistic calibration is designed for normally distributed per-class scores, many classifiers including Naive Bayes and Adaboost suffer from a particular distortion where these score distributions are heavily skewed. In such cases logistic calibration can easily yield probability estimates that are worse than the original scores. Moreover, the logistic curve family does not include the identity function, and hence logistic calibration can easily uncalibrate a perfectly calibrated classifier.
The papers seeks to solve all these problems with a richer class of calibration maps based on the beta distribution. THey derive the method from first principles and show that fitting it is as easy as fitting a logistic curve.
Extensive experiments show that beta calibration is superior to logistic calibration for Naive Bayes and Adaboost.

Understanding Black-box Predictions via Influence Functions

The following paper summary has been contributed by Zhibo Zhang.
This paper by Koh and Liang in ICML 2017 from Percy Liang’s group at Stanford introduces influence functions that originated from robust statistics to explain individual instance predictions.
The method utilizes the inverse of the second-order derivative (Hessian matrix) to calculate an approximation of the empirical risk.
Although the authors propose a few approximation methodologies to calculate the inverse Hessian matrix, the amount of computation involved in this calculation is a drawback of the work.
Additionally, as discussed in the TracIn (Pruthi et al.) paper, the optimality condition for the approximation (with respect to the empirical risk) is hard to achieve in practice, especially for complicated deep neural networks.
As shown in the experimental part, this work can be used to identify influential training data points for the model, and the authors showed that this method could be further extended to several use cases, including understanding model behaviors as well as the influence of adversarial examples, detecting the mismatch between training distribution and test distribution, and identifying mislabelled data points.
Code.

Mixed Precision Training

Deep neural networks have enabled progress in a wide variety of applications. Growing the size of the neural network typically results in improved accuracy. As model sizes grow, the memory and compute requirements for training these models also increases.
This paper by Micikevicius et al. from Baidu Research and Nvidia in ICLR 2018 introduces a technique to train deep neural networks using half precision floating point numbers. In their technique, weights, activations and gradients are stored in IEEE half-precision format. Half-precision floating numbers have limited numerical range compared to single-precision numbers.
They propose two techniques to handle this loss of information. Firstly, they recommend maintaining a single-precision copy of the weights that accumulates the gradients after each optimizer step. This single-precision copy is rounded to half-precision format during training.
Secondly, they propose scaling the loss appropriately to handle the loss of information with half-precision gradients. They demonstrate that this approach works for a wide variety of models including convolution neural networks, recurrent neural networks and generative adversarial networks.
This technique works for large scale models with more than 100 million parameters trained on large datasets. Using this approach, they can reduce the memory consumption of deep learning models by nearly 2x. In future processors, they can also expect a significant computation speedup using half-precision hardware units.

StarSpace: Embed All The Things!

This paper by Wu et al. from FAIR presents StarSpace, a general-purpose neural embedding model that can solve a wide variety of problems: labeling tasks such as text classification, ranking tasks such as information retrieval/web search, collaborative filtering-based or content-based recommendation, embedding of multi-relational graphs, and learning word, sentence or document level embeddings.
In each case the model works by embedding those entities comprised of discrete features and comparing them against each other – learning similarities dependent on the task.
Empirical results on a number of tasks show that StarSpace is highly competitive with existing methods, whilst also being generally applicable to new cases where those methods are not.

2018

Model Cards for Model Reporting

Trained machine learning models are increasingly used to perform high-impact tasks in areas such as law enforcement, medicine, education, and employment. In order to clarify the intended use cases of machine learning models and minimize their usage in contexts for which they are not well suited, they recommend that released models be accompanied by documentation detailing their performance characteristics.
This paper by Mitchell et al. from Google and UofT proposes a framework that they call model cards, to encourage such transparent model reporting. Model cards are short documents accompanying trained machine learning models that provide benchmarked evaluation in a variety of conditions, such as across different cultural, demographic, or phenotypic groups (e.g., race, geographic location, sex, Fitzpatrick skin type) and intersectional groups (e.g., age and race, or sex and Fitzpatrick skin type) that are relevant to the intended application domains. Model cards also disclose the context in which models are intended to be used, details of the performance evaluation procedures, and other relevant information.
While they focus primarily on human-centered machine learning models in the application fields of computer vision and natural language processing, this framework can be used to document any trained machine learning model. To solidify the concept, they provide cards for two supervised models: One trained to detect smiling faces in images, and one trained to detect toxic comments in text. They propose model cards as a step towards the responsible democratization of machine learning and related AI technology, increasing transparency into how well AI technology works.

Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)

The following paper summary has been contributed by Zhibo Zhang.
Many existing works on explainability focus on feature attribution, which attributes an importance score to each individual input feature. However, the individual features themselves do not necessarily have semantic meanings.
This paper by Kim et al. from Google in ICML 2018 introduced concept-based explanations using Concept Activation Vectors (CAVs) in neural networks to capture the importance of human-friendly high-level concepts.
This methodology adopts two sets of input examples - one set that contains instances with the concept of interest, another set that contains instances without the concept of interest. The class activation vector is defined to be the vector that is orthogonal to the linear classifier that separates the intermediate representations of the two sets of data instances. The sensitivity of a particular data class (for e.g., the zebra class, as in the paper) with respect to the concept in question (e.g., the ‘striped’ concept) can then be calculated using a directional derivative.
The drawback of this approach is that a linear classifier needs to be trained separately for each concept through the set of examples collected, which implies incurring extra time in collecting representative data instances and training the classifier.
The authors showed several use cases that adopted TCAV (Testing with Concept Activation Vectors) to better understand the learned model and predictions, including sorting images by similarity with respect to a concept of interest. The authors also conducted quantitative sanity checks through adding captions to the image and tuning the probability of noise in the captions, which showed that the concepts captured by TCAV closely matches what neural network focuses on to make predictions.
Code.

Representer Point Selection for Explaining Deep Neural Networks

The following paper summary has been contributed by Zhibo Zhang.
This paper by Yeh et al. from CMU introduces a method for selecting representer points for any given instance prediction. Relying on the representer theorem, the pre-activation value of the individual data instance can be decomposed into a linear combination of the training points’ activations. The weight corresponds to either positive contributions (if the weight is positive) or negative contributions (if the weight is negative) towards the prediction of the data instance in question.
Through experiments, the authors of the paper showed that this method can be used to efficiently detect and fix mislabelled training data points. It outperformed influence functions by 2% on test accuracy score with the same amount of training data (by fixing the mislabelled ones detected in those data) on the CIFAR-10 dataset. In addition, the authors showed that Representer Point Selection is capable of picking out more representative positive and negative examples for given data instances compared to influence functions from a qualitative perspective. Thus, this method can also be used by machine learning experts to understand misclassified examples.
Furthermore, compared to influence functions, Representer Point Selection is much faster in practice.
Code.

Mixed Precision Training

Deep neural networks have enabled progress in a wide variety of applications. Growing the size of the neural network typically results in improved accuracy. As model sizes grow, the memory and compute requirements for training these models also increases.
This paper by Narang et al. from Baidu Research and Nvidia in ICLR 2018 introduces a technique to train deep neural networks using half precision floating point numbers. In their technique, weights, activations and gradients are stored in IEEE half-precision format. Half-precision floating numbers have limited numerical range compared to single-precision numbers.
They propose two techniques to handle this loss of information. Firstly, they recommend maintaining a single-precision copy of the weights that accumulates the gradients after each optimizer step. This single-precision copy is rounded to half-precision format during training. Secondly, they propose scaling the loss appropriately to handle the loss of information with half-precision gradients.
They demonstrate that the latter approach works for a wide variety of large scale models including convolution neural networks, recurrent neural networks, and generative adversarial networks with more than 100 million parameters trained on large datasets. For certain models with a large number of small gradient values, this loss/gradient scaling method helps them converge to the same accuracy as FP32 baseline models.
Mixed precision training is an important technique that allows us to reduce the memory consumption as well as time spent in memory and arithmetic operations of deep neural networks. They demonstrate that many different deep learning models can be trained using this technique with no loss in accuracy without any hyper-parameter tuning. Using this approach, they can reduce the memory consumption of deep learning models by nearly 2x. For half-precision optimized hardware, they can also expect a significant computation speedup using half-precision hardware units.
DNN operations benchmarked with DeepBench on Volta GPU see 2-6x speedups compared to FP32 implementations if they are limited by memory or arithmetic bandwidth. Speedups are lower when operations are latency-limited.

2019

Fast Transformer Decoding: One Write-Head is All You Need

Multi-head attention layers, as used in the Transformer neural sequence model, are a powerful alternative to RNNs for moving information across and between sequences. While training these layers is generally fast and simple, due to parallelizability across the length of the sequence, incremental inference (where such paralleization is impossible) is often slow, due to the memory-bandwidth cost of repeatedly loading the large “keys” and “values” tensors.
This paper by Shazeer from Google in 2019 proposes a variant called multi-query attention, where the keys and values are shared across all of the different attention “heads”, greatly reducing the size of these tensors and hence the memory bandwidth requirements of incremental decoding.
They verify experimentally that the resulting models can indeed be much faster to decode, and incur only minor quality degradation from the baseline.

Similarity of Neural Network Representations Revisited

Recent work has sought to understand the behavior of neural networks by comparing representations between layers and between different trained models. Measuring similarity between the representations learned by neural networks is an ill-defined problem, since it is not entirely clear what aspects of the representation a similarity. index should focus on. Previous work has suggested that there is little similarity between intermediate layers of neural networks trained from different random initializations.
This paper by Kornblith et al. from Hinton’s lab at UofT in ICML 2019 examines methods for comparing neural network representations based on canonical correlation analysis (CCA).
They show that CCA belongs to a family of statistics for measuring multivariate similarity, but that neither CCA nor any other statistic that is invariant to invertible linear transformation can measure meaningful similarities between representations of higher dimension than the number of data points.
They introduce a similarity index that measures the relationship between representational similarity matrices and does not suffer from this limitation. This similarity index is equivalent to centered kernel alignment (CKA) and is also closely connected to CCA.
CKA is a method for comparing representations of neural networks, and show that it consistently identifies correspondences between layers, not only in the same network trained from different initializations, but across entirely different architectures, whereas other methods do not. They also provide a unified framework for understanding the space of similarity indexes, as well as an empirical framework for evaluation.
CKA captures intuitive notions of similarity, i.e. that neural networks trained from different initializations should be similar to each other. However, it remains an open question whether there exist kernels beyond the linear and RBF kernels that would be better for analyzing neural network representations.
Unlike CCA, CKA can reliably identify correspondences between representations in networks trained from different initializations.

Toward a better trade-off between performance and fairness with kernel-based distribution matching

As recent literature has demonstrated how classifiers often carry unintended biases toward some subgroups, deploying machine learned models to users demands careful consideration of the social consequences. How should we address this problem in a real-world system? How should we balance core performance and fairness metrics?
This paper by Prost et al. from Google in NeurIPS 2019 introduces a MinDiff framework for regularizing classifiers toward different fairness metrics and analyze a technique with kernel-based statistical dependency tests.
To illustrate how MinDiff can be used, consider an example of a product policy classifier that is tasked with identifying and removing text comments that could be considered toxic. One challenge is to make sure that the classifier is not unfairly biased against submissions from a particular group of users, which could result in incorrect removal of content from these groups.
The academic community has laid a solid theoretical foundation for ML fairness, offering a breadth of perspectives on what unfair bias means and on the tensions between different frameworks for evaluating fairness. One of the most common metrics is equality of opportunity, which, in their example, means measuring and seeking to minimize the difference in false positive rate (FPR) across groups. In the example above, this means that the classifier should not be more likely to incorrectly remove safe comments from one group than another. Similarly, the classifier’s false negative rate should be equal between groups. That is, the classifier should not miss toxic comments against one group more than it does for another.
MinDiff uses a regularization framework, which penalizes statistical dependency between its predictions and demographic information for non-harmful examples. This encourages the model to equalize error rates across groups, e.g., classifying non-harmful examples as toxic.
There are several ways to encode this dependency between predictions and demographic information. The initial MinDiff implementation minimized the correlation between the predictions and the demographic group, which essentially optimized for the average and variance of predictions to be equal across groups, even if the distributions still differ afterward. They have since improved MinDiff further by considering the maximum mean discrepancy (MMD) loss, which is closer to optimizing for the distribution of predictions to be independent of demographics. They have found that this approach is better able to both remove biases and maintain model accuracy.
They run a thorough study on an academic dataset to compare the Pareto frontier achieved by different regularization approaches, and apply their kernel-based method to two large-scale industrial systems demonstrating real-world improvements.
Project page.

Root Mean Square Layer Normalization

Layer normalization (LayerNorm) has been successfully applied to various deep neural networks to help stabilize training and boost model convergence because of its capability in handling re-centering and re-scaling of both inputs and weight matrix. However, the computational overhead introduced by LayerNorm makes these improvements expensive and significantly slows the underlying network, e.g. RNN in particular.
This paper by Zhang and Sennrich in NeurIPS 2019 hypothesizes that re-centering invariance in LayerNorm is dispensable and propose root mean square layer normalization, or RMSNorm. RMSNorm regularizes the summed inputs to a neuron in one layer according to root mean square (RMS), giving the model re-scaling invariance property and implicit learning rate adaptation ability.
RMSNorm is computationally simpler and thus more efficient than LayerNorm. They also present partial RMSNorm, or pRMSNorm where the RMS is estimated from p% of the summed inputs without breaking the above properties. Extensive experiments on several tasks using diverse network architectures show that RMSNorm achieves comparable performance against LayerNorm but reduces the running time by 7%~64% on different models.
Code.

Generating Long Sequences with Sparse Transformers

Transformers are powerful sequence models, but require time and memory that grows quadratically with the sequence length.
This paper by Child et al. from OpenAI introduces sparse factorizations of the attention matrix which reduce this to $O(n\sqrt{n})$. They also introduce:
1. A variation on architecture and initialization to train deeper networks,
2. The recomputation of attention matrices to save memory, and
3. Fast attention kernels for training.
They call networks with these changes Sparse Transformers, and show they can model sequences tens of thousands of timesteps long using hundreds of layers.
They use the same architecture to model images, audio, and text from raw bytes, setting a new state of the art for density modeling of Enwik8, CIFAR-10, and ImageNet-64.
They generate unconditional samples that demonstrate global coherence and great diversity, and show it is possible in principle to use self-attention to model sequences of length one million or more.
The following table from the paper shows two 2D factorized attention schemes we evaluated in comparison to the full attention of a standard Transformer. (a) Transformer: The top row indicates, for an example 6x6 image, which positions two attention heads receive as input when computing a given output. The bottom row shows the connectivity matrix (not to scale) between all such outputs (rows) and inputs (columns). Sparsity in the connectivity matrix can lead to significantly faster computation. In (b) Sparse Transformer (strided) and (c) Sparse Transformer (fixed), full connectivity between elements is preserved when the two heads are computed sequentially.

Understanding and Improving Layer Normalization

This paper by Jingjing Xu et al. from Peking University in NeurIPS 2019 seeks to enhance the understanding and application of Layer Normalization (LayerNorm).
The paper challenges the conventional belief about the success of LayerNorm, asserting that the derivatives of the mean and variance in the normalization process are more critical than forward normalization. This contradicts earlier theories that attributed LayerNorm’s effectiveness to forward normalization alone.
The authors introduce a simpler version of LayerNorm, named LayerNorm-simple, which excludes the bias and gain parameters. Surprisingly, this simpler model outperforms the standard LayerNorm in several datasets, suggesting that the bias and gain might increase the risk of overfitting and are not always beneficial.
A novel normalization method, Adaptive Normalization (AdaNorm), is proposed. AdaNorm replaces the bias and gain in LayerNorm with an adaptive transformation function, dynamically adjusting scaling weights based on input values. This method aims to mitigate the overfitting issues observed with LayerNorm.
Extensive experimental analysis across various datasets in tasks such as machine translation, language modeling, text classification, and image classification shows that AdaNorm outperforms LayerNorm in most cases. This is attributed to its ability to adaptively control scaling weights, improving generalization.
The figure below from the paper shows the loss curves of LayerNorm and AdaNorm on En-Vi, PTB, and De-En. Compared to AdaNorm, LayerNorm has lower training loss but higher validation loss. Lower validation loss proves that AdaNorm has better convergence.

The research reveals the importance of the derivatives of the mean and variance in LayerNorm. It highlights that these derivatives are crucial for re-centering and re-scaling backward gradients, a finding that shifts the focus from forward normalization to the importance of these derivatives in the effectiveness of LayerNorm.
The paper underscores the need for a deeper understanding of gradient normalization and suggests that future work could explore alternatives to LayerNorm that further leverage the insights on gradient normalization.
This study is significant as it not only provides a deeper understanding of the mechanisms behind Layer Normalization but also introduces practical improvements through AdaNorm, potentially influencing the development of more efficient and robust deep learning models.

2020

Estimating Training Data Influence by Tracing Gradient Descent

The following paper summary has been contributed by Zhibo Zhang.
This paper by Garima et al. from Google in NeurIPS 2020 introduces a method called TracIn that computes the influence of a training example on a prediction made by the model by keeping track of the gradient information along the training process.
Specifically, this method measures the changes of the loss on a given test point in question before utilizing a specific training data instance and after utilizing this instance. The authors provide a first-order approximation of this calculation and also extend the methodology to the mini-batch setting.
The authors conducted various experimental validations. Quantitatively, when increasing the fraction of training data that is allowed to be checked, TracIn very consistently outperforms influence functions and Representer Point Selection in terms of the fraction of mislabelled examples detected on the CIFAR-10 (a maximum of around 20% outperformance) as well as the MNIST (a maximum of over 10% outperformance) datasets. Qualitatively, the authors showed that TracIn was able to effectively pick out proponents (examples whose influence scores are positive) and opponents (examples whose influence scores are negative) for given data instances on a text classification task as well as an image classification task.
Code

LEEP - Log Expected Empirical Prediction

LEEP - Log Expected Empirical Prediction by Nguyen et al. from Amazon Web Services and Facebook AI in ICML 2020 proposes to measure the transferability from the source dataset to the target dataset by evaluating the log likelihood of the correct prediction on the target dataset (without requiring to re-train a model on the target dataset). The individual probability of the correct prediction on the target dataset is calculated through a predictive distribution based on two conditional probabilities:
1. The probability of the dummy label based on the categorical distribution of the trained model (trained on the source dataset) evaluated on the input of the target dataset.
2. The conditional density of the target dataset’s label given the dummy label from the previous step. The predictive distribution is then evaluated through integrating over all possible dummy labels.
More on LEEP in the comparative analysis between LEEP and OTDD below.

OTDD - Optimal Transport Dataset Distance

OTDD - Optimal Transport Dataset Distance by Alvarez-Melis et al. from Microsoft Research in NeurIPS 2020 proposes to measure distances between datasets through optimal transport as an estimation for transferability. Ideally, smaller distance indicates better transferability.
Compared to LEEP, OTDD does not require training a model on the source dataset. It only needs the feature-label pairs of the two datasets. Specifically, the distance measure is composed of two parts:
1. The distance between feature vectors of the two datasets.
2. The distance between the labels of the two datasets, where each label is represented by the distribution of the associated feature vectors.
However, the drawback of the OTDD approach is obvious. Wasserstein distance is known to be computationally expensive. Therefore, OTDD needs to rely on approximation algorithms. Although the authors propose that it is possible to use Gaussian distribution as the modeling choice for the feature vector distribution under each label so that the 2-Wasserstein distance can be calculated through an analytic form, the approximation of this approach is too coarse. In comparison, the LEEP approach only involves one iteration of trained model inference on the target dataset to acquire the dummy label distribution.
In terms of experiments, both LEEP and OTDD validated the statistical correlation between their proposed transferability estimation approaches and the model performance on the target dataset on several transfer learning tasks. Specifically, the LEEP approach witnessed larger than 0.94 correlation coefficients between the LEEP score and the test accuracy (closer to 1 correlation coefficient indicates better transferability measurement) when transferring from the ImageNet dataset to the CIFAR-100 dataset and from the CIFAR-10 dataset to the CIFAR-100 dataset. The OTDD approach witnessed -0.85 correlation between the dataset distance and the relative drop in test error (closer to -1 correlation coefficient indicates better distance measure) when transferring from the MNIST dataset (with augmentations) to the USPS dataset. However, when not performing augmentations, the correlation when transferring among the MNIST dataset, its variations and the USPS dataset is only -0.59 for OTDD.
Overall, neither LEEP and OTDD require re-training a model on the target dataset.

Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection

The ability to control for the kinds of information encoded in neural representation has a variety of use cases, especially in light of the challenge of interpreting these models.
This paper by Ravfogel et al. from Bar Ilan University in ACL 2020 presents Iterative Null-space Projection (INLP), a novel method for removing information from neural representations.
Their method is based on repeated training of linear classifiers that predict a certain property they aim to remove, followed by projection of the representations on their null-space. By doing so, the classifiers become oblivious to that target property, making it hard to linearly separate the data according to it.
While applicable for multiple uses, they evaluate their method on bias and fairness use-cases, and show that their method is able to mitigate bias in word embeddings, as well as to increase fairness in a setting of multi-class classification.
The following table from the paper shows the t-SNE projection of GloVe vectors of the most gender-biased words after t=0, 3, 18, and 35 iterations of INLP. Words are colored according to being male-biased or female-biased.

GLU Variants Improve Transformer

Gated Linear Units consist of the component-wise product of two linear projections, one of which is first passed through a sigmoid function. Variations on GLU are possible, using different nonlinear (or even linear) functions in place of sigmoid.
This paper by Shazeer in 2020 tests these variants in the feed-forward sublayers of the Transformer sequence-to-sequence model, and find that some of them yield quality improvements over the typically-used ReLU or GELU activations.

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

Large deep learning models offer significant accuracy gains, but training billions to trillions of parameters is challenging. Existing solutions such as data and model parallelisms exhibit fundamental limitations to fit these models into limited device memory, while obtaining computation, communication and development efficiency.
This paper by Rajbhandari et al. in from Microsoft develops a novel solution, Zero Redundancy Optimizer (ZeRO), to optimize memory, vastly improving training speed while increasing the model size that can be efficiently trained. ZeRO eliminates memory redundancies in data- and model-parallel training while retaining low communication volume and high computational granularity, allowing them to scale the model size proportional to the number of devices with sustained high efficiency.
Their analysis on memory requirements and communication volume demonstrates: ZeRO has the potential to scale beyond 1 Trillion parameters using today’s hardware.
ZeRO-DP has three main optimization stages (as depicted in the below figure from the paper), which correspond to the partitioning of optimizer states, gradients, and parameters. When enabled cumulatively:
1. Optimizer State Partitioning $\left(P_{o s}\right): 4 \mathrm{x}$ memory reduction, same communication volume as DP;
2. Add Gradient Partitioning $\left(P_{o s+g}\right): 8 \mathrm{x}$ memory reduction, same communication volume as DP;
3. Add Parameter Partitioning $\left(P_{o s+g+p}\right)$: Memory reduction is linear with DP degree $N_d$. For example, splitting across 64 GPUs $\left(N_d=64\right)$ will yield a $64 \mathrm{x}$ memory reduction. There is a modest 50% increase in communication volume.
They implement and evaluate ZeRO: it trains large models of over 100B parameter with super-linear speedup on 400 GPUs, achieving throughput of 15 Petaflops. This represents an 8x increase in model size and 10x increase in achievable performance over state-of-the-art. In terms of usability, ZeRO can train large models of up to 13B parameters (e.g., larger than Megatron GPT 8.3B and T5 11B) without requiring model parallelism which is harder for scientists to apply. Last but not the least, researchers have used the system breakthroughs of ZeRO to create the world’s largest language model (Turing-NLG, 17B parameters) with record breaking accuracy.
The following figure from the paper shows a comparison the per-device memory consumption of model states, with three stages of ZeRO-DP optimizations. $\Psi$ denotes model size (number of parameters), $K$ denotes the memory multiplier of optimizer states, and $N_d$ denotes DP degree. In the example, we assume a model size of $\Psi=7.5 B$ and DP of $N_d=64$ with $K=12$ based on mixed-precision training with Adam optimizer.

Project page.

2021

Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better

Deep Learning has revolutionized the fields of computer vision, natural language understanding, speech recognition, information retrieval and more. However, with the progressive improvements in deep learning models, their number of parameters, latency, resources required to train, etc. have all have increased significantly. Consequently, it has become important to pay attention to these footprint metrics of a model as well, not just its quality. Training and deploying models involves making either implicit or explicit decisions about efficiency.
This paper by Menghani from Google Research in 2022 motivates the problem of efficiency in deep learning, followed by a thorough survey of the seminal work in core areas of model efficiency (spanning modeling techniques, infrastructure, and hardware). They lay out a mental model for the readers to wrap their heads around the multiple focus areas of model efficiency and optimization, thereby offering the reader an opportunity to understand the state-of-the-art, apply these techniques in the modelling process, and/or use them as a starting point for exploration
They also present an experiment-based guide along with code, for practitioners to optimize their model training and deployment. They believe this is the first comprehensive survey in the efficient deep learning space that covers the landscape of model efficiency from modeling techniques to hardware support.
Finally, they present a section of explicit and actionable insights supplemented by code, for a practitioner to use as a guide in this space. This section will hopefully give concrete and actionable takeaways, as well as tradeoffs to think about when optimizing a model for training and deployment.

Do Wide and Deep Networks Learn the Same Things? Uncovering How Neural Network Representations Vary with Width and Depth

This paper by Nguyen et al. from Google Research in ICLR 2021 performs a systematic study of the similarity between wide and deep networks from the same architectural family through the lens of their hidden representations and final outputs.
As either width or depth increases relative to the size of the dataset, analysis of hidden representations reveals the emergence of a characteristic block structure that reflects the similarity of a dominant first principal component, propagated across many network hidden layers. Put simply, they establish a connection between this phenomenon and model over-parameterization.
Comparisons across models demonstrate that those without the block structure show significant similarity between representations in corresponding layers, but those containing the block structure exhibit highly dissimilar representations.
In other words, while the block structure is unique to each model, other learned features are shared across different initializations and architectures, particularly across relative depths of the network.
These properties of the internal representations in turn translate to systematically different errors at the class and example levels for wide and deep models when they are evaluated on the same test set.
Google blog.

Using AntiPatterns to avoid MLOps Mistakes

This paper by Muralidhar et al. from Virgina Tech and The Bank of New York Mellon in 2021 describes lessons learned from developing and deploying machine learning models at scale across the enterprise in a range of financial analytics applications. These lessons are presented in the form of antipatterns. Just as design patterns codify best software engineering practices, antipatterns provide a vocabulary to describe defective practices and methodologies.
They catalog and document numerous antipatterns in financial ML operations (MLOps). Some antipatterns are due to technical errors, while others are due to not having sufficient knowledge of the surrounding context in which ML results are used. By providing a common vocabulary to discuss these situations, their intent is that antipatterns will support better documentation of issues, rapid communication between stakeholders, and faster resolution of problems. In addition to cataloging antipatterns, they describe solutions, best practices, and future directions toward MLOps maturity.
Their recommendations for operationalizing lessons learnt in a production financial ML setting are:
- Use AntiPatterns presented here to document a model management process to avoid costly but routine mistakes in model development, deployment, and approval.
- Use assertions to track data quality across the enterprise. This is crucial since ML models can be so dependent on faulty or noisy data and suitable checks and balances can ensure a safe operating environment for ML algorithms.
- Document data lineage along with transformations to support creation of ‘audit trails’ so models can be situated back in time and in specific data slices for re-training or re-tuning.
- Use ensembles to maintain a palette of models including remedial and compensatory pipelines in the event of errors.
- Track model histories through the lifecycle of an application.
- Ensure human-in-the-loop operational capability at multiple levels.
Overall, the model development and management pipeline in typical organizations supports four classes of stakeholders:
- The data steward (who holds custody of datasets and sets performance standards),
- The model developer (an ML person who designs algorithms),
- The model engineer (who places models in production and tracks performance), and
- The model certification authority (group of professionals who ensure compliance with standards and risk levels).
Bringing such multiple stakeholder groups together ensures a structured process where benefits and risks of ML models are well documented and understood at all stages of development and deployment

Self-attention Does Not Need $O(n^2)$ Memory

This paper by Rabe et al. from Google Research presents a very simple algorithm for attention that requires $O(1)$ memory with respect to sequence length and an extension to self-attention that requires $O(\log{n})$ memory. This is in contrast with the frequently stated belief that self-attention requires $O(n^2)$ memory. While the time complexity is still $O(n^2)$, device memory rather than compute capability is often the limiting factor on modern accelerators. Thus, reducing the memory requirements of attention allows processing of longer sequences than might otherwise be feasible.
They provide a practical implementation for accelerators that requires $O(\sqrt{n})$ memory, is numerically stable, and is within a few percent of the runtime of the standard implementation of attention.
They also demonstrate how to differentiate the function while remaining memory-efficient. For sequence length 16384, the memory overhead of self-attention is reduced by 59X for inference and by 32X for differentiation.

Sharpness-Aware Minimization for Efficiently Improving Generalization

In today’s heavily overparameterized models, the value of the training loss provides few guarantees on model generalization ability. Indeed, optimizing only the training loss value, as is commonly done, can easily lead to suboptimal model quality.
This paper by Foret et al. from Google Research, Blueshift, and Alphabet in ICLR 2021 introduces Sharpness-Aware Minimization (SAM), a novel, effective procedure for instead simultaneously minimizing loss value and loss sharpness, motivated by prior work connecting the geometry of the loss landscape and generalization.
In particular, SAM seeks parameters that lie in neighborhoods having uniformly low loss; this formulation results in a min-max optimization problem on which gradient descent can be performed efficiently.
The figure below from the paper shows: (left) Error rate reduction obtained by switching to SAM. Each point is a different dataset / model / data augmentation. (middle) A sharp minimum to which a ResNet trained with SGD converged. (right) A wide minimum to which the same ResNet trained with SAM converged.

They present empirical results showing that SAM improves model generalization across a variety of benchmark datasets (e.g., CIFAR-10, CIFAR-100, ImageNet, finetuning tasks) and models, yielding novel state-of-the-art performance for several. Additionally, they find that SAM natively provides robustness to label noise on par with that provided by state-of-the-art procedures that specifically target learning with noisy labels.
Code.

ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning

In the last three years, the largest dense deep learning models have grown over 1000x to reach hundreds of billions of parameters, while the GPU memory has only grown by 5x (16 GB to 80 GB). Therefore, the growth in model scale has been supported primarily though system innovations that allow large models to fit in the aggregate GPU memory of multiple GPUs. However, we are getting close to the GPU memory wall. It requires 800 NVIDIA V100 GPUs just to fit a trillion parameter model for training, and such clusters are simply out of reach for most data scientists. In addition, training models at that scale requires complex combinations of parallelism techniques that puts a big burden on the data scientists to refactor their model.
This paper by Rajbhandari et al. from Microsoft presents ZeRO-Infinity, a novel heterogeneous system technology that leverages GPU, CPU, and NVMe memory to allow for unprecedented model scale on limited resources without requiring model code refactoring. At the same time it achieves excellent training throughput and scalability, unencumbered by the limited CPU or NVMe bandwidth.
ZeRO-Infinity can fit models with tens and even hundreds of trillions of parameters for training on current generation GPU clusters. It can be used to fine-tune trillion parameter models on a single NVIDIA DGX-2 node, making large models more accessible. In terms of training throughput and scalability, it sustains over 25 petaflops on 512 NVIDIA V100 GPUs (40% of peak), while also demonstrating super linear scalability.
An open source implementation of ZeRO-Infinity is available through DeepSpeed, a deep learning optimization library that makes distributed training easy, efficient, and effective.
The figure below from the paper shows a snapshot of ZeRO-Infinity training a model with two layers on four data parallel (DP) ranks. Communication for the backward pass of the first layer is depicted. Partitioned parameters are moved from slow memory to GPU and then collected to form the full layer. After gradients are computed, they are aggregated, repartitoned, and then offloaded to slow memory. Layers are denoted with subscripts and DP ranks are denoted with superscripts. For example, $P_0^{(2)}$ is the portion of layer 0’s parameters owned by $GPU^{(2)}$.

Project page.

Efficiently Modeling Long Sequences with Structured State Spaces

The paper, authored by Gu et al. from Stanford University, introduces a new sequence model named Structured State Space Sequence model (S4), designed to efficiently handle long-range dependencies (LRDs) in data sequences extending over 10,000 steps or more.
S4 leverages a novel parameterization of the state space model (SSM), enabling it to efficiently compute tasks while maintaining high performance traditionally achieved by models like RNNs, CNNs, and Transformers. Specifically, it uses a reparameterization of the structured state matrices in SSMs by combining a low-rank correction with a normal term, allowing for efficient computations via the Cauchy kernel, reducing the operational complexity to $O(N+L)$ for state size $N$ and sequence length $L$.
The model significantly outperforms existing models on the Long Range Arena benchmark, addressing tasks previously infeasible due to computational constraints. For example, it achieves 91% accuracy on sequential CIFAR-10 and solves the challenging Path-X task (16k length) with 88% accuracy, a task where other models performed no better than random.
The figure below from the paper shows: (Left) State Space Models (SSM) parameterized by matrices $A$, $B$, $C$, $D$ map an input signal $u(t)$ to output $y(t)$ through a latent state $x(t)$. (Center) Recent theory on continuous-time memorization derives special A matrices that allow SSMs to capture LRDs mathematically and empirically. (Right) SSMs can be computed either as a recurrence (left) or convolution (right). However, materializing these conceptual views requires utilizing different representations of its parameters (red, blue, green) which are very expensive to compute. S4 introduces a novel parameterization that efficiently swaps between these representations, allowing it to handle a wide range of tasks, be efficient at both training and inference, and excel at long sequences.

Implementation details include the use of the HiPPO framework to derive specific matrices that help capture long-range dependencies more effectively. S4 transitions between continuous-time, recurrent, and convolutional representations of the SSM, which accommodates various data modalities and sequence lengths efficiently.
Additionally, the paper discusses the architecture of the S4 layer in depth, detailing how it uses the state space to model sequences across different domains, such as images, audio, and text, with minimal domain-specific tailoring. It also explains how S4 handles changes in time-series sampling frequency without retraining, an important feature for real-world applications.

2022

Pathways: Asynchronous Distributed Dataflow for ML

This paper by Barham et al. from Google in MLSys 2022 presents the design of Pathways, a new large scale orchestration layer for accelerators. Pathways is explicitly designed to enable exploration of new systems and ML research ideas, while matching state-of-the-art multi-controller performance on current ML models which are single-tenant SPMD.
Pathways uses a sharded dataflow graph of asynchronous operators that consume and produce futures, and efficiently gang-schedules heterogeneous parallel computations on thousands of accelerators while coordinating data transfers over their dedicated interconnects. Pathways makes use of a novel asynchronous distributed dataflow design that lets the control plane execute in parallel despite dependencies in the data plane. This design, with careful engineering, allows Pathways to adopt a single-controller model that makes it easier to express complex new parallelism patterns. Pathways upends the execution model of JAX programs, pulling user code back into a single-controller model, and interposing a centralized resource management and scheduling framework between client and accelerators. The single-controller programming model allows users simple access to much richer computation patterns. The resource management and scheduling layer permits the reintroduction of cluster management policies including multi-tenant sharing, virtualization and elasticity, all tailored to the requirements of ML workloads and accelerators.
Their micro-benchmarks show interleaving of concurrent client workloads, and efficient pipelined execution, convincingly demonstrating that the system mechanisms they have built are fast and flexible, and form a solid basis for research into novel policies to make use of them. They demonstrate that Pathways can achieve performance parity (~100% accelerator utilization) with state-of-the-art systems when running SPMD computations over 2048 TPUs, while also delivering throughput comparable to the SPMD case for Transformer models that are pipelined across 16 stages, or sharded across two islands of accelerators connected over a data center network.
They have shown that careful system design and engineering lets them “get the best of both worlds”, matching performance on today’s ML models while delivering the features needed to write the models of tomorrow.

PolyLoss: A Polynomial Expansion Perspective of Classification Loss Functions

Cross-entropy loss and focal loss are the most common choices when training deep neural networks for classification problems. Generally speaking, however, a good loss function can take on much more flexible forms, and should be tailored for different tasks and datasets.
This paper by Leng et al. from in ICLR 2022 proposes a simple framework, named PolyLoss, to view and design loss functions as a linear combination of polynomial functions, motivated by how functions can be approximated via Taylor expansion. Under polynomial expansion, focal loss is a horizontal shift of the polynomial coefficients compared to the cross-entropy loss. Motivated by this new insight, they explore an alternative dimension, i.e., vertically modify the polynomial coefficients.
PolyLoss allows flexible ways of changing the loss function shape by adjusting the polynomial coefficients depending on the targeting tasks and datasets, while naturally subsuming the aforementioned cross-entropy loss and focal loss as special cases.
Extensive experimental results show that the optimal choice within the PolyLoss is indeed dependent on the task and dataset.
By simply adjusting the coefficient of the leading polynomial coefficient with just one extra hyperparameter and adding one line of code, the Poly-1 formulation outperforms the cross-entropy loss and focal loss on 2D image classification, instance segmentation, object detection, and 3D object detection tasks, sometimes by a large margin.
The following table from the paper shows the magnitude by which PolyLoss outperforms cross-entropy and focal loss on various models and tasks:

Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models

Despite their wide adoption, the underlying training and memorization dynamics of very large language models is not well understood. They empirically study exact memorization in causal and masked language modeling, across model sizes and throughout the training process.
This paper by Tirumala et al. from Meta AI Research in 2022 study the properties of memorization dynamics over language model training and demonstrate that larger models memorize faster.
They also measure the properties of forgetting curves and surprisingly find that forgetting reaches a baseline, which again increases with the model scale. Combined with memorization analyses that expose the unintuitive behavior of language models, they hope to motivate considering memorization as a critical metric when increasing language model scale.
They implicitly focus on information that is sensitive if outputted verbatim (phone numbers, SSNs, addresses, medical diagnoses, etc.), rather than capturing all aspects of privacy (for e.g., synonyms).
It is also known that text data used for training language models contain certain biases and stereotypes; therefore, their work has similar implications for how long language models can train before they definitively memorize these biases from training data.
They measure the effects of dataset size, learning rate, and model size on memorization, finding that larger language models memorize training data faster across all settings.
Surprisingly, they show that larger models can memorize a larger portion of the data before over-fitting and tend to forget less throughout the training process.
They also analyze the memorization dynamics of different parts of speech and find that models memorize nouns and numbers first; they hypothesize and provide empirical evidence that nouns and numbers act as a unique identifier for memorizing individual training examples. Together, these findings present another piece of the broader puzzle of trying to understand what actually improves as models get bigger.
Their work highlights the importance of analyzing memorization dynamics as they scale up language models, instead of only reporting cross entropy. Cross-entropy loss and memorization capture different behavior — for example, in many of their memory degradation experiments, even though memorization approaches a baseline, they observe that perplexity still increases. This implies that the model is becoming unconfident about the exact predictions, which they can only conclude because they inspect memorization dynamics along with the loss. Similarly, there are multiple instances where they uncover interesting behavior because they focus on memorization dynamics, rather than focusing only on just the cross-entropy loss.

Federated Learning with Buffered Asynchronous Aggregation

Scalability and privacy are two critical concerns for cross-device federated learning (FL) systems.
This paper by Nguyen et al. from Meta AI in AISTATS 2022 identifies that synchronous FL - synchronized aggregation of client updates in FL - cannot scale efficiently beyond a few hundred clients training in parallel, and leads to diminishing returns in model performance and training speed, analogous to large-batch training. To address the scalability issue, they propose FedBuff, an asynchronous FL training scheme with buffered aggregation, which offers an asynchronous aggregation of client updates in FL (i.e., asynchronous FL). Compared to SyncFL proposals, FedBuff scales to large values of concurrency.
However, aggregating individual client updates is incompatible with secure aggregation, which could result in an undesirable level of privacy for the system. To address these concerns, they propose a novel buffered asynchronous aggregation method, FedBuff, that is agnostic to the choice of optimizer, and combines the best properties of synchronous and asynchronous FL. Compared to AsyncFL proposals, FedBuff is more private as it is compatible with SecAgg and differential privacy.
They empirically demonstrate that FedBuff is 3.3x more efficient than synchronous FL and up to 2.5x more efficient than asynchronous FL, while being compatible with privacy-preserving technologies such as secure aggregation and differential privacy.
They provide theoretical convergence guarantees in a smooth non-convex setting. Finally, they show that under differentially private training, FedBuff can outperform FedAvgM at low privacy settings and achieve the same utility for higher privacy settings.
Facebook blog.

Applied Federated Learning: Architectural Design for Robust and Efficient Learning in Privacy Aware Settings

The classical machine learning paradigm requires the aggregation of user data in a central location where machine learning practitioners can preprocess data, calculate features, tune models and evaluate performance. The advantage of this approach includes leveraging high performance hardware (such as GPUs) and the ability of machine learning practitioners to do in depth data analysis to improve model performance.
However, these advantages may come at a cost to data privacy. User data is collected, aggregated, and stored on centralized servers for model development. Centralization of data poses risks, including a heightened risk of internal and external security incidents as well as accidental data misuse. Federated learning with differential privacy is designed to avoid the server-side centralization pitfall by bringing the ML learning step to users’ devices.
Learning is done in a federated manner where each mobile device runs a training loop on a local copy of a model. Updates from on-device models are sent to the server via encrypted communication and through differential privacy to improve the global model. In this paradigm, users’ personal data remains on their devices. Surprisingly, model training in this manner comes at a fairly minimal degradation in model performance.
This paper by Stojkovic from Meta in 2022 presents an architecture to address several challenges unique to productionizing federated machine learning with differential privacy owing to its distributed nature, heterogeneous compute environments and lack of data visibility. These challenges include label balancing, slow release cycles, low device participation rate, privacy-preserving system logging, model metric calculation and feature normalization.
This paper concludes with results demonstrating the effectiveness of the proposed architecture.
While this architecture is capable of successfully training and potentially deploying production federated learning models, there are several challenges left to future work. Specifically, developer speed remains one of the largest barriers to scaling production-grade federated machine learning. Current iterations of model development are several orders of magnitude slower when compared to similar sized undertakings within a centralized environment.

Operationalizing Machine Learning: An Interview Study

Organizations rely on machine learning engineers (MLEs) to operationalize ML, i.e., deploy and maintain ML pipelines in production. The process of operationalizing ML, or MLOps, consists of a continual loop of (i) data collection and labeling, (ii) experimentation to improve ML performance, (iii) evaluation throughout a multi-staged deployment process, and (iv) monitoring of performance drops in production. When considered together, these responsibilities seem staggering – how does anyone do MLOps, what are the unaddressed challenges, and what are the implications for tool builders?
This paper by Shankar et al. from UC Berkeley in 2022 presented results from semi-structured ethnographic interviews with 18 MLEs working spanning different organizations and applications to understand their workflow, best practices, and challenges – including chatbots, autonomous vehicles, and finance. Our interviews expose three variables that govern success for a production ML deployment: high velocity, validating as early as possible, and maintaining multiple versions of models for minimal production downtime.
They summarize common practices for successful ML experimentation, deployment, and sustaining production performance. Finally, they discuss MLOps pain points and anti-patterns discovered in their interviews to inspire new MLOps tooling and research ideas.

A/B Testing Intuition Busters

A/B tests, or online controlled experiments, are heavily used in industry to evaluate implementations of ideas.
This paper by Kohavi et al. in KDD ‘22 goes over common misunderstandings in online controlled experiments.
While the statistics behind controlled experiments are well documented and some basic pitfalls known, they have observed some seemingly intuitive concepts being touted, including by A/B tool vendors and agencies, which are misleading, often badly so.
They describe these misunderstandings, the “intuition” behind them, and to explain and bust that intuition with solid statistical reasoning. They provide recommendations that experimentation platform designers can implement to make it harder for experimenters to make these intuitive mistakes.

Effect of scale on catastrophic forgetting in neural networks

Catastrophic forgetting describes the phenomenon that the performance of neural networks degrades on the previous tasks once trained on new tasks.
This paper by Ramasesh et al. from Google Research in ICLR 2022 empirically validated the effect of neural network pre-training on catastrophic forgetting.
The key conclusion from the paper is that pre-training large image classification models can help mitigate forgetting in sequential tasks.
Suppose there are two sequential tasks, denoted as Task A and Task B, which could be split CIFAR-10 or CIFAR-100 image classification tasks, according to the paper. The authors also studied a few other datasets beyond CIFAR. Ideally, a model that is robust to forgetting should observe as good performance as possible on Task B without sacrificing the performance on Task A.
In order to validate the effect of pre-training on overcoming catastrophic forgetting, the authors adopted ResNet models pre-trained on the ImageNet21k dataset, sequentially fine-tuned them on Task A and Task B. Compared to the trained-from-scratch counterparts, the pre-trained models witnessed an overall trend of better accuracies on Task B for given accuracies of Task A.
In order to validate the effect of model size on overcoming catastrophic forgetting, the authors used Vision Transformers of various sizes and ResNets of various sizes. Both under the case of Vision Transformers and under the case of ResNets, when maintaining a given performance on Task A, larger models exhibited an overall trend of better achievable performance on Task B.
In addition, the authors empirically showed that the pre-trained model provided more orthogonal representations among distinct classes compared to its trained-from-scratch counterpart, which can explain the fact that the pre-trained models are more robust to forgetting.

Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution

When transferring a pretrained model to a downstream task, two popular methods are full fine-tuning (updating all the model parameters) and linear probing (updating only the last linear layer—the “head”). It is well known that fine-tuning leads to better accuracy in-distribution (ID).
This paper by Kumar et al. from Stanford in 2022 demonstrates that fine-tuning can achieve worse accuracy than linear probing out-of-distribution (OOD) when the pretrained features are good and the distribution shift is large. On 10 distribution shift datasets (Breeds-Living17, Breeds-Entity30, DomainNet, CIFAR → STL, CIFAR10.1, FMoW, ImageNetV2, ImageNet-R, ImageNet-A, ImageNet-Sketch), fine-tuning obtains on average 2% higher accuracy ID but 7% lower accuracy OOD than linear probing.
They show theoretically that this tradeoff between ID and OOD accuracy arises even in a simple setting: fine-tuning overparameterized two-layer linear networks. They prove that the OOD error of fine-tuning is high when they initialize with a fixed or random head—this is because while fine-tuning learns the head, the lower layers of the neural network change simultaneously and distort the pretrained features.
Their analysis suggests that the easy two-step strategy of linear probing then full fine-tuning (LP-FT), sometimes used as a fine-tuning heuristic, combines the benefits of both fine-tuning and linear probing.
Empirically, LP-FT outperforms both fine-tuning and linear probing on the above datasets (1% better ID, 10% better OOD than full fine-tuning).

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. Approximate attention methods have attempted to address this problem by trading off model quality to reduce the compute complexity, but often do not achieve wall-clock speedup. They argue that a missing principle is making attention algorithms IO-aware – accounting for reads and writes between levels of GPU memory.
This paper by Dao et al. from Stanford in 2022 proposes FlashAttention, an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM. Specifically, FlashAttention reorders the attention computation and leverages classical techniques (tiling, recomputation) to significantly speed it up and reduce memory usage from quadratic to linear in sequence length.
They analyze the IO complexity of FlashAttention, showing that it requires fewer HBM accesses than standard attention, and is optimal for a range of SRAM sizes. They also extend FlashAttention to block-sparse attention, yielding an approximate attention algorithm that is faster than any existing approximate attention method.
FlashAttention trains Transformers faster than existing baselines: 15% end-to-end wall-clock speedup on BERT-large (seq. length 512) compared to the MLPerf 1.1 training speed record, 3x speedup on GPT-2 (seq. length 1K), and 2.4x speedup on long-range arena (seq. length 1K-4K).
FlashAttention and block-sparse FlashAttention enable longer context in Transformers, yielding higher quality models (0.7 better perplexity on GPT-2 and 6.4 points of lift on long-document classification) and entirely new capabilities: the first Transformers to achieve better-than-chance performance on the Path-X challenge (seq. length 16K, 61.4% accuracy) and Path-256 (seq. length 64K, 63.1% accuracy).
The figure below from the paper shows: (Left) FlashAttention uses tiling to prevent materialization of the large $N \times N$ attention matrix (dotted box) on (relatively) slow GPU HBM. In the outer loop (red arrows), FlashAttention loops through blocks of the $K$ and $V$ matrices and loads them to fast on-chip SRAM. In each block, FlashAttention loops over blocks of $Q$ matrix (blue arrows), loading them to SRAM, and writing the output of the attention computation back to HBM. Right: Speedup over the PyTorch implementation of attention on GPT-2. FlashAttention does not read and write the large $N \times N$ attention matrix to HBM, resulting in an 7.6x speedup on the attention computation.

Code

Robust fine-tuning of zero-shot models

This paper by et al. from in CVPR 2022 introduces a novel method for improving the robustness of zero-shot models during fine-tuning, called WiSE-FT (Weight-space ensembles for Fine-Tuning).
Zero-shot models like CLIP and ALIGN show consistent accuracy across various data distributions without specific dataset fine-tuning. However, traditional fine-tuning methods, while improving target distribution accuracy, often diminish robustness to distribution shifts. This study proposes a solution to this problem by introducing WiSE-FT.
WiSE-FT involves a two-step process. First, the zero-shot model is fine-tuned on the target distribution. Then, the weights of the original zero-shot and the fine-tuned models are linearly interpolated, a process termed as weight-space ensembling.
WiSE-FT notably enhances accuracy under distribution shifts while maintaining or even improving accuracy on the target distribution. Specifically, it showed a 4 to 6 percentage points increase in accuracy on ImageNet and its derived distribution shifts over previous methods, without additional computational costs during fine-tuning or inference.
The method was tested extensively, demonstrating that WiSE-FT improves the accuracy of a fine-tuned CLIP model across various ImageNet distribution shifts. It also showed significant improvements in accuracy on reference and shifted distributions with a recommended mixing coefficient $\alpha = 0.5$ for optimal performance.
The figure below from the paper shows: (Top left) Zero-shot CLIP models exhibit moderate accuracy on the reference distribution (x-axis, the target for fine-tuning) and high effective robustness (accuracy on the distribution shifts beyond the baseline models). In contrast, standard fine-tuning—either end-to-end or with a linear classifier (final layer)—attains higher accuracy on the reference distribution but less effective robustness. (Top right) WiSE-FT linearly interpolates between the zero-shot and fine-tuned models with a mixing coefficient $\alpha \in$0, 1. (Bottom) On five distribution shifts derived from ImageNet (ImageNetV2, ImageNet-R, ImageNet Sketch, ObjectNet, and ImageNet-A), WiSE-FT improves average accuracy relative to both the zero-shot and fine-tuned models while maintaining or improving accuracy on ImageNet.

WiSE-FT was effective in reducing relative error by 4 to 49% on several datasets, such as ImageNet, CIFAR-10, CIFAR-100, Describable Textures, Food-101, SUN397, and Stanford Cars, outperforming standard fine-tuning methods.
The method was successfully applied to the ALIGN model, which, although similar to CLIP, is pre-trained with a different dataset, showing consistent trends with previous findings.

Efficiently Scaling Transformer Inference

This paper by Pope et al. from Google, addresses the challenge of efficient generative inference for large, deep Transformer models.
The authors focus on optimizing Transformer-based models for efficient generative inference, which is crucial due to the increasing use of these models in various applications with challenging requirements, such as tight latency targets and long sequence lengths.
They discuss briefly what makes generative inference of LLMs challenging. First, large models have a large memory footprint both due to the trained model parameters as well as the transient state needed during decoding. The model parameters generally do not fit in the memory of a single accelerator chip. The attention key and value tensors of each layer, which they refer to as the KV cache, must also be stored in memory for the duration of decoding. Second, tight latency targets become especially challenging for generative inference given the much lower parallelizability of Transformer generation relative to training. The large memory footprint gives rise to a large amount of memory traffic to load the parameters and KV cache from high-bandwidth memory (HBM) into the compute cores for each step, and hence a large total memory bandwidth required to meet a given latency target. Finally, inference cost from the attention mechanism scales quadratically with input sequence length.
A key innovation presented is a simple analytical model for inference efficiency. This model is used to select the best multi-dimensional partitioning techniques, optimized for TPU v4 slices, based on specific application requirements.
The paper introduces a suite of low-level optimizations and partitioning techniques, resulting in a new Pareto frontier on the latency and model FLOPS utilization tradeoffs for models with over 500 billion parameters. This approach outperforms existing benchmarks like FasterTransformer.
The authors demonstrate that multiquery attention, where multiple query heads share a single key/value head, enables scaling up to 32 times larger context lengths. This advancement significantly reduces memory requirements.
Impressive performance metrics are achieved, such as a low-batch-size latency of 29ms per token during generation and a 76% model FLOPS utilization during large-batch-size processing, all while supporting a long 2048-token context length on the PaLM 540B parameter model.
The paper’s results indicate that these strategies allow for effective scaling and efficient deployment of large Transformer-based models in real-world settings, addressing the challenges of generative inference’s lower parallelizability compared to training.
Overall, the paper makes significant contributions to the field by developing and demonstrating techniques to efficiently scale Transformer inference, enhancing the practical utility of large language models in diverse applications.

2023

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Scaling Transformers to longer sequence lengths has been a major problem in the last several years, promising to improve performance in language modeling and high-resolution image understanding, as well as to unlock new applications in code, audio, and video generation. The attention layer is the main bottleneck in scaling to longer sequences, as its runtime and memory increase quadratically in the sequence length.
FlashAttention exploits the asymmetric GPU memory hierarchy to bring significant memory saving (linear instead of quadratic) and runtime speedup (2-4x compared to optimized baselines), with no approximation. However, FlashAttention is still not nearly as fast as optimized matrix-multiply (GEMM) operations, reaching only 25-40% of the theoretical maximum FLOPs/s.
They observe that the inefficiency is due to suboptimal work partitioning between different thread blocks and warps on the GPU, causing either low-occupancy or unnecessary shared memory reads/writes.
This paper by Dao from Princeton and Stanford proposes FlashAttention-2, with better work partitioning to address these issues. In particular, they (1) tweak the algorithm to reduce the number of non-matmul FLOPs, (2) parallelize the attention computation, even for a single head, across different thread blocks to increase occupancy, and (3) within each thread block, distribute the work between warps to reduce communication through shared memory. These yield around 2x speedup compared to FlashAttention, reaching 50-73% of the theoretical maximum FLOPs/s on A100 and getting close to the efficiency of GEMM operations.
They empirically validate that when used end-to-end to train GPT-style models, FlashAttention-2 reaches training speed of up to 225 TFLOPs/s per A100 GPU (72% model FLOPs utilization).
The following figure from Sebastian Raschka summarizes FlashAttention-2:

Surgical Fine-Tuning Improves Adaptation to Distribution Shifts

A common approach to transfer learning under distribution shift is to fine-tune the last few layers of a pre-trained model, preserving learned features while also adapting to the new task.
This paper by Lee et al. from Stanford in ICLR 2023 demonstrates that in such settings, selectively fine-tuning a subset of layers (which they term surgical fine-tuning) matches or outperforms commonly used fine-tuning approaches.
Moreover, the type of distribution shift influences which subset is more effective to tune: for example, for image corruptions, fine-tuning only the first few layers works best. They validate their findings systematically across seven real-world data tasks spanning three types of distribution shifts.
Theoretically, they prove that for two-layer neural networks in an idealized setting, first-layer tuning can outperform fine-tuning all layers. Intuitively, fine-tuning more parameters on a small target dataset can cause information learned during pre-training to be forgotten, and the relevant information depends on the type of shift.
The following figure from the paper shows that surgical fine-tuning, where they tune only one block of parameters and freeze the remaining parameters, outperforms full fine-tuning on a range of distribution shifts. Moreover, they find that tuning different blocks performs best for different types of distribution shifts. Fine-tuning the first block works best for input-level shifts such as CIFAR-C (image corruption), later blocks work best for feature-level shifts such as Entity-30 (shift in entity subgroup), and tuning the last layer works best for output-level shifts such as CelebA (spurious correlation between gender and hair color).

Dataless Knowledge Fusion by Merging Weights of Language Models

Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models. Oftentimes fine-tuned models are readily available but their training data is not, due to data privacy or intellectual property concerns. This creates a barrier to fusing knowledge across individual models to yield a better single model.
This paper by Jin et al. from USC and Bloomberg studies the problem of merging individual models built on different training data sets to obtain a single model that performs well both across all data set domains and can generalize on out-of-domain data.
They propose a data-less knowledge fusion method called Regression Mean (RegMean), which merges models in their parameter space, guided by weights that minimize prediction differences between the merged model and the individual models. Over a battery of evaluation settings, they show that the proposed method significantly outperforms baselines such as Fisher-weighted averaging or model ensembling.
Note that RegMean focuses on merging fine-tuned models that originate from pre-trained language models with the same architecture and pretrained weights.
Further, they find that their method is a promising alternative to multi-task learning that can preserve or sometimes improve over the individual models without access to the training data. Finally, model merging is more efficient than training a multi-task model, thus making it applicable to a wider set of scenarios.
The figure below from the paper shows the problem formation for model merging and its comparison to other setups including multi-task learning, model ensembling and federated learning. Models $f_{1, \ldots, N}$ trained by individuals or organizations are released to the user (optionally with some statistics) but the training data $D_{1, \ldots, N}$ is kept private.

Weak-to-Strong Generalization: Eliciting Strong Capabilities with Weak Supervision

This paper by Burns et al. from OpenAI investigates whether weak model supervision can elicit the full capabilities of much stronger models, a crucial question for the alignment of future superhuman models. The authors conduct experiments using a range of pretrained language models from the GPT-4 family across different tasks: natural language processing (NLP), chess, and reward modeling.
The study proposes an innovative approach, weak-to-strong learning, where large (strong) pretrained models are fine-tuned using labels generated by small (weak) models. This setup reflects the challenge of humans supervising superhuman models and aims to understand how well strong models can generalize beyond the guidance of their weaker supervisors.
The figure below from the paper illustrates their methodology. Traditional ML focuses on the setting where humans supervise models that are weaker than humans. For the ultimate superalignment problem, humans will have to supervise models much smarter than them. They study an analogous problem today: using weak models to supervise strong models.

A key metric introduced in this research is the Performance Gap Recovered (PGR), which quantifies the extent to which weak-to-strong generalization occurs. PGR is the fraction of the performance gap between weak and strong ceiling models that can be recovered with weak supervision.
The paper underscores the potential and limitations of this approach. It can be applied to any pair of weak and strong models across various tasks, aiding the study of scaling laws without the need for state-of-the-art models or human feedback. However, there are crucial differences between this setup and the actual problem of aligning superhuman models, such as imitation saliency and pretraining leakage.
The tasks examined include 22 popular NLP classification datasets, chess puzzles from lichess.org, and a proprietary dataset used for training ChatGPT reward models. The tasks cover various domains like ethics, commonsense reasoning, and natural language inference, and involve both binary classification and generative tasks.
When naively fine-tuning on weak labels, the study finds that strong models generally outperform their weak supervisors, with the degree of generalization varying across tasks. For NLP tasks, the generalization is more pronounced, with large models recovering a significant portion of the performance gap. In contrast, the generalization in chess puzzles and reward modeling tasks shows mixed results.
The researchers employ simple methods to enhance weak-to-strong generalization. One such method is bootstrapping with intermediate model sizes, which shows substantial improvements in chess puzzles but limited effect in NLP tasks and reward modeling. Another effective method is an auxiliary confidence loss, particularly in NLP tasks, which significantly boosts generalization, especially for larger compute gaps between weak and strong models.
Thus, the paper delves into understanding weak-to-strong generalization, focusing on the imitation of supervisor mistakes and the salience of tasks to the strong student model. It highlights issues like overfitting to weak supervision and the importance of student-supervisor agreement, indicating that strong models tend to imitate weak supervisors’ errors, which can be mitigated with strategies like the auxiliary confidence loss.

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

This paper by Gu and Dao from presents ‘Mamba’, a neural network architecture for sequence modeling. Mamba addresses the computational inefficiencies of Transformers in processing long sequences, a significant issue in modern deep learning, particularly with foundation models.
They propose selective state space models (SSMs) that enable linear scaling with sequence length and demonstrate superior performance across different modalities including language, audio, and genomics.
The authors highlight that traditional SSMs struggle with discrete and information-dense data like text due to their inability for content-based reasoning. By making SSM parameters input-dependent, Mamba can selectively process information, improving its adaptability and performance. This innovative approach allows selective information retention across sequences, crucial for coherent text generation and understanding.
To maintain computational efficiency despite the loss of efficient convolution operations due to input-dependent parameters, the authors develop a hardware-aware parallel algorithm for SSM computation. This innovation avoids extensive memory access and leverages GPU memory hierarchy effectively, leading to significant speedups. The architecture integrates these selective SSMs into a single block, eliminating the need for attention or MLP blocks, resulting in a homogeneous and efficient design.
Mamba’s architecture simplifies previous deep sequence models by integrating selective SSMs without the need for attention or MLP blocks, achieving a homogeneous and simplified design. This results in a model that not only performs well on tasks requiring long-range dependencies but also offers rapid inference. The following figure from the paper shows: (Overview.) Structured SSMs independently map each channel (e.g., $D$ = 5) of an input $x$ to output $y$ through a higher dimensional latent state $h$ (e.g., $N$ = 4). Prior SSMs avoid materializing this large effective state ($DN$, times batch size $B$ and sequence length $L$) through clever alternate computation paths requiring time-invariance: the $(\Delta, A, B, C)$ parameters are constant across time. Mamba’s selection mechanism adds back input-dependent dynamics, which also requires a careful hardware-aware algorithm to only materialize the expanded states in more efficient levels of the GPU memory hierarchy.

In empirical evaluations, Mamba sets new performance benchmarks in tasks such as selective copying and induction heads, showcasing its ability to solve problems that challenge other models. In language modeling, Mamba outperforms Transformers of similar or even larger sizes, offering better scaling laws and downstream task performance. Additionally, in DNA modeling and audio generation, Mamba achieves state-of-the-art results, benefiting from its ability to process long sequences efficiently.
Mamba demonstrates superior performance in various tasks like language, audio, and genomics. It outperforms Transformers of the same size in language modeling and achieves five times higher throughput, scaling linearly in sequence length. Its versatility is showcased through empirical validation on tasks such as synthetic copying, induction heads, language modeling, DNA modeling, and audio modeling and generation. The model’s significant speed improvements and scalability could redefine efficiency standards in foundation models across different modalities.
The paper also discusses the significance of the selection mechanism in SSMs, connecting it to gating mechanisms in recurrent neural networks and highlighting its role in modeling variable spacing and context in sequences. This mechanism allows Mamba to focus on relevant information and ignore noise, which is crucial for handling long sequences in various domains.
Model ablations and comparisons demonstrate the critical components contributing to Mamba’s performance, including the impact of selective parameters and the architecture’s simplified design. The authors release the model code and pre-trained checkpoints, facilitating further research and application in the field.

Kolmogorov–Arnold Networks (KANs): An Alternative to Multi-Layer Perceptrons for Enhanced Interpretability and Accuracy

This paper by Liu et al. from MIT, Caltech, Northeastern University, and the NSF Institute for Artificial Intelligence and Fundamental Interactions introduces Kolmogorov-Arnold Networks (KANs), a new class of neural networks that replace traditional MLPs’ fixed activation functions on nodes with learnable activation functions on edges, using no linear weights but univariate spline functions instead. This design not only enhances interpretability but also significantly boosts accuracy in various tasks compared to MLPs.
Unlike MLPs that use fixed activation functions and linear weights, KANs employ parametrized univariate functions (splines) as activation functions that are adaptable during the training process. This allows for a more flexible representation of data. Theoretically, KANs draw inspiration from the Kolmogorov-Arnold representation theorem, suggesting that multivariate functions can be decomposed into univariate functions, enhancing the interpretability and scalability of the model.
KANs structure involves a layer configuration where each ‘weight’ in traditional networks is replaced by a learnable spline function. This design eliminates the need for linear transformations, relying entirely on the manipulation of these functions for learning. The activation functions in KANs are initially simple but can adapt their complexity based on the requirements of the task, governed by the network’s architecture and the depth of its layers. For practical implementation, KANs are optimized using standard backpropagation techniques, and the parameters of the spline functions are adjusted based on their performance on training data.
The figure below from the paper illustrates Multi-Layer Perceptrons (MLPs) vs. Kolmogorov-Arnold Networks (KANs).

The paper reports extensive experiments showing that smaller KANs can achieve or exceed the performance of significantly larger MLPs in tasks like function approximation and PDE solving, with far fewer parameters. Interpretability is demonstrated through visualizations of the network’s function mappings and their adjustments during training, making KANs useful tools for scientific exploration and understanding complex data structures.
The introduction of KANs proposes a shift from the traditional perceptions of neural network architectures, suggesting a new pathway for creating highly accurate and interpretable models. Future directions include exploring the limits of KAN scalability and their applications in other domains of machine learning and artificial intelligence, potentially enhancing the model’s utility and adaptability.
Regarding the design choice of KANs or MLPs in a neural architecture, each comes with its pros and cons. Currently, the biggest bottleneck of KANs lies in its slow training. KANs are usually 10x slower than MLPs, given the same number of parameters. The authors claim to not have attempted to optimize KANs’ efficiency, so they deem KANs’ slow training more as an engineering problem to be improved in the future rather than a fundamental limitation. If one wants to train a model fast, one should use MLPs. In other cases, however, KANs should be comparable or better than MLPs, which makes them a great option. The decision tree below from the paper can help decide when to use a KAN. In short, if you care about interpretability and/or accuracy, and slow training is not a major concern, the recommendation is to try KANs.

Code

Mathematical Discoveries from Program Search with Large Language Models

This paper by Romera-Paredes et al. from Google DeepMind in Nature, introduces FunSearch, a novel approach combining a pre-trained Large Language Model (LLM) with a systematic evaluator to make scientific discoveries. It addresses the issue of LLMs producing plausible but incorrect statements, a hindrance in scientific discovery. FunSearch, standing for “searching in the function space,” evolves initial low-scoring programs into high-scoring ones, surpassing existing LLM-based methods. Its effectiveness is demonstrated in solving problems in extremal combinatorics, notably the cap set problem, and online bin packing, yielding new heuristics surpassing widely-used baselines. FunSearch not only finds solutions but generates programs describing the solution process, enhancing interpretability and application in real-world scenarios.
FunSearch operates using an evolutionary method. It evolves a crucial part of an algorithm, the ‘priority function,’ within a given skeleton containing boilerplate code and prior knowledge. This approach focuses LLM resources on evolving critical logic parts, enhancing performance. A user provides a problem description in code, including an evaluation procedure and a seed program. The system iteratively selects and improves programs, feeding them to an LLM, which generates new solutions. The highest-scoring programs are continuously added back, creating a self-improving loop. The use of Codey, an LLM based on the PaLM2 model family, is pivotal, chosen for its balance between sample quality and inference speed. FunSearch’s effectiveness isn’t highly sensitive to the choice of LLM as long as it’s trained on a sufficient code corpus. This demonstrates the applicability of pre-trained models without fine-tuning for specific problems. This method introduces multiple key components, such as utilizing common knowledge and focusing on critical ideas, ensuring idea diversity, and parallel processing to improve efficiency and avoid stagnation.
The figure below from the paper illustrates the FunSearch process. The LLM is shown a selection of the best programs it has generated so far (retrieved from the programs database), and asked to generate an even better one. The programs proposed by the LLM are automatically executed, and evaluated. The best programs are added to the database, for selection in subsequent cycles. The user can at any point retrieve the highest-scoring programs discovered so far.

The cap set problem in extremal combinatorics, an open challenge in mathematics, was tackled using FunSearch. FunSearch identified new cap sets in 8 dimensions, the largest known set ever found, representing a significant leap in the field. This demonstrated its scalability and ability to generate novel constructions from scratch. Additionally, it found new algorithms for the online bin packing problem that outperformed standard heuristics like first fit and best fit. FunSearch’s heuristic, starting from the best fit, was evolved and tested against benchmarks, showing superior performance and generalization across different problem sizes.
The figure below from the paper illustrates: (Left) Inspecting code generated by FunSearch yielded further actionable insights (highlights added by us). (Right) The raw “admissible” set constructed using the (much shorter) program on the left.

FunSearch generated concise, interpretable programs with low Kolmogorov complexity that scale to larger problem instances than traditional approaches, while offering insights for further scientific exploration. is best suited for problems with an efficient evaluator, rich scoring feedback, and a component amenable to evolution. The approach reveals new insights, such as discovering symmetries in problem solutions, and is envisioned to be further extended to tackle a broader range of problems, benefiting from the rapid development of LLMs. FunSearch’s potential extends beyond theoretical problems to practical applications in fields like communication theory and industrial systems.
Demo; Nature

Gaussian Error Linear Units (GELUs)

This paper by Hendrycks and Gimpel from UC Berkeley and Toyota Technological Institute introduces the Gaussian Error Linear Unit (GELU), a novel neural network activation function that improves performance across a range of computer vision, natural language processing, and speech tasks. The GELU function, denoted as $x\Phi(x)$, combines input values with their standard Gaussian cumulative distribution, offering a probabilistic approach to activation by weighting inputs by their magnitude rather than their sign, which is a departure from ReLUs.
The GELU activation is presented as an expectation of a modification to Adaptive Dropout, proposing a more statistical view of neuron output. It was empirically tested against ReLU and ELU activations, displaying superior or matching performance in several domains including MNIST classification and autoencoding, Twitter part-of-speech tagging, TIMIT frame classification, and CIFAR-10/100 classification.
The paper outlines the GELU’s mathematical foundation, emphasizing its probabilistic interpretation and input-dependent stochastic regularizing effect. A significant aspect of GELU is its formulation to introduce no additional hyperparameters, using a standardized setting ($\mu = 0, \sigma = 1$) for all experiments, simplifying its application.
Implementation details include employing Adam optimizer and suggesting the use of approximations like $0.5x(1 + \tanh[\sqrt{2/\pi}(x + 0.044715x^3)])$ for computational efficiency without sacrificing accuracy. The authors also discuss practical tips for deploying GELUs, such as momentum optimization and choosing effective approximations of the Gaussian CDF.
The figure below from the paper shows the log loss on TIMIT Frame Classification. Learning curves show training set convergence, and the lighter curves show the validation set convergence.

The empirical evaluation covered diverse tasks: for MNIST classification, GELUs achieved lower median training log loss compared to ELUs and ReLUs, both with and without dropout. In MNIST autoencoding, GELUs significantly outperformed other nonlinearities across different learning rates. For Twitter POS tagging and TIMIT frame classification, GELUs demonstrated comparable or superior generalization from limited data. CIFAR-10/100 classification results showed GELUs leading to lower error rates in both shallow and deep network architectures.
The paper concludes that GELUs consistently outperform ELUs and ReLUs across various tasks and datasets, making them a strong alternative for neural network activation functions. Additionally, it highlights the probabilistic interpretation and input-responsive regularization as key advantages of GELUs over traditional activation functions.
Code

2025

AlphaEvolve: A coding agent for scientific and algorithmic discovery

This paper from Novikov et al. from Google DeepMind introduces AlphaEvolve, an advanced evolutionary coding agent that orchestrates ensembles of LLMs to autonomously discover and optimize complex algorithms across scientific and engineering domains. Its design enables state-of-the-art superoptimization of programs, breaking long-standing barriers in algorithm discovery and practical computing tasks.
Core Idea: AlphaEvolve improves algorithms via an LLM-guided evolutionary loop. It generates code variations, evaluates them automatically using predefined metrics, and iteratively evolves superior solutions. This system enables the discovery of novel algorithms and engineering solutions at a scale and complexity previously unachievable.
Architecture and Implementation:
- Program Representation and Evaluation:
  - Programs are evolved over annotated code blocks (marked with # EVOLVE-BLOCK-START/END).
  - A user-defined evaluate function provides automated, scalar-valued feedback for each generated program, guiding selection.
- Evolutionary Loop:
  1. A parent program is selected and paired with context-rich inspiration programs.
  2. Prompts constructed from this data are fed to an LLM ensemble (Gemini 2.0 Flash + Pro).
  3. LLMs generate diffs (code mutations) to produce child programs.
  4. Evaluators run the evaluate function to compute fitness scores.
  5. The evolutionary database stores top-performing programs to inform future generations.
- LLM Prompting and Contextualization:
  - Prompts include prior high-quality programs, performance metrics, and task-specific metadata.
  - Meta-prompt evolution further refines the prompt generation strategy.
  - Diff blocks follow a precise SEARCH/REPLACE format to ensure controlled modifications.
- Evaluation Strategies:
  - Supports cascaded hypothesis testing and LLM-based qualitative scoring.
  - Evaluations are parallelized and compute-intensive runs are amortized across massive clusters.
- Flexibility and Multiobjective Optimization:
  - AlphaEvolve supports evolving full codebases in multiple languages and optimizing across multiple metrics simultaneously.
- LLM Ensemble Usage:
  - Gemini 2.0 Flash: Fast, for high-throughput idea generation.
  - Gemini 2.0 Pro: Slower, for generating high-quality mutations.
- The following figure from the paper shows a high-level overview of AlphaEvolve. AlphaEvolve is a coding agent that orchestrates an autonomous pipeline of computations including queries to LLMs, and produces algorithms that address a user-specified task.
- The following figure from the paper shows an expanded view of the AlphaEvolve discovery process. The user provides an initial program (with components to evolve marked), evaluation code, and optional configurations. AlphaEvolve then initiates an evolutionary loop. The Prompt sampler uses programs from the Program database to construct rich prompts. Given these prompts, the LLMs generate code modifications (diffs), which are applied to create new programs. These are then scored by Evaluators, and promising solutions are registered back into the Program database, driving the iterative discovery of better and better programs.
Empirical Applications and Discoveries:
- Scientific Algorithm Discovery:
  - Surpassed the state-of-the-art (SOTA) in 14 matrix multiplication problems, including a landmark result improving Strassen’s algorithm by discovering a 48-scalar multiplication algorithm for 4×4 complex-valued matrices—the first improvement in 56 years.
- Mathematical Problem Solving:
  - Applied to 50+ open math problems across combinatorics, geometry, and analysis.
  - Matched or beat prior best constructions on ~95% of tasks; made SOTA-breaking discoveries on ~20%.
  - Notable results include improvements in the Kissing Number Problem in 11D (593 spheres vs. 592 prior) and Erdős’s Minimum Overlap Problem.
- Engineering and System Optimization:
  - Developed a new scheduling heuristic for Google’s Borg cluster management system, recovering 0.7% of fleet-wide resources.
  - Tuned matrix-multiplication tiling heuristics used in training Gemini LLMs, yielding a 23% kernel speedup and 1% overall training time reduction.
  - Optimized RTL hardware design for TPUs, discovering a functionally equivalent, power-saving circuit simplification.
  - Reduced runtime of compiler-generated FlashAttention code by 32% (kernel) and 15% (I/O) through direct IR optimization.
Technical Insights:
- Loss Function for Matrix Multiplication:
  - Combines reconstruction loss with discretization penalties, hallucination noise, gradient noise, and annealing strategies.
  - Encourages near-integral tensor decompositions through half-integer rounding penalties and cosine annealing.
  - Optimized using AdamW with dynamic weight decay and soft-clipping to control value magnitudes.
- Hyperparameter Sweep:
  - AlphaEvolve explores learning rate, weight decay, initialization scale, gradient noise, hallucination probability/scale, clipping thresholds, and regularization weights using evolutionary strategies.
Ablation Studies:
- Removing components like evolutionary loop, prompt context, meta-prompting, or large LLMs substantially degrades performance.
- Evolution and full-file editing are particularly critical to AlphaEvolve’s success.
Notebook

RecSys

2008

Calibrated recommendations

When a user has watched, say, 70 romance movies and 30 action movies, then it is reasonable to expect the personalized list of recommended movies to be comprised of about 70% romance and 30% action movies as well. This important property is known as calibration, and recently received renewed attention in the context of fairness in machine learning.
In the recommended list of items, calibration ensures that the various (past) areas of interest of a user are reflected with their corresponding proportions. Calibration is especially important in light of the fact that recommender systems optimized toward accuracy (e.g., ranking metrics) in the usual offline-setting can easily lead to recommendations where the lesser interests of a user get crowded out by the user’s main interests – which they show empirically as well as in thought-experiments. This can be prevented by calibrated recommendations.
This paper by Steck from Netflix in RecSys 2018 outlines metrics for quantifying the degree of calibration, as well as a simple yet effective re-ranking algorithm for post-processing the output of recommender systems.

2009

The wisdom of the few: a collaborative filtering approach based on expert opinions from the web

Nearest-neighbor collaborative filtering provides a successful means of generating recommendations for web users. However, this approach suffers from several shortcomings, including data sparsity and noise, the cold-start problem, and scalability.
This paper by Amatriain in SIGIR 2009 introduces a novel method for recommending items to users based on expert opinions.
Their method is a variation of traditional collaborative filtering: rather than applying a nearest neighbor algorithm to the user-rating data, predictions are computed using a set of expert neighbors from an independent dataset, whose opinions are weighted according to their similarity to the user. This method promises to address some of the weaknesses in traditional collaborative filtering, while maintaining comparable accuracy.
They validate their approach by predicting a subset of the Netflix data set. They used ratings crawled from a web portal of expert reviews, measuring results both in terms of prediction accuracy and recommendation list precision.
Finally, they explore the ability of their method to generate useful recommendations, by reporting the results of a user-study where users prefer the recommendations generated by their approach.

2010

Factorization Machines

This paper by Rendle from Osaka University in 2010 introduces Factorization Machines (FM) which are a new model class that combines the advantages of Support Vector Machines (SVM) with factorization models.
Like SVMs, FMs are a general predictor working with any real valued feature vector. In contrast to SVMs, FMs model all interactions between variables using factorized parameters. Thus they are able to estimate interactions even in problems with huge sparsity (like recommender systems) where SVMs fail.
They show that the model equation of FMs can be calculated in linear time and thus FMs can be optimized directly. So unlike nonlinear SVMs, a transformation in the dual form is not necessary and the model parameters can be estimated directly without the need of any support vector in the solution.
They show the relationship to SVMs and the advantages of FMs for parameter estimation in sparse settings. On the other hand there are many different factorization models like matrix factorization, parallel factor analysis or specialized models like SVD++, PITF or FPMC.
The drawback of these models is that they are not applicable for general prediction tasks but work only with special input data. Furthermore their model equations and optimization algorithms are derived individually for each task. They show that FMs can mimic these models just by specifying the input data (i.e., the feature vectors). This makes FMs easily applicable even for users without expert knowledge in factorization models.

2011

Unbiased Offline Evaluation of Contextual-bandit-based News Article Recommendation Algorithms

Contextual bandit algorithms have become popular for online recommendation systems such as Digg, Yahoo! Buzz, and news recommendation in general.
Offline evaluation of the effectiveness of new algorithms in these applications is critical for protecting online user experiences but very challenging due to their “partial-label” nature. Common practice is to create a simulator which simulates the online environment for the problem at hand and then run an algorithm against this simulator. However, creating simulator itself is often difficult and modeling bias is usually unavoidably introduced.
This paper by Li et al. from Yahoo Labs in WSDM 2011 introduces Offline Evaluation Replay (OER), a replay methodology for contextual bandit algorithm evaluation.
Different from simulator-based approaches, OER is completely data-driven and very easy to adapt to different applications. More importantly, their method can provide provably unbiased evaluations.
Their empirical results on a large-scale news article recommendation dataset collected from Yahoo! Front Page conform well with their theoretical results. Furthermore, comparisons between their offline replay and online bucket evaluation of several contextual bandit algorithms show accuracy and effectiveness of their offline evaluation method.

2015

Collaborative Deep Learning for Recommender Systems

Collaborative filtering (CF) is a successful approach commonly used by many recommender systems. Conventional CF-based methods use the ratings given to items by users as the sole source of information for learning to make recommendation. However, the ratings are often very sparse in many applications, causing CF-based methods to degrade significantly in their recommendation performance. To address this sparsity problem, auxiliary information such as item content information may be utilized. Collaborative topic regression (CTR) is an appealing recent method taking this approach which tightly couples the two components that learn from two different sources of information. Nevertheless, the latent representation learned by CTR may not be very effective when the auxiliary information is very sparse.
This paper by Wang et al. from HKU addresses this problem, by generalizing recent advances in deep learning from i.i.d. input to non-i.i.d. (CF-based) input and propose a hierarchical Bayesian model called collaborative deep learning (CDL), which jointly performs deep representation learning for the content information and collaborative filtering for the ratings (feedback) matrix.
Extensive experiments on three real-world datasets from different domains show that CDL can significantly advance the state of the art.

2016

Wide & Deep Learning for Recommender Systems

Generalized linear models with nonlinear feature transformations are widely used for large-scale regression and classification problems with sparse inputs. Memorization of feature interactions through a wide set of cross-product feature transformations are effective and interpretable, while generalization requires more feature engineering effort. However, memorization and generalization are both important for recommender systems. With less feature engineering, deep neural networks can generalize better to unseen feature combinations through low-dimensional dense embeddings learned for the sparse features. However, deep neural networks with embeddings can over-generalize and recommend less relevant items when the user-item interactions are sparse and high-rank.
This paper by Cheng et al. from Google in 2016 introduced Wide & Deep learning – jointly trained wide linear models and deep neural networks – to combine the benefits of memorization and generalization for recommender systems. Wide linear models can effectively memorize sparse feature interactions using cross-product feature transformations, while deep neural networks can generalize to previously unseen feature interactions through low dimensional embeddings. Wide & Deep learning framework to combine the strengths of both types of models. In other words, the fusion of wide and deep models combines the strengths of memorization and generalization, and provides us with better recommendation systems. The two models are trained jointly with the same loss function.
The figure below from the paper shows a spectrum of Wide and Deep models:

They productionized and evaluated the system on Google Play Store, a massive-scale commercial mobile app store with over one billion active users and over one million apps. Online experiment results show that Wide & Deep significantly increased app acquisitions compared with wide-only and deep-only models.

Deep Neural Networks for YouTube Recommendations

YouTube represents one of the largest scale and most sophisticated industrial recommendation systems in existence.
This paper by Covington et al. from Google in RecSys 2016 describes the system at a high level and focus on the dramatic performance improvements brought by deep learning.
The paper is split according to the classic two-stage information retrieval dichotomy: first, they detail a deep candidate generation model and then describe a separate deep ranking model. They also provide practical lessons and insights derived from designing, iterating and maintaining a massive recommendation system with enormous userfacing impact.
The following figure from the paper shows the recommendation system architecture demonstrating the “funnel” where candidate videos are retrieved and ranked before presenting only a few to the user.

Product-based Neural Networks for User Response Prediction

Predicting user responses, such as clicks and conversions, is of great importance and has found its usage in many web applications including recommender systems, web search, and online advertising. The data in those applications is mostly categorical and contains multiple fields; a typical representation is to transform it into a high-dimensional sparse binary feature representation via one-hot encoding.
Facing with the extreme sparsity, traditional models may limit their capacity of mining shallow patterns from the data, i.e. low-order feature combinations. Deep models like deep neural networks, on the other hand, cannot be directly applied for the high-dimensional input because of the huge feature space.
This paper by Qu et al. from Shanghai Jiao Tong and UCL in ICDM 2016 proposes a Product-based Neural Networks (PNN) with an embedding layer to learn a distributed representation of the categorical data, a product layer to capture interactive patterns between inter-field categories, and further fully connected layers to explore high-order feature interactions.
Their experimental results on two large-scale real-world ad click datasets demonstrate that PNNs consistently outperform the state-of-the-art models on various metrics.
The following figure from the paper shows the proposed product-based neural network architecture.

2017

Neural Collaborative Filtering

This paper by He et al., from NUS Singapore, Columbia University, Shandong University, and Texas A&M University, presented at WWW 2017, introduces Neural Collaborative Filtering (NCF). It tackles the key problem in recommendation systems – collaborative filtering – based on implicit feedback. This work marked a significant breakthrough in the integration of deep learning into recommender systems.
This innovative approach marked a departure from the (then standard) matrix factorization method. Prior to NCF, the gold standard in recommender systems was matrix factorization, which relied on learning latent vectors (a.k.a. embeddings) for both users and items, and then generate recommendations for a user by taking the dot product between the user vector and the item vectors. The closer the dot product is to 1, the better the match. As such, matrix factorization can be simply viewed as a linear model of latent factors.

The key idea in NCF is to substitute the inner product in matrix factorization with a neural network architecture to that can learn an arbitrary non-linear function from data. To supercharge the learning process of the user-item interaction function with non-linearities, they concatenated user and item embeddings, and then fed them into a multi-layer perceptron (MLP) with a single task head predicting user engagement, like clicks. Both MLP weights and embedding weights (which user/item IDs are mapped to) were learned through backpropagation of loss gradients during model training.

The hypothesis underpinning NCF posits that user-item interactions are non-linear, contrary to the linear assumption in matrix factorization.
The figure below from the paper illustrates the neural collaborative filtering framework.

NCF proved the value of replacing (then standard) linear matrix factorization algorithms with a neural network. With a relatively simply 4-layer neural network, NCF proved that there’s immense value of applying deep neural networks in recommender systems, marking the pivotal transition away from matrix factorization and towards deep recommenders. They were able to beat the best matrix factorization algorithms at the time by 5% hit rate on the Movielens and Pinterest benchmark datasets. Empirical evidence showed that using deeper layers of neural networks offers better recommendation performance.
Despite its revolutionary impact, NCF lacked an important ingredient that turned out to be extremely important for the success of recommenders: cross features, a concept popularized by the Wide & Deep paper described above.

Deep & Cross Network for Ad Click Predictions

Feature engineering has been the key to the success of many prediction models. However, the process is non-trivial and often requires manual feature engineering or exhaustive searching. DNNs are able to automatically learn feature interactions; however, they generate all the interactions implicitly, and are not necessarily efficient in learning all types of cross features.
This paper by Wang et al. from Stanford and Google in AdKDD proposes the Deep & Cross Network (DCN) which keeps the benefits of a DNN model, and beyond that, it introduces a novel cross network that is more efficient in learning certain bounded-degree feature interactions. In particular, DCN explicitly applies feature crossing at each layer, requires no manual feature engineering, and adds negligible extra complexity to the DNN model.
Their experimental results have demonstrated its superiority over the state-of-art algorithms on the CTR prediction dataset and dense classification dataset, in terms of both model accuracy and memory usage.
The figure below from the paper illustrates the Deep & Cross Network.

Note that as part of the embedding layer, they consider input data with sparse and dense features. In webscale recommender systems such as CTR prediction, the inputs are mostly sparse/categorical features, e.g. “country=usa”. Such features are often encoded as one-hot vectors e.g. “[0,1,0]”; however, this often leads to excessively high-dimensional feature spaces for large vocabularies.
To reduce the dimensionality, they employ an embedding procedure to transform these binary features into dense vectors of real values (commonly called embedding vectors):
\[\mathbf{x}_{\mathrm{embed}, i}=W_{\mathrm{embed}, i} \mathbf{x}_i,\]
- where $\mathbf{x}_{\mathrm{embed}, i}$ is the embedding vector, $\mathbf{x}_i$ is the binary input in the $i^{th}$ category, and $W_{\text {embed, } i} \in \mathbb{R}^{n_e \times n_v}$ is the corresponding embedding matrix that will be optimized together with other parameters in the network, and $n_e, n_v$ are the embedding size and vocabulary size, respectively.
In the end, we stack the embedding vectors, along with the normalized dense features $\mathbf{x}_{\text {dense}}$, into one vector:

\[\mathbf{x}_0=\left[\mathbf{x}_{\text {embed, } 1}^T, \ldots, \mathbf{x}_{\text {embed, } k}^T, \mathbf{x}_{\text {dense }}^T\right],\]

and feed $\mathbf{x}_0$ to the network.

DeepFM: A Factorization-Machine based Neural Network for CTR Prediction

Learning sophisticated feature interactions behind user behaviors is critical in maximizing CTR for recommender systems. Despite great progress, existing methods seem to have a strong bias towards low- or high-order interactions, or require expertise feature engineering.
This paper by Guo et al. from Harbin Institute of Technology and Huawei in 2017 proposes DeepFM, an end-to-end learning model that emphasizes both low- and high-order feature interactions. DeepFM is a Factorization-Machine (FM) based Neural Network for CTR prediction, to overcome the shortcomings of the state-of-the-art models and to achieve better performance. DeepFM trains a deep component and an FM component jointly and models low-order feature interactions through FM and models high-order feature interactions through the DNN. Unlike Google’s Wide & Deep Model, DeepFM can be trained end-to-end with a shared input to its “wide” and “deep” parts, with no need of feature engineering besides raw features.
DeepFM gains performance improvement from these advantages: 1) it does not need any pre-training; 2) it learns both high- and low-order feature interactions; 3) it introduces a sharing strategy of feature embedding to avoid feature engineering.
DeepFM thus combines the power of factorization machines for recommendation and deep learning for feature learning in a new neural network architecture.
Extensive experiments were conducted to demonstrate the effectiveness and efficiency of DeepFM over the existing models for CTR prediction, on two real-world datasets (Criteo dataset and a commercial App Store dataset) to compare the effectiveness and efficiency of DeepFM and the state-of-the-art models. Their experiment results demonstrate that 1) DeepFM outperforms the state-ofthe-art models in terms of AUC and Logloss on both datasets; 2) The efficiency of DeepFM is comparable to the most efficient deep model in the state-of-the-art.
The figure below from the paper shows the Wide & deep architecture of DeepFM. The wide and deep component share the same input raw feature vector, which enables DeepFM to learn low- and high-order feature interactions simultaneously from the input raw features.

2018

Deep Interest Network for Click-Through Rate Prediction

Click-through rate prediction is an essential task in industrial applications, such as online advertising. Recently deep learning based models have been proposed, which follow a similar Embedding & MLP paradigm. In these methods large scale sparse input features are first mapped into low dimensional embedding vectors, and then transformed into fixed-length vectors in a group-wise manner, finally concatenated together to fed into a multilayer perceptron (MLP) to learn the nonlinear relations among features. In this way, user features are compressed into a fixed-length representation vector, in regardless of what candidate ads are. The use of fixed-length vector will be a bottleneck, which brings difficulty for Embedding & MLP methods to capture user’s diverse interests effectively from rich historical behaviors.
This paper by Zhou et al. from Alibaba in KDD 2018 proposes Deep Interest Network (DIN), a novel model which tackles this challenge by designing a local activation unit to adaptively learn the representation of user interests from historical behaviors with respect to a certain ad.
This representation vector varies over different ads, improving the expressive ability of model greatly. Besides, they develop two techniques: mini-batch aware regularization and data adaptive activation function which can help training industrial deep networks with hundreds of millions of parameters.
Experiments on two public datasets as well as an Alibaba real production dataset with over 2 billion samples demonstrate the effectiveness of proposed approaches, which achieve superior performance compared with state-of-the-art methods.
DIN now has been successfully deployed in the online display advertising system in Alibaba, serving the main traffic.
The figure below shows the network architecture of DIN. The left part illustrates the network of base model (Embedding&MLP). Embeddings of cate_id, shop_id and goods_id belong to one goods are concatenated to represent one visited goods in user’s behaviors. Right part is their proposed DIN model. It introduces a local activation unit, with which the representation of user interests varies adaptively given different candidate ads.

2019

Behavior Sequence Transformer for E-commerce Recommendation in Alibaba

Deep learning based methods have been widely used in industrial recommendation systems (RSs). Previous works adopt an Embedding&MLP paradigm: raw features are embedded into low-dimensional vectors, which are then fed on to MLP for final recommendations. However, most of these works just concatenate different features, ignoring the sequential nature of users’ behaviors.
This paper by Chen et al. from Alibaba in 2019 proposes to use the powerful Transformer model to capture the sequential signals underlying users’ behavior sequences for recommendation in Alibaba.
Experimental results demonstrate the superiority of the proposed model, which is then deployed online at Taobao and obtain significant improvements in online Click-Through-Rate (CTR) comparing to two baselines.
The figure below from the paper shows the overview architecture of the proposed BST. BST takes as input the user’s behavior sequence, including the target item, and “other features”. It firstly embeds these input features as low-dimensional vectors. To better capture the relations among the items in the behavior sequence, the transformer layer is used to learn deeper representation for each item in the sequence. Then by concatenating the embeddings of “other features” and the output of the transformer layer, the three-layer MLPs are used to learn the interactions of the hidden features, and sigmoid function is used to generate the final output. Note that the “positional features” are incorporated into “sequence item features”.

BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer

This paper by Sun et al. from in CIKM 2019 proposes BERT4Rec, a novel sequential recommendation model that adapts the BERT architecture from NLP to predict user preferences based on historical interaction sequences using bidirectional self-attention.
Core Motivation: Traditional sequential recommendation models like RNNs or Transformer-based SASRec use unidirectional (left-to-right) architectures, limiting their ability to capture comprehensive context. BERT4Rec addresses this by using a bidirectional Transformer encoder to model user-item sequences, thus considering both past and future context around each item in a sequence.
Key Innovation: The use of the Cloze task (Masked Language Modeling) for training. Instead of predicting the next item in sequence (which would lead to information leakage in a bidirectional model), BERT4Rec randomly masks items in the sequence and trains the model to predict them based on their bidirectional context. During inference, a special [mask] token is appended to the end of the sequence to predict the next item.
Model Architecture:
- Built entirely on bidirectional Transformer layers (stacked L layers), each consisting of:
  - Multi-head self-attention with learnable projection matrices per head.
  - Position-wise feed-forward networks using GELU activations.
  - Layer normalization and residual connections for training stability.
  - Learnable positional embeddings are added to the item embeddings to encode sequence information.
- The input is a truncated sequence of recent user interactions. The embedding of the [mask] token is used for final recommendation scoring via a shared item embedding output projection.
- The figure below from the paper shows the differences in sequential recommendation model architectures. BERT4Rec learns a bidirectional model via Cloze task, while SASRec and RNN based methods are all left-to-right unidirectional model which predict next item sequentially.
Training Objective:
- Employs a masked item prediction loss akin to the BERT Cloze objective. A proportion ρ of the sequence is randomly masked and the model is trained to minimize the negative log-likelihood of the true item identities at these positions.
- The loss function enables efficient training and improves generalization by generating more training samples per sequence.
Datasets & Evaluation:
- Evaluated on four datasets: Amazon Beauty, Steam, ML-1m, and ML-20m, using standard metrics like HR\@k, NDCG\@k, and MRR.
- BERT4Rec outperforms all baselines (e.g., SASRec, GRU4Rec+, Caser) across all datasets, particularly in sparse settings.
Implementation Details:
- Implemented in TensorFlow and trained using Adam optimizer with standard BERT training strategies.
- Hyperparameters: hidden dimension d=64 to 256, layers L=2, attention heads h=2, and varying mask proportions ρ based on dataset sequence length.
- Shared input-output embeddings and GELU activations are used to reduce overfitting.
Ablation Studies:
- Removing positional embeddings or key components like layer normalization and residual connections significantly degrades performance, especially on longer sequences.
- Multi-head attention and deeper layers improve performance on long-sequence datasets, while shallow configurations suffice for shorter sequences.
Contributions:
- First application of bidirectional Transformer (BERT-style) to recommendation.
- Introduction of the Cloze-style objective to avoid information leakage and enable bidirectional context use.
- Extensive empirical evidence showing superior performance and robustness of BERT4Rec.
Code

2020

DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems

This paper by Wang et al. from Google introduces an enhanced version of the Deep & Cross Network (DCN), named DCN-V2, to effectively learn feature interactions in large-scale learning to rank (LTR) systems.
DCN-V2 addresses the limitations of the original DCN, particularly in web-scale systems with vast amounts of training data, where DCN exhibited limited expressiveness in its cross network for learning predictive feature interactions.
The paper focuses on efficient and effective learning of predictive feature interactions, crucial in applications like search recommendation systems and computational advertising. Traditional approaches often involve manual identification of feature crosses or rely on deep neural networks (DNNs), which can be inefficient for higher-order feature crosses.
DCN-V2 includes an embedding layer that processes both categorical (sparse) and dense features. It supports different embedding sizes, crucial for industrial-scale applications with varying vocabulary sizes.
The core of DCN-V2 is its cross layers, which create explicit feature crosses. These layers are built upon a base layer containing original features and use learned weight matrices and bias vectors for each cross layer.
DCN-V2 combines the cross network with a deep network through two architectures: a stacked structure where the cross network output feeds into the deep network, and a parallel structure where outputs from both networks are concatenated.
A novel aspect of DCN-V2 is the use of low-rank techniques to approximate feature crosses in a subspace, enhancing performance and latency trade-offs. This is complemented by a Mixture-of-Expert architecture for decomposing the matrix into smaller sub-spaces aggregated through a gating mechanism.
The figure below from the paper visualizes DCN-V2. $\otimes$ represents the cross operation, i.e., $\mathrm{x}_{l+1}=\mathrm{x}_0 \odot\left(W_l \mathrm{x}_l+\mathrm{b}_l\right)+\mathrm{x}_l$.

The figure below from the paper visualizes a cross layer.

DCN-V2’s effectiveness is demonstrated through extensive studies and comparisons with state-of-the-art algorithms on benchmark datasets like Criteo and MovieLen-1M. It outperforms these algorithms and offers significant offline accuracy and online business metrics gains in Google’s web-scale LTR systems.
The paper also explores polynomial approximation from two perspectives: bitwise and feature-wise, showing how DCN-V2 can create feature interactions up to a certain order for a given number of cross layers, making it more expressive than the original DCN.

Neural Collaborative Filtering vs. Matrix Factorization Revisited

The following paper summary has been contributed by Zhibo Zhang.
This paper by Rendle et al. from Google Research in RecSys’ 20 compared neural collaborative filtering He et al., 2017 methods with dot product. Specifically, the following methods were evaluated in this paper:
- Dot product
- Multi-Layer Perceptron (MLP)
- MLP + Generalized Matrix Factorization (MLP + GMF)
- MLP + GMF Pretrained, where the MLP and GMF are firstly trained until convergence, whose parameters are then used to initialize the joint model of the two.
The authors conducted experiments on the MovieLens dataset and a dataset from Pinterest, where Recall and Normalized Discounted Cumulative Gain (NDCG) among the top 10 recommended results were adopted as the evaluation metrics.
As shown in the figure below from the paper, the simple dot product witnessed better performance compared to learning-based similarity functions overall.

In order to study the difficulty of learning a good MLP, the authors created a synthetic dataset: the inputs are embedding pairs generated from Gaussian distribution; for each synthesized input embedding pair, the label is the inner product of the two embeddings plus Gaussian noise. It was concluded from the root mean squared error (RMSE) gap between dot product and MLP that approximating dot product with a multi-layer perceptron is hard.

2022

PinnerFormer: Sequence Modeling for User Representation at Pinterest

Sequential models have become increasingly popular in powering personalized recommendation systems over the past several years. These approaches traditionally model a user’s actions on a website as a sequence to predict the user’s next action. While theoretically simplistic, these models are quite challenging to deploy in production, commonly requiring streaming infrastructure to reflect the latest user activity and potentially managing mutable data for encoding a user’s hidden state.
This paper by Pancha et al. in KDD ‘22 introduces PinnerFormer, a user representation trained to predict a user’s future long-term engagement using a sequential model of a user’s recent actions.
The figure below from the paper shows an overview of PinnerFormer architecture. Features are passed through a transformer with causal masking, and embeddings are returned at every time step. Note that the training window (28d above) exceeds our future evaluation objective window (14d).

Unlike prior approaches, they adapt their modeling to a batch infrastructure via their new dense all-action loss, modeling long-term future actions instead of next action prediction.
The figure below from the paper shows the four explored training objectives. Blue circles represent embeddings corresponding to actions considered positive, while red circles represent embeddings corresponding to actions considered non-positive (but not necessarily explicitly negative). The exact pairings in the dense all action loss are sampled, so this is simply one potential materialization. Note we do not attempt to predict non-positive examples.

They show that by doing so, they significantly close the gap between batch user embeddings that are generated once a day and realtime user embeddings generated whenever a user takes an action. They describe our design decisions via extensive offline experimentation and ablations and validate the efficacy of their approach in A/B experiments showing substantial improvements in Pinterest’s user retention and engagement when comparing PinnerFormer against their previous user representation.
PinnerFormer is deployed in production as of Fall 2021.

RL

1992

Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning

This paper by Ronald Williams in Springer Machine Learning introduces a class of reinforcement learning algorithms for connectionist networks containing stochastic units, called REINFORCE algorithms. These algorithms enable weight updates that follow the gradient of expected reinforcement without explicitly computing gradient estimates. The work explores both immediate-reinforcement and limited delayed-reinforcement tasks and provides analytical results on how these methods integrate with backpropagation.
The REINFORCE algorithm is widely applicable to various reinforcement learning problems and serves as a stepping stone for more advanced policy gradient methods used in modern deep reinforcement learning.
Key Contributions:
- Gradient-Following without Explicit Computation:
  - REINFORCE algorithms update network weights based on expected reinforcement without directly computing the gradient.
  - The weight adjustments lie along the gradient direction, ensuring statistical improvement in performance.
- Theoretical Foundation:
  - The paper provides proofs that REINFORCE algorithms optimize expected reinforcement.
  - The author discusses their integration into feedforward and recurrent networks.
- Extensions to Episodic Learning:
  - The paper extends REINFORCE to delayed-reinforcement tasks using an episodic learning variant, accumulating eligibility over multiple steps.
- Integration with Backpropagation:
  - Demonstrates how REINFORCE can be combined with backpropagation in networks with deterministic hidden units and stochastic output units.
- Multiparameter Distributions for Exploration:
  - Introduces Gaussian units that adapt both mean and variance, allowing control over exploration-exploitation balance.
Algorithm Details:
- Standard REINFORCE Update Rule:
  - The weight update is given by:
    $\Delta w_{ij} = \alpha (r - b_{ij}) e_{ij}$
    - where:
      - $\alpha$ is the learning rate,
      - $r$ is the reinforcement signal,
      - $b_{ij}$ is a baseline reinforcement (used to reduce variance),
      - $e_{ij}$ is the characteristic eligibility term.
- Bernoulli-logistic Units:
  - Special case where units follow a Bernoulli distribution and use a logistic activation function:
    $P_i = \frac{1}{1 + e^{-s_i}}$ $\Delta w_{ij} = \alpha (y_i - P_i) x_j$
- Gaussian Units for Adaptive Exploration:
  - Uses mean and variance adaptation for more effective exploration.
  - The mean update follows:
    $\Delta \mu = \alpha (r - b_{\mu}) (y - \mu)$
  - The variance update follows:
    $\Delta \sigma = \alpha (r - b_{\sigma}) \left( \frac{(y - \mu)^2 - \sigma^2}{\sigma^2} \right)$
Empirical and Theoretical Analysis:
- Convergence Analysis:
  - While REINFORCE algorithms statistically follow the gradient, they are not guaranteed to converge to a global optimum.
  - The paper discusses reinforcement baseline choices to improve stability and convergence speed.
- Comparison to Learning Automata:
  - Demonstrates that two-action stochastic learning automata (like LR-I) are special cases of REINFORCE.
- Performance Considerations:
  - The algorithm’s effectiveness depends on reinforcement signal structure and baseline choice.
  - Reinforcement comparison (adaptive baselines) significantly improves learning efficiency.

2015

Trust Region Policy Optimization

The paper by Schulman et al. from ICML 2015 details an iterative procedure for optimizing policies with guaranteed monotonic improvement in reinforcement learning contexts.
The paper introduces a practical algorithm called Trust Region Policy Optimization (TRPO), which is an adaptation of natural policy gradient methods. TRPO is effective in optimizing large, nonlinear policies such as neural networks. The algorithm’s robust performance is demonstrated across a variety of tasks, including learning robotic locomotion and playing Atari games using screen images as input.
The authors first prove that minimizing a certain surrogate objective function guarantees policy improvement. They then develop practical variants of this theoretically justified algorithm: the single-path method for model-free settings and the vine method for simulated environments. These methods can optimize complex policies with tens of thousands of parameters, a significant challenge for model-free policy search previously.
The paper extends the policy improvement guarantee, which was previously applicable only to mixture policies, to general stochastic policies. This is done by using the total variation divergence as a distance measure between policies, making the results more applicable to practical scenarios.
The authors describe an approximate policy iteration scheme based on the policy improvement bound. This algorithm is guaranteed to generate a sequence of policies with non-decreasing expected return. The approach involves computing all advantage values and solving a constrained optimization problem to update the policy’s parameter vector.
The following figure from the paper shows (Left) illustration of single path procedure. Here, they generate a set of trajectories via simulation of the policy and incorporate all state-action pairs $(s_n, a_n)$ into the objective. (Right) illustration of vine procedure. They generate a set of “trunk” trajectories, and then generate “branch” rollouts from a subset of the reached states. For each of these states $s_n$, they perform multiple actions ($a_1$ and $a_2$ here) and perform a rollout after each action, using common random numbers (CRN) to reduce the variance.

The TRPO algorithm is shown to be effective in solving large-scale problems. In experiments, the TRPO methods (both single path and vine) successfully trained high-quality locomotion controllers from scratch, a task considered difficult. Moreover, the algorithms were competitive in learning policies for playing Atari games using convolutional neural networks with a large number of parameters.
The TRPO algorithm was tested on seven Atari games, demonstrating its capability to handle complex observations and partially observed tasks. The games, which required a variety of behaviors and dealt with issues like delayed rewards and non-stationary image statistics, showcased the algorithm’s adaptability and effectiveness.
In conclusion, the TRPO paper by Schulman et al. presents a significant advancement in policy optimization for reinforcement learning, offering a practical, effective, and theoretically grounded approach to training complex policies in a variety of challenging environments.

2016

Mastering the game of Go with Deep Neural Networks & Tree Search

The game of Go has long been viewed as the most challenging of classic games for artificial intelligence owing to its enormous search space and the difficulty of evaluating board positions and moves.
This paper by Silver et al. from DeepMind in Nature proposes AlphaGo, a new approach to computer Go that combines Monte-Carlo tree search with deep neural networks that have been trained by supervised learning, from human expert games, and by reinforcement learning, from games of self-play. Specifically, AlphaGo uses ‘value networks’ to evaluate board positions and ‘policy networks’ to select moves.
AlphaGo is powered by a deep neural networks are trained by a novel combination of supervised learning from human expert games, and reinforcement learning from games of self-play. Without any lookahead search, the neural networks play Go at the level of stateof-the-art Monte Carlo tree search programs that simulate thousands of random games of self-play.
They also introduce a new search algorithm that combines Monte Carlo simulation with value and policy networks. Using this search algorithm, their program AlphaGo achieved a 99.8% winning rate against other Go programs, and defeated the human European Go champion by 5 games to 0.
AlphaGo is the first time that a computer program has defeated a human professional player in the full-sized game of Go, a feat previously thought to be at least a decade away.
The following figure from the paper shows an illustration of the neural network training pipeline and architecture. a, A fast rollout policy $p_\pi$ and supervised learning (SL) policy network $p_\sigma$ are trained to predict human expert moves in a data set of positions. A reinforcement learning (RL) policy network $p_\rho$ is initialized to the SL policy network, and is then improved by policy gradient learning to maximize the outcome (that is, winning more games) against previous versions of the policy network. A new data set is generated by playing games of self-play with the RL policy network. Finally, a value network $v_\theta$ is trained by regression to predict the expected outcome (that is, whether the current player wins) in positions from the self-play data set. b, Schematic representation of the neural network architecture used in AlphaGo. The policy network takes a representation of the board position $s$ as its input, passes it through many convolutional layers with parameters $\sigma$ (SL policy network) or $\rho$ (RL policy network), and outputs a probability distribution $p_\sigma(a \mid s)$ or $p_\rho(a \mid s)$ over legal moves $a$, represented by a probability map over the board. The value network similarly uses many convolutional layers with parameters $\theta$, but outputs a scalar value $v_\theta\left(s^{\prime}\right)$ that predicts the expected outcome in position $s^{\prime}$.

Project page.

2017

Proximal Policy Optimization Algorithms

This paper by Schulman et al. from OpenAI in 2017 proposes a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a “surrogate” objective function using stochastic gradient ascent.
Whereas standard policy gradient methods perform one gradient update per data sample, they propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which they call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more general, and have better sample complexity (empirically).
Their experiments test PPO on a collection of benchmark tasks, including simulated robotic locomotion and Atari game playing, showing that PPO outperforms other online policy gradient methods, and overall strikes a favorable balance between sample complexity, simplicity, and wall clock time.

Evolution Strategies as a Scalable Alternative to Reinforcement Learning

This paper by Salimans et al. from OpenAI explores the use of Evolution Strategies (ES), a class of black box optimization algorithms, as an alternative to popular MDP-based RL techniques such as Q-learning and Policy Gradients.
Experiments on MuJoCo and Atari show that ES is a viable solution strategy that scales extremely well with the number of CPUs available: By using a novel communication strategy based on common random numbers, their ES implementation only needs to communicate scalars, making it possible to scale to over a thousand parallel workers.
This allows them to solve 3D humanoid walking in 10 minutes and obtain competitive results on most Atari games after one hour of training. In addition, they highlight several advantages of ES as a black box optimization technique: it is invariant to action frequency and delayed rewards, tolerant of extremely long horizons, and does not need temporal discounting or value function approximation.

Playing FPS Games with Deep Reinforcement Learning

Advances in deep reinforcement learning have allowed autonomous agents to perform well on Atari games, often outperforming humans, using only raw pixels to make their decisions. However, most of these games take place in 2D environments that are fully observable to the agent.
This paper by Lample and Chaplot in AAAI 2017 presents the first architecture to tackle 3D environments in first-person shooter games, that involve partially observable states. Typically, deep reinforcement learning methods only utilize visual input for training.
They present a method to augment these models to exploit game feature information such as the presence of enemies or items, during the training phase. Their model is trained to simultaneously learn these features along with minimizing a Q-learning objective, which is shown to dramatically improve the training speed and performance of their agent.
Their architecture is also modularized to allow different models to be independently trained for different phases of the game. They show that the proposed architecture substantially outperforms built-in AI agents of the game as well as humans in deathmatch scenarios.
The following figure from the paper shows an illustration of the architecture of their model. The input image is given to two convolutional layers. The output of the convolutional layers is split into two streams. The first one (bottom) flattens the output (layer 3’) and feeds it to a LSTM, as in the DRQN model. The second one (top) projects it to an extra hidden layer (layer 4), then to a final layer representing each game feature. During the training, the game features and the Q-learning objectives are trained jointly

Mastering the game of Go without Human Knowledge

A long-standing goal of artificial intelligence is an algorithm that learns, tabula rasa, superhuman proficiency in challenging domains. Recently, AlphaGo became the first program to defeat a world champion in the game of Go. The tree search in AlphaGo evaluated positions and selected moves using deep neural networks. These neural networks were trained by supervised learning from human expert moves, and by reinforcement learning from selfplay.
This paper by Silver et al. from DeepMind in Nature proposes AlphaGo Zero, an algorithm based solely on reinforcement learning, without human data, guidance, or domain knowledge beyond game rules.
AlphaGo becomes its own teacher: a neural network is trained to predict AlphaGo’s own move selections and also the winner of AlphaGo’s games. This neural network improves the strength of tree search, resulting in higher quality move selection and stronger self-play in the next iteration.
Starting tabula rasa, AlphaGo Zero achieved superhuman performance, winning 100-0 against the previously published, champion-defeating AlphaGo.
The following figure from the paper shows an illustration of self-play reinforcement learning in AlphaGo Zero. a, The program plays a game $s_1, \ldots, s_T$ against itself. In each position $s_t$, an MCTS $\alpha_{\theta}$ is executed using the latest neural network $f_{\theta}$. Moves are selected according to the search probabilities computed by the MCTS, $a_t \sim \pi_t$. The terminal position $s_T$ is scored according to the rules of the game to compute the game winner $z$. b. Neural network training in AlphaGo Zero. The neural network takes the raw board position $s_t$ as its input, passes it through many convolutional layers with parameters $\theta$, and outputs both a vector $\boldsymbol{p}_{t}$, representing a probability distribution over moves, and a scalar value $v_t$, representing the probability of the current player winning in position $s_t$. The neural network parameters $\theta$ are updated to maximize the similarity of the policy vector $\boldsymbol{p}_{t}$ to the search probabilities $\pi_t$, and to minimize the error between the predicted winner $v_t$ and the game winner $z$. The new parameters are used in the next iteration of self-play as in $\mathbf{a}$.

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

The game of chess is the most widely-studied domain in the history of artificial intelligence. The strongest programs are based on a combination of sophisticated search techniques, domain-specific adaptations, and handcrafted evaluation functions that have been refined by human experts over several decades. In contrast, the AlphaGo Zero program recently achieved superhuman performance in the game of Go, by tabula rasa reinforcement learning from games of self-play.
This paper by Silver et al. from DeepMind in Nature proposes AlphaZero, by generalizing the aforementioned approach into a single algorithm that can achieve, tabula rasa (defined by Google as “an absence of preconceived ideas or predetermined goals; a clean slate”; e.g., “the human mind, especially at birth, viewed as having no innate ideas”), superhuman performance in many challenging domains.
Starting from random play, and given no domain knowledge except the game rules, AlphaZero achieved within 24 hours a superhuman level of play in the games of chess and shogi (Japanese chess) as well as Go, and convincingly defeated a world-champion program in each case.

2019

AlphaStar: Grandmaster level in StarCraft II using multi-agent reinforcement learning

Many real-world applications require artificial agents to compete and coordinate with other agents in complex environments. As a stepping stone to this goal, the domain of StarCraft has emerged as an important challenge for artificial intelligence research, owing to its iconic and enduring status among the most difficult professional e-sports and its relevance to the real world in terms of its raw complexity and multi-agent challenges. Over the course of a decade and numerous competitions, the strongest agents have simplified important aspects of the game, utilized superhuman capabilities, or employed hand-crafted sub-systems. Despite these advantages, no previous agent has come close to matching the overall skill of top StarCraft players.
This paper by Vinyals et al. in Nature seeks to address the challenge of StarCraft using general purpose learning methods that are in principle applicable to other complex domains: a multi-agent reinforcement learning algorithm that uses data from both human and agent games within a diverse league of continually adapting strategies and counter-strategies, each represented by deep neural networks.
They evaluated their agent, AlphaStar, in the full game of StarCraft II, through a series of online games against human players. AlphaStar was rated at Grandmaster level for all three StarCraft races and above 99.8% of officially ranked human players.
The following figure from the paper shows the training setup. (a) AlphaStar observes the game through an overview map and list of units. To act, the agent outputs what action type to issue (for example, build), who it is applied to, where it targets, and when the next action will be issued. Actions are sent to the game through a monitoring layer that limits action rate. AlphaStar contends with delays from network latency and processing time. b, AlphaStar is trained via both supervised learning and reinforcement learning. In supervised learning (bottom), the parameters are updated to optimize Kullback–Leibler (KL) divergence between its output and human actions sampled from a collection of replays. In reinforcement learning (top), human data are used to sample the statistic $z$, and agent experience is collected to update the policy and value outputs via reinforcement learning (TD($\lambda$), V-trace, UPGO) combined with a KL loss towards the supervised agent. (c) Three pools of agents, each initialized by supervised learning, were subsequently trained with reinforcement learning. As they train, these agents intermittently add copies of themselves—‘players’ that are frozen at a specific point—to the league. The main agents train against all of these past players, as well as themselves. The league exploiters train against all past players. The main exploiters train against the main agents. Main exploiters and league exploiters can be reset to the supervised agent when they add a player to the league. Images from StarCraft reproduced with permission from Blizzard Entertainment.

Project page.

2020

Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model

Constructing agents with planning capabilities has long been one of the main challenges in the pursuit of artificial intelligence. Tree-based planning methods have enjoyed huge success in challenging domains, such as chess and Go, where a perfect simulator is available. However, in real-world problems the dynamics governing the environment are often complex and unknown.
This paper by Schrittwieser et al. from DeepMind and UCL in Nature presents the MuZero algorithm which, by combining a tree-based search with a learned model, achieves superhuman performance in a range of challenging and visually complex domains, without any knowledge of their underlying dynamics.
MuZero learns a model that, when applied iteratively, predicts the quantities most directly relevant to planning: the reward, the action-selection policy, and the value function.
When evaluated on 57 different Atari games - the canonical video game environment for testing AI techniques, in which model-based planning approaches have historically struggled - MuZero achieves a new state of the art.
When evaluated on Go, chess and shogi, without any knowledge of the game rules, MuZero matched the superhuman performance of the AlphaZero algorithm that was supplied with the game rules.
The following figure from the paper illustrates the process of planning, acting, and training with a learned model. (A) How MuZero uses its model to plan. The model consists of three connected components for representation, dynamics and prediction. Given a previous hidden state $s^{k-1}$ and a candidate action $a^k$, the dynamics function $g$ produces an immediate reward $r^k$ and a new hidden state $s^k$. The policy $p^k$ and value function $v^k$ are computed from the hidden state $s^k$ by a prediction function $f$. The initial hidden state $s^0$ is obtained by passing the past observations (e.g. the Go board or Atari screen) into a representation function $h$. (B) How MuZero acts in the environment. A Monte-Carlo Tree Search is performed at each timestep $t$, as described in A. An action $a_{t+1}$ is sampled from the search policy $\pi_t$, which is proportional to the visit count for each action from the root node. The environment receives the action and generates a new observation $o_{t+1}$ and reward $u_{t+1}$. At the end of the episode the trajectory data is stored into a replay buffer. (C) How MuZero trains its model. A trajectory is sampled from the replay buffer. For the initial step, the representation function $h$ receives as input the past observations $o_1, \ldots, o_t$ from the selected trajectory. The model is subsequently unrolled recurrently for $K$ steps. At each step $k$, the dynamics function $g$ receives as input the hidden state $s^{k-1}$ from the previous step and the real action $a_{t+k}$. The parameters of the representation, dynamics and prediction functions are jointly trained, end-to-end by backpropagation-through-time, to predict three quantities: the policy $\mathbf{p}^k \approx \pi_{t+k}$, value function $v^k \approx z_{t+k}$, and reward $r_{t+k} \approx u_{t+k}$, where $z_{t+k}$ is a sample return: either the final reward (board games) or $n$-step return (Atari).

2021

Highly accurate protein structure prediction with AlphaFold

Proteins are essential to life, and understanding their structure can facilitate a mechanistic understanding of their function. Through an enormous experimental effort, the structures of around 100,000 unique proteins have been determined, but this represents a small fraction of the billions of known protein sequences.
Structural coverage is bottlenecked by the months to years of painstaking effort required to determine a single protein structure. Accurate computational approaches are needed to address this gap and to enable large-scale structural bioinformatics. Predicting the 3-D structure that a protein will adopt based solely on its amino acid sequence, the structure prediction component of the ‘protein folding problem’, has been an important open research problem for more than 50 years. Despite recent progress, existing methods fall far short of atomic accuracy, especially when no homologous structure is available.
This paper by Jumper et al. in Nature provides the first computational method that can regularly predict protein structures with atomic accuracy even where no similar structure is known.
They validated an entirely redesigned version of our neural network-based model, AlphaFold, in the challenging 14th Critical Assessment of protein Structure Prediction (CASP14), demonstrating accuracy competitive with experiment in a majority of cases and greatly outperforming other methods.
Underpinning the latest version of AlphaFold is a novel machine learning approach that incorporates physical and biological knowledge about protein structure, leveraging multi-sequence alignments, into the design of the deep learning algorithm.
Follow-up works:
- Enabling high-accuracy protein structure prediction at the proteome scale.
- Protein complex prediction with AlphaFold-Multimer: October 2021 paper where they apply an AlphaFold model specifically trained for multimeric inputs.

2023

Faster sorting algorithms discovered using deep reinforcement learning

Fundamental algorithms such as sorting or hashing are used trillions of times on any given day1. As demand for computation grows, it has become critical for these algorithms to be as performant as possible. Whereas remarkable progress has been achieved in the past, making further improvements on the efficiency of these routines has proved challenging for both human scientists and computational approaches. Here they show how artificial intelligence can go beyond the current state of the art by discovering hitherto unknown routines.
This paper by Mankowitz et al. from DeepMind in Nature accomplishes this by formulating the task of finding a better sorting routine as a single-player game. They then trained a new deep reinforcement learning agent, AlphaDev, to play this game.
AlphaDev discovered small sorting algorithms from scratch that outperformed previously known human benchmarks. The algorithm optimized the assembly code generated for sorting. The code generated by AI used one less instruction. Because sorting is called trillions of times every day, saving one assembly instruction has a massive effect on efficiency. The algorithm is already open-sourced and added to the main C++ library and it makes sorting 70% faster for shorter sequences and about 1.7% faster for sequences exceeding 250,000 elements.
These algorithms have been integrated into the LLVM standard C++ sort library3. This change to this part of the sort library represents the replacement of a component with an algorithm that has been automatically discovered using reinforcement learning. They also present results in extra domains, showcasing the generality of the approach.
The following figure from the paper shows: (Top: a) The AssemblyGame is played by AlphaDev, which receives as input the current assembly algorithm generated thus far St and plays the game by selecting an action to execute. In this example, the action is a mov<Register0,Memory1> assembly instruction, which is appended to the current algorithm. The agent receives a reward that is a function of the algorithm’s correctness, discussed in b, as well as the algorithm’s latency. The game is won by the player discovering a low latency, correct algorithm. (Bottom: b) The program correctness and latency computations are used to compute the reward rt. In this example, test sequences are input to the algorithm; for example, in the case of sorting three elements, test inputs comprise all sequences of unsorted elements of length 3. For each sequence, the algorithm output is compared to the expected output (in the case of sorting, the expected output is the sorted elements). In this example, the output D′ does not match the expected output B′ and the algorithm is therefore incorrect.

Graph ML

2000

Nonlinear dimensionality reduction by locally linear embedding

Many areas of science depend on exploratory data analysis and visualization. The need to analyze large amounts of multivariate data raises the fundamental problem of dimensionality reduction: how to discover compact representations of high-dimensional data.
This paper by Roweis and Saul from UCL in Nature Science in 2020 proposes the Locally Linear Embedding (LLE), an unsupervised learning algorithm that computes low-dimensional, neighborhood-preserving embeddings of high-dimensional inputs.
LLE assumes a linear combination of the node’s neighbors and is designed to preserve first-order proximity (i.e., one hop). It recovers the embedding $Y^{N \times d}$ from the locally linear fits by minimizing $\phi(Y) = \sum_{i}\big\|Y_{i} - \small{\sum_{j}}W_{ij}Y_{j}\big\|^{2}$.
Unlike clustering methods for local dimensionality reduction, LLE maps its inputs into a single global coordinate system of lower dimensionality, and its optimizations do not involve local minima.
The following figure from the paper shows the problem of nonlinear dimensionality reduction, as illustrated for three-dimensional data (B) sampled from a two-dimensional manifold (A). An unsupervised learning algorithm must discover the global internal coordinates of the manifold without signals that explicitly indicate how the data should be embedded in two dimensions. The color coding illustrates the neighborhood preserving mapping discovered by LLE; black outlines in (B) and (C) show the neighborhood of a single point. Unlike LLE, projections of the data by principal component analysis (PCA) or classical MDS (2) map faraway data points to nearby points in the plane, failing to identify the underlying structure of the manifold. Note that mixture models for local dimensionality reduction, which cluster the data and perform PCA within each cluster, do not address the problem considered here: namely, how to map high-dimensional data into a single global coordinate system of lower dimensionality.

The following figure from the paper shows the steps of LLE: (1) Assign neighbors to each data point $\vec{X}_i$ (for example by using the $K$ nearest neighbors). (2) Compute the weights $W_{\mathrm{ij}}$ that best linearly reconstruct $\vec{X}_{\mathrm{i}}$ from its neighbors, solving the constrained least-squares problem $\varepsilon(W)=\sum_{\mathrm{i}}\left\|\vec{X}_{\mathrm{i}}-\Sigma_{\mathrm{j}} W_{\mathrm{ij}} \vec{X}_{\mathrm{j}}\right\|^2$. (3) Compute the low-dimensional embedding vectors $vec{Y}_i$ best reconstructed by $W_{\mathrm{ij}}$, minimizing $\Phi(Y)=\sum_{\mathrm{i}}\left\|\vec{Y}_{\mathrm{i}}-\Sigma_{\mathrm{j}} W_{\mathrm{ij}} \vec{Y}_{\mathrm{j}}\right\|^2$ by finding the smallest eigenmodes of the sparse symmetric matrix in $M_{\mathrm{ij}}=\delta_{\mathrm{ij}}-W_{\mathrm{ij}}-W_{\mathrm{ji}}+\sum_{\mathrm{k}} W_{\mathrm{ki}} W_{\mathrm{kj}}$. Although the weights $W_{\mathrm{ij}}$ and vectors $Y_{\mathrm{i}}$ are computed by methods in linear algebra, the constraint that points are only reconstructed from neighbors can result in highly nonlinear embeddings.

By exploiting the local symmetries of linear reconstructions, LLE is able to learn the global structure of nonlinear manifolds, such as those generated by images of faces or documents of text.
However, a big downside of LLE is that it is not scalable due to its complexity of $O(V^2)$ where $V$ is the number of vertices.

A Global Geometric Framework for Nonlinear Dimensionality Reduction

Scientists working with large volumes of high-dimensional data, such as global climate patterns, stellar spectra, or human gene distributions, regularly confront the problem of dimensionality reduction: finding meaningful low-dimensional structures hidden in their high-dimensional observations. The human brain confronts the same problem in everyday perception, extracting from its high-dimensional sensory inputs—30,000 auditory nerve fibers or 106 optic nerve fibers—a manageably small number of perceptually relevant features.
This paper by Tenenbaum et al. from Stanford and CMU proposes an approach to solving dimensionality reduction problems that uses easily measured local metric information to learn the underlying global geometry of a data set.
Unlike classical techniques such as principal component analysis (PCA) and multidimensional scaling (MDS), their approach is capable of discovering the nonlinear degrees of freedom that underlie complex natural observations, such as human handwriting or images of a face under different viewing conditions. They achieve this by combining the major algorithmic features of PCA and MDS – computational efficiency, global optimality, and asymptotic convergence guarantees – with the flexibility to learn a broad class of nonlinear manifolds.
In contrast to previous algorithms for nonlinear dimensionality reduction, theirs efficiently computes a globally optimal solution, and, for an important class of data manifolds, is guaranteed to converge asymptotically to the true structure.

2001

Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering

This paper by Belkin and Niyogi proposes Laplacian Eigenmaps (LAP), a geometrically motivated algorithm for constructing a representation for data sampled from a low dimensional manifold embedded in a higher dimensional space, drawing on the correspondence between the graph Laplacian, the Laplace-Beltrami operator on a manifold, and the connections to the heat equation. The algorithm provides a computationally efficient approach to non-linear dimensionality reduction that has locality preserving properties and a natural connection to clustering.
In many areas of artificial intelligence, information retrieval and data mining, one is often confronted with intrinsically low dimensional data lying in a very high dimensional space. For example, gray scale $n \times n$ images of a fixed object taken with a moving camera yield data points in $\mathbb{R}^{n^2}$. However, the intrinsic dimensionality of the space of all images of the same object is the number of degrees of freedom of the camera - in fact the space has the natural structure of a manifold embedded in $\mathbb{R}^{n^2}$. While there is a large body of work on dimensionality reduction in general, most existing approaches do not explicitly take into account the structure of the manifold on which the data may possibly reside. Recently, there has been some interest Tenenbaum et al, 2000; Roweis and Saul, 2000) in the problem of developing low dimensional representations of data in this particular context.
The core algorithm is very simple, has a few local computations and one sparse eigenvalue problem. The solution reflects th e intrinsic geom etric structure of the manifold. The justification comes from the role of the Laplacian operator in providing an optimal emb edding. The Laplacian of the graph obtained from the data points may be viewed as an approximation to the Laplace-Beltrami operator defined on the manifold. The emb edding maps for the data come from approximations to a natural map that is defined on the entire manifold. The framework of analysis presented here makes this connection explicit. While this connection is known to geometers and specialists in spectral graph theory to the best of their knowledge, they do not know of any application to data representation yet. The connection of the Laplacian to the heat kernel enables them to choose the weights of the graph in a principled manner.
They preserve the first-order proximity of a network structure by minimizing $\phi(Y) = \frac{1}{2}\sum_{i,j}\|\|Y_{i} - Y_{j}\|\|^{2}W_{ij}$. Similar to LLE, it has a complexity of $O(V^2)$.
The locality preserving character of the Laplacian Eigenmap algorithm makes it relatively insensitive to outliers and noise. A byproduct of this is that the algorithm implicitly emphasizes the natural clusters in the data.

2014

This paper by Perozzi et al. from Skiena’s lab at Stony Brook presents DeepWalk, a novel approach for learning latent representations of vertices in a network. These latent representations encode social relations in a continuous vector space, which is easily exploited by statistical models.
DeepWalk generalizes recent advancements in language modeling and unsupervised feature learning (or deep learning) from sequences of words to graphs. DeepWalk uses local information obtained from truncated random walks to learn latent representations by treating walks as the equivalent of sentences.
They demonstrate DeepWalk’s latent representations on several multi-label network classification tasks for social networks such as BlogCatalog, Flickr, and YouTube.
Their results show that DeepWalk outperforms challenging baselines which are allowed a global view of the network, especially in the presence of missing information. DeepWalk’s representations can provide F1 scores up to 10% higher than competing methods when labeled data is sparse.
In some experiments, DeepWalk’s representations are able to outperform all baseline methods while using 60% less training data.
DeepWalk is also scalable. It is an online learning algorithm which builds useful incremental results, and is trivially parallelizable. These qualities make it suitable for a broad class of real world applications such as network classification, and anomaly detection.
The following figure from the paper offers an overview of DEEPWALK. We slide a window of length $2 w+1$ over the random walk $\mathcal{W}_{v_4}$, mapping the central vertex $v_1$ to its representation $\Phi\left(v_1\right)$. Hierarchical Softmax factors out $\operatorname{Pr}\left(v_3 \mid \Phi\left(v_1\right)\right)$ and $\operatorname{Pr}\left(v_5 \mid \Phi\left(v_1\right)\right)$ over sequences of probability distributions corresponding to the paths starting at the root and ending at $v_3$ and $v_5$. The representation $\Phi$ is updated to maximize the probability of $v_1$ co-occurring with its context $\left\{v_3, v_5\right\}$.

2016

Asymmetric Transitivity Preserving Graph Embedding

Graph embedding algorithms embed a graph into a vector space where the structure and the inherent properties of the graph are preserved. The existing graph embedding methods cannot preserve the asymmetric transitivity well, which is a critical property of directed graphs. Asymmetric transitivity depicts the correlation among directed edges, that is, if there is a directed path from $u$ to $v$, then there is likely a directed edge from $u$ to $v$.
Asymmetric transitivity can help in capturing structures of graphs and recovering from partially observed graphs.
This paper bu Ou et al. from Tsinghua University in KDD 2016 tackles this challenge, they propose the idea of preserving asymmetric transitivity by approximating high-order proximity which are based on asymmetric transitivity. In particular, they develop a novel graph embedding algorithm, High-Order Proximity preserved Embedding (HOPE), which is scalable to preserve high-order proximities of large scale graphs and capable of capturing the asymmetric transitivity.
This approach preserves asymmetric transitivity and higher order proximity by minimizing $\mid\mid\textbf{S} - \textbf{Y}_{s}{\textbf{Y}^{T}_{t}}\mid\mid^2_{F}$, where $S$ is the similarity matrix. Asymmetric transitivity describes the correlation among directed edges in the graph. Asymmetric transitivity between nodes $u$ and $v$ states that if there is a directed path from $u$ to $v$, there is likely a directed edge from $u$ to $v$. HOPE uses generalized Singular Value Decomposition (SVD) to obtain the embedding efficiently with a linear time complexity with respect to the number of edges.
Note that HOPE belongs to a class of factorization-based algorithms which represent the connections between nodes in the form of a matrix and generate embeddings by factorizing the matrix based on different matrix properties.
More specifically, they first derive a general formulation that cover multiple popular highorder proximity measurements, then propose a scalable embedding algorithm to approximate the high-order proximity measurements based on their general formulation. Moreover, they provide a theoretical upper bound on the RMSE (Root Mean Squared Error) of the approximation. Their empirical experiments on a synthetic dataset and three real-world datasets demonstrate that HOPE can approximate the high-order proximities significantly better than the state-of-art algorithms and outperform the state-of-art algorithms in tasks of reconstruction, link prediction and vertex recommendation.
The following figure from the paper illustrates the framework of Asymmetric Transitivity Preserving Graph Embedding. The left is a input directed graph, and the right is the embedding vector space of the left graph, which is learned by HOPE. In the left directed graph, the solid arrows represent observed directed edges and the numbers along with the solid arrows are the edge weights. The numbers along with the dashed arrows is the Katz proximity, which is highly correlated with asymmetric transitivity. For example, according to asymmetric transitivity, the two paths, $v_1 \rightarrow v_3 \rightarrow v_6$ and $v_1 \rightarrow v_4 \rightarrow v_6$, suggest that there may exist $v_1 \rightarrow v_6$. Then, the Katz proximity from $v_1$ to $v_6$ is relatively large, i.e. 0.18. On the other hand, because $v_6 \rightarrow v_1$ is in the opposite direction, $v_1 \rightarrow v_3 \rightarrow v_6$ and $v_1 \rightarrow v_4 \rightarrow v_6$ do not suggest $v_6 \rightarrow v_1$. Then, the Katz proximity from $v_6$ to $v_1$ is small, i.e. 0. In the embedding space, the arrows represent the embedding vectors of vertices, where row vectors $\mathbf{u}_i^s$ and $\mathbf{u}_i^t$ represent the source vector and target vector of $v_i$ respectively. We use the inner product between $\mathbf{u}_i^s$ and $\mathbf{u}_j^t$ (i.e. $\mathbf{u}_i^s \cdot \mathbf{u}_j^{t^{\top}}$) as the approximated proximity from $v_i$ to $v_j$. Note that $\mathbf{u}_1^t, \mathbf{u}_2^t, \mathbf{u}_5^s$ and $\mathbf{u}_6^s$ are all zero vectors. We can find that the approximated proximity perfectly preserve the Katz proximity. For example, with respect to source vector $u_1^s$, the inner product with target vector $u_6^t$ is larger than with $u_5^t$, which preserves the rank order of corresponding Katz proximities.

Structural Deep Network Embedding

Network embedding is an important method to learn low-dimensional representations of vertexes in networks, aiming to capture and preserve the network structure. Almost all the existing network embedding methods adopt shallow models. However, since the underlying network structure is complex, shallow models cannot capture the highly non-linear network structure, resulting in sub-optimal network representations.
Therefore, how to find a method that is able to effectively capture the highly non-linear network structure and preserve the global and local structure is an open yet important problem.
This paper by Wang et al. from Tsinghua in KDD 2016 seeks to solve this problem by proposing Structural Deep Network Embedding (SDNE). More specifically, they first propose a semi-supervised deep model, which uses a coupled deep auto-encoder to embed graphs and has multiple layers of non-linear functions, thereby being able to capture the highly non-linear network structure. The model uses highly non-linear functions to capture the non-linearity in network structure by jointly optimizing the first-order and second-order proximities. It consists of an unsupervised part where an autoencoder is employed to embed the nodes such that the reconstruction error is minimized, and a supervised part (based on Laplacian Eigenmaps) where a penalty is applied when similar nodes are mapped far from one another in the embedding space. The trained weights of the auto-encoder can be interpreted as the representation of the structure of the graph.
They then propose to exploit the first-order and second-order proximity jointly to preserve the network structure. The second-order proximity is used by the unsupervised component to capture the global network structure. While the first-order proximity is used as the supervised information in the supervised component to preserve the local network structure. By jointly optimizing them in the semi-supervised deep model, their method can preserve both the local and global network structure and is robust to sparse networks.
Empirically, they conduct the experiments on five real-world networks, including a language network, a citation network and three social networks. The results show that compared to the baselines, their method can reconstruct the original network significantly better and achieves substantial gains in three applications, i.e. multi-label classification, link prediction and visualization.
The following figure from the paper illustrates the framework of the semi-supervised deep model of SDNE.

node2vec: Scalable Feature Learning for Networks

Prediction tasks over nodes and edges in networks require careful effort in engineering features used by learning algorithms. Recent research in the broader field of representation learning has led to significant progress in automating prediction by learning the features themselves. However, present feature learning approaches are not expressive enough to capture the diversity of connectivity patterns observed in networks.
This paper by Grover and Leskovec from Stanford in KDD 2016 proposes node2vec, an algorithmic framework for learning continuous feature representations for nodes in networks based on random walks. Random walking is a popular approach to approximate graph properties such as betweenness and similarities between nodes.
In node2vec, they learn a mapping of nodes to a low-dimensional space of features that maximizes the likelihood of preserving network neighborhoods of nodes by maximizing the probability of occurrence of subsequent nodes in fixed-length random walks. Nodes that commonly appear together are embedded closely in the embedding space. It employs a biased-random neighbourhood sampling strategy that explores the neighbourhoods in a breath-first (BFS) as well as depth-first (DFS) search fashion and subsequently feeds them to the Skip-Gram model. Unlike factorization methods such as HOPE, the mixture of community and structural equivalences can be approximated by varying the random walk parameters.
They define a flexible notion of a node’s network neighborhood and design a biased random walk procedure, which efficiently explores diverse neighborhoods. Their algorithm generalizes prior work which is based on rigid notions of network neighborhoods, and they argue that the added flexibility in exploring neighborhoods is the key to learning richer representations.
They demonstrate the efficacy of node2vec over existing state-of-the-art techniques on multi-label classification and link prediction in several real-world networks from diverse domains. Taken together, their work represents a new way for efficiently learning state-of-the-art task-independent representations in complex networks.
The following figure from the paper illustrates BFS and DFS search strategies from node $u$ ($k = 3$).

The following figure from the paper illustrates the random walk procedure in node2vec. The walk just transitioned from $t$ to $v$ and is now evaluating its next step out of node $v$. Edge labels indicate search biases $\alpha$.

2017

Inductive Representation Learning on Large Graphs

Low-dimensional embeddings of nodes in large graphs have proved extremely useful in a variety of prediction tasks, from content recommendation to identifying protein functions. However, most existing approaches require that all nodes in the graph are present during training of the embeddings; these previous approaches are inherently transductive and do not naturally generalize to unseen nodes.
This paper by Hamilton et al. from Leskovec’s lab at Stanford in NIPS 2017 presents GraphSAGE, a general, inductive framework that leverages node feature information (e.g., text attributes) to efficiently generate node embeddings for previously unseen data. Instead of training individual embeddings for each node, they learn a function that generates embeddings by sampling and aggregating features from a node’s local neighborhood.
The following figure from the paper offers a visual illustration of the GraphSAGE sample and aggregate approach.

GraphSAGE outperforms strong baselines on three inductive node-classification benchmarks: they classify the category of unseen nodes in evolving information graphs based on citation and Reddit post data, and they show that their algorithm generalizes to completely unseen graphs using a multi-graph dataset of protein-protein interactions.

Semi-Supervised Classification with Graph Convolutional Networks

This paper by Kipf and Welling et al. from University of Amsterdam and CIFAR in ICLR 2017 presents Graph Convolutional Networks (GCNs), a scalable approach for semi-supervised learning on graph-structured data that is based on an efficient variant of convolutional neural networks which operate directly on graphs.
They motivate the choice of their convolutional architecture via a localized first-order approximation of spectral graph convolutions.
The following figure from the paper illustrates: (Left) Schematic depiction of multi-layer Graph Convolutional Network (GCN) for semisupervised learning with $C$ input channels and $F$ feature maps in the output layer. The graph structure (edges shown as black lines) is shared over layers, labels are denoted by $Y_i$. (Right) t-SNE visualization of hidden layer activations of a two-layer GCN trained on the Cora dataset using 5% of labels. Colors denote document class.

GCNs scale linearly in the number of graph edges and learns hidden layer representations that encode both local graph structure and features of nodes.
In a number of experiments on citation networks and on a knowledge graph dataset they demonstrate that GCNs outperform related methods by a significant margin.

2018

Graph Attention Networks

This paper by Cambridge, UAB, MILA and Bengio’s lab in ICLR 2018 presents graph attention networks (GATs), novel neural network architectures that operate on graph-structured data, leveraging masked self-attentional layers to address the shortcomings of prior methods based on graph convolutions or their approximations.
By stacking layers in which nodes are able to attend over their neighborhoods’ features, GATs enable (implicitly) specifying different weights to different nodes in a neighborhood, without requiring any kind of costly matrix operation (such as inversion) or depending on knowing the graph structure upfront.
In this way, they address several key challenges of spectral-based graph neural networks simultaneously, and make GATs readily applicable to inductive as well as transductive problems.
The following figure from the paper illustrates: (Left) The attention mechanism $a\left(\mathbf{W} h_i, \mathbf{W} h_j\right)$ employed by their model, parametrized by a weight vector $\overrightarrow{\mathbf{a}} \in \mathbb{R}^{2 F^{\prime}}$, applying a LeakyReLU activation. (Right) An illustration of multihead attention (with $K=3$ heads) by node 1 on its neighborhood. Different arrow styles and colors denote independent attention computations. The aggregated features from each head are concatenated or averaged to obtain $\vec{h}_1^{\prime}$.

GATs achieve or match state-of-the-art results across four established transductive and inductive graph benchmarks: the Cora, Citeseer and Pubmed citation network datasets, as well as a protein-protein interaction dataset (wherein test graphs remain unseen during training).
The following figure from the paper illustrates a t-SNE plot of the computed feature representations of a pre-trained GAT model’s first hidden layer on the Cora dataset. Node colors denote classes. Edge thickness indicates aggregated normalized attention coefficients between nodes $i$ and $j$, across all eight attention heads $\left(\sum_{k=1}^K \alpha_{i j}^k+\alpha_{j i}^k\right)$.

2019

Exploiting Edge Features for Graph Neural Networks

Edge features contain important information about graphs. However, current state-of-the-art neural network models designed for graph learning, e.g., graph convolutional networks (GCN) and graph attention networks (GAT), inadequately utilize edge features, especially multidimensional edge features.
This paper by Gong and Cheng from the University of Kentucky in CVPR 2019 proposes an Edge Enhanced Graph Neural Network (EGNN), a new framework for a family of graph neural network models that can more sufficiently exploit edge features, including those of undirected or multi-dimensional edges. The proposed framework can consolidate current graph neural network models, e.g., GCN and GAT.
The following figure from the paper offers a schematic illustration of the proposed EGNN architecture (right), compared with the original graph neural network (GNN) architecture (left). A GNN layer could be a GCN layer, or a GAT layer, while an EGNN layer is an edge enhanced counterpart of it. EGNN differs from GNN structurally in two folds. Firstly, the adjacency matrix $A$ in GNN is either a binary matrix that indicates merely the neighborhood of each node and is used in GAT layers, or a non negative-valued matrix that has one dimensional edge features and is used in GCN layers; in contrast, EGNN uses the multi-dimensional non negative-valued edge features represented as a tensor $E$ which may exploit multiple attributes associated with each edge. Secondly, in GNN the same original adjacency matrix $A$ is fed to every layer; in contrast, the edge features in EGNN are adapted at each layer before being fed to next layer.

EGNN-based models have the following novelties: First, they propose to use doubly stochastic normalization of graph edge features instead of the commonly used row or symmetric normalization approaches used in current graph neural networks. Second, they construct new formulae for the operations in each individual layer so that they can handle multi-dimensional edge features. Third, for the proposed new framework, edge features are adaptive across network layers. Fourth, they propose to encode edge directions using multi-dimensional edge features. As a result, EGNN-based models are able to exploit a rich source of graph edge information.
They apply EGNN-based models to graph node classification on several citation networks, whole graph classification, and regression on several molecular datasets.
Compared with the current state-of-the-art methods, i.e., GCNs and GAT, EGNN-based models obtain better performance, which testify to the importance of exploiting edge features in graph neural networks.

Selected Papers / Good-to-know

Computer Vision

2015

Learning Deep Features for Discriminative Localization

This paper by Zhou et al. from MIT CSAIL in 2015 is an explainable-AI method that seeks to answer what vision models “see” in images. They propose Class Activation Maps (CAM) which is a nifty visualization technique originally introduced for CNNs where the predicted class score is mapped back to the previous convolutional layer to generate the CAM. The CAM highlights the class-specific discriminative regions used by CNN to identify the category or class.
They revisit the global average pooling layer proposed earlier, and shed light on how it explicitly enables the convolutional neural network to have remarkable localization ability despite being trained on image-level labels. This enables classification-trained CNNs to learn to perform object localization, without using any bounding box annotations. While this technique was previously proposed as a means for regularizing training, they find that it actually builds a generic localizable deep representation that can be applied to a variety of tasks.
Furthermore they demonstrate that the CAM localization technique generalizes to other visual recognition tasks i.e., their technique produces generic localizable deep features that can aid other researchers in understanding the basis of discrimination used by CNNs for their tasks.
Later, there were several variants of similar explainable-AI methods (such as GradCAM, Saliency Maps and Integrated Gradients) that were introduced.
Despite the apparent simplicity of global average pooling, they are able to achieve 37.1% top-5 error for on weakly supervised object localization on the ILSVRC 2014 benchmark, demonstrating that their global average pooling CNNs can perform accurate object localization. Note that this is remarkably close to the 34.2% top-5 error achieved by a fully supervised CNN approach.
They demonstrate that their network is able to localize the discriminative image regions on a variety of tasks despite not being trained for them.
Unrelated to the paper but a similar approach for vision transformers was recently proposed. CNN uses pixel arrays, whereas ViT splits the images into patches, i.e., visual tokens. The visual transformer divides an image into fixed-size patches, correctly embeds each of them, and includes positional embedding as an input to the transformer encoder. So CAM will indicate what regions of the image the [CLS] token will use to discriminate between classes. Usage example.
The figure below from Prithvi Da summarizes the approaches using ViT, but the same approach is applicable to other vision-based transformers such as DEiT, BEiT etc. as well.

2016

Understanding the Effective Receptive Field in Deep Convolutional Neural Networks

This paper by Luo et al. from UofT in NeurIPS 2016 studied the characteristics of the receptive field of units in CNNs and introduced the concept of effective receptive field.

2017

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

This paper by Carreira and Zisserman from Google in CVPR 2017 introduced a new two-stream Inflated 3D ConvNet (I3D) architecture that incorporated both optical flow and RGB paths by inflating filters and pooling kernels of very deep image classification ConvNets from 2D to 3D, making it possible to learn seamless spatio-temporal feature extractors from video.

Densely Connected Convolutional Networks

This paper by Huang et al. from Cornell, Tsinghua and FAIR in CVPR 2017 that skip-connected all layers with the main difference with ResNets being that they performed concatenation-based skip connections instead of addition-based skip connections (as in ResNet).
The core idea behind DenseNet is feature reuse, which leads to very compact models. As a result it requires fewer parameters than other CNNs, as there are no repeated feature-maps.
They work around two concerns:
- The feature maps have to be of the same size.
- The concatenation with all the previous feature maps may result in memory explosion.
To address the first issue they propose two solutions:
- Use conv layers with appropriate padding that maintain spatial dimensions (as in InceptionNet) or
- Use dense skip connectivity only inside blocks called dense blocks.

2018

Neural Discrete Representation Learning

This paper by Oord et al. from DeepMind, presented at NeurIPS 2018, proposed the Vector Quantised-Variational AutoEncoder (VQ-VAE), a generative model that incorporates vector quantisation (VQ) to learn discrete latent representations. This model addresses the common “posterior collapse” issue found in traditional VAEs, where the model overlooks latent variables when paired with a powerful decoder.
VQ-VAE uniquely outputs discrete codes from the encoder and employs a learned rather than static prior. By using VQ, the model ensures more effective utilization of the latent space, capturing significant multi-dimensional data features such as objects in images or phonemes in speech without allocating capacity to noise or minor details.
The discrete latent space structure is straightforward, with the encoder’s output being transformed into a one-hot encoded vector that selects an embedding from a predefined table. This selection is made through a nearest neighbor approach in the embedding space, effectively creating a robust and simplified learning mechanism for the embeddings, which are then fed into the decoder.
Key to VQ-VAE’s implementation is the loss function composed of three terms: reconstruction loss, VQ loss, and commitment loss. This loss function helps balance the training of the encoder and decoder with the stability of the embedding vectors, mitigating issues like embedding underutilization and ensuring that the encoder commits to an embedding choice.
The model showcases its versatility and effectiveness across various applications, including high-quality generation of images, videos, and speech. It also demonstrates significant capabilities in unsupervised learning scenarios such as speaker conversion and phoneme discovery in speech without any labeled data.
The following figure from the paper shows: (Left) A figure describing the VQ-VAE. (Right) Visualisation of the embedding space. The output of the encoder $z(x)$ is mapped to the nearest point $e_2$. The gradient $\nabla_z L$ (in red) will push the encoder to change its output, which could alter the configuration in the next forward pass.

Experiments outlined in the paper demonstrate that VQ-VAE achieves comparable performance to continuous latent variable models with the added benefits of discrete representation. For example, in image tasks, the model efficiently compresses image data into a lower-dimensional discrete space, maintaining high-quality reconstruction capabilities. The authors show that VQ-VAEs are capable of modeling very long term dependencies through their compressed discrete latent space which they have demonstrated by generating $128 \times 128$ color images, sampling action conditional video sequences, and finally using audio where even an unconditional model can generate surprisingly meaningful chunks of speech and perform speaker conversion.
The paper also presents a novel approach to training priors, utilizing autoregressive models like PixelCNN for images and WaveNet for audio to learn the distribution of latent variables after VQ-VAE model has been trained, paving the way for future explorations into joint training of these components. These experiments demonstrated that the discrete latent space learned by VQ-VAEs captures important features of the data in a completely unsupervised manner, and VQ-VAEs achieve likelihoods that are almost as good as their continuous latent variable counterparts on CIFAR10 data. The authors believe that this is the first discrete latent variable model that can successfully model long range sequences and fully unsupervisedly learn high-level speech descriptors that are closely related to phonemes.

2019

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

This paper by Tan and Le from Google in ICML 2019 introduced EfficientNet which is all about engineering and scale. It proves that if you carefully design your architecture you can achieve top results with reasonable parameters. It’s incredible that EfficientNet-B1 is 7.6x smaller and 5.7x faster than ResNet-152 with better accuracy!
Ideas from the paper:
- With more layers (depth), one can capture richer and more complex features, but such models are hard to train (due to vanishing gradients).
- Wider networks are much easier to train. They tend to be able to capture more fine-grained features but saturate quickly.
- By training with higher resolution images, CNNs are able to capture more fine-grained details. Again, the accuracy gain diminishes for quite high resolutions.
- Instead of finding the best architecture, the authors proposed to start with a relatively small baseline model and gradually scale up network depth (more layers), width (more channels per layer), resolution (input image) simultaneously using a technique called compound scaling that they propose.

2020

Taming Transformers for High-Resolution Image Synthesis

Designed to learn long-range interactions on sequential data, transformers continue to show state-of-the-art results on a wide variety of tasks. In contrast to CNNs, they contain no inductive bias that prioritizes local interactions. This makes them expressive, but also computationally infeasible for long sequences, such as high-resolution images. They demonstrate how combining the effectiveness of the inductive bias of CNNs with the expressivity of transformers enables them to model and thereby synthesize high-resolution images.
This paper by Esser et al. from the Heidelberg Collaboratory for Image Processing in 2020 proposed VQGAN which addresses the fundamental challenges that previously confined transformers to low-resolution images. VQGAN shows how to (i) use CNNs to learn a context-rich vocabulary of image constituents, and in turn (ii) utilize transformers to efficiently model their composition within high-resolution images.
VQGAN represents images as a composition of perceptually rich image constituents and thereby overcomes the infeasible quadratic complexity when modeling images directly in pixel space. Their approach uses a convolutional generator to learn a codebook of context-rich visual parts, whose composition is subsequently modeled with an autoregressive transformer architecture. A discrete codebook provides the interface between these architectures and a patch-based discriminator enables strong compression while retaining high perceptual quality. This method introduces the efficiency of convolutional approaches to transformer based high resolution image synthesis.
Modeling constituents with a CNN architecture and their compositions with a transformer architecture taps into the full potential of their complementary strengths and thereby allows VQGAN to represent the first results on high-resolution image synthesis with a transformer-based architecture.
VQGAN is readily applied to conditional synthesis tasks, where both non-spatial information, such as object classes, and spatial information, such as segmentations, can control the generated image.
VQGAN demonstrates the efficiency of convolutional inductive biases and the expressivity of transformers by performing semantically-guided synthesis of megapixel images and outperforming state-of-the-art convolutional approaches and autoregressive models on class-conditional ImageNet.
Code and pretrained models can be found here.

Excellent tutorial on VQGAN.

Self-training with Noisy Student improves ImageNet classification

This paper by Xie et al. from Google and CMU in CVPR 2020 introduced teacher-student training. The paper proposed an iterative semi-supervised method using 300M unlabeled images called “noisy student training” which can be described in 4 steps:
- Train a teacher model on labeled images.
- Use the teacher to generate labels on 300M unlabeled images (pseudo-labels).
- Train a student model on the combination of labeled images and pseudo labeled images.
- Iterate from step 1, by treating the student as a teacher. Re-infer the unlabeled data and train a new student from scratch.

Big Transfer (BiT): General Visual Representation Learning

This paper by Kolesnikov et al. from Google in ECCV 2020 introduced BiT which is a scalable ResNet-based model for efficient image pre-training.
They develop 3 BiT models (small, medium, and large) based on ResNet-152. For the large variation of BiT they used ResNet152x4, which means that each layer has 4 times more channels. They pretrained that model using far larger datasets than ImageNet. Specifically, the largest model was trained on the insanely large JFT dataset, which consists of 300M labeled images.
The major contribution in the architecture is the choice of normalization layers – the authors replace batch normalization with group normalization and weight standardization.

This paper by Iashin and Rahtu from Tampere University in CVPR Workshops 2020 introduced multi-modal dense video captioning.

Efficient Saliency Maps for Explainable AI

This paper by Mundhenk et al. from Lawrence Livermore National Lab and UC Berkeley in 2020 describes an explainable AI saliency map method for use with deep CNNs that is much more efficient than popular fine-resolution gradient methods. It is also quantitatively similar or better in accuracy.
Their technique works by measuring information at the end of each network scale which is then combined into a single saliency map. They describe how saliency measures can be made more efficient by exploiting Saliency Map Order Equivalence. They visualize individual scale/layer contributions by using a Layer Ordered Visualization of Information. This provides an interesting comparison of scale information contributions within the network not provided by other saliency map methods.
Using their method instead of Guided Backprop, coarse-resolution class activation methods such as Grad-CAM and GradCAM++ seem to yield demonstrably superior results without sacrificing speed. This will make fine-resolution saliency methods feasible on resource limited platforms such as robots, cell phones, low-cost industrial devices, astronomy and satellite imagery.

2021

Finetuning Pretrained Transformers into RNNs

This paper by Kasai et al. from UW, Microsoft, DeepMind, and Allen AI in 2021 presented an idea of converting pre-trained transformers into RNNs, lowering memory cost while retaining high accuracy.
SyncedReview’s article.

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

This paper by Akbari et al. from Google, Columbia, and Cornell in 2021 explored learning multimodal representations from unlabeled data using convolution-free Transformer architectures. Furthermore, they also study a modality-agnostic, single-backbone Transformer by sharing weights among the three modalities.

Self-supervised learning for fast and scalable time series hyper-parameter tuning.

This paper by Zhang et al from Facebook in 2021 proposed a new self-supervised learning framework for model selection and hyperparameter tuning, which provides accurate forecasts with less computational time and resources.

Accelerating SLIDE Deep Learning on Modern CPUs: Vectorization, Quantizations, Memory Optimizations, and More

This paper by Daghaghi et al. from Rice University in MLSys 2021 presented a CPU algorithm using locality sensitive hashing that trains deep neural networks up to 15 times faster than top GPU trainers.

Emerging Properties in Self-Supervised Vision Transformers

This paper by Caron et al. from Facebook in 2021 proposed Distillation with No Labels (DINO), a novel self-supervised vision transformer-based model. DINO is unique in its ability to discover and segment objects in images and videos entirely without supervision and without a segmentation-targeted objective. It is distinguished by its self-supervised learning approach, setting it apart from traditional methods in computer vision.
DINO operates under the principle of self-supervision as a form of self-distillation, where no labels are utilized. The model’s architecture comprises a student-teacher network setup, where both networks share the same structure but have different parameters. The training process involves the student network attempting to match the output of the teacher network over varying views of the same image. This method allows the student network to learn from the teacher, effectively guiding its learning process.
A critical aspect of the DINO framework is its implementation of the self-distillation approach. The teacher network’s output is centered using a batch mean, and its parameters are updated using an exponential moving average of the student’s parameters. The two networks’ outputs are compared using cross-entropy loss, where a stop-gradient operator is applied to the teacher to ensure that gradients are only propagated through the student. This setup encourages the student to learn more robust and generalizable features from the teacher.
The DINO method demonstrates exceptional performance in feature representation, particularly in tasks like nearest neighbor search, object location retention, and transferability to various downstream tasks. The features generated by DINO prove to be highly effective in applications such as landmark retrieval, copy detection, and video instance segmentation.
An intriguing aspect of DINO is its ability to organize the learned feature space in an interpretable manner. The model manages to cluster similar categories based on visual properties, akin to human perception. This self-organizing feature space signifies DINO’s capability to understand and connect categories based on visual characteristics without any explicit supervision.
The figure below from the paper shows self-distillation with no labels. They illustrate DINO in the case of one single pair of views $\left(x_1, x_2\right)$ for simplicity. The model passes two different random transformations of an input image to the student and teacher networks. Both networks have the same architecture but different parameters. The output of the teacher network is centered with a mean computed over the batch. Each networks outputs a $K$ dimensional feature that is normalized with a temperature softmax over the feature dimension. Their similarity is then measured with a cross-entropy loss. They apply a stop-gradient (sg) operator on the teacher to propagate gradients only through the student. The teacher parameters are updated with an exponential moving average (ema) of the student parameters.

The figure below from the paper shows the DINO PyTorch pseudocode without multi-crop.

In terms of implementation, DINO is pretrained on the ImageNet dataset without labels using the AdamW optimizer. The training process involves specific learning rate and weight decay schedules, and it leverages data augmentations such as color jittering and multi-crop. The multi-crop training, in particular, significantly contributes to DINO’s performance improvement.
The model also addresses the challenge of model collapse during training. This is handled through the complementary roles of centering and target sharpening, where the paper presents a detailed analysis of entropy and Kullback-Leibler divergence to understand these dynamics.
DINO’s computational efficiency is another highlight, with the method achieving top-tier performance while requiring considerably less computational resources compared to other self-supervised systems. The use of multi-crop training enhances the accuracy-to-running-time ratio, demonstrating DINO’s effectiveness and efficiency in self-supervised learning scenarios.
TechCrunch’s article and Facebook AI article.

Semi-Supervised Learning of Visual Features by Non-Parametrically Predicting View Assignments with Support Samples

This paper by Assran et al. from Facebook in 2021 proposed PAWS, which combines some of the ideas of semi-supervised learning with the more traditional supervised method, essentially giving the training a boost by letting it learn from both labeled and unlabeled data.
PAWS is a method for semi-supervised learning that builds on the principles of self-supervised distance-metric learning. PAWS pre-trains a model to minimize a consistency loss, which ensures that different views of the same unlabeled image are assigned similar pseudo-labels. The pseudo-labels are generated non-parametrically, by comparing the representations of the image views to those of a set of randomly sampled labeled images. The distance between the view representations and labeled representations is used to provide a weighting over class labels, which they interpret as a soft pseudo-label. By non-parametrically incorporating labeled samples in this way, PAWS extended the distance-metric loss used in self-supervised methods such as BYOL and SwAV to the semi-supervised setting.

Enhancing Photorealism Enhancement

This paper by Richter et al. from Intel Labs in 2021 proposed an approach to enhancing the realism of synthetic images using a convolutional network that leverages intermediate representations produced by conventional rendering pipelines. The network is trained via a novel adversarial objective, which provides strong supervision at multiple perceptual levels.
The authors analyzed scene layout distributions in commonly used datasets and find that they differ in important ways. They hypothesize that this is one of the causes of strong artifacts that can be observed in the results of many prior methods. To address this, they propose a new strategy for sampling image patches during training.
Intel Lab’s article with sample A/B results and videos from the paper. Also, The Verge’s article on the idea.

FNet: Mixing Tokens with Fourier Transforms

This paper by Lee-Thorp et al. from Google in 2021 proposed replacing the self-attention sublayers with simple linear transformations that “mix” input tokens to significantly speed up the transformer encoder with limited accuracy cost.
More surprisingly, the team discovers that replacing the self-attention sublayer with a standard, unparameterized Fourier Transform achieves 92 percent of the accuracy of BERT on the GLUE benchmark, with training times that are seven times faster on GPUs and twice as fast on TPUs.
SynedReview’s article on the idea.

Are Convolutional Neural Networks or Transformers more like human vision?

This paper by Tuli et al. from Princeton University, DeepMind, and UC Berkeley explored the extent to which different vision models correlate with human vision from an error consistency point-of-view. They conclude that the recently proposed Vision Transformer (ViT) networks not only outperform CNNs on accuracy for image classification tasks, but also have higher shape bias and are largely more consistent with human errors.

RegNet: Self-Regulated Network for Image Classification

The ResNet and its variants have achieved remarkable successes in various computer vision tasks. Despite its success in making gradient flow through building blocks, the simple shortcut connection mechanism limits the ability of re-exploring new potentially complementary features due to the additive function.
This paper by Xu et al. in 2021 from Harbin Institute of Technology, University of Electronic Science and Technology of China, Singapore Management University, and Sichuan University addresses this issue by proposing a regulator module as a memory mechanism to extract complementary features, which are further fed to the ResNet. In particular, the regulator module is composed of convolutional RNNs (e.g., Convolutional LSTMs or Convolutional GRUs), which are shown to be good at extracting spatio-temporal information. They named the new regulated networks as RegNet.
The regulator module can be easily implemented and appended to any ResNet architecture. They also apply the regulator module for improving the squeeze-and-Excitation ResNet to show the generalization ability of their method. Experimental results on three image classification datasets have demonstrated the promising performance of the proposed architecture compared with the standard ResNet, SE-ResNet, and other state-of-the-art architectures.

Lossy Compression for Lossless Prediction

The following paper summary has been contributed by Zhibo Zhang.
This paper by Dubois et al. from Vector Institute, UofT, UBC, and Facebook AI Research in NeurIPS 2021 introduces a neural compressor that saves a large amount of bit rate while preserving the downstream classification task performance.
The authors calculated the bits needed to maintain high downstream task performance.
Inspired by the rate distortion theory (Shannon et al.), an effective compression should keep the mutual information between the input and its latent representation low while maintaining the task utility. Based on this principle, the authors proposed two algorithms, VIC - Variational Invariant Compressor (similar to the variational autoencoder (Kingma et al.) and Bottleneck InfoNCE (BINCE) on top of the vanilla contrastive learning framework with InfoNCE loss (Oord et al.). The loss functions for both algorithms, which include minimizing the entropy of the latent representation variable, aim at removing unrelated information from the input data while reserving task-relevant information.
The authors conducted controlled experiments on the STL10 dataset. The Variational Invariant Compressor method achieved huge compression gains compared to the PNG format (269 and 175 times compression gains when using the reconstruction and the latent representation to do the downstream task prediction accordingly) while leading to a huge drop in test accuracy (25.1 when using the reconstructed input for the downstream task). In comparison, the Bottleneck InfoNCE algorithm was able to achieve 121 times compression gains compared to the PNG format while observing no drop in test accuracy.
The authors also applied the Bottleneck InfoNCE compressor on top of the pre-trained CLIP model (Radford et al.) and performed experiments on eight different datasets. The entropy bottleneck together with the CLIP model brought a much better bit-rate compared to the JPEG format across different datasets and caused only a very small drop in test accuracy compared to the vanilla CLIP model.

2022

YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications

For years, the YOLO series has been the de facto industry-level standard for efficient object detection. The YOLO community has prospered overwhelmingly to enrich its use in a multitude of hardware platforms and abundant scenarios.
This paper by Li et al. from Meituan Inc. pushes YOLOs’ limits to the next level, stepping forward with an unwavering mindset for industry application.
Considering the diverse requirements for speed and accuracy in the real environment, they extensively examine the up-to-date object detection advancements either from industry or academia. Specifically, they heavily assimilate ideas from recent network design, training strategies, testing techniques, quantization, and optimization methods. On top of this, they integrate their thoughts and practice to build a suite of deployment-ready networks at various scales to accommodate diversified use cases.
YOLOv6 has a series of models for various industrial scenarios, including N/T/S/M/L, which the architectures vary considering the model size for better accuracy-speed trade-off. And some Bag-of-freebies methods are introduced to further improve the performance, such as self-distillation and more training epochs. For industrial deployment, they adopt QAT with channel-wise distillation and graph optimization to pursue extreme performance.
YOLOv6-N hits 35.9% AP on the COCO dataset at a throughput of 1234 FPS on an NVIDIA Tesla T4 GPU. YOLOv6-S strikes 43.5% AP at 495 FPS, outperforming other mainstream detectors at the same scale~(YOLOv5-S, YOLOX-S, and PPYOLOE-S). Their quantized version of YOLOv6-S even brings a new state-of-the-art 43.3% AP at 869 FPS. Furthermore, YOLOv6-M/L also achieves better accuracy performance (i.e., 49.5%/52.3%) than other detectors with a similar inference speed.
Code.

2023

Your Diffusion Model is Secretly a Zero-Shot Classifier

The recent wave of large-scale text-to-image diffusion models has dramatically increased their text-based image generation abilities. These models can generate realistic images for a staggering variety of prompts and exhibit impressive compositional generalization abilities. Almost all use cases thus far have solely focused on sampling; however, diffusion models can also provide conditional density estimates, which are useful for tasks beyond image generation.
This paper by Li et al. from CMU in 2023 shows that the density estimates from large-scale text-to-image diffusion models like Stable Diffusion can be leveraged to perform zero-shot classification without any additional training. Their generative approach to classification, which they call Diffusion Classifier, attains strong results on a variety of benchmarks and outperforms alternative methods of extracting knowledge from diffusion models. Although a gap remains between generative and discriminative approaches on zero-shot recognition tasks, they find that their diffusion-based approach has stronger multimodal relational reasoning abilities than competing discriminative approaches.
Finally, they use Diffusion Classifier to extract standard classifiers from class-conditional diffusion models trained on ImageNet. Even though these diffusion models are trained with weak augmentations and no regularization, they find that they approach the performance of SOTA discriminative ImageNet classifiers. Overall, their strong generalization and robustness results represent an encouraging step toward using generative over discriminative models for downstream tasks.
The figure below from the paper shows that given an input image $\mathbf{x}$ and text conditioning $\mathbf{c}$, they use a diffusion model to choose the class that best fits this image. Their approach, Diffusion Classifier, is theoretically motivated through the variational view of diffusion models and uses the ELBO to approximate $\log p_\theta(\mathbf{x} \mid \mathbf{c})$. Diffusion Classifier chooses the conditioning $\mathbf{c}$ that best predicts the noise added to the input image. Diffusion Classifier can be used to extract a zero-shot classifier from a text-to-image model (like Stable Diffusion) and a standard classifier from a class-conditional model (like DiT) without any additional training.

Project page

DINOv2: Learning Robust Visual Features without Supervision

The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning.
This paper by Oquab et al. from Meta AI in 2023 proposes DINOv2, a new method for training high-performance computer vision models based on self-supervised learning and shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources.
DINOv2 enables learning rich and robust visual features without supervision which are useful for both image-level visual tasks and pixel-level tasks. Tasks supported include image classification, instance retrieval, video understanding, depth estimation, and much more.
The big deal here seems to be self-supervised learning and how it enables DINOv2 to be used to create general, multipurpose backbones for many types of computer vision tasks and applications. The model generalizes well across domains without fine-tuning. This is self-supervised learning at its finest!
Another important aspect of this research is the composition of a large-scale, highly-curated, and diverse pertaining dataset to train the models. The dataset includes 142 million images.
The figure below from the paper shows an overview of their data processing pipeline. Images from curated and uncurated data sources are first mapped to embeddings. Uncurated images are then deduplicated before being matched to curated images. The resulting combination augments the initial dataset through a self-supervised retrieval system.

Other algorithmic efforts include dealing with the instability that arises from training larger models, including more efficient implementations that reduce memory usage and hardware requirements.
DINOv2 enables learning rich and robust visual features without supervision which are useful for both image-level visual tasks and pixel-level tasks. Tasks supported include image classification, instance retrieval, video understanding, depth estimation, and much more.
The selling point of DINOv2 seems to be self-supervised learning and how it enables DINOv2 to be used to create general, multipurpose backbones for many types of computer vision tasks and applications. The model generalizes well across domains without fine-tuning. This is self-supervised learning at its finest!
Another important aspect of this research is the composition of a large-scale, highly-curated, and diverse pertaining dataset to train the models. The dataset includes 142 million images.
Other algorithmic efforts include dealing with the instability that arises from training larger models, including more efficient implementations that reduce things like memory usage and hardware requirements.
In terms of models, they train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.

Consistency Models

Diffusion models have made significant breakthroughs in image, audio, and video generation, but they depend on an iterative generation process that causes slow sampling speed and caps their potential for real-time applications.
This paper by Song et al. from OpenAI in 2023 seeks to overcome this limitation by proposing consistency models, a new family of generative models that achieve high sample quality without adversarial training.
Consistency models support fast one-step generation by design, while still allowing for few-step sampling to trade compute for sample quality. They also support zero-shot data editing, like image inpainting, colorization, and super-resolution, without requiring explicit training on these tasks.
Consistency models can be trained either as a way to distill pre-trained diffusion models, or as standalone generative models. Through extensive experiments, they demonstrate that they outperform existing distillation techniques for diffusion models in one- and few-step generation. Consistency models achieve the new state-of-the-art FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 for one-step generation.
When trained as standalone generative models, consistency models also outperform single-step, non-adversarial generative models on standard benchmarks like CIFAR-10, ImageNet 64x64 and LSUN 256x256.
The figure below from the paper shows that given a Probability Flow (PF) ODE that smoothly converts data to noise, they learn to map any point (e.g., $x_t$, $x_{t'}$, and $x_T$) on the ODE trajectory to its origin (e.g., $x_0$) for generative modeling. Models of these mappings are called consistency models, as their outputs are trained to be consistent for points on the same trajectory.

Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold

Synthesizing visual content that meets users’ needs often requires flexible and precise controllability of the pose, shape, expression, and layout of the generated objects. Existing approaches gain controllability of generative adversarial networks (GANs) via manually annotated training data or a prior 3D model, which often lack flexibility, precision, and generality.
This paper by Pan et al. from Max Planck Institute, MIT CSAIL, and Google in SIGGRAPH 2023 proposes DragGAN, a powerful yet much less explored way of controlling GANs, that is, to “drag” any points of the image to precisely reach target points in a user-interactive manner.
To achieve this, they propose DragGAN, which consists of two main components: 1) a feature-based motion supervision that drives the handle point to move towards the target position, and 2) a new point tracking approach that leverages the discriminative generator features to keep localizing the position of the handle points. Through DragGAN, anyone can deform an image with precise control over where pixels go, thus manipulating the pose, shape, expression, and layout of diverse categories such as animals, cars, humans, landscapes, etc.
As these manipulations are performed on the learned generative image manifold of a GAN, they tend to produce realistic outputs even for challenging scenarios such as hallucinating occluded content and deforming shapes that consistently follow the object’s rigidity. Both qualitative and quantitative comparisons demonstrate the advantage of DragGAN over prior approaches in the tasks of image manipulation and point tracking. They also showcase the manipulation of real images through GAN inversion.
The figure below from the paper shows an overview of DragGAN. Given a GAN-generated image, the user only needs to set several handle points (red dots), target points (blue dots), and optionally a mask denoting the movable region during editing (brighter area). Our approach iteratively performs motion supervision and point tracking. The motion supervision step drives the handle points (red dots) to move towards the target points (blue dots) and the point tracking step updates the handle points to track the object in the image. This process continues until the handle points reach their corresponding target points.

Project page.
Blog with (i) overview of the DragGAN architecture, (ii) detailed explanation of motion supervision and point tracking mechanism, (iii) potential use cases, and (iv) limitations and future scope.

ZipIt! Merging Models from Different Tasks without Training

Typical deep visual recognition models are capable of performing the one task they were trained on. In this paper, they tackle the extremely difficult problem of combining completely distinct models with different initializations, each solving a separate task, into one multi-task model without any additional training. Prior work in model merging permutes one model to the space of the other then adds them together. While this works for models trained on the same task, they find that this fails to account for the differences in models trained on disjoint tasks.
This paper by Stoica et al. from Georgia Tech introduces “ZipIt!”, a general method for merging two arbitrary models of the same architecture that incorporates two simple strategies.
The figure below from the paper shows an overview of ZipIt!, which merges models trained on completely separate tasks without any additional training by identifying their shared features. Depending on the architecture and task, ZipIt! can nearly match their ensemble performance.

First, in order to account for features that aren’t shared between models, they expand the model merging problem to additionally allow for merging features within each model by defining a general “zip” operation. - Second, they add support for partially zipping the models up until a specified layer, naturally creating a multi-head model. They find that these two changes combined account for a staggering 20-60% improvement over prior work, making the merging of models trained on disjoint tasks feasible.
The figure below from the paper illustrates the fact that prior work focuses on merging models from the same dataset with the same label sets: e.g., merging two models both trained to classify dog breeds. In this work, they remove that restriction and “zip” models that can come from different datasets and have different label sets: e.g., merging a model that classifies dog breeds with one that classifies bird species.

Self-Consuming Generative Models Go MAD

Seismic advances in generative AI algorithms for imagery, text, and other data types has led to the temptation to use synthetic data to train next-generation models. Repeating this process creates an autophagous (self-consuming) loop whose properties are poorly understood.
This paper by Alemohammad el al. from Rice University and Stanford University conducts a thorough analytical and empirical analysis using state-of-the-art generative image models of three families of autophagous loops that differ in how fixed or fresh real training data is available through the generations of training and in whether the samples from previous generation models have been biased to trade off data quality versus diversity.
Their primary conclusion across all scenarios is that without enough fresh real data in each generation of an autophagous loop, future generative models are doomed to have their quality (precision) or diversity (recall) progressively decrease. They term this condition Model Autophagy Disorder (MAD), making analogy to mad cow disease.
The figure below from the paper illustrates the fact that training generative artificial intelligence (AI) models on synthetic data progressively amplifies artifacts. As synthetic data from generative models proliferates on the Internet and in standard training datasets, future models will likely be trained on some mixture of real and synthetic data, forming an autophagous (“self-consuming”) loop. Here they highlight one potential unintended consequence of autophagous training. They trained a succession of StyleGAN-2 generative models such that the training data for the model at generation $t \geq 2$ was obtained by synthesizing images from the model at generation $t − 1$. This particular setup corresponds to a fully synthetic loop. Note how the cross-hatched artifacts (possibly an architectural fingerprint) are progressively amplified in each new generation.

The figure below from the paper illustrates the fact that today’s large-scale image training datasets contain synthetic data from generative models. Datasets such as LAION-5B, which is oft-used to train text-to-image models like Stable Diffusion, contain synthetic images sampled from earlier generations of generative models. Pictured here are representative samples from LAION-5B that include (clockwise from upper left and highlighted in red) synthetic images from the generative models StyleGA, AICAN, Pix2Pix, DALL-E, and BigGAN. They found these images using simple queries on haveibeentrained.com. Generative models trained on the LAION-5B dataset are thus closing an autophagous (“self-consuming”) loop (see Figure 3) that can lead to progressively amplified artifacts (recall Figure 1), lower quality (precision) and diversity (recall), and other unintended consequences.

Substance or Style: What Does Your Image Embedding Know?

Probes are small networks that predict properties of underlying data from embeddings, and they provide a targeted, effective way to illuminate the information contained in embeddings. Put simply, so-called probing methods are small models that are applied to text embeddings to analyze the properties of LLMs and convolutional neural networks.
While analysis through the use of probes has become standard in NLP, there has been much less exploration in vision. Image foundation models have primarily been evaluated for semantic content. Better understanding the non-semantic information in popular embeddings (e.g., MAE, SimCLR, or CLIP) will shed new light both on the training algorithms and on the uses for these foundation models.
This paper by Rashtchian et al. from Google Research seeks to apply probing techniques to transformer-based vision models and designs a systematic transformation prediction task and measures the visual content of embeddings along many axes, including image style, quality, and a range of natural and artificial transformations. Surprisingly, six embeddings (including SimCLR) encode enough non-semantic information to identify dozens of transformations.
They also consider a generalization task, where they group similar transformations and hold out several for testing. They find that image-text models (CLIP and ALIGN) are better at recognizing new examples of style transfer than masking-based models (CAN and MAE).
Overall, their results suggest that the choice of pre-training task (supervised, contrastive, masking, etc.) determines the type of non-semantic information the embeddings contain, and certain models are better than others for non-semantic downstream tasks.
The following figure from Sebastian Raschka summarizes the process and resultsSubstanceOrStyle:

Scaling Vision Transformers to 22 Billion Parameters

The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large language models (LLMs) contain upwards of 100B parameters. Vision Transformers (ViT) have introduced the same architecture to image and video modelling, but these have not yet been successfully scaled to nearly the same degree; the largest dense ViT contains 4B parameters (Chen et al., 2022).
This paper by Dehghani et al. from Google Research presents a recipe for highly efficient and stable training of a 22B-parameter ViT (ViT-22B) and perform a wide variety of experiments on the resulting model. They incorporate the following three main modifications to improve efficiency and training stability at scale: parallel layers, query/key (QK) normalization, and omitted biases.
- Parallel layers: ViT-22B applies the Attention and MLP blocks in parallel, instead of sequentially as in the standard Transformer:
\[\begin{aligned} & y^{\prime}=\operatorname{LayerNorm}(x), \\ & y=x+\operatorname{MLP}\left(y^{\prime}\right)+\operatorname{Attention}\left(y^{\prime}\right) . \end{aligned}\]
- This enables additional parallelization via combination of linear projections from the MLP and attention blocks. In particular, the matrix multiplication for query/key/value-projections and the first linear layer of the MLP are fused into a single operation, and the same is done for the attention out-projection and second linear layer of the MLP. This approach is also used by PaLM, where this technique sped up the largest model’s training by 15% without performance degradation.
- QK Normalization: In scaling ViT beyond prior works, they observed divergent training loss after a few thousand steps. In particular, this instability was observed for models with around 8B parameters. It was caused by extremely large values in attention logits, which lead to (almost one-hot) attention weights with near-zero entropy. To solve this, they apply LayerNorm to the queries and keys before the dot-product attention computation. Specifically, the attention weights are computed as $\operatorname{softmax}\left[\frac{1}{\sqrt{d}} \mathrm{LN}\left(X W^Q\right)\left(\mathrm{LN}\left(X W^K\right)\right)^T\right],$ where $d$ is query/key dimension, $X$ is the input, LN stands for layer normalization, and $W^Q$ is the query weight matrix, and $W^K$ is the key weight matrix. The effect on an 8B parameter model is that normalization prevents divergence due to uncontrolled attention logit growth.
- Omitting biases on QKV projections and LayerNorms: Following PaLM, the bias terms were removed from the QKV projections and all LayerNorms were applied without bias and centering. This improved accelerator utilization (by 3\%), without quality degradation. However, unlike PaLM, they use bias terms for the (in- and out-) MLP dense layers as they have observed improved quality and no speed reduction.
The following figure from the paper shows a parallel ViT-22B layer with QK normalization.

When evaluated on downstream tasks (often with a lightweight linear model on frozen features), ViT-22B demonstrates increasing performance with scale.
ViT-22B further demonstrates other interesting benefits of scale, including an improved tradeoff between fairness and performance, state-of-the-art alignment to human visual perception in terms of shape/texture bias, and improved robustness.
ViT-22B demonstrates the potential for “LLM-like” scaling in vision, and provides key steps towards getting there.

CLIP-Dissect: Automatic Description of Neuron Representations in Deep Vision Networks

This paper by Oikarinen in ICLR 2023 proposes CLIP-Dissect, a new technique to automatically describe the function of individual hidden neurons inside vision networks.
CLIP-Dissect leverages recent advances in multimodal vision/language models to label internal neurons with open-ended concepts without the need for any labeled data or human examples.
They show that CLIP-Dissect provides more accurate descriptions than existing methods for last layer neurons where the ground-truth is available as well as qualitatively good descriptions for hidden layer neurons. In addition, CLIP-Dissect is very flexible: it is model agnostic, can easily handle new concepts and can be extended to take advantage of better multimodal models in the future.
Finally CLIP-Dissect is computationally efficient and can label all neurons from five layers of ResNet-50 in just 4 minutes, which is more than 10 times faster than existing methods.
The following figure from the paper shows labels generated by CLIP-Dissect, NetDissect (Bau et al., 2017) and MILAN (Hernandez et al., 2022) for random neurons of ResNet-50 trained on ImageNet. Displayed together with 5 most highly activating images for that neuron. We have subjectively colored the descriptions green if they match these 5 images, yellow if they match but are too generic and red if they do not match. In this paper we follow the torchvision naming scheme of ResNet: Layer 4 is the second to last layer and Layer 1 is the end of first residual block. MILAN(b) is trained on both ImageNet and Places365 networks, while MILAN(p) is only trained on Places365.

On the Impact of Knowledge Distillation for Model Interpretability

The following paper summary has been contributed by Zhibo Zhang.
This paper by Han et al. from Seoul National University, Seoul National University of Science and Technology and ZIOVISION Inc. in ICML 2023 studies the effect of knowledge distillation on model interpretability.
The authors empirically compared the following models:
- The models trained from scratch.
- The student models trained with knowledge distillation (Hinton et al., 2015).
- The models trained with label smoothing (Szegedy et al., 2015).
The authors used the number of concept detectors (Bau et al., 2017) to quantitatively measure the interpretability of the model.
Based on the experiments on the ImageNet dataset (Russakovsky et al., 2015), as shown in the figure below from the paper, although both knowledge distillation and label smoothing increased model accuracy, the former led to a larger number of concept detectors compared to the latter. The figure below illustrates the main argument of the proposed study. The number of concept detectors of different models, namely models trained (a) from scratch ($f_{scratch}$), (b) using KD ($f_{KD}$), and (c) using LS ($f_{LS}$), have been measured for a quantitative comparison of the model interpretability. LS enhances the model performance but reduces the interpretability while KD boosts both. The transfer of class-similarity information from the teacher to student model enhances the model interpretability.

A key factor that likely led to the difference in interpretability (from the number of concept detectors’ perspective) is that the teacher model contains class-similarity information, which can be transferred to the student model in knowledge distillation. Empirically, the authors divided the 1000 classes in ImageNet dataset into 67 categories based on the coarse classification scheme (Eshed, 2020). It was observed that knowledge distillation resulted in the highest entropy calculated based on the classes of the same category as the true class and label smoothing resulted in the highest entropy calculated based on the entire 1000 classes. This shows that knowledge distillation brought more class-similarity information. In addition, the authors compared between the loss function of knowledge distillation and that of label smoothing. The authors inferred that the transfer of class-similarity information led to the improvement in model interpretability for knowledge distillation.

Replacing Softmax with ReLU in Vision Transformers

This report by Wortsman et al. from Google DeepMind, investigates point-wise alternatives to the softmax operation in the context of vision transformers, particularly focusing on the ReLU function divided by sequence length.
The study reveals that this approach, termed as ReLU-attention, can match or approach the performance of traditional softmax-attention in vision transformers in terms of scaling behavior as a function of compute.
The paper highlights that ReLU-attention facilitates parallelization over the sequence length dimension with fewer gather operations compared to traditional softmax attention, offering new possibilities for efficiency in transformer models.
The experiments, conducted on small to large vision transformers trained on ImageNet-21k, show that ReLU-attention’s performance is influenced by sequence length scaling. The optimal results are typically achieved when the scaling factor is close to 1.
Additionally, the report discusses the impact of removing qk-layernorm and the integration of a gated attention unit on the performance of ReLU-attention, observing that sequence length scaling remains crucial for achieving the best accuracy, even with these modifications.
The study concludes with open questions regarding the underlying reasons for the improved performance with the factor $L^{^−1}$ (where $L$ is the sequence length) and the potential for discovering more effective activation functions beyond ReLU and softmax for vision transformers.

Learning Vision from Models Rivals Learning Vision from Data

This paper by Tian et al. from Google Research and MIT CSAIL introduces SynCLR, a novel method for learning visual representations solely from synthetic images and captions, without using real data.
SynCLR synthesizes a large dataset of image captions with large language models (LLMs), and then generates images for each caption using a text-to-image model. This approach creates a synthetic dataset of 600 million images.
The process encompasses three major steps:
1. Synthesizing Captions: Using LLMs, they generate captions from a list of visual concepts. This includes direct caption generation, combining concepts with a background (bg), and pairing concepts with a positional relationship (rel). Examples of each type are provided to guide the LLMs.
2. Synthesizing Images: For each caption, multiple images are generated using a text-to-image diffusion model with different random noise inputs. The Classifier-Free Guidance (CFG) scale, crucial for the quality and diversity of samples, is set to 2.5.
3. Representation Learning: The visual representation learning uses multi-positive contrastive learning and masked image modeling, building upon StableRep. This involves aligning images generated from the same caption in the embedding space, and a patch-level masked image modeling objective.
Enhanced Representation Learning Techniques:
- Exponential Moving Average (EMA): Used to encode crops and produce targets for iBOT loss. EMA parameters are updated following a cosine schedule, enhancing final performance and training stability.
- iBOT Loss: An adaptation of the DINO objective for patch-level prediction. A localized patch is masked, and the model predicts the tokenized representation of the masked patch. The iterative Sinkhorn-Knopp algorithm is used for building the prediction target, replacing the softmax-centering method.
The following figure from the paper shows three paradigms for visual representation learning. Top row: Traditional methods, such as CLIP, learn only from real data; Middle row: Recent methods, such as StableRep, learn from real text and generated images; Bottom row: SynCLR learns from synthetic text and synthetic images, and rival the linear transfer performance of CLIP on ImageNet despite not directly observing any real data.

The results demonstrate that SynCLR’s pre-training achieves competitive performance compared to other general-purpose visual representation learners like CLIP and DINO v2 in image classification tasks. Specifically, it outperforms MAE and iBOT by notable margins in dense prediction tasks like semantic segmentation on ADE20k for ViT-B/16.
For implementation, the concept list is derived from various datasets, and batches are composed of 2048 captions with 4 images each, leading to 8192 global crops per batch. The methodology uses a mix of hyperparameters and strategies, such as EMA for encoding and iterative Sinkhorn-Knopp algorithm for the iBOT loss.
The study’s significance lies in its ability to leverage synthetic data to rival learning from real data, potentially transforming how visual representation models are trained and scaled.
Code

TUTEL: Adaptive Mixture-of-Experts at Scale

This paper by Hwang et al. from MSR and Microsoft, published in MLSys, introduces TUTEL, a scalable and adaptive system designed to optimize the performance of sparsely-gated mixture-of-experts (MoE) models. The motivation behind TUTEL stems from the recognition that the dynamic nature of MoE models, which route input tokens to different experts based on a gating function, poses significant challenges for efficient computation due to the static execution strategies of existing systems. These strategies fail to adapt to the varying workload of experts, leading to inefficient use of computational resources.
TUTEL addresses these challenges by introducing a dynamic execution framework that supports adaptive parallelism and pipelining. The system’s design allows for the distribution of MoE model parameters and input data in an identical layout, enabling seamless switching between parallelism strategies without the need for costly data or tensor migration. This capability facilitates real-time optimization of parallelism and pipelining during runtime, significantly enhancing computational efficiency.
The authors implement various MoE acceleration techniques within TUTEL, including a Flexible All-to-All communication strategy, two-dimensional hierarchical (2DH) All-to-All, and fast encode/decode mechanisms. These innovations collectively enable TUTEL to deliver substantial speedups in the execution of single MoE layers over multiple GPUs, demonstrating improvements of 4.96x and 5.75x over the state-of-the-art on 16 and 2048 NVIDIA A100 GPUs, respectively.
The evaluation of TUTEL showcases its effectiveness in running MoE-based models, specifically a real-world model named SwinV2-MoE, which is built upon the Swin Transformer V2 architecture. TUTEL significantly accelerates the training and inference of SwinV2-MoE, achieving speedups of up to 1.55x and 2.11x, respectively, over Fairseq, a previous framework. Moreover, the SwinV2-MoE model achieves superior accuracy in both pre-training and downstream computer vision tasks compared to its dense counterpart, highlighting TUTEL’s practical utility in enabling efficient training and deployment of large-scale MoE models for real-world applications.

2024

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

This paper by Blattmann et al. from Stability AI introduces Stable Video Diffusion (SVD), a latent video diffusion model designed for high-resolution text-to-video and image-to-video generation. They address the challenge of lacking a unified strategy for curating video data and propose a methodical curation process for training successful video LDMs, which includes three stages:
- Stage I: Text-to-image (or simply, image pretraining), i.e., a 2D text-to-image diffusion model.
- Stage II: video pretraining, which trains on large amounts of videos.
- Stage III: video finetuning, which refines the model on a small subset of high-quality videos at higher resolution.
In the initial stage, leveraging insights from large-scale image model training, the authors curated an extensive pretraining dataset named LVD, consisting of approximately 580 million annotated video clip pairs. This dataset underwent rigorous processing, including cut detection and annotation using several methods such as image captioning and optical flow analysis, to filter out low-quality content. Specifically, to avoid the samples in the dataset that can be expected to degrade the performance of the final video model, such as clips with less motion, excessive text presence, or generally low aesthetic value, they therefore additionally annotate the dataset with dense optical flow calculated at 2 FPS, with which static scenes are filtered out by removing any videos whose average optical flow magnitude is below a certain threshold.
The following figure from the paper shows that the initial dataset contains many static scenes and cuts which hurts training of generative video models. Left: Average number of clips per video before and after our processing, revealing that our pipeline detects lots of additional cuts. Right: The distribution of average optical flow score for one of these subsets before processing, which contains many static clips.

The paper outlines the importance of each training stage and demonstrates that systematic data curation significantly boosts model performance. Notably, they emphasize the necessity of pretraining on a well-curated dataset for generating high-quality videos, showing that models pretrained in this manner outperform others when finetuned on smaller, high-quality datasets.
Leveraging the curated dataset, the authors trained a base model that provides a comprehensive motion representation. This base model was further finetuned for several applications, including text-to-video and image-to-video generation, demonstrating state-of-the-art performance. The model also supports controlled camera motion through LoRA modules and has been shown to serve as a robust multi-view 3D prior, capable of generating multiple consistent views of an object in a feedforward manner.
The SVD model stands out for its ability to efficiently generate high-fidelity videos from both text and images, offering a substantial advancement over existing methods in terms of visual quality and consistency. The authors released the code and model weights, contributing a valuable resource to the research community for further exploration and development in video generation technology.
Blog; Code; Weights

Scalable Diffusion Models with State Space Backbone

This paper by Fei et al. from Kunlun Inc. introduces a novel approach to scaling diffusion models using a state space architecture.
They focus on replacing the traditional U-Net backbone with a state space model (SSM) framework to enhance image generation performance and computational efficiency.
The authors present Diffusion State Space Models (DiS) that treat all inputs—time, condition, and noisy image patches—as discrete tokens, enhancing the model’s ability to handle long-range dependencies effectively. The DiS architecture is characterized by its scalability, leveraging state space techniques that offer superior performance compared to conventional CNN-based or Transformer-based architectures, especially in handling larger image resolutions and reducing computational costs.
Key Technical Details and Implementation:
- Architecture: DiS utilizes a state space model backbone which processes inputs as tokens, incorporating forward and backward processing with skip connections that enhance both shallow and deep layers’ integration.
- Noise Prediction Network: The noise prediction network in DiS, represented as $\epsilon_\theta(x_t, t, c)$, predicts the injected noise at various timesteps and conditions, thereby optimizing the reverse diffusion process from noisy to clean images.
- Model Configurations: Different configurations of DiS are explored, with parameters adjusted for varying depths and widths, showing a clear correlation between increased model complexity and improved image quality metrics.
- Patchify and Linear Decoder: Initial layers transform input images into a sequence of tokens which are then processed by SSM blocks. The output is decoded back to image space using a linear decoder after the final SSM block, predicting noise and covariance matrices.
The following figure from the paper shows the proposed state space-based diffusion models. It treats all inputs including the time, condition and noisy image patches as tokens and employs skip connections between shallow and deep layers. Different from original Mamba for text sequence modeling, our SSM block process the hidden states sequence with both forward and backward directions.

DiS models were tested under unconditional and class-conditional image generation tasks. In scenarios like ImageNet at resolutions of 256 $\times$ 256 and 512 $\times$ 512 pixels, DiS models demonstrated competitive or superior performance to prior models, achieving impressive Frechet Inception Distance (FID) scores.
Various configurations from small to huge models were benchmarked to demonstrate scalability, showing that larger models continue to provide substantial improvements in image quality.
The paper concludes that DiS models not only perform comparably or better than existing architectures but do so with less computational overhead, showcasing their potential in scalable and efficient large-scale image generation. This approach paves the way for future explorations into more effective generative modeling techniques that can handle complex, high-resolution datasets across different modalities. The authors also make their code and models publicly available, encouraging further experimentation and development in the community.

Towards Evaluating the Robustness of Visual State Space Models

This paper by Malik et al. from MBZUAI and Linköping University, explores the robustness of Vision State Space Models (VSSMs) against various perturbations. VSSMs are novel architectures that integrate recurrent neural networks and latent variable models to effectively capture long-range dependencies and model complex visual dynamics. The study compares VSSMs with transformers and Convolutional Neural Networks (CNNs) across different robustness scenarios, including occlusions, image structure perturbations, common corruptions, and adversarial attacks.
The paper is structured into three main robustness evaluation categories: occlusions and information loss, common corruptions, and adversarial attacks.
Occlusions and Information Loss: The robustness of VSSMs was evaluated against various occlusion scenarios, including information loss along scanning directions and severe occlusions affecting both foreground objects and non-salient background regions. VSSMs demonstrated superior robustness to sequential information loss along the scanning direction compared to ViT and Swin models, and exhibited the highest overall robustness to random and salient patch drops. The models also showed greater resilience to spatial structure disturbances caused by patch shuffling.
Common Corruptions: The study assessed the performance of VSSM-based classification, detection, and segmentation models against common corruptions mimicking real-world scenarios. For global corruptions like noise, blur, and weather conditions, VSSMs experienced the least average performance drop compared to Swin and ConvNext models. In fine-grained corruptions, VSSMs outperformed all transformer-based variants and maintained performance comparable to advanced ConvNext models. In dense prediction tasks, VSSM-based models generally demonstrated greater resilience.
Adversarial Attacks: The robustness of VSSMs was analyzed under white-box and black-box adversarial attacks, with a specific focus on frequency analysis of adversarial perturbations. Smaller VSSM models showed higher robustness against white-box attacks compared to Swin Transformers, but this robustness did not scale with larger VSSM models. VSSMs maintained robustness above 90% for low-frequency perturbations but deteriorated rapidly under high-frequency attacks.
Implementation Details:
- Classification Tasks: The models were evaluated using ImageNet-pretrained backbones, and the evaluations included various corrupted datasets mimicking real-world scenarios.
- Detection and Segmentation: Fine-tuned using MMDetection and MMSegmentation frameworks, with Mask-RCNN for detection and UperNet for segmentation.
- Adversarial Evaluations: Conducted using FGSM and PGD attacks, with additional adversarial fine-tuning on CIFAR-10 and Imagenette datasets using the TRADES objective.
The findings highlight that while VSSMs exhibit strong robustness in many scenarios, other architectures like ConvNext and ViT may perform better in certain conditions. These insights provide valuable guidance for selecting appropriate models for real-world applications, emphasizing the importance of robustness in deploying vision models in safety-critical environments. The study’s comprehensive evaluation underscores the need for ongoing research to enhance the robustness of visual state space models.

NLP

2008

ROUGE-C: A fully automated evaluation method for multi-document summarization

This paper by He et al. in the 2018 IEEE International Conference on Granular Computing presents how to use ROUGE to evaluate summaries without human reference summaries.
ROUGE is a widely used evaluation tool for multi-document summarization and has great advantages in the areas of summarization evaluation. However, manual reference summaries written beforehand by assessors are indispensable for a ROUGE test. There was still no research on ROUGEp’s abilities of evaluating summaries without manual reference summaries.
By considering summary as consensus speaker for the original input information, we discovered and developed ROUGE-C. ROUGE-C applies the ROUGE method alternatively by replacing the reference summaries with source document as well as query-focused information (if any), and therefore it enables a fully manual-independent way of evaluating multi-document summarization.
Experiments conducted on the 2001 to 2005 DUC data showed that, with restraint of appropriate condition and some acceptable decreased efficiency, ROUGE-C correlated well with methods that depend on reference summaries, including human judgments.

2015

Effective Approaches to Attention-based Neural Machine Translation

This paper by Luong et al. from Manning’s lab in EMNLP 2015 described a few more attention models that offer improvements and simplifications compared to Bahdanau attention.
They describe a few “global attention” models, the distinction between them being the way the attention scores are calculated.

2018

Generating Wikipedia by Summarizing Long Sequences

This paper by Liu et al. from Google Brain in ICLR 2018 shows that generating English Wikipedia articles can be approached as a multi-document summarization problem with a large, parallel dataset, and demonstrated a two-stage extractive-abstractive framework for carrying it out. They perform coarse extraction by using extractive summarization to identify salient information in the first stage and a neural decoder-only sequence transduction model for the abstractive stage, capable of handling very long input-output examples.
For the abstractive model, they introduce a decoder-only architecture that can scalably attend to very long sequences, much longer than typical encoder-decoder architectures used in sequence transduction, allowing them to condition on many reference documents and to generate fluent and coherent multi-sentence paragraphs and even whole Wikipedia articles.
When given reference documents, they show it can extract relevant factual information as reflected in perplexity, ROUGE scores and human evaluations.

2019

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

While BERT and RoBERTa have set a new state-of-the-art performance on sentence-pair regression tasks like semantic textual similarity (STS), they requires that both sentences are fed into the network, which causes a massive computational overhead. Finding the most similar pair in a collection of 10,000 sentences requires about 50 million inference computations (~65 hours) with BERT. The construction of BERT makes it unsuitable for semantic similarity search as well as for unsupervised tasks like clustering.
This paper by Reimers and Gurevych from Technische Universitat Darmstad in 2019 presented Sentence-BERT (SBERT) a modification of the pretrained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity. This reduces the effort for finding the most similar pair from 65 hours with BERT / RoBERTa to about 5 seconds with SBERT, while maintaining the accuracy from BERT.
They showed that BERT out-of-the-box maps sentences to a vector space that is rather unsuitable to be used with common similarity measures like cosine-similarity. In fact, the performance for seven STS tasks was below the performance of average GloVe embeddings.
SBERT fine-tunes BERT in a siamese network architecture. They evaluated the quality on various common benchmarks, where it could achieve a significant improvement over state-of-the-art sentence embeddings methods. Replacing BERT with RoBERTa did not yield a significant improvement in their experiments.
They evaluate SBERT and SRoBERTa on common STS tasks and transfer learning tasks, where it outperforms other state-of-the-art sentence embeddings methods due to it being computationally efficient. On a GPU, it is about 9% faster than InferSent and about 55% faster than Universal Sentence Encoder. SBERT can be used for tasks which are computationally not feasible to be modeled with BERT such as clustering of 10,000 sentences with hierarchical clustering (BERT needs 65 hours, while SBERT needs 5 seconds).
The left half of the following diagram shows the SBERT architecture with the classification objective function, e.g., for fine-tuning on SNLI dataset. The two BERT networks have tied weights (siamese network structure), while the right half of the diagram shows the SBERT architecture during inference, for example, to compute similarity scores. This architecture is also used with the regression objective function.

Diversity and Depth in Per-Example Routing Models

Routing models, a form of conditional computation where examples are routed through a subset of components in a larger network, have shown promising results in recent works. Surprisingly, routing models to date have lacked important properties, such as architectural diversity and large numbers of routing decisions. Both architectural diversity and routing depth can increase the representational power of a routing network.
This paper by Ramachandran and Le from Google Brain in ICLR 2019 addresses both of these deficiencies. They discuss the significance of architectural diversity in routing models, and explain the tradeoffs between capacity and optimization when increasing routing depth.
Given that the router is characterized as neural network, it must be trained. In this work, they use the noisy top-$k$ gating technique of Shazeer et al. (2017) that enables the learning of the router directly by gradient descent. Noisy top-$k$ gating is used to choose $k$ experts out of a list of $n$ experts in a way that enables gradients to flow to the router.
The number of routes taken by the examples is controlled by the hyperparameter $k$. The choice of $k$ plays an important role in controlling the tradeoff between weight sharing and specialization. They find that using a small fixed $k$ throughout the network leads to overfitting. To combat overfitting, they introduce a simple trick which they call $k$-annealing. In $k$-annealing, instead of using the same value of $k$ for every layer, $k$ is annealed downwards over the layers of the routing model. That is, initial layers will have high $k$ (more weight sharing) and layers closer to the classification layer have smaller $k$ (more specialization). They found in early experimentation that $k$-annealing works well in practice, and they use it for all experiments in this work.
If noisy top-$k$ gating is used without modifications, the model tends to immediately collapse to using a small number of experts instead of using all the capacity. The collapse happens in the beginning of training. Some experts tend to perform slightly better due to random initialization, and gradient descent will accordingly route more examples to the best few experts, further reinforcing their dominance. To combat this problem, Shazeer et al. (2017) used additional losses that balance expert utilization, such as the importance loss:
\[L_{\text {importance }}(X)=\mathrm{cV}\left(\sum_{x \in X} G(x)\right)^2\]
- where $\mathrm{CV}$ is the coefficient of variation. They use the importance loss and the load loss in all their models, which is defined in detail in Shazeer et al. (2017).
In their experiments, they find that adding architectural diversity to routing models significantly improves performance, cutting the error rates of a strong baseline by 35% on an Omniglot setup. However, when scaling up routing depth, they find that modern routing techniques struggle with optimization. They conclude by discussing both the positive and negative results, and suggest directions for future research.

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

This paper by Shoeybi et al. from Nvidia presents techniques for training extremely large transformer models, introducing an efficient intra-layer model parallel approach that facilitates training models with billions of parameters.
The approach, compatible with PyTorch and requiring no new compiler or extensive library modifications, allows for significant scaling without sacrificing computational efficiency. Demonstrating its capability, the authors trained models up to 8.3 billion parameters across 512 GPUs, achieving up to 15.1 PetaFLOPs with 76% scaling efficiency. They highlight the importance of the placement of layer normalization in BERT-like models for enhancing performance as model size increases.
The following image from the paper shows the architecture. Purple blocks correspond to fully connected layers. Each blue block represents a single transformer layer that is replicated $L$ times.

Using models similar to GPT-2 and BERT, they achieved state-of-the-art results on WikiText103 and LAMBADA datasets for GPT-2 models and on the RACE dataset for BERT models.
This work showcases the potential of large language models in pushing the boundaries of NLP performance, providing insights into the scalability and efficiency of model parallelism.
Code

2020

Efficient Transformers: A Survey

Transformer model architectures have garnered immense interest lately due to their effectiveness across a range of domains like language, vision and reinforcement learning. In the field of natural language processing for example, Transformers have become an indispensable staple in the modern deep learning stack. Recently, a dizzying number of “X-former” models have been proposed - Reformer, Linformer, Performer, Longformer, to name a few - which improve upon the original Transformer architecture, many of which make improvements around computational and memory efficiency.
This paper by Tay et al. from Google in 2020 characterizes a large and thoughtful selection of recent efficiency-flavored “X-former” models, providing an organized and comprehensive overview of existing work and models across multiple domains.

Towards a Human-like Open-Domain Chatbot

This paper by Adiwardana et al. from Google in 2020 presented Meena, which is an end-to-end, neural conversational model that learns to respond sensibly to a given conversational context. The training objective is to minimize perplexity, the uncertainty of predicting the next token (in this context, the next word in a conversation).
Google AI’s article.

MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices

Natural Language Processing (NLP) has recently achieved great success by using huge pre-trained models with hundreds of millions of parameters. However, these models suffer from heavy model sizes and high latency such that they cannot be deployed to resource-limited mobile devices.
This paper by Sun et al. from CMU and Google Brain in ACL 2020 proposes MobileBERT for compressing and accelerating the popular BERT model.
Like the original BERT, MobileBERT is task-agnostic, that is, it can be generically applied to various downstream NLP tasks via simple fine-tuning. Basically, MobileBERT is a thin version of BERT_LARGE, while equipped with bottleneck structures and a carefully designed balance between self-attentions and feed-forward networks.
To train MobileBERT, they first train a specially designed teacher model, an inverted-bottleneck incorporated BERT_LARGE model. Then, they conduct knowledge transfer from this teacher to MobileBERT. Empirical studies show that MobileBERT is 4.3x smaller and 5.5x faster than BERT_BASE while achieving competitive results on well-known benchmarks.
On the natural language inference tasks of GLUE, MobileBERT achieves a GLUE score of 77.7 (0.6 lower than BERT_BASE), and 62 ms latency on a Pixel 4 phone. On the SQuAD v1.1/v2.0 question answering task, MobileBERT achieves a dev F1 score of 90.0/79.2 (1.5/2.1 higher than BERT_BASE).

Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning

Pre-trained universal feature extractors, such as BERT for natural language processing and VGG for computer vision, have become effective methods for improving deep learning models without requiring more labeled data. While effective, feature extractors like BERT may be prohibitively large for some deployment scenarios.
This paper by Gordon et al. from JHU in the Rep4NLP 2020 Workshop at ACL 2020 explores weight pruning for BERT and ask: how does compression during pre-training affect transfer learning? They find that pruning affects transfer learning in three broad regimes.
Low levels of pruning (30-40%) do not affect pre-training loss or transfer to downstream tasks at all. Medium levels of pruning increase the pre-training loss and prevent useful pre-training information from being transferred to downstream tasks. High levels of pruning additionally prevent models from fitting downstream datasets, leading to further degradation.
Finally, they observe that fine-tuning BERT on a specific task does not improve its prunability. They conclude that BERT can be pruned once during pre-training rather than separately for each task without affecting performance.

Movement Pruning: Adaptive Sparsity by Fine-Tuning

Magnitude pruning is a widely used strategy for reducing model size in pure supervised learning; however, it is less effective in the transfer learning regime that has become standard for state-of-the-art natural language processing applications. They propose the use of movement pruning, a simple, deterministic first-order weight pruning method that is more adaptive to pretrained model fine-tuning.
This paper by Sanh et al. from MuggingFace and Cornell offers in NeurIPS 2020 offers a mathematical foundation to the method and compare it to existing zeroth- and first-order pruning methods.
Experiments show that when pruning large pretrained language models, movement pruning shows significant improvements in high-sparsity regimes. When combined with distillation, the approach achieves minimal accuracy loss with down to only 3% of the model parameters.

Dense passage retrieval for open-domain question answering

Open-domain question answering relies on efficient passage retrieval to select candidate contexts, where traditional sparse vector space models, such as TF-IDF or BM25, are the de facto method.
This paper by from Karpukhin et al. from Facebook AI, University of Washington, and Princeton in EMNLP 2020 shows that retrieval can be practically implemented using dense representations alone, where embeddings are learned from a small number of questions and passages by a simple dual-encoder framework.
Training the encoders so that the dot-product similarity ($\operatorname{sim}(q, p)=E_Q(q)^{\top} E_P(p)$) becomes a good ranking function for retrieval is essentially a metric learning problem. The goal is to create a vector space such that relevant pairs of questions and passages will have smaller distance (i.e., higher similarity) than the irrelevant ones, by learning a better embedding function.
- Let $\mathcal{D}=\left\{\left\langle q_i, p_i^{+}, p_{i, 1}^{-}, \cdots, p_{i, n}^{-}\right\rangle\right\}_{i=1}^m$ be the training data that consists of $m$ instances. Each instance contains one question $q_i$ and one relevant (positive) passage $p_i^{+}$, along with $n$ irrelevant (negative) passages $p_{i, j}^{-}$. They optimize the loss function as the negative log likelihood of the positive passage:
\[\begin{gathered} L\left(q_i, p_i^{+}, p_{i, 1}^{-}, \cdots, p_{i, n}^{-}\right) \\ =-\log \frac{e^{\operatorname{sim}\left(q_i, p_i^{+}\right)}}{e^{\operatorname{sim}\left(q_i, p_i^{+}\right)}+\sum_{j=1}^n e^{\operatorname{sim}\left(q_i, p_{i, j}^{-}\right)}} \end{gathered}\]
For retrieval problems, it is often the case that positive examples are available explicitly, while negative examples need to be selected from an extremely large pool. For instance, passages relevant to a question may be given in a QA dataset, or can be found using the answer. All other passages in the collection, while not specified explicitly, can be viewed as irrelevant by default. In practice, how to select negative examples is often overlooked but could be decisive for learning a high-quality encoder. The approach employed to retrieve positive and negative passages is as follows. They consider three different types of negatives:
1. Random: any random passage from the corpus;
2. BM25: top passages returned by BM25 which don’t contain the answer but match most question tokens;
3. Gold: positive passages paired with other questions which appear in the training set.
  - Their best model uses gold passages from the same mini-batch and one BM25 negative passage. In particular, re-using gold passages from the same batch as negatives can make the computation efficient while achieving great performance.
Summary of their approach:
1. Purpose: The goal of the DPR is to improve the retrieval component in open-domain QA. This involves efficiently retrieving relevant text passages from a vast collection when given a question.
2. Key Task: Given a large number $M$ of text passages, the DPR aims to index all of these passages in a low-dimensional continuous space, making it efficient to retrieve the top $k$ most relevant passages for a given input question. $M$ can be very large, like 21 million passages, but $k$ (the number of passages we want to retrieve for a given question) is relatively small, often between 20 and 100.
3. DPR’s Mechanism:
  - Dense Encoder for Passages $EP(\cdot)$: It converts any text passage to a $d$-dimensional real-valued vector. This encoder processes and indexes all $M$ passages for retrieval.
  - Encoder for Questions $EQ(\cdot)$: At runtime, when a question is posed, this encoder turns the question into a $d$-dimensional vector.
  - Similarity Measurement: The similarity between a question and a passage is calculated using the dot product of their respective vectors: $sim(q, p) = EQ(q) \cdot EP(p)$.
4. Passage Size and Boundaries: The passage’s size and the decision of where a passage begins and ends affect the retriever and reader. Fixed-length passages have been found to be more effective in retrieval and QA accuracy.
5. Encoders Implementation: The encoders for both questions and passages are based on BERT networks, a popular deep learning model for NLP. They use the representation at the [CLS] token as the output, meaning the output vector has 768 dimensions.
6. Inference: During the process of answering a question, the system uses the passage encoder to process all passages and then indexes them using FAISS, an efficient library for similarity search. For any given question, its embedding is computed, and the top $k$ passages with the closest embeddings are retrieved.
7. Training:
  - The main goal during training is to optimize the encoders such that relevant questions and passages have a high similarity (close in vector space) and irrelevant ones have a low similarity.
  - The training data consists of question-passage pairs with both positive (relevant) and negative (irrelevant) passages. The system is trained to increase the similarity for relevant pairs and decrease it for irrelevant ones.
  - For training, they have explicit positive examples (relevant passages) but need to choose negatives from a vast collection. They experimented with different types of negative passages: random, those ranked high by BM25 but not containing the answer, and relevant passages for other questions.
8. In-batch Negatives: A training optimization method is discussed where they use relevant passages from the same batch of questions as negatives, which makes computation more efficient. This technique leverages the similarities between passages in the same batch to boost the number of training examples, effectively reusing computation.
When evaluated on a wide range of open-domain QA datasets, their dense retriever outperforms a strong Lucene-BM25 system largely by 9%-19% absolute in terms of top-20 passage retrieval accuracy, and helps their end-to-end QA system establish new state-of-the-art on multiple open-domain QA benchmarks.

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Large pre-trained language models have been shown to store factual knowledge in their parameters, and achieve state-of-the-art results when fine-tuned on downstream NLP tasks. However, their ability to access and precisely manipulate knowledge is still limited, and hence on knowledge-intensive tasks, their performance lags behind task-specific architectures. Additionally, providing provenance for their decisions and updating their world knowledge remain open research problems. Pre-trained models with a differentiable access mechanism to explicit non-parametric memory can overcome this issue, but have so far been only investigated for extractive downstream tasks.
This paper by Lewis et al. from Meta AI Research, UCL, and NYU in NeurIPS 2020 explores a general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) – models which combine pre-trained parametric and non-parametric memory for language generation.
They introduce RAG models where the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever.
They endow pre-trained, parametric-memory generation models with a non-parametric memory through RAG, a general-purpose fine-tuning approach. They build RAG models where the parametric memory is a pre-trained seq2seq transformer, and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever. They combine these components in a probabilistic model trained end-to-end. The retriever (Dense Passage Retriever) provides latent documents conditioned on the input, and the seq2seq model (BART) then conditions on these latent documents together with the input to generate the output. They marginalize the latent documents with a top-$k$ approximation, either on a per-output basis (assuming the same document is responsible for all tokens) or a per-token basis (where different documents are responsible for different tokens).
They compare two RAG formulations, one which conditions on the same retrieved passages across the whole generated sequence, the other can use different passages per token.
A summary of the methods and models used for query/document embedding and retrieval, as well as the end-to-end structure of the RAG framework is as below:
1. Query/Document Embedding:
  - The retrieval component, Dense Passage Retriever (DPR), follows a bi-encoder architecture.
  - DPR uses BERTBASE as the foundation for both document and query encoders.
  - For a document $z$, a dense representation $d(z)$ is produced by a document encoder, $BERT_d$.
  - For a query $x$, a query representation $q(x)$ is produced by a query encoder, $BERT_q$.
  - The embeddings are created such that relevant documents for a given query are close in the embedding space, allowing effective retrieval.
2. Retrieval Process:
  - The retrieval process involves calculating the top-$k$ documents with the highest prior probability, which is essentially a Maximum Inner Product Search (MIPS) problem.
  - The MIPS problem is solved approximately in sub-linear time to efficiently retrieve relevant documents.
3. End-to-End Structure:
  - The RAG model uses the input sequence $x$ to retrieve text documents $z$, which are then used as additional context for generating the target sequence $y$.
  - The generator component is modeled using BART-large, a pre-trained seq2seq transformer with 400M parameters. BART-large combines the input $x$with the retrieved content $z$ for generation.
  - The RAG-Sequence model uses the same retrieved document for generating the complete sequence, while the RAG-Token model can use different passages per token.
  - The training process involves jointly training the retriever and generator components without direct supervision on what document should be retrieved. The training minimizes the negative marginal log-likelihood of each target using stochastic gradient descent with Adam.
  - Notably, the document encoder BERTd is kept fixed during training, avoiding the need for periodic updates of the document index.
The following figure from the paper illustrates an overview of the proposed approach. They combine a pre-trained retriever (Query Encoder + Document Index) with a pre-trained seq2seq model (Generator) and fine-tune end-to-end. For query $x$, they use Maximum Inner Product Search (MIPS) to find the top-$K$ documents $z_i$. For final prediction $y$, they treat $z$ as a latent variable and marginalize over seq2seq predictions given different documents.

They fine-tune and evaluate RAG-based models on a wide range of knowledge-intensive NLP tasks and set the state-of-the-art on three open domain QA tasks, outperforming parametric seq2seq models and task-specific retrieve-and-extract architectures.
For language generation tasks, they find that RAG models generate more specific, diverse and factual language than a state-of-the-art parametric-only seq2seq baseline.
Code; interactive demo.

Unsupervised Commonsense Question Answering with Self-Talk

Natural language understanding involves reading between the lines with implicit background knowledge. Current systems either rely on pre-trained language models as the sole implicit source of world knowledge, or resort to external knowledge bases (KBs) to incorporate additional relevant knowledge.
This paper by Shwartz et al. from Allen AI and UW in EMNLP 2020 proposes an unsupervised framework based on self-talk as a novel alternative to multiple-choice commonsense tasks. Inspired by inquiry-based discovery learning, our approach inquires language models with a number of information seeking questions such as “what is the definition of …” to discover additional background knowledge.
The following figure from the paper illustrates a sample for WinoGrande. Each answer choice (Brett, Ian) is assigned to the concatenation of the context and a clarification. The score for each choice is the best LM score across clarifications (2 in this case).

Empirical results demonstrate that the self-talk procedure substantially improves the performance of zero-shot language model baselines on four out of six commonsense benchmarks, and competes with models that obtain knowledge from external KBs.
While their approach improves performance on several benchmarks, the self-talk induced knowledge even when leading to correct answers is not always seen as useful by human judges, raising interesting questions about the inner-workings of pre-trained language models for commonsense reasoning.

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention

This paper by Katharopoulos et al. from Idiap Research Institute, EPFL, University of Washington, and University of Geneva presents a novel approach to enhancing transformer models, particularly in the context of handling long sequences.
The primary aim of the paper is to address the high computational cost and memory usage of traditional transformers when dealing with long sequences. The authors propose a new transformer model with linear complexity in sequence length, thus significantly improving efficiency.
The approach involves expressing self-attention as a linear dot-product of kernel feature maps. This utilizes the associative property of matrix products, reducing complexity from quadratic to linear ($O(N)$) with respect to the sequence length.
A notable achievement of this work is demonstrating that autoregressive transformers can be significantly accelerated, revealing their underlying relationship to recurrent neural networks (RNNs). This is achieved through an innovative linear transformer model.
The linear transformers match the performance of standard (vanilla) transformers but demonstrate a substantial increase in speed, up to 4000 times faster, for autoregressive prediction of very long sequences.
The paper assesses the linear transformer’s performance on tasks like image generation and automatic speech recognition, showcasing competitive results while requiring considerably less GPU memory and computational resources.
The research concludes that linear transformers offer a more memory and computationally efficient alternative to traditional transformers, particularly in tasks involving long sequences. The authors suggest potential future research directions, including exploring different feature maps for linear attention and extending the model’s application to other domains.
This study presents a significant advancement in transformer technology, particularly for applications requiring the processing of long sequences, by offering a more efficient and faster alternative while maintaining competitive performance levels.

2021

Pretrained Transformers As Universal Computation Engines

This paper by Lu et al. from UC Berkeley, FAIR, and Google Brain in 2021 investigated the capability of a transformer pretrained on natural language to generalize to other modalities with minimal finetuning – in particular, without finetuning of the self-attention and feedforward layers of the residual blocks and apply this model to numerical computation, vision, and protein fold prediction.
In contrast to prior works which investigate finetuning on the same modality as the pretraining dataset, the authors showed that pretraining on natural language improves performance and compute efficiency on non-language downstream tasks. In particular, the authors found that such pretraining enables FPT to generalize in zero-shot to these modalities, matching the performance of a transformer fully trained on these tasks.
BAIR’s article; VentureBeat’s article; Yannic Kilcher’s video.

SimCSE: Simple Contrastive Learning of Sentence Embeddings

This paper by Gao et al. from Princeton University and Tsinghua University in 2021 presents SimCSE, a simple contrastive learning framework that greatly advances the state-of-the-art sentence embeddings on semantic textual similarity tasks.
They first describe an unsupervised approach, which takes an input sentence and predicts itself in a contrastive objective, with only standard dropout used as noise. This simple method works surprisingly well, performing on par with previous supervised counterparts. They find that dropout acts as minimal data augmentation and removing it leads to a representation collapse.
Next, they propose a supervised approach utilizing NLI datasets, which incorporates annotated pairs from natural language inference datasets into their contrastive learning framework, by using “entailment” pairs as positives and “contradiction” pairs as hard negatives. SimCSE (Simple Contrastive Learning of Sentence Embeddings) is a technique designed to produce high-quality sentence embeddings, which are vector representations of sentences that capture their semantic meaning. Here’s an overview of how it works and its key components:
Key Components of SimCSE:
1. Contrastive Learning:
  - The core idea behind SimCSE is contrastive learning, which aims to bring similar sentences closer in the embedding space while pushing dissimilar sentences further apart.
  - It uses a loss function that measures the similarity between sentences. Typically, the cosine similarity or dot product is used.
2. Data Augmentation:
  - To generate positive pairs (similar sentences), SimCSE applies data augmentation techniques.
  - For example, in the unsupervised setting, it can use dropout as augmentation. By applying dropout to the same sentence twice, it generates two different “views” of the sentence.
3. Supervised and Unsupervised Versions:
  - Unsupervised SimCSE: It relies on the natural augmentation of dropout. No labeled data is required.
  - Supervised SimCSE: It uses labeled pairs of sentences (e.g., from the SNLI or MRPC datasets) to generate positive and negative pairs directly.
4. Loss Function:
  - The loss function often used in SimCSE is the contrastive loss (e.g., InfoNCE loss), which ensures that the embeddings of positive pairs are similar and those of negative pairs are dissimilar.
  - The objective is to maximize the similarity of positive pairs and minimize the similarity of negative pairs.
5. Model Architecture:
  - SimCSE can be implemented using any transformer-based model like BERT, RoBERTa, or others.
  - The standard process involves encoding sentences using these pre-trained models and then fine-tuning with the contrastive learning objective.
Steps to Implement SimCSE:
1. Model Initialization:
  - Start with a pre-trained transformer model (e.g., BERT, RoBERTa).
2. Data Preparation:
  - For unsupervised SimCSE, prepare a corpus of sentences.
  - For supervised SimCSE, prepare pairs of sentences with similarity labels.
3. Contrastive Learning:
  - For each sentence in the unsupervised setting, apply dropout twice to get two augmented versions.
  - For the supervised setting, use provided pairs of sentences.
4. Loss Computation:
  - Calculate the embeddings of the sentences using the transformer model.
  - Compute the contrastive loss to ensure positive pairs are close and negative pairs are far in the embedding space.
5. Training:
  - Train the model with the contrastive loss function until convergence.
6. Evaluation:
  - Evaluate the quality of the sentence embeddings on tasks such as semantic textual similarity (STS), clustering, or other NLP benchmarks.
Applications of SimCSE:
- Semantic Textual Similarity (STS): Measuring the similarity between sentences.
- Information Retrieval: Improving search results by using sentence embeddings.
- Text Clustering: Grouping similar sentences or documents together.
- Downstream NLP Tasks: Fine-tuning on various NLP tasks that require understanding sentence semantics.
SimCSE has been shown to achieve state-of-the-art performance on several benchmarks, making it a powerful approach for generating high-quality sentence embeddings.
They evaluate SimCSE on standard semantic textual similarity (STS) tasks, and their unsupervised and supervised models using BERTbase achieve an average of 76.3% and 81.6% Spearman’s correlation respectively, a 4.2% and 2.2% improvement compared to previous best results. They also show both theoretically and empirically justify the inner workings of their approach by analyzing alignment and uniformity of SimCSE and demonstrating that their contrastive learning objective regularizes pre-trained embeddings’ anisotropic space to be more uniform, and it better aligns positive pairs when supervised signals are available.
The key takeaway is that their contrastive objective, especially the unsupervised one, may have a broader application in NLP. It provides a new perspective on data augmentation with text input, and can be extended to other continuous representations and integrated in language model pre-training.

DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations

Sentence embeddings are an important component of many natural language processing (NLP) systems. Like word embeddings, sentence embeddings are typically learned on large text corpora and then transferred to various downstream tasks, such as clustering and retrieval. Unlike word embeddings, the highest performing solutions for learning sentence embeddings require labelled data, limiting their usefulness to languages and domains where labelled data is abundant.
This paper by Giorgi et al. from UofT in 2021 present DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations. Similar to SimCSE, DeCLUTR learns high quality sentence embeddings in a self-supervised fashion, the quality of which are equal to or better than the ones obtained from a supervised setting.
Inspired by recent advances in deep metric learning (DML), they design a self-supervised objective for learning universal sentence embeddings that does not require labelled training data. When used to extend the pretraining of transformer-based language models, their approach closes the performance gap between unsupervised and supervised pretraining for universal sentence encoders. Their experiments suggest that the quality of the learned embeddings scale with both the number of trainable parameters and the amount of unlabelled training data.
They demonstrated the effectiveness of their objective by evaluating the learned embeddings on the SentEval benchmark, which contains a total of 28 tasks designed to evaluate the transferability and linguistic properties of sentence representations.
Their experiments suggest that the learned embeddings’ quality can be further improved by increasing the model and train set size. Together, their results demonstrate the effectiveness and feasibility of replacing hand-labelled data with carefully designed self-supervised objectives for learning universal sentence embeddings.
Code with code and pretrained models.

Transformer Feed-Forward Layers Are Key-Value Memories

The following paper summary has been contributed by Zhibo Zhang.
This paper by Geva et al. from Blavatnik School of Computer Science, Tel-Aviv University, Allen Institute for Artificial Intelligence and Cornell Tech in EMNLP 2021 explains the feed-forward layers in Transformer architecture from a key-value memory perspective.
Specifically, the authors discover that feed-forward layers in a Transformer are mathematically similar to neural memory (Sukhbaatar et al., 2015). In the mapping of the feed-forward layer, the parameters inside the activation function can be regarded as key vectors, and the parameters outside the activation function can be regarded as value vectors.
Empirically, the key vectors were validated to capture the patterns in the input sentence prefixes. It was observed that the keys in shallow layers tended to capture shallow patterns such as sentence prefixes ending with the same word. The keys in deep layers tended to capture semantic patterns, such as different expressions of the same semantic meaning.
The authors claim that the value vectors represent the output vocabulary distribution. In order to study the relationship between the distribution based on the value vector and the patterns captured by the key vector, the authors compared the agreement between: 1. the token with the highest probability based on the value vector and the output embedding matrix; 2. the next token of the top-ranked sentence prefix example based on the key vector. It was observed that the two tended to agree more in deeper layers of the network.
In addition, the authors also validated that at the layer level, multiple memory cells compose the output of the layer in most of the cases. It was also observed that the predictions based on residual connections corresponded more with the model output in deeper layers of the network.

Measuring Massive Multitask Language Understanding

This paper by Hendrycks et al. from UC Berkeley, Columbia University, and UChicago in ICLR 2021 proposes a new test called Massive Multitask Language Understanding (MMLU) to measure a text model’s multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability.
They find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average (cf. figure below from the paper which shows performance on a commonsense benchmark (HellaSwag), a linguistic understanding benchmark (SuperGLUE), and MMLU). On previous benchmarks, smaller models start well above random chance levels and exhibit more continuous improvements with model size increases, but on MMLU, GPT-3 moves beyond random chance only with the largest model.

However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models also have lopsided performance and frequently do not know when they are wrong. Worse, they still have near-random accuracy on some socially important subjects such as morality and law. By comprehensively evaluating the breadth and depth of a model’s academic and professional understanding, MMLU can be used to analyze models across many tasks and to identify important shortcomings.
The following figure from the paper shows examples from the Conceptual Physics and College Mathematics STEM tasks.

Code.

2022

A Causal Lens for Controllable Text Generation

This paper by Hu and Li from UCSD and Amazon introduces released a paper describing a novel approach to conditional text generation that leverages causal inference principles to mitigate the effects of spurious correlations.
Conditional text modeling is hard. Natural language documents tend to contain large amounts of complex unstructured information, most of which is implicit.
Controllable text generation concerns two fundamental tasks of wide applications, namely generating text of given attributes (i.e., attribute-conditional generation), and minimally editing existing text to possess desired attributes (i.e., text attribute transfer). Historically, problems of attribute-conditional text generation and attribute transfer were perceived as two independent tasks and approached individually and developed different conditional models which, however, are prone to producing biased text (e.g., various gender stereotypes). The authors propose a unifying causal framework to formulate controllable text generation from a principled causal perspective which models the two tasks with a unified framework for generation and transfer, based on structural causal models (SCMs).
A direct advantage of the causal formulation is the use of rich causality tools to mitigate generation biases and improve control. They treat the two tasks as interventional and counterfactual causal inference based on a structural causal model, respectively. They propose to model attribute-conditional text generation as intervention, using Daniel Pearl’s $do$ operator. Hence, the attribute-conditional distribution becomes $P(x\|do(a))$ rather than purely association-based $P(x\|a)$, where: $x$ is a text, $a$ is an attribute (an intervention). Two more variables are used in the paper: $z$, a multidimensional latent (unobserved) confounder and $c$, a $z$’s observed proxy.
Text attribute transfer is modeled as a conterfactual prediction, trying to answer the question: “what the text would have been if the attribute had been different?”
Training consists of four objectives: VAE objective to learn the causal model and three counterfactual objectives.
They apply the framework to the challenging practical setting where confounding factors (that induce spurious correlations) are observable only on a small subset (1%-5%) of training data with confounding labels for $c$.
Results show that the proposed model achieves significantly better results than conventional conditional models in terms of control accuracy and reduced bias. This is true for both types of tasks: attribute-conditional generation and attribute transfer.

SNCSE: Contrastive Learning for Unsupervised Sentence Embedding with Soft Negative Samples

This paper by Wang et al. from National University of Defense Technology, SenseTime and The University of Hong Kong in 2022 released a paper proposing a new contrastive sentence embedding framework called SNCSE.
Application of contrastive learning techniques to sentence embeddings has been proved to be a great way to improve their semantic and classification properties. For a sentence, current models utilize diverse data augmentation methods to generate positive samples, while consider other independent sentences as negative samples. Then they adopt InfoNCE loss to pull the embeddings of positive pairs gathered, and push those of negative pairs scattered.
Although these models have made great progress on sentence embedding, the authors argue that contrastive losses are not sensitive enough to distinguish and decouple textual and semantic similarity. This leads to the methods deploying traditional contrastive losses to overestimate the semantic similarity of any pairs with similar textual regardless of the actual semantic difference between them. This is because positive pairs in unsupervised contrastive learning come with similar and even the same textual meaning through data augmentation.
Let’s take a negation. Adding a simple “not” to a sentence does not change its textual properties much, but can drastically change its semantic properties. The authors argue that traditional contrastive loss leads to feature supression, making models fail to decouple textual and semantic aspects of a sentence. To address this issue, the authors propose contrastive learning for unsupervised sentence embedding with soft negative samples (SNCSE) - samples with different semantic content (hence “negative”) and very high textual similarity (hence “soft”).
Moreover, the authors propose an additional loss component - bidirectional margin loss (BML) - to model semantic differences between positive and soft negative samples, while retaining InfoNCE as a loss for regular positive-negative pairs. BML helps introduce soft negative examples into the traditional contrastive learning framework.
To obtain these soft negative examples, the authors construct soft negative samples as negations of positive examples. They use a rule-based system for this purpose.
SNCSE achieves state-of-the-art performance on semantic textual similarity (STS) task with average Spearman’s correlation coefficient of 78.97% on BERTbase and 79.23% on RoBERTabase, an improvement compared to other contrastive methods (e.g. SimCSE). Finally, they adopt rank-based error analysis method to detect the weakness of SNCSE.
Code.

LaMDA: Language Models for Dialog Applications

This paper by Cheng et al. from Google Brain in 2022 is an attempt to propose safe, grounded, and high-quality dialog models for open-ended applications.
Language models are becoming more capable than ever before and are helpful in a variety of tasks — translating one language into another, summarizing a long document into a brief highlight, or answering information-seeking questions. Among these, open-domain dialog, where a model needs to be able to converse about any topic, is probably one of the most difficult, with a wide range of potential applications and open challenges. In addition to producing responses that humans judge as sensible, interesting, and specific to the context, dialog models should adhere to Responsible AI practices, and avoid making factual statements that are not supported by external information sources.
Defining objectives and metrics is critical to guide training dialog models. LaMDA has three key objectives — Quality, Safety, and Groundedness — each of which they measure using carefully designed metrics as follows.
- Quality: They decompose Quality into three dimensions, Sensibleness, Specificity, and Interestingness (SSI), which are evaluated by human raters. Sensibleness refers to whether the model produces responses that make sense in the dialog context (e.g., no common sense mistakes, no absurd responses, and no contradictions with earlier responses). Specificity is measured by judging whether the system’s response is specific to the preceding dialog context, and not a generic response that could apply to most contexts (e.g., “ok” or “I don’t know”). Finally, Interestingness measures whether the model produces responses that are also insightful, unexpected or witty, and are therefore more likely to create better dialog.
- Safety: Safety is essential for responsible AI. Their Safety metric is composed of an illustrative set of safety objectives that captures the behavior that the model should exhibit in a dialog. These objectives attempt to constrain the model’s output to avoid any unintended results that create risks of harm for the user, and to avoid reinforcing unfair bias. For example, these objectives train the model to avoid producing outputs that contain violent or gory content, promote slurs or hateful stereotypes towards groups of people, or contain profanity. Their research towards developing a practical Safety metric represents very early work, and there is still a great deal of progress for us to make in this area.
- Groundedness: The current generation of language models often generate statements that seem plausible, but actually contradict facts established in known external sources. This motivates their study of groundedness in LaMDA. Groundedness is defined as the percentage of responses with claims about the external world that can be supported by authoritative external sources, as a share of all responses containing claims about the external world. A related metric, Informativeness, is defined as the percentage of responses with information about the external world that can be supported by known sources, as a share of all responses. Therefore, casual responses that do not carry any real world information (e.g., “That’s a great idea”), affect Informativeness but not Groundedness. While grounding LaMDA generated responses in known sources does not in itself guarantee factual accuracy, it allows users or external systems to judge the validity of a response based on the reliability of its source.
With the objectives and metrics defined, they describe LaMDA’s two-stage training: pre-training and fine-tuning. In the fine-tuning stage, they train LaMDA to perform a mix of generative tasks to generate natural-language responses to given contexts, and classification tasks on whether a response is safe and high-quality, resulting in a single multi-task model that can do both. The LaMDA generator is trained to predict the next token on a dialog dataset restricted to back-and-forth dialog between two authors, while the LaMDA classifiers are trained to predict the Safety and Quality (SSI) ratings for the response in context using annotated data. During a dialog, the LaMDA generator first generates several candidate responses given the current multi-turn dialog context, and the LaMDA classifiers predict the SSI and Safety scores for every response candidate. Candidate responses with low Safety scores are first filtered out. Remaining candidates are re-ranked by their SSI scores, and the top result is selected as the response.
They observe that LaMDA significantly outperforms the pre-trained model in every dimension and across all model sizes.

Causal Inference Principles for Reasoning about Commonsense Causality

Commonsense causality reasoning (CCR) aims at identifying plausible causes and effects in natural language descriptions that are deemed reasonable by an average person. Although being of great academic and practical interest, this problem is still shadowed by the lack of a well-posed theoretical framework; existing work usually relies on deep language models wholeheartedly, and is potentially susceptible to confounding co-occurrences.
This paper by Zhang et al. from UPenn in 2022 articulates CCR from a completely new perspective using classical causal principles. Their contributions include (i) a novel commonsense causality framework; (ii) mitigating confounding co-occurrences by matching temporal propensities; (iii) a modular pipeline for zeroshot CCR with demonstrated effectiveness.
They propose a novel framework, ROCK, to Reason O(A)bout Commonsense K(C)ausality, which utilizes temporal signals as incidental supervision, and balances confounding effects using temporal propensities that are analogous to propensity scores. The ROCK implementation is modular and zero-shot, and demonstrates good CCR capabilities on various datasets.

RescoreBERT: Discriminative Speech Recognition Rescoring with BERT

Second-pass rescoring is an important component in automatic speech recognition (ASR) systems that is used to improve the outputs from a first-pass decoder by implementing a lattice rescoring or n-best re-ranking.
While pretraining with a masked language model (MLM) objective has received great success in various natural language understanding (NLU) tasks, it has not gained traction as a rescoring model for ASR. Specifically, training a bidirectional model like BERT on a discriminative objective such as minimum WER (MWER) has not been explored.
This paper by Xu et al. from Amazon Alexa AI in ICASSP 2022 proposes a method to train a BERT rescoring model with discriminative objective functions. They show how to train a BERT-based rescoring model with MWER loss, to incorporate the improvements of a discriminative loss into fine-tuning of deep bidirectional pretrained models for ASR.
Specifically, they propose a fusion strategy that incorporates the MLM into the discriminative training process to effectively distill knowledge from a pretrained model. They further propose an alternative discriminative loss.
RescoreBERT reduces WER by 6.6%/3.4% relative on the LibriSpeech clean/other test sets over a BERT baseline without discriminative objective. They also evaluate RescoreBERT on an internal dataset from a conversational agent and find that it reduces both latency and WER (by 4%/8.3%/7.1% relative) over an LSTM rescoring model.

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

This paper by Smith et al. from Microsoft and Nvidia presents MT-NLG, a 530 billion parameter left-to-right, autoregressive, generative transformer-based language model that possesses strong in-context learning capabilities.
Pretrained general-purpose language models can achieve state-of-the-art accuracies in various natural language processing domains by adapting to downstream tasks via zero-shot, few-shot and fine-tuning techniques. Because of their success, the size of these models has increased rapidly, requiring high-performance hardware, software, and algorithmic techniques to enable training such large models.
They present details on the training of the largest monolithic transformer based language model, Megatron-Turing NLG 530B (MT-NLG), with 530 billion parameters. They discuss the challenges in training neural networks at such scale and present the 3D parallelism methodology as well the hardware infrastructure used to efficiently train MT-NLG using DeepSpeed and Megatron.
Next, they detail the training process, the design of the training corpus, and the data curation techniques, which is a key ingredient to the success of the model.
MT-NLG achieves superior zero-/one- and few-shot learning performance on several NLP benchmarks, establishing new state-of-the-art results.
They also analyze the social biases exhibited by MT-NLG and examine various factors that can affect in-context learning, bringing forth awareness of certain limitations of current generation of large language models.
Microsoft blog article.

Extreme Compression for Pre-trained Transformers Made Simple and Efficient

Extreme compression, particularly ultra-low bit precision (binary/ternary) quantization, has been proposed to fit large NLP models on resource-constraint devices. However, to preserve the accuracy for such aggressive compression schemes, cutting-edge methods usually introduce complicated compression pipelines, e.g., multi-stage expensive knowledge distillation with extensive hyperparameter tuning. Also, they oftentimes focus less on smaller transformer models that have already been heavily compressed via knowledge distillation and lack a systematic study to show the effectiveness of their methods.
This paper by Wu et al. from Microsoft in 2022 derives a user-friendly celebrating recipe for extreme quantization, which allows them to achieve a larger model compression ratio and higher accuracy. They accomplish this by performing a very comprehensive systematic study to measure the impact of many key hyperparameters and training strategies from previous works.
They carefully design and perform extensive experiments to investigate the contemporary existing extreme quantization methods for ultra-low bit precision quantization and find that they are significantly under-trained. To this end, they fine-tune pre-trained BERTbase models with various training budgets and learning rate search.
They propose a simple yet effective compression pipeline for extreme compression, named XTC. XTC demonstrates that (1) they can skip the pre-training knowledge distillation to obtain a 5-layer BERT while achieving better performance than previous state-of-the-art methods, e.g., the 6-layer TinyBERT; (2) extreme quantization plus layer reduction is able to reduce the model size by 50x, resulting in new state-of-the-art results on GLUE tasks.

Memorizing Transformers

The following paper summary has been contributed by Zhibo Zhang.
When it comes to language modeling, Transformers are good at capturing the dependencies among input tokens within a context window given that they compare each pair of input tokens directly through the query-key matching process. However, many complicated tasks nowadays such as book reading require long-term dependencies that span across tens of thousands of tokens, which exceeds the size of the context window that existing Transformers can handle due to its quadratic complexity with respect to the number of input tokens.
Memorizing Transformers by Wu et al. from Google in ICLR 2022 proposes an extension of Transformer that attends not only to the input tokens of the current context window, but also the ones from past context windows.
As shown in the figure below, suppose context windows 1 and 2 are past context windows, and context window 3 is the current context window. The authors use a non-trainable memory to store the keys and values for the tokens from past context windows. However, the memory size will grow over time as the context window shifts. In order to avoid the model from attending to the full memory each time, an approximate kNN (k-Nearest-Neighbors) approach is adopted to select the past embedded tokens that match the most with each embedded token of the current context window.
When predicting the next token, the authors propose a gating mechanism that performs a weighted sum between the attention outcome based on the current context window and the attention outcome based on the most relevant embedded tokens in memory.
The authors validated the effect of external memory on five different language modeling tasks including English language books, long web articles, technical math papers, source code as well as formal theorems. It was observed that the Transformer with external memory can match the performance of a larger Transformer model without external memory in terms of perplexity scores. In addition, the authors observed that a larger external memory size would generally help the Transformer obtain lower perplexity scores.

Ask Me Anything: A simple strategy for prompting language models

The following paper summary has been contributed by Zhibo Zhang.
Prompting is a strategy that helps large language models transfer to new tasks under the natural language task specification. However, designing perfect prompts for a task requires a large amount of effort.
This paper by Arora et al. from Chris Ré’s lab at Stanford University, Numbers Station and the University of Wisconsin-Madison proposes the Ask Me Anything Prompting (AMA) strategy that runs a collection of prompt chains, where each prompt chain is composed of question and answer prompts, as shown in the illustration figure below from the paper. The predictions of the individual prompt chains are then combined using weak supervision to produce the final prediction.
In particular, the authors observed two empirical facts about effective prompt formats:
- Open-ended formats such as traditional QA (Question-Answering) are more effective than restrictive formats such as True or False selection.
- It is essential to map the answers of open-ended questions to specialized output categories of a given task.
In order to evaluate the effectiveness of the AMA strategy, the authors applied Question-Answering prompt chains and Weak Supervision on top of the GPT-J-6B parameter model and compared it with the version without AMA as well as the GPT-3 175B model (with few in-context examples) on tasks spanning across natural language understanding, natural language inference, classification and question answering. The GPT-J-6B parameter model with AMA performed the best with the majority of large language models tested.
Code.

Large Language Models Can Self-Improve

Large Language Models (LLMs) have achieved excellent performances in various tasks. However, fine-tuning an LLM requires extensive supervision. Human, on the other hand, may improve their reasoning abilities by self-thinking without external inputs.
This paper by Huang et al. from UIUC and Google in 2022 demonstrates that an LLM is also capable of self-improving with only unlabeled datasets. They use a pre-trained LLM to generate “high-confidence” rationale-augmented answers for unlabeled questions using Chain-of-Thought prompting and self-consistency, and fine-tune the LLM using those self-generated solutions as target outputs.
They show that their approach improves the general reasoning ability of a 540B-parameter LLM (74.4%->82.1% on GSM8K, 78.2%->83.0% on DROP, 90.0%->94.4% on OpenBookQA, and 63.4%->67.9% on ANLI-A3) and achieves state-of-the-art-level performance, without any ground truth label.
They conduct ablation studies and show that fine-tuning on reasoning is critical for self-improvement.
The following figure from the paper offers an overview of the method. With Chain-of-Thought (CoT) examples as demonstration, the language model generates multiple CoT reasoning paths and answers (temperature T > 0) for each question. The most consistent answer is selected by majority voting. The “high-confidence” CoT reasoning paths that lead to the majority answer are augmented by mixed formats as the final training samples to be fed back to the model for fine-tuning.

$\infty$-former: Infinite Memory Transformer

Transformers are unable to model long-term memories effectively, since the amount of computation they need to perform grows with the context length. While variations of efficient transformers have been proposed, they all have a finite memory capacity and are forced to drop old information.
This paper by Martins et al. from Institution of Telecommunication, DeepMind, Institute of Systems and Robotics in ACL 2022 proposes the $\infty$-former, which extends the vanilla transformer with an unbounded long-term memory.
By making use of continuous attention to attend over the long-term memory, the $\infty$-former’s attention complexity becomes independent of the context length, trading off memory length with precision. In order to control where precision is more important, $\infty$-former maintains “sticky memories” being able to model arbitrarily long contexts while keeping the computation budget fixed.
Experiments on a synthetic sorting task, language modeling, and document grounded dialogue generation demonstrate the $\infty$-former’s ability to retain information from long sequences.

Multitask Prompted Training Enables Zero-Shot Task Generalization

Large language models have recently been shown to attain reasonable zero-shot generalization on a diverse set of tasks. It has been hypothesized that this is a consequence of implicit multitask learning in language models’ pretraining.
This paper by Sanh et al. in ICLR 2022 seeks to answer the question: can zero-shot generalization instead be directly induced by explicit multitask learning?
To test this question at scale, they develop a system for easily mapping any natural language tasks into a human-readable prompted form. They convert a large set of supervised datasets, each with multiple prompts with diverse wording. These prompted datasets allow for benchmarking the ability of a model to perform completely held-out tasks.
They propose T0, an encoder-decoder model that consumes textual inputs and produces target responses. It is trained on a multitask mixture of NLP datasets partitioned into different tasks. Each dataset is associated with multiple prompt templates that are used to format example instances to input and target pairs.
The model attains strong zero-shot performance on several standard datasets, often outperforming models up to 16x its size.
Further, T0 attains strong performance on a subset of tasks from the BIG-bench benchmark, outperforming models up to 6x its size.
The figure blow from the paper shows T0’s prompt format. Italics indicate the inserted fields from the raw example data. After training on a diverse mixture of tasks (top), their model is evaluated on zero-shot generalization to tasks that are not seen during training (bottom).

Code.

Large Language Models Encode Clinical Knowledge

Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but the quality bar for medical and clinical applications is high. Today, attempts to assess models’ clinical knowledge typically rely on automated evaluations on limited benchmarks. There is no standard to evaluate model predictions and reasoning across a breadth of tasks.
This paper by Singal et al. from Google Research and DeepMind seeks to address this by presenting MultiMedQA, a benchmark combining six existing open question answering datasets spanning professional medical exams, research, and consumer queries; and HealthSearchQA, a new free-response dataset of medical questions searched online. They propose a framework for human evaluation of model answers along multiple axes including factuality, precision, possible harm, and bias.
In addition, they evaluate PaLM (a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM, on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA, MedMCQA, PubMedQA, MMLU clinical topics), including 67.6% accuracy on MedQA (US Medical License Exam questions), surpassing prior state-of-the-art by over 17%. However, human evaluation reveals key gaps in Flan-PaLM responses. To resolve this they introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. As the name suggests, instruction prompt tuning uses prompt tuning instead of full-model finetuning given compute and clinician data generation costs. Their approach effectively extends Flan-PaLM’s principle of “learning to follow instructions” to the prompt tuning stage. Specifically, rather than using the soft prompt learned by prompt tuning as a replacement for a task-specific human-engineered prompt, they instead use the soft prompt as an initial prefix that is shared across multiple medical datasets, and which is followed by the relevant task-specific human-engineered prompt (consisting of instructions and/or few-shot exemplars, which may be chain-of-thought examples) along with the actual question and/or context. Instruction prompt tuning can thus be seen as a lightweight way (data-efficient, parameter-efficient, compute-efficient during both training and inference) of training a model to follow instructions in one or more domains. In their setting, instruction prompt tuning adapted LLMs to better follow the specific type of instructions used in the family of medical datasets that they target. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians.
They show that comprehension, recall of knowledge, and medical reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal important limitations of today’s models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLM models for clinical applications.
Given the combination of soft prompt with hard prompt, instruction prompt tuning can be considered a type of “hard-soft hybrid prompt tuning”, alongside existing techniques that insert hard anchor tokens into a soft prompt, insert learned soft tokens into a hard prompt [28], or use a learned soft prompt as a prefix for a short zero-shot hard prompt. To the best of their knowledge, ours is the first published example of learning a soft prompt that is prefixed in front of a full hard prompt containing a mixture of instructions and few-shot exemplars.

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

Scaling language models with more data, compute and parameters has driven significant progress in natural language processing. For example, thanks to scaling, GPT-3 was able to achieve strong results on in-context learning tasks. However, training these large dense models requires significant amounts of computing resources.
This paper by Du et al. in ICML 2022 proposes and develops a family of language models named GLaM (Generalist Language Model), which uses a sparsely activated mixture-of-experts architecture to scale the model capacity while also incurring substantially less training cost compared to dense variants. The largest GLaM has 1.2 trillion parameters, which is approximately 7x larger than GPT-3.
The figure below from the paper shows the GLaM model architecture. Each MoE layer (the bottom block) is interleaved with a Transformer layer (the upper block). For each input token, e.g., ‘roses’, the Gating module dynamically selects two most relevant experts out of 64, which is represented by the blue grid in the MoE layer. The weighted average of the outputs from these two experts will then be passed to the upper Transformer layer. For the next token in the input sequence, two different experts will be selected.

It consumes only 1/3 of the energy used to train GPT-3 and requires half of the computation FLOPs for inference, while still achieving better overall zero-shot and one-shot performance across 29 NLP tasks.
The figure below from the paper illustrates an overview of the percentage change in predictive performance (higher is better) of GLaM (64B/64E) versus GPT-3 (175B) in the (a) zero-shot, (b) one-shot, and (c) few-shot setting across 7 benchmark categories with 29 public tasks in total. Each bar in panel (a), (b) and (c) represents one benchmark category. Panel (d) compares the FLOPs needed per token prediction and training energy consumption.

Automatic Chain of Thought Prompting in Large Language Models

Large language models (LLMs) can perform complex reasoning by generating intermediate reasoning steps. Providing these steps for prompting demonstrations is called chain-of-thought (CoT) prompting. CoT prompting has two major paradigms. One leverages a simple prompt like “Let’s think step by step” to facilitate step-by-step thinking before answering a question. The other uses a few manual demonstrations one by one, each composed of a question and a reasoning chain that leads to an answer. The superior performance of the second paradigm hinges on the hand-crafting of task-specific demonstrations one by one.
This paper by Zhang et al. from Shanghai Jiao Tong University and Amazon Web Services shows that such manual efforts may be eliminated by leveraging LLMs with the “Let’s think step by step” prompt to generate reasoning chains for demonstrations one by one, i.e., let’s think not just step by step, but also one by one. However, these generated chains often come with mistakes. To mitigate the effect of such mistakes, they find that diversity matters for automatically constructing demonstrations.
They propose an automatic CoT prompting method: Auto-CoT. It samples questions with diversity and generates reasoning chains to construct demonstrations.
The following figure from the paper shows an overview of the Auto-CoT method. Different from Manual-CoT, demonstrations (on the right) are automatically constructed one by one (total: k) using an LLM with the “Let’s think step by step” prompt.

As part of this process, they carry out: (i) question clustering and (ii) demonstration sampling.
1. Question Clustering:
  - They perform diversity-based clustering for a given set of questions $\mathcal{Q}$. They first compute a vector representation for each question in $\mathcal{Q}$ by Sentence-BERT. The contextualized vectors are averaged to form a fix-sized question representation. Then, the question representations are processed by the $k$-means clustering algorithm to produce $k$ clusters of questions. For questions in each cluster $i$, sort them into a list $\mathbf{q}^{(i)}=\left[q_1^{(i)}, q_2^{(i)}, \ldots\right]$ in the ascending order of the distance to the center of cluster $i$.
2. Demonstration Sampling:
  - In the second stage, they need to generate reasoning chains for those sampled questions and then sample demonstrations that satisfy their selection criteria. More concretely, they construct a demonstration $d^{(i)}$ (concatenation of a question, a rationale, and an answer) for each cluster $i(i=1, \ldots, k)$. For cluster $i$, they iterate over questions in the sorted list $\mathbf{q}^{(i)}=\left[q_1^{(i)}, q_2^{(i)}, \ldots\right]$ until satisfying their selection criteria (listed below). In other words, a question that is closer to the center of cluster $i$ is considered earlier. Say that the $j^{th}$ closest question $q_j^{(i)}$ is being considered. A prompted input is formulated as: [Q: $\left.q_j^{(i)} . \mathrm{A}:[\mathrm{P}]\right]$, where [P] is a single prompt “Let’s think step to step”. This formed input is fed into an LLM using Zero-Shot-CoT to output the reasoning chain consisting of the rationale $r_j^{(i)}$ and the extracted answer $a_j^{(i)}$. Then, a candidate demonstration $d_j^{(i)}$ for the $i^{th}$ cluster is constructed by concatenating the question, rationale, and answer: $\left[\mathrm{Q}: q_j^{(i)}, \mathrm{A}: r_j^{(i)} \circ a_j^{(i)}\right]$.
  - Their selection criteria follow simple heuristics to encourage sampling simpler questions and rationales: set the selected demonstration $d^{(i)}$ as $d_j^{(i)}$ if it has a question $q_j^{(i)}$ with no more than 60 tokens and a rationale $r_j^{(i)}$ with no more than 5 reasoning steps.
On ten public benchmark reasoning tasks with GPT-3, Auto-CoT consistently matches or exceeds the performance of the CoT paradigm that requires manual designs of demonstrations.
Code.

Less is More: Parameter-Free Text Classification with Gzip

Deep neural networks (DNNs) are often used for text classification tasks as they usually achieve high levels of accuracy. However, DNNs can be computationally intensive with billions of parameters and large amounts of labeled data, which can make them expensive to use, to optimize and to transfer to out-of-distribution (OOD) cases in practice.
This paper by Jiang et al. from the University of Waterloo proposes a non-parametric alternative to DNNs that’s easy, light-weight and universal in text classification: a combination of a simple compressor like gzip with a k-nearest-neighbor classifier. The core idea is document similarity by compression (gzip): if two documents are similar, compressing the two together will lead to a length similar to compressing the longer one.
This relies on estimating the (otherwise not computable) Kolmogorov complexity by Normalised Compression Distance and is a fascinating derivative of Kolmogorov Complexity and its intersection with text classification through lossless compression.
Kolmogorov Complexity, named after Russian mathematician Andrey Kolmogorov, provides a unique lens to view the ‘complexity’ or ‘randomness’ of a string of data. Mathematically, the Kolmogorov Complexity $K(x)$ of a string $x$ is defined as the length of the shortest binary program $P$ that outputs $x$ when run on a Universal Turing Machine $\text{U: U(P) = x}$. It’s about finding the smallest ‘package’ for the information, or in other words, achieving the “ultimate compression”.
Here’s the catch, though: Kolmogorov Complexity is uncomputable – we can never be certain if we’ve found the shortest description. However, it offers a deep theoretical base to comprehend the intricacies of information complexity. Given a vast corpus of text data that needs classification, traditional methods might involve resource-intensive feature extraction and machine learning models. But what if we could use the concept of lossless compression algorithms, a practical approximation of Kolmogorov Complexity, to streamline this process? Here’s how:
1. Compress: Utilize a lossless compression algorithm (like gzip or bzip2) to the text data. These algorithms typically work by finding and removing redundancy in the data, resulting in a shorter representation.
2. Measure: The size of the compressed file gives an approximation of the Kolmogorov Complexity. Essentially, this size is a single numeric value representing the ‘information content’ of the text.
As an aside, there are other fun techniques related to this, for example, “Normalised Google distance”: estimating term similarity by how many documents Google returns for a search.
The practical feasibility of the idea is a challenge though. Document similarity estimation relies on storing a fingerprint of the document (the embedding) and storing it efficiently (indexed). That’s because the large body of existing documents to compare against is stable, hardly changing or even append-only. This means you can invest computing resources to make the data structure you are comparing against as optimized as possible. Then the only difficulty is to turn the query document into an embedding which is $O(1)$, and then search this document index for similar embeddings, which is $O(log(n))$.
However, the proposed technique is $O(n)$, given that the process is is not straightforward in the sense that each time you have a new query document, you must zip the entire corpus again to calculate the mutual compression length since per the approach, the classification problem simplifies into a comparison of compressed lengths: if $L(xy)$ is close to $L(x)$ for two strings $x$ and $y$, they are likely from the same class.
Without any training, pre-training or fine-tuning, their method achieves results that are competitive with non-pretrained deep learning methods on six in-distributed datasets. It even outperforms BERT on all five OOD datasets, including four low-resource languages.
Their method also performs particularly well in few-shot settings where labeled data are too scarce for DNNs to achieve a satisfying accuracy.
The following figure from the paper shows Python code for text classification with Gzip.

Code.

A Length-Extrapolatable Transformer

This paper by Sun et al. from Microsoft, explores the concept of length extrapolation in Transformers.
The paper addresses a key limitation in Transformer models: their inability to handle input lengths beyond their training distribution. It emphasizes the importance of position modeling in sequence representation, particularly when dealing with variable input lengths.
The authors introduce the concept of “attention resolution” as a measure of a Transformer’s ability to extrapolate to longer sequences than those seen during training.
They propose two major design improvements to enhance this capability:
- Extrapolatable Position Embedding (XPOS): A novel relative position embedding approach that explicitly maximizes attention resolution.
- Blockwise Causal Attention: This method, used during inference, increases attention resolution and hence improves the model’s performance on length extrapolation for language modeling.
The LEX Transformer, employing these design innovations, was trained from scratch and evaluated on language modeling tasks. The model showed strong performance in both interpolation (handling lengths within the training range) and extrapolation (handling lengths beyond the training range) settings.
Experimental results demonstrated that the LEX Transformer outperforms other Transformer variants, especially in handling longer input sequences. The following table from the paper shows the position modeling capabilities of Transformer variants for language modeling.

The paper presents a detailed mathematical framework for defining and calculating attention resolution.
It shows how XPOS and blockwise causal attention contribute to better performance in length extrapolation.
The study includes a comparative analysis of the LEX Transformer with other Transformer models, highlighting its superior ability in handling longer sequences without a significant loss in performance.
The findings have practical implications for tasks requiring the processing of long sequences of data, like document summarization or reading comprehension.
However, the paper also acknowledges certain limitations, such as the added computational cost and the focus on causal language modeling, suggesting future work could explore integrating these methods into bidirectional models like BERT.
This work represents a significant contribution to the field of NLP and machine learning by addressing a fundamental limitation of Transformer models, paving the way for more versatile and efficient handling of variable-length input sequences. The paper demonstrates a clear progression in the development of Transformer models, addressing a critical issue of length extrapolation, which is essential for a wide range of real-world applications.
Code.

Efficient Training of Language Models to Fill in the Middle

This paper by Bavarian et al. shows that autoregressive language models can learn to infill text after they apply a straightforward transformation to the dataset, which simply moves a span of text from the middle of a document to its end.
While this data augmentation has garnered much interest in recent years, they provide extensive evidence that training models with a large fraction of data transformed in this way does not harm the original left-to-right generative capability, as measured by perplexity and sampling evaluations across a wide range of scales.
Given the usefulness, simplicity, and efficiency of training models to fill-in-the-middle (FIM), they suggest that future autoregressive language models be trained with FIM by default. To this end, they run a series of ablations on key hyperparameters, such as the data transformation frequency, the structure of the transformation, and the method of selecting the infill span.
They use these ablations to prescribe strong default settings and best practices to train FIM models.
They have released our best infilling model trained with best practices in our API, and release our infilling benchmarks to aid future research.
The following table from the paper illustrates the evidence for the FIM-for-free property, i.e., FIM can be learned for free. They pretrain language models with 50% and 0% FIM rates on two domains, natural language and code, and evaluate the test loss of all the final snapshots. All models are trained on 100B tokens of data. We observe that joint FIM training incurs no cost as the original left-to-right loss trend remains the same even though FIM models see the original data only 50% of the time and the models are learning a new capability.

Language Models of Code are Few-Shot Commonsense Learners

This paper by Madaan et al. from CMU and Inspired Cognition in EMNLP 2022 proposes CoCoGen, which addresses the general task of structured commonsense reasoning: given a natural language input, the goal is to generate a graph such as an event or a reasoning-graph.
To employ large language models (LMs) for this task, existing approaches “serialize” the output graph as a flat list of nodes and edges. Although feasible, these serialized graphs strongly deviate from the natural language corpora that LMs were pre-trained on, hindering LMs from generating them correctly.
They show that when structured commonsense reasoning tasks are framed as code generation tasks, pre-trained LMs of code are better structured commonsense reasoners than LMs of natural language, even when the downstream task does not involve source code at all.
The following image from the paper illustrates CoCoGen for the task of script generation. An input graph (1a) is typically represented using the DOT format (1c) or as a list of edges (1d), which allows modeling the graph using standard language models. These popular choices are sufficient in principle; however, these formats are loosely structured, verbose, and not common in text corpora, precluding language models from effectively generating them. In contrast, CoCoGen converts structures into Python code (1b), allowing to model them using large-scale language models of code.

They demonstrate their approach across three diverse structured commonsense reasoning tasks. In all these natural language tasks, they show that using their approach, a code generation LM (CODEX) outperforms natural-LMs that are fine-tuned on the target task (e.g., T5) and other strong LMs such as GPT-3 in the few-shot setting.
Code.

A Systematic Investigation of Commonsense Knowledge in Large Language Models

Language models (LMs) trained on large amounts of data have shown impressive performance on many NLP tasks under the zero-shot and few-shot setup.
This paper by Li et al. from Allen AI, DeepMind, Inflection AI, Reka, Cohere, and University of Oxford in EMNLP 2022 aims to better understand the extent to which such models learn commonsense knowledge — a critical component of many NLP applications.
They conduct a systematic and rigorous zero-shot and few-shot commonsense evaluation of large pretrained LMs, where they: (i) carefully control for the LMs’ ability to exploit potential surface cues and annotation artifacts, and (ii) account for variations in performance that arise from factors that are not related to commonsense knowledge.
The following image from the paper illustrates the experiment settings with their corresponding input to the LM. The example is taken from the Social IQa dataset where we convert questions to natural text using the rules of Shwartz et al.; this conversion yields better performance.

Their findings highlight the limitations of pre-trained LMs in acquiring commonsense knowledge without task-specific supervision; furthermore, using larger models or few-shot evaluation are insufficient to achieve human-level commonsense performance.

MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound

This paper by Zellers et al. from the University of Washington and the Allen Institute for Artificial Intelligence introduces MERLOT Reserve, which is a VLM that processes audio, subtitles, and video frames to represent videos.
The model uniquely accepts time-aligned segments of audio, transcribed audio text, and video frames as inputs. It employs a novel training objective, similar to Masked Language modeling, where it predicts spans of masked text/audio tokens. This approach involves masking snippets of text and audio with a MASK token, compelling the model to predict the masked-out snippet.
MERLOT Reserve’s architecture features separate pre-encoding for each modality: a Vision Transformer for images, an Audio Spectrogram Transformer for audio, and a BPE embedding table for text. A bidirectional Transformer serves as the joint encoder, fusing these representations over time.
The model was pretrained on an extensive dataset of 20 million YouTube videos, demonstrating an efficient and scalable approach to learning from diverse, multimodal sources.
The following figure from the paper shows the MERLOT Reserve architecture. They provide sequence-level representations of video frames, and either words or audio, to a joint encoder. The joint encoder contextualizes over modalities and segments, to predict what is behind MASK for audio $\hat{a_t}$ and text $\hat{w_t}$. They supervise these predictions with independently encoded targets: at from the audio encoder, and wt from a separate text encoder (not shown).

It sets new state-of-the-art performances in tasks such as Visual Commonsense Reasoning (VCR), TVQA, and Kinetics-600, surpassing previous models by considerable margins. Its ability to utilize audio in pretraining and efficiently use computational resources contributes significantly to these improvements.
In zero-shot settings, MERLOT Reserve shows competitive performance in tasks like Situated Reasoning (STAR) and MSR-VTT QA. The inclusion of audio in the model’s pretraining enhances its understanding of dynamic state changes and human communication dynamics, providing a distinct advantage over text-only information.
The paper concludes with a discussion on the potential societal impacts of multimodal pretraining. While acknowledging the benefits for low vision or d/Deaf users, it also highlights risks related to surveillance and social biases. The authors emphasize the importance of critical examination of these technologies in future research.

MegaBlocks: Efficient Sparse Training with Mixture-of-Experts

This paper by Gale et al. from Stanford University, Microsoft Research, and Google Research, introduces Dropless Mixture-of-Experts (MoE), a novel system for efficient MoE training on GPUs.
The system, named MegaBlocks, addresses the limitations of current frameworks that restrict dynamic routing in MoE layers, often leading to a tradeoff between model quality and hardware efficiency due to the necessity of dropping tokens or wasting computation on excessive padding. Token dropping leads to information loss, as it involves selectively ignoring part of the input data, while padding adds redundant data to make the varying input sizes uniform, which increases computational load without contributing to model learning. This challenge arises from the difficulty in efficiently handling the dynamic routing and load-imbalanced computation characteristic of MoE architectures, especially in the context of deep learning hardware and software constraints.
MegaBlocks innovatively reformulates MoE computations as block-sparse operations, developing new GPU kernels specifically for this purpose. These kernels efficiently manage dynamic, load-imbalanced computations inherent in MoEs without resorting to token dropping. This results in up to 40% faster end-to-end training compared to MoEs trained with the Tutel library, and 2.4 times speedup over DNNs trained with Megatron-LM.
The system’s core contributions include high-performance GPU kernels for block-sparse matrix multiplication, leveraging blocked-CSR-COO encoding and transpose indices. This setup enables efficient handling of sparse inputs and outputs in both transposed and non-transposed forms.
Built upon the Megatron-LM library for Transformer model training, MegaBlocks supports distributed MoE training with data and expert model parallelism. Its unique ability to avoid token dropping through block-sparse computation provides a fresh approach to MoE algorithms as a form of dynamic structured activation sparsity.
The figure below from the paper shows a Mixture-of-Experts Layer. Shown for num experts=3, top k=1 and capacity factor=1 with the prevalent, token dropping formulation. First (1), tokens are mapped to experts by the router. Along with expert assignments, the router produces probabilities that reflect the confidence of the assignments. Second (2), the feature vectors are permuted to group tokens by expert assignment. If the number of tokens assigned to an expert exceeds its capacity, extra tokens are dropped. Third (3), the expert layers are computed for the set of tokens they were assigned as well as any padding needed for unused capacity. Lastly (4), the results of the expert computation are un-permuted and weighted by the router probabilities. The outputs for dropped tokens are shown here set to zero.

Experiments demonstrate that MegaBlocks enables significant end-to-end training speedups for MoE models compared to existing approaches, especially as model size increases. The system also reduces the computational overhead and memory requirements associated with MoE layers, leading to more efficient utilization of hardware resources. Furthermore, the approach decreases the number of hyperparameters that need to be re-tuned for each model and task, simplifying the process of training large MoE models.
The paper provides detailed insights into the design and performance of the block-sparse kernels, including analyses of throughput relative to cuBLAS batched matrix multiplication and discussions on efficient routing and permutation for MoEs. The results show that MegaBlocks’ kernels perform comparably to cuBLAS, achieving an average of 98.6% of cuBLAS’s throughput with minimal variations across different configurations.
Code

Ask Me Anything: A Simple Strategy for Prompting Language Models

This paper by Arora et al. from Stanford University, Numbers Station, and UW-Madison introduces Ask Me Anything Prompting (AMA), a novel prompting method for large language models (LLMs).
AMA aims to overcome the brittleness of traditional prompting methods by aggregating multiple effective yet imperfect prompts to enhance model performance across various tasks. It exploits question-answering (QA) prompts for their open-ended nature, encouraging models to generate more nuanced responses than restrictive prompt types.
The approach uses the LLM itself to recursively transform task inputs into effective QA formats, collecting several noisy votes for an input’s true label. These votes are then aggregated using weak supervision, a technique for combining noisy predictions without additional labeled data.
AMA first recursively uses the LLM to reformat tasks and prompts to effective formats, and second aggregates the predictions across prompts using weak-supervision. The reformatting is performed using prompt-chains, which consist of functional (fixed, reusable) prompts that operate over the varied task inputs. Here, given the input example, the prompt-chain includes a question()-prompt through which the LLM converts the input claim to a question, and an answer() prompt, through which the LLM answers the question it generated. Different prompt-chains (i.e., differing in the in-context question and answer demonstrations) lead to different predictions for the input’s true label.

AMA was evaluated across multiple open-source model families (EleutherAI, BLOOM, OPT, and T0) and sizes (125M-175B parameters), demonstrating an average performance improvement of 10.2% over a few-shot baseline. Remarkably, it enabled the GPT-J-6B model to match or exceed few-shot GPT-3-175B performance on 15 out of 20 popular benchmarks.
The paper concludes that AMA not only facilitates the use of smaller, open-source LLMs by reducing the need for perfect prompting but also suggests a scalable and effective method for prompt aggregation.
Code

STaR: Self-Taught Reasoner: Bootstrapping Reasoning With Reasoning

This paper by Zelikman et al. from Stanford and Google introduces the Self-Taught Reasoner (STaR), a technique for bootstrapping the ability of large language models (LLMs) to generate reasoning-based answers (rationales) iteratively. The goal of STaR is to improve the LLM’s performance on complex reasoning tasks like arithmetic and commonsense question answering without the need for manually curated large datasets of rationales. Instead, the method iteratively generates and fine-tunes rationales using a small set of initial examples, allowing the model to “teach itself” more complex reasoning over time.
The core of the STaR approach relies on a simple yet iterative loop:
1. Rationale Generation: A pretrained LLM is prompted with a few rationale examples (e.g., 10 for arithmetic) and tasked with generating rationales for a set of questions. Only rationales that yield correct answers are retained.
2. Fine-Tuning: The model is fine-tuned on these filtered correct rationales to improve its ability to generate them.
3. Rationalization: For problems where the model fails to generate correct answers, it is provided with the correct answer and asked to “rationalize” it by generating a rationale. This technique allows the model to improve by reasoning backward from the correct answer.
This process is repeated across multiple iterations, with the model learning to solve increasingly complex tasks through rationale generation and rationalization.
The following figure from the paper shows an overview of STaR and a STaR-generated rationale on CommonsenseQA. We indicate the fine-tuning outer loop with a dashed line. The questions and ground truth answers are expected to be present in the dataset, while the rationales are generated using STaR.

Implementation Details:
- Initial Setup: STaR starts with a small prompt set of rationale-annotated examples (e.g., 10 examples in the case of arithmetic). Each example in the dataset is then augmented with these few-shot rationales, encouraging the model to generate a rationale for the given question.
- Filtering: Rationales are filtered by whether they result in the correct final answer, and only correct rationales are used for fine-tuning.
- Training Process: The model is fine-tuned in a loop, with the number of fine-tuning steps increased by 20% per iteration. Fine-tuning starts with 40 training steps and slowly scales up.
- Rationalization: When the model fails to generate a correct rationale, it is prompted with the correct answer and asked to generate a rationale based on this information. These rationales are added to the fine-tuning dataset for further improvement.
- Avoiding Overfitting: The model is always retrained from the original pre-trained model rather than continuing to train the same model across iterations, to prevent overfitting.
Results:
- Arithmetic: The model’s performance improved significantly after each iteration. Without rationalization, STaR improved performance on n-digit addition problems in a stage-wise fashion (improving on simpler problems first), while rationalization enabled the model to learn across different problem sizes simultaneously.
- CommonsenseQA: STaR outperformed a GPT-J model fine-tuned directly on answers, achieving 72.5% accuracy compared to 73.0% for a 30× larger GPT-3 model. STaR with rationalization outperformed models without rationalization, indicating the added benefit of rationalizing incorrect answers.
- Generalization: The STaR approach also demonstrated the ability to generalize beyond training data, solving unseen, out-of-distribution problems in arithmetic.
Key Contributions:
1. STaR provides a scalable bootstrapping technique that allows models to iteratively improve their reasoning abilities without relying on large rationale-annotated datasets.
2. The inclusion of rationalization as a mechanism for solving problems that the model initially fails to answer correctly enhances the training process by exposing the model to more difficult problems.
3. STaR’s iterative approach makes it a broadly applicable method for improving model reasoning across domains, including arithmetic and commonsense reasoning tasks.
In summary, STaR introduces a novel iterative reasoning-based training approach that improves the reasoning capability of LLMs using a small set of rationale examples and a large dataset without rationales. This method significantly enhances model performance on both symbolic and natural language reasoning tasks.

2023

Challenges and Applications of Large Language Models

Large Language Models (LLMs) went from non-existent to ubiquitous in the machine learning discourse within a few years. Due to the fast pace of the field, it is difficult to identify the remaining challenges and already fruitful application areas.
This paper by Kaddour et al. from University College London, UK Health Security Agency, EleutherAI, University of Cambridge, Stability AI, Meta AI Research, and InstaDeep aims to establish a systematic set of open problems and application successes so that ML researchers can comprehend the field’s current state more quickly and become productive.
Important themes in this study and some gaps per Juan Huerta:
1. Dealing with “Outdated Knowledge” (Section 2.10) represents one of the biggest opportunities in the area. They referred Locating and Editing Factual Associations in GPT by Meng et al., where they identify and edit the nodes in the feed forward layers that contain the knowledge to be edited. The opportunity here is to have a unified framework to update an LLM training with information that is evolving in time, or to information that carry uncertainty, or handing contradictory facts. The capacity to store facts in an LLM is amazing so, we should not kick this can down the road: not every fact-override should go into a Retrieval Augmented Index!
2. Understanding how to deal with “unfathomable” training datasets is important. Specifically, the “sample efficiency” criteria is paramount in this aspect: we should not invest the same amount of training budget for every sample. Some samples are worth more than others (same could be said of inference). Early exit, Selective backpropagation, optimized multi-task training, optimized masking in MLM, provide a set of levers aimed to support a more selective training strategy through reinforcement learning or active learning. Likewise, “Inference Efficiency” should emerge as a big topic, as not all inference tasks are equally valuable.
3. The paper describes parallelism as an area of focus, and doesn’t deep dive a lot specifically in Hardware. GPU’s and TPU’s emerged organically from hardware that can break large matrices and tensors into blocks, carry out matrix multiplications in parallel and put results back together. It’s important to also optimize their hardware for multi-head scaled dot-product attention kernels. In any case, the architecture will probably evolve to support more specifics aspects of the transformer architecture, including more aspects driven by green/sustainability-considerations, and biologically-inspired considerations.
The following figure from the paper shows an overview of LLM Challenges. Designing LLMs relates to decisions taken before deployment. Behaviorial challenges occur during deployment. Science challenges hinder academic progress.

LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models

The success of large language models (LLMs), like GPT-3 and ChatGPT, has led to the development of numerous cost-effective and accessible alternatives that are created by fine-tuning open-access LLMs with task-specific data (e.g., ChatDoctor) or instruction data (e.g., Alpaca). Among the various fine-tuning methods, adapter-based parameter-efficient fine-tuning (PEFT) is undoubtedly one of the most attractive topics, as it only requires fine-tuning a few external parameters instead of the entire LLMs while achieving comparable or even better performance.
This paper by Hu et al. from Singapore University of Technology and Design, Singapore Management University, Southwest Jiaotong University, Alibaba Group, University of Electronic Science and Technology of China in presents LLM-Adapters, an easy-to-use framework that integrates various adapters into LLMs and can execute these adapter-based PEFT methods of LLMs for different tasks to enable further research on PEFT methods of LLMs.
The framework includes state-of-the-art open-access LLMs such as LLaMA, BLOOM, OPT, and GPT-J, as well as widely used adapters such as Series adapter, Parallel adapter, and LoRA. The framework is designed to be research-friendly, efficient, modular, and extendable, allowing the integration of new adapters and the evaluation of them with new and larger-scale LLMs.
Furthermore, to evaluate the effectiveness of adapters in LLMs-Adapters, they conduct experiments on six math reasoning datasets. The results demonstrate that using adapter-based PEFT in smaller-scale LLMs (7B) with few extra trainable parameters yields comparable, and in some cases superior, performance to that of powerful LLMs (175B) in zero-shot inference on simple math reasoning datasets.
Overall, they provide a promising framework for fine-tuning large LLMs on downstream tasks. They believe the proposed LLMs-Adapters will advance adapter-based PEFT research, facilitate the deployment of research pipelines, and enable practical applications to real-world systems.
The following figure from the paper shows a detailed illustration of the model architectures of three different adapters: (a) Series Adapter (Houlsby et al., 2019), (b) Parallel Adapter (He et al., 2021), and (c) LoRA (Hu et al., 2021).

Code.

Accelerating Large Language Model Decoding with Speculative Sampling

This paper by Chen et al. from DeepMind introduces speculative sampling, an algorithm designed to speed up the decoding process of large transformer models without compromising sample quality or model alterations.
The algorithm works by generating multiple tokens from each transformer call. It combines a fast, less powerful draft model for generating short continuations and a powerful target model for scoring these drafts. A novel modified rejection sampling scheme is used to preserve the target model’s distribution.
Speculative sampling notably achieves a 2-2.5x decoding speedup in distributed settings when applied to Chinchilla, a 70 billion parameter language model. This improvement is achieved without altering the model’s architecture or sample distribution.
The authors focused on auto-regressive sampling and speculative sampling, detailing their differences and the efficiencies brought by speculative sampling. This includes discussions on auto-regressive sampling’s limitations due to memory bandwidth and model parallelism.
The paper provides empirical evidence through evaluations on two tasks: the XSum summarization and the HumanEval code generation task. The results demonstrate that speculative sampling maintains performance parity with standard methods while significantly reducing sampling latency.
Challenges such as acceptance rate variation across different domains and the trade-offs between longer drafts and frequent scoring are also explored. The paper concludes that speculative sampling is an effective method for reducing latency in large language models, complementing existing acceleration techniques.

GPT detectors are biased against non-native English writers

The rapid adoption of generative language models has brought about substantial advancements in digital communication, while simultaneously raising concerns regarding the potential misuse of AI-generated content. Although numerous detection methods have been proposed to differentiate between AI and human-generated content, the fairness and robustness of these detectors remain underexplored.
This paper by Liang et al. from Stanford in 2023 evaluates the performance of several widely-used GPT detectors using writing samples from native and non-native English writers. Our findings reveal that these detectors consistently misclassify non-native English writing samples as AI-generated, whereas native writing samples are accurately identified.
Furthermore, they demonstrate that simple prompting strategies can not only mitigate this bias but also effectively bypass GPT detectors, suggesting that GPT detectors may unintentionally penalize writers with constrained linguistic expressions.
Their results call for a broader conversation about the ethical implications of deploying ChatGPT content detectors and caution against their use in evaluative or educational settings, particularly when they may inadvertently penalize or exclude non-native English speakers from the global discourse.

GPT4All: Training an Assistant-style Chatbot with Large Scale Data Distillation from GPT-3.5-Turbo

This technical report by Anand et al. describes the development of GPT4All, a chatbot trained over a massive curated corpus of assistant interactions including word problems, story descriptions, multi-turn dialogue, and code.
They openly release the collected data, data curation procedure, training code, and final model weights to promote open research and reproducibility.
Additionally, they release quantized 4-bit versions of the model allowing virtually anyone to run the model on CPU.
Code.

SELF-INSTRUCT: Aligning Language Model with Self Generated Instructions

Large “instruction-tuned” language models (finetuned to respond to instructions) have demonstrated a remarkable ability to generalize zero-shot to new tasks. Nevertheless, they depend heavily on human-written instruction data that is limited in quantity, diversity, and creativity, therefore hindering the generality of the tuned model.
This paper by Wang et al. from UW, Tehran Polytechnic, ASU, etc. introduces Self-Instruct, a framework for improving the instruction-following capabilities of pretrained language models by bootstrapping off its own generations. Their pipeline generates instruction, input, and output samples from a language model, then prunes them before using them to finetune the original model.
Applying Self-Instruct to vanilla GPT3, they demonstrate a 33% absolute improvement over the original model on Super-NaturalInstructions, on par with the performance of InstructGPT_001, which is trained with private user data and human annotations.
For further evaluation, they curate a set of expert-written instructions for novel tasks, and show through human evaluation that tuning GPT3 with Self-Instruct outperforms using existing public instruction datasets by a large margin, leaving only a 5% absolute gap behind InstructGPT_001. Self-Instruct provides an almost annotation-free method for aligning pre-trained language models with instructions, and release their large synthetic dataset to facilitate future studies on instruction tuning.
The following figure presents a high-level overview of Self-Instruct. The process starts with a small seed set of tasks (one instruction and one input-output instance for each task) as the task pool. Random tasks are sampled from the task pool, and used to prompt an off-the-shelf LM to generate both new instructions and corresponding instances, followed by filtering low-quality or similar generations, and then added back to the initial repository of tasks. The resulting data can be used for the instruction tuning of the language model itself later to follow instructions better. Tasks shown in the figure are generated by GPT.

Efficient Methods for Natural Language Processing: A Survey

Recent work in natural language processing (NLP) has yielded appealing results from scaling model parameters and training data; however, using only scale to improve performance means that resource consumption also grows. Such resources include data, time, storage, or energy, all of which are naturally limited and unevenly distributed. This motivates research into efficient methods that require fewer resources to achieve similar results.
This survey paper by Treviso et al. in TACL 2023 synthesizes and relates current methods and findings in efficient NLP.
They aim to provide both guidance for conducting NLP under limited resources, and point towards promising research directions for developing more efficient methods.
The figure below from the paper offers a typology of efficient NLP methods.

Better Language Models of Code through Self-Improvement

Pre-trained language models for code (PLMCs) have gained attention in recent research. These models are pre-trained on large-scale datasets using multi-modal objectives. However, fine-tuning them requires extensive supervision and is limited by the size of the dataset provided.
This paper by To et al. from FPT Software AI Center, Fulbright University, and McGill University aims to improve this issue by proposing a simple data augmentation framework.
Their framework utilizes knowledge gained during the pre-training and fine-tuning stage to generate pseudo data, which is then used as training data for the next step.
Usually, models are pretrained on large scale corpora, resulting in a pre-trained checkpoint $\theta_{\text {pre-trained}}$. These pre-trained models are then fine-tuned on a specific downstream dataset $D$ using a supervised-learning approach, resulting in a set of fine-tuned parameters $\theta_{\text {fine-tuned}}$. Our investigation revealed that model performance can be further improved if they continue to fine-tuned these parameters on an augmented version of $D$. As depicted in the figure below, their proposal for self-improvement is the final step in the overall training flow. Specifically, they propose a data augmentation process and an extra fine-tuning step in addition to the pre-training and fine-tuning paradigm. The process of augmenting the dataset is illustrated in the below figure. They also give a detailed algorithm for this process in the Appendix. For each training pair of sequences $\left(x_i, y_i\right)$ in the train dataset $D$, they first use beam search to generate a list of K-best predictions $L_K$. This list contains $k$ predictions, where $k$ is the beam size.
They then evaluate the similarity of each prediction $\hat{y}_{i j}$ and its corresponding ground truth sequence $y_i$ using a similarity function sim based on BLEU score. The best prediction with highest similarity is then selected $\tilde{y}_i=$ $\operatorname{argmax}_{\hat{y}_{i j} \in L_K}\left(\operatorname{sim}\left(\hat{y}_{i j}, y_i\right)\right)$. In the last step, they add the pair of sequences $\left(x_i, \tilde{y}_i\right)$ into a new empty dataset $\tilde{D}$. They call this new dataset the augmented dataset or pseudo dataset interchangeably in the rest of the paper. The next step requires fine-tuning $\theta_{\text {fine-tuned }}$ on $\tilde{D}$ until convergence. They call this new stage of model parameters $\theta_{\text {improved}}$. Note that the index $j$ in $\hat{y}_{i j}$ denotes the $j^{th}$ prediction in the beam, not the $j^{\text {th}}$ token of the predicted sequence. Additionally, only train dataset $D$ is augmented, while the validation and test dataset are kept unchanged for evaluation purpose.
The following image from the paper illustrates the process of generating the pseudo dataset.

They incorporate this framework into the state-of-the-art language models, such as CodeT5, CodeBERT, and UnixCoder. The results show that their framework significantly improves PLMCs’ performance in code-related sequence generation tasks, such as code summarization and code generation in the CodeXGLUE benchmark.

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

Deploying large language models (LLMs) is challenging because they are memory inefficient and compute-intensive for practical applications. In reaction, researchers train smaller task-specific models by either finetuning with human labels or distilling using LLM-generated labels. However, finetuning and distillation require large amounts of training data to achieve comparable performance to LLMs.
This paper by Hsieh et al. from University of Washington, Google Cloud AI Research, and Google Research in ACL 2023 introduces “Distilling Step-by-step”, a new mechanism that (a) trains smaller models that outperform LLMs, and (b) achieves so by leveraging less training data needed by finetuning or distillation.
Their method extracts LLM rationales as additional supervision for small models within a multi-task training framework. They present three findings across 4 NLP benchmarks: First, compared to both finetuning and distillation, their mechanism achieves better performance with much fewer labeled/unlabeled training examples. Second, compared to LLMs, they achieve better performance using substantially smaller model sizes. Third, they reduce both the model size and the amount of data required to outperform LLMs; their 770M T5 model outperforms the 540B PaLM model using only 80% of available data on a benchmark task.
The figure below from the paper compares distilling step-by-step and Standard finetuning using 220M T5 models on varying sizes of human-labeled datasets. On all datasets, distilling step-by-step is able to outperform standard finetuning, trained on the full dataset, by using much less training examples (e.g., 12.5% of the full e-SNLI dataset).

Active Retrieval Augmented Generation

Despite the remarkable ability of large language models (LLMs) to comprehend and generate language, they have a tendency to hallucinate and create factually inaccurate output.
Augmenting LLMs by retrieving information from external knowledge resources is one promising solution. Most existing retrieval-augmented LLMs employ a retrieve-and-generate setup that only retrieves information once based on the input. This is limiting, however, in more general scenarios involving generation of long texts, where continually gathering information throughout the generation process is essential. There have been some past efforts to retrieve information multiple times while generating outputs, which mostly retrieve documents at fixed intervals using the previous context as queries.
This paper from Jiang et al. at CMU, Sea AI Lab, and Meta AI in EMNLP 2023 presents Forward-Looking Active REtrieval augmented generation (FLARE), a method addressing the tendency of large language models (LLMs) to produce factually inaccurate content.
FLARE iteratively uses predictions of upcoming sentences to actively decide when and what to retrieve across the generation process, enhancing LLMs with dynamic, multi-stage external information retrieval.
Unlike traditional retrieve-and-generate models that use fixed intervals or input-based retrieval, FLARE targets continual information gathering for long text generation, reducing hallucinations and factual inaccuracies.
The system triggers retrieval when generating low-confidence tokens, determined by a probability threshold. This anticipates future content, forming queries to retrieve relevant documents for regeneration.
The following figure from the paper illustrates FLARE. Starting with the user input $x$ and initial retrieval results $D_x$, FLARE iteratively generates a temporary next sentence (shown in gray italic) and check whether it contains low-probability tokens (indicated with underline). If so (step 2 and 3), the system retrieves relevant documents and regenerates the sentence.

FLARE was tested on four long-form, knowledge-intensive generation tasks/datasets, exhibiting superior or competitive performance, demonstrating its effectiveness in addressing the limitations of existing retrieval-augmented LLMs.
The model is adaptable to existing LLMs, as shown with its implementation on GPT-3.5, and employs off-the-shelf retrievers and the Bing search engine.
Code.

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

There is a rapidly growing number of large language models (LLMs) that users can query for a fee. They review the cost associated with querying popular LLM APIs, e.g. GPT-4, ChatGPT, J1-Jumbo, and find that these models have heterogeneous pricing structures, with fees that can differ by two orders of magnitude. In particular, using LLMs on large collections of queries and text can be expensive.
This paper by Chen et al. from Stanford outlines and discusses three types of strategies that users can exploit to reduce the inference cost associated with using LLMs: 1) prompt adaptation, 2) LLM approximation, and 3) LLM cascade.
As an example, they propose FrugalGPT, a simple yet flexible instantiation of LLM cascade which learns which combinations of LLMs to use for different queries in order to reduce cost and improve accuracy.
Their experiments show that FrugalGPT can match the performance of the best individual LLM (e.g., GPT-4) with up to 98% cost reduction or improve the accuracy over GPT-4 by 4% with the same cost. The ideas and findings presented here lay a foundation for using LLMs sustainably and efficiently.
The figure below from the paper shows their vision for reducing LLM cost while improving accuracy. (a) The standard usage sends queries to a single LLM (e.g., GPT-4), which can be expensive. (b) Our proposal is to use prompt adaption, LLM approximation and LLM cascade to reduce the inference cost. By optimizing over the selection of different LLM APIs (e.g., GPT-J, ChatGPT, and GPT-4) as well as prompting strategies (such as zero-shot, few-shot, and chain-of-thought (CoT)), they can achieve substantial efficiency gains. (c) On HEADLINES (a financial news dataset), FrugalGPT can reduce the inference cost by 98% while exceeding the performance of the best individual LLM (GPT-4).

Exploring In-Context Learning Capabilities of Foundation Models for Generating Knowledge Graphs from Text

Knowledge graphs can represent information about the real-world using entities and their relations in a structured and semantically rich manner and they enable a variety of downstream applications such as question-answering, recommendation systems, semantic search, and advanced analytics. However, at the moment, building a knowledge graph involves a lot of manual effort and thus hinders their application in some situations and the automation of this process might benefit especially for small organizations. Automatically generating structured knowledge graphs from a large volume of natural language is still a challenging task and the research on sub-tasks such as named entity extraction, relation extraction, entity and relation linking, and knowledge graph construction aims to improve the state of the art of automatic construction and completion of knowledge graphs from text.
The recent advancement of foundation models with billions of parameters trained in a self-supervised manner with large volumes of training data that can be adapted to a variety of downstream tasks has helped to demonstrate high performance on a large range of Natural Language Processing (NLP) tasks. In this context, one emerging paradigm is in-context learning where a language model is used as it is with a prompt that provides instructions and some examples to perform a task without changing the parameters of the model using traditional approaches such as fine-tuning. This way, no computing resources are needed for re-training/fine-tuning the models and the engineering effort is minimal. Thus, it would be beneficial to utilize such capabilities for generating knowledge graphs from text.
This paper by Khorashadizadeh et al. from University of Lübeck, IBM Research, and Universidad Autonoma de Tamaulipas explores the capabilities of foundation models such as ChatGPT to generate knowledge graphs from the knowledge it captured during pre-training as well as the new text provided to it in the prompt. The paper provides a qualitative analysis of a set of example outputs generated by a foundation model with the aim of knowledge graph construction and completion. The results demonstrate promising capabilities. Furthermore, they discuss the challenges and next steps for this research work.
The following figure from the paper shows a potential architecture for generating knowledge graphs with foundation models.

How Language Model Hallucinations Can Snowball

A major risk of using language models in practical applications is their tendency to hallucinate incorrect statements.
This paper by Zhang et al. from UW, NYU and Allen AI in 2023 hypothesizes that in some cases, when justifying previously generated hallucinations, LMs output false claims that they can separately recognize as incorrect.
They construct three question answering datasets where ChatGPT and GPT-4 often state an incorrect answer and offer an explanation with at least one incorrect claim.
Crucially, they find that ChatGPT and GPT-4 can identify 67% and 87% of their own mistakes, respectively. They refer to this phenomenon as hallucination snowballing: an LM over-commits to early mistakes, leading to more mistakes that it otherwise would not make.

Unlimiformer: Long-Range Transformers with Unlimited Length Input

Since the proposal of transformers, these models have been limited to bounded input lengths, because of their need to attend to every token in the input.
This paper by Bertsch et al. from CMU in 2023 proposes Unlimiformer, a general approach that wraps any existing pretrained encoder-decoder transformer, and offloads the cross-attention computation to a single k-nearest-neighbor (kNN) index, while the returned kNN distances are the attention dot-product scores. This kNN index can be kept on either the GPU or CPU memory and queried in sub-linear time; this way, they can index practically unlimited input sequences, while every attention head in every decoder layer retrieves its top-$k$ keys, instead of attending to every key.
They evaluate Unlimiformer on several long-document and book-summarization benchmarks, showing that it can process even 500k token-long inputs from the BookSum dataset, without any input truncation at test time.
They demonstrate that Unlimiformer improves pretrained models such as BART and Longformer by extending them to unlimited inputs without additional learned weights and without modifying their code.
The following figure from the paper shows an example where the given LM’s encoder’s maximum input length is 2 tokens. A 6-token input is encoded in chunks and indexed in an index. They inject Unlimiformer into each decoder layer prior to cross-attention. In Unlimiformer, thry perform kNN search to select a 2-token context for each attention head from the index. This makes cross-attention attend to tokens from the entire input sequence, without adding parameters and without changing the given LM’s architecture.

Code

Gorilla: Large Language Model Connected with Massive APIs

Large Language Models (LLMs) have seen an impressive wave of advances recently, with models now excelling in a variety of tasks, such as mathematical reasoning and program synthesis. However, their potential to effectively use tools via API calls remains unfulfilled. This is a challenging task even for today’s state-of-the-art LLMs such as GPT-4, largely due to their inability to generate accurate input arguments and their tendency to hallucinate the wrong usage of an API call.
This paper by Patil et al. from UC Berkeley and Microsoft Research in 2023 introduces Gorilla, a finetuned LLaMA-based model that generates APIs to complete tasks by interacting with external tools and surpasses the performance of GPT-4 while writing API calls. When combined with a document retriever, Gorilla demonstrates a strong capability to adapt to test-time document changes, enabling flexible user updates or version changes. It also substantially mitigates the issue of hallucination, commonly encountered when prompting LLMs directly.
To evaluate the model’s ability, they introduce APIBench, a comprehensive dataset consisting of Weights, TorchHub, and TensorHub APIs. The successful integration of the retrieval system with Gorilla demonstrates the potential for LLMs to use tools more accurately, keep up with frequently updated documentation, and consequently increase the reliability and applicability of their outputs.
The following figure from the paper shows (top) the training procedure for Gorilla using the most exhaustive API dataset for ML to the best of their knowledge; (bottom) during inference Gorilla supports two modes - with retrieval, and zero-shot. In this example, it is able to suggest the right API call for generating the image from the user’s natural language query.

Project page

SPRING: GPT-4 Out-performs RL Algorithms by Studying Papers and Reasoning

Open-world survival games pose significant challenges for AI algorithms due to their multi-tasking, deep exploration, and goal prioritization requirements. Despite reinforcement learning (RL) being popular for solving games, its high sample complexity limits its effectiveness in complex open-world games like Crafter or Minecraft.
This paper by Wu et al. from CMU, NVIDIA, Ariel University, and Microsoft Research proposes a novel approach, SPRING, to read the game’s original academic paper and use the knowledge learned to reason and play the game through a large language model (LLM).
Prompted with the LaTeX source as game context and a description of the agent’s current observation, their SPRING framework employs a directed acyclic graph (DAG) with game-related questions as nodes and dependencies as edges.
They identify the optimal action to take in the environment by traversing the DAG and calculating LLM responses for each node in topological order, with the LLM’s answer to final node directly translating to environment actions. In their experiments, they study the quality of in-context “reasoning” induced by different forms of prompts under the setting of the Crafter open-world environment.
Their experiments suggest that LLMs, when prompted with consistent chain-of-thought, have great potential in completing sophisticated high-level trajectories. Quantitatively, SPRING with GPT-4 outperforms all state-of-the-art RL baselines, trained for 1M steps, without any training. Finally, they show the potential of games as a test bed for LLMs.
The following figure from the paper illustrates SPRING. The context string, shown in the middle column, is obtained by parsing the LaTeX source code of a paper. The LLM-based agent then takes input from a visual game descriptor and the context string. The agent uses questions composed into a DAG for chain-of-thought reasoning, and the last node of the DAG is parsed into action.

Deliberate then Generate: Enhanced Prompting Framework for Text Generation

Large language models (LLMs) have shown remarkable success across a wide range of natural language generation tasks, where proper prompt designs make great impacts. While existing prompting methods are normally restricted to providing correct information, in this paper, they encourage the model to deliberate by proposing a novel Deliberate then Generate (DTG) prompting framework, which consists of error detection instructions and candidates that may contain errors. DTG is a simple yet effective technique that can be applied to various text generation tasks with minimal modifications. We conduct extensive experiments on 20+ datasets across 7 text generation tasks, including summarization, translation, dialogue, and more. We show that DTG consistently outperforms existing prompting methods and achieves state-of-the-art performance on multiple text generation tasks. We also provide in-depth analyses to reveal the underlying mechanisms of DTG, which may inspire future research on prompting for LLMs.
The following figure from the paper illustrates a comparison of standard GPT prompting and their DTG prompt desgin for summarization task. Note that prompt in blue denotes the demonstration, and that in red denotes the test input. [SRC] and [Input] means the source input, TGT means the target reference and [INCORRECT SYS] means the irrelevant system output (e.g., such as a randomly sampled text or even an empty string).

Enabling Large Language Models to Generate Text with Citations

Large language models (LLMs) have emerged as a widely-used tool for information seeking, but their generated outputs are prone to hallucination. In this work, they aim to enable LLMs to generate text with citations, improving their factual correctness and verifiability. Existing work mainly relies on commercial search engines and human evaluation, making it challenging to reproduce and compare with different modeling approaches.
This paper by Gao et al. from Danqi Chen’s lab in Princeton proposes ALCE, the first benchmark for Automatic LLMs’ Citation Evaluation. ALCE collects a diverse set of questions and retrieval corpora and requires building end-to-end systems to retrieve supporting evidence and generate answers with citations.
They build automatic metrics along three dimensions – fluency, correctness, and citation quality – and demonstrate their strong correlation with human judgements.
Their experiments with state-of-the-art LLMs and novel prompting strategies show that current systems have considerable room for improvements – for example, on the ELI5 dataset, even the best model has 49% of its generations lacking complete citation support. Our extensive analyses further highlight promising future directions, including developing better retrievers, advancing long-context LLMs, and improving the ability to synthesize information from multiple sources.
The following figure from the paper illustrates the task setup of ALCE. Given a question, the system generates text while providing citing passages from a large retrieval corpus. Each statement may contain multiple citations (e.g., [1][2]).

As an example, they pass the following prompt to the LLM: “Instruction: Write an accurate, engaging, and concise answer for the given question using only the provided search results (some of which might be irrelevant) and cite them properly. Use an unbiased and journalistic tone. Always cite for any factual claim. When citing several search results, use [1][2][3]. Cite at least one document and at most three documents in each sentence. If multiple documents support the sentence, only cite a minimum sufficient subset of the documents.”
The following figure from the paper shows: (Left) an example of their VANILLA method where different colors represent prompt, model generation (blue), and <actions> (mustard). They also provide two in-context demonstrations before the test example. (Right) an example of INLINESEARCH.

Code.

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

Large language models are commonly trained on a mixture of filtered web data and curated high-quality corpora, such as social media conversations, books, or technical papers. This curation process is believed to be necessary to produce performant models with broad zero-shot generalization abilities. However, as larger models requiring pretraining on trillions of tokens are considered, it is unclear how scalable is curation and whether we will run out of unique high-quality data soon.
This paper by Penedo et al. from the Falcon LLM team shows that properly filtered and deduplicated web data alone can lead to powerful models; even significantly outperforming models from the state-of-the-art trained on The Pile. Despite extensive filtering, the high-quality data we extract from the web is still plentiful, and we are able to obtain five trillion tokens from CommonCrawl. We publicly release an extract of 600 billion tokens from their RefinedWeb dataset, and 1.3/7.5B parameters language models trained on it.
The following figure from the paper shows subsequent stages of Macrodata Refinement remove nearly 90% of the documents originally in CommonCrawl. Notably, filtering and deduplication each result in a halving of the data available: around 50% of documents are discarded for not being English, 24% of remaining for being of insufficient quality, and 12% for being duplicates. We report removal rate (grey) with respect to each previous stage, and kept rate (shade) overall. Rates measured in % of documents in the document preparation phase, then in tokens.

The following figure from the paper shows models trained on REFINEDWEB alone outperform models trained on curated corpora. Zero-shot performance on their main-agg task aggregate. At equivalent compute budgets, their models significantly outperform publicly available models trained on The Pile, and match the performance of the GPT-3 models when tested within their evaluation setup.

Related: Falcon LLM details –
1. Data is key! As noted in the abstract and benchmarks, Falcon performs very well due to the data refinement techniques used. A key theme through the paper
2. Falcon follows a very close scaling law to GPT-3, with the authors of the paper testing Babbage and Currie (two smaller variants). They measure it using the Eleuther AI evaluation harness.
3. Data Filtering and deduplication is key. Starting with Common Crawl, they apply a 7 part pipeline including URL dedup, website specific actions, document de-dup and line by line de-dup.
4. The final dataset is only $\frac{1}{9}^{th}$ of the Common Crawl original!
5. The team conducts several tests on 1B and 3B param models to validate their data cleaning hypothesis. C4 is still an excellent dataset, but Refined web outperforms The Pile and Oscar, which have duplications.
6. After blocking NSFW URLs, the dataset/model toxicity matches that of The Pile, indicating that more work can be done to further decrease it. Minimal work was done on investigating social and other biases.
7. The team open sourced a 600B subset from their 5000B Token dataset.
Project page.

Fine-Tuning Language Models with Just Forward Passes

Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a prohibitively large amount of memory. Zeroth-order (ZO) methods can in principle estimate gradients using only two forward passes but are theorized to be catastrophically slow for optimizing large models.
This paper by Malladi et al. from Danqi Chen’s and Sanjeev Arora’s lab at Princeton proposes a memory-efficient zerothorder optimizer (MeZO), adapting the classical ZO-SGD method to operate in-place, thereby fine-tuning LMs with the same memory footprint as inference. For example, with a single A100 80GB GPU, MeZO can train a 30-billion parameter model, whereas fine-tuning with backpropagation can train only a 2.7B LM with the same budget.
They conduct comprehensive experiments across model types (masked and autoregressive LMs), model scales (up to 66B), and downstream tasks (classification, multiple-choice, and generation). Our results demonstrate that (1) MeZO significantly outperforms in-context learning and linear probing; (2) MeZO achieves comparable performance to fine-tuning with backpropagation across multiple tasks, with up to 12x memory reduction; (3) MeZO is compatible with both full-parameter and parameter-efficient tuning techniques such as LoRA and prefix tuning; (4) MeZO can effectively optimize non-differentiable objectives (e.g., maximizing accuracy or F1).
They support their empirical findings with theoretical insights, highlighting how adequate pre-training and task prompts enable MeZO to fine-tune huge models, despite classical ZO analyses suggesting otherwise.
Github page.

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Chain-of-thought prompting combined with pre-trained large language models has achieved encouraging results on complex reasoning tasks.
This paper by Wang et al. from Google Brain in ICLR 2023 proposes a new decoding strategy, self-consistency, to replace the naive greedy decoding used in chain-of-thought prompting. It first samples a diverse set of reasoning paths instead of only taking the greedy one, and then selects the most consistent answer by marginalizing out the sampled reasoning paths.
Self-consistency leverages the intuition that a complex reasoning problem typically admits multiple different ways of thinking leading to its unique correct answer.
The following figure from the paper illustrates the self-consistency method contains three steps: (i) prompt a language model using chain-of-thought (CoT) prompting; (ii) replace the “greedy decode” in CoT prompting by sampling from the language model’s decoder to generate a diverse set of reasoning paths; and (iii) marginalize out the reasoning paths and aggregate by choosing the most consistent answer in the final answer set.

To sample diverse reasoning paths, self-consistency utilizes prevalent sampling algorithms, including temperature sampling, top-$k$ sampling, and nucleus sampling. Finally, they aggregate the answers by marginalizing out the sampled reasoning paths and choosing the answer that is the most consistent among the generated answers.
Their extensive empirical evaluation shows that self-consistency boosts the performance of chain-of-thought prompting with a striking margin on a range of popular arithmetic and commonsense reasoning benchmarks, including GSM8K (+17.9%), SVAMP (+11.0%), AQuA (+12.2%), StrategyQA (+6.4%) and ARC-challenge (+3.9%).

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Language models are increasingly being deployed for general problem solving across a wide range of tasks, but are still confined to token-level, left-to-right decision-making processes during inference. This means they can fall short in tasks that require exploration, strategic lookahead, or where initial decisions play a pivotal role.
This paper by Yao et al. from Princeton and Google DeepMind seeks to surmount these challenges by introducing a new framework for language model inference, Tree of Thoughts (ToT), which generalizes over the popular Chain of Thought approach to prompting language models, and enables exploration over coherent units of text (thoughts) that serve as intermediate steps toward problem solving. ToT allows LMs to perform deliberate decision making by considering multiple different reasoning paths and self-evaluating choices to decide the next course of action, as well as looking ahead or backtracking when necessary to make global choices.
Their experiments show that ToT significantly enhances language models’ problem-solving abilities on three novel tasks requiring non-trivial planning or search: Game of 24, Creative Writing, and Mini Crosswords. For instance, in Game of 24, while GPT-4 with chain-of-thought prompting only solved 4% of tasks, their method achieved a success rate of 74%. Code repo with all prompts: this https URL.
The following figure from the paper illustrates the various approaches to problem solving with LLMs. Each rectangle box represents a thought, which is a coherent language sequence that serves as an intermediate step toward problem solving.

Code.

Beyond Chain-of-Thought, Effective Graph-of-Thought Reasoning in Large Language Models

With the widespread use of large language models (LLMs) in NLP tasks, researchers have discovered the potential of Chain-of-thought (CoT) to assist LLMs in accomplishing complex reasoning tasks by generating intermediate steps. However, human thought processes are often non-linear, rather than simply sequential chains of thoughts.
This paper by Yao et al. from Shanghai Jiao Tong University and Wuhan University proposes Graph-of-Thought (GoT) reasoning, which models human thought processes not only as a chain but also as a graph. By representing thought units as nodes and connections between them as edges, their approach captures the non-sequential nature of human thinking and allows for a more realistic modeling of thought processes. Similar to Multimodal-CoT, they modeled GoT reasoning as a two-stage framework, generating rationales first and then producing the final answer. Specifically, they employ an additional graph-of-thoughts encoder for GoT representation learning and fuse the GoT representation with the original input representation through a gated fusion mechanism.
Directed Acyclic Graphs (DAGs) have revolutionized data pipeline orchestration tools by modelling the flow of dependencies in a graph without circular loops. Unlike trees, DAGs can model paths that fork and then converge back together, given GoT a big advantage over ToT!
Moving beyond pure orchestration, the Graph of Thought approach represents all information in a graph structure, with nodes representing concepts or entities and edges denoting relationships between them. Each node contains information that can be processed by the LLM, while the connections between nodes capture contextual associations and dependencies. This graph structure enables the model to traverse and explore the relationships between concepts, facilitating a more nuanced understanding of the input and informing the development of a more logically coherent plan (similar to System One and System Two thinking in the human brain).
They implement a GoT reasoning model on the T5 pre-trained model and evaluate its performance on a text-only reasoning task (GSM8K) and a multimodal reasoning task (ScienceQA).
Their model achieves significant improvement over the strong CoT baseline with 3.41% and 5.08% on the GSM8K test set with T5-base and T5-large architectures, respectively. Additionally, their model boosts accuracy from 84.91% to 91.54% using the T5-base model and from 91.68% to 92.77% using the T5-large model over the state-of-the-art Multimodal-CoT on the ScienceQA test set. Experiments have shown that GoT achieves comparable results to Multimodal-CoT (large) with over 700M parameters, despite having fewer than 250M backbone model parameters, demonstrating the effectiveness of GoT.
The following figure from the paper illustrates the framework overview for GoT.

LLM-QAT: Data-Free Quantization Aware Training for Large Language Models

Several post-training quantization methods have been applied to large language models (LLMs), and have been shown to perform well down to 8-bits.
This paper by Liu et al. from Meta AI Reality Labs finds that that these methods break down at lower bit precision, and investigate quantization aware training for LLMs (LLM-QAT) to push quantization levels even further.
They propose a data-free distillation method that leverages generations produced by the pre-trained model, which better preserves the original output distribution and allows quantizing any generative model independent of its training data, similar to post training quantization methods.
In addition to quantizing weights and activations, we also quantize the KV cache, which is critical for increasing throughput and support long sequence dependencies at current model sizes. We experiment with LLaMA models of sizes 7B, 13B, and 30B, at quantization levels down to 4-bits. We observe large improvements over training-free methods, especially in the low-bit settings.
The following figure from the paper provides an overview of LLM-QAT. We generate data from the pretrained model with next token generation, which is sampled from top-$k$ candidates. Then we use the generated data as input and the teacher model prediction as label to guide quantized model finetuning.

RWKV: Reinventing RNNs for the Transformer Era

Transformers have revolutionized almost all natural language processing (NLP) tasks but suffer from memory and computational complexity that scales quadratically with sequence length. In contrast, recurrent neural networks (RNNs) exhibit linear scaling in memory and computational requirements but struggle to match the same performance as Transformers due to limitations in parallelization and scalability.
This paper by Albalak et al. in EMNLP 2023 proposes a novel model architecture, Receptance Weighted Key Value (RWKV), which combines the advantages of RNNs and Transformers while mitigating their known limitations – by offering the efficient parallelizable training of Transformers with the efficient inference of RNNs. Put simply, RWKV combines the best of the RNN and Transformer architectures: great performance, fast inference, fast training, saves VRAM, “infinite” context length, and free sentence embedding – all while being 100% attention-free.
RWKV leverages a new linear channel-directed attention mechanism reformulation, eschewing the traditional dot-product token interaction attention (with quadratic complexity) associated with standard Transformer models. This attention mechanism allows RWKV to be formulated as either a Transformer or an RNN, which parallelizes computations during training and maintains constant computational and memory complexity during inference, leading to the first non-transformer architecture to be scaled to tens of billions of parameters.
It is important to note that the implementation of linear attention in RWKV is carried out without approximation, which offers a considerable improvement in efficiency and scalability.
The following figure from the paper provides an complexity comparison with different Transformers: Reformer (Kitaev et al., 2020), Linear Transformer (Katharopoulos et al., 2020), Performer (Choromanski et al., 2020), AFT (Zhai et al., 2021), MEGA (Ma et al., 2023). Here $T$ denotes the sequence length, $d$ the feature dimension, and $c$ is MEGA’s chunk size of quadratic attention.

The RWKV architecture is comprised of a series of stacked residual blocks, each formed by a timemixing and a channel-mixing sub-blocks with recurrent structures. The recurrence is formulated both as a linear interpolation between the current input and the input at the previous time step (a technique they refer to as time-shift mixing or token shift).
The following figure from the paper shows the RWKV block elements (left) and RWKV residual block with a final head for language modeling (right) architectures.

Their experiments reveal that RWKV performs on par with similarly sized Transformers, suggesting that future work can leverage this architecture to create more efficient models. This work presents a significant step towards reconciling the trade-offs between computational efficiency and model performance in sequence processing tasks.
Project page; Wiki; Blog

Knowledge Distillation of Large Language Models

Knowledge Distillation (KD) is a promising technique for reducing the high computational demand of large language models (LLMs). However, previous KD methods are primarily applied to white-box classification models or training small models to imitate black-box model APIs like ChatGPT. How to effectively distill the knowledge from white-box generative LLMs is still under-explored, which becomes more and more important with the prosperity of LLMs.
This paper by Gu et al. from Furu Wei’s lab at Microsoft Research in conjunction with the CoAI Group and Tsinghua University proposes MiniLLM that distills smaller language models from generative larger language models. They first replace the forward Kullback-Leibler divergence (KLD) objective in the standard KD approaches with reverse KLD, which is more suitable for KD on generative language models, to prevent the student model from overestimating the low-probability regions of the teacher distribution.
Next, they derive an effective optimization approach to learn this objective. Extensive experiments in the instruction-following setting show that the MiniLLM models generate more precise responses with the higher overall quality, lower exposure bias, better calibration, and higher long-text generation performance. Our method is also scalable for different model families with 120M to 13B parameters.
The following figure from the paper shows a comparison between sequence-level KD (left) and their MiniLLM (right). Sequence-level KD forces the student to memorize all teacher-generated samples, while MINILLM allows the student model to improve its own generation with the teacher’s feedback.

Code.

Unifying Large Language Models and Knowledge Graphs: A Roadmap

The following paper summary is from Srijan Kumar.
LLMs hallucinate, but grounding them in external knowledge via Knowledge Graphs (KGs) can reduce hallucination! A new paper surveys how LLMs can be enhanced using KGs that provide external knowledge for inference and interpretability, and also how domain-specific knowledge can be infused into LLMs.
This paper by Pan et al. from Griffith University, Monash University, Nanyang Technological University, Beijing University of Technology, and Hefei University of Technology, seeks to answer the following three questions:
1. How can infusing KGs into LLMs help?
  - LLMs can not incorporate recent information without expensive retraining. Using updated KGs can include new info into the LLMs.
  - LLMs have general knowledge. KGs have domain-specific knowledge and are easily interpretable.
  - Grounding LLMs in external knowledge via KGs can reduce hallucination.
2. What are three major ways to combine LLM + KGs?
  - KG-enhanced LLMs: incorporate KGs during the pre-training and inference phases of LLMs or for the purpose of enhancing the interpretability of LLMs.
  - LLM-augmented KGs: leverage LLMs for different KG tasks such as embedding, completion, construction, graph-to-text generation, and question answering.
  - Synergized LLMs + KGs: LLMs and KGs play equal roles and work in a mutually beneficial way to enhance both LLMs and KGs for bidirectional reasoning driven by both data and knowledge.
3. How can each of these ways be combined with the four major types of KGs?
  - Domain-specific KGs: This is the most interesting with the most real-world applicability. Medical, finance, chemistry, etc. knowledge graphs can be used to ground LLMs with expert knowledge in each domain.
  - Encyclopedic KGs: general-purpose KGs with structured knowledge about diverse topics. E.g., Wikidata, Freebase, YAGO, NELL, etc.
  - Commonsense KGs: contains knowledge about daily concepts. E.g., CSKG, ConceptNet, CasualBank, etc.
  - Multimodal KGs: considers knowledge across text, audio, visual, etc. E.g., IMGPedia, MMKG, Rickpedia, etc.
The following figure from the paper offers a summarization of the pros and cons for LLMs and KGs. LLM pros: General Knowledge, Language Processing, Generalizability; LLM cons: Implicit Knowledge, Hallucination, Indecisiveness, Black-box, Lacking Domain-specific/New Knowledge. KG pros: Structural Knowledge, Accuracy, Decisiveness, Interpretability, Domain-specific Knowledge, Evolving Knowledge; KG cons: Incompleteness, Lacking Language Understanding, Unseen Facts.

Orca: Progressive Learning from Complex Explanation Traces of GPT-4

Recent research has focused on enhancing the capability of smaller models through imitation learning, drawing on the outputs generated by large foundation models (LFMs). A number of issues impact the quality of these models, ranging from limited imitation signals from shallow LFM outputs; small scale homogeneous training data; and most notably a lack of rigorous evaluation resulting in overestimating the small model’s capability as they tend to learn to imitate the style, but not the reasoning process of LFMs.
This paper by Mukherjee et al. from Microsoft Research seeks to address these challenges by developing Orca, a 13-billion parameter model that learns to imitate the reasoning process of LFMs. Orca learns from rich signals from GPT-4 including explanation traces; step-by-step thought processes; and other complex instructions, guided by teacher assistance from ChatGPT.
Orca 13B’s progressive learning approach is a cornerstone of its success. By learning from rich signals from GPT-4, including explanation traces, step-by-step thought processes, and other complex instructions, Orca is able to develop a deeper understanding of the reasoning process. This is a significant departure from traditional AI models, which often focus on imitating the style of LFMs but fail to capture their reasoning process.
The use of explanation traces, for instance, allows Orca to understand the underlying logic behind the responses generated by GPT-4. This not only enhances Orca’s ability to generate accurate responses, but also enables it to understand the context and nuances of different scenarios, thereby improving its overall performance.
Furthermore, the role of ChatGPT as a teacher assistant is crucial in providing a supportive learning environment for Orca. By providing guidance and feedback, ChatGPT helps Orca refine its learning process and improve its understanding of complex instructions. This teacher-student dynamic is a key factor in Orca’s ability to imitate the reasoning process of LFMs.
To promote this progressive learning, they tap into large-scale and diverse imitation data with judicious sampling and selection. Orca surpasses conventional state-of-the-art instruction-tuned models such as Vicuna-13B by more than 100% in complex zero-shot reasoning benchmarks like Big-Bench Hard (BBH) and 42% on AGIEval.
Orca’s performance in various benchmarks is a testament to its capabilities. In complex zero-shot reasoning benchmarks like Big-Bench Hard (BBH) and AGIEval, Orca surpasses conventional state-of-the-art instruction-tuned models such as Vicuna-13B by more than 100% and 42% respectively. This is a significant achievement, considering that these benchmarks are designed to test the model’s ability to reason and make decisions in complex scenarios.
One of the most remarkable aspects of Orca is its size. Despite being a smaller AI model compared to giants like ChatGPT, Orca manages to perform at the same level. This is a significant breakthrough in technology as it demonstrates that powerful AI models can be built by smaller teams, making AI development more accessible.
The size of Orca also has implications for its efficiency and scalability. Being a smaller model, Orca requires less computational resources to train and operate, making it a more sustainable and cost-effective solution for AI development. Furthermore, its smaller size makes it easier to scale and adapt to different applications, thereby increasing its versatility and utility.
Moreover, Orca reaches parity with ChatGPT on the BBH benchmark and shows competitive performance (4 pts gap with optimized system message) in professional and academic examinations like the SAT, LSAT, GRE, and GMAT, both in zero-shot settings without CoT; while trailing behind GPT-4.
Their research indicates that learning from step-by-step explanations, whether these are generated by humans or more advanced AI models, is a promising direction to improve model capabilities and skills.
The following figure from the paper shows the process of explanation-tuning with GPT-4. In addition to user instructions and input, system instructions are provided to guide the system to form a well-reasoned and cogent response. System instructions are sampled from a diverse instruction set including chain-of-thought reasoning steps, explain like I’m five, being helpful and informative, etc. Such rich and well-structured response allows tuning small models to mimic the thinking process of GPT-4 on <{system instruction, user instruction, input}, output> pairs.

Textbooks Are All You Need

This paper by Gunasekar from Microsoft Research introduces phi-1, a new large language model for code, with significantly smaller size than competing models: phi-1 is a Transformer-based model with 1.3B parameters, trained for 4 days on 8 A100s, using a selection of “textbook quality” data from the web (6B tokens) and synthetically generated textbooks and exercises with GPT-3.5 (1B tokens).
Despite this small scale, phi-1 attains pass@1 accuracy 50.6% on HumanEval and 55.5% on MBPP. It also displays surprising emergent properties compared to phi-1-base, the model before their finetuning stage on a dataset of coding exercises, and phi-1-small, a smaller model with 350M parameters trained with the same pipeline as phi-1 that still achieves 45% on HumanEval.
They demonstrate that increasing layer count and sacrificing computational cost is not the only approach to increase LLM accuracy, but instead focusing on data quality can also result in a significant accuracy boost – reinforcing the fact that a data-centric approach also helps in making your model better.

Extending Context Window of Large Language Models via Positional Interpolation

This paper by Chen et al. from Meta AI in 2023 presents Position Interpolation (PI) that extends the context window sizes of RoPE-based pretrained LLMs such as LLaMA models to up to 32768 with minimal fine-tuning (within 1000 steps), while demonstrating strong empirical results on various tasks that require long context, including passkey retrieval, language modeling, and long document summarization from LLaMA 7B to 65B.
Meanwhile, the extended model by Position Interpolation preserve quality relatively well on tasks within its original context window. To achieve this goal, Position Interpolation linearly down-scales the input position indices to match the original context window size, rather than extrapolating beyond the trained context length which may lead to catastrophically high attention scores that completely ruin the self-attention mechanism.
They present a theoretical study which shows that the upper bound of interpolation is at least ∼600x smaller than that of extrapolation, further demonstrating its stability.
Models extended via Position Interpolation retain its original architecture and can reuse most pre-existing optimization and infrastructure.
The following figure from the paper illustrates the Position Interpolation method. Consider a Llama model pre-trained with a 2048 context window length. Upper left illustrates the normal usage of an LLM model: input position indices (blue dots) are within the pre-trained range. Upper right illustrates length extrapolation where models are required to operate unseen positions (red dots) up to 4096. Lower left illustrates Position Interpolation where we downscale the position indices (blue and green dots) themselves from [0, 4096] to [0, 2048] to force them to reside in the pretrained range.

Deep Language Networks: Joint Prompt Training of Stacked LLMs using Variational Inference

This paper by Sordoni et al. from Microsoft Research and MILA views large language models (LLMs) as stochastic language layers in a network, where the learnable parameters are the natural language prompts at each layer. They stack two such layers, feeding the output of one layer to the next.
They call the stacked architecture a Deep Language Network (DLN). They first show how to effectively perform prompt optimization for a 1-Layer language network (DLN-1).
They then show how to train 2-layer DLNs (DLN-2), where two prompts must be learnt. They consider the output of the first layer as a latent variable to marginalize, and devise a variational inference algorithm for joint prompt training.
A DLN-2 reaches higher performance than a single layer, sometimes comparable to few-shot GPT-4 even when each LLM in the network is smaller and less powerful.
The following figure from the paper shows: (Left) An illustration of a DLN-1 performing a sentiment analysis task: input and the trainable prompt are merged using a template and fed to the LM for answer generation. (Right) a DLN-2 with a residual connection, performing the date understanding task: two prompts need to be learnt. In this example, the hidden template extends Chain-Of-Thought with a learnable prefix; they consider the output of the first layer, hidden, as a latent variable $h$. They use variational inference to learn $\pi_0, \pit_1$. Templates can be considered as an hyperparameter of the network.

Code.

A Simple and Effective Pruning Approach for Large Language Models

As their size increases, Large Languages Models (LLMs) are natural candidates for network pruning methods: approaches that drop a subset of network weights while striving to preserve performance. Existing methods, however, require either retraining, which is rarely affordable for billion-scale LLMs, or solving a weight reconstruction problem reliant on second-order information, which may also be computationally expensive. In other words, models such as DistilBERT are popular because LLMs are usually too large for most contexts, but creating such pruned versions of a model usually requires retraining.
This paper by Sun et al. from Carnegie Mellon University, Meta AI, and Bosch Center for AI introduces a novel, straightforward yet effective pruning method, termed Wanda (Pruning by Weights and activations), designed to induce sparsity in pretrained LLMs without requiring retraining. Motivated by the recent observation of emergent large magnitude features in LLMs, Wanda is a relatively simple approach based on determining the weight importance for pruning based on computing element-wise product between the weight magnitude and norm of input activations. It thus prunes weights with the smallest magnitudes multiplied by the corresponding input activations, on a per-output basis.
Notably, Wanda requires no retraining or weight update, and the pruned LLM can be used as is. They conduct a thorough evaluation of Wanda on LLaMA across various language benchmarks. Wanda significantly outperforms the established baseline of magnitude pruning and competes favorably against recent methods involving intensive weight update.
The following figure from Sebastian Raschka summarizes Wanda:

Code.

To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis

Recent research has highlighted the importance of dataset size in scaling language models. However, large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs. Put simply, given the enormous amounts of data, why would we want to consider training LLMs for multiple epochs? It turns out that high-quality text data on the internet is slower than required. Also, if copyrighted material is removed in the future, this could even shrink the datasets further. To further enhance LLMs, a straightforward approach is to repeat the pre-training data for additional epochs.
So, why not train for multiple epochs on existing data? This is already a common convention in deep learning and also when training, for example, vision transformers.
This paper by empirically tries the answer the question: “what happens if we train LLMs for multiple epochs?” and investigates three key aspects under this approach.
First, they explore the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting, leading to multi-epoch degradation. Thus, the result is that training for multiple epochs leads to overfitting.
Second, they examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives, while less influential factors consist of dataset quality and model FLOPs. Overfitting gets more severe the larger the model and the smaller the dataset – this is consistent with common deep learning experiences.
Finally, they explore whether widely used regularization can alleviate multi-epoch degradation. Most regularization techniques, such as weight decay, do not yield significant improvements, except for dropout, which demonstrates remarkable effectiveness but requires careful tuning when scaling up the model size.
The key takeaway is thus that LLMs are susceptible to overfitting when repeating the training data for multiple epochs, degrading performance.
Additionally, they discover that leveraging mixture-of-experts (MoE) enables cost-effective and efficient hyper-parameter tuning for computationally intensive dense LLMs with comparable trainable parameters, potentially impacting efficient LLM development on a broader scale.
The following figure from Sebastian Raschka summarizes Wanda:

ART: Automatic multi-step reasoning and tool-use for large language models

Large language models (LLMs) can perform complex reasoning in few- and zero-shot settings by generating intermediate chain of thought (CoT) reasoning steps. Further, each reasoning step can rely on external tools to support computation beyond the core LLM capabilities (e.g. search/running code).
Prior work on CoT prompting and tool use typically requires hand-crafting task-specific demonstrations and carefully scripted interleaving of model generations with tool use.
This paper by et al. from the University of Washington, Microsoft Research, UC Irvine, Allen Institute of Artificial Intelligence, Meta AI introduces Automatic Reasoning and Tool-use (ART), a gradient-free approach that uses frozen LLMs for automatic multi-step reasoning generation and framework emitting a “program”.
Given a new task to solve, ART selects demonstrations of multi-step reasoning and tool use from a task library. At test time, ART seamlessly pauses generation whenever external tools are called, and integrates their output before resuming generation.
Their main contributions include a lightweight grammar to represent multi-step reasoning as a program (with tool calls and arguments), an extensible library of seed tasks for which programs are authored, and a tool library that consists of useful external utilities like search, code generation, and execution. The interpretable reasoning framework also allows humans to improve task decomposition and tool use to boost performance.
ART achieves a substantial improvement over few-shot prompting and automatic CoT on unseen tasks in the BigBench and MMLU benchmarks, and matches performance of hand-crafted CoT prompts on a majority of these tasks. ART is also extensible, and makes it easy for humans to improve performance by correcting errors in task-specific programs or incorporating new tools, which we demonstrate by drastically improving performance on select tasks with minimal human intervention.
The following figure from the paper illustrates the fact that ART generates automatic multi-step decompositions for new tasks by selecting decompositions of related tasks in the task library (A) and selecting and using tools in the tool library alongside LLM generation (B). Humans can optionally edit decompositions (eg. correcting and editing code) to improve performance (C).

The following figure from the paper illustrates a run-through of ART on a new task, Physics QA. (A) Programs of related tasks like anachronisms and Math QA provide few-shot supervision to the LLM — related sub-steps and tools in these programs can be used by the LLM for cross-task generalization (shown in purple). (B) Tool use: Search is used to find the appropriate physics formula, and code generation and execution are used to substitute given values and compute the answer (shown in orange).

Code.

Lost in the Middle: How Language Models Use Long Contexts

While recent language models have the ability to take long contexts as input, relatively little is known about how well the language models use longer context.
This paper by Liu et al. from Percy Liang’s lab at Stanford, UC Berkeley, and Samaya AI analyzes language model performance on two tasks that require identifying relevant information within their input contexts: multi-document question answering and key-value retrieval. Put simply, they analyze and evaluate how LLMs use the context by identifying relevant information within it.
They tested open-source (MPT-30B-Instruct, LongChat-13B) and closed-source (OpenAI’s GPT-3.5-Turbo and Anthropic’s Claude 1.3) models. They used multi-document question-answering where the context included multiple retrieved documents and one correct answer, whose position was shuffled around. Key-value pair retrieval was carried out to analyze if longer contexts impact performance.
They find that performance is often highest when relevant information occurs at the beginning or end of the input context, and significantly degrades when models must access relevant information in the middle of long contexts. In other words, their findings basically suggest that Retrieval-Augmentation (RAG) performance suffers when the relevant information to answer a query is presented in the middle of the context window with strong biases towards the beginning and the end of it.
A summary of their learnings is as follows:
- Best performance when the relevant information is at the beginning.
- Performance decreases with an increase in context length.
- Too many retrieved documents harm performance.
- Improving the retrieval and prompt creation step with a ranking stage could potentially boost performance by up to 20%.
- Extended-context models (GPT-3.5-Turbo vs. GPT-3.5-Turbo (16K)) are not better if the prompt fits the original context.
Considering that RAG retrieves information from an external database – which most commonly contains longer texts that are split into chunks. Even with split chunks, context windows get pretty large very quickly, at least much larger than a “normal” question or instruction. Furthermore, performance substantially decreases as the input context grows longer, even for explicitly long-context models. Their analysis provides a better understanding of how language models use their input context and provides new evaluation protocols for future long-context models.
“There is no specific inductive bias in transformer-based LLM architectures that explains why the retrieval performance should be worse for text in the middle of the document. I suspect it is all because of the training data and how humans write: the most important information is usually in the beginning or the end (think paper Abstracts and Conclusion sections), and it’s then how LLMs parameterize the attention weights during training.” (source)
In other words, human text artifacts are often constructed in a way where the beginning and the end of a long text matter the most which could be a potential explanation to the characteristics observed in this work.
You can also model this with the lens of two popular cognitive biases that humans face (primacy and recency bias), as in the following figure (source).

The final conclusion is that combining retrieval with ranking (as in recommender systems) should yield the best performance in RAG for question answering.
The following figure (source) shows an overview of the idea proposed in the paper: “LLMs are better at using info at beginning or end of input context”.

The following figure from the paper illustrates the effect of changing the position of relevant information (document containing the answer) on multidocument question answering performance. Lower positions are closer to the start of the input context. Performance is generally highest when relevant information is positioned at the very start or very end of the context, and rapidly degrades when models must reason over information in the middle of their input context.

Improving Retrieval-Augmented Large Language Models via Data Importance Learning

Retrieval-augmentation (RAG) is a powerful tool that enables large language models to take advantage of external knowledge, for example on tasks like question answering and data imputation.
However, the performance of such retrieval-augmented models is limited by the data quality of their underlying retrieval corpus. Furthermore, since RAG usually relies on fast yet simple algorithms that fetch anything remotely connected to the query, a lot of irrelevant bits get added to the prompt. Also, not only are prompts with large context windows slower and more costly, it can also hurt their performance compared to shorter prompts with less noise per Lost in the Middle: How Language Models Use Long Contexts.
This paper by Lyu et al. from ETH Zurich, University of Amsterdam, and Apple introduces RAGBooster, which seeks to answer the question: “what if you could prune the retrieved data before feeding it to the model in order to only keep the actually relevant bits?” That’s the idea behind Data Importance Learning, as the authors call it.
They propose an algorithm based on multilinear extension for evaluating the data importance of retrieved data points. There are exponentially many terms in the multilinear extension, and one key contribution of this paper is a polynomial time algorithm that computes exactly, given a retrieval-augmented model with an additive utility function and a validation set, the data importance of data points in the retrieval corpus using the multilinear extension of the model’s utility function. They further proposed an even more efficient ($\epsilon$, $\delta$)-approximation algorithm.
Their experimental results illustrate that they can enhance the performance of large language models by only pruning or reweighting the retrieval corpus, without requiring further training. For some tasks, this even allows a small model (e.g., GPT-JT), augmented with a search engine API, to outperform GPT-3.5 (without retrieval augmentation). Moreover, they show that weights based on multilinear extension can be computed efficiently in practice (e.g., in less than ten minutes for a corpus with 100 million elements).
The following figure from the paper illustrates the concept of data importance evaluation for retrieval-augmented models: The retriever $f_{\text {ret }}$ retrieves $K$ data points from the retrieval corpus $\mathcal{D}_{\text {ret }}$ and provides them to the answer generator $f_{\text {gen }}$. Our data importance evaluator learns weights for the data sources in the retrieval corpus based on the performance on a validation set $\mathcal{D}_{\text {val }}$. These weights are subsequently used to reweight or prune the data sources, and improve the model’s performance without further training.

Code.

Scaling Transformer to 1M tokens and beyond with RMT

This technical report by Bulatov et al. from DeepPavlov, Artificial Intelligence Research Institute (AIRI), and London Institute for Mathematical Sciences presents the application of a recurrent memory to extend the context length of BERT, one of the most effective Transformer-based models in natural language processing.
By leveraging the Recurrent Memory Transformer architecture, we have successfully increased the model’s effective context length to an unprecedented two million tokens, while maintaining high memory retrieval accuracy.
Their method allows for the storage and processing of both local and global information and enables information flow between segments of the input sequence through the use of recurrence. - Their experiments demonstrate the effectiveness of RMT, which holds significant potential to enhance long-term dependency handling in natural language understanding and generation tasks as well as enable large-scale context processing for memory-intensive applications.
The following figure from the paper shows memory-intensive synthetic tasks. Synthetic tasks and the required RMT operations to solve them are presented. In the Memorize task, a fact statement is placed at the start of the sequence. In the Detect and Memorize task, a fact is randomly placed within a text sequence, making its detection more challenging. In the Reasoning task, two facts required to provide an answer are randomly placed within the text. For all tasks, the question is at the end of the sequence. ’mem’ denotes memory tokens, ’Q’ represents the question, and ’A’ signifies the answer.

Hyena Hierarchy: Towards Larger Convolutional Language Models

Recent advances in deep learning have relied heavily on the use of large Transformers due to their ability to learn at scale. However, the core building block of Transformers, the attention operator, exhibits quadratic cost in sequence length, limiting the amount of context accessible. Existing subquadratic methods based on low-rank and sparse approximations need to be combined with dense attention layers to match Transformers, indicating a gap in capability.
This paper by Poli et al. from proposes Hyena, a subquadratic drop-in replacement for attention constructed by interleaving implicitly parametrized long convolutions and data-controlled gating. Guided by these findings, we introduce the Hyena hierarchy, an operator defined by a recurrence of two efficient subquadratic primitives: a long convolution and element-wise multiplicative gating (see figure below from the paper). A specified depth (i.e., number of steps) of the recurrence controls the size of the operator. For short recurrences, existing models are recovered as special cases. By mapping each step in the Hyena recurrence to its corresponding matrix form, we reveal Hyena operators to be equivalently defined as a decomposition of a data-controlled matrix i.e., a matrix whose entries are functions of the input. Furthermore, we show how Hyena operators can be evaluated efficiently without materializing the full matrix, by leveraging fast convolution algorithms. Empirically, Hyena operators are able to significantly shrink the quality gap with attention at scale, reaching similar perplexity and downstream performance with a smaller computational budget and without hybridization of attention.
The following figure from the paper illustrates the Hyena operator is defined as a recurrence of two efficient subquadratic primitives: an implicit long convolution $h$ (i.e. Hyena filters parameterized by a feed-forward network) and multiplicative elementwise gating of the (projected) input. The depth of the recurrence specifies the size of the operator. Hyena can equivalently be expressed as a multiplication with data-controlled (conditioned by the input $u$) diagonal matrices $\mathrm{D}_x$ and Toeplitz matrices $\mathrm{S}_h$. In addition, Hyena exhibits sublinear parameter scaling (in sequence length) and unrestricted context, similar to attention, while having lower time complexity.

In recall and reasoning tasks on sequences of thousands to hundreds of thousands of tokens, Hyena improves accuracy by more than 50 points over operators relying on state-spaces and other implicit and explicit methods, matching attention-based models. We set a new state-of-the-art for dense-attention-free architectures on language modeling in standard datasets (WikiText103 and The Pile), reaching Transformer quality with a 20% reduction in training compute required at sequence length 2K. Hyena operators are twice as fast as highly optimized attention at sequence length 8K, and 100x faster at sequence length 64K.

LongNet: Scaling Transformers to 1,000,000,000 Tokens

Scaling sequence length has become a critical demand in the era of large language models. However, existing methods struggle with either computational complexity or model expressivity, rendering the maximum sequence length restricted.
This paper by Ding et al. from Furu Wei’s group at MSR introduces LongNet, a Transformer variant that can scale sequence length to more than 1 billion tokens, without sacrificing the performance on shorter sequences. Specifically, they propose dilated attention, which expands the attentive field exponentially as the distance grows. LongNet has significant advantages:
1. It has a linear computation complexity and a logarithm dependency between tokens;
2. It can be served as a distributed trainer for extremely long sequences;
3. Its dilated attention is a drop-in replacement for standard attention, which can be seamlessly integrated with the existing Transformer-based optimization.
The following figure from the paper illustrates the trend of Transformer sequence lengths over time.

Experiments results demonstrate that LongNet yields strong performance on both long-sequence modeling and general language tasks. Our work opens up new possibilities for modeling very long sequences, e.g., treating a whole corpus or even the entire Internet as a sequence.
Code.

The Curse of Recursion: Training on Generated Data Makes Models Forget

Stable Diffusion revolutionised image creation from descriptive text. GPT-2, GPT-3(.5) and GPT-4 demonstrated astonishing performance across a variety of language tasks. ChatGPT introduced such language models to the general public. It is now clear that large language models (LLMs) are here to stay, and will bring about drastic change in the whole ecosystem of online text and images.
This paper by Shumailov et al. from University of Oxford, University of Cambridge, Imperial College London, University of Toronto & Vector Institute, University of Cambridge, and University of Edinburgh considers what the future might hold in this regard, i.e., what will happen to GPT-{n} once LLMs contribute much of the language found online?
They find that use of model-generated content in training causes irreversible defects in the resulting models, where tails of the original content distribution disappear. Within a few generations, text becomes garbage, and authors call this effect model collapse. They refer to this effect as model collapse and show that it can occur in Variational Autoencoders, Gaussian Mixture Models and LLMs. They build theoretical intuition behind the phenomenon and portray its ubiquity amongst all learned generative models.
They demonstrate that it has to be taken seriously if we are to sustain the benefits of training from large-scale data scraped from the web.
Indeed, the value of data collected about genuine human interactions with systems will be increasingly valuable in the presence of content generated by LLMs in data crawled from the Internet.
The following figure from the paper illustrates the concept of model collapse which refers to the concept of a degenerative learning process where models start forgetting improbable events over time, as the model becomes poisoned with its own projection of reality.

Artificial Artificial Artificial Intelligence: Crowd Workers Widely Use Large Language Models for Text Production Tasks

With the widespread adoption of LLMs, human gold–standard annotations are key to understanding the capabilities of LLMs and the validity of their results. Since Large language models (LLMs) are remarkable data annotators, they can be used to generate high-fidelity supervised training data, as well as survey and experimental data. As a result, crowdsourcing, an important, inexpensive way to obtain human annotations, may itself be impacted by LLMs, as crowd workers have financial incentives to use LLMs to increase their productivity and income.
This paper by from EPFL suggests that human-generated content used to train AI models is no longer fully “human”. In other words, crowdsource workers on Amazon Mechanical Turk (AMT) are now frequently using ChatGPT to do the supposedly “human” tasks. They estimated 33%-46% of crowdsourced data in a text summarization task was created via ChatGPT and similar LLMs.
This finding is important because human-generated data from platforms like AMT are no longer fully reliable. Solutions will soon be needed to reliably discern human vs AI written text. Using AI-generated outputs to train future AI models creates an unwanted feedback loop. Errors and biases in current model’s outputs will cascade forward and be used as ground truth for future AI models. This can make future AI models less trustworthy.
Historically, AI models and LLMs have been trained on human-written data from the internet and data collected via crowdsourcing platforms like Amazon MTurk (AMT). AMT is the most popular human data collection platform.
In this work, the researchers tasked humans to summarize content. Using a combination of synthetic data classifiers and keystroke data patterns, they could detect when humans copy-paste content from LLMs.
Note that this task of content summarization is relatively easy for current LLMs to do. Although generalization to other, less LLM-friendly tasks is unclear, their results call for platforms, researchers, and crowd workers to find new ways to ensure that human data remain human, perhaps using the methodology proposed here as a stepping stone.
The following figure from the paper illustrates the approach for quantifying the prevalence of LLM usage among crowd workers solving a text summarization task. First, they use truly human-written MTurk responses and synthetic LLM-written responses to train a task-specific synthetic-vs.-real classifier. Second, they use this classifier on real MTurk responses (where workers may or may not have relied on LLMs), estimating the prevalence of LLM usage. Additionally (not shown), they confirm the validity of their results in a post-hoc analysis of keystroke data collected alongside MTurk responses.

Code.

FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets

Evaluation of Large Language Models (LLMs) is challenging because aligning to human values requires the composition of multiple skills and the required set of skills varies depending on the instruction. Recent studies have evaluated the performance of LLMs in two ways, (1) automatic evaluation on several independent benchmarks and (2) human or machined-based evaluation giving an overall score to the response. However, both settings are coarse-grained evaluations, not considering the nature of user instructions that require instance-wise skill composition, which limits the interpretation of the true capabilities of LLMs.
This paper by Ye et al. from KAIST introduces FLASK (Fine-grained Language Model Evaluation based on Alignment SKill Sets), a fine-grained evaluation protocol that can be used for both model-based and human-based evaluation which decomposes coarse-level scoring to an instance-wise skill set-level. Specifically, they define 12 fine-grained skills needed for LLMs to follow open-ended user instructions and construct an evaluation set by allocating a set of skills for each instance.
Additionally, by annotating the target domains and difficulty level for each instance, FLASK provides a holistic view with a comprehensive analysis of a model’s performance depending on skill, domain, and difficulty. Through using FLASK, they compare multiple open-sourced and proprietary LLMs and observe highly-correlated findings between model-based and human-based evaluations. FLASK enables developers to more accurately measure the model performance and how it can be improved by analyzing factors that make LLMs proficient in particular skills. For practitioners, FLASK can be used to recommend suitable models for particular situations through comprehensive comparison among various LLMs.
The following figure from the paper illustrates FLASK, a comprehensive evaluation framework for language models considering: Logical Thinking (Logical Robustness, Logical Correctness, Logical Efficiency), Background Knowledge (Factuality, Commonsense Understanding), Problem Handling (Comprehension, Insightfulness, Completeness, Metacognition), User Alignment (Readability, Conciseness, Harmlessness).

Code; Demo.

Secrets of RLHF in Large Language Models Part I: PPO

This paper by Zheng et al. from Fudan NLP Group and ByteDance
Large language models (LLMs) have formulated a blueprint for the advancement of artificial general intelligence. Their primary objective is to function as a human-centric (helpful, honest, and harmless) assistant. Alignment with humans assumes paramount significance, and reinforcement learning with human feedback (RLHF) emerges as the pivotal technological paradigm underpinning this pursuit. Current technical routes usually include reward models to measure human preferences, Proximal Policy Optimization (PPO) to optimize policy model outputs, and process supervision to improve step-by-step reasoning capabilities. However, due to the challenges of reward design, environment interaction, and agent training, coupled with huge trial and error cost of large language models, there is a significant barrier for AI researchers to motivate the development of technical alignment and safe landing of LLMs. The stable training of RLHF has still been a puzzle. In the first report, they dissect the framework of RLHF, re-evaluate the inner workings of PPO, and explore how the parts comprising PPO algorithms impact policy agent training. They identify policy constraints being the key factor for the effective implementation of the PPO algorithm.
They propose PPO-max, an advanced version of PPO algorithm, to efficiently improve the training stability of the policy model. Based on their main results, they perform a comprehensive analysis of RLHF abilities compared with SFT models and ChatGPT. The absence of open-source implementations has posed significant challenges to the investigation of LLMs alignment but they have released technical reports, reward models and PPO codes.
The following figure from the paper illustrates the PPO workflow, depicting the sequential steps in the algorithm’s execution. The process begins with sampling from the environment, followed by the application of GAE for improved advantage approximation. The diagram then illustrates the computation of various loss functions employed in PPO, signifying the iterative nature of the learning process and the policy updates derived from these losses.

Code.

WizardLM: Empowering Large Language Models to Follow Complex Instructions

Training large language models (LLMs) with open-domain instruction following data brings colossal success. However, manually creating such instruction data is very time-consuming and labor-intensive. Moreover, humans may struggle to produce high-complexity instructions.
This paper by Xu et al. from Microsoft and Peking proposes Evol-Instruct, a method that proposes a new method to create large amounts of instruction data with varying levels of difficulty/complexity using LLM instead of humans. Specifically, Evol-instruct consists of three steps: (i) instruction evolving, (ii) response generation, and (iii) elimination evolving.
- In-depth evolving (blue) is used to evolve a simple instruction to a more complex one.
- In-breadth evolving (red) is used to create new instruction for diversity.
- Elimination evolving is used to filter failed evolutions and instructions.
Starting with an initial set of 175 human-created instructions, they use their proposed Evol-Instruct to rewrite them step by step into more complex instructions and create 250,000 instructions using an LLM.
They validate the method by mixing all generated instruction data to fine-tune LLaMA (creating WizardLM) and comparing it with existing methods and models (Alpaca, Vicuna), which it outperforms.
The following figure from the paper illustrates running examples of Evol-Instruct.

Human evaluations on a complexity-balanced test bed and Vicuna’s testset show that instructions from Evol-Instruct are superior to human-created ones. Thus, Evol-Instruct can be viewed as an improved version of the Alpaca method to generate instructions.
By analyzing the human evaluation results of the high complexity part, they demonstrate that outputs from WizardLM are preferred to outputs from OpenAI ChatGPT.
In GPT-4 automatic evaluation, WizardLM achieves more than 90% capacity of ChatGPT on 17 out of 29 skills. Even though WizardLM still lags behind ChatGPT in some aspects, their findings suggest that fine-tuning with AI-evolved instructions is a promising direction for enhancing LLMs.
Evol-Instruct can also be used for domain-specific adoption as shown with WizardCoder for Code.
Code

Universal and Transferable Adversarial Attacks on Aligned Language Models

Because “out-of-the-box” large language models are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent undesirable generation. While there has been some success at circumventing these measures – so-called “jailbreaks” against LLMs – these attacks have required significant human ingenuity and are brittle in practice.
This paper by Zou et al. from CMU, Center for AI Safety, and Bosch Center for AI shows that all major LLMs (OpenAI’s ChatGPT, Google’s Bard, Meta’s Llama 2, Anthropic’s Claude) can be made to do harmful activities using adversarial prompts, despite having rigorous safety filters around them. They propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors.
Specifically, their approach finds a suffix that, when attached to a wide range of queries for an LLM to produce objectionable content, aims to maximize the probability that the model produces an affirmative response (rather than refusing to answer). However, instead of relying on manual engineering, their approach automatically produces these adversarial suffixes by a combination of greedy and gradient-based search techniques, and also improves over past automatic prompt generation methods.
Adversarial suffixes confuse the model and circumvent the safety filters. Interestingly, these adversarial prompts were found using open source LLMs and shown to transfer to even closed source, black-box LLMs.
Specifically, we train an adversarial attack suffix on multiple prompts (i.e., queries asking for many different types of objectionable content), as well as multiple models (in their case, Vicuna-7B and 13B). When doing so, the resulting attack suffix is able to induce objectionable content in the public interfaces to ChatGPT, Bard, and Claude, as well as open source LLMs such as LLaMA-2-Chat, Pythia, Falcon, and others.
The following figure from the paper illustrates the fact that aligned LLMs are not adversarially aligned. Their attack constructs a single adversarial prompt that consistently circumvents the alignment of state-of-the-art commercial models including ChatGPT, Claude, Bard, and Llama-2 without having direct access to them. The examples shown here are all actual outputs of these systems. The adversarial prompt can elicit arbitrary harmful behaviors from these models with high probability, demonstrating potentials for misuse. To achieve this, their attack (Greedy Coordinate Gradient) finds such universal and transferable prompts by optimizing against multiple smaller open-source LLMs for multiple harmful behaviors.

The following figure from the paper shows screenshots of harmful content generation: ChatGPT (top left), Claude 2 (top right), Bard (bottom left), LLaMA-2 (bottom right).

In total, this work significantly advances the state-of-the-art in adversarial attacks against aligned language models, raising important questions about how such systems can be prevented from producing objectionable information.
Project page; Accelerated Coordinate Gradient (ACG) by Haize Labs

Scaling TransNormer to 175 Billion Parameters

This technical report by Qin et al. from Shanghai AI Laboratory, OpenNLPLab presents TransNormerLLM, the first linear attention-based Large Language Model (LLM) that outperforms conventional softmax attention-based models in terms of both accuracy and efficiency.
TransNormerLLM evolves from the previous linear attention architecture TransNormer by making advanced modifications that include positional embedding, linear attention acceleration, gating mechanism, tensor normalization, inference acceleration and stabilization. Specifically, they use LRPE together with an exponential decay to avoid attention dilution issues while allowing the model to retain global interactions between tokens. Additionally, they propose Lightning Attention, a cutting-edge technique that accelerates linear attention by more than twice in runtime and reduces memory usage by a remarkable four times.
To further enhance the performance of TransNormer, they leverage a gating mechanism to smooth training and a new tensor normalization scheme to accelerate the model, resulting in an impressive acceleration of over 20%. Furthermore, they have developed a robust inference algorithm that ensures numerical stability and consistent inference speed, regardless of the sequence length, showcasing superior efficiency during both training and inference stages. Scalability is at the heart of TransNormer’s design, enabling seamless deployment on large-scale clusters and facilitating expansion to even more extensive models, all while maintaining outstanding performance metrics.
Rigorous validation of TransNormer design is achieved through a series of comprehensive experiments on their self-collected corpus, boasting a size exceeding 6TB and containing over 2 trillion tokens. To ensure data quality and relevance, they implement a new self-cleaning strategy to filter their collected data.
The following figure from the paper illustrates the architectural overview of the proposed model. Each transformer block is composed of a Simple Gated Linear Unit (SGLU) for channel mixing and a Gated Linear Attention for token mixing. They apply pre-norm for both modules.

Code.

What learning algorithm is in-context learning? Investigations with linear models

Neural sequence models, especially transformers, exhibit a remarkable capacity for in-context learning. They can construct new predictors from sequences of labeled examples (x,f(x)) presented in the input without further parameter updates.
This paper by Akyürek et al. from Google, MIT CSAIL, and Stanford in ICLR 2023 investigates the hypothesis that transformer-based in-context learners implement standard learning algorithms implicitly, by encoding smaller models in their activations, and updating these implicit models as new examples appear in the context.
Using linear regression as a prototypical problem, they offer three sources of evidence for this hypothesis.
First, they prove by construction that transformers can implement learning algorithms for linear models based on gradient descent and closed-form ridge regression.
Second, they show that trained in-context learners closely match the predictors computed by gradient descent, ridge regression, and exact least-squares regression, transitioning between different predictors as transformer depth and dataset noise vary, and converging to Bayesian estimators for large widths and depths.
Third, they present preliminary evidence that in-context learners share algorithmic features with these predictors: learners’ late layers non-linearly encode weight vectors and moment matrices. These results suggest that in-context learning is understandable in algorithmic terms, and that (at least in the linear case) learners may rediscover standard estimation algorithms.
Code.

What In-Context Learning “Learns” In-Context: Disentangling Task Recognition and Task Learning

Large language models (LLMs) exploit in-context learning (ICL) to solve tasks with only a few demonstrations, but its mechanisms are not yet well-understood. Some works suggest that LLMs only recall already learned concepts from pre-training, while others hint that ICL performs implicit learning over demonstrations.
This paper by Pan et al. from in ACL 2023 characterizes two ways through which ICL leverages demonstrations. Task recognition (TR) captures the extent to which LLMs can recognize a task through demonstrations – even without ground-truth labels – and apply their pre-trained priors, whereas task learning (TL) is the ability to capture new input-label mappings unseen in pre-training.
Using a wide range of classification datasets and three LLM families (GPT-3, LLaMA and OPT), they design controlled experiments to disentangle the roles of TR and TL in ICL. They show that (1) models can achieve non-trivial performance with only TR, and TR does not further improve with larger models or more demonstrations; (2) LLMs acquire TL as the model scales, and TL’s performance consistently improves with more demonstrations in context.
Their findings unravel two different forces behind ICL and advocate for discriminating them in future ICL research due to their distinct nature.

PanGu-Coder2: Boosting Large Language Models for Code with Ranking Feedback

Large Language Models for Code (Code LLM) are flourishing. New and powerful models are released on a weekly basis, demonstrating remarkable performance on the code generation task. Various approaches have been proposed to boost the code generation performance of pre-trained Code LLMs, such as supervised fine-tuning, instruction tuning, reinforcement learning, etc.
This paper by Shen et al. from Huawei, Chinese Academy of Science, and Peking University proposes a novel Rank Responses to align Test&Teacher Feedback (RRTF) framework, which can effectively and efficiently boost pre-trained large language models for code generation.
Under this framework, they present PanGu-Coder2, which achieves 62.20% pass@1 on the OpenAI HumanEval benchmark. Furthermore, through an extensive evaluation on CoderEval and LeetCode benchmarks, they show that PanGu-Coder2 consistently outperforms all previous Code LLMs.
The following figure from the paper offers an overview of the proposed RRTF framework.

Multimodal Neurons in Pretrained Text-Only Transformers

Language models demonstrate remarkable capacity to generalize representations learned in one modality to downstream tasks in other modalities.
This paper by Schwettmann et al. from MIT CSAIL seeks to answer the question of whether we can trace this ability to individual neurons.
They study the case where a frozen text transformer is augmented with vision using a self-supervised visual encoder and a single linear projection learned on an image-to-text task.
The following figure from the paper offers shows that multimodal neurons in transformer MLPs activate on specific image features and inject related text into the model’s next token prediction. Unit 2019 in GPT-J layer 14 detects horses.

They found neurons in text-only NLP models like BERT that activate in response to images indicating language models could understand concepts across vision and language – even without any multimodal training.
Outputs of the projection layer are not immediately decodable into language describing image content; instead, they find that translation between modalities occurs deeper within the transformer.
They introduce a procedure for identifying “multimodal neurons” that convert visual representations into corresponding text, and decoding the concepts they inject into the model’s residual stream. Specific neurons fired for inputs like cat and dog photos.
In a series of experiments, they show that multimodal neurons operate on specific visual concepts across inputs, and have a systematic causal effect on image captioning.
The following figure from the paper offers shows the top five multimodal neurons (layer L, unit u), for a sample image from 6 COCO supercategories. Superimposed heatmaps (0.95 percentile of activations) show mean activations of the top five neurons over the image. Gradient-based attribution scores are computed with respect to the logit shown in bold in the GPT caption of each image. The two highest-probability tokens are shown for each neuron.

This work reveals the intrinsic multimodal reasoning capacities latent in language models.

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Despite the advancements of open-source large language models (LLMs) and their variants, e.g., LLaMA and Vicuna, they remain significantly limited in performing higher-level tasks, such as following human instructions to use external tools (APIs). This is because current instruction tuning largely focuses on basic language tasks instead of the tool-use domain. This is in contrast to state-of-the-art (SOTA) LLMs, e.g., ChatGPT, which have demonstrated excellent tool-use capabilities but are unfortunately closed source.
This paper by Qin et al. from 1Tsinghua University, ModelBest, Renmin University of China, Yale, WeChat AI, Tencent and Zhihu Inc. seeks to facilitate tool-use capabilities within open-source LLMs by introducing ToolLLM, a general tool-use framework of data construction, model training and evaluation.
They first present ToolBench, an instruction-tuning dataset for tool use, which is created automatically using ChatGPT. Specifically, we collect 16,464 real-world RESTful APIs spanning 49 categories from RapidAPI Hub, then prompt ChatGPT to generate diverse human instructions involving these APIs, covering both single-tool and multi-tool scenarios.
Finally, they use ChatGPT to search for a valid solution path (chain of API calls) for each instruction. To make the searching process more efficient, they develop a novel depth-first search-based decision tree (DFSDT), enabling LLMs to evaluate multiple reasoning traces and expand the search space. They show that DFSDT significantly enhances the planning and reasoning capabilities of LLMs.
For efficient tool-use assessment, they develop an automatic evaluator: ToolEval. They fine-tune LLaMA on ToolBench and obtain ToolLLaMA. ToolEval reveals that ToolLLaMA demonstrates a remarkable ability to execute complex instructions and generalize to unseen APIs, and exhibits comparable performance to ChatGPT.
The following figure from the paper offers shows the three phases of constructing ToolBench and how they train their API retriever and ToolLLaMA. During inference of an instruction, the API retriever recommends relevant APIs to ToolLLaMA, which performs multiple rounds of API calls to derive the final answer. The whole reasoning process is evaluated by ToolEval.

To make the pipeline more practical, they devise a neural API retriever to recommend appropriate APIs for each instruction, negating the need for manual API selection.

Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding

The technical report by Ning et al. from Tsinghua University and Microsoft Research presents Skeleton-of-Thought (SoT), aimed at reducing the end-to-end generation latency of large language models (LLMs) by tackling the sequential decoding process inherent in state-of-the-art LLMs. SoT encourages LLMs to first outline the answer’s skeleton and then fill in the details of each point in parallel, leveraging API calls or batch decoding for efficiency.
SoT’s methodology mirrors human cognitive processes in structuring and elaborating responses, intending to make LLMs’ processing more intuitive and effective. It achieves considerable speed-ups (up to 2.39x) across 12 models, demonstrating improved efficiency and, in many cases, enhanced answer quality across various question categories including knowledge, generic, common-sense, roleplay, and counterfactual queries.
The approach is evaluated on two datasets (Vicuna-80 and WizardLM) with diverse LLMs, showing not only efficiency gains but also potential improvements in diversity and relevance of the answers, indicating SoT’s capacity to guide LLMs towards more human-like reasoning and articulation.
The following figure from the paper offers an illustration of SoT. In contrast to the traditional approach that produces answers sequentially, SoT accelerates it by producing different parts of answers in parallel. In more detail, given the question, SoT first prompts the LLM to give out the skeleton, then conducts batched decoding or parallel API calls to expand multiple points in parallel, and finally aggregates the outputs together to get the final answer.

SoT provides considerable speed-ups while maintaining (or even improving) answer quality for many question types. However, the biggest limitation is that SoT is not suitable for questions that require step-by-step reasoning. Towards pushing the practical adoption of SoT, they explore the possibility of adaptively triggering SoT only when it is suitable. To achieve that, they propose a router module – SoT with router (SoT-R) – that decides if SoT should be applied for the user request, and then call either SoT or normal decoding accordingly. Put simply, SoT-R adaptively triggers SoT by employing a router to identify suitable questions. This paradigm also aligns with the recent trends of composing multiple models to solve complicated tasks. To implement the router, they explore two options: LLM prompting as the router (no model training is needed), and trained RoBERTa as the router. SoT-R integrates a router mechanism to selectively apply SoT to suitable questions, optimizing both speed and quality of responses. This extension illustrates the potential for adaptive application based on question characteristics, reinforcing SoT’s role as a pioneering data-level optimization strategy for LLM inference efficiency.
By diverging from traditional model- and system-level optimization, SoT and SoT-R represent innovative steps towards enhancing LLMs’ capabilities, emphasizing efficiency, and quality in generated responses, and highlighting the potential of aligning LLM processes more closely with human thought patterns.
Blog

The Hydra Effect: Emergent Self-repair in Language Model Computations

This paper by McGrath et al. from Google DeepMind investigates the internal structure of language model computations using causal analysis and demonstrate two motifs: (1) a form of adaptive computation where ablations of one attention layer of a language model cause another layer to compensate (which they term the Hydra effect) and (2) a counterbalancing function of late MLP layers that act to downregulate the maximum-likelihood token.
Their ablation studies demonstrate that language model layers are typically relatively loosely coupled (ablations to one layer only affect a small number of downstream layers). Surprisingly, these effects occur even in language models trained without any form of dropout.
They analyse these effects in the context of factual recall and consider their implications for circuit-level attribution in language models.
The following figure from the paper shows a diagram of their protocol for investigating network self-repair and illustrative results. The blue line indicates the effect on output logits for each layer for the maximum-likelihood continuation of the prompt shown in the title. Faint red lines show direct effects following ablation of at a single layer indicated by dashed vertical line (attention layer 18 in this case) using patches from different prompts and the solid red line indicates the mean across patches.

MetaGPT: Meta Programming for Multi-Agent Collaborative Framework

Recently, remarkable progress has been made in automated task-solving through the use of multi-agents driven by large language models (LLMs). However, existing works primarily focuses on simple tasks lacking exploration and investigation in complicated tasks mainly due to the hallucination problem. This kind of hallucination gets amplified infinitely as multiple intelligent agents interact with each other, resulting in failures when tackling complicated problems.
This paper by Hong et al. from DeepWisdom, Xiamen University, CUHK, Shenzhen, Nanjing University, UPenn, Berkeley introduces MetaGPT, an innovative framework that infuses effective human workflows as a meta programming approach into LLM-driven multi-agent collaboration. In particular, MetaGPT first encodes Standardized Operating Procedures (SOPs) into prompts, fostering structured coordination. And then, it further mandates modular outputs, bestowing agents with domain expertise paralleling human professionals to validate outputs and reduce compounded errors.
In this way, MetaGPT leverages the assembly line work model to assign diverse roles to various agents, thus establishing a framework that can effectively and cohesively deconstruct complex multi-agent collaborative problems.
Their experiments conducted on collaborative software engineering tasks illustrate MetaGPT’s capability in producing comprehensive solutions with higher coherence relative to existing conversational and chat-based multi-agent systems. This underscores the potential of incorporating human domain knowledge into multi-agents, thus opening up novel avenues for grappling with intricate real-world challenges.
The following figure from the paper shows a comparative depiction of the software development SOP between MetaGPT and real-world human team. The MetaGPT approach showcases its ability to decompose high-level tasks into detailed actionable components handled by distinct roles (ProductManager, Architect, ProjectManager, Engineer), thereby facilitating role-specific expertise and coordination. This methodology mirrors human software development teams, but with the advantage of improved efficiency, precision, and consistency. The diagram illustrates how MetaGPT is designed to handle task complexity and promote clear role delineations, making it a valuable tool for complex software development scenarios.

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

Without proper safeguards, large language models will readily follow malicious instructions and generate toxic content. This motivates safety efforts such as red-teaming and large-scale feedback learning, which aim to make models both helpful and harmless. However, there is a tension between these two objectives, since harmlessness requires models to refuse complying with unsafe prompts, and thus not be helpful.
Recent anecdotal evidence suggests that some models may have struck a poor balance, so that even clearly safe prompts are refused if they use similar language to unsafe prompts or mention sensitive topics.
This paper by Röttger et al. from Bocconi University, University of Oxford, Stanford introduces a new test suite called XSTest to identify such eXaggerated Safety behaviours in a structured and systematic way.
In its current form, XSTest comprises 200 safe prompts across ten prompt types that well-calibrated models should not refuse to comply with.
They describe XSTest’s creation and composition, and use the test suite to highlight systematic failure modes in a recently-released state-of-the-art language model.
The following figure from the paper shows an example of exaggerated safety behaviour by Llama 2, in response to a safe prompt from XSTest.

Tool Documentation Enables Zero-Shot Tool-Usage with Large Language Models

Today, large language models (LLMs) are taught to use new tools by providing a few demonstrations of the tool’s usage. Unfortunately, demonstrations are hard to acquire, and can result in undesirable biased usage if the wrong demonstration is chosen. Even in the rare scenario that demonstrations are readily available, there is no principled selection protocol to determine how many and which ones to provide. As tasks grow more complex, the selection search grows combinatorially and invariably becomes intractable.
This paper by Hsieh et al. from UW, NTU, Google Cloud AI Research, and Google Research provides an alternative to demonstrations: tool documentation. They advocate the use of tool documentation, descriptions for the individual tool usage, over demonstrations. They substantiate their claim through three main empirical findings on 6 tasks across both vision and language modalities. First, on existing benchmarks, zero-shot prompts with only tool documentation are sufficient for eliciting proper tool usage, achieving performance on par with few-shot prompts. Second, on a newly collected realistic tool-use dataset with hundreds of available tool APIs, they show that tool documentation is significantly more valuable than demonstrations, with zero-shot documentation significantly outperforming few-shot without documentation. Third, they highlight the benefits of tool documentations by tackling image generation and video tracking using just-released unseen state-of-the-art models as tools.
Finally, they highlight the possibility of using tool documentation to automatically enable new applications: by using nothing more than the documentation of GroundingDino, Stable Diffusion, XMem, and SAM, LLMs can re-invent the functionalities of the just-released Grounded-SAM and Track Anything models.
The following figure from the paper shows an example workflow of tool-using with LLMs to solve a multi-modal question answering task. Given the input question with an image, the LLM selects appropriate tools from the tool set and generates an execution plan to answer the question correctly. Here, the LLMs outlines a plan to first use Text Detector to understand the positioning of the magnets in the image, then leverage Knowledge Retriever to obtain relevant background knowledge about magnets, then finally generate the solution based on the previous steps.

Jina Embeddings: A Novel Set of High-Performance Sentence Embedding Models

This paper by Günther from Jina AI introduces Jina Embeddings, which constitute a set of high-performance sentence embedding models adept at translating various textual inputs into numerical representations, thereby capturing the semantic essence of the text.
The models excel in applications such as dense retrieval and semantic textual similarity.
They detail the development of Jina Embeddings, starting with the creation of high-quality pairwise and triplet datasets. It underlines the crucial role of data cleaning in dataset preparation, gives in-depth insights into the model training process, and concludes with a comprehensive performance evaluation using the Massive Textual Embedding Benchmark (MTEB).
To increase the model’s awareness of negations, they construct a novel training and evaluation dataset of negated and non-negated statements, which they make publicly available to the community.

SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning

The recent progress in large language models (LLMs), especially the invention of chain-of-thoughts (CoT) prompting, makes it possible to solve reasoning problems. However, even the strongest LLMs are still struggling with more complicated problems that require non-linear thinking and multi-step reasoning.
This paper by Miao et al. from Oxford explores whether LLMs have the ability to recognize their own errors, without resorting to external resources. In particular, they investigate whether they can be used to identify individual errors within step-by-step reasoning.
To this end, they propose a zero-shot verification scheme to recognize such errors. They then use this verification scheme to improve question-answering performance, by using it to perform weighted voting on different generated answers.
They test the method on three math datasets – GSM8K, MathQA, and MATH – and find that it successfully recognizes errors and, in turn, increases final predictive performance.
The following figure from the paper shows an example illustrating how SelfCheck works. The blocks show the question, the outputs of the generator, and the checker, respectively. To check the correctness of a certain step (step 4 in the example) in a multi-step reasoning procedure, SelfCheck goes through 4 stages before making a decision, which is elaborated in Section 2.1. Then the integration function combines the checking results of all steps to form a confidence score for the whole solution.

Stack More Layers Differently: High-Rank Training Through Low-Rank Updates

Despite the dominance and effectiveness of scaling, resulting in large networks with hundreds of billions of parameters, the necessity to train overparametrized models remains poorly understood, and alternative approaches do not necessarily make it cheaper to train high-performance models.
This paper by Lialin et al. from UMass Lowell proposes Refined Low-Rank Adaptation (ReLoRA), a low-rank training technique as an alternative approach to training large neural networks. ReLoRA utilizes low-rank updates to train high-rank networks. Put simply, they explore whether LoRA can be used for pretraining (as opposed to finetuning) LLMs in a parameter-efficient manner.
They apply ReLoRA to pre-training transformer language models with up to 350M parameters and demonstrate comparable performance to regular neural network training.
Furthermore, they observe that the efficiency of ReLoRA increases with model size, making it a promising approach for training multi-billion-parameter networks efficiently. Their findings shed light on the potential of low-rank training techniques and their implications for scaling laws.
A caveat worth mentioning is that the researchers only pretrained models up to 350 M parameters for now (the smallest Llama 2 model is 7B parameters, for comparison).
The following figure (source) presents an overview of their results:

AlpaGasus: Training A Better Alpaca with Fewer Data

Large language models (LLMs) obtain instruction-following capability through instruction-finetuning (IFT) on supervised instruction/response data. However, widely used IFT datasets (e.g., Alpaca’s 52k data) surprisingly contain many low-quality instances with incorrect or irrelevant responses, which are misleading and detrimental to IFT.
This paper by Chen et al. at ICLR 2024 from University of Maryland, Samsung Research America, and USC proposes a simple and effective data selection strategy that automatically identifies and removes low-quality data using a strong LLM (e.g., ChatGPT). Similar to LIMA, this is another interesting paper highlighting that more data is not always better when finetuning LLMs.
Using ChatGPT to identify low-quality instruction-response pairs, the authors find that the original 52k Alpaca dataset can be trimmed to 9k instruction-response pairs to improve the performance when training 7B and 13B parameter (LLaMA) LLMs.
They introduce AlpaGasus, which is finetuned on the filtered high-quality data. AlpaGasus significantly outperforms the original Alpaca as evaluated by GPT-4 on multiple test sets and its 13B variant matches >90% performance of its teacher LLM (i.e., text-davinci-003) on test tasks.
It also provides 5.7x faster training, reducing the training time for a 7B variant from 80 minutes (for Alpaca) to 14 minutes. Overall, AlpaGasus demonstrates a novel data-centric IFT paradigm that can be generally applied to instruction-tuning data, leading to faster training and better instruction-following models.
The following figure (source) presents an overview of AlpaGasus’s superior results:

How Is ChatGPT’s Behavior Changing over Time?

GPT-3.5 and GPT-4 are the two most widely used large language model (LLM) services. However, when and how these models are updated over time is opaque.
This paper by Chen et al. from Stanford and UC Berkeley evaluates the March 2023 and June 2023 versions of GPT-3.5 and GPT-4 on four diverse tasks: (i) solving math problems, (ii) answering sensitive/dangerous questions, (iii) generating code and (iv) visual reasoning.
They make an interesting observation that the performance and behavior of both GPT-3.5 and GPT-4 can vary greatly over time. In particular, GPT-4’s modeling performance appears to be getting worse over time. One hypothesis is that it could be due to distillation methods to save costs or guard rails to prevent certain types of misuse.
They find that the performance and behavior of both GPT-3.5 and GPT-4 can vary greatly over time. For example, GPT-4 (March 2023) was very good at identifying prime numbers (accuracy 97.6%) but GPT-4 (June 2023) was very poor on these same questions (accuracy 2.4%). Interestingly GPT-3.5 (June 2023) was much better than GPT-3.5 (March 2023) in this task. GPT-4 was less willing to answer sensitive questions in June than in March, and both GPT-4 and GPT-3.5 had more formatting mistakes in code generation in June than in March.
Overall, their findings shows that the behavior of the same LLM service can change substantially in a relatively short amount of time, highlighting the need for continuous monitoring of LLM quality.
The following figure from the paper shows the performance of the March 2023 and June 2023 versions of GPT-4 and GPT-3.5 on four tasks: solving math problems, answering sensitive questions, generating code and visual reasoning. The performances of GPT-4 and GPT-3.5 can vary substantially over time, and for the worse in some tasks.

The following figure (source) presents an overview of their results:

Do Multilingual Language Models Think Better in English?

Suppose users want to use models such as LLaMA on non-English inputs. There are three options:
1. Prompt the model in a non-English language.
2. Use an external translation system first to translate the prompt into English.
3. Let the LLM itself translate the prompt into English first (the proposed method).
Translate-test (2) is a popular technique to improve the performance of multilingual language models. This approach works by translating the input into English using an external machine translation system, and running inference over the translated input. However, these improvements can be attributed to the use of a separate translation system, which is typically trained on large amounts of parallel data not seen by the language model.
This paper by Etxaniz et al. from University of the Basque Country and Reka AI finds that models like XGLM and LLaMA work better if they first translate a prompt (3) compared to prompting in the original language (1). They introduce a new approach called self-translate, which overcomes the need of an external translation system by leveraging the few-shot translation capabilities of multilingual language models.
Experiments over 5 tasks show that self-translate consistently outperforms direct inference, demonstrating that language models are unable to leverage their full multilingual potential when prompted in non-English languages.
The following figure (source) presents an overview of their results:

Code.

Skill-it! A Data-Driven Skills Framework for Understanding and Training Language Models

The quality of training data impacts the performance of pre-trained large language models (LMs).
This paper by Chen et al. from Stanford, UW-Madison, Together AI, University of Chicago, studies how to best select data that leads to good downstream model performance across tasks given a fixed budget of tokens.
They develop a new framework based on a simple hypothesis: just as humans acquire interdependent skills in a deliberate order, language models also follow a natural order when learning a set of skills from their training data. If such an order exists, it can be utilized for improved understanding of LMs and for data-efficient training. Put simply, they propose a new approach for improving LLM training efficiency with an online sampling algorithm. At first glance, it appears similar to curriculum learning, where the goal is to select training data during training. However, here the focus is to select training data by skill instead of selecting training examples by difficulty.
Using this intuition, their framework formalizes the notion of a skill and of an ordered set of skills in terms of the associated data. First, using both synthetic and real data, they demonstrate that these ordered skill sets exist, and that their existence enables more advanced skills to be learned with less data when they train on their prerequisite skills.
Second, using their proposed framework, they introduce an online data sampling algorithm, Skill-It, over mixtures of skills for both continual pre-training and fine-tuning regimes, where the objective is to efficiently learn multiple skills in the former and an individual skill in the latter.
On the LEGO synthetic in the continual pre-training setting, Skill-It obtains 36.5 points higher accuracy than random sampling. On the Natural Instructions dataset in the fine-tuning setting, Skill-It reduces the validation loss on the target skill by 13.6% versus training on data associated with the target skill itself.
They apply their skills framework on the recent RedPajama dataset to continually pre-train a 3B-parameter LM, achieving higher accuracy on the LM Evaluation Harness with 1B tokens than the baseline approach of sampling uniformly over data sources with 3B tokens.
The following figure (source) presents an overview of their process:

In-context Autoencoder for Context Compression in a Large Language Model

This paper by Ge et al. from Furu Wei’s group at Microsoft proposes an In-context Autoencoder (ICAE) model (encoder) to compress the inputs (contexts) for a target LLM (decoder). The ICAE, which behaves as an auxiliary autoencoder LLM, has two modules: a learnable encoder adapted with LoRA from an LLM for compressing a long context into a limited number of memory slots, and a fixed decoder which is the target LLM that can condition on the memory slots for various purposes. Put simply, this autoencoder is an LLM trained with low-rank adaptation (LoRA) that is pretrained on an unlabeled large corpus enabling it to generate memory slots that accurately and comprehensively represent the original context. Next, they finetune ICAE on a smaller instruction dataset to enhance its interaction with various prompts for producing desirable responses.
Their experimental results demonstrate that the ICAE learned with their proposed pretraining and fine-tuning paradigm can effectively produce memory slots with 4x context compression, which can be well conditioned on by the target LLM to respond to various prompts. The promising results demonstrate significant implications of the ICAE for its novel approach to the long context problem and its potential to reduce computation and memory overheads for LLM inference in practice, suggesting further research effort in context management for an LLM.
The following figure (source) presents an overview of their process:

No Train No Gain: Revisiting Efficient Training Algorithms For Transformer-based Language Models

The computation necessary for training Transformer-based language models has skyrocketed in recent years. This trend has motivated research on efficient training algorithms designed to improve training, validation, and downstream performance faster than standard training.
This paper by Ge et al. from evaluates three categories of efficient training methods for transformers:
- Dynamic architectures (layer stacking, layer dropping)
- Batch selection (selective backprop, RHO loss)
- Efficient optimizers (Lion, Sophia)
When pre-training BERT and T5 with a fixed computation budget using such methods, they find that their training, validation, and downstream gains vanish compared to a baseline with a fully-decayed learning rate. They define an evaluation protocol that enables computation to be done on arbitrary machines by mapping all computation time to a reference machine which they call reference system time.
The gains seem to be negligible compared to regular training (baseline). Note that the paper only focuses on encoder-style and encoder-decoder-style LLMs (BERT, T5) and doesn’t investigate parameter-efficient finetuning methods such as LoRA or Adapters.
The following figure (source) presents an overview of their results:

Code.

Leveraging Implicit Feedback from Deployment Data in Dialogue

This paper by Pang et al. from Meta studies improving social conversational agents by learning from natural dialogue between users and a deployed model, without extra annotations.
To implicitly measure the quality of a machine-generated utterance, they leverage signals like user response length, sentiment and reaction of the future human utterances in the collected dialogue episodes.
Their experiments use the publicly released deployment data from BlenderBot. Human evaluation indicates improvements in their new models over baseline responses; however, they find that some proxy signals can lead to more generations with undesirable properties as well. For example, optimizing for conversation length can lead to more controversial or unfriendly generations compared to the baseline (which makes sense in hindsight, but is not obvious), whereas optimizing for positive sentiment or reaction can decrease these behaviors.
The following figure from the paper offers an overview of the approach. Implicit signals are extracted from human-bot conversations, such as whether future human turns are long or short, or joyful or not joyful. For example the bot turn in the top-left is labeled as “good” and the bottom-left labeled as “bad” according to both of these signals. They train a binary classifier to predict whether the bot’s turn is “good” given the conversation history and the bot turn, and they leverage the classifier at bot’s test-time.

FacTool: Factuality Detection in Generative AI – A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios

The emergence of generative pre-trained models has facilitated the synthesis of high-quality text, but it has also posed challenges in identifying factual errors in the generated text. In particular: (1) A wider range of tasks now face an increasing risk of containing factual errors when handled by generative models. (2) Generated texts tend to be lengthy and lack a clearly defined granularity for individual facts. (3) There is a scarcity of explicit evidence available during the process of fact checking. - This paper by Chern et al. from Shanghai Jiao Tong. CMU, CUHK, NYU, Meta AI, HKUST, and Shanghai Artificial Intelligence Laboratory seeks to address the above challenges and proposes FacTool, a task and domain agnostic framework for detecting factual errors of texts generated by large language models (e.g., ChatGPT).
For obtaining factually correct claims (and thus mitigate hallucination), FacTool serves as a good framework for (i) claim extraction, (ii) search query generation, and (iii) fact elicitation.
The following figure from the paper shows the proposed framework for factuality detection in four domains: knowledge-based QA, code generation, math problem solving and scientific literature review writing.

Experiments on four different tasks (knowledge-based QA, code generation, mathematical reasoning, and scientific literature review) show the efficacy of the proposed method.
Code with the code of FacTool associated with ChatGPT plugin interface.

The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants

This paper by Bandarkar et al. from Meta AI, Abridge AI, and Reka AI presents Belebele, a multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants. Significantly expanding the language coverage of natural language understanding (NLU) benchmarks, this dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
Each question is based on a short passage from the Flores-200 dataset and has four multiple-choice answers. The questions were carefully curated to discriminate between models with different levels of general language comprehension. The following figure from the paper shows a sample passage from the dataset in 4 different languages, displayed alongside its two questions.

The English dataset on its own proves difficult enough to challenge state-of-the-art language models. Being fully parallel, this dataset enables direct comparison of model performance across all languages.
They use this dataset to evaluate the capabilities of multilingual masked language models (MLMs) and large language models (LLMs). They present extensive results and find that despite significant cross-lingual transfer in English-centric LLMs, much smaller MLMs pretrained on balanced multilingual data still understand far more languages.
They also observe that larger vocabulary size and conscious vocabulary construction correlate with better performance on low-resource languages. Overall, Belebele opens up new avenues for evaluating and analyzing the multilingual capabilities of NLP systems.

Large Language Models Can Be Easily Distracted by Irrelevant Context

The following paper summary has been contributed by Zhibo Zhang.
Prompting large language models with relevant context has been of recent focus on NLP.
This paper by Shi et al. from Google DeepMind, Toyota Technological Institute at Chicago, and Purdue in ICML 2023 studies the effects of irrelevant contexts for large language models.
The authors created the Grade-School Math with Irrelevant Context (GSM-IC) dataset to study the distractibility of large language models. The dataset was constructed by adding an additional sentence with irrelevant context on top of some easy problems from the GSM8K (Cobbe et al., 2021) dataset.
To evaluate the robustness to irrelevant context, the authors considered three quantitative metrics: 1. The micro accuracy, which is the average performance of the model on individual problems. 2. The macro accuracy, which is the average performance of the model over problem categories. 3. The normalized accuracy, which is either: 1. micro accuracy divided by base problem accuracy; 2. macro accuracy divided by the base problem accuracy.
The authors empirically investigated the distractibility of large language models using the code-davinci-002 model and the text-davinci-003 model on the GSM-IC dataset using various prompting techniques, including chain-of-thought prompting (Wei et al., 2022), chain-of-thought prompting without exemplar (Kojima et al., 2022), least-to-most prompting (Zhou et al., 2022), program prompts (Chowdhery et al., 2022) and self-consistency (Wang et al., 2022; Shi et al., 2022) . In addition, the authors propose instructed prompting, which encourages the model to ignore irrelevant information through an explicit instruction.
It was concluded from the experimental results that irrelevant context can distract the models with all investigated promoting techniques, among which least-to-most prompting is most robust to distraction. In addition, it was observed that self-consistency and instructed prompting both improve robustness to irrelevant information.

Fast Inference from Transformers via Speculative Decoding

Inference from large autoregressive models like Transformers is slow - decoding $K$ tokens takes $K$ serial runs of the model.
This paper by Leviathan et al. from Google Research in ICML 2023 introduces speculative decoding – an algorithm to sample from autoregressive models faster without any changes to the outputs, by computing several tokens in parallel. At the heart of our approach lie the observations that (1) hard language-modeling tasks often include easier subtasks that can be approximated well by more efficient models, and (2) using speculative execution and a novel sampling method, they can make exact decoding from the large models faster, by running them in parallel on the outputs of the approximation models, potentially generating several tokens concurrently, and without changing the distribution.
Their method can accelerate existing off-the-shelf models without retraining or architecture changes. They demonstrate it on T5-XXL and show a 2X-3X acceleration compared to the standard T5X implementation, with identical outputs.
The following figure from the paper shows a technique illustrated in the case of unconditional language modeling. Each line represents one iteration of the algorithm. The green tokens are the suggestions made by the approximation model (here, a GPT-like Transformer decoder with 6M parameters trained on 1m1b with 8k tokens) that the target model (here, a GPT-like Transformer decoder with 97M parameters in the same setting) accepted, while the red and blue tokens are the rejected suggestions and their corrections, respectively. For example, in the first line the target model was run only once, and 5 tokens were generated.

Textbooks Are All You Need II

Models are getting bigger and bigger, leaving not only consumers behind but also any company or research lab that doesn’t play in the Champions League in terms of budget and talent! Also, while scale alone brought us a lot of improvements, quantity doesn’t seem to help much when it comes to hallucinations, harmful content and a deep understanding of things.
This technical report by Li et al. from Microsoft introduces phi-1.5, investigating the power of smaller Transformer-based language models as initiated by TinyStories – a 10 million parameter model that can produce coherent English – and the follow-up work on phi-1, a 1.3 billion parameter model with Python coding performance close to the state-of-the-art.
They take a quality-first approach for their work. Phi-1.5 is the continuation of their success with phi-1 – both models were trained on high quality textbook data – and it shows remarkable capabilities way beyond what one would expect from a 1.3B parameter model. phi-1 proposed to use existing Large Language Models (LLMs) to generate “textbook quality” data as a way to enhance the learning process compared to traditional web data. They follow the “Textbooks Are All You Need” approach, focusing this time on common sense reasoning in natural language, leading to phi-1.5, a new 1.3 billion parameter model.
phi-1.5 offers performance on natural language tasks comparable to models 5x larger, and surpassing most non-frontier LLMs on more complex reasoning tasks such as grade-school mathematics and basic coding. As the researchers said themselves, it’s more of a 1.3B model behaving like a 13B model.
The following plot from the paper shows benchmark results comparing phi-1.5, its version enhanced with filtered web data phi-1.5-web, and other state-of-the-art open-source LLMs. Sizes range from phi-1.5’s 1.3 billion parameters (Falcon-RW-1.3B to 10x larger models like Vicuna-13B, a fine-tuned version of Llama-13B. Benchmarks are broadly classified into three categories: common sense reasoning, language skills, and multi-step reasoning. The classification is meant to be taken loosely, for example while HellaSwag requires common sense reasoning, it arguably relies more on “memorized knowledge”. One can see that phi-1.5 models perform comparable in common sense reasoning and language skills, and vastly exceeds other models in multi-step reasoning.

phi-1.5 exhibits many of the traits of much larger LLMs, both good – such as the ability to “think step by step” or perform some rudimentary in-context learning – and bad, including hallucinations and the potential for toxic and biased generations – encouragingly though, they see an improvement on that front thanks to the absence of web data. They open-source phi-1.5 to promote further research on these urgent topics.
As an side, there have been ongoing discussions ever since their first paper. Back then, sceptics raised concerns of data leakage that the team denied, pointing to their rigorous data selection criteria. This time around, the scepticism continued and some of the results are so astonishing that they are hard to believe. Time, transparency, and further research will hopefully unravel this mystery for us.

Cognitive Mirage: A Review of Hallucinations in Large Language Models

As large language models continue to develop in the field of AI, text generation systems are susceptible to a worrisome phenomenon known as hallucination.
Hallucination is probably the biggest issue in LLM research right now, and there’s a clear reason for it: companies don’t like uncertainty especially for customer-facing products and the last thing they want is a model spreading false information – potentially even information harmful to their business – so research on ways to control Large Language Models and minimize hallucinations has become very popular. There’s an acceptable level of error for every task, but usually, tolerance in an industrial setting is very low when dealing with customer-facing applications. We’re still a long way to go for mainstream industry adoption.
This paper by Ye et al. from provides a comprehensive overview of recent compelling insights into hallucinations in LLMs. They’ve built a novel taxonomy of hallucinations associated with various text generation tasks, thus providing theoretical insights, detection methods and improvement approaches. One such example can be seen above for the task of Hallucination Detection. The paper provides a solid starting point for anyone that wants to catch up with the literature.
They also propose several research directions that can be developed in the future. As hallucinations garner significant attention from the community, we will maintain updates on relevant research progress.
The following figure from the paper shows a taxonomy of Hallucination Detection.

The following figure from the paper shows a taxonomy of Hallucination Correction.

Structured Chain-of-Thought Prompting for Code Generation

Large Language Models (LLMs) (e.g., ChatGPT) have shown impressive performance in code generation. LLMs take prompts as inputs, and Chain-of-Thought (CoT) prompting is the state-of-the-art prompting technique. CoT prompting asks LLMs first to generate CoTs (i.e., intermediate natural language reasoning steps) and then output the code. However, CoT prompting is designed for natural language generation and has low accuracy in code generation.
This paper by Li et al. from Peking University proposes Structured CoTs (SCoTs) and present a novel prompting technique for code generation, named SCoT prompting.
Their motivation is that source code contains rich structural information and any code can be composed of three program structures (i.e., sequence, branch, and loop structures). Intuitively, structured intermediate reasoning steps make for structured source code. Thus, they ask LLMs to use program structures to build CoTs, obtaining SCoTs. Then, LLMs generate the final code based on SCoTs.
Compared to CoT prompting, SCoT prompting explicitly constrains LLMs to think about how to solve requirements from the view of source code and further the performance of LLMs in code generation.
The following figure from the paper shows a comparison of Chain-of-Thoughts (CoT) and our Structured Chain-of-Thought (SCoT).

They apply SCoT prompting to two LLMs (i.e., ChatGPT and Codex) and evaluate it on three benchmarks (i.e., HumanEval, MBPP, and MBCPP). (1) SCoT prompting outperforms the state-of-the-art baseline - CoT prompting by up to 13.79% in Pass@1. (2) Human evaluation shows human developers prefer programs from SCoT prompting. (3) SCoT prompting is robust to examples and achieves substantial improvements.

DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models

Despite their impressive capabilities, large language models (LLMs) are prone to hallucinations, i.e., generating content that deviates from facts seen during pretraining.
This paper by Chuang et al. from MIT and Microsoft proposes Decoding by Contrasting Layers (DoLa), a simple decoding strategy for reducing hallucinations with pretrained LLMs that does not require conditioning on retrieved external knowledge nor additional fine-tuning.
DoLa obtains the next-token distribution by contrasting the differences in logits obtained from projecting the later layers versus earlier layers to the vocabulary space, exploiting the fact that factual knowledge in an LLMs has generally been shown to be localized to particular transformer layers.
DoLa is able to better surface factual knowledge and reduce the generation of incorrect facts. DoLa consistently improves the truthfulness across multiple choices tasks and open-ended generation tasks, for example improving the performance of LLaMA family models on TruthfulQA by 12-17% absolute points, demonstrating its potential in making LLMs reliably generate truthful facts.
The following figure from the paper shows an illustration of how a transformer-based LM progressively incorporates more factual information along the layers. We observe that while the next-word probability of “Seattle” remains similar throughout the different layers, the probability of the correct answer “Olympia” gradually increases from the lower layers to the higher layers. DoLa uses this fact and decodes by contrasting the difference between the two layers to sharpen an LLM’s probability towards factually correct outputs.

Code.

QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models

Recent advancements in large language models (LLMs) have significantly improved their capabilities in various language-understanding tasks. However, the deployment of these models, especially on edge devices, is hampered by their substantial computational requirements.
This paper by Xu et al. from Huawei seeks to address the aforementioned issue and proposes a quantization-aware low-rank adaptation (QA-LoRA) algorithm, a technique that aims to mitigate the computational burden by efficiently fine-tuning low-bit diffusion models without compromising accuracy. The motivation lies in the imbalanced degrees of freedom of quantization and adaptation, and the solution is to use group-wise operators which increase the degree of freedom of quantization meanwhile decreasing that of adaptation. QA-LoRA is easily implemented with a few lines of code, and it equips the original LoRA with two-fold abilities: (i) during fine-tuning, the LLM’s weights are quantized (e.g., into INT4) to reduce time and memory usage; (ii) after fine-tuning, the LLM and auxiliary weights are naturally integrated into a quantized model without loss of accuracy.
Put simply, QA-LoRA stands out by incorporating a quantization-aware approach that merges and quantizes the weights of Low-Rank Adaptation (LoRA) with the full-precision model weights. This process not only optimizes memory usage and computational efficiency during inference but also ensures the seamless integration of LoRA and auxiliary weights into a quantized model. Notably, QA-LoRA allows for the reduction of weight precision (e.g., to INT4, INT3, and INT2) during fine-tuning, significantly decreasing time and memory usage while maintaining accuracy levels, as it eliminates the need for post-training quantization.
The algorithm operates through several key steps:
1. Adding LoRA Weights: LoRA weights are introduced to the pre-trained model, enhancing its adaptability.
2. Fine-Tuning LoRA Weights: These weights are then specifically fine-tuned, which involves updating the LoRA weights while the original model’s weights remain unchanged.
3. Merging LoRA and Original Model Weights: Subsequently, the fine-tuned LoRA weights are merged with the model’s original weights.
4. Quantization: Finally, the combined weight set is quantized to a lower-bit format, which is essential for reducing both memory and computational costs.
The following figure from the paper shows an illustration of the goal of QA-LoRA. Compared to prior adaptation methods, LoRA and QLoRA, QA-LoRA is computationally efficient in both the fine-tuning and inference stages. More importantly, it does not suffer an accuracy loss because post-training quantization is not required. They display INT4 quantization in the figure, but QA-LoRA can be generalized to INT3 and INT2.

QA-LoRA’s effectiveness has been validated across different fine-tuning datasets and downstream scenarios, particularly with the LLaMA and LLaMA2 model families. Its unique integration of quantization-aware techniques with low-rank adaptation principles marks a significant advancement in the fine-tuning of diffusion models for low-bit settings. This approach not only addresses the challenges posed by the computational demands of LLMs but also opens up new possibilities for deploying these models more efficiently and effectively on a wider range of devices.
The implementation of QA-LoRA is straightforward and can be achieved with a minimal addition to the existing codebase, showcasing its practicality for widespread adoption. Further details, including code examples, are available on their GitHub repository, making it accessible for researchers and practitioners aiming to leverage the benefits of this innovative adaptation technique.
Code

Editing Commonsense Knowledge in GPT

This paper by Memory editing methods for updating encyclopedic knowledge in transformers have received increasing attention for their efficacy, specificity, and generalization advantages. However, it remains unclear if such methods can be adapted for the more nuanced domain of commonsense knowledge.
This paper by Gupta et al. from UMass Amherst and Allen AI proposes $MEMIT_{CSK}$, an adaptation of MEMIT to edit commonsense mistakes in GPT-2 Large and XL.
The following figure from the paper shows the proposed framework for probing, editing ($MEMIT_{CSK}$), and evaluating commonsense knowledge in GPT models.

They extend editing to various token locations and employ a robust layer selection strategy. Models edited by $MEMIT_{CSK}$ outperforms the fine-tuning baselines by 10.97% and 10.73% F1 scores on subsets of PEP3k and 20Q.
They further propose a novel evaluation dataset, MEMIT-CSK-PROBE, that contains unaffected neighborhood, affected neighborhood, affected paraphrase, and affected reasoning challenges.
$MEMIT_{CSK}$ demonstrates favorable semantic generalization, outperforming fine-tuning baselines by 13.72% and 5.57% overall scores on MEMIT-CSK-PROBE.
The following figure from the paper shows samples from MEMIT-CSK-PROBE for evaluating semantic generalization.

These results suggest a compelling future direction of incorporating context-specific user feedback concerning commonsense in GPT by direct model editing, rectifying and customizing model behaviors via human-in-the-loop systems.
Code.

The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning

Large Language Models (LLMs) have shown enhanced capabilities of solving novel tasks by reasoning step-by-step known as Chain-of-Thought (CoT) reasoning; how can we instill the same capability of reasoning step-by-step on unseen tasks into LMs that possess less than <100B parameters?
This paper by Kim et al. from KAIST, Yonsei University, and NAVER AI Lab seeks to address this question. They first introduce the CoT Collection, a new instruction-tuning dataset that augments 1.88 million CoT rationales across 1,060 tasks.
The following figure from the paper shows an illustration of the overall task group and dataset source of where we obtained the instance to augment CoT rationales in CoT Collection Compared to the 9 datasets that provide publicly available CoT rationales (namely ‘FLAN-T5 MCQA’, ‘FLAN-T5 ExQA’, ‘FLAN-T5 NLI’, ‘FLAN-T5 Arithmetic’), they generate x52.36 more rationales (1.88 million rationales) and x177.78 more task variants (1,060 tasks).

They show that continually fine-tuning Flan-T5 (3B & 11B) with the CoT Collection enables the 3B & 11B LMs to perform CoT better on unseen tasks, leading to an improvement in the average zero-shot accuracy on 27 datasets of the BIG-Bench-Hard benchmark by +4.34% and +2.44%, respectively.
Furthermore, they show that instruction tuning with CoT allows LMs to possess stronger few-shot learning capabilities, resulting in an improvement of +2.97% and +2.37% on 4 domain-specific tasks over Flan-T5 (3B & 11B), respectively.
Code.

FinGPT: Open-Source Financial Large Language Models

Large language models (LLMs) have shown the potential of revolutionizing natural language processing tasks in diverse domains, sparking great interest in finance. Accessing high-quality financial data is the first challenge for financial LLMs (FinLLMs). While proprietary models like BloombergGPT have taken advantage of their unique data accumulation, such privileged access calls for an open-source alternative to democratize Internet-scale financial data.
This paper by Yang et al. from Columbia and NYU Shanghai presents an open-source large language model, FinGPT, for the finance sector. Unlike proprietary models, FinGPT takes a data-centric approach, providing researchers and practitioners with accessible and transparent resources to develop their FinLLMs.
They highlight the importance of an automatic data curation pipeline and the lightweight low-rank adaptation technique in building FinGPT.
Furthermore, they showcase several potential applications as stepping stones for users, such as robo-advising, algorithmic trading, and low-code development.
Through collaborative efforts within the open-source AI4Finance community, FinGPT aims to stimulate innovation, democratize FinLLMs, and unlock new opportunities in open finance.
The following figure from the paper shows the FinGPT Framework.

Code.

The Reversal Curse: LLMs trained on “A is B” fail to learn “B is A”

This paper by Berglund et al. from Vanderbilt University, UK Frontier AI Taskforce, Apollo Research, NYU, University of Sussex, and University of Oxford exposes a surprising failure of generalization in auto-regressive large language models (LLMs).
If a model is trained on a sentence of the form “A is B”, it will not automatically generalize to the reverse direction “B is A”. This is the Reversal Curse.
For instance, if a model is trained on “Olaf Scholz was the ninth Chancellor of Germany”, it will not automatically be able to answer the question, “Who was the ninth Chancellor of Germany?”. Moreover, the likelihood of the correct answer (“Olaf Scholz”) will not be higher than for a random name.
The following figure from the paper shows the finetuning test for the Reversal Curse. In step 1, they finetune a model on fictitious facts where the name (e.g. “Daphne Barrington”) precedes the description (e.g. “the director of …”). Then, they prompt the model with questions in both orders. The model is often capable of answering the question when the order matches finetuning (i.e. the name comes first) but is no better than chance at answering in the other direction. Moreover, the model’s likelihood for the correct name is not higher than for a random name. This demonstrates the Reversal Curse.

Thus, models exhibit a basic failure of logical deduction and do not generalize a prevalent pattern in their training set (i.e. if “A is B’’ occurs, “B is A” is more likely to occur).
They provide evidence for the Reversal Curse by finetuning GPT-3 and Llama-1 on fictitious statements such as “Uriah Hawthorne is the composer of ‘Abyssal Melodies’” and showing that they fail to correctly answer “Who composed ‘Abyssal Melodies?’”.
The Reversal Curse is robust across model sizes and model families and is not alleviated by data augmentation.
They also evaluate ChatGPT (GPT-3.5 and GPT-4) on questions about real-world celebrities, such as “Who is Tom Cruise’s mother? [A: Mary Lee Pfeiffer]” and the reverse “Who is Mary Lee Pfeiffer’s son?”. GPT-4 correctly answers questions like the former 79% of the time, compared to 33% for the latter. This shows a failure of logical deduction that is hypothesized to be caused by the Reversal Curse. The following figure from the paper shows inconsistent knowledge in GPT-4 as described in the aforementioned example.

Code.

Less is More: Task-aware Layer-wise Distillation for Language Model Compression

The following paper summary has been contributed by Zhibo Zhang.
Knowledge distillation (Hinton et al., 2015) trains a smaller model (student model) based on the trained teacher model through constraining the KL-divergence between the output prediction distribution of the teacher model and that of the student model.
However, knowledge distillation does not directly take the semantic information in the intermediate layers of the neural network into account. Layer-wise distillation (Romero et al., 2015; Sun et al., 2019; Sun et al., 2020; Jiao et al., 2020; Hou et al., 2020; Zuo et al., 2022; Liang et al., 2023) encourages similar hidden layer representations between the teacher model and student model.
However, the student model can underfit the training data in layer-wise distillation. This paper by Liang et al. from Georgia Institute of Technology and Microsoft in ICML 2023 proposes TED (Task-aware Layer-wise Distillation), which adds task-aware filtering on top of layer-wise distillation, allowing the student model to focus on the task-relevant knowledge from the teacher model.
Specifically, TED is composed of two steps:
- With the parameters of the teacher network fixed, the task-aware filters applied on top of the hidden layers are trained to retrieve the task-specific information.
- In the distillation process, with the teacher network and its task-aware filters frozen, the student network and its task-aware filters are trained using a loss composed of three items:
  - A task loss to encourage utility on the task
  - A prediction distillation loss to encourage similar output distributions between the teacher network and the student network.
  - A task-aware distillation loss to encourage similar filtered representations in the hidden layer between the teacher model and the student model.
Empirically, TED was evaluated on the following perspectives:
- Language modeling, where Open WebText corpus (Gokaslan et al., 2019) was used for domain pre-training. The student model was evaluated on the LAMBADA (Paperno et al., 2016) and WikiText-103 (Merity et al., 2017) downstream tasks.
- Natural Language Understanding, where the method was investigated through the GLUE (General Language Understanding Evaluation) (Wang et al., 2019) benchmark and the SQuAD v1.1 / v2.0 datasets (Rajpurkar et al., 2016, Rajpurkar et al., 2018).
In the above tasks, TED achieved overall better results compared to knowledge distillation and layer-wise distillation.

Reinforced Self-Training (ReST) for Language Modeling

Reinforcement learning from human feedback (RLHF) can improve the quality of large language model’s (LLM) outputs by aligning them with human preferences.
This paper by Gulcehre et al. from Google DeepMind and Google Research proposes Reinforced Self-Training (ReST), a simple algorithm for aligning LLMs with human preferences inspired by growing batch reinforcement learning (RL).
ReST generates samples from an initial LLM policy to create a dataset, which is then used to improve the LLM policy using offline RL algorithms. This method is more efficient than traditional online RLHF methods due to offline production of the training dataset, facilitating data reuse.
ReST operates in two loops: the inner loop (Improve) and the outer loop (Grow).
- Grow: The LLM policy generates multiple output predictions per context, augmenting the training dataset.
- Improve: The augmented dataset is ranked and filtered using a scoring function based on a learned reward model trained on human preferences. The model is then fine-tuned on this filtered dataset with an offline RL objective, with the possibility of repeating this step with increasing filtering thresholds.
The following image from the paper illustrates the ReST method. During the Grow step, a policy generates a dataset. At Improve step, the filtered dataset is used to fine-tune the policy. Both steps are repeated, the Improve step is repeated more frequently to amortise the dataset creation cost.

ReST’s advantages include reduced computational burden, independence from the original dataset’s quality, and simplicity in implementation.
Machine translation was chosen as the application for testing ReST, due to strong baselines and well-defined evaluation procedures. Experiments were conducted on IWSLT 2014, WMT 2020 benchmarks, and an internal high-fidelity benchmark called Web Domain. The evaluation used state-of-art reference-free reward models like Metric X, BLEURT, and COMET. ReST significantly improved reward model scores and translation quality on test and validation sets, as per both automated metrics and human evaluation.
ReST outperformed standard supervised learning (BC G=0 I=0) in reward model scores and human evaluations. The BC loss (Behavioral Cloning) was found to be the most effective for ReST, leading to continuous improvements in the model’s reward on holdout sets. However, improvements in reward model scores did not always align with human preferences.
ReST showed better performance over supervised training across different datasets and language pairs. The inclusion of multiple Improve steps and Grow steps resulted in significant improvements in performance. Human evaluations showed that all ReST variants significantly outperformed the BC baseline.
ReST is distinct from other self-improvement algorithms in language modeling due to its computational efficiency and ability to leverage exploration data and rewards. The approach is applicable to various language tasks, including summarization, dialogue, and other generative models.
Future work includes fine-tuning reward models on subsets annotated with human preferences and exploring better RL exploration strategies.

How Do Large Language Models Capture the Ever-changing World Knowledge? A Review of Recent Advances

Although large language models (LLMs) are impressive in solving various tasks, they can quickly be outdated after deployment. Maintaining their up-to-date status is a pressing concern in the current era.
This paper by Zhang et al. from University of Technology Sydney, University of Liverpool, University of Wollongong, and UCL provides a comprehensive review of recent advances in aligning LLMs with the ever-changing world knowledge without re-training from scratch.
The following image from the paper illustrates the taxonomy of methods to align LLMs with the ever-changing world knowledge. Implicit means the approaches seek to directly alter the knowledge stored in LLMs (e.g., parameters), while Explicit means more often incorporating external resources to override internal knowledge (e.g., search engine).

They categorize research works systemically and provide in-depth comparisons and discussion. They also discuss existing challenges and highlight future directions to facilitate research in this field.
Paper list.

Diffusion Language Models Can Perform Many Tasks with Scaling and Instruction-Finetuning

The recent surge of generative AI has been fueled by the generative power of diffusion probabilistic models and the scalable capabilities of large language models. Despite their potential, it remains elusive whether diffusion language models can solve general language tasks comparable to their autoregressive counterparts.
This paper by Ye et al. from ByteDance Research and Fudan University demonstrates that scaling diffusion models w.r.t. data, sizes, and tasks can effectively make them strong language learners.
They build competent diffusion language models at scale by first acquiring knowledge from massive data via masked language modeling pretraining thanks to their intrinsic connections.
They then reprogram pretrained masked language models into diffusion language models via diffusive adaptation, wherein task-specific finetuning and instruction finetuning are explored to unlock their versatility in solving general language tasks.
The following image from the paper illustrates an overview: (A) Comparative illustration of language model (LM) paradigms, i.e., autoregressive LMs vs. diffusion LMs. (B) Overall illustration of the proposed approach wherein massively pretrained masked LMs are reprogrammed to diffusion LMs via generative surgery.

Experiments show that scaling diffusion language models consistently improves performance across downstream language tasks.
They further discover that instruction finetuning can elicit zero-shot and few-shot in-context learning abilities that help tackle many unseen tasks by following natural language instructions, and show promise in advanced and challenging abilities such as reasoning.

A Reparameterized Discrete Diffusion Model for Text Generation

This paper by Zheng et al. from HKU and ByteDance studies discrete diffusion probabilistic models with applications to natural language generation.
They derive an alternative yet equivalent formulation of the sampling from discrete diffusion processes and leverage this insight to develop a family of reparameterized discrete diffusion models. The derived generic framework is highly flexible, offers a fresh perspective of the generation process in discrete diffusion models, and features more effective training and decoding techniques.
They conduct extensive experiments to evaluate the text generation capability of their model, demonstrating significant improvements over existing diffusion models.

AR-Diffusion: Auto-Regressive Diffusion Model for Text Generation

Diffusion models have gained significant attention in the realm of image generation due to their exceptional performance. Their success has been recently expanded to text generation via generating all tokens within a sequence concurrently. However, natural language exhibits a far more pronounced sequential dependency in comparison to images, and the majority of existing language models are trained with a left-to-right auto-regressive approach.
This paper by Wu et al. from Tsinghua University, Fudan University, Microsoft Research Asia, Microsoft, Soochow University, and IDEA Research seeks to account for the inherent sequential characteristic of natural language and introduces Auto-Regressive Diffusion (AR-Diffusion).
AR-Diffusion ensures that the generation of tokens on the right depends on the generated ones on the left, a mechanism achieved through employing a dynamic number of denoising steps that vary based on token position. This results in tokens on the left undergoing fewer denoising steps than those on the right, thereby enabling them to generate earlier and subsequently influence the generation of tokens on the right.
The following image from the paper illustrates the model behaviors on a two-dimensional coordinate system, where the horizontal axis stands for the position and the vertical axis represents the diffusion timestep. In the inference stage, different models will behave differently. (a) For the typical Diffusion-LM, each token shares identical movement speed $v\left(n_1, t_i, t_{i+1}\right)=v\left(n_2, t_i, t_{i+1}\right)=\left\|t_{i+1}-t_i\right\|$. (b) For AR from the perspective of diffusion models, the tokens have two states based on the degree of interpolation between the original tokens and Gaussian noise: to be decoded (at timestep $t=T$) and already decoded (at timestep $t=0$). Specifically, we have $v\left(n_1, t_i, t_{i+1}\right)=0$ and $v\left(n_2, t_i, t_{i+1}\right)=T$. (c) In AR-Diffusion, $\left(n_e, t_e\right)$ is the coordinate of anchor point. Tokens in different positions exhibit varying movement speeds, such as $v\left(n_1, t_i, t_{i+1}\right)>v\left(n_2, t_i, t_{i+1}\right)$ when $n_1 < n_2$.

In a series of experiments on various text generation tasks, including text summarization, machine translation, and common sense generation, AR-Diffusion clearly demonstrates its superiority over existing diffusion language models and that it can be 100x-600x faster when achieving comparable results.

HomoDistil: Homotopic Task-Agnostic Distillation of Pre-trained Transformers

The following paper summary has been contributed by Zhibo Zhang.
This Paper by Liang et al. from Georgia Institute of Technology and Amazon in ICLR 2023 proposes HomoDistil (Homotopic Distillation), a task-agnostic distillation method with iterative pruning.
In HomoDistil, the weighted sum of the following loss components are used to optimize the student model:
- the task-agnostic pre-training loss.
- the KL-divergence between the prediction distribution of the student model and that of the teacher model.
- the mean-squared-error between the projected hidden layer representations (the projection is for dimension matching) of the student model and the hidden representations of the teacher model.
- the mean-squared-error between the attention score matrices of the student model and those of the teacher model.
At each training iteration, the student model is first updated based on the above loss. Next, for each parameter, a sensitivity score is calculated (which is approximately equal to the absolute difference between the loss with this parameter and the loss without this parameter), based on which an importance score for each column is calculated. The columns are then pruned based on the importance score to ensure that the student model maximally preserves the output utility and prediction similarity to the teacher model. Along the training iterations, HomoDistil gradually reduces the student model size as shown in the figure below.
The following image from the paper illustrates: (Left) In HomoDistil, the student is initialized from the teacher and is iteratively pruned through the distillation process. The widths of rectangles represent the widths of layers. The depth of color represents the sufficiency of training. (Right) An illustrative comparison of the student’s optimization trajectory in HomoDistil and standard distillation. They define the region where the prediction discrepancy is sufficiently small such that the distillation is effective as the Effective Distillation Region. In HomoDistil, as the student is initialized with the teacher and is able to maintain this small discrepancy, the trajectory consistently lies in the region. In standard distillation, as the student is initialized with a much smaller capacity than the teacher’s, the distillation is ineffective at the early stage of training.

Empirically, on the majority of the GLUE tasks (Wang et al., 2018), HomoDistil (using the BERT (Devlin et al., 2018) base) outperformed DistilBERT (Sanh et al., 2019), TinyBERT-GD (Jiao et al., 2019), MiniLM (Wang et al., 2020), MiniLMv2 (Wang et al., 2020) in terms of the accuracy of fine-tuning the distilled BERT models on the development set.
It was also observed that along the training steps, HomoDistil resulted in smaller student-teacher discrepancy compared to pruning the model only once at the beginning.

Likelihood-Based Diffusion Language Models

Despite a growing interest in diffusion-based language models, existing work has not shown that these models can attain nontrivial likelihoods on standard language modeling benchmarks.
This paper by Gulrajani and Hashimoto from Stanford takes the first steps towards closing the likelihood gap between autoregressive and diffusion-based language models, with the goal of building and releasing a diffusion model which outperforms a small but widely-known autoregressive model.
They pursue this goal through algorithmic improvements, scaling laws, and increased compute. On the algorithmic front, they introduce several methodological improvements for the maximum-likelihood training of diffusion language models.
They then study scaling laws for our diffusion models and find compute-optimal training regimes which differ substantially from autoregressive models.
Using their methods and scaling analysis, they train and release Plaid 1B, a large diffusion language model which outperforms GPT-2 124M in likelihood on benchmark datasets and generates fluent samples in unconditional and zero-shot control settings.
Code.

Who’s Harry Potter? Approximate Unlearning in LLMs

Large language models (LLMs) are trained on massive internet corpora that often contain copyrighted content. This poses legal and ethical challenges for the developers and users of these models, as well as the original authors and publishers.
This paper by Eldan and Russinovich Microsoft Research and Microsoft Azure proposes a novel technique for unlearning a subset of the training data from a LLM, without having to retrain it from scratch.
They evaluate their technique on the task of unlearning the Harry Potter books from the Llama2-7b model (a generative language model recently open-sourced by Meta). While the model took over 184K GPU-hours to pretrain, they show that in about 1 GPU hour of finetuning, they effectively erase the model’s ability to generate or recall Harry Potter-related content, while its performance on common benchmarks (such as Winogrande, Hellaswag, arc, boolq, and piqa) remains almost unaffected.
To the best of their knowledge, this is the first paper to present an effective technique for unlearning in generative language models.
Their technique consists of three main components: First, they use a reinforced model that is further trained on the target data to identify the tokens that are most related to the unlearning target, by comparing its logits with those of a baseline model. Second, they replace idiosyncratic expressions in the target data with generic counterparts, and leverage the model’s own predictions to generate alternative labels for every token. These labels aim to approximate the next-token predictions of a model that has not been trained on the target data. Third, they finetune the model on these alternative labels, which effectively erases the original text from the model’s memory whenever it is prompted with its context.
The following image from the paper illustrates the comparison of the baseline vs. the fine-tuned model.

Project page; Weights

Mistral 7B

This report by Jiang et al. from Mistral introduces Mistral 7B v0.1, a 7-billion-parameter language model engineered for superior performance and efficiency.
Mistral 7B outperforms Llama 2 13B across all evaluated benchmarks, and Llama 1 34B in reasoning, mathematics, and code generation.
Mistral 7B leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost.
They also provide a model fine-tuned to follow instructions, Mistral 7B – Instruct, that surpasses the Llama 2 13B – Chat model both on human and automated benchmarks.
In summary, Mistral 7B is a 7.3B parameter model that:
- Outperforms Llama 2 13B on all benchmarks.
- Outperforms Llama 1 34B on many benchmarks.
- Approaches CodeLlama 7B performance on code, while remaining good at English tasks.
- Uses Grouped-query attention (GQA) for faster inference.
- Uses Sliding Window Attention (SWA) to handle longer sequences at smaller cost.
Mistral 7B uses a sliding window attention (SWA) mechanism (Child et al., Beltagy et al.), in which each layer attends to the previous 4,096 hidden states. The main improvement, and reason for which this was initially investigated, is a linear compute cost of O(sliding_window.seq_len). In practice, changes made to FlashAttention and xFormers yield a 2x speed improvement for sequence length of 16k with a window of 4k.
Sliding window attention exploits the stacked layers of a transformer to attend in the past beyond the window size: A token i at layer k attends to tokens [i-sliding_window, i] at layer k-1. These tokens attended to tokens [i-2*sliding_window, i]. Higher layers have access to information further in the past than what the attention patterns seems to entail.

The following image from the paper illustrates sliding window attention. The number of operations in vanilla attention is quadratic in the sequence length, and the memory increases linearly with the number of tokens. At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. To alleviate this issue, we use sliding window attention: each token can attend to at most $W$ tokens from the previous layer (here, $W$ = 3). Note that tokens outside the sliding window still influence next word prediction. At each attention layer, information can move forward by W tokens. Hence, after $k$ attention layers, information can move forward by up to $k \times W$ tokens.

Finally, a fixed attention span means we can limit our cache to a size of sliding_window tokens, using rotating buffers (read more in their reference implementation repo). This saves half of the cache memory for inference on sequence length of 8192, without impacting model quality.
Mistral 7B has been released under the Apache 2.0 license, it can be used without restrictions.
Download it and use it anywhere (including locally) with their reference implementation.
Deploy it on any cloud (AWS/GCP/Azure), using vLLM inference server and skypilot.
Mistral 7B is easy to fine-tune on any task. As a demonstration, they’re providing a model fine-tuned for chat, which outperforms Llama 2 13B chat.
Weights

Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models

Imagine you’re asked a detailed physics question. Instead of diving straight in, you first understand the fundamental law or principle that applies. Then, you use this understanding to tackle the specific question. This is the underpinning principle behind the proposal in this paper.
This paper by Zheng et al. from Google DeepMind introduces a novel prompting technique named Step-Back Prompting. This method enables Large Language Models (LLMs) like PaLM-2L to perform abstractions, deriving high-level concepts and first principles from detailed instances, thus significantly enhancing their reasoning capabilities.
Step-Back Prompting is a two-step process comprising Abstraction and Reasoning. In the abstraction phase, LLMs are prompted to ask high-level, broader, generic step-back questions about concepts or principles relevant to the task. The reasoning phase then uses these concepts and principles to guide the LLMs towards the solution of the original questions.
The technique is exemplified in the paper with two illustrations. The following image from the paper illustrates Step-Back Prompting with two steps of Abstraction and Reasoning guided by concepts and principles. Top: an example of MMLU high-school physics where the first principle of Ideal Gas Law is retrieved via abstraction. Bottom: an example from TimeQA where the high-level concept of education history is a result of the abstraction. Left: PaLM-2L fails to answer the original question. Chain-of-Thought prompting ran into errors during intermediate reasoning steps (highlighted as red). Right: PaLM-2L successfully answers the question via Step-Back Prompting.

The authors conduct extensive experiments with Step-Back Prompting on PaLM-2L models across various challenging reasoning-intensive tasks, including STEM, Knowledge QA, and Multi-Hop Reasoning. Notably, this technique improves performance on MMLU Physics and Chemistry by 7% and 11%, TimeQA by 27%, and MuSiQue by 7%.
The effectiveness of Step-Back Prompting is empirically validated, outperforming other methods like Chain of Thought (CoT) prompting and Take a Deep Breath (TDB) prompting, with significant improvements over baseline models.
An error analysis indicates that most errors in Step-Back Prompting occur during the reasoning step, suggesting that while LLMs can be effectively taught abstraction skills, enhancing their reasoning capabilities remains a challenge.
The paper positions Step-Back Prompting as a simple yet powerful method to significantly improve the reasoning ability of LLMs, especially in tasks that demand complex and deep reasoning.

Text Generation with Diffusion Language Models: A Pre-training Approach with Continuous Paragraph Denoise

This paper by Lin et al. from MSRA, Microsoft Research Asia, Microsoft, Tsinghua University, and Fudan University, introduces a novel dIffusion language modEl pre-training framework for text generation, which we call GENIE.
GENIE is a large-scale pretrained diffusion language model that consists of an encoder and a diffusion-based decoder, which can generate text by gradually transforming a random noise sequence into a coherent text sequence. To pre-train GENIE on a large-scale language corpus, they design a new continuous paragraph denoise objective, which encourages the diffusion-decoder to reconstruct a clean text paragraph from a corrupted version, while preserving the semantic and syntactic coherence.
The following image from the paper illustrates the framework of GENIE. They take the masked source sequence $s$ as the input of the Encoder to obtain the hidden information $H_s$, and interact with Language Diffusion Model through cross attention. Language Diffusion Model restores the randomly initial Gaussian noise to the output text $y$ through the iterative denoising and grounding process.

They evaluate GENIE on four downstream text generation benchmarks, namely XSum, CNN/DailyMail, Gigaword, and CommonGen. Their experimental results show that GENIE achieves comparable performance with the state-of-the-art autoregressive models on these benchmarks, and generates more diverse text samples.

JudgeLM: Fine-tuned Large Language Models are Scalable Judges

This paper by Zhu et al. from Beijing Academy of Artificial Intelligence and Huazhong University of Science & Technology Evaluating Large Language Models (LLMs) in open-ended scenarios is challenging because existing benchmarks and metrics can not measure them comprehensively. To address this problem, they propose to fine-tune LLMs as scalable judges (JudgeLM) to evaluate LLMs efficiently and effectively in open-ended benchmarks.
They first propose a comprehensive, large-scale, high-quality dataset containing task seeds, LLMs-generated answers, and GPT-4-generated judgments for fine-tuning high-performance judges, as well as a new benchmark for evaluating the judges. They train JudgeLM at different scales from 7B, 13B, to 33B parameters, and conduct a systematic analysis of its capabilities and behaviors.
They then analyze the key biases in fine-tuning LLM as a judge and consider them as position bias, knowledge bias, and format bias.
To address these issues, JudgeLM introduces a bag of techniques including swap augmentation, reference support, and reference drop, which clearly enhance the judge’s performance.
The following image from the paper offers an overview of our scalable JudgeLM including data generation, fine-tuning, and various functions.

JudgeLM obtains the state-of-the-art judge performance on both the existing PandaLM benchmark and our proposed new benchmark. Our JudgeLM is efficient and the JudgeLM-7B only needs 3 minutes to judge 5K samples with 8 A100 GPUs. JudgeLM obtains high agreement with the teacher judge, achieving an agreement exceeding 90% that even surpasses human-to-human agreement.
JudgeLM also demonstrates extended capabilities in being judges of the single answer, multimodal models, multiple answers, and multi-turn chat.
Code.

LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond

This paper by Laban et al. from Salesforce AI explores the capabilities of large language models (LLMs) for factual reasoning, using the task of detecting factual inconsistencies between a document and summary.
Experiments on existing benchmarks like FactCC, AggreFact, and DialSummEval show a few LLMs like GPT-3 and GPT-4 perform competitively to specialized non-LLM methods.
However, analysis reveals most LLMs generate incorrect explanations when asked to identify factual inconsistencies, with only GPT-4, Claude, and Bard producing 50%+ correct explanations.
Benchmarks also have issues like low inter-annotator agreement on labels or a simplicity unsuitable for precise LLM evaluation. To address this, the authors propose a new protocol to create factual consistency benchmarks focused on atomic summary edits, which yields the SummEdits benchmark that encompasses ten diverse textual domains, including the legal, dialogue, academic, financial, and sales domains.
The following image from the paper illustrates SummEdits as a benchmark to evaluate the factual reasoning abilities of LLMs, measuring if models detect factual inconsistencies when they occur in summaries. Capable detection models can help build more reliable NLG systems.

SummEdits has an estimated 0.9 inter-annotator agreement and costs 20x less per sample than prior benchmarks. Experiments show it challenges most LLMs, with GPT-4 reaching 82.4% balanced accuracy – 8% below estimated human performance.
In summary, analysis on existing benchmarks combined with the challenging new SUMMEDITS benchmark reveal gaps in LLMs’ ability to precisely reason about facts and detect inconsistencies.
Code.

Llemma: An Open Language Model For Mathematics

This paper by Azerbayev et al. from Princeton University, EleutherAI, University of Toronto, Vector Institute, University of Cambridge, CMU, and UW proposes LLEMMA, a new open source LLM specifically designed to solve mathematical problems through continued pretraining.
The authors pretrained the 7B and 34B parameter LLEMMA models by continued pretraining of Code Llama on Proof-Pile-2, a 55B token dataset containing scientific papers, mathematical web pages and math code.
LLEMMA outperformed all known open base models on the MATH benchmark, and matched the proprietary Minerva model’s few-shot performance on an equi-parameter basis. In particular, LLEMMA surpasses Minerva by Google and other competitors on an “equi-parameter basis.” LLEMMA-7B outperforms Minerva-8B, and LLEMMA-34B is nearly on par with Minerva-62B on a 4-shot Math Performance. The following image from the paper shows performance levels after continued pretraining on Proof- Pile-2 yielding LLEMMA.

LLEMMA achieved new state-of-the-art results among open models on mathematical chain of thought reasoning benchmarks including GSM8k, MATH, OCW Courses and MMLU-STEM.
LLEMMA was shown to be capable of using Python and theorem provers to solve problems without task-specific finetuning. Preliminary results on few-shot Lean tactic prediction and Isabelle proof autoformalization were presented.
The following image from the paper shows an example of a LLEMMA 34B solution to a MATH problem. This problem is tagged with difficulty level 5, the highest in MATH. The model was conditioned on a 4-shot prompt, and the solution was produced by greedy decoding. The model had to apply two nontrivial steps to solve this problem: (1) noticing that swapping the order of summation simplifies the problem, and (2) noticing that the resulting sum telescopes.

Code.

CODEFUSION: A Pre-trained Diffusion Model for Code Generation

This paper by Singh et al. from Microsoft, Northeastern University, and Anthropic in 2023 proposes CODEFUSION, a pre-trained diffusion model for code generation from natural language.
CODEFUSION consists of an encoder, a denoiser, and a decoder. The encoder transforms the natural language input into a continuous representation. The denoiser iteratively removes noise from a latent code representation conditioned on the encoded input. The decoder then converts the denoised latent code into probability distributions over discrete code tokens.
For pre-training, the authors propose an adaptation of continuous paragraph denoising (CPD) to only mask code identifiers and keywords. This allows the model to learn syntactic and semantic relationships in code.
The authors evaluate CODEFUSION on Python, Bash and Excel conditional formatting rule generation benchmarks. The results show that CODEFUSION with only 75M parameters competes with or outperforms larger auto-regressive models like CodeT5, GPT-3, and CodeGen in top-1 accuracy, while significantly improving on top-3 and top-5 generations.
The following image from the paper shows that analyses reveal CODEFUSION generates more diverse and syntactically valid code compared to auto-regressive and text diffusion baselines. For example, the table below adapted from the paper shows CODEFUSION generates 48.5% more valid code on average across the 3 languages compared to the GENIE text diffusion model.

The following image from the paper shows the iterative denoising process allows CODEFUSION to gradually refine the code from noise into the target output. The following image from the paper shows the architecture diagram for CODEFUSION showing the Encoder (E), Denoiser (N), and the Decoder (D) units.

Overall, the paper demonstrates promising results for code generation through diffusion models. The proposed adaptations of conditioned denoising and CPD pre-training effectively learn semantic and syntactic relationships in code.

CodeT5+: Open Code Large Language Models for Code Understanding and Generation

This paper by Wang et al. from Salesforce AI Research proposes CodeT5+, a new family of open-source encoder-decoder language models for code understanding and generation tasks.
CodeT5+ builds on top of CodeT5 with enhanced flexibility to operate in encoder-only, decoder-only, or encoder-decoder modes for different downstream tasks. This flexibility comes from a diverse mixture of pretraining objectives including span denoising, causal language modeling, contrastive learning, and text-code matching on both code and text-code data.
The following image from the paper shows an overview of the CodeT5+ approach: CodeT5+ is a family of code large language models to address a wide range of code understanding and generation tasks. The framework contains a diverse mixture of pretraining objectives on unimodal and bimodal data. Individual modules of CodeT5+ can be flexibly detached and combined to suit different downstream applications in zero-shot, finetuning, or instruction-tuning settings.

To efficiently scale up, CodeT5+ initializes encoder and decoder components with off-the-shelf pretrained language models in a “shallow encoder, deep decoder” setup. Only the shallow encoder and cross-attention layers are trained while freezing the deep decoder parameters. CodeT5+ models are pretrained on 51.5B tokens of 9 programming languages scraped from GitHub, much larger than prior work. Bimodal pretraining on text-code data from CodeSearchNet further adapts the models.
Instruction tuning on synthetically generated task instructions is explored to align CodeT5+ with natural language and improve few-shot capability.
Extensive experiments on 20+ benchmarks demonstrate CodeT5+’s state-of-the-art performance on code intelligence tasks including text-to-code retrieval, code summarization, completion, clone detection and math problem solving.
On the HumanEval text-to-code generation benchmark, instruction-tuned CodeT5+ sets new state-of-the-art for open-source models with 35.0% pass@1, surpassing results from models like Codex and GPT-3.
CodeT5+ also excels as a unified retrieval-augmented code generator, significantly outperforming prior work. The model flexibility enables strong performance on both retrieval and generation.
The open-sourced CodeT5+ models can support code intelligence research and applications through their pretraining objectives, scale, and flexibility on a diverse set of code understanding and generation tasks.

Augmenting Language Models with Long-Term Memory

This paper by Wang et al. from UCSB and Microsoft Research proposes Language Models Augmented with Long-Term Memory (LongMem), a framework which enables large language models (LLMs) to memorize long-term context beyond the input length limit.
The authors design a novel decoupled network architecture with the original backbone LLM frozen as a memory encoder and an adaptive residual side network as a memory retriever and reader.
Attention key-value pairs of previous inputs are extracted by the backbone LLM and cached into a long-term memory bank. The residual side network retrieves relevant cached memory and fuses it into language modeling via a joint attention mechanism.
Enhanced with memory-augmented adaptation training, LongMem can memorize long past context in its 65k-token memory bank and leverage long-term memory to improve language modeling.
The following image from the paper shows an overview of the memory caching and retrieval flow of LongMem. The long text sequence is split into fix-length segments, then each segment is forwarded through large language models and the attention key and value vectors of m-th layer are cached into the long-term memory bank. For future inputs, via attention query-key based retrieval, the top-$k$ attention key-value pairs of long-term memory are retrieved and fused into language modeling.

The following image from the paper shows an overview of the LongMem architecture. “MemAug” represents Memory-Augmented Layer.

Experiments show LONGMEM outperforms strong baselines on long-context language modeling datasets like Project Gutenberg and ChapterBreak. It also enables superior memory-augmented in-context learning, improving accuracy on NLU tasks by 8 points with 2k demonstration examples loaded into memory.
LongMem’s toxicity score is much lower than baselines when pretraining language models with human feedback, demonstrating LongMem’s effectiveness in aligning LM outputs to human preferences by filtering offensive content.

ALCUNA: Large Language Models Meet New Knowledge

The paper by Yin et al. from Peking University proposes a new method called KnowGen to generate artificial entities with new knowledge by making changes to the attributes and relationships of existing entities. This simulates the natural process of new knowledge emerging in the real world.
KnowGen is applied to structured biological taxonomic data from the EOL database to create artificial organisms. This results in a benchmark dataset called ALCUNA for evaluating large language models (LLMs) on their ability to handle new knowledge.
ALCUNA contains questions testing the model’s knowledge understanding, differentiation, and association abilities when faced with new entities.
The following image from the paper shows a demonstration of ALCUNA, including heredity, variation, extension and dropout operations in KnowGen, generated artificial entity named Alcuna and three types of questions related to it.

Several popular LLMs like ChatGPT, Alpaca, Vicuna, and ChatGLM are evaluated on ALCUNA in zero-shot and few-shot settings. The results show these models still struggle with reasoning between new and existing knowledge.
Analysis reveals factors impacting model performance on new knowledge like entity similarity, contextual knowledge, and input representation format.
The paper argues benchmarks with truly new knowledge like ALCUNA are important to drive progress in LLMs’ ability to understand and reason with new information, as opposed to existing knowledge already seen during training.
The artificial nature of the knowledge in ALCUNA makes it reusable as a standard benchmark to assess different models on new knowledge without having to collect new data repeatedly.
This paper proposes a novel method to automatically generate new structured knowledge for evaluating LLMs’ capabilities in more realistic and challenging settings involving unfamiliar information. The ALCUNA benchmark constructed using this approach provides insights into current model limitations and opportunities for improvement.

The Perils & Promises of Fact-checking with Large Language Models

The paper by Quelle and Bovet from the University of Zurich, Switzerland in evaluates using large language models (LLMs) like GPT-3.5 and GPT-4 for automated fact-checking of claims. This is important as LLMs are being used more in high stakes domains like research and journalism.
They test the models on two datasets: PolitFact (US political claims) and a multilingual dataset from Data Commons. The models are evaluated with and without providing contextual information from web searches.
The paper is inspired by the motivation that fact-checking is important to combat misinformation, but manual fact-checking has limited capacity. Large language models (LLMs) like GPT-3.5 and GPT-4 are increasingly used for writing and information gathering, so understanding their fact-checking abilities is critical.
The following image from the paper shows the workflow showing how they enable LLM agents to interact with a context to assess the veracity of a claim (top). Example of the treatment of a specific claim (bottom).

They evaluated GPT-3.5 and GPT-4 on fact-checking claims from PolitiFact and a multilingual dataset. Tested models with and without retrieving context from Google. Compared performance across languages.
Key Results:
- GPT-4 outperformed GPT-3.5 overall.
- Providing context significantly improved accuracy, highlighting the importance of evidence gathering.
- Models struggled with ambiguous “half-true” type verdicts.
- Performance varied across languages - non-English claims saw a boost when translated to English first.
- No sharp drop in accuracy after GPT-3.5/4 training cutoff dates, suggesting continued learning from human feedback.
Limitations:
- Biased evaluation due to use of GPT-4 as a scorer.
- Did not explore model scaling or curating better training data.
- Safety/ethics of potential misinformation not addressed.
Implications:
- LLMs show promise for assisting human fact-checkers but cannot fully automate the process yet.
- Critical examination of LLM reasoning is important before deployment.
- Understanding model limitations and language-specific differences is key.
- Continued learning after initial training needs more investigation.
The paper provides a comprehensive evaluation of GPT-3.5 and GPT-4 on fact-checking, using novel context retrieval and multilingual data. Key findings highlight the models’ strengths as well as areas needing improvement before responsible LLM-assisted fact-checking.

SLoRA: Federated Parameter Efficient Fine-Tuning of Language Models

The paper, authored by Babakniya et al. from the University of Southern California and Samsung Semiconductor Inc., focuses on efficient fine-tuning of language models in Federated Learning (FL) settings, specifically addressing the challenge of parameter efficiency.
The core issue addressed is the need for efficient fine-tuning methods that cater to the limited communication, computation, and storage capabilities of edge devices in FL, especially given the large sizes of popular transformer models.
The following image from the paper shows the impact of client data distribution on the performance of full fine-tuning vs. parameter efficient fine-tuning. While heterogeneity has an adverse effect on both of them, parameter efficient methods are more vulnerable and experience more accuracy drop in more heterogeneous settings.

The authors propose a method called SLoRA (Sparse Low-Rank Adaptation), which improves upon the LoRA (Low-Rank Adaptation) method, particularly in scenarios with high data heterogeneity across FL clients. SLoRA incorporates a novel data-driven initialization technique to achieve this. SLoRA demonstrates comparable performance to full model fine-tuning while requiring significantly fewer parameter updates (approximately 1% density) and reducing training time by up to 90%.
The paper’s approach involves two stages: (1) finding a mature starting point to prime the LoRA blocks collaboratively by FL clients, and (2) running the LoRA algorithm with these learned initializers.
The following image from the paper shows an overview of SLoRA; First server initializes a mask, and clients only update the parameters in the mask. Then using SVD, the updates are decomposed into LoRA blocks which will be used as an initialization for Stage 2.

The method is evaluated on various datasets and models, including Albert and DistilBERT, showing improvements in performance, especially in non-IID data distribution settings.
SLoRA is shown to be effective in reducing the training and communication costs in FL, making it a promising approach for deploying large language models in resource-constrained environments.

S-LoRA: Serving Thousands of Concurrent LoRA Adapters

This paper by Sheng et al. from UC Berkeley, Stanford, and Shanghai Jiao Tong focuses on the scalable serving of LoRA (Low-Rank Adaptation) adapters for large language models (LLMs).
The “pretrain-then-finetune” paradigm, widely adopted in deploying LLMs, leads to numerous fine-tuned variants, presenting significant opportunities for batched inference during serving. The paper introduces S-LoRA, a system designed for this purpose.
S-LoRA addresses memory management challenges by storing all adapters in main memory and fetching them to GPU memory as needed. The system employs Unified Paging, a unified memory pool managing dynamic adapter weights and KV cache tensors, to reduce memory fragmentation and I/O overhead.
The paper presents a novel tensor parallelism strategy and customized CUDA kernels for efficient heterogeneous batching of LoRA computations, enabling the serving of thousands of adapters on a single or multiple GPUs with minimal overhead.
The following image from the paper shows separated batched computation for the base model and LoRA computation. The batched computation of the base model is implemented by GEMM. The batched computation for LoRA adapters is implemented by custom CUDA kernels which support batching various sequence lengths and adapter ranks.

The following image from the paper shows an overview of memory allocation in S-LoRA. S-LoRA stores all adapters in the main memory and fetches the active adapters for the current batch to the GPU memory. The GPU memory is used to store the KV cache, adapter weights, base model weights, and other temporary tensors.

S-LoRA’s performance is evaluated against state-of-the-art libraries like Weights PEFT and vLLM, showing up to 4 times higher throughput and the capability to serve significantly more adapters.
The system is effective in reducing the training and communication costs in Federated Learning, making it a promising approach for deploying large language models in resource-constrained environments.
This paper contributes significantly to the field of machine learning by presenting a novel and efficient method for serving a large number of LoRA adapters, a crucial aspect in the deployment of large-scale language models.
Code

ChainPoll: A High Efficacy Method for LLM Hallucination Detection

This paper by Friel and Sanyal from Galileo introduces ChainPoll, a novel methodology for detecting hallucinations in large language model (LLM) outputs, which significantly outperforms existing methods. Hallucinations in LLMs refer to inaccurate or unmotivated claims, a persistent issue in LLM performance.
The authors also present RealHall, a new suite of benchmark datasets specifically curated for evaluating hallucination detection in LLMs. RealHall was developed to address the limitations of previous datasets that lacked relevance to modern, powerful LLMs and their practical applications. RealHall comprises four benchmark datasets, each designed to pose a significant challenge to the hallucination detection capabilities of modern LLMs. This diversity ensures a comprehensive evaluation across different contexts and types of content. The datasets in RealHall are selected and structured to mirror real-world applications and use cases of LLMs. This alignment with practical scenarios ensures that the insights gained from evaluating models on RealHall are directly applicable to real-world LLM deployments.
The ChainPoll algorithm is as follows:
1. Ask gpt-3.5-turbo whether the completion contained hallucination(s), using a detailed and carefully engineered prompt.
2. Run step 1 multiple times, typically 5. (We use batch inference here for its speed and cost advantages.)
3. Divide the number of “yes” answers from step 2 by the total number of answers to produce a score between 0 and 1.
Among metrics previously proposed in the literature, ChainPoll is perhaps closest to G-Eval. However, they find that ChainPoll dramatically outperforms G-Eval across the entirety of RealHall. They attribute this to a number of key differences between ChainPoll and G-Eval:
- ChainPoll puts considerable effort into prompt engineering. In particular, we phrase our chain-of-thought prompt in a way that reliably elicits a very specific and systematic explanation from the model, an prompting approach they call “detailed CoT.” – By contrast, the prompts used in G-Eval either did not use chain-of-thought, or asked for the answer before the chain-of-thought explanation, which prevents the answer from leveraging the reasoning in the explanation.
- ChainPoll requests boolean judgments, rather than numeric scores. In early experiments on this distinction, we observed that boolean judgments work better than scores, even when eliciting only a single completion.
- ChainPoll uses gpt-3.5-turbo, while G-Eval used either text-davinci-003 or gpt-4.
ChainPoll excels in both open-domain and closed-domain hallucination detection, showing superior performance across all four benchmarks in RealHall. Its aggregate AUROC score of 0.781 beats the next best theoretical algorithm by 11% and surpasses industry standards by over 23%.
Notably, ChainPoll is more efficient, less computationally expensive, and provides human-readable verbal justifications for its judgments, enhancing its explainability.
The paper highlights the need for effective hallucination detection in LLMs, as hallucinations pose a major barrier to the practical use of these models. The introduction of ChainPoll and RealHall marks a significant advancement in this area.

Mixture-of-Experts Meets Instruction Tuning: A Winning Combination for Large Language Models

The paper titled “Mixture-of-Experts Meets Instruction Tuning: A Winning Combination for Large Language Models” presents an innovative approach to enhancing the performance and scalability of Large Language Models (LLMs) by combining Sparse Mixture-of-Experts (MoE) architecture with instruction tuning. - Sparse MoE is a neural architecture that adds learnable parameters to LLMs without increasing inference costs. In contrast, instruction tuning trains LLMs to follow instructions more effectively.
The authors advocate for the combination of these two approaches, demonstrating that MoE models benefit significantly more from instruction tuning compared to their dense model counterparts.
The paper presents three experimental setups: direct finetuning on individual downstream tasks without instruction tuning; instruction tuning followed by few-shot or zero-shot generalization on downstream tasks; and instruction tuning supplemented by further finetuning on individual tasks.
The findings indicate that MoE models generally underperform compared to dense models of the same computational capacity in the absence of instruction tuning. However, this changes with the introduction of instruction tuning, where MoE models outperform dense models.
The paper introduces the FLAN-MOE32B model, which outperforms FLAN-PALM62B on four benchmark tasks while using only a third of the FLOPs. This highlights the efficiency and effectiveness of the FLAN-MOE approach.
The authors conduct a comprehensive series of experiments to compare the performance of various MoE models subjected to instruction tuning. These experiments include evaluations in natural language understanding, reasoning, and question-answering tasks. The study also explores the impact of different routing strategies and the number of experts on the performance of FLAN-MOE models, showing that performance scales with the number of tasks rather than the number of experts.
The following image from the paper shows the effect of instruction tuning on MOE models versus dense counterparts for base-size models (same flops across all models in this figure). They perform single-task finetuning for each model on held-out benchmarks. Compared to dense models, MoE models benefit more from instruction-tuning, and are more sensitive to the number of instruction-tuning tasks. Overall, the performance of MoE models scales better with respect to the number of tasks, than the number of experts.

The paper discusses the challenge of adapting MoE models to multilingual benchmarks and highlights the importance of incorporating diverse linguistic data during training to ensure effective language coverage.
Overall, the paper “Mixture-of-Experts Meets Instruction Tuning” by Sheng Shen et al. presents significant advancements in the scalability and efficiency of LLMs through the novel integration of MoE architecture and instruction tuning, setting new standards in the field of natural language processing.

LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion

This paper presents a novel framework for enhancing the performance of large language models (LLMs) through an ensembling approach.
The framework aims to leverage the strengths of multiple open-source LLMs for consistently superior performance. The following image from the paper shows the

LLM-BLENDER comprises of two main modules: PairRanker and GenFuser. The following image from the paper shows the LLM-BLENDER framework. For each input $x$ from users, we employ $N$ different LLMs to get output candidates. Then, we pair all candidates and concatenate them with the input before feeding them to PairRanker, producing a matrix as comparison results. By aggregating the results in the matrix, we can then rank all candidates and take the top $K$ of them for generative fusion. The GenFuser module concatenates the input $x$ with the $K$ top-ranked candidates as input and generate the final output $\hat{y}$.

PairRanker employs a specialized pairwise comparison method to distinguish subtle differences between candidate outputs from various LLMs. It uses cross-attention encoders to jointly encode the input text and pairs of candidate outputs, determining the superior one. PairRanker has shown the highest correlation with ChatGPT-based ranking.
The following image from the paper shows the architectures of typical reranking methods. $x$ is an input and $y_i$ is a certain candidate, and its score is $s_i$. MLM-Scoring is an unsupervised method that uses an external masked LM to score a candidate; SimCLS uses the same encoder to encode $x$ and each candidate $y_i$; SummaReranker instead employs a cross-encoder to encode both $x$ and $y_i$ at the same time; PairRanker encodes a pair of candidates $\left(y_i, y_j\right)$ at the same time for pairwisely scoring them, and the final score of each candidate is produced.

Following PairRanker’s ranking, GenFuser merges the top-ranked candidates to generate an improved output. This process capitalizes on the strengths of the candidates while mitigating their weaknesses. The GenFuser module is a sequence-to-sequence language model that generates the final output.
To evaluate the effectiveness of LLM ensembling methods, the authors introduce the MixInstruct dataset. This dataset includes 100k training examples and 5k validation and test examples, featuring oracle pairwise comparisons across various instruction-following tasks.
The LLM-BLENDER framework significantly outperforms individual LLMs and baseline methods in various metrics, including BERTScore, BARTScore, and BLUERT, as well as ChatGPT-based ranking. Notably, LLM-BLENDER’s average rank was 3.2 among 12 methods, better than the best LLM’s rank of 3.90. It also ranked in the top 3 for 68.59% of examples, showcasing its substantial performance advantages.
PairRanker outperforms other ranking methods and demonstrates superior performance compared to the best individual model across several metrics. It also performs well in natural language generation tasks like summarization, machine translation, and constrained text generation. LLM-BLENDER, which uses top-3 selections from PairRanker for GenFuser, excels in GPT-Rank and other metrics, significantly surpassing the best individual models.
Overall, the LLM-BLENDER framework represents a significant advancement in ensemble learning for large language models, offering improved performance by effectively combining the outputs of multiple models.

Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation

This paper by Kuhn et al. from the University of Oxford, presented at ICLR 2023, introduces a method for measuring uncertainty in large language models, specifically in the context of tasks like question answering.
The authors tackle the challenge of ‘semantic equivalence’ in natural language, where different sentences can convey the same meaning, making it difficult to estimate uncertainty. They propose ‘semantic entropy’, which is an entropy measure that incorporates linguistic invariances created by shared meanings.
Semantic entropy is an unsupervised method that works with existing language models without requiring any modifications. It uses a single model and is demonstrated to be more predictive of model accuracy on question answering datasets compared to other baselines.
The approach involves three main steps: generating a set of sequences from a predictive distribution, clustering sequences that convey the same meaning using a bi-directional entailment algorithm, and calculating semantic entropy by summing probabilities that share a meaning.
The method was empirically evaluated on open- and closed-book question answering using datasets like TriviaQA and CoQA. Results showed that semantic entropy significantly outperforms baselines like predictive entropy and lexical similarity, especially with larger models.
The following image from the paper shows (a) The proposed semantic entropy (blue) predicts model accuracy better than baselines on the free-form question answering data set TriviaQA (30B parameter OPT model). (b) Semantic entropy’s outperformance increases with model size while also doing well for smaller models.

Semantic entropy’s performance improves as the number of samples increases, suggesting that it makes better use of sample diversity. Additionally, it balances the trade-off between sampling diverse and accurate generations, with fewer samples needed for effective uncertainty estimation compared to prior works.
The paper also discusses the challenges of sampling in high-dimensional language space and variable length generations, with specific attention to the importance of adjusting the sampling temperature for optimal performance.

FACTSCORE: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation

This paper by Min et al. from UW, University of Massachusetts Amherst, Allen Institute for AI, and Meta AI focuses on evaluating the factual accuracy of long-form text generated by large language models (LMs).
The paper introduces FACTSCORE, a novel evaluation method that measures the factual precision of text generated by LMs. It breaks down a generation into atomic facts and calculates the percentage of these facts supported by a reliable knowledge source. This method is particularly necessary since text generations often contain a mix of supported and unsupported information, making binary judgments of quality inadequate.
FACTSCORE addresses two key ideas: using atomic facts as units for evaluation and assessing factual precision based on a specific knowledge source. It defines an atomic fact as a short sentence containing a single piece of information. This approach allows for a more fine-grained evaluation of factual precision than previous methods. The paper uses people biographies as a basis for evaluation due to their objective nature and covers diverse nationalities, professions, and rarity levels.
The following image from the paper shows an overview of FACTSCORE, a fraction of atomic facts (pieces of information) supported by a given knowledge source. FACTSCORE allows a more fine-grained evaluation of factual precision, e.g., in the figure, the top model gets a score of 66.7% and the bottom model gets 10.0%, whereas prior work would assign 0.0 to both. FACTSCORE can either be based on human evaluation, or be automated, which allows evaluation of a large set of LMs with no human efforts.

Their automated estimator of FACTSCORE first breaks a generation into a series of atomic facts and then validates each against the given knowledge source. They use LLAMA 7B trained on Super Natural Instructions and ChatGPT as an LMEVAL, and Generalizable T5-based Retrievers for passage retrieval.
The paper evaluates three state-of-the-art commercial LMs: InstructGPT, ChatGPT, and PerplexityAI. These models struggle with factual precision errors, with FACTSCOREs of 42%, 58%, and 71%, respectively. The study highlights that the factual precision of these LMs significantly drops with the rarity of the entities discussed in the text.
To address the time-consuming and costly nature of human evaluation, the authors propose an automated model to estimate FACTSCORE. This model decomposes text into atomic facts and validates each against a knowledge source, closely approximating FACTSCORE with less than a 2% error rate. It allows the evaluation of a large set of new LMs without manual human effort.
The paper also showcases the application of this automated estimator by evaluating 12 recently released LMs, offering insights into their factual accuracy. This approach could have cost $65K if evaluated by humans, highlighting the cost-effectiveness of the automated method.
Finally, the paper suggests future work to enhance FACTSCORE, including considering other aspects of factuality such as recall (coverage of factual information), improving the estimator for better approximation of factual precision, and leveraging FACTSCORE to correct model generations.
Overall, FACTSCORE represents a significant advancement in evaluating the factual precision of text generated by LMs, providing a detailed and cost-effective method for assessing the accuracy of long-form text.

Fine-tuning Language Models for Factuality

This paper by Tian et al. from Stanford University and UNC Chapel Hill focuses on enhancing the factuality of large language models (LLMs), particularly in long-form text generation. The authors address the tendency of LLMs to generate convincing but factually incorrect statements, known as ‘hallucinations’, which can spread misinformation.
To overcome the limitations of manual fact-checking, they propose a methodology for fine-tuning LLMs to improve their factuality without human labeling, targeting more open-ended generation settings than previous works.
The following image from the paper shows that their approach aims to improve the factuality of language models, specifically focusing on longform generation (e.g. writing a biography). They develop two different approaches for estimating factuality of a passage (center), each of which allows us to generate a preference dataset (right). They then fine-tune the language model to optimize these factuality preferences (far right).

The paper presents two methods for estimating the factuality of text. The reference-based approach evaluates text consistency with an external knowledge base, like Wikipedia, using tools such as FactScore. The reference-free method leverages a model’s own confidence scores as a proxy for truthfulness.
The following figure from the paper shows how they estimate the factuality of a long-form generation by first extracting claims (left) and then evaluating the truthfulness of each claim (right). We consider two approaches for the latter: a reference-based (top right) method that uses a fine-tuned Llama model to check if the fact is supported by Wikipedia, and a reference-free (bottom right) method that uses the model’s confidence in its most likely answer to estimate its truthfulness.

The authors use a pipeline that involves generating automatic factuality preference rankings, either through existing retrieval systems or their novel retrieval-free approach. These preferences are then used to fine-tune LLMs using the DPO algorithm, enhancing their ability to generate factual content.
The effectiveness of the methodology is demonstrated through experiments in generating biographies and answering medical questions. The paper shows that fine-tuning for factuality outperforms conventional RLHF (Reinforcement Learning from Human Feedback) and delivers complementary benefits to LLM decoding strategies aimed at increasing factuality.
The experiments reveal significant reductions in factual error rates when generating biographies and answering medical questions. Specifically, compared to baseline models, the approach achieves over 50% reduction in error rates for biographies and 20-30% for medical questions.
The study also explores combining their factuality tuning approach with other methods, like DOLA (Decoding by Contrasting Layers), to further enhance the accuracy of factuality fine-tuned models.
Post factuality tuning, the models tend to produce more objective and direct sentences with simpler structures and less conversational or storytelling style, differing from the original models in tone and content organization.
The research presented in this paper is a significant step toward addressing the challenge of factual error generation in LLMs, particularly in long-form text generation contexts. By providing scalable, self-supervised strategies for improving model factuality, this work contributes to the development of more reliable and trustworthy language models.

Better Zero-Shot Reasoning with Self-Adaptive Prompting

This paper by Wan et al. from Google Cloud AI Research, DeepMind, and the University of Oxford, introduces Consistency-based Self-adaptive Prompting (COSP), a novel method to enhance zero-shot reasoning in large language models (LLMs).
COSP eliminates the need for handcrafted responses or ground-truth labels. It selects and constructs examples from the LLM’s zero-shot outputs using criteria that focus on consistency, diversity, and repetition. The following figure from the paper illustrates the fact that selecting in-context demonstrations for reasoning tasks can be a delicate art. LLM output is sensitive to in-context demos and their reasoning, especially when they are generated and imperfect. Example inputs & outputs shown from top to bottom (MultiArith dataset & PaLM-62B model): (1) zero-shot CoT with no demo: correct logic but wrong answer; (2) correct demo (Demo1) and correct answer; (3) correct but repetitive demo (Demo2) leads to repetitive outputs; (4) erroneous demo (Demo3) leads to a wrong answer, but (5) combining Demo3 and Demo1 again leads to a correct answer. This motivates the need for a carefully-designed selection procedure for in-context demos, which is the key objective of this paper.

The authors demonstrate that COSP, when applied to three different LLMs, improves performance by up to 15% over zero-shot baselines and matches or exceeds few-shot baselines in various reasoning tasks.
The method involves two stages: the first stage involves collecting LLM responses to test questions via Zero-shot CoT and calculating outcome entropy for each question. The second stage involves using the selected question-reasoning pairs as in-context demonstrations for further LLM queries.
The following figure from the paper shows the overall procedure of COSP. In Stage 1, we run Zero-shot CoT multiple times on each question to generate a pool of demonstrations (each consisting of the question $x^{(i)}$, generated rationale $r_j^{(i)}$ and predicted answer $\hat{y}_j^{(i)}$ ) and assign a score to each. Note that different rationales leading to the same final answer are marked the same color in “Outputs” blocks. In Stage 2, we augment the current test question for which we are interested in finding in-context examples (boxes shaded in blue) with a number of in-context demonstrations. These consist of test questions and selected Stage 1 outputs whose predicted answers $\hat{y}_j^{(i)}$ are the majority predictions for that question that minimize the proposed score. The augmented test question is used to query the LLM a second time. Finally, a majority vote over outputs from both stages forms the final prediction.

Experiments conducted on a range of tasks show significant improvements in average accuracy. Notably, COSP performs comparably or better than few-shot baselines with handcrafted in-context examples.
The paper highlights the effectiveness of COSP in enhancing the zero-shot reasoning capabilities of LLMs without the additional labor of crafting per-task examples or relying on ground-truth labels.

Universal Self-Adaptive Prompting for Zero-shot and Few-shot Learning

This paper by Wan et al. from Google and University of Oxford focuses on enhancing the zero-shot and few-shot abilities of large language models (LLMs) through a novel technique known as Universal Self-Adaptive Prompting (USP).
The paper addresses a significant challenge in the field of large language models (LLMs): improving zero-shot performances which are typically weaker due to the lack of guidance and difficulty in applying automatic prompt design methods, especially when ground-truth labels are unavailable. Tho this end, the authors introduce Universal Self-Adaptive Prompting (USP), an automatic prompt design approach specifically tailored for zero-shot learning, which is also compatible with few-shot scenarios. USP requires only a minimal amount of unlabeled data and an inference-only LLM.
USP’s approach to universal prompting involves categorizing a possible NLP task into one of three types and using a corresponding selector. This selector chooses the most suitable queries and zero-shot model-generated responses as pseudo-demonstrations, thereby automating and generalizing in-context learning (ICL) to the zero-shot setup.
The following figure from the paper shows an overview of (a) zero-shot setup, (b) few-shot setup with in-context learning, (c) Consistency-based Self-Adaptive Prompting and (d) Universal Self-Adaptive Prompting, or USP, the proposed method in this work. The queries without demos with which LLMs are directly prompted (zero-shot, or Stage 1 in COSP and USP) are marked in red arrows, and the queries prepended with either the handcrafted demos (few-shot) or model-generated pseudo-demos (Stage 2 in COSP and USP) are marked in blue arrows.

The following figure from the blog post shows an illustration of USP in exemplary tasks (classification, QA and text summarization). Similar to COSP, the LLM first generates predictions on an unlabeled dataset whose outputs are scored with logit entropy, consistency or alignment, depending on the task type, and pseudo-demonstrations are selected from these input-output pairs. In Stage 2, the test instances are augmented with pseudo-demos for prediction.

Experiments were conducted on PaLM-540B, PaLM-62B, and the state-of-the-art PaLM 2-M models, covering a wide variety of tasks in natural language understanding, generation, and reasoning.
The results demonstrated that USP significantly improves upon standard zero-shot prompting methods without pseudo-demonstrations, offering performances often comparable to or even superior to few-shot baselines across more than 40 tasks.
The paper also discusses a counter-intuitive phenomenon common in both USP and few-shot learning with golden examples: the average USP score across a dataset can be a good zero-shot indicator of the potential improvement from USP. This score quantifies the general uncertainty the model has about a task, with a high average score indicating lower benefits from ICL and a low score suggesting higher potential gains from additional guidance.
This research represents a significant step in advancing the capabilities of LLMs, particularly in zero-shot and few-shot learning scenarios, through the innovative use of adaptive prompting techniques.
Blog post.

Chain-of-Note: Enhancing Robustness in Retrieval-Augmented Language Models

This paper by Yu et al. from Tencent AI Lab explores improving the robustness of Retrieval-Augmented Language Models (RALMs). It introduces the Chain-of-Note (CoN) framework to address two main robustness challenges in RALMs: noise and unknown scenarios.
The core of CoN involves generating sequential reading notes for retrieved documents, enabling thorough evaluation of their relevance to the query and integrating this information to formulate a final answer.
CoN focuses on improving the robustness of RALMs in handling irrelevant or noisy information and responding appropriately when faced with queries outside its knowledge scope.
The framework was tested on various open-domain question answering datasets. Notably, CoN achieved an average improvement of +7.9 in exact match scores with entirely noisy retrieved documents and +10.5 in rejection rates for real-time questions beyond the pre-training knowledge scope.
The following image from the paper shows that compared with the current RALMs, the core idea behind CoN is to generate sequential reading notes for the retrieved documents, ensuring a systematic assessment of their relevance to the input question before formulating a final response.

The following image from the paper shows an illustration of the CoN framework with three distinct types of reading notes. Type (a) depicts the scenario where the language model identifies a document that directly answers the query, leading to a final answer formulated from the retrieved information. Type (b) represents situations where the retrieved document, while not directly answering the query, provides contextual insights, enabling the language model to integrate this context with its inherent knowledge to deduce an answer. Type (c) illustrates instances where the language model encounters irrelevant documents and lacks the necessary knowledge to respond, resulting in an “unknown” answer. This figure exemplifies the CoN framework’s capability to adaptively process information, balancing direct information retrieval, contextual inference, and the recognition of its knowledge boundaries.

ChatGPT was used to generate training data for CoN, which was then trained on a LLaMa-2 7B model, demonstrating the practical approach for implementing the framework.
The evaluation on overall QA performance, noise robustness, and unknown robustness across multiple datasets indicated that RALMs equipped with CoN significantly outperform standard RALMs.
Case studies demonstrated the enhanced capability of CoN in understanding and integrating information from multiple sources, leading to more accurate conclusions compared to standard RALMs.
In conclusion, the paper presents a novel approach to enhance the robustness of RALMs, showing significant improvements in handling noise and unknown scenarios, which is crucial for practical applications of language models in open-domain settings.

Thread of Thought: Unraveling Chaotic Contexts

This paper by Zhou et al. from University of Macau, Microsoft, and University of Technology Sydney introduces the Thread of Thought (ThoT) strategy, a novel technique designed to enhance the reasoning capabilities of Large Language Models (LLMs) in handling chaotic contexts. ThoT draws inspiration from human cognitive processes and aims to systematically segment and analyze extended contexts for better comprehension and accuracy.
ThoT is developed to address challenges in chaotic contexts, where LLMs struggle to sift through and prioritize relevant information amidst a plethora of data.
The following image from the paper shows the strategy involves a two-step process where the first step guides the LLM through the context analytically, breaking it down into manageable parts for summarization and analysis. The second step refines this into a definitive answer. Thread of Thought prompting enables large language models to tackle chaotic context problems. In the output depicted, green text denotes the correct answer, while red text indicates the erroneous prediction.

The following image from the paper shows Thread of Thought for zero-shot reasoning.

The efficacy of ThoT is demonstrated using PopQA and EntityQ datasets, and a Multi-Turn Conversation Response dataset (MTCR) developed by the authors. ThoT shows significant improvements in reasoning performance over other prompting techniques.
ThoT can be seamlessly integrated with various pre-trained language models and prompting strategies, acting as a versatile “plug-and-play” module.
Example Use Case: Consider an LLM being used in a customer service application to handle complex customer inquiries involving multiple issues. Traditional models might struggle to extract and focus on relevant details from a lengthy customer interaction history. With ThoT, the model could systematically break down the customer’s history into segments, analyze each part to identify key issues, and then synthesize this information to provide a comprehensive and accurate response. This method not only improves the quality of the response but also enhances the efficiency and effectiveness of the customer service process.

Large Language Models Understand and Can Be Enhanced by Emotional Stimuli

This paper by Li et al. from CAS, Microsoft, William&Mary, Beijing Normal University and HKUST proposes an idea to enhance the performance of Large Language Models (LLMs) using emotional stimuli, a concept termed “EmotionPrompt.” It explores the influence of emotional intelligence on LLMs and demonstrates how adding emotional cues to prompts significantly improves the LLMs’ performance in various tasks.
The study introduces EmotionPrompt, a method that combines standard prompts with emotional stimuli. This approach leverages human-like emotional responses to enhance the LLMs’ reasoning and problem-solving abilities.
The paper conducts automatic experiments using several LLMs, including Flan-T5-Large, Vicuna, Llama 2, BLOOM, ChatGPT, and GPT-4. Tasks span deterministic and generative applications, offering a comprehensive evaluation scenario.
The following image from the paper shows an overview of the process from generating to evaluating EmotionPrompt.

The following image from the paper illustrates the fact that building upon psychological theories, we developed different sets of emotional stimuli.

Results show a notable improvement in LLM performance with EmotionPrompt, with a relative performance improvement of 8.00% in Instruction Induction and 115% in BIG-Bench tasks.
A human study involving 106 participants assessed the quality of generative tasks using both vanilla and emotional prompts. This study indicated a 10.9% average improvement in performance, truthfulness, and responsibility metrics with EmotionPrompt.
The paper delves into why EmotionPrompt is effective, discussing factors that may influence its performance and providing insights into the integration of emotional intelligence in LLMs.
Example use case: Imagine an educational application where a language model assists students in learning a new topic. Normally, a prompt might simply ask the model to explain a concept. However, with EmotionPrompt, the query might include a statement like “It’s crucial for my upcoming exam to understand this topic.” This emotional addition motivates the LLM to generate more thoughtful, comprehensive, and engaging explanations, potentially improving the student’s understanding and retention of the material.

Text Embeddings Reveal (Almost) As Much As Text

This paper by Morris et al. from Cornell University at EMNLP 2023 investigates the amount of private information that text embeddings can reveal about the original text.
The paper focuses on the problem of embedding inversion, i.e., reconstructing the full text represented in dense text embeddings. The authors frame this as a controlled generation problem, aiming to generate text such that, when re-embedded, it is close to a fixed point in latent space.
The proposed method, Vec2Text, uses iterative corrections and re-embeddings to recover 92% of 32-token text inputs exactly. The model decodes text embeddings from two state-of-the-art embedding models and demonstrates the ability to recover significant personal information, such as full names from clinical notes.
The following image from the paper shows an overview of Vec2Text. Given access to a target embedding $e$ (blue) and query access to an embedding model $\phi$ (blue model), the system aims to iteratively generate (yellow model) hypotheses $\hat{e}$ (pink) to reach the target. Example input is a taken from a recent Wikipedia article (June 2023). Vec2Text perfectly recovers this text from its embedding after 4 rounds of correction.

The method was tested on embeddings of web documents and various datasets, including clinical notes from the MIMIC database. The results showed a high degree of success in recovering text and sensitive information from embeddings, highlighting potential privacy risks associated with the use of text embeddings.
The authors also explored defending against inversion attacks by adding Gaussian noise to embeddings. They found that a small amount of noise can effectively defend against naive inversion attacks while preserving utility in retrieval tasks. However, they note that more sophisticated attacks could potentially overcome this defense.
The study reveals that text embeddings can leak as much sensitive information as the text they are derived from, suggesting that embeddings should be treated with similar precautions as raw data to protect privacy.
The paper concludes by discussing the implications of these findings for data privacy, particularly in sensitive domains like medicine, and suggests that embeddings should be considered as sensitive as the original text data.
Code.

Influence Scores at Scale for Efficient Language Data Sampling

This paper by Anand et al. from Amazon Alexa AI in EMNLP 2023, presents a detailed exploration of influence scores for optimizing language data sampling in machine learning models.
The paper focuses on identifying important examples for machine learning model performance, especially in language classification tasks. It addresses the challenge of efficiently training models by determining which data examples are most influential to the learning process.
The authors explore the use of various influence scores, like the Variance of Gradients (VoG), in language-based tasks using pretrained models. They conduct experiments on the SNLI dataset and an NLU model stack exposed to dynamic user speech patterns in a voice assistant setting. The influence scores are used to prune training data by identifying the most significant examples.
In many cases, encoder-based language models can be fine-tuned on roughly 50% of the original data without performance degradation.
Among the various influence scores evaluated, VoG outperformed others, especially at larger pruning fractions.
In a large-scale user study, sampling by VoG scores proved effective at identifying training data crucial for learning, enabling the pruning of about 50% of the training data without significant regressions in key metrics.
The paper provides insights into the effects of noisy and class-imbalanced data on influence scores. It offers recommendations for score-based sampling to enhance accuracy and training efficiency.
The research demonstrates practical applications in commercial voice assistants. It shows that intelligent data pruning can lead to significant reductions in the amount of data needed for training, without compromising model performance. This has implications for training efficiency and data management, especially in settings where data privacy and minimization are important.
The authors acknowledge limitations related to the scope of their experiments and the reproducibility of their in-house results. They propose future research directions, including extending the definition of influence scores to model pretraining and exploring data mixing/augmentation strategies for efficient model training.
This paper contributes to the field of machine learning by providing a comprehensive analysis of influence scores for data sampling and offers practical insights for their application in real-world settings, especially for language models in voice assistant technologies.

TableLlama: Towards Open Large Generalist Models for Tables

This paper by Zhang et al. from The Ohio State University and IN.AI Research introduces TableLlama, an open-source large language model (LLM) designed as a generalist for various table-based tasks. The model is built by fine-tuning Llama 2 (7B) with LongLoRA to address the challenges of long context handling.
The paper presents TableInstruct, a new dataset comprising a diverse range of realistic tables and tasks for instruction tuning and evaluating LLMs. TableInstruct includes 14 datasets across 11 tasks, with a focus on both in-domain and out-of-domain evaluation settings.
The following image from the paper shows an overview of TableInstruct and TableLlama. TableInstruct includes a wide variety of realistic tables and tasks with instructions. They make the first step towards developing open-source generalist models for tables with TableInstruct and TableLlama.

The following image from the paper shows an illustration of three exemplary tasks: (a) Column type annotation. This task is to annotate the selected column with the correct semantic types. (b) Row population. This task is to populate rows given table metadata and partial row entities. (c) Hierarchical table QA. For subfigures (a) and (b), they mark candidates with red color in the “task instruction” part. The candidate set size can be hundreds to thousands in TableInstruct.

TableLlama, trained on TableInstruct, demonstrates strong performance on 7 out of 8 in-domain tasks, often surpassing state-of-the-art (SOTA) models that typically require task-specific designs. The model’s design enables it to effectively manage diverse and complex table-based tasks.
In out-of-domain evaluations, TableLlama shows significant improvements (6-48 absolute point gains) over the base model, indicating that training on TableInstruct effectively enhances model generalizability.
Key contributions of the work include the construction of the large-scale TableInstruct dataset, and the development of TableLlama, an open-source LLM-based generalist model, which both demonstrate promising directions in research for handling table-based tasks.
The authors also conduct an ablation study, revealing insights into how TableInstruct enhances model generalizability, and the benefits of instruction tuning in training LLMs for diverse table-based tasks.
Code; Dataset; Models.

NEFTune: Noisy Embeddings Improve Instruction Finetuning

This paper by Jain et al. from University of Maryland, Lawrence Livermore National Laboratory, and NYU proposes NEFTune, a simple trick that can improve the finetuning process for language models with only a few lines of code. Here’s how it works…
NEFTune adds random (uniform) noise to an LLM’s input word embeddings. By using this simple trick, we see improved performance after SFT across almost all models/tasks.

To generate the random noise that is added to word embeddings, we can independently sample values in the range $[-1, 1]$, then scale these values according to the sequence length $L$, embedding dimension $d$, and two tunable parameters, $\alpha$ and $\epsilon$. Such a scaling approach is inspired by work in adversarial ML.
Interestingly, when uniform and Gaussian approaches are compared for generating noise, we see that a uniform approach works best. However, the level of noise to be added is a hyperparameter that must be tuned depending on the model/task.
OPT, LLaMA, and LLaMA-2 are fine-tuned via SFT on a variety of datasets, such as Alpaca, Evol-Instruct, ShareGPT, and OpenPlatypus. When we compare the results of SFT with and without NEFTune, we clearly see that adding NEFTune yields a better model in (nearly) every case, as measured by win percentage in human evaluations.
The following image from the paper shows the AlpacaEval Win Rate percentage for LLaMA-2-7B models finetuned on various datasets with and without NEFTune. NEFTune leads to massive performance boosts across all of these datasets, showcasing the increased conversational quality of the generated answers.

NEFTune is a simple, easy-to-use trick for LLM finetuning that yields a slight (but consistent) boost to performance. It should be used by any AI practitioners that frequently finetune LLMs. This approach is already implemented in TRL, so it can be used by simply setting a parameter in the SFTTrainer package.

Break the Sequential Dependency of LLM Inference Using Lookahead Decoding

This blog by Fu et al. from LMSys and UCSD introduces lookahead decoding, an exact, parallel decoding algorithm that accelerates large language model (LLM) inference by breaking the sequential dependency in autoregressive decoding. This approach aligns with the belief that LLMs should move beyond the “one token per forward pass” paradigm. Lookahead decoding extracts and verifies n-grams concurrently using the Jacobi iteration method, functioning without a draft model or data store and reducing decoding steps linearly in relation to log(FLOPs) per step.
Lookahead decoding employs the Jacobi iteration method and additional input preparation to address computational bottlenecks in LLMs without needing a new model. This method overcomes the slow, difficult-to-optimize autoregressive decoding in LLMs, notably improving GPU utilization and addressing latency issues in applications like chatbots and personal assistants.
The following image from the paper shows that this method overcomes the slow, difficult-to-optimize autoregressive decoding in LLMs, notably improving GPU utilization and addressing latency issues in applications like chatbots and personal assistants. The following image from the blog shows a demo of speedups by lookahead decoding on LLaMA-2-Chat 7B generation. Blue fonts are tokens generated in parallel in a decoding step.

nlike speculative decoding methods (e.g., Medusa and OSD), lookahead decoding doesn’t rely on token acceptance rate or need an accurate draft model, thus avoiding their limitations. However, it comes with its limitations, such as being restricted to greedy decoding and requiring model-level code changes, which makes implementation challenging. The decoding process involves generating multiple disjoint n-grams in parallel and integrating suitable n-grams into the sequence, with each step generating more than one token, thus reducing total decoding steps.
This technique also includes a two-branch structure: a lookahead branch for generating new n-grams and a verification branch for confirming promising n-grams, operating concurrently.
- Lookahead Branch: The lookahead branch aims to generate new $N$-grams. The branch operates with a two-dimensional window, defined by two parameters:
  - window size $W$: how far ahead we look in future token positions to conduct parallel decoding.
  - N-gram size $N$: how many steps we look back into the past Jacobi iteration trajectory to retrieve n-grams.
- Verification Branch: Alongside the lookahead branch, the verification branch of each decoding step aims to identify and confirm promising n-grams, ensuring the progression of the decoding process. In the verification branch, we identify n-grams whose first token matches the last input token. This is determined via simple string match. Once identified, these n-grams are appended to the current input and subjected to verification via an LLM forward pass through them.
The following image from the paper shows that Lookahead Decoding overcomes the limitations of Jacobi Decoding by leveraging its capability of generating parallel n-grams. In Jacobi decoding, they notice that each new token at a position is decoded based on its historical values from previous iterations. This process creates a trajectory of historical tokens at each token position, forming many n-grams. For instance, by looking back over three Jacobi iterations, a 3-gram can be formed at each token position. Lookahead decoding takes advantages of this by collecting and caching these n-grams from their trajectories. While lookahead decoding performs parallel decoding using Jacobi iterations for future tokens, it also concurrently verifies promising n-grams from the cache. Accepting an N-gram allows Lookahead Decoding to advance $N$ tokens in one step, significantly accelerating the decoding process.

Lookahead decoding enables significant latency reduction (1.5x to 2.3x) in LLM inference and is scalable, with larger lookahead window sizes and n-gram sizes offering more computational operations but potentially greater latency reductions. Benchmarks on an A100 GPU demonstrate speedups as high as 2.25x over the base transformers implementation.
The blog demonstrates this method’s efficiency on various models and datasets, showing notable speedups across different settings, including LLaMA-2-Chat and CodeLLaMA.
However, while lookahead decoding works efficiently with various models and datasets, it has its fair share of limitations:
- As mentioned in the blog post, you are increasing FLOPS to get additional LLM throughput. All is good if the model is small for your device, but it’s hard to achieve speedups using modest models on consumer GPUs (e.g. 7B models on a NVIDIA GeForce RTX 3090). On a NVIDIA GeForce RTX 3090 GPU, the performance varies, with potential slowdowns or modest speedups depending on hyperparameter adjustments. After some fiddling with the Lookahead parameters, a 25% speedup was seen on a 7B model on a NVIDIA GeForce RTX 3090, compared to the model without Flash Attention 2. Running with their default parameterization actually slows the model down by 33%, despite achieving a high compression ratio (FLOPS is the bottleneck).
- Doesn’t work correctly with Flash Attention 2: the output is significantly different.
- Works with Bitsandbytes (BNB) quantization, but speedups could be limited; in fact, slowdowns were seen at least in some scenarios.
- Works with Activation-aware Weight (AWQ) Quantization, same findings as in the case without quantization.
- Requires modifications to the large language model, and thus fares negatively compared to other techniques such as PaSS.
- Currently limited to greedy decoding, but can be extended to work on beam search decoding.
- Batching with this technique is much trickier – just like in speculative decoding/assisted generation, we may have more than one accepted token per forward pass.
- In summary, Lookahead decoding works efficiently with various models and datasets, it does not perform well with Flash Attention 2 (FA2), has no batching support, and faces challenges with dynamic and static 4-bit quantization.
Regarding the limitations of Jacobi decoding, in practical applications, they have found that Jacobi decoding faces several challenges that impede achieving considerable wallclock speedup. While it can decode more than one token in many steps, the precise positioning of these tokens within the sequence often goes wrong. Even when tokens are correctly predicted, they are often replaced in subsequent iterations. Consequently, very few iterations successfully achieve the simultaneous decoding and correct positioning of multiple tokens. This defeats the fundamental goal of parallel decoding.
As an important takeaway: lookahead decoding reveals a new scaling law between flops and latency reduction in LLM inference. It opens the possibility of allocating more FLOPs, even from more than 1 GPU, for greater latency reduction. This is very important for extremely latency-sensitive applications, albeit this comes with diminishing returns.
An implementation compatible with huggingface/transformers is available, allowing users to enhance performance with minimal code changes.
Code.

Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads

This blog by Cai et al. introduces Medusa, a framework to accelerate LLM generation without the complexities of speculative decoding. Medusa uses additional decoding heads, inspired by Stern et al. 2018, improving LLM generation efficiency by about 2x.
The blog identifies the main inefficiency in LLM generation as the memory-bound computational pattern, where the autoregressive decoding process requires transferring the model’s parameters from High-Bandwidth Memory (HBM) to compute units for each token generated, underutilizing arithmetic computation capabilities.
Traditional speculative decoding methods, which use a draft model to generate token candidates for validation by the full-scale model, face challenges like finding an ideal draft model, system complexity, and sampling inefficiency.
Medusa simplifies this by adding multiple decoding heads to the original model, allowing them to be trained together with the base model remaining frozen. This method avoids additional complexity in the system and facilitates fine-tuning on single GPUs.
The framework pairs these decoding heads with a tree-based attention mechanism to verify several candidates in parallel, significantly boosting generation speed. A new efficient and high-quality sampling method tailored for Medusa further enhances performance.
Medusa’s components include the Medusa heads, tree attention, and a typical acceptance scheme. In this, the tree attention mechanism processes multiple candidates concurrently by creating an attention mask that permits flow only from current to antecedent tokens.
The following image from the blog shows the performance of Medusa on Vicuna-7b.

The approach was tested on Vicuna models (7B, 13B, 33B) fine-tuned for chat applications, using the ShareGPT dataset for a single epoch on a single A100-80G GPU. Results showed a consistent 2x speedup in wall time across various use cases.

Online Speculative Decoding

This paper by Liu et al. from UC Berkeley, UCSD, Sisu Data, and SJTU, focuses on improving the inference efficiency of large language models (LLMs) through an innovative technique named Online Speculative Decoding (OSD).
OSD addresses the limitations of speculative decoding, which uses a smaller draft model to predict the outputs of a larger target model, but often suffers from low predictive accuracy due to diverse text inputs and capability gaps between the models.
OSD continuously updates multiple draft models based on observed user query data. This process leverages excess computational power in LLM serving clusters, making the training cost-neutral.
The following image from the paper shows an overview of OSD. For each prompt, the draft model suggests multiple tokens in a single step. The target model then verifies these tokens, accepting some and rejecting others. If the student proposes incorrect tokens, both the draft and target distributions are stored in a buffer. Once the buffer exceeds a specified threshold, the draft model is updated by calculating the loss between the draft and target distributions using various distance metrics.

They implement of a prototype based on online knowledge distillation, evaluated using synthetic and real query data on popular LLMs. The results showed a substantial increase in token acceptance rates, translating into significant latency reductions (1.22x to 3.06x).
They explore various Generalized Knowledge Distillation (GKD) methods for constructing draft models, identifying the most effective variants. This surpassed existing methods which construct static draft models using fine-tuning or distillation on offline datasets.
Analysis of the method’s robustness to adversarial prompts and distribution shifts, demonstrating a significant improvement in alignment with the target model and adaptation to new data distributions.
A comprehensive analysis of high-frequency tokens, precision, and recall, highlighting the draft model’s ability to adapt and predict tokens consistent with the data distribution.
In conclusion, the paper presents a novel approach to reduce LLM serving latency by adapting draft models on the fly using query data and knowledge distillation. This method is shown to be effective in both synthetic and real-world scenarios, significantly enhancing token acceptance and reducing latency with negligible additional cost.
Code

PaSS: Parallel Speculative Sampling

This paper by Monea et al. from EPFL and Apple in the Efficient Natural Language and Speech Processing (ENLSP) workshop at NeurIPS 2023 introduces an innovative solution to the bottleneck in language model generation, especially for large language models (LLMs). The primary focus is on addressing the slow generation speed due to the need for accessing memory for each token, a process that becomes increasingly cumbersome as the model size grows.
The paper targets the significant bottleneck in generation speed for LLMs, where memory access for each token generated forms the main hindrance. As model size increases, this issue becomes more pronounced, impacting real-world applications that require rapid response times.
The authors propose Parallel Speculative Sampling (PaSS), an alternative to speculative sampling. This method allows for drafting multiple tokens from a single model without the computational cost or need for a second model. It involves an additional input token indicating the words to be generated concurrently, thereby reducing the number of memory accesses required.
PaSS operates by generating candidate tokens through parallel decoding, adding a small number of “look-ahead embeddings” to generate output for each additional embedding. This approach doesn’t require a second model or modifications to the large language model, ensuring the same loss-less quality of generations as speculative sampling methods.
PaSS showed more significant speed improvements at lower temperatures, indicating better efficiency when token distributions are more predictable.
Running time decreases with up to 6 look-ahead steps, but additional steps beyond this number negate the efficiency benefits.
The method provides significant speed-ups in generation time (up to 30%) compared to standard auto-regressive methods, while maintaining model performance within the margin of error, as demonstrated in tasks like generating on the HumanEval dataset.
The memory overhead for adding extra embeddings is considerably smaller than any small model added by existing speculative sampling solutions.
PaSS ensures that every call to the language model adds at least one token to the final sequence of generated tokens, providing a guaranteed speed-up.
The authors plan to explore improvements in the quality of parallel generation with look-ahead tokens for future enhancement of the PaSS algorithm.
In summary, this paper contributes to the field by offering an efficient method for accelerating the generation of large language models without compromising on quality or performance, effectively addressing a critical issue for real-world applications of LLMs.

System 2 Attention (is something you might need too)

This paper by Weston and Sukhbaatar from Meta introduces a novel attention mechanism for Large Language Models (LLMs) named System 2 Attention (S2A). This concept is inspired by the human cognitive process of deliberate attention (System 2 reasoning) and aims to solve the problem of standard soft attention in Transformers, which often struggles with filtering out irrelevant information from the input context.
S2A targets the challenge in standard soft attention mechanisms where irrelevant information in the input context leads to degraded performance in tasks like opinion analysis, question answering, and longform content generation. The central issue is the model’s inability to discern and focus only on the relevant context portions.
S2A introduces a method where the LLM first regenerates the input context, eliminating irrelevant parts. This approach leverages the LLM’s natural language understanding and instruction-following capabilities to improve the quality of attention and the responses by focusing only on the regenerated, relevant context. Thus, S2A involves a two-step process to improve attention and response quality by focusing only on regenerated, relevant context:
1. Context Regeneration: Given a context $x$, S2A regenerates this context to $x'$, removing irrelevant parts that could adversely affect the output. This is denoted as $x' \sim S2A(x)$.
2. Response Generation with Refined Context: The final response is produced using the regenerated context $x'$ instead of the original, leading to more accurate and factual responses. This step is represented as $y \sim LLM(x')$.
Implementation Details:
- S2A is implemented as a class of techniques using general instruction-tuned LLMs. The process is executed as an instruction via prompting.
- Specifically, $S2A(x) = LLM(PS2A(x))$, where $PS2A$ is a function generating a zero-shot prompt instructing the LLM to perform the System 2 Attention task over $x$.
- An example prompt, $PS2A$, used in the experiments, instructs the LLM to regenerate the context by extracting parts beneficial for providing relevant context for a given query, shown below in the figure from the paper.
- Post-processing is applied to the output of step 1 to structure the prompt for step 2, as instruction-following LLMs produce additional reasoning and comments.
The following image from the paper shows an example from the GSM-IC task where a distracting sentence (“Max has 1000 more books than Mary”) makes LLaMA-2-70B-chat (left) make a mistake. System 2 Attention (S2A) regenerates the portion of the context it decides to pay attention to, successfully removing the distracting sentence (right), then hence answering correctly.

The authors assess S2A across factual QA, longform generation, and math word problems. In factual QA, S2A achieves 80.3% accuracy, significantly improving factuality. In longform generation, it enhances objectivity, scoring 3.82 out of 5. In math word problems, S2A shows improved accuracy, indicating its effectiveness in focusing on relevant context.
The paper explores different S2A variants, offering insights into its robustness and flexibility.
The success of S2A in enhancing factuality and objectivity while reducing irrelevant content suggests its potential for high precision tasks like automated news reporting, academic research assistance, or legal document analysis. Future work could refine this approach for specific domains or integrate it with other advanced techniques to further enhance LLM capabilities.
This research represents a significant advancement in the attention mechanisms of LLMs, particularly in handling context relevance, factuality, and objectivity.

Aligning Large Language Models through Synthetic Feedback

This paper by Kim et al. from NAVER Cloud, NAVER AI Lab, KAIST AI, and SNU AI Center, presents a novel framework for aligning large language models (LLMs) with human values like safety, truthfulness, helpfulness, harmlessness, and honesty. The framework aims to reduce the need for extensive human annotations and reliance on proprietary LLMs, such as ChatGPT, by introducing three key steps:
- Reward Modeling (RM) with Synthetic Feedback: This step contrasts responses from different sized vanilla LLMs with various prompts. The assumption is that larger, optimally prompted models produce superior responses compared to smaller, inadequately prompted ones. Responses are then ranked based on these assumptions, and post-validation methods like Heuristic Filters and As-is RM are used to filter out noise and biases from the synthetic comparisons. HF discards bad responses containing or beginning with keywords like “I don’t know” or “well” and prefers longer responses over shorter ones to reduce biases. As-is RM uses another RM trained on a community QA dataset for further data filtering.
- Supervised Fine-Tuning (SFT): Utilizing Reward-Model-guided Self-Play (RMSP), high-quality demonstrations are simulated between a user and an AI assistant. This step uses the LLaMA-30B model with tailored prompts for both user and assistant roles. The aim is to generate aligned responses with the help of the RM from the first step, thereby training a supervised policy model. To ensure a more aligned response from the assistant, they include the synthetic RM, trained in the first stage, in the loop, thus aptly naming this stage: Reward-Model-guided SelfPlay (RMSP). In this setup, the assistant model, LLaMA-30B-Faithful-3shot, first samples $N$ responses for a given conversational context. Then, the RM scores the $N$ responses, and the best-scored response is chosen as the final response for the simulation, i.e., the RM performs rejection sampling (best-of-$N$ sampling). Like the Self-Play, the turn-taking with LLaMA-30B-User-3shot is continued until the maximum turn.
- Reinforcement Learning from Synthetic Feedback (RLSF): The final stage involves further aligning the SFT model using the synthetic RM as a reward signal. The process employs Proximal Policy Optimization (PPO) to fine-tune the policy model, with the objective of maximizing expected rewards derived from the synthetic RM.
The following image from the paper shows an overview of our proposed framework for alignment learning of LLMs. Step 1. They first conduct reward modeling with a synthetically generated comparison dataset (synthetic feedback). Step 2. The demonstration dataset is generated by simulation with the guidance of the reward model and train supervised policy with the synthetic demonstrations. Step 3. They further optimize the model against the reward model with reinforcement learning.

The ALMoST model, as a result of this framework, shows well-aligned behaviors with human values and outperforms other open-sourced models like Alpaca and Dolly-v2 in alignment benchmarks. Additionally, human evaluations favor the ALMoST model over Alpaca and Dolly-v2, indicating its effectiveness in aligning with human values.
The paper also discusses the limitations of synthetic feedback, noting that while it shows promise in alignment-related benchmarks, it falls short in identifying other aspects of the aligned models. The phenomenon of ‘alignment tax’ is mentioned, where aligned models might sacrifice other abilities, showing degraded performances in other NLP tasks.
Overall, the paper contributes to the field by offering a cost-effective method for aligning LLMs with human values, reducing dependence on human annotations and proprietary LLMs, and introducing a novel methodology incorporating synthetic feedback and reinforcement learning.

Contrastive Chain-of-Thought Prompting

This paper by Chia et al. from DAMO, Singapore University of Technology and Desing, and NTU Singapore introduces a novel method to enhance the reasoning capabilities of large language models (LLMs). This method, termed Contrastive Chain of Thought (CCoT), involves providing both valid and invalid reasoning demonstrations, inspired by the way humans learn from both correct and incorrect methods.
The concept of CCoT is based on the idea that adding contrastive examples, comprising both valid and invalid reasoning, can significantly improve the performance of LLMs in reasoning tasks.
The process of using CCoT involves preparing a prompt, providing a valid chain of thought (CoT) explanation, generating contrastive invalid CoT explanations from the valid one, and then introducing a new user prompt.
The following image from the paper shows an overview of contrastive chain-of-thought (right), with comparison to common prompting methods.

CCoT has demonstrated improvements of approximately 4-16% over traditional CoT on evaluations focused on strategic and mathematical reasoning. When combined with self-consistency techniques, CCoT becomes even more effective, showing an additional improvement of about 5%.
A novel approach for generating invalid CoT explanations is introduced. This involves identifying key entities (like numbers, equations, persons) in the valid explanation and then shuffling them to create an invalid explanation.
The authors identify five different categories of negative rationales to enhance learning. These include using irrelevant entities, erroneous order of logic, and incorrect logic.
To validate the effectiveness of CCoT, the method was tested on GPT-3.5 using 500 samples across seven different datasets. The results showed that CCoT outperforms standard CoT across all datasets.
This work represents a significant advancement in the field of natural language processing and reasoning, offering an innovative method to enhance the reasoning abilities of LLMs by learning from a mix of correct and incorrect reasoning examples.

ChipNeMo: Domain-Adapted LLMs for Chip Design

This paper by Liu et al. from NVIDIA focuses on leveraging Large Language Models (LLMs) for industrial chip design, enhancing productivity in Electronic Design Automation (EDA). It presents a novel approach of domain-adapting general-purpose LLMs rather than using off-the-shelf models. This adaptation employs techniques like custom tokenizers, domain-adaptive continued pretraining (DAPT), supervised fine-tuning (SFT) with domain-specific instructions, and domain-adapted retrieval models.
They demonstrate effectiveness in three key use-cases: an engineering assistant chatbot, EDA tool script generation, and bug summarization and analysis. The engineering assistant chatbot scored 7.4/10 based on expert evaluations, the EDA script generation achieved over 50% correctness, and the bug summarization and analysis received a 4-5/7 rating.
The following image from the paper shows the ChipNeMo training flow.

The domain-adapted ChipNeMo models significantly outperformed standard LLMs in AutoEval benchmarks and human evaluations.
The domain adaptation allowed for up to a 5x reduction in model size without compromising performance, offering efficient alternatives to larger models like the state-of-the-art LLaMA2 70B.
A domain-specific tokenizer improves the tokenization efficiency by tailoring rules and patterns for domain-specific terms such as keywords commonly found in RTL. For DAPT, we cannot retrain a new domain-specific tokenizer from scratch, since it would make the foundation model invalid. Instead of restricting ChipNeMo to the pretrained general-purpose tokenizer used by the foundation model, they instead adapted the pre-trained tokenizer to their chip design dataset, only adding new tokens for domain-specific terms. Custom tokenizers improved efficiency without losing application effectiveness, reducing the token count required for DAPT by up to 3.3%.
When adapting a pre-trained tokenizer, the main goals are to improve tokenization efficiency on domain-specific data, maintain efficiency and language model performance on general datasets, and minimize the effort for retraining/fine-tuning. To achieve this, they’ve developed a four-step approach:
- Step 1: Training a tokenizer from scratch using domain specific data.
- Step 2: From the vocabulary of the new tokenizer, identifying tokens that are absent in the general-purpose tokenizer and are rarely found in general-purpose datasets.
- Step 3: Expanding the general-purpose tokenizer with the newly identified tokens at Step 2.
- Step 4: Initializing the embeddings of the new tokens by utilizing the general-purpose tokenizer.
- Specifically for Step 4, when a new token is encountered, it is tokenized using the pretrained general-purpose tokenizer. The embedding of the new token is determined by averaging the embeddings of the tokens generated by the general-purpose tokenizer, and the output layer weights initialized to zero.
- Step 2 helps maintain the performance of the pre-trained LLM on general datasets by selectively introducing new tokens that are infrequently encountered in general-purpose datasets. And Step 4 reduces the effort required for retraining/finetuning the LLM via initialization of the embeddings of new tokens guided by the general-purpose tokenizer.
SFT with an additional 1.1K domain-specific instructions substantially enhanced application proficiency across various scales and metrics.
Fine-tuning the ChipNeMo retrieval model with domain-specific data improved its hit rate by 30% over a pre-trained model, enhancing the quality of Retrieval Augmented Generation (RAG) responses.

Efficient Streaming Language Models with Attention Sinks

This paper by Xiao et al. from MIT, Meta AI, and Carnegie Mellon University, introduces StreamingLLM, a framework designed to enable Large Language Models (LLMs) to process streaming applications with long interactions efficiently. It addresses the trade-off between efficiency and raw performance in LLMs, focusing on optimizing the performance losses associated with more efficient model training, decoding, and inference.
The problem addressed by StreamingLLM is rooted in the inefficiency and performance limitations of previous attention algorithms, particularly in handling long texts. These algorithms often disproportionately focused on the initial tokens, a phenomenon attributed to the Softmax operation, which biases the model towards initial tokens due to their foundational role in subsequent tokens’ context. StreamingLLM innovatively employs “attention sinks,” special tokens without semantic significance that nonetheless play a crucial role in maintaining model focus and preventing loss of attention. This approach allows for the conversion of a standard LLM into a streaming LLM with preserved performance, even with a single attention sink token.
The paper identifies an ‘attention sink’ phenomenon where maintaining the KV of initial tokens can significantly improve performance when the text length surpasses the cache size. StreamingLLM leverages this by preserving attention sink tokens’ KV, along with the sliding window’s KV, enabling models like Llama-2, MPT, Falcon, and Pythia to model up to 4 million tokens without fine-tuning.
The following image from the paper illustrates StreamingLLM vs. existing methods. The language model, pre-trained on texts of length $L$, predicts the $T^{th}$ token $(T \gg L)$. (a) Dense Attention has $O\left(T^2\right)$ time complexity and an increasing cache size. Its performance decreases when the text length exceeds the pre-training text length. (b) Window Attention caches the most recent $L$ tokens’ KV. While efficient in inference, performance declines sharply once the starting tokens’ keys and values are evicted. (c) Sliding Window with Re-computation rebuilds the KV states from the $L$ recent tokens for each new token. While it performs well on long texts, its $O\left(T L^2\right)$ complexity, stemming from quadratic attention in context re-computation, makes it considerably slow. (d) StreamingLLM keeps the attention sink (several initial tokens) for stable attention computation, combined with the recent tokens. It’s efficient and offers stable performance on extended texts. Perplexities are measured using the Llama-2-13B model on the first book (65K tokens) in the PG-19 test set.

StreamingLLM was evaluated using various model families, including Llama-2, MPT, Pythia, and Falcon, to demonstrate its effectiveness and robustness. The framework was benchmarked against existing methods like dense attention, window attention, and the sliding window approach with re-computation.
The study concludes that StreamingLLM offers a simple and efficient means to deploy LLMs in streaming applications, addressing efficiency limitations and reduced performance with longer texts. The framework not only manages to handle unlimited text lengths without fine-tuning but also demonstrates an improvement in streaming performance when pre-training models with a dedicated sink token. StreamingLLM effectively decouples the LLM’s pre-training window size from its actual text generation length, enabling more versatile streaming applications of LLMs.
Looking ahead, the method warrants further evaluation to fully explore its potential. Although preliminary results are promising, the limits of input length and model sizes are yet to be tested extensively. Attention sinks hold the potential to become a new standard in LLMs, offering a balance between efficient streaming and maintaining high model performance.

Precise Zero-Shot Dense Retrieval without Relevance Labels

This paper by Gao et al. from CMU and University of Waterloo, proposes an innovative approach called Hypothetical Document Embeddings (HyDE) for effective zero-shot dense retrieval in the absence of relevance labels. HyDE leverages an instruction-following language model, such as InstructGPT, to generate a hypothetical document that captures relevance patterns, although it may contain factual inaccuracies. An unsupervised contrastive encoder, like Contriever, then encodes this document into an embedding vector to identify similar real documents in the corpus embedding space, effectively filtering out incorrect details.
The implementation of HyDE combines InstructGPT (a GPT-3 model) and Contriever models, utilizing OpenAI playground’s default temperature setting for generation. For English retrieval tasks, the English-only Contriever model was used, while for non-English tasks, the multilingual mContriever was employed.
The following image from the paper illustrates the HyDE model. Documents snippets are shown. HyDE serves all types of queries without changing the underlying GPT-3 and Contriever/mContriever models.

Experiments were conducted using the Pyserini toolkit. The results demonstrate HyDE’s significant improvement over the state-of-the-art unsupervised dense retriever Contriever, with strong performance comparable to fine-tuned retrievers across various tasks and languages. Specifically, in web search and low-resource tasks, HyDE showed sizable improvements in precision and recall-oriented metrics. It remained competitive even compared to fine-tuned models, particularly in terms of recall. In multilingual retrieval, HyDE improved the mContriever model and outperformed non-Contriever models fine-tuned on MS-MARCO. However, there were some performance gaps with fine-tuned mContrieverFT, likely due to under-training in non-English languages.
Further analysis explored the effects of using different generative models and fine-tuned encoders with HyDE. Larger language models brought greater improvements, and the use of fine-tuned encoders with HyDE showed that less powerful instruction language models could impact the performance of the fine-tuned retriever.
One possible pitfall of HyDE is that it can potentially “hallucinate” in the sense that it generates hypothetical documents that may contain invented or inaccurate details. This phenomenon occurs because HyDE uses an instruction-following language model, like InstructGPT, to generate a document based on a query. The generated document is intended to capture the relevance patterns of the query, but since it’s created without direct reference to real-world data, it can include false or fictional information. This aspect of HyDE is a trade-off for its ability to operate in zero-shot retrieval scenarios, where it creates a contextually relevant but not necessarily factually accurate document to guide the retrieval process.
In conclusion, the paper introduces a new paradigm of interaction between language models and dense encoders/retrievers, showing that relevance modeling and instruction understanding can be effectively handled by a powerful and flexible language model. This approach eliminates the need for relevance labels, offering practical utility in the initial stages of a search system’s life, and paving the way for further advancements in tasks like multi-hop retrieval/QA and conversational search.

Tied-LoRA: Enhancing Parameter Efficiency of LoRA with Weight Tying

This paper by Renduchintala et al. from NVIDIA, proposes Tied-LoRA, an enhancement to the Low-rank Adaptation (LoRA) method. Tied-LoRA utilizes weight tying and selective training to increase parameter efficiency in large language models (LLMs).
Tied-LoRA stands out by achieving comparable performance across various tasks while using only 13% of the parameters employed by the standard LoRA method. This significantly reduces the computational resources required for training and customization.
The methodology involves studying all feasible combinations of parameter training/freezing in conjunction with weight tying. This approach aims to find the optimal balance between performance and the number of trainable parameters.
The authors conducted experiments across a range of tasks and with two base language models. Their analysis reveals trade-offs between efficiency and performance, especially highlighting the effectiveness of a specific Tied-LoRA configuration.
Schematic of our Tied-Lora paradigm, the main low-rank matrices A and B are tied across (indicated by the ® symbol) all the layers of the base language model. We use the gradient shading to indicate that these parameters can either be trained or frozen.

Their findings demonstrate that Tied-LoRA configurations can maintain performance while dramatically reducing the number of trainable parameters. For instance, the vB uA configuration was proposed as the most efficient, reducing parameter usage by 87% without significantly impacting performance.
The experiments also covered diverse tasks like extractive QA, summarization, commonsense natural language inference, translation, and mathematical reasoning, using datasets like SQuAD, DialogSum, HellaSwag, IWSLT, and GSM8K.
Implementation details include the use of the open-source NeMo Framework for algorithm implementation, AdamW optimizer for training, and a range of low-rank dimensions for model training. The performance was evaluated based on various metrics like accuracy, RougeL, and BLEU scores.
In conclusion, Tied-LoRA is presented as a viable option for enhancing parameter efficiency in LLMs, especially in scenarios where computational resources are a constraint. The authors suggest its potential use as a replacement for LoRA in more tasks in the future.

Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning

This paper by Zhang et al. from Georgia Tech, Princeton University, and Microsoft Azure AI, was presented at ICLR 2023. It addresses the challenge of fine-tuning large pre-trained language models (PLMs) for multiple downstream tasks, which traditionally requires tuning all parameters and leads to high memory consumption. The authors propose AdaLoRA, an innovative method that adaptively allocates parameter budgets among weight matrices based on their importance scores. This approach significantly enhances the efficiency and effectiveness of fine-tuning.
Key aspects of AdaLoRA include:
- SVD-based Adaptation: AdaLoRA parameterizes incremental updates in the form of Singular Value Decomposition (SVD), enabling effective pruning of unimportant updates without intensive exact SVD computations.
- **Importance-aware Rank Allocation: **It prunes redundant singular values based on a newly-designed importance metric, allowing the model to focus on critical weights and discard less important ones.
- Global Budget Scheduler: The method starts with an initial parameter budget slightly higher than the final target and gradually reduces it, improving training stability and model performance.
The paper demonstrates AdaLoRA’s effectiveness through extensive experiments across various tasks and models, including DeBERTaV3-base and BART-large, on tasks like natural language understanding, question answering, and natural language generation. AdaLoRA consistently outperforms baselines, especially in low-budget settings, showing notable improvements in metrics like accuracy, F1 score, and Rouge scores.
The resulting rank of each incremental matrix when fine-tuning DeBERTaV3-base on MNLI with AdaLoRA. Here the x-axis is the layer index and the y-axis represents different types of adapted weight matrices.

Furthermore, the authors compare AdaLoRA with structured pruning on LoRA and different importance metrics, showing that AdaLoRA’s components contribute significantly to its performance. The paper’s ablation studies validate the effectiveness of both SVD-based adaptation and the adaptive budget allocation in AdaLoRA.

Chain-of-Knowledge: Grounding Large Language Models via Dynamic Knowledge Adapting over Heterogeneous Sources

This paper by Li et al. from DAMO Academy Alibaba Group, NTU, and Singapore University of Technology and Design proposes Chain-of-Knowledge (CoK), a framework that enhances large language models (LLMs) by dynamically incorporating grounding information from heterogeneous sources. The framework aims to produce more factual rationales and reduce hallucination in generation.
CoK consists of three stages: reasoning preparation, dynamic knowledge adapting, and answer consolidation. Initially, CoK prepares preliminary rationales and answers for a knowledge-intensive question while identifying relevant knowledge domains. It then corrects these rationales step by step by adapting knowledge from identified domains, thereby providing a better foundation for the final answer.
The following figure from the paper shows a comparison of different methods: (a) chain-of-thought with self-consistency, (b) verify-and-edit, and (c) chain-of-knowledge or CoK. CoK incorporates heterogeneous sources for knowledge retrieval and performs dynamic knowledge adapting. For clarity and succinct presentation, only pivotal steps are shown in the figure.

The following figure from the paper shows the proposed chain-of-knowledge (CoK) framework, consisting of (I) Reasoning preparation, (II) Dynamic knowledge adapting, and (III) Answer consolidation. n.s.: natural sentence.

A key aspect of CoK is its use of both unstructured and structured knowledge sources, such as Wikidata and tables, for more reliable factual information. To access these varied sources, the framework introduces an adaptive query generator that generates queries for different query languages, including SPARQL, SQL, and natural sentences.
CoK corrects rationales progressively, using preceding corrected rationales to generate and correct subsequent ones, minimizing error propagation. Extensive experiments demonstrate CoK’s consistent improvement in LLMs’ performance across different domains on knowledge-intensive tasks.
The paper also details the challenges and limitations of existing methods in augmenting LLMs with external knowledge and how CoK addresses these by its design. It provides a comprehensive approach to improve factual correctness and reasoning capabilities of LLMs for a wide range of applications.

Exponentially Faster Language Modeling

This paper by Belcak and Wattenhofer from ETH Zurich proposes UltraFastBERT, a variant of BERT that significantly speeds up language model inference by using only 0.3% of its neurons.
UltraFastBERT achieves this by replacing standard feedforward networks with fast feedforward networks (FFFs), selectively engaging just 12 out of 4095 neurons for each layer during inference.
The FFFs employ a balanced binary tree structure for neurons, allowing for conditional execution based on input, resulting in a time complexity of $O(log_2 n)$ for a forward pass, compared to $O(n)$ in standard feedforward networks.
The paper demonstrates a significant improvement in inference speed, with a high-level CPU implementation yielding a 78x speedup and a PyTorch implementation delivering a 40x speedup over traditional feedforward network inference.
For practical application, UltraFastBERT models were fine-tuned for standard downstream tasks and found to perform on par with their BERT counterparts in terms of downstream performance.
The authors also address the challenge of implementing conditional matrix multiplication (CMM) efficiently, noting the lack of native support in current deep learning frameworks but highlighting the potential for significant acceleration.
The paper underscores the feasibility of significant speed improvements in language model inference without a compromise in performance, paving the way for more efficient implementations in practical applications.

Prompt Injection attack against LLM-integrated Applications

This paper by Liu et al. from Nanyang Technological University, University of New South Wales, Huazhong University of Science and Technology, Southern University of Science and Technology, and Tianjin University presents a comprehensive analysis of security risks in LLM-integrated applications, focusing on prompt injection attacks. The authors conducted a pilot study on 10 commercial applications, revealing limitations in current attack strategies.
They introduced HOUYI, a novel black-box prompt injection technique, inspired by traditional web injection attacks. HOUYI comprises three components: a pre-constructed prompt, a context separation prompt, and a malicious payload.
The methodology was applied to 36 real-world applications, identifying 31 as vulnerable. Notable among these was Notion, potentially affecting millions of users.
The following figure from the paper shows an LLM-integrated application with normal usage (top) and prompt injection (bottom).

The following figure from the paper shows an overview of HOUYI.

The authors’ evaluations uncovered severe implications, such as unrestricted LLM usage and prompt theft, leading to significant financial impacts on service providers.
Despite common defensive mechanisms, the study revealed that HOUYI could successfully bypass these, highlighting a need for more robust defenses against prompt injection attacks.

Jailbroken: How Does LLM Safety Training Fail?

This paper by Wei et al. from UC Berkeley investigates why “jailbreak” attacks succeed against safety-trained Large Language Models (LLMs) like OpenAI’s GPT-4 and Anthropic’s Claude v1.3.
The study identifies two primary failure modes in LLM safety training: competing objectives and mismatched generalization. Competing objectives arise when a model’s capabilities and safety goals conflict, while mismatched generalization occurs when safety training fails to generalize to domains where capabilities exist.
The authors design and test various jailbreak attacks, demonstrating that both GPT-4 and Claude v1.3 remain vulnerable despite extensive safety training and red-teaming efforts. New attacks based on the identified failure modes outperformed existing methods.
The following figure from the paper shows (a) GPT-4 refusing a prompt for harmful behavior, followed by a jailbreak attack leveraging competing objectives that elicits this behavior. (b) Claude v1.3 refusing the same prompt, followed by a jailbreak attack leveraging mismatched generalization (on Base64-encoded inputs).

The paper emphasizes the need for safety-capability parity, arguing against the notion that scaling alone can resolve these safety failure modes. It suggests that safety mechanisms should match the sophistication of the underlying model.
The work is significant for highlighting inherent limitations in current safety training methods for LLMs, pointing to the need for more advanced and nuanced approaches to ensure robust model safety.

Orca 2: Teaching Small Language Models How to Reason

This paper by Mitra et al. from Microsoft Research, presented at NeurIPS 2023, introduces advancements in training smaller language models (LMs) to enhance their reasoning abilities.
The paper critiques the conventional approach of imitation learning, arguing that it limits smaller LMs’ potential by replicating larger models. Instead, Orca 2 focuses on teaching small LMs different, more effective reasoning strategies tailored to specific tasks.
Key techniques include “Explanation Tuning” and “Prompt Erasure,” which help LMs reason more effectively and autonomously. Explanation Tuning uses rich, expressive reasoning signals from larger LMs, while Prompt Erasure helps LMs learn underlying reasoning strategies by erasing specific task instructions.
The training process involves a custom dataset of ~817K instances, progressive learning, and evaluation across 15 diverse benchmarks covering over 36K unique prompts.
Orca 2 demonstrates superior performance in reasoning tasks compared to similar-sized models and is competitive with models 5-10 times larger. This highlights the potential of empowering smaller models with enhanced reasoning capabilities.
The following figure from the paper shows results comparing Orca 2 (7B & 13B) to LLaMA-2-Chat (13B & 70B) and WizardLM (13B & 70B) on variety of benchmarks (in 0-shot setting) covering language understanding, common sense reasoning, multi-step reasoning, math problem solving, etc. Orca 2 models match or surpass all other models including models 5-10x larger. Note that all models are using the same LLaMA-2 base models of the respective size.

The paper also discusses limitations, including constraints from the underlying pre-trained model and the absence of Reinforcement Learning from Human Feedback (RLHF) for safety training.
Project page; Blog

Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves

This paper by Deng et al. from UCLA introduces a novel method, ‘Rephrase and Respond’ (RaR), aimed at enhancing the performance of Large Language Models (LLMs) in understanding and responding to human-posed questions. RaR allows LLMs to autonomously rephrase and expand questions before providing responses, addressing the challenge of misunderstanding seemingly unambiguous questions due to discrepancies in interpretation between humans and LLMs.
RaR is implemented in two forms:
1. One-step RaR: Here, LLMs rephrase and respond to a question within a single prompt. This method is based on the human communication strategy of rephrasing for clarity and coherence. It’s shown to be effective, especially with datasets that present ambiguous questions to LLMs.
2. Two-step RaR: This approach involves a more intricate process where a ‘rephrasing LLM’ first rephrases the question, which is then combined with the original question to prompt a ‘responding LLM’. This method is beneficial for using rephrased questions across different models, with experiments showing that a question rephrased by a more advanced LLM, like GPT-4, can significantly aid a less sophisticated LLM in producing more accurate responses.
The paper conducts extensive experiments to validate the efficacy of RaR, revealing that:
- Both One-step and Two-step RaR significantly improve LLM performance across various tasks.
- One-step RaR is a straightforward and effective method to enhance LLM responses, outperforming Two-step RaR in 6 out of 10 tasks.
- Two-step RaR consistently improves the quality of responses, particularly in tasks where LLMs initially show poor performance. This method also demonstrates the ability to rephrase questions autonomously, leading to significant accuracy improvements.
The following figure from the paper depicts Two-step RaR examples where the question is rephrased and the rephrased question is responded to.

The effectiveness of RaR was tested across multiple benchmark tasks, including Knowledge Classification, Knowledge Comparison, CommonSense QA, Date Understanding, Last Letter Concatenation, Coin Flip, and Sports Understanding. These tasks were designed to evaluate various aspects of LLM capabilities like commonsense reasoning, symbolic reasoning, and sports knowledge.
The performance of RaR was also examined across different LLMs, including GPT-3.5 and Vicuna. It was found that all LLMs tested showed enhanced performance with Two-step RaR. Moreover, the study confirms that the rephrased questions are transferable across different LLMs, demonstrating that rephrased questions by a model like GPT-4 can significantly benefit other models like Vicuna.
Additionally, the paper explores the concept of multiple rephrasings, where iterative self-rephrasing by GPT-4 is used to achieve consistent clarifications. This method shows that GPT-4 can progressively clarify concepts, even if it fails to do so in the initial attempt, with the questions becoming more elaborate after each rephrasing.
Lastly, RaR is compared with the Chain-of-Thought (CoT) method, demonstrating that RaR offers improvements in scenarios where zero-shot CoT is ineffective, and also addresses the shortcomings inherent in few-shot CoT.

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

This paper by Biderman et al. from EleutherAI, Booz Allen Hamilton, Yale University, IIIT Delhi, Stability AI, Datasaur.ai, and University of Amsterdam, introduces Pythia, a suite of 16 large language models (LLMs), ranging from 70M to 12B parameters, all trained on public data in the same order, aimed at understanding the development and evolution of LLMs across training and scaling.
The Pythia Scaling Suite is a collection of models developed to facilitate interpretability research. It contains two sets of eight models of sizes 70M, 160M, 410M, 1B, 1.4B, 2.8B, 6.9B, and 12B. For each size, there are two models: one trained on the Pile, and one trained on the Pile after the dataset has been globally deduplicated. All 8 model sizes are trained on the exact same data, in the exact same order.
Pythia allows public access to 154 checkpoints for each model, with tools to download and reconstruct their exact training data, offering insights into memorization, term frequency effects on few-shot performance, and reducing gender bias.
The Pythia model suite was deliberately designed to promote scientific research on large language models, especially interpretability research. Despite not centering downstream performance as a design goal, they find the models match or exceed the performance of similar and same-sized models, such as those in the OPT and GPT-Neo suites.
The following table from the paper shows commonly used model suites and how they rate according to theirmerg requirements.

The suite’s consistent setup across models is used to analyze gender bias mitigation by modifying training data’s gendered terms, demonstrating reduced bias measures in larger models.
Another focus is memorization dynamics, where memorization is modeled as a Poisson point process, indicating that memorization occurrences are uniformly distributed throughout training, contrary to the theory that later training data is memorized more.
The study also explores the impact of term frequency in pretraining data on model performance, finding a correlation between term frequency and task accuracy in larger models, an emergent property not observed in smaller models.
The paper, presented at the International Conference on Machine Learning (ICML) 2023, emphasizes the utility of Pythia for detailed analysis and research on LLM behaviors, offering a new perspective on how pretraining data affects model development.
Weights

Starling-7B: Increasing LLM Helpfulness & Harmlessness with RLAIF

This report by Zhu et al. from UC Berkeley introduces Starling-7B, a large language model enhanced by Reinforcement Learning from AI Feedback (RLAIF). It utilizes a new GPT-4 labeled ranking dataset, Nectar, and a novel reward training and policy tuning pipeline.
The following table from the paper shows that Starling-7B-alpha achieves a score of 8.09 on MT Bench, evaluated by GPT-4, surpassing most models except GPT-4 and GPT-4 Turbo. The model and its components, including the ranking dataset Nectar and the reward model Starling-RM-7B-alpha, are available on HuggingFace and as an online demo in LMSYS Chatbot Arena.

The report discusses the effectiveness of Supervised Fine-Tuning (SFT) in chatbot systems, contrasting it with Reinforcement Learning from Human Feedback (RLHF) and AI feedback (RLAIF). It emphasizes the need for high-quality ranking datasets for chat, leading to the creation of Nectar, which includes 183K chat prompts and 3.8M pairwise comparisons.
Starling-7B is fine-tuned using the Starling-RM-7B-alpha reward model, improving its MT-Bench and AlpacaEval scores, reflecting increased helpfulness.
The model’s evaluation involves MT-Bench and AlpacaEval, with results indicating improvements in helpfulness and safety but minor regressions in areas like QA, math, and coding.
The report details the dataset creation process, particularly the efforts to mitigate positional bias in GPT-4-based rankings, resulting in the Nectar dataset.
Training of the reward model involves the K-wise maximum likelihood estimator under the Plackett-Luce Model. Policy fine-tuning experiments are conducted using different RL methods, with APA being the most effective.
The report highlights challenges in RLHF evaluation and discusses limitations, including Starling-7B’s struggles with reasoning tasks and susceptibility to jailbreaking prompts.
The research is subject to licenses and terms from various sources, including LLaMA, OpenAI, and ShareGPT, and acknowledges contributions from the broader research community.
Project page.

Large Language Models Are Human-Level Prompt Engineers

The following paper summary has been contributed by Zhibo Zhang.
This paper by Yongchao et al. from University of Toronto, Vector Institute, and University of Waterloo in ICLR 2023 proposes APE (Automatic Prompt Engineer) that automates task-specific instruction generation based on example input-output demonstrations.
As shown in part (a) of the figure below from the paper,
- The first step of the Automatic Prompt Engineer (APE) algorithm is to synthesize a set of instruction proposals through doing inference on large language models.
- Next, in each iteration,
  - the instructions from the proposal set are scored using large language models through either execution accuracy or log probability;
  - the instruction proposals with the highest scores are input into large language models for resampling new instruction proposals. The resampling is through guiding the model to synthesize variations of the instructions with the same semantic meaning.
  - the updated proposal set will come from either the subset of the initial proposal with the highest scores or the ones resampled from large language models.
- The above iteration keeps running until convergence.
- As the final step, the instruction with the highest score from the proposal set will be used.

Empirically, as shown in part (b) of the figure above from the paper, on 24 NLP tasks introduced by Honovich et al. (2022), applying APE method on top of the InstructGPT model (Ouyang et al., 2022) outperformed human-engineered prompts in terms of the interquartile mean score (Agarwal et al., 2021).

A Survey of Graph Meets Large Language Model: Progress and Future Directions

This paper by Yuhan Li et al. provides a comprehensive review of integrating Large Language Models (LLMs) with graph-related tasks.
It introduces a new taxonomy categorizing methods based on LLMs’ roles in these tasks as enhancers, predictors, and aligners.
The paper discusses various approaches under each category, highlighting their advantages, implementation strategies, and applications.
The following table from the paper shows that aross a myriad of graph domains, the integration of graphs and LLMs demonstrates success in various downstream tasks.

The review also identifies current limitations and suggests future research directions, focusing on enhancing the synergy between graph neural networks and LLMs for improved performance in various applications.

Nash Learning from Human Feedback

This paper by Munos et al. from Google DeepMind introduces an alternative approach to the conventional Reinforcement Learning from Human Feedback (RLHF) for aligning large language models (LLMs) with human preferences. This new approach, termed Nash Learning from Human Feedback (NLHF), focuses on learning a preference model from pairwise human feedback and pursuing a policy that generates responses preferred over any competing policy, thus achieving a Nash equilibrium for this preference model.
The NLHF approach aims to encompass a broader spectrum of human preferences, maintain policy independence, and better align with the diversity of human preferences. This method marks a significant shift from the traditional RLHF framework, which is more limited in capturing the richness and diversity of human preferences.
Key contributions of this work include the introduction and definition of a regularized variant of the preference model, the establishment of the existence and uniqueness of the corresponding Nash equilibrium, and the introduction of novel algorithms such as Nash-MD and Nash-EMA. Nash-MD, founded on mirror descent principles, converges to the Nash equilibrium without requiring the storage of past policies, making it particularly suitable for LLMs. Nash-EMA, inspired by fictitious play, uses an exponential moving average of past policy parameters. The paper also introduces policy-gradient algorithms Nash-MD-PG and Nash-EMA-PG for deep learning architectures. Extensive numerical experiments conducted on a text summarization task using the TL;DR dataset validate the effectiveness of the NLHF approach.
The regularized preference model in NLHF uses KL-regularization to quantify the divergence between the policy under consideration and a reference policy. This regularization is particularly crucial in situations where the preference model is more accurately estimated following a given policy or where it is essential to remain close to a known safe policy.
In terms of implementation, the paper explores gradient-based algorithms for deep learning architectures, focusing on computing the Nash equilibrium of a preference model. This exploration emphasizes the applicability of these algorithms in the context of LLMs.

This paper by Liu et al. from Xiamen University, MBZUAI, and Tencent AI Lab, introduces a novel approach to enhance the reasoning of Large Language Models (LLMs) in multi-modal tasks, especially multi-modal question answering. Their method, called CoT-MM-Retrieval, focuses on dynamic and automatic selection of demonstration examples based on cross-modal similarities.
The authors propose a retrieval mechanism that employs stratified sampling to categorize demonstration examples into groups based on their types. This approach aims to improve the diversity and relevance of examples provided to the LLMs, thereby enhancing their reasoning capabilities in multi-modal scenarios.
The following figure from the paper shows an overview of our proposed multi-modal retrieval method where they employ intra-modality retrieval and cross-modality retrieval to obtain relevant examples from demonstration pool, which are then combined into the prompt for LLMs.

The following figure from the paper shows a detailed illustration of our multi-modal retrieval approach, where they use intra-modal similarity and cross-modal similarity to sample demonstration examples $\boldsymbol{D}$ from demonstration pool $Q$.

Comprehensive experiments were conducted using the ScienceQA dataset, demonstrating significant improvements in LLM performance. The approach achieved an accuracy of 82.67% with ChatGPT and 87.43% with GPT-4, outperforming the Chameleon system by 2.74% and 0.89%, respectively. Moreover, the best performing systems showed a 6.05% increase for ChatGPT-based models and a 4.57% increase for GPT-4-based models.
The following figure from the paper shows the results on different categories of ScienceQA. Our proposed approach obtains substantial improvements over previous baseline models including CoT and Chameleon on both ChatGPT and GPT-4 foundation models.

The paper also presents an early in-depth evaluation of GPT-4V on the ScienceQA dataset, highlighting its superior zero-shot capabilities, particularly in integrating visual context.
The authors conducted ablation studies to validate the effectiveness of their retrieval techniques and the impact of varying the number of demonstration examples. These studies further underscore the robustness and adaptability of the proposed retrieval strategies across different modalities and question types.

Magicoder: Source Code Is All You Need

As AI emerges as a vital co-pilot for programmers, the need for high-quality and reliable machine-generated code is increasingly crucial.
This paper by Wei et al. from UIUC and Tsinghua introduces “Magicoder,” a fully open-source series of Large Language Models (LLMs) dedicated to code generation. Notably, Magicoder models, despite having no more than 7 billion parameters, significantly close the gap with top-tier code models.
The core innovation behind Magicoder is OSS-Instruct, a unique approach to code instruction tuning that incorporates real open-source code snippets into the training process as source references. OSS-Instruct functions by prompting an LLM, such as ChatGPT, to generate coding problems and solutions based on seed code snippets sourced from platforms like GitHub. This process not only enables the creation of diverse coding challenges but also mirrors real-world programming scenarios.This methodology enhances the diversity, realism, and controllability of the generated code. Magicoder is trained on 75,000 synthetic instruction data using OSS-Instruct, a novel approach leveraging open-source code snippets to generate high-quality instruction data for coding. The result is a model that not only surpasses its 7 billion parameter counterparts but also competes closely with the 34 billion parameter version of WizardCoder-SC.
The following figure from the paper shows an overview of OSS-INSTRUCT and the pass@1 results of different LLMs on HumanEval+.

The following figure from the paper shows the detailed prompt design for OSS-INSTRUCT.

The implementation details highlight the use of GPT-3.5-turbo-1106 as the foundation model for OSS-INSTRUCT, chosen for its cost-effectiveness. The process involves extracting 1–15 lines from each code document to generate coding problems and solutions, aiming for consistency. Data decontamination is applied to ensure quality and originality. The training process uses CodeLlama-Python-7B and DeepSeek-Coder-Base 6.7B as base LLMs, with fine-tuning on the OSS-Instruct dataset using PyTorch’s Distributed Data Parallel module, Adafactor optimizer, and specific learning rates and batch sizes. For MagicoderS, the models are further fine-tuned with an additional dataset from Evol-Instruct.
The evaluation of Magicoder involved benchmarks like HumanEval and MBPP, with enhanced versions (HumanEval+ and MBPP+) used for more rigorous testing. The evaluation included a range of baseline models for comparison. Magicoder-CL demonstrated clear improvements over its base model and surpassed most open-source models, including WizardCoder-SC-15B, in HumanEval and HumanEval+ benchmarks. The advanced MagicoderS-CL, trained with Evol-Instruct, outperformed all other models, including ChatGPT, especially in more challenging settings, suggesting its ability to generate more robust code
Despite having only 7 billion parameters, it challenges much larger models, rivaling even those 5 times its size. This impressive feat is achieved while ensuring both its weights and data are open, a critical aspect for fostering transparent and rapid advancements.
While GPT-4 remains the top performer in terms of raw output, the rapid advancements in models like Magicoder suggest that scaling these new approaches to 70 billion parameters and beyond is on the horizon. Such a development could signify a paradigm shift in the field, potentially surpassing OpenAI’s current offerings, unless GPT-5 enters the scene first. Magicoder, with its innovative training approach and impressive capabilities, is a strong contender in this evolving landscape, offering a glimpse into a future where AI-generated code becomes even more sophisticated and reliable.
Code.

TarGEN: Targeted Data Generation with Large Language Models

This paper by Gupta et al. from ASU and Georgia Tech introduces TarGEN, a strategy for generating high-quality, diverse synthetic datasets using Large Language Models (LLMs). This approach addresses the common issues of lack of diversity and noise in synthetic datasets.
TarGEN’s multi-step prompting strategy involves creating contexts for semantic diversity, generating task-specific “instance seeds” (linguistic elements forming the base of each instance), and using these seeds in label-constrained generation of data instances. The unique self-correction module identifies and corrects mislabeled instances, ensuring accurate labeling and overall data quality.
To validate TarGEN’s effectiveness, they emulate eight tasks from the SuperGLUE benchmark. They trained various language models on both the synthetic dataset created by TarGEN and the original training sets. The models trained on the TarGEN-generated dataset performed slightly better (approximately 1-2% improvement) than those trained on the original datasets. Incorporating instruction tuning further improved performance, with Flan-T5 showing an increase from 81.12% to 84.54% on synthetic data.
The following figure from the paper shows an overview of using TarGEN to generate instances for the WiC task. They first create a set of prompts (1,2 in figure) to generate instance seeds, or linguistic components unique to each task instance. In this case, the instance seeds refer to homonyms (1) and their definitions (2). Next, they create label-specific prompts (3) that generate instances based on instance seeds and the relationship implied by the label for this task. Given an instance seed, they generate TRUE instances by generating sentence pairs that contain the word in the same sense. They generate FALSE instances by generating sentence pairs containing the instance seed in different word senses. They use zero-shot LLM inference to generate an initial set of synthetic instances. The instances are then passed to our self-correction module consisting of a single meta-prompt that is augmented with task instructions and evaluation examples, and an LLM into which they pass synthetic instances with this prompt. This allows us to re-label mislabeled data instances, helping us reduce noise. Hence, based on the task description, they obtain high-quality synthetic instances to evaluate a task.

In-depth analysis of the TarGEN datasets showed robustness in terms of dataset difficulty, diversity, and bias. The synthetic datasets displayed similar or higher complexity, better lexical diversity, and lower cosine similarity between text pairs, indicating richer content and diversity. Additionally, the synthetic datasets closely matched the original datasets in terms of bias, showing balanced representation across different categories. The use of V-usable information revealed that synthetic datasets have a diverse range of difficulty levels and effectively employed self-correction prompts to ensure the absence of mislabeled samples.
The self-correction component in TarGEN, implemented using an LLM (ChatGPT) as an evaluator model, plays a crucial role. It aligns the generated instances with their labels and the task description, thus enhancing data consistency and reliability. This is achieved using a single meta-prompt, tailored to each dataset with task-specific instructions and validation examples
In summary, TarGEN is an innovative approach for creating high-quality, diverse synthetic datasets, enhancing the performance of models trained on these datasets and demonstrating significant improvements in data complexity, diversity, and accuracy.

RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

This paper by Lee et al. from Google Research, introduces a novel method for training large language models (LLMs) with AI-generated feedback, addressing the challenges and costs associated with traditional human feedback methods.
The paper presents Reinforcement Learning from AI Feedback (RLAIF) as a promising alternative to the conventional Reinforcement Learning from Human Feedback (RLHF). RLAIF utilizes an off-the-shelf LLM as a preference labeler, streamlining the training process and, in some cases, surpassing the performance of models trained with human feedback.
This approach is applied to text generation tasks such as summarization, helpful dialogue generation, and harmless dialogue generation. The performance of RLAIF, as assessed by human raters, is comparable or superior to RLHF, challenging the assumption that larger policy models are always more effective.
A key advantage of RLAIF is its potential to significantly reduce reliance on expensive human annotations. The study shows the efficacy of using the same model size for both the LLM labeler and the policy model, and highlights that directly prompting the LLM for reward scores can be more effective than using a distilled reward model.
The authors explore methodologies for generating AI preferences aligned with human values, emphasizing the effectiveness of chain-of-thought reasoning and detailed preamble in improving AI labeler alignment.
The following figure from the paper shows a diagram depicting RLAIF (top) vs. RLHF (bottom).

RLAIF’s scalability and cost-effectiveness are notable, with the approach being over ten times cheaper than human annotation. This aligns with the growing trend in LLM research focusing on quality over quantity in datasets.
The paper suggests that combining RLHF and RLAIF could be a strategic approach, especially considering that LLMs like GPT-4 have been trained with human feedback. This hybrid model could represent a balanced integration of high-quality human data, amplified significantly by AI, potentially shaping the future of LLM training and influencing approaches like the development of GPT-5.

The Alignment Ceiling: Objective Mismatch in Reinforcement Learning from Human Feedback

This paper by Lambert and Calandra from the Allen Institute for AI and TU Dresden, the concept of “Objective Mismatch” in Reinforcement Learning from Human Feedback (RLHF) for Large Language Models (LLMs) is explored.
The paper identifies that in RLHF, there’s a fundamental challenge where the training of reward models, policy models, and the evaluation of these models are numerically decoupled. This leads to the Objective Mismatch issue, where the reward model does not align with the end goals of the LLMs, affecting their effectiveness.
The following figure from the paper shows an illustration of where the objective mismatch issue emerges within the RL optimization phase of RLHF. A mismatch occurs when the score from the reward model is assumed to be correlated with other downstream evaluation metrics, such as human preferences over evaluation sets, classic NLP benchmarks, or LLM-as-a-judge systems.

The following figure from the paper shows a comparison of RLHF to a traditional RL problem. (left) is the canonical RL problem, where an agent interacts repeatedly with an environment. (right) is RLHF, where an agent is optimized against a set of predetermined prompts and a learned reward model for one action per input state.

The authors detail how this mismatch arises due to over-reliance on reward model scores for evaluating downstream performance, leading to models that exhibit undesirable traits such as verbosity, self-doubt, and refusal of user requests.
The research is situated within the broader context of model-based reinforcement learning, discussing the origins of the mismatch, investigating potential solutions, and proposing future research directions to mitigate these challenges.
Specifically, the paper delves into the intricacies of training reward models, optimizing policy models, and evaluating them in the context of RLHF, highlighting the complex interplay between these components.
The authors argue for a comprehensive approach to solving the Objective Mismatch problem, which includes improving reward model evaluation, exploring new reward model training methods, developing high-quality training datasets, and innovating in human-centric NLP evaluation.
The paper is significant in its analysis of the pitfalls in current RLHF practices and offers a path forward to aligning LLMs more closely with human values and goals, thereby enhancing their utility and reliability.

Revisiting Large Language Models as Zero-shot Relation Extractors

This paper by Li et al. from Southeast University, China delves into using large language models (LLMs) like ChatGPT for zero-shot relation extraction (RE), a task traditionally considered very challenging in natural language processing.
The study introduces and explores the “summarize-and-ask” (SUMASK) prompting technique, which transforms RE tasks into a question-answering format. This innovative approach enables LLMs to more effectively understand and extract relationships between entities in text, optimizing the process without the need for traditional data labeling.
Extensive experiments across various benchmarks have shown that LLMs, when assisted with the SUMASK prompt, not only enhance performance but also can achieve competitive or even superior results compared to other zero-shot and fully supervised methods in relation extraction. Remarkably, ChatGPT with SUMASK prompting outperformed state-of-the-art fully supervised methods in certain benchmarks.
The following figure from the paper illustrates SUMASK prompting. The outputs of LLMs are highlighted in color. The probability of relation “residence” is 0 because the system answers “no” via majority vote. To estimate the uncertainty of relation “field of work”, they generate $k$ [SUMMARIZATION], [QUESTION], [ANSWER] representations, respectively. Then they calculate the dispersion degree among these representations to approximate the uncertainty.

The paper highlights the efficiency of LLMs in extracting overlapping relations and effectively handling the “none-of-the-above” (NoTA) relation challenge. It also notes that LLMs’ performance varies significantly across different types of relations.
The following figure from the paper shows main results on FewRel and Wiki-ZSL. In order to reduce the effect of experimental noise, the unseen label selection process is repeated for five different random seeds to produce the test set.

While this approach doesn’t completely replace traditional methods, it has significant implications for the speed and accuracy of information extraction. The SUMASK method simplifies the complex task of relation extraction for LLMs and consistently improves their performance in various experimental settings.
The study’s limitations include its exclusive focus on zero-shot RE without exploring few-shot and domain-specific scenarios. Additionally, the research was limited to a small set of prompt styles due to budget constraints.

NexusRaven-V2: Surpassing GPT-4 for Zero-shot Function Calling

This blog by Nexusflow introduces the open-source NexusRaven-V2, a 13B LLM that excels in zero-shot function calling, surpassing GPT-4’s capabilities. This model is pivotal in converting natural language instructions into executable code, integral to the OpenAI Assistants API. It’s a major stride in enhancing copilots and agents for using software tools, emphasizing open-source models’ role in technology and society.
NexusRaven-V2 achieves up to 7% higher function calling success rates than GPT-4, particularly in complex cases involving nested and composite functions. This is notable considering NexusRaven-V2 was never trained on these functions.
The model is instruction-tuned on Meta’s CodeLlama-13B-instruct, utilizing data solely from open-code corpora. Its open-source nature and commercial permissiveness cater to both community developers and enterprises.
NexusRaven-V2 is designed for easy integration into existing software workflows, replacing mainstream proprietary function calling APIs. It includes open-source utility artifacts, online demos, and Colab notebooks for seamless onboarding.
The following figure from the blog shows that NexusRaven-V2 provides the function calling capability to enable copilots and agents to use software tools. Given human instruction prompts and software documentations, the function calling capability generates executable code to run the functions/APIs.

The team introduces the Nexus-Function-Calling benchmark and a Weights leaderboard, featuring a wide array of real-life, human-curated function-calling examples. This benchmark, with 8 out of 9 tasks open-sourced, aims to standardize evaluations in function calling.
The following figure from the blog shows NexusRaven-V2 evaluation with their human-curated Benchmark.

The model’s robustness is evident in its handling of various descriptions of functions by developers, indicating its potential to match or surpass proprietary LLM APIs in accuracy and robustness.
Project page.

Instruction-Following Evaluation for Large Language Models

This paper by Zhou et al. from Google and Yale University introduces IFEval, a benchmark for evaluating large language models (LLMs) like GPT-4 and PaLM 2 on “verifiable instructions.”
The team identified 25 verifiable instructions, generated 541 prompts containing these instructions, and implemented verification and transformation functions to check model outputs.
IFEval provides a simple, objective, and extendable method for evaluating LLMs, offering insights into models’ accuracy in following instructions and their ability to improve during training.
The following figure from the paper shows instructions such as “write at least 25 sentences” can be automatically and objectively verified. We build a set of prompts with verifiable instructions, for evaluating the instruction-following ability of large language models.

However, a potential limitation is that it doesn’t evaluate the quality of responses, such as writing style or the presence of hallucinations. GPT-4 achieved 79.30% accuracy, while PaLM 2 scored 46.95% in these tests.

Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback

This paper by Tian et al. from Harvard and Stanford in EMNLP 2023 focuses on improving the calibration of confidence scores in language models (LMs) fine-tuned with human feedback (RLHF-LMs). It addresses the issue of RLHF-LMs, like ChatGPT, GPT-4, and Claude, producing poorly calibrated confidence scores.
The study presents methods to extract more calibrated confidence scores from RLHF-LMs. A key finding is that verbalized confidence scores, where the model expresses its confidence directly, are typically better calibrated than the model’s conditional probabilities. This approach was tested on benchmarks like TriviaQA, SciQ, and TruthfulQA.
The paper shows that eliciting multiple answer choices before finalizing a confidence score significantly improves calibration. Additionally, the use of temperature scaling is found to generally provide better calibration than model probabilities across different datasets, often reducing expected calibration error by over 50%.
The following figure from the paper shows verbalized confidence scores (blue) are better-calibrated than log probabilities (orange) for gpt-3.5-turbo. Raw model probabilities (top-left) are consistently over-confident. Verbalized numerical probabilities (bottom) are better-calibrated. Considering more answer choices (bottom-right) further improves verbalized calibration (as in ‘Considering the Opposite’ in psychology). Verbalized expressions of likelihood (top-right) also provide improved calibration. Bar height is average accuracy of predictions in bin. Darker bars mean more predictions fall in that confidence range. Results computed on SciQ.

The authors conducted comprehensive experiments using various RLHF-LMs and metrics like expected calibration error (ECE) and area under the curve (AUC) to assess the calibration improvements. They found that directly verbalizing a probability or an expression of confidence yields significantly better-calibrated estimates.
Among the methods for verbalizing probabilities directly, generating and evaluating multiple hypotheses is noted to improve calibration. Interestingly, the study reveals that LMs can express their uncertainty with numerical probabilities as well or better than with words.
The paper concludes that while RLHF can worsen the calibration of a model’s conditional probabilities, the proposed methods of eliciting calibrated confidences by prompting models to verbalize in token space show promise. The authors suggest future work to explore why certain models are better at verbalizing confidence and how to reduce the sensitivity of a model’s calibration to the prompt used.

A Comprehensive Survey on Vector Database: Storage and Retrieval Technique, Challenges

This paper by Han et al. from University of Michigan, Chengdu Library of the Chinese Academy of Sciences, Computer Network Information Center of the Chinese Academy of Sciences, focuses on vector databases which are used for storing high-dimensional data that traditional DBMS cannot handle.
Vector databases store data as high-dimensional vectors, representing features or attributes, and can range from tens to thousands of dimensions. These vectors are usually generated by transforming raw data like text, images, audio, or video using various methods like machine learning models and word embeddings. The primary advantage of vector databases over traditional ones is their ability to perform fast and accurate similarity searches and retrievals based on vector distance or similarity.
The paper discusses the storage technique of sharding, where a vector database is distributed across multiple machines or clusters (shards) based on criteria like hash functions or key ranges. Sharding enhances the scalability, availability, and performance of vector databases. Hash-based and range-based sharding methods are explored, with the former distributing vector data across shards based on the hash value of a key column or set of columns, and the latter assigning data to shards based on value ranges of a key column or set of columns.
In terms of search methods, the paper highlights the optimization problem of nearest neighbor search (NNS) and its variation, approximate nearest neighbor search (ANNS). While NNS uses more exact methods like partitioning space into regions or enclosing points in hyperspheres, ANNS employs more probabilistic or heuristic methods like locality-sensitive hashing, visiting regions based on distance to the query point, or using graph-based approaches.
The paper also discusses brute force and tree-based approaches for NNS. Brute force is simple but computationally expensive, scanning all points in the dataset, whereas tree-based methods like KD-tree, Ball-tree, R-tree, and M-tree organize points in multidimensional spaces more efficiently.
For ANNS, hash-based methods like locality-sensitive hashing, spectral hashing, and deep hashing are detailed. These methods focus on reducing memory footprint and search time by transforming high-dimensional vectors into compact binary codes for storage and retrieval, balancing the trade-off between accuracy and efficiency.
The challenges faced by vector databases include the need for efficient indexing and searching of billions of vectors in hundreds or thousands of dimensions, support for heterogeneous vector data types, scalability to handle large-scale vector data and queries, and integration with mainstream machine learning frameworks like TensorFlow, PyTorch, and Scikit-learn.

FLEEK: Factual Error Detection and Correction with Evidence Retrieved from External Knowledge

This paper by Bayat et al. from UMich and Apple in EMNLP 2023 proposes FLEEK, a tool for detecting and correcting factual errors in text.
FLEEK identifies factual claims in text, gathers evidence from external sources, evaluates each claim’s factuality, and suggests revisions for identified errors. It uses both curated knowledge graphs and web search for evidence retrieval.
The following figure from the paper shows the FLEEK verification and revision framework.

FLEEK addresses challenges in factual accuracy in responses from Large Language Models and humans. It integrates with ML-based libraries, offering a user-friendly interface for fact-checking LLM outputs and arbitrary text inputs. The system showed good precision and recall in identifying “Strongly Supported” facts but had limitations in detecting less clear facts.

Automatic Hallucination Assessment for Aligned Large Language Models via Transferable Adversarial Attacks

This paper by Yu et al. from University of Pennsylvania and Microsoft Research propose AutoDebug, a framework to assess hallucinations in Large Language Models (LLMs).
AutoDebug generates adversarial attacks on LLMs by modifying existing QA data where LLMs perform well, focusing on two methods: answer swapping and context enriching.
The following figure from the paper shows the pipeline of AutoDebug, including identifying seed cases, generating new tests, and hallucination evaluation.

Evaluated on popular LLMs like GPT-4, results show significant performance drops, highlighting LLMs’ susceptibility to hallucinations.
The paper demonstrates the transferability of adversarial examples across different LLMs, suggesting a cost-effective method for debugging larger models.
Human evaluations confirm the naturalness and supportiveness of the generated evidence, reinforcing AutoDebug’s effectiveness in triggering LLM hallucinations.
The study opens avenues for future research in developing more reliable LLMs and suggests implications for improving LLM robustness against adversarial attacks.

OLaLa: Ontology Matching with Large Language Models

This paper by Hertling and Paulheim from the University of Mannheim, presented at K-CAP 2023, introduces OLaLa, a novel system for ontology matching using large language models (LLMs).
It primarily utilizes open-source LLMs for ontology matching, addressing challenges like candidate generation, entity resolution, and prompt design.
The system employs Sentence BERT models for candidate generation and various LLMs for text generation, leveraging few-shot prompting techniques for enhanced performance.
The following figure from the paper shows the overall framework of the proposed LLM-enhanced representation learning framework RLMRec.

Extensive evaluations on the Ontology Alignment Evaluation Initiative (OAEI) tracks showed competitive results with OLaLa, particularly in its ability to match entities based solely on textual descriptions.
The paper also details an ablation study, exploring the impact of different components and parameters on system performance, indicating that the choice of LLM and parameterization significantly affects outcomes.
Future work includes exploring automatic parameterization, new prompting techniques, and scalability improvements for large knowledge graphs.

LLM-Pruner: On the Structural Pruning of Large Language Models

This paper by Ma et al. in NeurIPS 2023 from the National University of Singapore, addresses the challenge of compressing large language models (LLMs) while preserving their multi-task solving and language generation abilities.
The authors introduce LLM-Pruner, a novel framework for task-agnostic compression of LLMs. This framework aims to reduce the model size without compromising the model’s diverse capabilities as general-purpose task solvers.
LLM-Pruner adopts structural pruning that selectively removes non-critical coupled structures based on gradient information, maximally preserving the majority of the LLM’s functionality. This approach minimizes reliance on the original training dataset, needing only about 50K publicly available samples for compression.
The performance of pruned models can be efficiently recovered through tuning techniques like LoRA within just 3 hours, a significant reduction in time compared to traditional methods.
The following figure from the paper illustrates LLM-Pruner. (i) Task-specific compression: the model is fine-tuned then compressed on a specific task. (ii) TinyBERT: First distill the model on unlabeled corpus and then fine-tune it on the specific task. (iii) LLM-Pruner: Task-agnostic compression within 3 hours.

The paper validates LLM-Pruner on three LLMs: LLaMA, Vicuna, and ChatGLM. It demonstrates that the compressed models still exhibit satisfactory capabilities in zero-shot classification and generation tasks.
Experimental results show that even with the removal of 20% of the parameters, the pruned model retains a significant portion of the original model’s performance. For instance, a 20% pruned LLaMA-7B model maintains 94.97% of the original model’s accuracy.
The authors also explore different aspects of the pruning process, such as the impact of different pruning rates and the importance of dependency-based structural pruning. They find that LLM-Pruner can prune up to 60% of the model parameters without significant loss in performance.
The study compares LLM-Pruner with DistilBERT and finds that LLM-Pruner outperforms DistilBERT in maintaining performance with a smaller model size.
Overall, the paper presents a significant advancement in the field of LLM compression, offering a practical solution for reducing model size while maintaining performance, especially important for deploying LLMs in resource-constrained environments.
Code.

SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot

This paper by Frantar and Alistarh from the Institute of Science and Technology Austria (ISTA) and Neural Magic Inc. present SparseGPT, a novel pruning method for large-scale generative pre-trained transformer (GPT) models.
SparseGPT demonstrates that GPT models, including OPT-175B and BLOOM-176B, can be pruned to at least 50% sparsity in one-shot, maintaining high accuracy without retraining. This pruning results in a significant reduction in the number of weights, with more than 100 billion weights being ignored at inference time.
The method effectively reduces the pruning problem to a series of large-scale instances of sparse regression, solved by a new approximate sparse regression solver. This allows for execution on large models within a few hours on a single GPU.
SparseGPT also supports semi-structured (2:4 and 4:8) patterns and is compatible with weight quantization techniques.
The approach is local, relying only on weight updates that preserve the input-output relationship for each layer, calculated without global gradient information.
The following figure from the paper illustrates: (Left) Visualization of the SparseGPT reconstruction algorithm. Given a fixed pruning mask $M$, they incrementally prune weights in each column of the weight matrix $W$, using a sequence of Hessian inverses $(H_{U_j})^{−1}$, and updating the remainder of the weights in those rows, located to the “right” of the column being processed. Specifically, the weights to the “right” of a pruned weight (dark blue) will be updated to compensate for the pruning error, whereas the unpruned weights do not generate updates (light blue). (Right) Illustration of the adaptive mask selection via iterative blocking.

In experiments, SparseGPT was compared against magnitude pruning and other baseline methods. It showed superior performance in maintaining accuracy at higher sparsity levels. For instance, it induced up to 60% sparsity in the OPT model family with minimal accuracy loss, outperforming magnitude pruning which preserved accuracy only up to 10% sparsity.
Larger models were found to be more compressible with less accuracy loss at a fixed sparsity compared to smaller counterparts.
The study also included evaluations on zero-shot tasks, where SparseGPT models maintained close to original accuracy, in contrast to magnitude-pruned models which showed a significant drop in performance.
The method’s scalability was highlighted by its ability to sparsify models with over 100 billion parameters, significantly reducing the models’ storage and computational requirements without compromising on performance.
Code.

RAGAS: Automated Evaluation of Retrieval Augmented Generation

This paper by Es et al. from Exploding Gradients, Cardiff University, and AMPLYFI introduces RAGAS, a framework for reference-free evaluation of Retrieval Augmented Generation (RAG) systems.
RAGAS focuses on evaluating the performance of RAG systems in dimensions such as the effectiveness of the retrieval system in providing relevant context, the LLM’s ability to utilize this context, and the overall quality of generation.
The framework proposes a suite of metrics to evaluate these dimensions without relying on ground truth human annotations.
RAGAS focuses on three quality aspects: Faithfulness, Answer Relevance, and Context Relevance.
- Faithfulness: Defined as the extent to which the generated answer is grounded in the provided context. It’s measured using the formula: $F = \frac{|V|}{|S|}$ where, $|V|$ is the number of statements supported by the context and $|S|$ is the total number of statements extracted from the answer.
- Answer Relevance: This metric assesses how well the answer addresses the given question. It’s calculated by generating potential questions from the answer and measuring their similarity to the original question using the formula: $AR = \frac{1}{n} \sum_{i=1}^{n} \text{sim}(q, q_i)$ where $q$ is the original question, $q_i$ are the generated questions, and sim denotes the cosine similarity between their embeddings.
- Context Relevance: Measures the extent to which the retrieved context contains only the information necessary to answer the question. It is quantified using the proportion of extracted relevant sentences to the total sentences in the context: $CR = \frac{\text{number of extracted sentences}}{\text{total number of sentences in } c(q)}$
The paper validates RAGAS using the WikiEval dataset, demonstrating its alignment with human judgments in evaluating these aspects.
The authors argue that RAGAS contributes to faster and more efficient evaluation cycles for RAG systems, which is vital due to the rapid adoption of LLMs.
RAGAS is validated using the WikiEval dataset, which includes question-context-answer triples annotated with human judgments for faithfulness, answer relevance, and context relevance.
The evaluation shows that RAGAS aligns closely with human judgments, particularly in assessing faithfulness and answer relevance.
Code.

EVER: Mitigating Hallucination in Large Language Models through Real-Time Verification and Rectification

The paper by Kang et al. from UNC-Chapel Hill and UW, introduces EVER, a novel approach to address the issue of hallucination in Large Language Models (LLMs).
EVER operates in real-time during the text generation process, employing step-wise generation and hallucination rectification to prevent error propagation.
The approach consists of three stages: generation, validation, and rectification. It begins with an LLM generating an initial sentence, followed by validation of each fact-related concept for hallucinations (intrinsic and extrinsic), and rectifying any detected errors.
The following figure from the paper illustrates a comparison between vanilla Chain of Thought (CoT) reasoning chain and EVER. CoT is susceptible to hallucination snowball to propagate initial errors to later generation, whereas EVER reduces such errors by a step-wise verification and rectification.

The following figure from the paper illustrates an Overview of EVER pipeline in the biography generation task. EVER proactively identifies and rectifies concept-level hallucinations before each new sentence generation. Also, it flags any remaining extrinsic hallucinations after a single round of rectification, thereby enhancing the trustworthiness of the output.

Experiments conducted on tasks like short-form QA, long-form biography generation, and reasoning, demonstrate EVER’s effectiveness in reducing hallucinations and enhancing the trustworthiness of generated text.
Compared to state-of-the-art methods, EVER significantly improves the accuracy and reliability of LLMs by addressing both intrinsic and extrinsic hallucinations.

Prometheus: Inducing Fine-Grained Evaluation Capability in Language Models

This paper by Kim et al. from KAIST AI, NAVER AI Lab, UW, and MIT introduces Prometheus, a 13B open-source language model (LM) designed for fine-grained evaluation of long-form text.
Prometheus is trained using a new dataset called Feedback Collection, which includes 1K fine-grained score rubrics, 20K instructions, and 100K responses with language feedback generated by GPT-4.
The model can assess any given text based on customized score rubrics provided by the user, showing a high Pearson correlation of 0.897 with human evaluators and comparable performance to GPT-4.
The following figure from the paper illustrates that compared to conventional, coarse-grained LLM evaluation, Prometheus offers a fine-grained approach that takes user-defined score rubrics as input.

The following figure from the paper illustrates an overview of the augmentation process of the Feedback Collection. The keywords included within the score rubrics of the Feedback Collection is also displayed.

Experimentally, Prometheus outperforms open-sourced reward models in two human preference benchmarks, demonstrating its potential as a universal reward model.
The model’s development emphasizes the importance of including various reference materials, like reference answers, to effectively induce fine-grained evaluation capability.
The work contributes to the field by offering an alternative to proprietary LLMs for large-scale evaluation tasks with specific criteria, addressing issues like closed-source nature, versioning control, and high costs.
Code.

AlphaCode 2

This technical report by the AlphaCode Team at Google DeepMind, published in 2023, introduces AlphaCode 2, an AI system that significantly improves competitive programming performance. It utilizes the Gemini model for various components like code generation and reranking.
The system’s approach includes fine-tuning Gemini Pro models using the GOLD training objective and datasets with numerous human code samples. This results in a diverse family of policy models.
AlphaCode 2 adopts an advanced search and reranking mechanism, involving sampling with diverse code samples, filtering out non-compliant samples, clustering similar samples, and employing a scoring model to select the best solutions.
Specific details are as follows:
- Policy and Fine-Tuning: AlphaCode 2 employs Gemini Pro models, fine-tuned using the GOLD objective on a large dataset of human code samples. First, they fine-tune on an updated version of the CodeContests dataset (containing more problems, more solutions and higher quality, manually-curated tests on the validation set). This dataset contains approximately 15 thousand problems and 30 million human code samples. They generate several fine-tuned models by varying hyperparameters, and end up with a family of fine-tuned models. Second, we conduct a few additional steps of fine-tuning on a different, higher-quality dataset. Relying on a family of policies instead of a single one allows them to maximize diversity, which remains key to tackling hard problems.
- Sampling: Their sampling approach is close to that of AlphaCode. They generate up to a million code samples per problem, using a randomized temperature parameter for each sample to encourage diversity. They also randomize targeted metadata included in the prompt, such as the problem difficulty rating and its categorical tags. They split their sampling budget evenly across our family of fine-tuned models. While they sampled in Python and C++ for AlphaCode, they only used C++ samples for AlphaCode 2 as they found them to be higher quality. Massive sampling allows us to search the model distribution thoroughly and generate a large diversity of code samples, maximizing the likelihood of generating at least some correct samples. Given the amount of samples, filtering and reranking are of paramount importance to the overall system’s performance, as they only submit a maximum of 10 code samples per problem.
- Filtering: Code samples are filtered based on compliance with problem descriptions, with approximately 95% being removed for non-conformity or compilation issues.
- Clustering: Remaining candidates are clustered based on runtime behavior, reducing redundancy. The 10 largest clusters are retained for further evaluation.
- Scoring Model: A second Gemini Pro model scores each candidate, selecting the best from each cluster for submission.
- Evaluation: Tested on Codeforces, AlphaCode 2 solved 43% of problems, a significant improvement over its predecessor, placing it in the top 15% of competitors.
The following figure from the report shows a high-level overview of the AlphaCode 2 system.

Evaluation on Codeforces shows AlphaCode 2 solves 43% of problems, a marked improvement over the original AlphaCode. It ranks around the 85th percentile among competitors, illustrating significant advances in AI’s reasoning and problem-solving in competitive programming.
The system’s sample efficiency and adaptability highlight its potential for interactive programming, aiding human coders in reasoning and code design.

MediTron-70B: Scaling Medical Pretraining for Large Language Models

This paper by EPFL, Idiap Research Institute, Open Assistant, and Yale details the development of MediTron-7B and 70B, language models focused on medical reasoning, adapted from Llama-2 and pretrained on a curated medical corpus including PubMed articles, abstracts, and medical guidelines.
Engineering challenges were addressed using Nvidia’s Megatron-LM for distributed training, incorporating various forms of parallelism and optimization techniques for handling large-scale models.
The following figure from the paper shows the complete pipeline for continued pretraining, supervised finetuning, and evaluation of MediTron-7B and MediTron-70B.

Performance was evaluated using four medical benchmarks, showing significant gains over several baselines, both before and after task-specific finetuning. MediTron outperformed GPT-3.5 and Med-PaLM and closely approached the performance of GPT-4 and Med-PaLM-2.
The study emphasizes the use of chain-of-thought and self-consistency methods for improving inference. MediTron models demonstrated strong medical reasoning capabilities even before task-specific finetuning.

Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine

This paper by Nori et al. from Microsoft investigates whether generalist foundation models like GPT-4 can outperform specialist models in the medical domain without specialized training.
The authors introduce “Medprompt,” a novel methodology comprising dynamic few-shot selection, self-generated chain of thought, and choice shuffling ensemble, significantly enhancing GPT-4’s performance in medical question-answering datasets.
The prompting strategy employed in “Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine” is a critical part of Medprompt’s success. It involves dynamically selecting few-shot prompts from a large pool, optimizing for relevance and diversity. Additionally, the model self-generates a “chain of thought” to reason through complex medical scenarios, enhancing its decision-making process. Lastly, the strategy employs a unique “choice shuffling” ensemble technique, where multiple model instances are used, and their predictions are shuffled to improve reliability and reduce bias. This innovative prompting strategy is pivotal in enabling GPT-4 to outperform specialized models in the medical domain.
The following figure from the paper shows (a) Comparison of performance on MedQA. (b) GPT-4 with Medprompt achieves SoTA on a wide range of medical challenge questions.

Medprompt achieved state-of-the-art results on all nine datasets in the MultiMedQA suite, outperforming specialist models like Med-PaLM 2 with fewer model calls.
Key findings include a 27% reduction in error rate on the MedQA dataset over existing methods and surpassing a 90% score on this benchmark for the first time.
The findings highlight the potential of generalist foundation models, steered through advanced prompting strategies, to achieve exceptional performance in specialized areas, suggesting broad applicability beyond medicine and presenting a promising direction for leveraging large language models without extensive fine-tuning.

Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

This paper by Xiao et al. from Microsoft introduces Florence-2, a vision foundation model.
Florence-2 utilizes a unified prompt-based representation for various computer vision and vision-language tasks, designed to handle complex spatial hierarchy and semantic granularity.
The paper details the development of FLD-5B, a large-scale dataset with 5.4 billion annotations on 126 million images, using automated image annotation and model refinement.
Florence-2 employs a sequence-to-sequence architecture, training to perform versatile vision tasks. It demonstrates strong zero-shot and fine-tuning capabilities in extensive evaluations.
The following figure from the paper shows that Florence-2 is a vision foundation model that enables extensive perception capabilities including spatial hierarchy and semantic granularity. To this end, a single unified model Florence-2 is pre-trained on our FLD-5B dataset encompassing a total of 5.4B comprehensive annotations across 126M images, which is collected by their Florence data engine.

The following figure from the paper shows that Florence-2 consists of an image encoder and standard multi-modality encoder-decoder. We train Florence-2 on our FLD-5B data in a unified multitask learning paradigm, resulting in a generaslist vision foundation model, which can perform various vision tasks.

The model’s adaptability is underscored by its ability to perform tasks like captioning, object detection, grounding, and segmentation, using a uniform set of parameters activated by textual prompts.
The research highlights the integration of a vast range of visual comprehension tasks, addressing challenges in comprehensive data annotation and unified model architecture in vision tasks.

This paper by Panagopoulou et al. from UPenn, Salesforce Research, and Stanford, introduces X-InstructBLIP, a novel framework developed to facilitate the alignment of various modalities with frozen large language models (LLMs) without extensive modality-specific customization.
X-InstructBLIP utilizes distinct pre-trained encoders for each modality (image, audio, video, 3D) and aligns them into the language domain through independently trained instruction-aware Q-Formers. This allows the model to handle cross-modal tasks even though each modality projection is trained individually.
Extracting instruction-aware representations necessitates diverse instruction-related tasks across all modalities. Notably, datasets for 3D and audio modalities are marjorly caption-centric. To address this, the authors created instruction-modality fine-tuning datasets by employing a three-stage query data augmentation technique using open-source LLMs. This approach yielded approximately 250k QA samples for 3D and 24k QA samples for audio. To this end, they leverage the open-source large language model google/flan-t5-xxl to automatically generate question-answer pairs for the 3D and audio modalities from their corresponding captions. The process begins by prompting the model with captions to generate potential answers. These answers are then used to prompt the model to generate candidate questions. If the model’s response to a question, using the caption as context aligns closely with the initial answer (achieving a Levenshtein similarity score above 0.9), the example is added to their dataset.
The following figure from the paper shows that X-InstructBLIP utilizes distinct pre-trained encoders for each modality, aligning them into the language domain through independently trained, instruction-aware Q-Formers. Despite the individual training of each modality, X-InstructBLIP demonstrates emergent capabilities in cross-modal comprehension.

The following figure from the paper offers a visual illustration of the multimodal instruction-aware Q-Former architecture. Each modality Q-Former transforms a set of K trainable query tokens conditioned on the modality embedding and task instruction which are then linearly projected to the frozen LLM’s space next to a modality-specific prefix.

For evaluating the framework’s cross-modal reasoning capabilities, a Discriminative Cross-modal Reasoning (DisCRn) challenge dataset was introduced, comprising of audio-video and image-3D QA samples that necessitate discriminative reasoning across different modalities. To generate the dataset, they employed the same google/flan-t5-xxl model, utilized for instruction data augmentation. The process is initiated by prompting the language model in a Chain-of-Thought manner to generate a set of properties for each dataset instance. Each instance is then paired with a random entity from the dataset to form a (question, answer, explanation) triplet by prompting the language model with three in-context examples to leverage in-context-learning. A pivotal step in this creation process is a round-trip-consistency check: an example is integrated into the final dataset only when the model’s predictions on the generated question, given the captions, match the example answer, exhibiting a Levenshtein distance above 0.9.
In experiments, X-InstructBLIP displayed competence in both individual modality understanding and emergent cross-modal reasoning, showcasing its potential scalability and adaptability across various modalities. The framework performed comparably or better than existing models in tasks involving images, audio, video, and 3D point clouds, highlighting its effectiveness in aligning different modalities with LLMs.
The study concludes that while X-InstructBLIP demonstrates promising results in cross-modal and individual modality tasks, exploring the complexities and challenges within each modality remains a critical area for future research.
Project page; Code.

SwiftSage: A Generative Agent with Fast and Slow Thinking for Complex Interactive Tasks

This paper by Lin et al. from the Allen Institute for Artificial Intelligence in NeurIPS 2023, introduces SwiftSage, an agent framework for action planning in complex tasks.
The framework combines behavior cloning and large language model (LLM) prompting, comprising two modules: Swift for fast, intuitive thinking, and Sage for deliberate thought. In other words, SwiftSage is inspired by the human mind’s dual-process thought theory. Its design synergizes the rapid intuition of behavior cloning with the reflective depths of LLM prompting.
Swift, a small encoder-decoder LM, is fine-tuned on oracle agent action trajectories, while Sage employs LLMs like GPT-4 for subgoal planning. Put simply, the Swift module is a leaner model, trained directly on successful action paths, while the Sage module harnesses the raw power of LLMs like GPT-4 for meticulous planning.
SwiftSage’s integration of fast and slow thinking significantly outperforms other methods on the ScienceWorld benchmark, showcasing efficiency and robustness in problem-solving.
The following figure from the paper shows a comparison of different methods of prompting LLMs to build agents for interactive tasks

The following figure from the paper shows an example of how SWIFTSAGE works with fast and slow thinking. The SWIFT module is offline trained via imitation learning with a small LM such as T5-large (770m). When it is necessary, for example, encountering an exception, they switch to the SAGE module that prompts LLMs (e.g., GPT-4) for planning and grounding the next subgoals, resulting in an action buffer.

In head-to-head competitions with other sophisticated AI systems like SayCan and ReAct, SwiftSage has emerged as the decisive winner across the ScienceWorld benchmark’s 30 tasks as shown in the figure below from the paper.

The work demonstrates the potential of collaborative frameworks that blend smaller LMs and LLMs for complex reasoning tasks, offering insights into building general AI agents.

Evaluating Large Language Models: A Comprehensive Survey

This paper by Guo et al. from provides an extensive overview of the evaluation of Large Language Models (LLMs). The paper discusses the remarkable capabilities of LLMs across various tasks and their deployment in numerous applications. However, it also highlights the risks associated with LLMs, such as potential data leaks, generation of harmful content, and concerns about superintelligent systems lacking safeguards.
The evaluation of LLMs is categorized into three major groups: Knowledge and Capability Evaluation, Alignment Evaluation, and Safety Evaluation. This structure guides the comprehensive review of methodologies and benchmarks for evaluating LLMs.
The following figure from the paper shows their proposed taxonomy of major categories and sub-categories of LLM evaluation.

The paper delves into evaluations specific to various domains such as biology and medicine, education, legislation, computer science, and finance. This highlights the diversity of applications and the need for specialized assessments in different fields.
It discusses benchmarks for Natural Language Understanding (NLU) and Natural Language Generation (NLG), knowledge and reasoning, and holistic evaluation through leaderboards and arenas. This section provides insights into various benchmarks and datasets used for assessing LLMs.
The paper identifies key areas for future research in LLM evaluation, including risk evaluation, agent evaluation, dynamic evaluation, and enhancement-oriented evaluation. These areas are crucial for advancing the field and ensuring the responsible development of LLMs.
Overall, this paper provides a thorough and systematic perspective on the evaluation of LLMs, emphasizing the importance of comprehensive, responsible, and domain-specific assessments to maximize societal benefits and minimize potential risks.

Show Your Work: Scratchpads for Intermediate Computation with Language Models

This paper by Nye et al. from MIT and Google Research, presented at NeurIPS 2021, introduces the concept of “scratchpads” to improve the ability of large Transformer-based language models to perform complex, multi-step computations.
The authors address the issue that while these models excel at tasks requiring single-step computation, they struggle with multi-step algorithmic tasks, like long addition or program execution. The proposed solution involves training models to use a scratchpad for intermediate computation steps.
The paper demonstrates that using scratchpads allows models to successfully perform long addition, polynomial evaluation, and execution of arbitrary Python code.
The following figure from the paper shows an overview of the proposed scratchpad approach applied to predicting code execution and comparison to direct execution prediction. (Top) Previous work has shown that large pre-trained models achieve poor performance when asked to directly predict the result of executing given computer code. (Bottom) In this work, we show that training models to use a scratchpad and predict the program execution trace line-by-line can lead to large improvements in execution prediction performance. N.B. Although the example below only has one loop iteration for each loop, all loops are unrolled across time.

Empirical results show that the scratchpad method leads to significant improvements in task performance, including out-of-distribution generalization and execution prediction of Python programs.
The authors conclude that scratchpads offer a simple yet effective way to enhance the computational abilities of Transformer models without altering their underlying architecture.

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

This paper by Chen et al. from the University of Waterloo, Vector Institute Toronto, University of California Santa Barbara, and Google Research in TMLR 2023 introduces Program of Thoughts (PoT) prompting.
PoT improves numerical reasoning in language models. PoT leverages language models, mainly Codex, to generate programming language statements alongside text, which are then executed by a program interpreter. PoT thus decouples complex computation from reasoning and language understanding.
The following figure from the paper shows a comparison between Chain of Thoughts and Program of Thoughts.

PoT was evaluated on math word problem and financial QA datasets, showing an average performance gain of around 12% compared to Chain-of-Thoughts prompting.
The paper demonstrates that PoT, particularly when combined with self-consistency decoding, significantly reduces offensive content and enhances robustness to adversarial prompts.

Chain of Code: Reasoning with a Language Model-Augmented Code Emulator

This paper by Li et al. introduces the Chain of Code (CoC), an approach that enhances language models’ (LMs) reasoning capabilities by integrating code-writing with an LM-augmented code emulator (LMulator), which executing code with a language model that simulates the execution if the code is not executable.
CoC leverages code writing by LMs for improved reasoning in logic, arithmetic, and semantic tasks, often blending these aspects.
The LMulator acts as a pseudo-interpreter, selectively emulating code execution for parts of the program that are not executable by a standard interpreter, like “detect_sarcasm(string)” function in semantic tasks.
This approach allows LMs to format semantic sub-tasks as flexible pseudocode, with the LMulator catching undefined behaviors to simulate expected outputs.
CoC outperforms other methods like Chain of Thought, particularly in benchmarks like BIG-Bench Hard, where it achieved an 84% score, a 12% gain.
The following figure from the paper depicts various prompt engineering methods to solve advanced problems, (a) Chain of Thought prompting breaks the problem down into intermediate steps, (b) Program of Thoughts prompting writes and executes code, and (c) ScratchPad prompting simulates running already written code by tracking intermediate steps through a program state. Our reasoning method: Chain of Code first (d) generates code or psuedocode to solve the question and then (e) executes the code with a code interpreter if possible, and with an LMulator (language model emulating code) otherwise. Blue highlight indicates LM generation, red highlight indicates LM generated code being executed, and purple highlight indicates LMulator simulating the code via a program state in green.

CoC’s performance is scalable across different LM sizes and broadens the scope of reasoning questions LMs can accurately answer by “thinking in code.”
Code.

When do Generative Query and Document Expansions Fail? A Comprehensive Study Across Methods Retrievers and Datasets

This paper by Weller et al. from Johns Hopkins University, Allen Institute for AI, and Yale University, investigates the effectiveness of large language models (LMs) for query and document expansion in improving information retrieval generalization.
The study, first of its kind, examines whether LM-based expansion techniques universally benefit information retrieval (IR) models or are effective only under specific conditions. It finds a strong negative correlation between the performance of retrieval models and the gains from expansion: expansions tend to improve scores for weaker models but often harm stronger ones. This trend holds across eleven expansion techniques, twelve datasets with diverse distribution shifts, and twenty-four retrieval models.
The following figure from the paper illustrates an example of expansions obscuring the relevance signal. The non-relevant document in red was ranked higher than the relevant blue document due to the phrase “Home Equity Line of Credit” being added to the query. The left side indicates original query and documents while the right side shows the query and document expansions.

The experimental setup involved a range of models from major IR architectures, including DPR, ColBERT v2, SPLADE v2, MonoBERT, MonoT5, E5, GTE, MiniLM, all-mpnet-v2-base, and Llama models, evaluated on three representative datasets. The study used text-based expansions from LMs, specifically from gpt-3.5-turbo, due to its strong performance and cost-effectiveness.
To understand why expansions hurt stronger IR models, the study conducted an error analysis. It revealed that larger models are better at utilizing the information in the original documents, with model size impacting gains from expansions in tandem with performance. The analysis of 30 examples showed that errors in expanded versions were mainly due to the introduction of irrelevant keywords that shifted focus from pertinent keywords. This distraction led to a lower ranking of relevant documents.

MemGPT: Towards LLMs as Operating Systems

This paper by Packer et al. from UC Berkeley introduces MemGPT, a groundbreaking system that revolutionizes the capabilities of large language models (LLMs) by enabling them to handle contexts beyond their standard limited windows. Drawing inspiration from traditional operating systems’ hierarchical memory systems, MemGPT manages different memory tiers within LLMs to provide extended context capabilities.
The innovative MemGPT system allows LLMs to have virtually unlimited memory, mirroring the functionality of operating systems in managing memory hierarchies. This approach enables the LLMs to handle much larger contexts than their inherent limits would usually permit. The virtual context management technique, similar to virtual memory paging in operating systems, allows for effective data movement between fast and slow memory tiers, giving the appearance of a larger memory resource.
MemGPT consists of two primary memory types: main context (analogous to OS main memory/RAM) and external context (similar to OS disk memory). The main context is the standard context window for LLMs, while external context refers to out-of-context storage. The system’s architecture enables intelligent management of data between these contexts. The LLM processor is equipped with function calls, which help manage its own memory, akin to an operating system’s management of physical and virtual memory.
The following figure from the paper illustrates that in MemGPT (components shaded), a fixed-context LLM is augmented with a hierarchical memory system and functions that let it manage its own memory. The LLM processor takes main context (analogous to OS main memory/RAM) as input, and outputs text interpreted by a parser, resulting either in a yield or a function call. MemGPT uses functions to move data between main context and external context (analogous to OS disk memory). When the processor generates a function call, it can request control ahead of time to chain together functions. When yielding, the processor is paused until the next external event (e.g., a user message or scheduled interrupt).

One of the key features of MemGPT is its capability to easily connect to external data sources. This feature enhances the system’s flexibility and utility in various applications. Furthermore, MemGPT comes with built-in support for LanceDB (YC W22), providing scalable semantic search via archival storage. This integration significantly boosts MemGPT’s ability to retrieve and process large amounts of data efficiently.
The following figure from the paper illustrates the deep memory retrieval task. In the example shown, the user asks a question that can only be answered using information from a prior session (no longer in-context). Even though the answer is not immediately answerable using the in-context information, MemGPT can search through its recall storage containing prior conversations to retrieve the answer.

The paper evaluates MemGPT’s performance in two critical domains: document analysis and conversational agents. In document analysis, MemGPT exceeds the capabilities of traditional LLMs by analyzing large documents that surpass the underlying LLM’s context window. For conversational agents, MemGPT demonstrates its effectiveness in maintaining long-term interactions with enhanced coherence and personalization, resulting in more natural and engaging dialogues.
MemGPT’s versatility is further highlighted by its compatibility with many LLMs out of the box and the option to plug it into a custom LLM server. This feature ensures that MemGPT can be widely adopted and integrated into various existing LLM frameworks, enhancing their capabilities.
Despite its pioneering approach and capabilities, the paper also notes limitations, such as reliance on specialized GPT-4 models for function calling and the proprietary nature of these models.
Overall, MemGPT’s innovative approach, inspired by operating system techniques, represents a significant advancement in LLM capabilities. It opens new avenues for future research and applications, particularly in domains requiring massive or unbounded contexts and in integrating different memory tier technologies. The system’s ability to manage its own memory and connect to external sources, along with LanceDB support, positions MemGPT as a versatile and powerful tool in the realm of artificial intelligence.
Code; Blog

The Internal State of an LLM Knows When It’s Lying

This paper by Azaria and Mitchell, from Ariel University and Carnegie Mellon University, introduces SAPLMA, a method that uses the internal state of Large Language Models (LLMs) to determine the truthfulness of their outputs. Unlike traditional few-shot prompting, Statement Accuracy Prediction, based on Language Model Activations (SAPLMA) significantly outperforms with 60-80% accuracy, highlighting the internal state as a reliable truthfulness indicator.
SAPLMA operates by training an external simple classifier on the hidden activations of the LLM. This classifier, designed to be lightweight and efficient, can be seamlessly integrated with the LLM’s operation, processing the internal state outputs of the LLM to assess statement accuracy.
The implementation of SAPLMA involves extracting activations from various layers of the LLM during its operation. These activations, which represent the LLM’s internal state at different stages of processing the input, are then fed into the classifier. The classifier is trained to distinguish between true and false statements based on these activations.
The study demonstrates that LLMs, while generating text one token at a time, internally distinguish between true and false statements. This method requires no fine-tuning or modifications to the LLM and can efficiently run alongside it, filtering out likely false statements.
The following figure from the paper shows a tree diagram that demonstrates how generating words one at a time and committing to them may result in generating inaccurate information.

A diverse dataset of true/false statements across six topics was used, achieving accuracy up to 90% with Llama2-7b. The approach is model-agnostic, suggesting its broad applicability.
The paper suggests experimenting with different classifier configurations and deploying an ensemble of layers and classifiers tailored to specific use cases. However, this method is limited to models like Llama2 and is inapplicable to models without public parameter access, like GPT-4 or Claude.

GPT4All: An Ecosystem of Open Source Compressed Language Models

This paper by Anand et al. in NLP-OSS at EMNLP 2023 from Nomic AI, presents the development of GPT4All, an open-source platform designed to democratize access to large language models. It outlines the technical evolution, transformation into a comprehensive ecosystem, and impact on the open-source community.
GPT4All began as a fine-tuned variant of LLaMA 7B, enhanced with LoRA weights. It was trained on approximately one million prompt-response pairs, emphasizing diverse datasets and rigorous data curation.
The following figure from the paper shows TSNE visualizations showing the progression of the GPT4All train set. Panel (a) shows the original uncurated data. The red arrow denotes a region of highly homogeneous prompt-response pairs. The coloring denotes which open dataset contributed the prompt. Panel (b) shows the original GPT4All data after curation. This panel, as well as panels (c) and (d) are 10 colored by topic, which Atlas automatically extracts. Notice that the large homogeneous prompt-response blobs no longer appear. Panel (c) shows the GPT4All-J dataset. The “starburst” clusters introduced on the right side of the panel correspond to the newly added creative data. Panel (d) shows the final GPT4All-snoozy dataset. All datasets have been released to the public, and can be interactively explored online. In the web version of this article, you can click on a panel to be taken to its interactive visualization.

The project has expanded to include models like GPT4All-J and GPT4All-Snoozy, the latter integrating Dolly’s training data and based on LLaMA-13B for better performance.
GPT4All features compressed versions of various models for standard hardware, high-level APIs, and a GUI for easy experimentation, emphasizing accessibility and practical utility.
It addresses limitations in transparency and accessibility of state-of-the-art models, offering an open, user-friendly alternative. Its popularity is evidenced by its growth on GitHub and integration in various projects.
The paper acknowledges ethical concerns in making LLM technology widely available, like potential misuse, but argues for the benefits of democratizing this technology.
The document is both a technical overview of GPT4All models and a case study on the open-source ecosystem’s growth, with future plans including expanding model support and enhancing hardware compatibility.

The Falcon Series of Open Language Models

This paper by Almazrouei et al. from the details the development of large-scale causal decoder-only models (7B, 40B, and 180B parameters), showcasing the Falcon-180B model’s superiority over models like PaLM or Chinchilla.
The paper emphasizes the efficacy of scalable methods that optimally utilize computational resources, echoing Richard Sutton’s “The Bitter Lesson,” which posits that, historically, AI methods harnessing computational power have surpassed those reliant on intricate human-designed features or domain expertise. This approach champions the principle that enhancing basic but scalable techniques with increased computational power can yield superior outcomes.
The authors provide an in-depth analysis of the experiments and motivations behind the data, architecture, and hyperparameters chosen for their models. Key implementation details include:
1. Data Sources and Preparation: The Falcon models are trained on an extensive dataset comprising over 3.5 trillion tokens, primarily sourced from the web. This selection is grounded in the observation that web data, when properly processed, rivals or surpasses curated datasets. To optimize data quality, the team implemented rigorous filtering and deduplication processes. Thus, one of their most pivotal findings is that web data, when meticulously filtered and deduplicated, is capable of training high-performance language models, potentially surpassing models trained on meticulously curated datasets. The Falcon team’s research indicates that incorporating curated data into a robust web-based dataset does not invariably enhance performance and can, in some scenarios, even impede the model’s generalization abilities and task performance.
2. Model Architecture: The Falcon series consists of decoder-only models, with a focus on scalable architecture that leverages computing power efficiently. The models are designed to balance performance and computational cost, leading to the development of models with 7B, 40B, and 180B parameters. These models are structured to efficiently handle extensive and varied training data. The make the following
  - Parallel attention and MLP blocks. Wang and Komatsuzaki (2021) first introduced parallel attention and MLP layers while training GPT-J. This augmentation is important to reduce the communication costs associated with tensor parallelism: this simple modification cuts the number of all_reduce necessary from two to one per layer. They found no measurable degradation in zero-shot performance or perplexity, in line with Chowdhery et al. (2022), and adopt this practice. The following figure from the paper shows parallelizing the attention and MLP blocks allows us to remove one sync point during tensor parallel training.
  - No biases. Chowdhery et al. (2022) found that removing the biases in the linear layers and layer norms improves stability. They validate that removing the biases in the linear layer does not result in worse performance: neither in terms of language modeling loss nor in terms of the final zero-shot performance. Accordingly, they remove biases from the linear layers in the Falcon series.
3. Hyperparameter Optimization: The choice of hyperparameters in Falcon models is guided by empirical experiments. The team tested various configurations to determine the optimal settings that maximize model performance while maintaining computational efficiency. This involved balancing factors like batch size, learning rate, and layer configuration to ensure the models are robust and efficient.
4. Multilingual and Multimodal Data: Concerning the integration of multilingual and code data, the research conducted on the Falcon models challenges the common belief that extensive multilinguality in generative large language models (LLMs) leads to a decrease in English performance. Researchers in the past have reported that massive multilinguality in generative LLMs comes at the expense of English performance, resulting in multilingual models that underperform their monolingual counterparts. These models strategically include modest amounts of multilingual and code data, ranging from 5-10%, finding that this does not significantly impact the zero-shot performance on English tasks. This deliberate inclusion is aimed at augmenting the models’ versatility and adaptability, enhancing their applicability across various languages and formats, while ensuring there is no significant detriment to the core performance metrics in English.
5. Computational Efficiency: The implementation of Falcon models on AWS infrastructure showcases a focus on large-scale, distributed training. This approach is essential to manage the computational demands of training such large models, ensuring that the process is not only effective but also feasible in terms of resource utilization.
6. Evaluation Metrics: The Falcon series is rigorously evaluated against various benchmarks to ensure its effectiveness and superiority over existing models like PaLM, GPT-3.5, and GPT-4. This evaluation includes assessing the models on a range of tasks, validating their performance improvements and scalability.
Overall, these implementation details reflect the Falcon team’s comprehensive approach to building state-of-the-art language models that leverage the latest in data processing, model architecture, and hyperparameter tuning techniques. The models are rigorously evaluated against various benchmarks, ensuring their effectiveness and standing in comparison to other notable models like PaLM, GPT-3.5, and GPT-4.
Weights

Promptbase: Elevating the Power of Foundation Models through Advanced Prompt Engineering

“Promptbase” by Microsoft seeks to harness the full potential of frontier foundation models with advanced prompt engineering, offering a collection of resources, best practices, and scripts designed to maximize the performance of foundation models like GPT-4. This initiative reflects a broader movement in AI research towards harnessing the sophisticated capabilities of foundation models, particularly in areas like abstraction, generalization, and composition across various knowledge domains.
The core focus of Promptbase is the “Medprompt” methodology and its more versatile extension, “Medprompt+”. Initially developed for medical applications, these methodologies have shown remarkable adaptability and efficacy in a wide range of domains. The Medprompt approach combines dynamic few-shot selection, self-generated Chain of Thought (CoT), and majority vote ensembling to guide generalist models like GPT-4 to specialist-level performance.
- Dynamic Few Shots uses k-NN clustering in the embedding space to select few-shot examples dynamically, enhancing the model’s domain adaptation by choosing examples most semantically similar to a test question.
- Self-Generated CoT automates the creation of chain-of-thought examples, prompting GPT-4 to produce intermediate reasoning steps and reducing hallucination risk.
- Majority Vote Ensembling, incorporating a unique choice-shuffling technique for multiple-choice questions, combines outputs from varied prompts to achieve robust consensus.
The extension to Medprompt+, a refinement of the original strategy, has been instrumental in achieving state-of-the-art (SoTA) results on benchmarks like MMLU (Measuring Massive Multitask Language Understanding). Medprompt+ incorporates a simplified prompting method alongside the base Medprompt strategy, integrating outputs from both for a final answer. This approach is guided by a control strategy using GPT-4’s inferred confidence in candidate answers. The following figure from the blog shows the reported performance of multiple models and methods on the MMLU benchmark.

The effectiveness of Medprompt and Medprompt+ is evidenced in their application to the comprehensive MMLU benchmark, which tests general knowledge and reasoning across 57 diverse areas. By scaling up the number of ensembled calls in Medprompt and integrating simpler prompts in Medprompt+, GPT-4 achieved a record score of 90.10% on MMLU, outperforming other models like Gemini Ultra.
The Promptbase initiative aims to provide tools and information to both engineers and customers, enabling them to evoke peak performance from foundation models. Future additions to Promptbase will include more sophisticated general-purpose tools and structured information about the scientific process behind prompt engineering.
This development underscores the importance of systematic prompt engineering in realizing the maximum potential of foundation models. While exploring the native power of GPT-4 with simple zero- or few-shot prompts remains essential for establishing baseline performance, the integration of more complex and strategic prompting methods like Medprompt and Medprompt+ has demonstrated significant improvements in model performance across diverse domains and benchmarks.

Phi-2: The surprising power of small language models

Microsoft’s Research team has been addressing inefficiencies in Large Language Models (LLMs), specifically the trade-off between size and performance. Smaller models traditionally underperform in tasks like coding, common-sense reasoning, and language understanding compared to their larger counterparts. By advancing a suite of Small Language Models (SLMs), named “Phi”, Microsoft aims to bridge this performance gap, ensuring that more compact models can still deliver high levels of accuracy and utility in various applications.
This article by Javaheripi and Bubeck, details Microsoft Research’s release of Phi-2, a 2.7 billion-parameter language model. This model is part of the “Phi” series of SLMs, including Phi-1 (1.3 billion parameters) and Phi-1.5 (also 1.3 billion parameters). Phi-2 stands out for its exceptional capabilities, achieving equivalent language understanding capabilities to models 5x larger and matching reasoning capabilities of models up to 25x larger.
The Phi series of models scale down the number of parameters without a proportional loss in performance. Phi-1 showcased this in coding benchmarks, performing on par with larger models. With Phi-1.5 and the latest Phi-2, Microsoft has implemented novel model scaling techniques and refined training data curation to achieve results comparable to models many times their size. The success of Phi-2, a 2.7 billion-parameter language model, signifies a leap in optimization that allows it to demonstrate state-of-the-art reasoning and language understanding, matching or exceeding models with up to 25 times more parameters.
Phi-2’s success relies on two core strategies: Firstly, Phi-2’s prowess stems from a relentless focus on high-quality “textbook-quality” data, integrating synthetic datasets designed to impart common sense reasoning and general knowledge. Thus, highly selected/curated/generated data used in Phi-2’s training to educate the model on some specific foundational capabilities (e.g., common sense reasoning, problem solving, math, etc.) is central to Phi-2’s exceptional performance. Secondly, it utilizes innovative scaling techniques by building upon the knowledge embedded in the 1.3 billion parameter Phi-1.5, employing scaled knowledge transfer for enhanced performance and faster training convergence. By valuing textbook-caliber content and embedding knowledge from its predecessor Phi-1.5, Phi-2 emerges as a powerhouse in reasoning and comprehension. This scaled knowledge transfer not only accelerates training convergence but shows clear boost in Phi-2 benchmark scores, as shown in the graphs from the blog below.

The model, which is a Transformer-based model, was trained on 1.4 trillion tokens from a mix of synthetic and web datasets, over 14 days using 96 A100 GPUs. Notably, Phi-2 has not undergone alignment through reinforcement learning from human feedback (RLHF) or been instruct fine-tuned, yet demonstrates improved behavior regarding toxicity and bias.
Phi-2 is so small that it can run on a device, thus opening the door to a bunch of very interesting edge scenarios where latency or data sensitivity (e.g., for personalization) is paramount.
Phi-2’s performance is highlighted in several benchmarks, including Big Bench Hard (BBH), commonsense reasoning, language understanding, math, and coding tasks, often surpassing or matching other models like Mistral, Llama-2, and Gemini Nano 2 despite its smaller size.
Additionally, the article presents Phi-2’s proficiency in practical applications, such as solving physics problems and correcting student errors, showcasing its potential beyond benchmark tasks.
The research underlines the significance of quality training data and strategic model scaling in achieving high performance with smaller models, challenging conventional beliefs about language model scaling laws.
Smaller yet high-performing models like Phi-2 present an ideal testbed for experiments in mechanistic interpretability and safety, reducing the computational resources required for fine-tuning and exploring new tasks. Their more manageable size also makes them suitable for applications where deploying larger models is impractical. The ongoing work from Microsoft Research signals continuous improvements in SLMs, which could redefine industry benchmarks and open new avenues for widespread adoption of sophisticated AI tools in diverse fields.
Weights; Blog).

QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models

This paper by Frantar and Alistarh from the Institute of Science and Technology Austria and Neural Magic Inc. presents QMoE, a framework designed to address the memory challenges in deploying large language models (LLMs) with Mixture-of-Experts (MoE) architectures.
The key problem QMoE addresses is the massive memory requirement of large models, exemplified by the 1.6 trillion-parameter SwitchTransformer-c2048 model, which typically requires 3.2TB of memory. QMoE effectively compresses such models to less than 1 bit per parameter, enabling their execution on commodity hardware with minor runtime overheads.
QMoE employs a scalable algorithm and a custom compression format paired with GPU decoding kernels. It compresses the SwitchTransformer-c2048 model to less than 160GB (0.8 bits per parameter) with minor accuracy loss in under a day on a single GPU.
The implementation includes a highly scalable compression algorithm and a bespoke compression format, facilitating efficient end-to-end compressed inference. The framework enables running trillion-parameter models on affordable hardware, like servers equipped with NVIDIA GPUs, at less than 5% runtime overhead compared to ideal uncompressed execution.
The paper discusses the challenges in compressing MoE models, including conceptual issues with existing post-training compression methods and practical scaling challenges. It overcomes these by introducing a custom compression format and highly-efficient decoding algorithms optimized for GPU accelerators.
The technical contributions include a novel approach to handling massive activation sets and a unique system design for optimized activation offloading, expert grouping, and robustness modifications, ensuring efficient application of data-dependent compression to massive MoEs.
The framework significantly reduces the size of large models, with QMoE compressed models achieving over 20x compression rates compared to 16-bit precision models. This reduction in size is accompanied by minor increases in loss on pretraining validation and zero-shot data.
The paper also discusses the system design and optimizations made to address memory costs, GPU utilization, and reliability requirements. These include techniques like optimized activation offloading, list buffer data structures, lazy weight fetching, and expert grouping.
The following figure from the paper illustrates the offloading execution for the sparse part of a Transformer block. An expert $E_2$ and its corresponding input tokens $X_E$ are fetched to GPU memory to produce $E_2′$, which together with the corresponding outputs $Y_E$ are written back to CPU again.

The experiments demonstrate that QMoE effectively compresses MoE models while maintaining performance. The system was tested on various datasets, including Arxiv, GitHub, StackExchange, and Wikipedia, showing good performance preservation even for highly compressed models.
The paper provides detailed insights into the encoding and decoding processes and the kernel implementation for the GPU, highlighting the challenges and solutions for achieving sub-1-bit per parameter compression.
The QMoE framework is a significant step towards practical deployment of massive-scale MoE models, addressing key limitations of MoE architectures and facilitating further research and understanding of such models.
The paper’s findings are significant as they make it feasible to deploy and research trillion-parameter models on more accessible hardware, potentially democratizing access to high-performance LLMs and spurring further innovation in the field.

PromptBench: A Unified Library for Evaluation of Large Language Models

This paper by Zhu et al. from Microsoft Research Asia, Institute of Automation Chinese Academy of Sciences, University of Science and Technology of China, and Carnegie Mellon University, introduces PromptBench, which addresses the complex challenge of accurately evaluating Large Language Models (LLMs). It offers a comprehensive solution to the previously fragmented and non-standardized approaches in LLM assessment, thereby enabling more systematic and robust evaluations.
Evaluating LLMs requires an understanding of their performance across a variety of tasks and ensuring safe deployment without security risks. Before PromptBench, there was no standardized way to assess LLMs, leading to fragmented efforts and limited comparability across studies. PromptBench addresses this gap by providing a comprehensive library for unified LLM evaluation, encompassing prompt construction to adversarial testing.
PromptBench simplifies the evaluation process with its modular design, including tools for prompt construction, dataset and model loading, and more. Its flexibility allows researchers to tailor evaluations to specific needs.
The library supports models like Flan-T5-large, Dolly, Vicuna, and GPT-4, covering tasks such as sentiment analysis, natural language inference, and math problem-solving across 22 public datasets. It facilitates prompt engineering with methods like Chain-of-Thought and EmotionPrompt, and includes seven types of adversarial prompt attacks.
The following figure from the paper illustrates the components and supported research areas of PromptBench.

Evaluation protocols include standard, dynamic (DyVal), and semantic (MSTemp) evaluation.
Tools for benchmark results analysis, attention visualization, and defense analysis are included. The streamlined API allows for easy construction of evaluation pipelines, supporting dataset loading, model customization, prompt definition, and input-output processing.
The library’s modular design and support for a diverse range of models and datasets enable extensive experimentation in language model evaluation.
It is a crucial tool for evaluating the capabilities, robustness, and performance of LLMs, addressing the need for comprehensive evaluation frameworks in AI and NLP.
PromptBench’s open-source nature fosters collaboration and continuous improvement, leading to new benchmarks and advanced downstream applications.
We can expect the emergence of new benchmarks tailored to the evolving capabilities of LLMs, as well as more advanced downstream applications and evaluation protocols developed using PromptBench.

Making LLMs Worth Every Penny: Resource-Limited Text Classification in Banking

This paper by Loukas et al. from Helvia.ai and the Department of Informatics at Athens University of Economics and Business, Greece, presented at the 2023 International Conference on AI in Finance (ICAIF ‘23), delves into the challenging quest for high-efficiency, cost-effective NLP solutions in banking, focusing on balancing the deployment of cheap yet performant models.
Addressing the limitations of standard full-data classifiers in NLP, especially in data-limited domains like finance, the paper explores few-shot methods, highlighting the pivotal roles of data quality, quantity, and model selection. In sectors with sparse data, the authors emphasize the impracticality of labeling thousands of examples and the added costs of increasing data quality.
The paper evaluates few-shot scenarios using the Banking77 dataset, involving the fine-tuning of MPNet and SetFit, a contrastive learning technique. It also examines in-context learning with conversational LLMs like GPT-3.5, GPT-4, Cohere’s Command-nightly, and Anthropic’s Claude 2. These LLMs demonstrate effectiveness in financial few-shot text classification tasks with minimal examples, a paradigm shift from traditional ML’s reliance on feature engineering to the emerging practice of prompt engineering.
The following figure from the paper shows the cost analysis of various closed-source LLMs. The first two groups represent the 1- and 3-shot results. The rest groups utilize Retrieval-Augmented Generation (RAG) for cost-effective LLM inference. We perform 3,080 queries to each LLM, i.e., 1 query per sample in the test set.

A significant contribution is the development of a cost-effective querying method for LLMs using Retrieval-Augmented Generation (RAG), reducing operational costs and outperforming traditional few-shot approaches. This method retrieves only a small fraction of examples (2.2%) and surpasses GPT-4’s performance while being more cost-effective.
The paper also explores the impact of data augmentation using GPT-4 in data-limited scenarios, providing insights into its effectiveness threshold. This information aids AI practitioners in optimizing their models for efficiency, considering factors like data quality and quantity and model selection, where the cost-performance relationship is notably nonlinear.
An extensive error analysis identifies improvement areas for LLMs and MLMs, concluding with the provision of a human expert-curated subset of Banking77 to enhance robust financial research. The study emphasizes the importance of balancing cost, efficiency, and performance, especially in the intent detection setting, analyzing a wide range of workflows from supervised fine-tuning (SFT) to GPT-4 with RAG.

Mathematical Language Models: A Survey

This paper by Liu et al. from the School of Computer Science and Technology, Department of Educational Information Technology, Lab of Artificial Intelligence for Education, East China Normal University, presents a comprehensive overview of the advances in leveraging Pre-trained Language Models (PLMs) and Large-scale Language Models (LLMs) in the field of mathematics. It systematically categorizes research efforts from the perspectives of tasks and methodologies, highlighting instruction learning, tool-based methods, fundamental and advanced chain-of-thought (CoT) techniques.
Over 60 mathematical datasets, including training, benchmark, and augmented datasets, are compiled. The survey addresses primary challenges and future trajectories, serving as a resource for future innovations in mathematical language models.
The paper underscores the foundational role of mathematics in various fields and notes the historical interest in developing computer models for mathematical problem-solving. The advent of LLMs, like OpenAI’s GPT-4, has led to significant innovations in this domain, demonstrating the potential of AI in mathematics.
1. Mathematical Calculation:
  - Arithmetic Representation: Early models oversimplified numerical values, impacting performance. Recent developments include methods like GenBERT, which tokenizes numbers at the digit level, and approaches that integrate digit embeddings or emphasize different numerical components.
  - Arithmetic Calculation: Studies have explored the arithmetic capabilities of LLMs, including addition, subtraction, and multiplication, with models like GPT-4, Galactica, and LLaMA. Techniques like prompt engineering and Chain of Thought (CoT) have been used to enhance calculation abilities.
2. Mathematical Reasoning:
  - Math Problem Solving: LLMs are increasingly used to understand, explain, and solve math problems at various levels. Research focuses on Math Word Problems (MWPs) and Math Question Answering (MQA), with models like GPT-3 DaVinci and MetaMath being notable examples.
  - Theorem Proving: LLMs are being utilized for automated theorem proving, combining language models with automated proof assistants and theorem provers.
3. Approaches Based on Language Models:
  - Autoregression Language Models (ALMs): These include causal decoder architectures (e.g., GPT-1, GPT-2) and encoder-decoder architectures (e.g., T5). They have been used for automated theorem proving and enhancing mathematical problem-solving.
  - Non-Autoregression Language Models (NALMs): NALMs like BERT and RoBERTa use masked word representations for mathematical reasoning and operations. They enable parallel generation of sequence parts and are used in various mathematical contexts.
4. LLMs-Based Methods:
  - Instruction Learning: Methods under this category focus on enhancing mathematical performance through instruction building, tuning, and in-context learning.
  - Tool-Based Methods: These involve combining LLMs with mathematical tools like symbolic solvers and programming languages for problem-solving.
  - CoT Methods: Fundamental and advanced CoT methods are used for enhancing reasoning and problem-solving capabilities in LLMs.
  - Multi-Tool and In-Context Learning Approaches: These methods integrate LLMs with various tools and use specific task examples as conditions during inference without updating model parameters.
This survey concludes by underscoring the significant potential of mathematical language models in transforming the approach to mathematical problems. It highlights the need for further research to overcome challenges like model faithfulness, multimodal uncertainty, and data scarcity, aiming to enhance the capability of LLMs in mathematical reasoning and problem-solving.

A Survey of Large Language Models in Medicine: Principles, Applications, and Challenges

Large Language Models (LLMs) like ChatGPT have shown remarkable capabilities in understanding and generating human language, prompting research into their application in medicine to support physicians and patient care.
This paper by Zhou et al. from the University of Oxford, Imperial College London, University of Waterloo, University of Rochester, University College London aims to offer a comprehensive view of the principles, applications, and challenges of LLMs in medical contexts, addressing key questions related to their construction, utilization, and optimization in clinical settings.
Principles of Medical Large Language Models:
- Construction Methods: Medical LLMs are developed mainly through three approaches: pre-training from scratch, fine-tuning from existing general LLMs, and prompting to align general LLMs to the medical domain.
- Pre-training: Involves training an LLM on a vast corpus of medical texts (both structured and unstructured) such as electronic health records, clinical notes, DNA sequences, and medical literature. Notable examples include PubMedBERT, ClinicalBERT, and BlueBERT, each tailored to specific medical corpora.
- Fine-tuning: Saves computational resources compared to training from scratch. It leverages high-quality medical corpora for domain-specific knowledge learning. Methods include Supervised Fine-Tuning (SFT), Instruction Fine-Tuning (IFT), Low-Rank Adaptation (LoRA), and Prefix Tuning. Examples include MedPaLM-2 and Clinical Camel.
- Prompting: Efficiently aligns general LLMs to the medical domain without training model parameters, using methods like few-shot prompting, chain-of-thought prompting, self-consistency prompting, and prompt tuning.
The following figure from the paper shows a performance comparison between the GPT-3.5 turbo, GPT-4, state-of-the-art task-specific fine-tuned models, and human experts, on seven downstream biomedical NLP tasks across eleven datasets.

The survey covers generative and discriminative tasks, which are crucial for building clinical applications. It details the tasks, their evaluation datasets, and the performance of suitable LLMs.
This survey provides significant insights into the development and application of LLMs in medicine, highlighting their potential to revolutionize patient care and clinical practice. It serves as a resource for constructing effective medical LLMs and explores the opportunities and challenges in this burgeoning field.

Language Model Inversion

This paper by Morris et al. from Cornell University delves into language model inversion, where the authors demonstrate the potential of using a language model (LM) to predict a prompt based on the probability scores of an LM’s generations. The research highlights a significant issue in the field of large language models (LLMs) offered as a service: the potential exposure of proprietary or carefully handcrafted prompts that ensure LLMs behave helpfully and harmlessly.
The paper showcases the feasibility of recovering prompt tokens from a language model’s next-token probability distribution. The method used involves unrolling probability scores over the entire vocabulary into smaller buckets, which are then processed by a Multi-Layer Perceptron (MLP) to form a context vector $C$. This vector is subsequently passed to the encoder of a T5 model, whose decoder predicts the original prompt.
The authors implemented a conditional language model that maps next-token probabilities back to tokens, using cross-attention in an encoder-decoder Transformer for conditioning on the next-token vector. They utilized T5 as the encoder-decoder backbone, trained on a diverse dataset of prompts and responses.
The following figure from the paper illustrates the fact that under the assumption that a language model is offered as a service with a hidden prefix prompt that produces next-word probabilities, the system is trained from samples to invert the language model, i.e. to recover the prompt given language model probabilities for the next token.

The inversion model was tested on Llama-2 7B and Llama-2 7B chat models, displaying impressive results in prompt reconstruction. For example, on the Instructions-2M dataset, it achieved a BLEU score of 58.26 and token-level F1 of 75.8. In terms of exact prompt recovery, the model successfully recovered 23.4% of prompts for Llama2-7b-chat and 36.5% for Llama2-7b.
The research also highlights the vulnerability of LLMs to prompt reconstruction, even when the prompts contain proprietary information, and this poses a risk of jailbreaking. Interestingly, their approach was effective on both RLHF’d models and models without RLHF.
Addressing potential defenses, the authors suggest that the most effective way to protect prompts is to disable top-logits access and set the temperature to zero. They also recommend future research to explore smarter parameterizations for inversion, integrating token embeddings with probability values for more effective prompt recovery.

LLM360: Towards Fully Transparent Open-Source LLMs

This paper by Liu et al. from Petuum, MBZUAI, and CMU, presents LLM360, a framework aimed at enhancing the transparency and reproducibility of Large Language Models (LLMs). It emphasizes the importance of fully open-sourcing LLMs, including training code, data, model checkpoints, and intermediate results.
The paper introduces two 7B parameter LLMs, AMBER and CRYSTALCODER. AMBER is an English LLM pre-trained on 1.3 trillion tokens, while CRYSTALCODER is an English and code LLM pre-trained on 1.4 trillion tokens. Both models are notable for their transparency, with the release of all training components.
For AMBER, data preparation entailed combining RefinedWeb, StarCoder, and RedPajama-v1 datasets, with no further cleaning or sub-sampling, resulting in 1.26 trillion tokens. CRYSTALCODER’s pre-training dataset blended SlimPajama and StarCoder data, totaling around 1382 billion tokens. The training for CRYSTALCODER was divided into three stages, with a unique mix of English and coding data.
The following figure from the paper shows a summary of notable open-source LLMs. We note a trend of progressively less disclosure of important pretraining details over time: (1) availability of pretraining code, (2) disclosure of training configurations and hyperparameters, (3) intermediate checkpoints of model weights, (4) intermediate checkpoints of optimizer states, (5) disclosure of data mixture and sources, (6) reproducibility of pretraining data sequence, and (7) availability (or reconstruction scripts) of the pretraining data.

In terms of infrastructure, AMBER was trained on an in-house GPU cluster, utilizing 56 DGX A100 nodes, with a throughput of approximately 582.4k tokens per second. CRYSTALCODER was trained on the Cerebras Condor Galaxy 1, a 4 exaFLOPS supercomputer.
The paper discusses challenges encountered during pre-training, such as NaN loss on specific data chunks and missing optimizer states. It also highlights the importance of data cleaning and mixing ratios in LLM pre-training.
One of the key contributions of LLM360 is the release of training code, hyperparameters, configurations, model checkpoints, and evaluation metrics, all aimed at fostering a collaborative and transparent research environment.
The paper concludes with insights into future work, including a more detailed analysis of AMBER and CRYSTALCODER, exploration of optimal data mixing ratios, and the pre-training of a larger LLM.
Project page

LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models

This paper by Jiang et al. from Microsoft introduces LLMLingua, a novel method for prompt compression in Large Language Models (LLMs) to accelerate inference and reduce computational costs.
LLMLingua employs a coarse-to-fine prompt compression approach, incorporating a budget controller to dynamically allocate compression ratios across different components of prompts (instructions, demonstrations, questions), ensuring semantic integrity even at high compression rates.
The process involves a token-level iterative compression algorithm that better models the interdependence between compressed contents. This contrasts with existing methods like Selective-Context, which may neglect these interdependencies.
An instruction tuning-based method is proposed to align the distribution between the small language model used for compression and the target LLM, addressing distribution discrepancy issues.
The figure below from the paper shows the framework of the proposed approach LLMLingua.

Experiments conducted on four datasets (GSM8K, BBH, ShareGPT, Arxiv-March23) demonstrate that LLMLingua achieves state-of-the-art performance, allowing up to 20x compression with minimal performance loss.
The approach shows robustness in different domains like reasoning, in-context learning (ICL), conversation, and summarization, confirming its efficacy in retaining crucial information in compressed prompts.
The paper discusses several components of LLMLingua, including the budget controller’s ability to allocate different budgets for various components, iterative token-level prompt compression for preserving key information, and distribution alignment to ensure the compatibility of compressed prompts with LLMs.
In ablation studies, the paper highlights the importance of each component in LLMLingua. For example, removing the iterative token-level prompt compression or the budget controller leads to significant performance drops, underscoring their roles in maintaining the effectiveness of the compressed prompts.
The paper concludes by acknowledging the limitations of LLMLingua, particularly in achieving excessively high compression ratios, which can lead to notable performance drops. Despite this, the method provides a substantial computational saving and is a potential solution for accommodating longer contexts in LLMs, enhancing downstream task performance by compressing prompts and improving LLMs’ inference efficiency.

Retrieval-Augmented Generation for Large Language Models: A Survey

This paper by Gao et al. from the Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University, and Fudan University, addresses the challenges faced by Large Language Models (LLMs) in practical applications, focusing on issues like hallucinations, slow knowledge updates, and lack of transparency in answers. The paper emphasizes the role of Retrieval-Augmented Generation (RAG) in overcoming these challenges.
RAG, as the paper describes, is a technique that retrieves relevant information from external knowledge bases before utilizing LLMs for answering questions. This approach has shown significant benefits in enhancing answer accuracy, reducing model hallucination, especially in knowledge-intensive tasks, and allowing users to verify the accuracy of answers, thereby boosting trust in model outputs.
The figure below from the paper shows a timeline of existing RAG research. The timeline was established mainly according to the release date.

The figure below from the paper shows a comparison of RAG with other model optimization methods.

The table below from the paper compares RAG vs. fine-tuning.

The authors outline the development paradigms of RAG in the context of LLMs, dividing it into three categories: Naive RAG, Advanced RAG, and Modular RAG. Each paradigm is discussed in detail, along with the key technologies involved in their respective components: retriever, generator, and augmentation methods.
The paper also delves into the methods of evaluating the effectiveness of RAG models. It introduces two evaluation methods, focusing on key metrics and abilities, and presents the latest automatic evaluation framework.
Lastly, the paper discusses potential future research directions in RAG, categorized into vertical optimization, horizontal scalability, and the technical stack and ecosystem of RAG.

LLM in a Flash: Efficient Large Language Model Inference with Limited Memory

This paper by Alizadeh et al. from Apple addresses the challenge of efficiently running large language models (LLMs) with limited DRAM capacity. It introduces a method to store model parameters on flash memory and load them on-demand to DRAM.
The paper presents two key techniques: “windowing,” which reduces data transfer by reusing previously activated neurons, and “row-column bundling,” which increases the size of data chunks read from flash memory. These strategies allow running models up to twice the size of the available DRAM, significantly enhancing inference speed (4-5x on CPU and 20-25x on GPU compared to naive loading approaches).
The approach leverages sparsity in LLMs, selectively loading only necessary parameters from flash memory, which reduces the data transfer volume. This is complemented by a static memory preallocation in DRAM to minimize transfers and reduce inference latency.
The figure below from the paper shows: (a) Flash memory offers significantly higher capacity but suffers from much lower bandwidth compared to DRAM and CPU/GPU caches and registers. (b) The throughput for random reads in flash memory increases with the size of sequential chunks and the number of threads.

The ReLU activation function naturally induces over 90% sparsity in the FFN’s intermediate outputs, which reduces the memory footprint for subsequent layers that utilize these sparse outputs. However, the preceding layer, namely the up project for OPT and Falcon, must be fully present in memory. To circumvent loading the entire up project matrix and to deal with ReLU sparsity, they follow Liu et al. (2023b) and employ a low-rank predictor to identify the zeroed elements post-ReLU as shown in the figure below from the paper.

The figure below from the paper shows: (a) Aggregated neuron use of the tenth layer of Falcon 7B, as it can be seen the slop of aggregated neuron use is decreasing. Other layers exhibit the same pattern. (b) Instead of deleting neurons that brought to DRAM we keep the active neurons of past 5 tokens: when the new token “Was” is being processed only a few amount of data needs to be changed.

The figure below from the paper shows that by bundling columns of up project and rows of down project in OPT 6.7B they load 2x chunks instead of reading columns or rows separately.

A hardware-inspired cost model is introduced, optimizing data transfer volume and read throughput, enabling efficient inference on memory-constrained devices.
The method is evaluated using the OPT 6.7B and Falcon 7B models. Results show that the approach maintains model accuracy while significantly reducing inference latency and data transfer volumes. It demonstrates effective inference of LLMs on devices with limited memory, suggesting a new direction for deploying advanced LLMs in resource-limited environments.

ReST Meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent

This paper by Aksitov et al. from Google Research and Google DeepMind, introduces an innovative approach to enhance the performance and robustness of a multi-step reasoning Large Language Model (LLM) agent. The core of this research lies in combining reactive and reflective elements in LLM agents to achieve self-improvement.
The researchers developed a ReAct-style LLM agent capable of integrating external knowledge for answering complex questions. A critical aspect of this agent is its ability to iteratively improve through a ReST-like method. This method uses a growing-batch reinforcement learning framework with AI feedback for continuous self-improvement and self-distillation.
A notable implementation detail is the use of the PaLM 2 models of varying sizes for the agent. The agent employs a unique method of generating multiple samples at each reasoning step, selecting the best one based on perplexity. The fine-tuning process uses full fine-tuning for all experiments, with a focus on smaller models due to computational efficiency.
The figure below from the paper shows a state machine of the Search Agent flow. Each blue shape corresponds to a single LLM call and defines a separate type of the reasoning step.

The agent’s performance was evaluated using two specifically designed datasets: Bamboogle and BamTwoogle. These datasets feature compositional questions that require multi-step reasoning, thus challenging the agent’s capabilities.
The results showed that after just two iterations of the self-improvement algorithm, the agent achieved comparable performance to its larger counterparts but with significantly fewer parameters. The research demonstrates the potential of AI-driven self-improvement in LLM agents, particularly in tasks requiring complex reasoning and external knowledge integration.
This research provides a significant step forward in the development of LLM agents, showcasing the feasibility of a self-improving, efficient model capable of complex multi-step reasoning.

PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU

This paper by Song et al. from Shanghai Jiao Tong University introduces PowerInfer, a system designed to accelerate Large Language Model (LLM) inference on PCs with consumer-grade GPUs. PowerInfer addresses the challenge of LLMs’ extensive memory requirements, which are often beyond the capacity of consumer-grade GPUs. The authors’ approach leverages the observation that LLM inference exhibits high locality, with a small subset of neurons (termed hot neurons) being consistently activated, while the majority (cold neurons) are variably activated depending on the input. PowerInfer’s design utilizes a GPU-CPU hybrid inference engine, where hot-activated neurons are preloaded onto the GPU for fast access, and cold-activated neurons are computed on the CPU. This significantly reduces GPU memory demands and CPU-GPU data transfers.
The system includes adaptive predictors and neuron-aware sparse operators to optimize the efficiency of neuron activation and computational sparsity. Evaluations show that PowerInfer, when deployed on a PC with an NVIDIA RTX 4090 GPU, delivers average token generation speeds of 13.20 tokens/s for quantized models and 8.32 tokens/s for non-quantized models, while maintaining model accuracy. These results significantly surpass those of llama.cpp, exhibiting up to 8.00$\times$ and 11.69$\times$ improvements for quantized and non-quantized models, respectively.
PowerInfer’s architecture is divided into offline and online components. The offline component profiles the activation sparsity of LLMs and differentiates between hot and cold neurons. The online inference engine loads both types of neurons into GPU and CPU memory, serving LLM requests with low latency during runtime. The system implements a neuron-aware LLM inference engine that uses adaptive sparsity predictors to determine which neurons to activate. Neurons are divided and managed between the CPU and GPU, with a hybrid execution model that facilitates independent processing by each unit and subsequent result combination on the GPU. Neuron-aware operators are used for efficient computation, focusing on individual neuron rows/columns within matrices.
The figure below from the paper shows the architecture of a Transformer layer and how neurons are sparsely activated in FC1 and FC2 layers due to the ReLU function. The neurons that are activated are represented as green rows or columns encircled by red lines. The output vector from FC1 is then supplied to FC2 as its input vector.

For the offline profiling, PowerInfer utilizes an integer linear programming approach to develop a neuron placement policy. This policy considers factors like neuron activation frequencies and the bandwidth hierarchy of CPU and GPU architectures. The solver uses an impact metric for each neuron to model its influence on LLM inference outcomes. The implementation extends llama.cpp with additional C++ and CUDA code, and the system supports various popular LLM families, including OPT and LLaMA.
Experimental setups on two PC configurations, high-end and low-end, demonstrate PowerInfer’s effectiveness across diverse hardware. The system is tested with OPT models ranging from 6.7B to 175B parameters and compared against llama.cpp, showing significant improvements in performance and efficiency.

Adversarial Attacks on GPT-4 via Simple Random Search

This paper by Andriushchenko et al. from the Theory of Machine Learning Group at EPFL presents a novel approach for conducting adversarial attacks on the latest GPT-4 Turbo model using a simple random search algorithm. The paper is notable for its innovative use of adversarial techniques against a state-of-the-art language model.
The core methodology involves appending a short adversarial string to a request that is typically rejected by the model due to safety or ethical concerns. This process effectively “jailbreaks” the model, making it respond to the harmful request. The adversarial string is iteratively optimized using a random search algorithm, significantly increasing the probability of the model starting its response with a specific token from approximately 1% to over 50%.
The algorithm’s steps include appending a suffix to an original request, changing a single character in the suffix at a random position in each iteration, and accepting changes that increase the log-probability of a desired response token. This method was inspired by previous works on adversarial attacks in other domains and demonstrated its effectiveness through examples.
The paper also discusses potential defenses against such attacks. One key observation is that the model’s non-deterministic log-probabilities present a challenge for the random search method. Suggested defenses include adding noise to the log-probabilities returned by the model or detecting and refusing to answer repeated queries that are very similar.
The implications of this work are significant, highlighting that scaling data and compute alone may not suffice to prevent adversarial attacks on advanced language models. The paper suggests that incorporating a worst-case adversarial training objective might be necessary for fine-tuning or even pre-training such models.

An In-depth Look at Gemini’s Language Abilities

This paper by researchers from Carnegie Mellon University and BerriAI presents a comprehensive analysis of the Google Gemini model’s language abilities, offering a third-party objective comparison with OpenAI’s GPT models (GPT 3.5 Turbo and GPT 4 Turbo) across a variety of tasks.
The Gemini Pro model, the second largest in the Gemini series, was evaluated alongside these models. Notably, the Gemini Pro model is a multimodal model trained over videos, text, and images, based on the Transformer architecture, but this study focused solely on its language capabilities.
The paper’s methodology involved using a unified interface, LiteLLM, for querying all models between December 11-15, 2023, ensuring a fair and consistent comparison across tasks. For Gemini Pro, the queries were processed through Google Vertex AI, while OpenAI models used the OpenAI API, and Mixtral was accessed through the API provided by Together. The study noted that Gemini Pro sometimes blocked responses for potentially illegal or sensitive material, which were treated as incorrect for the purposes of the analysis.
The benchmarks spanned ten datasets, evaluating the models on various tasks including knowledge-based question answering, reasoning, mathematics, code generation, machine translation, and acting as instruction-following agents. The tasks were derived from datasets like MMLU, BIG-Bench-Hard, GSM8K, SVAMP, ASDIV, MAWPS, HumanEval, ODEX, FLORES, and WebArena. Gemini Pro’s performance was found to be close but slightly inferior to GPT 3.5 Turbo across all tasks, while GPT 4 Turbo generally outperformed both. Specific areas of underperformance for Gemini Pro included mathematical reasoning with many digits, sensitivity to multiple-choice answer ordering, and aggressive content filtering. However, Gemini Pro excelled in generating non-English languages and managing longer and more complex reasoning chains.
The figure below from the paper shows the main results of their benchmarking. The best model is listed in bold, and the second best model is underlined. Mixtral was only evaluated on a subset of the tasks.

The study emphasizes that the results and code for reproduction are publicly available, promoting transparency and further research in the field. The in-depth analysis provides valuable insights into the relative strengths and weaknesses of these leading language models, contributing to the ongoing development and understanding of large language models and their applications.

LoRAMoE: Revolutionizing Mixture of Experts for Maintaining World Knowledge in Language Model Alignment

This paper by Dou et al. from Fudan University and Hikvision Inc, this paper introduces LoRAMoE, a new approach to alleviate the conflict between expanding supervised fine-tuning (SFT) data and retaining world knowledge in large language models (LLMs). The paper demonstrates that extensive SFT can disrupt world knowledge in LLMs, leading to knowledge forgetting.
LoRAMoE is an innovative plugin version of Mixture of Experts (MoE), designed to preserve world knowledge by freezing the backbone model’s parameters during training. It uses localized balancing constraints to coordinate expert groups, dividing them between task-specific learning and maintaining world knowledge.
The architecture of LoRAMoE includes multiple parallel plugins as experts in each feed-forward layer of the LLM, connected by routers. These experts are divided into groups, where one focuses on downstream tasks and the other on aligning world knowledge with human instructions. This approach effectively reduces knowledge forgetting and improves downstream task performance.
The figure below from the paper shows the architecture of LoRAMoE, compared with classic MoE. LoRAMoE utilizes multiple LoRAs as adaptable experts and a router to gate them in the FFN layer of every transformer block. During the training process, only the experts and the router are optimized.

Experiments conducted on various datasets demonstrate LoRAMoE’s ability to manage experts based on data type and its effectiveness in preventing knowledge forgetting. The method shows improvement in downstream tasks, indicating its potential for multi-task learning. The visualization of expert utilization confirms that LoRAMoE effectively specializes experts for different types of tasks.
The paper’s innovative approach leverages the strengths of MoE and Low-Rank Adaptation (LoRA) for efficient training. It strategically addresses the issue of balancing expert utilization and preventing knowledge forgetting in LLMs, making it a notable contribution to the field of language model alignment.

Large Language Models are Better Reasoners with Self-Verification

This paper by Weng et al. from Laboratory of Cognition and Decision Intelligence for Complex Systems, University of Chinese Academy of Sciences, Hunan University, Unisound, and Shanghai Artificial Intelligence Laboratory in EMNLP 2023 Findings explores the self-verification capabilities of large language models (LLMs) like GPT-3 in improving reasoning tasks. The authors note that while Chain of Thought (CoT) prompting enhances LLMs’ reasoning abilities in areas like arithmetic, commonsense, and logical reasoning, it is prone to individual mistakes and error accumulation.
The proposed method introduces a two-step process: forward reasoning, where LLMs generate multiple candidate answers through sampling decoding, and backward verification, where these answers are verified, and the one with the highest verification score is selected as the final answer. The sampling decoding technique utilized for GPT-3 and InstructGPT in this process involves generating conclusions without top-$k$ truncation. This is implemented via logprobs as part of OpenAI’s chat completions API request, as shown below.

For backward verification, original questions and candidate answers are rephrased as conclusions and supplemented as new conditions. Two methods, True-False Item Verification and Condition Mask Verification, are used to construct new verification questions. The verification score is computed by comparing the consistency between the predicted condition value and the original masked condition value.
The method was evaluated on eight datasets across three reasoning tasks: arithmetic reasoning, commonsense reasoning, and logical reasoning, using the original GPT-3 model and the Instruct-GPT model.
Implementation details reveal that during forward reasoning, five candidate answers were generated, and during backward verification, each conclusion was generated ten times with a maximum token length of 168 for each decoding. Experiments were run three times for consistency.
The figure below from the paper shows an example of self-verification. In the step one, LLM generates candidate answers and forms different conclusions. Then, in the step two, LLM verifies these conclusions in turn and computes the verification score.

Results indicated that the self-verification method improved performance across all datasets, achieving state-of-the-art results in six out of eight datasets. It also led to an average increase of 2.33% in the high-performing Instruct-GPT model, highlighting the benefit of self-verification in models with strong forward reasoning capabilities.
Various figures in the paper illustrate the self-verification ability of models of different sizes, the problem solve rate comparison for different prompt lengths, and between single-condition and multiple-condition verification, emphasizing the importance of answer verification in scenarios with limited data.
The paper also discusses the computational efficiency of the method, finding that a small increase in computational overhead can still yield improvements in performance, recommending a balance between performance and resource consumption.
Additional analysis showed that the self-verification technique consistently and significantly improved verification accuracy compared to the random guessing baseline and standalone CoT model accuracy, particularly in arithmetic and symbolic equation problems, demonstrating the effectiveness of the self-verification approach in refining LLMs and reducing reasoning errors.

PaperMage: A Unified Toolkit for Processing, Representing, and Manipulating Visually-Rich Scientific Documents

PaperMage is an open-source Python toolkit developed for processing visually-rich scientific documents. It addresses challenges in working with these documents, which are often in unwieldy PDF formats, and integrates state-of-the-art natural language processing (NLP) and computer vision (CV) models into a cohesive framework. This toolkit facilitates seamless manipulation and representation of both textual and visual elements of documents, offering turnkey recipes for common document processing tasks. PaperMage has been utilized in AI application prototypes and Semantic Scholar’s system for processing millions of PDFs.
Scientific papers and textbooks, crucial for knowledge extraction, are challenging to process due to their complex visual richness and the limitations of current tools. Traditional approaches involve parsing documents into tokens and bounding boxes, but this only captures raw content, leaving out intricate structures like titles, authors, figures, and semantics. Existing models often require extensive custom code to extract these structures, especially when dealing with multiple modalities like image-based and text-based models.
Key Contributions of PaperMage:
1. Magelib: A library offering primitives and methods for handling multimodal constructs in visually-rich documents.
2. Predictors: Implementations that integrate diverse scientific document analysis models into a unified interface, accommodating models from different frameworks or modalities.
3. Recipes: Turnkey access to combinations of individual modules, creating sophisticated, extensible multimodal pipelines.
The figure below from the paper illustrates how Entities can be accessed flexibly in different ways: (A) Accessing the Entity of the first paragraph in the Document via its own Layer (B) Accessing a sentence via the paragraph Entity or directly via the sentences Layer (C) Similarly, the same tokens can be accessed via the overlapping sentence Entity or directly via the tokens Layer of the Document (where the first tokens are the title of the paper.) (D, E) Figures, captions, tables and keywords can be accessed in similar ways (F) Additionally, given a bounding box (e.g., of a user selected region), papermage can find the corresponding Entities for a given Layer, in this case finding the tokens under the region.

PaperMage provides Predictors to create Entities compatible with documents, leveraging state-of-the-art models for various tasks like segmenting text, visual block regions, and logical document structures. These Predictors can be adapted or used off-the-shelf and support inference with text-only, vision-only, and multimodal models. Recipes, predefined combinations of Predictors, offer high-quality, turnkey solutions for processing documents and can be customized for different needs.
A vignette demonstrates how a researcher can efficiently use PaperMage for building an attributed QA system for scientific papers. This scenario highlights the toolkit’s capability to enable rapid prototyping, ease of use, and versatility in handling different aspects of document processing, from input formatting to leveraging bounding box data for visual highlighting in user interfaces.
PaperMage’s users are advised to exercise caution when extracting metadata like author names and affiliations, as inaccuracies can lead to misattribution or ethical concerns such as deadnaming. Users should cross-reference extracted data with other sources and allow for manual edits by authors.
Code.

Large Language Models on Graphs: A Comprehensive Survey

The paper by Jin et al. from UIUC and University of Notre Dame, presents a thorough survey on the application of Large Language Models (LLMs) on graphs. It systematically explores various scenarios and techniques where LLMs interact with graphs, addressing challenges and potential in this emerging field.
The authors categorize the application scenarios of LLMs on graphs into three main types: Pure Graphs, Text-Rich Graphs, and Text-Paired Graphs. Each type presents unique challenges and opportunities for leveraging LLMs, like understanding and reasoning with graph structures.
In Pure Graphs, where no semantically rich text is associated, LLMs address graph theory problems or enhance LLMs by acting as knowledge sources. In Text-Rich Graphs, LLMs focus on nodes or edges with rich textual information, integrating text and structure for advanced tasks like recommendation systems. Text-Paired Graphs involve whole graph structures paired with text, like molecules or proteins, where LLMs aid in understanding complex structures through textual descriptions.
The paper also introduces a categorization of techniques for applying LLMs to graphs: LLM as Predictor, LLM as Encoder, and LLM as Aligner. These roles define how LLMs interact with Graph Neural Networks (GNNs) in processing and analyzing graph data. For instance, in the ‘LLM as Predictor’ role, LLMs directly generate outputs or predictions, potentially enhanced by GNNs.
The figure below from the paper shows the categorization of three LLM on graph scenarios, according to the relationship between graph and text. Depending on the role of LLM, they summarize three LLM-on-graph techniques. “LLM as Predictor” is where LLMs are responsible for predicting the final answer. “LLM as Aligner” will align the inputs-output pairs with those of GNNs. “LLM as Encoder” refers to using LLMs to encode and obtain feature vectors.

Notable implementation details include the adaptation of LLMs to graph structures by modifying input formats or LLM architectures. Techniques such as Graph-Empowered LLMs and Graph-Aware LLM Finetuning are explored, where the architecture of LLMs is modified for joint text and graph encoding, or LLMs are fine-tuned with graph supervision, respectively.
The authors highlight the challenges and limitations in this field, such as the complexity of integrating graph structure into LLMs and ensuring effective learning from both textual and structural data. They also discuss potential future directions, emphasizing the need for further research in this rapidly developing area.
In conclusion, the paper provides a comprehensive overview of the current state of research at the intersection of LLMs and graph data, offering valuable insights and guidance for future explorations in this promising domain.

An LLM Compiler for Parallel Function Calling

This paper by Kim et al. from UC Berkeley, ICSI, and LBNL, introduces the LLMCompiler, a framework for optimizing parallel function calling in Large Language Models (LLMs). The LLMCompiler is unique in its ability to handle various function calling patterns with LLMs, enhancing the efficiency and accuracy of LLM-based applications.
LLMCompiler is designed to address the limitations of existing methods for multi-function calling with LLMs, which often require sequential reasoning and action for each function, leading to high latency, cost, and sometimes inaccurate behavior. Inspired by classical compilers in traditional programming, LLMCompiler efficiently orchestrates parallel multi-tool execution of LLMs, managing dependencies between tasks and minimizing interference from processing intermediate results. This results in improved latency, cost, and accuracy over current methods like ReAct.
LLMCompiler tool that compiles an effective plan for executing multiple tasks in parallel. It helps create scalable LLM applications, identifies tasks for parallel execution, and manages dependencies. LLMCompiler is compatible with both open-source and OpenAI models, marking a stride towards more efficient and intelligent software systems.
The figure below from the paper shows an illustration of the runtime dynamics of LLMCompiler, in comparison with ReAct, given a sample question from the HotpotQA benchmark. In LLMCompiler (Right), the Planner first decomposes the query into a Directed Acyclic Graph (DAG) comprising of several tasks with inter-dependencies. The Executor then in parallel executes multiple tasks, respecting their dependencies in the DAG. Finally, LLMCompiler joins all observations from the tool executions to produce the final response. In contrast, sequential tool execution of the existing frameworks like ReAct (Left) leads to longer execution latency. In this example, LLMCompiler attains a latency speedup of 1.8$\times$ on the HotpotQA benchmark. While a 2-way parallelizable question from HotpotQA is presented here for the sake of simple visual illustration, LLMCompiler is capable of managing tasks with more complex dependency structures.

The figure below from the paper shows an overview of the LLMCompiler framework: the workflow from initial user input to task execution. Beginning with user input, the LLM Planner generates a sequence of tasks with their inter-dependencies. These tasks are then dispatched by the Task Fetching Unit to the Executor based on their dependencies, thus allowing for their parallel executions. For instance, in this example, Task $1 and $2 are fetched together for parallel execution of two independent search tasks. After each task is performed, the results (i.e., observations) are forwarded back to the Task Fetching Unit to unblock the dependent tasks after replacing their placeholder variables (e.g., the variable $1 and $2 in Task $3) with actual values. Once all tasks have been executed, the final answer is delivered to the user.

The framework consists of three key components:
1. LLM Planner: Generates a sequence of tasks to be executed along with their dependencies, forming a Directed Acyclic Graph (DAG) of task dependencies. It uses the reasoning capability of LLMs for task decomposition and can work with models like GPT and LLaMA-2. Users can provide tool definitions and in-context examples to aid the Planner in generating appropriate dependency graphs. 2. Task Fetching Unit: Inspired by instruction fetching in computer architectures, this unit optimizes the execution flow of tasks based on the Planner’s output. It fetches tasks for parallel execution and replaces variables with actual outputs from preceding tasks. 3. Executor: Asynchronously executes tasks fetched by the Task Fetching Unit, ensuring concurrent execution of independent tasks. It utilizes tools provided by the user, which can range from simple functions to complex LLM agents tailored for specific tasks.
LLMCompiler also includes features for streaming planning and dynamic replanning:
- Streamed Planner: Reduces overhead by asynchronously streaming the dependency graph, allowing tasks to be processed immediately as their dependencies are resolved, improving latency by up to 30%.
  - Dynamic Replanning: Adapts the execution graph based on intermediate results, similar to branching in programming. It’s particularly useful for complex tasks that require iterative replanning, such as the Game of 24 using the Tree-of-Thoughts approach.
The LLMCompiler was evaluated using various models and problem types, demonstrating significant improvements in latency, cost, and accuracy across different parallel function calling patterns. For instance, in the Game of 24 benchmark, it showed a 2.89$\times$ enhancement in latency and improved success rate with the GPT-4 model, and a 2.01$\times$ improvement in latency without compromising success rate with the LLaMA-2 model.
In summary, LLMCompiler marks a significant advancement in LLM-based software development by enabling efficient and accurate execution of complex, parallel function calling tasks. It opens up possibilities for executing large-scale tasks using LLMs efficiently and accurately, potentially integrating with future works that adopt an operating systems perspective for LLMs.

Scaling Down, LiTting Up: Efficient Zero-Shot Listwise Reranking with Seq2seq Encoder–Decoder Models

This paper by Tamber et al. from the University of Waterloo introduces novel methods, LiT5-Distill and LiT5-Score, for efficient zero-shot listwise reranking using sequence-to-sequence (seq2seq) encoder-decoder models, specifically leveraging the T5 model. These methods address the limitations of large language models (LLMs) in terms of parameter size and context constraints, demonstrating competitive reranking effectiveness with substantially smaller models.
LiT5-Distill employs a distillation strategy from larger models like RankGPT for reranking. It processes each passage separately along with the query, and the decoder then generates a ranking. This method includes the use of sliding window strategy from RankGPT and RankZephyr to distill ranking orderings of up to 20 passages from RankGPT. The model sizes range from 220M to 3B parameters.
LiT5-Score, on the other hand, innovatively uses cross-attention scores from a Fusion-in-Decoder (FiD) model for reranking in a zero-shot setting. It eliminates the need for external passage relevance labels for training, thereby leveraging the inherent capabilities of the model for understanding and ranking passages.
Both methods were evaluated on various datasets including MS MARCO collections and BEIR, using nDCG@10 scores for performance assessment. The results show that both LiT5-Distill and LiT5-Score improve nDCG@10 scores compared to baseline models like BM25, indicating their effectiveness and generalizability in reranking tasks. Interestingly, while larger models exhibit improvements in reranking effectiveness, the study demonstrates that even smaller models can deliver competitive results. This challenges the necessity of large-scale models for effective zero-shot reranking.
In terms of model efficiency, LiT5-Distill and LiT5-Score are designed with significantly fewer parameters compared to recent listwise reranking models like RankVicuna, RankZephyr, and Rank-wo-GPT. This provides a trade-off between computational demand and reranking effectiveness. The findings suggest that smaller models can still achieve strong results for listwise reranking, generalizing well to out-of-domain reranking tasks, and being relatively quick to train and run.
The figure below from the paper shows the LiT5-Distill architecture. Each query–passage pair is encoded separately. Then, the decoder reads over the concatenated representations to generate a ranking of the form [] > [] > ... > [], eg., [2] > [1] > ... > [N].

The figure below from the paper shows the LiT5-Score architecture. Each query–passage (or question–context) pair is encoded separately. Then, the decoder reads over the concatenated representations to generate an answer to the query.

The paper’s contribution lies in demonstrating that effective reranking can be achieved with smaller models, challenging the conventional reliance on large-scale models. This opens avenues for more efficient listwise reranking solutions, particularly in contexts where computational resources are limited. The research also lays the groundwork for future studies in distilling LLM behavior into smaller models and exploring the use of cross-attention scores in other text generation applications.
Code

Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents

This paper by Sun et al. from Shandong University, Baidu Inc., and Leiden University in EMNLP 2023 investigates the efficacy of Large Language Models (LLMs) like ChatGPT and GPT-4 in relevance ranking for Information Retrieval (IR). It explores LLMs’ capabilities beyond generative tasks, focusing on direct passage ranking. The study introduces an instructional permutation generation approach and evaluates LLMs using a new test set (NovelEval) to verify their ability to rank unknown knowledge. It also proposes a permutation distillation technique to distill ChatGPT’s ranking capabilities into smaller models, yielding efficient and cost-effective results.
LLMs have shown exceptional zero-shot generalization in various language tasks, including search engines. However, their use in IR has primarily been for content generation, not direct passage ranking. This study aims to evaluate LLMs like ChatGPT and GPT-4 in passage re-ranking tasks, employing instructional permutation generation and sliding window strategies. The research also addresses data contamination concerns in LLMs and explores distillation techniques for practical applications.
LLMs have been increasingly applied in IR tasks. Methods like SGPT generate text embeddings, and others produce pseudo-documents using GPT-3. Instructional query and relevance generation methods have also been used, but they often depend on the availability of model output log probabilities. This paper contributes by focusing on instructional permutation generation for re-ranking, offering a more comprehensive evaluation incorporating a new dataset and validating the permutation distillation technique.
The study introduces an instructional permutation generation method, where LLMs output permutations of passages based on relevance to a query, directly ranking passages without intermediate relevance scores. A sliding window strategy addresses token limitations in LLMs. For permutation distillation, the study uses 10K queries from MS MARCO, distilling the permutation predicted by ChatGPT into a student model with RankNet-based distillation objective. It compares specialized model architectures, including BERT-like and GPT-like models.
The figure below from the paper shows three types of instructions for zero-shot passage re-ranking with LLMs. The gray and yellow blocks indicate the inputs and outputs of the model. (a) Query generation relies on the log probability of LLMs to generate the query based on the passage. (b) Relevance generation instructs LLMs to output relevance judgments. (c) Permutation generation generates a ranked list of a group of passages.

The figure below from the paper shows an illustration of re-ranking 8 passages using sliding windows with a window size of 4 and a step size of 2. The blue color represents the first two windows, while the yellow color represents the last window. The sliding windows are applied in back-to-first order, meaning that the first 2 passages in the previous window will participate in re-ranking the next window.

Experiments were conducted on TREC-DL, BEIR, and Mr.TyDi datasets, along with a novel test set, NovelEval, comprising questions from domains published after GPT-4’s release. The evaluation used nDCG metrics, comparing ChatGPT and GPT-4 against supervised and unsupervised re-ranking methods. ChatGPT and GPT-4, equipped with permutation generation, showed superior performance on these datasets.
ChatGPT and GPT-4 demonstrated notable effectiveness in passage re-ranking, surpassing state-of-the-art supervised methods in several instances. GPT-4 particularly excelled in re-ranking tasks on NovelEval and benchmark datasets. The permutation generation approach outperformed previous methods, indicating a strong ability of LLMs to discern subtle differences between passages. Specialized models distilled from ChatGPT surpassed their teacher models, highlighting the efficiency of distillation in IR tasks.
The study primarily focuses on proprietary LLMs like ChatGPT and GPT-4, which are not open-source, and acknowledges the gap with open-source models. The re-ranking effectiveness of LLMs is sensitive to the initial order of passages, usually determined by first-stage retrieval like BM25. The study also highlights the ethical considerations regarding the use of LLMs due to potential biases and misinformation generation.
The research provides a comprehensive analysis of LLMs in passage re-ranking, introducing novel methods and demonstrating the effectiveness and efficiency of these models in IR tasks. It establishes the potential of LLMs like ChatGPT and GPT-4 in direct passage ranking, going beyond traditional content generation roles in IR.

NPHardEval: Dynamic Benchmark on Reasoning Ability of Large Language Models via Complexity Classes

This paper by Fan et al. from the University of Michigan and Rutgers University, introduces NPHardEval, a groundbreaking benchmark designed to rigorously assess the reasoning abilities of Large Language Models (LLMs) like GPT-4. NPHardEval presents a suite of 900 algorithmic questions, covering a range of complexities up to the NP-Hard class, effectively pushing LLMs to their limits in terms of reasoning tasks, such as coding and reasoning about texts.
The benchmark, which symbolizes a broad spectrum of real-world decision-making and optimization scenarios, is categorized into three computational complexity classes: P (Polynomial time), NP-complete, and NP-hard, each class containing 100 instances spread across 10 distinct difficulty levels. These tasks mirror critical areas like logistics, scheduling, and network design.
The figure below from the paper shows the computational complexity classes P, NP-complete, and NP-hard and corresponding tasks.

NPHardEval’s unique feature is its dynamic update mechanism, refreshing data points monthly to avoid the risk of overfitting. This innovative approach prevents LLMs from memorizing the test’s answers, fostering the development of genuine reasoning skills rather than relying on rote learning of static challenges.
The paper also delves into three critical research questions: (1) The reasoning ability gap among foundation models, particularly comparing open-sourced models like GPT-4 with closed-sourced ones; (2) The reasoning ability of LLMs across different task complexities; (3) The generalization of LLMs through in-context learning, examining whether they truly acquire algorithmic skills or simply mimic problem-solving processes.
The figure below from the paper shows model performance on different complexity problems: (a) weighted accuracy (b) (weighted) failure rate. Open models are denoted in squares and close models are denoted in triangles.

Experiments were conducted using ten LLMs, including both open-source and proprietary models. Performance metrics included weighted accuracy and failure rate across the different complexity classes. The results indicated that closed-source models generally outperformed open-source models, like GPT-4, in accuracy and lower failure rates.
A key observation from the benchmark’s performance was that closed-source models displayed minimal variation in performance across different example difficulty levels, suggesting a genuine acquisition of algorithmic skills. In contrast, open-source models, including GPT-4, were found to generalize well from more challenging examples but had difficulties with simpler ones, implying a tendency towards pattern mimicry rather than true learning.
The NPHardEval benchmark, with its focus on medium-complexity problems, underscores the need for more standardized and systematic methods of assessing tasks and problems in the field of AI and machine learning. The study is a significant step in understanding the current capabilities and limitations of LLMs in complex reasoning tasks, providing valuable insights for future advancements and real-world applications.

This paper by Potyka et al. from Cardiff University, Bosch Center for AI, University of Stuttgart, University of Oslo, in AAMAS 2024, addresses the challenge of inconsistency and imprecision in Large Language Models (LLMs), particularly in high-stake scenarios like medicine. Due to inherent randomness in LLMs, identical queries can produce different answers upon repeated prompts. The authors propose a novel solution involving ranking queries and social choice theory methods to improve LLM consistency. This method aims to yield a clear ranking of answers based on their plausibility, discerning whether the LLM has meaningful information about the query. The study suggests that these methods can enhance the robustness of LLM outputs, transforming them from uncertain responses to reliable guides in critical settings.
LLM Background: LLMs predict the next token based on previous context, trained to maximize the likelihood of subsequent tokens in large text corpora. They often produce unreliable ‘hallucinations’ when lacking sufficient knowledge about a query.
Addressing Uncertainty and Inconsistency: Traditional methods for uncertainty quantification in LLMs are either unreliable or resource-intensive. The new approach uses ranking queries and aggregation through social choice theory, specifically Partial Borda Count, to ensure more consistent and accurate responses.
Social Choice Theory Application: This theory aggregates individual preferences into a collective choice, suitable for ranking potential answers to a query. The proposed method uses Partial Borda Count to merge ranked choices, enhancing reliability in sensitive fields.
Partial Borda Weighting (PBW) and Query Processing:
- Ranking queries extract lists of plausible explanations for a given situation, prompted multiple times to generate varied outcomes.
- The answers are normalized, ensuring identical treatment of similar outcomes.
- This method includes determining base-outcomes, ranking them for plausibility, and normalizing these rankings.
Experiments and Findings:
- Conducted across domains like manufacturing, finance, and medicine.
- Ranking queries were generated using symptom-cause matrices and symptom sets.
- Robustness assessed using Jaccard similarity, with empirical findings suggesting notable improvements in LLM consistency.
- The figure below from the paper shows robustness with respect to the number of answers used for aggregation.
Challenges and Future Prospects:
- Current methods, including this one, tend to increase resource demands, either through multiple queries or expanded prompts.
- The real breakthrough would be achieving these improvements efficiently and affordably.
In conclusion, the paper presents a detailed methodology to enhance the reliability of LLMs in query answering through social choice theory, particularly in contexts where accuracy and consistency are critical. This approach marks a significant step in transforming LLM outputs from variable responses to reliable resources in critical decision-making scenarios.

LongQLoRA: Efficient and Effective Method to Extend Context Length of Large Language Models

This paper by Yang from SYSU China introduces LongQLoRA, a novel method to extend the context length of large language models, particularly focusing on LLaMA2 models. The key innovation of LongQLoRA is its combination of Position Interpolation, QLoRA, and Shift Short Attention from LongLoRA to increase context lengths efficiently with limited resources.
LongQLoRA was tested on LLaMA2 7B and 13B models, successfully extending their context length from 4096 to 8192, and even up to 12k tokens, using just a single 32GB V100 GPU and a mere 1000 fine-tuning steps. This performance was compared with other methods like LongLoRA and MPT-7B-8K, showing that LongQLoRA achieves competitive, if not superior, performance.
The paper detailed the implementation of LongQLoRA, emphasizing its efficient fine-tuning method, which leverages quantization and low-rank adapter weights. Specifically, it sets the LoRA rank to 64 and adds these weights to all layers without training word embeddings and normalization layers. This approach significantly reduces the GPU memory footprint, enabling performance enhancements on resource-limited settings.
The figure below from the paper shows the evaluation perplexity of 7B models on PG19 validation and Proof-pile test datasets in evaluation context length from 1024 to 8192. All models are quantized to 4-bit in inference. LongQLoRA is finetuned based on LLaMA2-7B for 1000 steps with RedPajama dataset on a single V100 GPU. ‘LongLoRA-Full’ and ‘LongLoRA-LoRA’ mean LLaMA2-7B published by LongLoRA with full finetuning and LoRA finetuning respectively. MPT-7B-8K are better than LLaMA2, LongLoRA and LongQLoRA in context length from 1024 to 4096. LLaMA2-7B has very poor performance beyond the pre-defined context length of 8192. LongQLoRA outperforms LongLoRA-LoRA on both datasets in context length from 1024 to 8192. In context length of 8192, LongQLoRA is extremely close to LongLoRA-Full on Proof-pile test dataset, even better than MPT-7B-8K on PG19 validation dataset.

Ablation studies were conducted to analyze the effects of LoRA rank, fine-tuning steps, and attention patterns in inference. These studies demonstrated the robustness and effectiveness of LongQLoRA across various settings and tasks.
The authors also collected and built a long instruction dataset of 39k data points, focusing on tasks like book summarization and Natural Questions, to test LongQLoRA’s performance in both long and short context generation tasks.
The results showed that LongQLoRA outperforms LongLoRA in several key metrics and closely matches the performance of more resource-intensive models like MPT-7B-8K. The model, training data, and code are publicly available for further research, emphasizing the paper’s contribution to the field of efficient language model training and context length extension.

Editing Models with Task Arithmetic

This paper by Ilharco et al. from the University of Washington, Microsoft Research, and the Allen Institute for AI, published in ICLR 2023, introduces a new method for editing the behavior of pre-trained neural networks, focusing on task vectors that encode directions in the weight space. By performing arithmetic operations on these vectors, the authors demonstrate the ability to manipulate model behavior efficiently. The key operations explored are negation and addition of task vectors, along with combining them to form analogies.
Core Concepts:
1. Task Vectors: Obtained by subtracting the weights of a pre-trained model from the same model after fine-tuning on a specific task. These vectors indicate a direction in the weight space that enhances task performance.
2. Negation Operation: Negating a task vector leads to decreased performance on the target task while maintaining overall model behavior. This operation can be used for unlearning specific tasks or mitigating undesirable behaviors.
3. Addition Operation: Adding task vectors enables improved multi-task performance or enhances performance on a single task. The addition process is efficient and leverages existing model weights without further training.
4. Task Analogies: Combining task vectors based on analogy relationships like “A is to B as C is to D” improves performance on a fourth task using vectors from the first three tasks, even without data for the fourth task.
Experiments and Results:
- Image Classification: Task vector negation for image classification tasks resulted in significantly lower accuracy on the target task with minimal impact on control tasks. Addition of task vectors improved performance on multiple tasks or even a single task.
- Text Generation: In text generation models, negating task vectors successfully reduced toxic generations with little change in model fluency.
- Task Analogies: Leveraging task analogies in domain generalization scenarios, the approach improved performance on target tasks without labeled data. Analogies were also used to improve accuracy in subpopulations with limited data.
The figure below from the paper shows an illustration of task vectors and the arithmetic operations they study for editing models. (a) A task vector is obtained by subtracting the weights of a pre-trained model from the weights of the same model after fine-tuning (Section 2). (b) Negating a task vector degrades performance on the task, without substantial changes in control tasks (Section 3). (c) Adding task vectors together improves the performance of the pre-trained model on the tasks under consideration (Section 4). (d) When tasks form an analogy relationship such as supervised and unsupervised learning on two different data sources, it is possible to improve performance on a supervised target task using only vectors from the remaining three combinations of objectives and datasets.

Implementation and Implications:
- Efficiency: The arithmetic operations on task vectors are computationally efficient and do not incur extra memory or compute costs at inference time.
- Flexibility: This approach allows practitioners to quickly experiment with different model edits and can be applied to a variety of models and tasks.
- Limitations: The method is constrained to models with the same architecture and relies on fine-tuning from the same pre-trained initialization.
Unique Contributions:
- The paper presents a novel, efficient, and effective approach for editing pre-trained models using simple arithmetic operations on task vectors.
It demonstrates the versatility of the method across different tasks and model types, emphasizing its potential for widespread application in machine learning.

Time is Encoded in the Weights of Finetuned Language Models

This paper by Nylund et al. from the University of Washington and Allen Institute for AI introduce “time vectors” as a novel method to adapt language models (LMs) to different time periods. The concept involves finetuning a language model on data from a specific time (like a year or month) and then subtracting the weights of the original pretrained model. These vectors represent a direction in weight space that enhances performance on texts from their respective time periods. The authors demonstrate that time vectors related to adjacent time periods are closer in weight space, forming a manifold.
The researchers applied this concept to various tasks, domains, and model sizes across different time scales. They used datasets for language modeling, classification, and summarization, including the WMT news dataset and Twitter data. For finetuning, they employed T5 models of different sizes (small, large, and 3B) and techniques like Low-Rank Adaptation. The study revealed that time vectors from years or months that are temporally closer yield time vectors that are also nearer in weight space.
The paper explored the degradation of model performance over time, noting that it tends to be linear on a yearly basis and exhibits seasonal patterns on a monthly scale. This degradation was correlated with the angles between time vectors.
The figure below from the paper shows the concept of time vectors, a simple tool to customize language models to new time periods. Time vectors ($\tau_i$) specify a direction in weight space that improves performance on text from a time period i. They are computed by subtracting the pretrained weights ($theta_{pre}$; left panel) from those finetuned to a target time period ($theta_i$). We can customize model behavior to new time periods (e.g., intervening months or years) by interpolating between time vectors and adding the result to the pretrained model (middle panel). They can also generalize to a future time period $j$ with analogy arithmetic (right panel). This involves combining a task-specific time vector with analogous time vectors derived from finetuned language models ($\tau_{j}^{LM}$).

An innovative aspect of the research was the use of time vector interpolation to generate new models that perform better on intervening and future time periods without additional training. By interpolating between two time vectors, they discovered vectors that, when applied to the pretrained model, improved performance on intervening months or years. This method also allowed for generalization to future time periods using analogy arithmetic, combining a task-specific time vector with analogous time vectors derived from finetuned LMs.
The authors provided a comprehensive analysis of their findings, including the consistency of the results across various setups. They also released their code, data, and over 500 models finetuned on specific time periods for public use.
Despite the innovative approach, the study recognized limitations in generalizing to multiple time periods simultaneously. While interpolations proved useful for intervening and future times, building models that perform well across multiple time periods was challenging. The paper suggests that further research might explore more sophisticated methods for merging models to induce better-performing multi-year models.

TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones

This paper by Yuan et al. from Anhui Polytechnic University, Nanyang Technological University, and Lehigh University propose TinyGPT-V to bridge the gap in multimodal learning due to the closed-source nature and high computational demand of models like GPT-4V. This model achieves high performance with lower computational requirements, requiring only a 24G GPU for training and an 8G GPU or CPU for inference.
TinyGPT-V integrates Phi-2, a powerful language model, with pre-trained vision modules from BLIP-2 or CLIP, and employs a unique quantization process, making it suitable for deployment and inference tasks on various devices.
The architecture involves a visual encoder (EVA of ViT), a linear projection layer, and the Phi-2 language model. The training process involves four stages: warm-up training with image-text pairs, pre-training the LoRA module, instruction fine-tuning with image-text pairs from MiniGPT4 or LLaVA, and multi-task learning to enhance conversational abilities.
The figure below from the paper shows the training process of TinyGPT-V, the first stage is warm-up training, the second stage is pre-training, the third stage is instruction finetuning, and the fourth stage is multi-task learning.

The figure below from the paper shows: (a) represents the structure of LoRA, (b) represents how LoRA can efficiently fine-tune large language models (LLMs) in natural language processing, (c) represents the structure of LLMs for TinyGPT-V, and (d) represents the structure of QK Normalization.

The model excels in benchmarks like visual question-answering and referring expression comprehension. It showcases competitive performance against larger models in various benchmarks like GQA, VSR, IconVQ, VizWiz, and Hateful Memes.
Ablation studies reveal the importance of modules like LoRA, Input Layer Norm, RMS Norm, and QK Norm in preventing gradient vanishing and maintaining low loss during training.
TinyGPT-V’s compact and efficient design, combining a small backbone with large model capabilities, marks a significant step towards practical, high-performance multimodal language models for a broad range of applications.
Code

OpenChat: Advancing Open-Source Language Models with Mixed-Quality Data

This paper by Wang et al. from Tsinghua University and Shanghai Artificial Intelligence Laboratory, presents OpenChat, a novel framework designed to advance open-source language models using mixed-quality data. It explores the integration of supervised fine-tuning (SFT) and reinforcement learning fine-tuning (RLFT) to enhance language model performance with diverse data qualities.
OpenChat introduces a new approach, Conditioned-RLFT (C-RLFT), which utilizes coarse-grained reward labels to distinguish between different data sources, like GPT-4 or GPT-3.5, within the general SFT training data. This method effectively leverages the mixed-quality training data, comprising a small amount of expert data and a larger proportion of sub-optimal data, without requiring preference labels.
An intuitive approach to align LLMs is through RL via RLHF, which models rewards according to human preference feedback and fine-tune LLMs to maximize the reward. The reward either modeled explicitly (by training a separate model) or implicitly (via the LLM itself, as in DPO), assigns high values on desirable responses and low values on bad ones to guide the alignment of the finetuned LLM. A popular RL framework for fine-tuning LLMs is the KL-regularized RL as in DPO, which adds an additional KL penalty to constrain the fine-tuned LLM $\pi_\theta(y \mid x)$ to stay close to the base pre-trained LLM $\pi_0(y \mid x)$. This has been shown beneficial to avoid distribution collapse as compared to naïvely maximize reward using RL.
C-RLFT, based on the KL-regularized RL framework, focuses on maximizing reward while minimizing the difference between the fine-tuned policy and a reference policy. Uniquely, it employs single-stage supervised learning (which implies that reinforcement learning step, e.g., based on PPO, is not required similar to DPO), which is both lightweight and free from the need for costly human preference labeling. C-RLFT learns to distinguish expert and sub-optimal data. To this end, mixed-quality data from different sources is sufficient. The reward label can be as simple as a relative value differentiating different classes of data sources, i.e., GPT-4 or GPT-3.5.
Note that regular SFT treats all training data uniformly but that’s not the case for C-RLFT. The paper used a dataset of a small amount of expert data and a large proportion of easily accessible sub-optimal data using coarse-grained reward labels.
The figure below from the paper shows the proposed framework OpenChat with Conditioned-RLFT to advance the open-source language model fine-tuning with mixed-quality data, comparing to previous supervised fine-tuning (SFT) method and reinforcement learning fine-tuning (RLFT) method. MLE and RL denote maximum likelihood estimates and reinforcement learning, respectively.

Implementation details:

Collect mixed-quality data from different sources (e.g. GPT-4 and GPT-3.5 conversations) and assign coarse-grained rewards based on data source quality, e.g., GPT-4=1.0 GPT-3.5=0.1.

Construct class-conditioned dataset by augmenting data with source class labels, e.g., “User: {QUESTION}<

end_of_turn

>GPT4 Assistant: {RESPONSE}”

Train LLM using C-RLFT regularizing the class-conditioned references for the optimizing the policy.

The paper demonstrates that OpenChat, particularly when fine-tuned with C-RLFT on the ShareGPT dataset, significantly outperforms other 13b open-source language models in terms of average performance across standard benchmarks. Notably, OpenChat-13b excels in AGIEval, surpassing the base model in terms of generalization performance.
The implementation details include using the llama-2-13b as the base model and fine-tuning it for 5 epochs on the ShareGPT dataset with AdamW optimizer, a sequence length of 4096 tokens, and an effective batch size of 200k tokens. The training involves class-conditioning and assigning weights to different data types, with a cosine learning rate schedule.
Extensive ablation studies and analyses were conducted to validate the contributions of different components, like coarse-grained rewards and class-conditioned policy, and performance consistency. These studies shed light on the effectiveness and robustness of OpenChat.
The paper suggests future research directions, including refining the coarse-grained rewards to better reflect the actual quality of data points and exploring applications of OpenChat to enhance reasoning abilities in language models.
Code; HuggingFace

What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning

This paper by Liu et al. from ShanghaiTech University, Beijing University of Posts and Telecommunica, Meituan, Alibaba, and HKUST presents a thorough investigation into optimizing data selection for instruction tuning in large language models (LLMs). The authors introduce a novel method, DEITA (Data-Efficient Instruction Tuning for Alignment), which significantly improves data efficiency in model training.
The core concept of DEITA revolves around the selection of data samples based on three key dimensions: complexity, quality, and diversity. The authors hypothesize that the most effective datasets for instruction tuning are those that are complex, of high quality, and diverse.
For complexity, the paper introduces EVOL COMPLEXITY, an evolution-based approach that evolves a single data point to produce a series of examples varying in complexity. Similarly, EVOL QUALITY evolves data points for quality assessment. Both these methods involve using ChatGPT for scoring and ranking different variants of data samples, leading to the training of complexity and quality scorers.
To maintain diversity, DEITA employs an embedding-based approach, named Repr Filter, which selects data samples based on the distance between embeddings. This strategy ensures a diverse and concise data subset, crucial for effective instruction tuning.
The figure below from the paper shows an illustration of the data selection approach. They measure data from three dimensions: complexity, quality, and diversity. $I$ and $R$ represent instruction and response respectively. For EVOL COMPLEXITY and EVOL QUALITY, they first collect samples with varying complexities or qualities through adopting an evolution-based approach, then they ask ChatGPT to rank and score the variants of the same data sample for a small seed dataset, and they train our own complexity and quality scorers based on these scores. In the last step, they utilize the trained scorers and adopt a score-first, diversity-aware approach to select the “good” data samples.

Implementation details:
1. Existing high quantity dataset (e.g., WizardLM)
2. Create or select seed dataset (e.g., Alpaca), here 2k samples
3. Apply EVOL-Complexity and EVOL-Quality to create 6 samples each with varying complexities and qualities
4. Use powerful LLM, e.g., GPT-3.5 to rank and score samples (from step #3). Ranking is done by using all samples at the same time (i.e., listwise and not pointwise/pairwise)
5. Train custom scorer for complexity and quality based on step #4
6. Use custom scorer to rank and score (from step #0)
7. Use Embedding model to filter (from step #6) on diversity (exclude to similar prompts) until threshold, e.g. here 6k samples
8. Fine-tune LLM on (from step #7) with SFT
9. Use DPO to align LLM (from step #8) with preference dataset, e.g. Ultrafeedback
DEITA models are fine-tuned from the LLaMA and Mistral models. The training involves using small, high-quality datasets (as few as 6K examples) and achieving competitive performance compared to models trained on significantly larger datasets. For instance, DEITA-Mistral-7B trained on 6K data samples achieves a 7.22 MT-bench and 80.78% AlpacaEval scores. With Direct Preference Optimization (DPO), the performance further improves.
The paper demonstrates through controlled experiments that DEITA outperforms or matches state-of-the-art instruction-following models while using over 10 times fewer automatically selected data examples. This indicates a paradigm shift towards more data-efficient instruction tuning, where fewer training samples can yield performance on par with or even surpassing models trained on larger datasets.
The authors release the DEITA model checkpoints and their light yet effective SFT datasets to facilitate efficient alignment for future research.
This paper represents a significant contribution to the field of large language model training, emphasizing the importance of data quality over quantity and introducing novel methods for data selection that enhance model performance and efficiency. Put simply, data selection is critical, adding more data does not necessarily improve performance.
Code

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

This paper by Zheng et al. from UC Berkeley, UC San Diego, Carnegie Mellon University, Stanford, and MBZUAI, presented at NeurIPS 2023, introduces an innovative approach for evaluating large language models (LLMs) used as chat assistants. The authors propose using strong LLMs as judges to assess the performance of other LLMs in handling open-ended questions.
The study introduces two benchmarks: MT-Bench, a series of multi-turn questions designed to test conversational and instruction-following abilities, and Chatbot Arena, a crowdsourced battle platform for user interaction and model evaluation.
A key focus of the research is exploring the use of LLMs, like GPT-4, as automated judges in these benchmarks, to approximate human preferences. This approach, termed “LLM-as-a-judge”, is tested for its alignment with human preferences and its practicality as a scalable, cost-effective alternative to human evaluations.
The authors address several biases and limitations inherent in LLM judges. Position bias, where the order of answers affects judgment, is mitigated by randomizing answer order. Verbosity bias, the tendency to favor longer answers, is countered by length normalization. Self-enhancement bias, where LLMs might prefer answers similar to their own style, is reduced through style normalization. Limited reasoning ability in math questions is addressed by introducing chain-of-thought and reference-guided judging methods.
The figure below from the paper shows multi-turn dialogues between a user and two AI assistants—LLaMA-13B (Assistant A) and Vicuna-13B (Assistant B)—initiated by a question from the MMLU benchmark and a follow-up instruction. GPT-4 is then presented with the context to determine which assistant answers better.

In their empirical evaluations, the authors demonstrate that strong LLM judges like GPT-4 can achieve over 80% agreement with human preferences, matching the level of agreement among humans. This highlights the potential of LLMs as effective judges in the automated evaluation of chatbots.
The study also evaluates the performance of various LLM models, including Vicuna and LLaMA variants, using MT-Bench and Chatbot Arena, underscoring the effectiveness of these benchmarks in differentiating the capabilities of chatbots.
The research contributes to the field by offering a systematic study of the LLM-as-a-judge approach and by providing publicly available datasets from MT-bench and Chatbot Arena for future exploration. The paper argues for a hybrid evaluation framework that combines capability-based benchmarks with preference-based benchmarks for comprehensive model assessment.
The LMSYS Chatbot Arena Leaderboard is a crowdsourced open platform for LLM evals based on over 200,000 human preference votes collected to rank LLMs with the Elo ranking system, based on this paper.

This paper by Xu et al. from Virginia Tech introduces MultiInstruct, a novel benchmark dataset for multimodal instruction tuning. The dataset, first of its kind, includes 62 diverse multimodal tasks in sequence-to-sequence format across 10 broad categories derived from 21 open-source datasets, each task accompanied by five expert-written instructions.
The authors utilize OFA, a pre-trained multimodal language model, for instruction tuning. They focus on leveraging large-scale text-only instruction datasets like Natural Instructions for transfer learning, aiming to enhance zero-shot performance on various unseen multimodal tasks.
Experimental results showcase strong zero-shot performance across different tasks, demonstrating the effectiveness of multimodal instruction tuning. The introduction of a new evaluation metric, ‘Sensitivity’, reveals that instruction tuning significantly reduces the model’s sensitivity to variations in instructions. The more diverse the tasks and instructions, the lower the sensitivity, enhancing model robustness.
The study compares different transfer learning strategies, such as Mixed Instruction Tuning and Sequential Instruction Tuning, and examines their impact on zero-shot performance. Findings indicate that while transferring from a text-only instruction dataset (Natural Instructions) can sometimes reduce performance, it generally lowers model sensitivity across multimodal tasks.
The figure below from the paper shows task groups included in MultiInstruct. The yellow boxes represent tasks used for evaluation, while the white boxes indicate tasks used for training.

A key observation is that increasing the number of task clusters in the training process improves both the mean and maximum aggregated performance and decreases model sensitivity, supporting the efficacy of the MultiInstruct dataset. Moreover, the use of diverse instructions per task during tuning improves the model’s performance on unseen tasks and reduces instruction sensitivity.
The paper also assesses the zero-shot performance on 20 natural language processing tasks from Natural Instructions, finding that multimodal instruction tuning can enhance performance in text-only tasks as well. OFAMultiInstruct, fine-tuned on MultiInstruct, generally outperforms other models, including the baseline OFA model.
In conclusion, the authors highlight the significant improvements in zero-shot performance on various unseen multimodal tasks achieved through instruction tuning. They acknowledge limitations such as the dataset’s focus on English language tasks and vision-language tasks, suggesting future exploration into more diverse language settings and modalities.

Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models

This paper by Luo et al. from Xiamen University and Peng Cheng Laboratory, published at NeurIPS 2023, introduces a novel and cost-effective approach, Mixture-of-Modality Adaptation (MMA), for adapting Large Language Models (LLMs) to vision-language (VL) tasks.
MMA utilizes lightweight modules called adapters to bridge the gap between LLMs and VL tasks, enabling joint optimization of image and language models. This approach is distinct from existing solutions that either use large neural networks or require extensive pre-training.
The authors developed a large vision-language instructed model, LaVIN, by applying MMA to the LLaMA model. LaVIN is designed to handle multimodal science question answering and multimodal dialogue tasks efficiently.
Experimental results show that LaVIN, powered by MMA, achieves competitive performance and superior training efficiency compared to existing multimodal LLMs. It is also noted for its potential as a general-purpose chatbot.
LaVIN’s training is notably efficient, requiring only 1.4 training hours and 3.8M trainable parameters. This efficiency is attributed to MMA’s design, which enables an automatic shift between single- and multi-modal instructions without compromising natural language understanding abilities.
The figure below from the paper shows comparison of different multimodal adaptation schemes for LLMs. In the expert system, LLMs play a role of controller, while the ensemble of LLM and vision models is expensive in terms of computation and storage overhead. The modular training regime (b) requires an additional large neck branch and another large-scale pre-training for cross-modal alignment, which is inefficient in training and performs worse in previous NLP tasks. In contrast, the proposed Mixture-of-Modality Adaption (MMA) (c) is an end-to-end optimization scheme, which is cheap in training and superior in the automatic shift between text-only and image-text instructions.

The figure below from the paper shows the overview of the Mixture-of-Modality Adaptation (MMA) and the architecture of LaVIN. In LaVIN, the novel Mixture-of-Modality Adapters are employed to process the instructions of different modalities. During instruction tuning, LaVIN is optimized by Mixture of Modality Training (MMT) in an end-to-end manner.

The paper includes quantitative experiments on the ScienceQA dataset, where LaVIN shows comparable performance with advanced multimodal LLMs while significantly reducing training time and storage costs. Qualitative comparisons also demonstrate LaVIN’s effective execution of various types of human instructions, like coding, math, and image captioning, showcasing superior vision-language understanding.
The authors highlight the cost-effectiveness of LaVIN, emphasizing its low training expenditure, which is much cheaper than existing methods like BLIP2 and LLaVA. LaVIN demonstrates significant reductions in training time, GPU memory, and storage cost, marking it as an efficient solution for VL instruction tuning.
Limitations of LaVIN include its potential to generate incorrect or fabricated responses, similar to existing multimodal LLMs, and its inability to identify extremely fine-grained details in images.
This research offers a breakthrough in efficiently adapting large language models to vision-language tasks, presenting a cost-effective and high-performance solution in the field of artificial intelligence.
Code

Dense X Retrieval: What Retrieval Granularity Should We Use?

One crucial choice in RAG pipeline design is chunking: should it be sentence level, passage level, or chapter level? This choice significantly impacts your retrieval and response generation performance.
This paper by Chen et al. from the University of Washington, Tencent AI Lab, University of Pennsylvania, Carnegie Mellon University introduces a novel approach to dense retrieval in open-domain NLP tasks by using “propositions” as retrieval units, instead of the traditional document passages or sentences. A proposition is defined as an atomic expression within text, encapsulating a distinct factoid in a concise, self-contained natural language format. This change in retrieval granularity has a significant impact on both retrieval and downstream task performances.
Propositions follow three key principles:
1. Each proposition encapsulates a distinct meaning, collectively representing the semantics of the entire text.
2. They are minimal and indivisible, ensuring precision and clarity.
3. Each proposition is contextualized and self-contained, including all necessary text context (like coreferences) for full understanding.
The authors developed a text generation model, named “Propositionizer,” to segment Wikipedia pages into propositions. This model was fine-tuned in two steps, starting with prompting GPT-4 for paragraph-to-propositions pairs generation, followed by fine-tuning a Flan-T5-large model.
The effectiveness of propositions as retrieval units was evaluated using the FACTOIDWIKI dataset, a processed English Wikipedia dump segmented into passages, sentences, and propositions. Experiments were conducted on five open-domain QA datasets: Natural Questions (NQ), TriviaQA (TQA), Web Questions (WebQ), SQuAD, and Entity Questions (EQ). Six different dense retriever models were compared: SimCSE, Contriever, DPR, ANCE, TAS-B, and GTR.
The figure below from the paper illustrates the fact that that segmenting and indexing a retrieval corpus on the proposition level can be a simple yet effective strategy to increase dense retrievers’ generalization performance at inference time $(A, B)$. We empirically compare the retrieval and downstream open-domain QA tasks performance when dense retrievers work with Wikipedia indexed at the level of 100-word passage, sentence or proposition $(C, D)$.

Results:
1. Passage Retrieval Performance: Proposition-based retrieval consistently outperformed sentence and passage-level retrieval across all datasets and models. This was particularly evident with unsupervised retrievers like SimCSE and Contriever, which showed an average Recall@5 improvement of 12.0% and 9.3%, respectively.
2. Cross-Task Generalization: The advantage of proposition retrieval was most pronounced in cross-task generalization settings, especially for queries about less common entities. It showed significant improvement over other granularities in datasets not seen during the training of the retriever models.
3. Downstream QA Performance: In the retrieve-then-read setting, proposition-based retrieval led to stronger downstream QA performance. This was true for both unsupervised and supervised retrievers, with notable improvements in exact match (EM) scores.
4. Density of Question-Related Information: Propositions proved to offer a higher density of relevant information, resulting in the correct answers appearing more frequently within the top-l retrieved words. This was a significant advantage over sentence and passage retrieval, particularly in the range of 100-200 words.
5. Error Analysis: The study also highlighted the types of errors typical to each retrieval granularity. For example, passage-level retrieval often struggled with entity ambiguity, while proposition retrieval faced challenges in multi-hop reasoning tasks.
The figure plot from the paper shows that retrieving by propositions yields the best retrieval performance in both passage retrieval task and downstream open-domain QA task, e.g. with Contriever or GTR as the backbone retriever.

The research demonstrates that using propositions as retrieval units significantly improves dense retrieval performance and downstream QA task accuracy, outperforming traditional passage and sentence-based methods. The introduction of FACTOIDWIKI, with its 250 million propositions, is expected to facilitate future research in information retrieval.

ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems

This paper by Saad-Falcon et al. from Stanford University and UC Berkeley, the paper introduces ARES (Automated RAG Evaluation System) for evaluating Retrieval-Augmented Generation (RAG) systems in terms of context relevance, answer faithfulness, and answer relevance.
ARES generates synthetic training data using a language model and fine-tunes lightweight LM judges to assess individual RAG components. It utilizes a small set of human-annotated data points for prediction-powered inference (PPI), enabling statistical guarantees for its predictions.
The framework has three stages:
1. LLM Generation of Synthetic Dataset: ARES uses generative LLMs (like FLAN-T5 XXL) to create synthetic datasets of question-answer pairs derived from target corpus passages. This stage includes both positive and negative examples for training.
2. Preparing LLM Judges: Separate lightweight LM models are fine-tuned for three classification tasks - context relevance, answer faithfulness, and answer relevance - using the synthetic dataset. These models are tuned using a contrastive learning objective.
3. Ranking RAG Systems with Confidence Intervals:
  - After preparing the LLM judges, the next step involves using them to score and rank various RAG systems. This process begins with ARES sampling in-domain query-document-answer triples from each RAG approach. The judges then label each triple, assessing context relevance, answer faithfulness, and answer relevance. These labels are averaged for each in-domain triple to evaluate the performance of the RAG systems across the three metrics.
  - While average scores could be reported as quality metrics for each RAG system, these scores are based on unlabeled data and predictions from synthetically-trained LLM judges, which may introduce noise. An alternative is to rely solely on a small human preference validation set for evaluation, examining the extent to which each RAG system aligns with human annotations. However, this method requires labeling outputs from each RAG system separately, which can be time-consuming and expensive.
  - To enhance the precision of the evaluation, ARES employs prediction-powered inference (PPI). PPI is a statistical method that narrows the confidence interval of predictions on a small annotated dataset by utilizing predictions on a larger, non-annotated dataset. It combines labeled datapoints and ARES judge predictions on non-annotated datapoints to construct tighter confidence intervals for RAG system performance.
  - PPI involves using LLM judges on the human preference validation set to learn a rectifier function. This function constructs a confidence set of the ML model’s performance, taking into account each ML prediction in the larger non-annotated dataset. The confidence set helps create a more precise confidence interval for the average performance of the ML model (e.g., its context relevance, answer faithfulness, or answer relevance accuracy). By integrating the human preference validation set with a larger set of datapoints with ML predictions, PPI develops reliable confidence intervals for ML model performance, outperforming traditional inference methods.
  - The PPI rectifier function addresses errors made by the LLM judge and generates confidence bounds for the success and failure rates of the RAG system. It estimates performances in context relevance, answer faithfulness, and answer relevance. PPI also allows for estimating confidence intervals with a specified probability level; in these experiments, a standard 95% alpha is used.
  - Finally, the accuracy confidence interval for each component of the RAG is determined, and the midpoints of these intervals are used to rank the RAG systems. This ranking enables a comparison of different RAG systems and configurations within the same system, aiding in identifying the optimal approach for a specific domain.
  - In summary, ARES employs PPI to score and rank RAG systems, using human preference validation sets to calculate confidence intervals. PPI operates by first generating predictions for a large sample of data points, followed by human annotation of a small subset. These annotations are used to calculate confidence intervals for the entire dataset, ensuring accuracy in the system’s evaluation capabilities.
To implement ARES for scoring a RAG system and comparing to other RAG configurations, three components are needed:
- A human preference validation set of annotated query, document, and answer triples for the evaluation criteria (e.g. context relevance, answer faithfulness, and/or answer relevance). There should be at least 50 examples but several hundred examples is ideal.
- A set of few-shot examples for scoring context relevance, answer faithfulness, and/or answer relevance in your system.
- A much larger set of unlabeled query-document-answer triples outputted by your RAG system for scoring.
The figure below from the paper shows an overview of ARES: As inputs, the ARES pipeline requires an in-domain passage set, a human preference validation set of 150 annotated datapoints or more, and five few-shot examples of in-domain queries and answers, which are used for prompting LLMs in synthetic data generation. To prepare our LLM judges for evaluation, we first generate synthetic queries and answers from the corpus passages. Using our generated training triples and a constrastive learning framework, we fine-tune an LLM to classify query–passage–answer triples across three criteria: context relevance, answer faithfulness, and answer relevance. Finally, we use the LLM judge to evaluate RAG systems and generate confidence bounds for the ranking using PPI and the human preference validation set.

Experiments conducted on datasets from KILT and SuperGLUE demonstrate ARES’s accuracy in evaluating RAG systems, outperforming existing automated evaluation approaches like RAGAS. ARES is effective across various domains, maintaining accuracy even with domain shifts in queries and documents.
The paper highlights the strengths of ARES in cross-domain applications and its limitations, such as its inability to generalize across drastic domain shifts (e.g., language changes, text-to-code). It also explores the potential of using GPT-4 for generating labels as a replacement for human annotations in the PPI process.
ARES code and datasets are available for replication and deployment at GitHub.
Code

A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models

This paper by Tonmoy et al. from Islamic University of Technology Bangladesh, University of South Carolina, Stanford University, and Amazon AI in 2024 provides an extensive survey of over thirty-two techniques to mitigate hallucinations in Large Language Models (LLMs). The focus is on addressing the issue of LLMs generating seemingly factual but ungrounded content, a significant obstacle to their real-world deployment.
The authors introduce a systematic taxonomy categorizing hallucination mitigation techniques into several areas: Prompt Engineering, Model Development, and various strategies like Retrieval Augmented Generation (RAG), Self-refinement through feedback and reasoning, Prompt Tuning, new decoding strategies, utilization of Knowledge Graphs (KG), faithfulness-based loss functions, and Supervised Fine-tuning (SFT).
In Prompt Engineering, methods like LLM-Augmenter, FreshPrompt, and various retrieval-augmented strategies are discussed. These strategies involve altering prompts to improve LLMs’ responses by incorporating external knowledge, up-to-date information, or self-reflection methodologies.
The figure below from the paper illustrates the taxonomy of hallucination mitigation techniques in LLMs, focusing on prevalent methods that involve model development and prompting techniques. Model development branches into various approaches, including new decoding strategies, knowledge graph-based optimizations, the addition of novel loss function components, and supervised fine-tuning. Meanwhile, prompt engineering can involve retrieval augmentation-based methods, feedback-based strategies, or prompt tuning.

Model Development section covers innovative approaches like Context-Aware Decoding (CAD), Decoding by Contrasting Layers (DoLa), and Inference-Time Intervention (ITI), which aim to refine the model’s generation phase to reduce hallucinations and enhance truthfulness.
The paper also delves into the utilization of Knowledge Graphs (KG), highlighting methods like RHO and FLEEK, which use KGs to generate more faithful responses and aid in fact verification.
The use of faithfulness-based loss functions is examined, with techniques like the Text Hallucination Mitigating (THAM) Framework, which incorporates information-theoretic regularization to mitigate hallucinations.
Supervised Fine-tuning (SFT) is explored as a vital phase in aligning LLMs to specific tasks, discussing approaches like Knowledge Injection, Teacher-Student Approaches, and novel methods like HAR and TWEAK, which employ counterfactual datasets and hypothesis verification models to enhance text grounding and reduce hallucinations.
The paper is comprehensive in its coverage of the diverse array of strategies for hallucination mitigation in LLMs, offering valuable insights for future research directions in computational linguistics, focusing on reducing hallucinations and enhancing the reliability of language models.

A Survey of Reasoning with Foundation Models

This paper by Sun et al. from CUHK, Huawei Noah’s Ark Lab, The University of Hong Kong, Shanghai AI Lab, Hong Kong University of Science and Technology, Dalian University of Technology, Peking University, Tsinghua University, Hefei University of Technology, Renmin University of China, Fudan University, and HKUST focuses on the exploration of reasoning capabilities within foundation models in Artificial General Intelligence (AGI).
It addresses the significant role reasoning plays in various real-world settings, such as negotiation, medical diagnosis, and criminal investigation.
The paper discusses the progress of foundation models and their potential in various reasoning tasks, methods, and benchmarks. It also looks at future directions in multimodal learning, autonomous agents, and super alignment in the context of reasoning.
The paper categorizes reasoning into commonsense reasoning, mathematical reasoning, logical reasoning, causal reasoning, visual reasoning, audio reasoning, multimodal reasoning, and agent reasoning, among others.
The figure below from the paper illustrates: (Left) Overview of the reasoning tasks introduced in this survey. (Right) Overview of the reasoning techniques for foundation models.

It further discusses foundation model techniques like pre-training, fine-tuning, alignment training, mixture of experts, in-context learning, and autonomous agents.
Code

GPT-4V(ision) is a Generalist Web Agent, if Grounded

This paper by Zheng et al. from The Ohio State University, presents SEEACT, a generalist web agent leveraging Large Multimodal Models (LMMs), notably GPT-4V, for integrated visual understanding and web actions. The focus is on expanding LMMs’ capabilities beyond standard tasks like image captioning and visual question answering, exploring their potential in executing web-based tasks following natural language instructions.
The core functionality of SEEACT involves two key aspects: Action Generation, where the agent produces textual action descriptions for each step, and Element Grounding, where it identifies the appropriate HTML elements for interaction. These processes are critical in enabling the agent to execute tasks on websites based on their HTML content and visual representation.
The figure below from the paper illustrates that SEEACT leverages an LMM like GPT-4V to visually perceive websites and generate plans in textual forms. The textual plans are then grounded onto the HTML elements and operations to act on the website.

The figure below from the paper illustrates an example of the element grounding process for a single action during completing the given task with three different methods. In this action step, the model needs to click the “Find Your Truck” button to perform a search. For grounding with textual choices, some element candidates represented with HTML text are given, the model is required to generate the choice index of the target element. For image annotation, bounding boxes and index labels are added to the image. The model is required to generate the label on the bottom-left of the target element. For grounding with element attributes, the model needs to predict the text and type of the target element.

Evaluation of SEEACT is conducted on the MIND2WEB dataset, which contains over 2000 complex web tasks across various domains and websites, including tasks that are new to the training data (Cross-Task), tasks across new websites (Cross-Website), and tasks in untrained domains (Cross-Domain). The evaluation also involves online tests on live websites, marking a departure from the traditional offline evaluation approach. The dataset’s test sets are designed to assess generalization capabilities across diverse tasks, websites, and domains.
Several methods for grounding SEEACT’s actions are explored, including image annotation and textual choices. A key finding is that the best grounding strategy combines HTML text and visuals, significantly outperforming image annotation methods. Despite this progress, a gap still exists between the best grounding strategy and oracle grounding (human annotation), indicating the need for further refinement.
In terms of performance, SEEACT with GPT-4V shows a promising success rate of 50% on live websites with oracle grounding, outperforming text-only models like GPT-4 or smaller models like FLAN-T5 and BLIP-2 that are specifically fine-tuned for web agents. The study also highlights a discrepancy between online and offline evaluations, emphasizing the importance of online evaluations for an accurate assessment of a model’s capabilities. The variability in potential plans for completing the same task contributes to this discrepancy, pointing to the dynamic nature of web interactions.
The paper also discusses the limitations of the current grounding methods, particularly for webpage screenshot images that contain complex semantic and spatial relationships. Error analysis reveals major types of errors in grounding via image annotation, including wrong action generation and challenges in linking bounding boxes with correct labels.
In terms of broader implications, the study highlights the potential of web agents in automating routine web tasks and enhancing web accessibility. However, it also underscores the need for caution regarding safety and privacy concerns in real-world deployment. The research underscores the need for future work to focus on leveraging the unique properties of the web for improved grounding and to address the challenges associated with the dynamic nature of web interactions.
Code

Large Language Models for Generative Information Extraction: A Survey

This paper by Xu et al. from the University of Science and Technology of China, City University of Hong Kong, and Jarvis Research Center, Tencent YouTu Lab, presents a comprehensive survey of the use of Large Language Models (LLMs) in generative Information Extraction (IE).
The study emphasizes the significant shift in IE from traditional methods to generative approaches, facilitated by LLMs like GPT-4 and others. These models have shown exceptional text understanding and generation capabilities, enabling them to efficiently handle IE tasks across various domains.
The figure below from the paper illustrates a comparison of prompts with NL-LLMs and Code-LLMs for Universal IE. This figure refers to InstructUIE (Wang et al., 2023c) and Code4UIE (Guo et al., 2023). Both NL and code-based methods attempt to construct a universal schema for various subtasks. However, they differ in terms of prompt format and the way they utilize the generation capabilities of LLMs. The Python subclass usually has docstrings for better explanation of the class to LLMs models have shown exceptional text understanding and generation capabilities, enabling them to efficiently handle IE tasks across various domains.

Generative IE involves using LLMs to create structured information, offering practical solutions in real-world applications. This approach contrasts with traditional discriminative methods, proving more effective in handling complex schemas with millions of entities.
The survey categorizes the advancements in this field into two main areas: diverse IE subtasks (like Named Entity Recognition (NER), Relation Extraction (RE), and Event Extraction (EE)) and learning paradigms (including supervised fine-tuning, few-shot, and zero-shot learning). This categorization helps in understanding the breadth of LLM applications in IE.
The paper also reviews specific domain-focused studies and analyzes the performance of various LLMs in IE tasks. It compares several methods to gain insights into their potential and limitations, addressing challenges and suggesting future research directions in generative IE using LLMs.
The authors note that this is the first survey specifically focused on generative IE using LLMs, highlighting the novelty and importance of their work in the context of current research trends.

EQ-Bench: An Emotional Intelligence Benchmark for Large Language Models

This paper by Paech et al. introduces the novel EQ-Bench benchmark is introduced to evaluate emotional intelligence (EI) in Large Language Models (LLMs). EQ-Bench focuses on assessing a model’s ability to understand emotions and social interactions by predicting emotional states’ intensity in dialogues. It’s effective in discriminating between a wide range of models and correlates strongly with comprehensive multi-domain benchmarks like MMLU (r=0.97), indicating a potential overlap with broader intelligence aspects.
The benchmark uses a specific question format where LLMs rate emotional intensity in a GPT-4 generated dialogue depicting conflict or tension. The format avoids the need for expert interpretation and is more nuanced than multiple-choice questions. EQ-Bench uses 60 English-language questions and provides open-source code for an automated benchmarking pipeline.
Methodologically, EQ-Bench improves upon SECEU, a prior benchmark, by using more complex scenarios, removing the summation requirement for emotion intensity ratings, and diversifying the emotions selection. The answers are rated on a scale of 0-10, focusing on the relative intensity of each emotion.
The benchmark process involves normalizing the emotional intensity ratings to sum to 10, then calculating the sum of differences from the reference (also normalized). The overall score is the average of scores for all parsable answers, with two scores calculated: one for the initial answer and another for revised answers, to account for models’ potential to improve upon reflection.
The figure below from the paper illustrates an example (partial) prompt for a question in the EQ-Bench test set. Further instruction is given for the specific output format, to ensure the answer can be parsed automatically.

EQ-Bench demonstrates strong repeatability, indicated by an average coefficient of variation of 2.93%. The practice of critiquing and revising answers generally improved scores, with a notable 9.3% average increase.
The benchmark correlates well with perceived intelligence and capabilities of models, showing a strong Pearson correlation with Chatbot Arena ELO scores (r=0.94) and other benchmarks using GPT-4 as a judge. It’s sensitive to model size differences, indicating its effectiveness in distinguishing between varying levels of emotional understanding (EU).
EQ-Bench is challenging to “game” or cheat due to the complex nature of EU and objective scoring. It differentiates a wide range of EU levels, as shown by its broad spread and near-symmetrical distribution of scores.
The benchmark is computationally economical, taking less than 10 minutes for OpenAI models and 20-60 minutes for open-source models on a single Nvidia RTX A6000. The provided Python pipeline facilitates batch benchmarking without human intervention.
Limitations include the inherent subjectivity in predicting emotional responses and the reliance on authors’ abilities to create challenging questions and set insightful reference answers. Future directions may involve consulting domain experts for question design and expanding the benchmark to include human-written dialogues and a larger cohort of human subjects for reference.

Turbulence: Systematically and Automatically Testing Instruction-Tuned Large Language Models for Code

This paper by Honarvar et al. from Imperial College London and University of Oxford, introduces a method named ‘Turbulence’ for evaluating instruction-tuned large language models (LLMs) for code generation.
Turbulence is a benchmark that assesses the correctness and robustness of LLMs in code generation. It utilizes natural language question templates, each a programming problem parameterized to be asked in various forms, along with an associated test oracle to judge the correctness of LLM-generated code.
The unique approach involves generating a “neighbourhood” of similar programming questions from a single template. This is achieved by varying parameters in the original problem statement, such as changing input values, modifying the problem description, or altering the expected output format. These subtle variations create a set of closely related questions that test the LLM’s ability to understand and adapt to minor changes in the problem’s context.
The neighbourhood concept leads to a more nuanced evaluation of LLMs. It goes beyond identifying simple right or wrong answers by exposing the model’s ability or inability to generalize from one instance of a problem to slightly different variations. This method systematically uncovers instances where LLMs solve some but not all problems in a neighbourhood, highlighting gaps and anomalies in their reasoning capabilities.
The figure below from the paper illustrates an overview of their benchmarking approach.

The paper evaluates five LLMs from OpenAI, Cohere, and Meta, each at two temperature configurations, revealing gaps in LLM reasoning abilities. The method surpasses merely identifying incorrect code generation by systematically pinpointing instances where LLMs solve some but not all problems in a neighbourhood.
The study categorizes LLM performance into four distinct categories: perfect failure, perfect success, consistent failure, and random failure. Consistent failure is particularly notable as it suggests a reasoning gap in LLMs.
The paper also explores the impact of temperature settings on LLM performance. Lowering temperature often led to fewer partially solved question neighbourhoods, indicating increased robustness.
The experimental setup included a desktop machine and a MacBook Pro, and the evaluation used models like GPT-4, GPT-3.5-turbo, Command, and variations of CodeLlama.
Key findings highlight GPT-4’s superior performance, with significant performance deterioration in LLMs as question difficulty increased. The evaluation demonstrates varying levels of LLM robustness and generalization capabilities across different configurations.
Source code and results for the Turbulence benchmark are available on GitHub. The paper’s approach and methodology offer a significant contribution to understanding and improving LLMs for code generation.

TrustLLM: Trustworthiness in Large Language Models

This paper by Sun et al. from Lehigh University, Illinois Institute of Technology, and others, focuses on the trustworthiness of Large Language Models (LLMs) like GPT-4 and ERNIE. It introduces TrustLLM, a framework for assessing trustworthiness in LLMs across eight dimensions: truthfulness, safety, fairness, robustness, privacy, machine ethics, transparency, and accountability.
The study evaluates 16 mainstream LLMs using over 30 datasets and finds a positive correlation between trustworthiness and utility, with proprietary models generally outperforming open-source ones. Notably, Llama2 showed superior trustworthiness, suggesting open-source models can achieve high levels without additional mechanisms like moderators.
Key insights include LLMs’ struggle with truthful responses due to training data noise and misinformation, challenges in safety like jailbreak and misuse, and fairness issues in stereotype recognition. The robustness varied significantly in open-ended tasks and out-of-distribution tasks. Privacy handling varied widely, with some models showing information leakage.
The figure below from the paper illustrates the design of the TrustLLM benchmark. Building upon the evaluation principles in prior research, the design the benchmark to evaluate the trustworthiness of LLMs on six aspects: truthfulness, safety, fairness, robustness, privacy, and machine ethics. They incorporate both existing and new datasets first proposed in this paper. The benchmark involves categorizing tasks into classification and generation. Through diverse metrics and evaluation methods, the assess the trustworthiness of a range of LLMs, encompassing both proprietary and open-weight variants.

The table below from the paper shows a comparison between TrustLLM and other trustworthiness-related benchmarks.

The table below from the paper shows datasets and metrics in the benchmark. A tick means the dataset is from prior work, and cross means the dataset is first proposed in the TrustLLM benchmark.

The paper underscores the need for continued research to enhance LLM reliability and ethical alignment and advocates for an AI alliance between industry, academia, and the open-source community for collaboration in advancing LLM trustworthiness.
The study’s dataset, code, and toolkit are available on GitHub, and a leaderboard is released for ongoing evaluation.

Blending Is All You Need: Cheaper Better Alternative to Trillion-Parameters LLM

This paper by Lu et al. from the University of Cambridge and University College London presents “Blended,” an approach that combines multiple smaller chat AIs to rival or surpass the performance of larger models like ChatGPT.
Blended integrates models of moderate size (6B/13B parameters), demonstrating that this blend can outperform even a 175B+ parameter model in terms of engagement and user retention, as evidenced by A/B testing on the Chai research platform. For some context, Chai-Research is a platform similar to Character AI where you can create AI personalities and chat with them. User engagement and retention are critical to products of this type.
The methodology employs random selection from a pool of base chat AIs for response generation, leveraging the unique strengths of each to enhance overall conversational quality. Put simply, they run a series of experiments where one of them is to randomly sample a response from one of the three LLMs for every turn. Each LLM has a 1/3 chance of being picked as the LLM that should respond. The authors call this approach blending because the entire conversation is a blend of responses from each of the three LLMs.
The authors use 3 models: Pygmillion-6B, Vicuna-13B, ChaiLM-6B, each of which has been further fine-tuned on proprietary conversational data. Results indicate higher user retention and engagement for Blended ensembles compared to ChatGPT, achieved with significantly lower computational demands. The plot below from the paper illustrates model performance comparisons, setting the baseline as Pygmalion 6B. Each model is assigned to 5,000 unique new users, graphs report the day 30 retention and engagement improvement with respect to the baseline.

This approach suggests that efficient, high-quality chat AI can be developed without the high resource requirements of larger models, highlighting the importance of collaborative model blending over scaling individual models.

Enhancing Zero-Shot Chain-of-Thought Reasoning in Large Language Models through Logic

This paper by Zhao et al. from the University of Hamburg, introduces a novel approach named Logical Chain-of-Thought (LogiCoT) aimed at enhancing the zero-shot reasoning capabilities of large language models (LLMs) by incorporating principles from symbolic logic. Recognizing the limitations of LLMs in performing multi-step reasoning tasks without losing coherence or succumbing to hallucinations, LogiCoT presents a neurosymbolic framework that systematically verifies and revises reasoning steps to ensure logical consistency and correctness.
The methodology behind LogiCoT involves a two-fold process: first, applying reductio ad absurdum to identify and correct logical fallacies within the reasoning chain; second, structuring the reasoning process to allow for systematic verification and revision of each reasoning step, based on logical principles. This process is complemented by the introduction of a chain growth mechanism that selectively revises implausible reasoning steps, thus enhancing the model’s reasoning accuracy without unnecessary computational overhead.
The figure below from the paper shows an overview of chain-of-thought (CoT) prompting and LogiCoT. In CoT, the failure of entailment (red) makes the rest of the deduction untrustworthy (gray), consequently impeding the overall success of the deduction. In contrast, LogiCoT is designed to think-verify-revise: it adopts those who pass the verification (green) and revise (blue) those who do not, thereby effectively improving the overall reasoning capability.

The figure below from the paper shows an arithmetic example when applying LogiCoT verification and revision on CoT reasoning paths. Every reasoning step has to undergo a verification procedure, which is mainly directed by two post hoc reviews generated by the LLM (yellow) independently. In this example, step #1 fails (red) the verification because the discriminator agrees with the “Review Y” which correctly points out the error in this step. As a result, the LLM further revises (blue) the original step into a new step #1 and re-generates the trailing paths based on the revision. The procedure unrolls till every step is verified to be valid (green). Key snippets of prompts used to achieve each procedure are shown in dotted boxes.

Experimental evaluations demonstrate LogiCoT’s effectiveness across a variety of domains, including arithmetic, commonsense reasoning, causal inference, and social interaction tasks. The experiments, conducted on datasets such as GSM8K, AQuA, and others, utilizing models ranging from Vicuna-7b to GPT-4, highlight LogiCoT’s ability to significantly improve reasoning performance, especially as model size increases. Notably, the transition from a composing to an adopting strategy in error detection further accentuates LogiCoT’s advantages in enhancing reasoning accuracy and coherence.
Moreover, the research delves into the impacts of logical revision on reasoning cases, distinguishing between worsening and improving rates to quantify the efficacy of LogiCoT interventions. The findings suggest that while larger models benefit more from LogiCoT’s revisions, there is a nuanced balance between improving reasoning accuracy and avoiding unnecessary interventions.
In conclusion, the paper posits that LogiCoT represents a significant step forward in leveraging logical principles to refine the reasoning processes of LLMs. By enabling systematic verification and revision of reasoning steps, LogiCoT not only improves the accuracy and logical consistency of LLM outputs but also opens new avenues for research into neurosymbolic AI and its applications in enhancing the reasoning capabilities of generative models.

Airavata: Introducing Hindi Instruction-Tuned LLM

“Airavata,” developed by researchers from Nilekani Centre at AI4Bharat, IIT Madras, IIIT D&M Kancheepuram, Flipkart, University of Surrey, STAR, NICT, IBM Research, Microsoft, is an instruction-tuned Hindi LLM. This addresses the limited support for Indian languages in current LLMs.
The model has been built by fine-tuning Sarvam AI’s OpenHathi, with diverse, instruction-tuning Hindi datasets to make it better suited for assistive tasks.
The paper describes creating high-quality Hindi datasets by translating English-supervised instruction-tuning datasets. The process involved using the IndicTrans2 model for translation, ensuring balanced task representation while retaining high-quality examples. Two native Hindi instruction datasets, wikiHow and Anudesh, were also created for training.
The process of Instruction Tuning dataset creation can be summarized as follows:
- Sampled instances from diverse English datasets for balance.
- Translated sampled instances (instructions, input, and outputs) in English to Hindi using IndicTrans2. For translation, the authors chose IndicTrans2 over OpenAI models as IndicTrans2 is the state-of-the-art open-source MT model for Indian languages.
- Filtered translated instances using chrF++ score to ensure quality.
- Final instruction tuning dataset includes 385k instances.
Airavata underwent supervised fine-tuning with these datasets using LoRA (Low-Rank Adaptation) technique, optimizing for specific hyperparameters. Model selection was based on evaluating checkpoints from different epochs on NLU and NLG tasks, leading to a blend of checkpoints providing balanced performance.
The model was evaluated on various NLP benchmarks, including native Hindi test sets and translated English benchmarks. It showed significant improvement over the base model (OpenHathi) in most tasks, especially in aligning with the instruction-tuned dataset. The plot below from the paper shows an ablation experiment to understand the performance gaps between Full fine-tuning and LoRA fine-tuning across a mix of English and Hindi NLU tasks.

Human evaluation of Airavata focused on its ability to generate long-form text, provide factual opinions, make content accessible, demonstrate creativity in language, and answer culturally relevant questions. It was compared with other models like ChatGPT, GPT-4, and BactrianX-llama-7B.
Despite showing promise, the paper acknowledges limitations such as potential generation of biased or objectionable content, and challenges in cultural nuances and mixed-language contexts. The model’s performance is closely tied to the quality and scope of its training data.
Airavata and its associated resources have been released to facilitate further research in instruction-tuning for Indian language LLMs.
HuggingFace Repo; Instruction tuning dataset; Airavata paper

Chain-of-Symbol Prompting for Spatial Relationships in Large Language Models

This paper by Hu et al. from Westlake University, The Chinese University of Hong Kong, and University of Edinburgh introduces Chain-of-Symbol (CoS) prompting. CoS is a novel method for representing spatial relationships in Large Language Models (LLMs) using condensed symbols.
The authors found that conventional Chain-of-Thought (CoT) prompting in natural language is less effective for spatial understanding and planning tasks, as LLMs like ChatGPT struggle with spatial relationships in texts. CoS addresses this by replacing natural language descriptions of spatial relationships with symbolic representations, leading to improved performance and efficiency.
The following image from the paper illustrates an example for comparison between Chain-of-Thought (CoT) and Chain-of-Symbol (CoS) that elicits large language models in tackling complex planning tasks with higher performance and fewer input tokens. They let the model generate CoT/CoS during inference in a few-shot manner. Results were taken in May 2023 with ChatGPT and can be subject to change.

The following image from the paper shows <input, Chain of Symbol, output> example triples for our three proposed tasks: Brick World, NLVR-based Manipulation, and Natural Language Navigation, and SPARTUN dataset (Mirzaee and Kordjamshidi, 2022). Chains of Symbols are highlighted.

CoS was evaluated using three spatial planning tasks (Brick World, NLVR-based Manipulation, and Natural Language Navigation) and a spatial question-answering dataset (SPARTUN). The method showed significant performance gains, for instance, up to 60.8% accuracy improvement in the Brick World task for ChatGPT, and reduced the number of tokens in prompts by up to 65.8%.
The authors also demonstrated the robustness of CoS across different LLMs and languages, showing that it consistently outperforms CoT in accuracy and token efficiency. This indicates the potential of symbolic representations in enhancing LLMs’ spatial reasoning capabilities.

Continual Pre-training of Language Models

This paper by Ke et al. from University of Illinois at Chicago, Wangxuan Institute of Computer Technology, Peking University, and KDDI Research in ICLR 2023 introduces a novel method for continual domain-adaptive pre-training (DAP-training) of language models (LMs) using a sequence of unlabeled domain corpora, aimed at enhancing end-task performance across various domains. This method, named DAS (Continual DA-pre-training of LMs with Soft-masking), employs a soft-masking mechanism and a contrastive loss function to manage the update process of the LM, focusing on preserving general knowledge while integrating new domain knowledge without catastrophic forgetting.
The following image from the paper illustrates DAS. The red cross indicates that the gradient is not used to update the Transformer but only to compute importance. (A) Initialization computes the importance of units for the general knowledge in the LM. (B) Domain training trains a new domain using the importance scores as soft-masks and contrasts the previously learned knowledge and the full knowledge. (C) Importance computation computes the importance of the units for the current domain.

Key innovations include the computation of unit importance for both general and domain-specific knowledge using a proxy based on model robustness, and contrastive learning to integrate current domain representation with previous knowledge. The approach successfully overcomes catastrophic forgetting and facilitates knowledge transfer across domains.
Empirical evaluation demonstrates DAS’s superiority over existing methods in terms of end-task performance, showcasing its ability to effectively accumulate knowledge from multiple domains while preserving the utility of the original pre-trained model.
Code

Jina Embeddings: A Novel Set of High-Performance Sentence Embedding Models

This paper introduces Jina Embeddings, a collection of high-performance sentence embedding models adept at converting text to numerical representations that capture the text’s semantics, significantly benefiting applications like dense retrieval and semantic textual similarity.
The development journey of Jina Embeddings began with the creation of high-quality pairwise and triplet datasets, emphasizing the importance of data cleaning in dataset preparation.
The model training process takes place in two distinct phases. The first phase centers on training the model using the voluminous quantity of text pairs, consolidating the semantics of an entire text phrase into a single representative embedding. The second phase uses the relatively small triplet dataset, comprising an anchor, an entailment, and a hard-negative, teaching it to differentiate between similar and dissimilar text phrases.
Each model within the Jina Embeddings set is based on, and trained using, the zero-shot T5 models of corresponding size using InfoNCE, a contrastive loss function. Jina Embeddings models use only the encoders of their respective T5 models.
A detailed performance evaluation using the Massive Text Embedding Benchmark (MTEB) is presented. A novel contribution includes constructing a unique training and evaluation dataset for negated and non-negated statements to enhance the models’ understanding of grammatical negation, with the datasets being made publicly available.
This comprehensive approach to developing sentence embedding models showcases a balance between innovative training methodologies and efficient resource use, setting new benchmarks in the field.

Simplifying Transformer Blocks

This paper by He and Hofmann from ETH Zurich propose substantial simplifications to the standard Transformer block architecture, questioning the necessity of several components like skip connections, projection/value matrices, sequential sub-blocks, and normalization layers.
The study, grounded in signal propagation theory and empirical observations, finds that many components can be removed without sacrificing training speed or performance.
Key simplifications include removing the attention sub-block skip connection, value, and projection parameters, and the MLP sub-block skip connection, leading to a 15% increase in training throughput and a reduction in parameter count by up to 16%.
The figure below from the paper shows a comparison between different Transformer blocks. (Left) The standard Pre-LN block. (Top Right) Their most simplified block. (Bottom Right) The parallel block (Wang & Komatsuzaki, 2021). Like the parallel block, their block eschews the need for sequential sub-blocks, but they additionally remove all skip connections and normalization layers, as well as value and projection parameters. Here, $\otimes$ denotes a matrix multiplication, and $\oplus$ denotes a (potentially weighted) sum.

The paper demonstrates that simplified Transformer blocks maintain per-update training speed and downstream task performance comparable to standard Transformers.
Experiments conducted on autoregressive decoder-only and BERT encoder-only models validate the effectiveness of the simplified architecture, indicating its potential for more efficient Transformer models in practice.

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

This paper by Fernando et al. from Google DeepMind introduces Promptbreeder, an innovative system designed to evolve and adapt prompts for Large Language Models (LLMs) autonomously, enhancing their reasoning capabilities across a range of tasks without manual prompt engineering. The system utilizes evolutionary algorithms to mutate a population of task-prompts and mutation-prompts generated by the LLM itself, demonstrating a unique self-referential improvement mechanism.
Promptbreeder outperforms existing prompt strategies such as Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks and proves its efficiency in evolving domain-specific prompts for complex tasks like hate speech classification, showcasing its adaptability and scalability.
The evolution process features a diverse set of mutation operators, including direct mutation, estimation of distribution, hypermutation, Lamarckian mutation, and prompt crossover with context shuffling. These operators facilitate the exploration of a wide range of cognitive strategies and promote diversity in prompt evolution.
The following figure from the paper shows an overview of Promptbreeder. Given a problem description and an initial set of general “thinking-styles” and mutation-prompts, Promptbreeder generates a population of units of evolution, each unit consisting of typically two task-prompts and a mutation-prompt. We then run a standard binary tournament genetic algorithm (Harvey, 2011). To determine the fitness of a task-prompt we evaluate its performance on a random batch of training data. Over multiple generations, Promptbreeder subsequently mutates task-prompts as well as mutation-prompts using five different classes of mutation operators. The former leads to increasingly domain-adaptive task-prompts whereas the latter evolves increasingly useful mutation-prompts in a self-referential way.

Experiments highlight Promptbreeder’s effectiveness in evolving intricate task-prompts that significantly outperform state-of-the-art methods, underscoring its potential to automate the generation of effective, domain-specific prompts for improving LLMs’ performance across various tasks.

AnglE-Optimized Text Embeddings

This paper by Li and Li from The Hong Kong Polytechnic University introduces a novel text embedding model, AnglE, designed to improve semantic textual similarity (STS) tasks crucial in Large Language Model (LLM) applications. AnglE addresses the challenge of vanishing gradients in existing text embedding models, primarily due to their reliance on the cosine function, by incorporating angle optimization in a complex space. This method effectively mitigates the adverse effects of the saturation zone in the cosine function, enhancing gradient flow and optimization processes.
The following figure from the paper shows the saturation zones of the cosine function. The gradient at saturation zones is close to zero. During backpropagation, if the gradient is very small, it could kill the gradient and make the network difficult to learn.

The model was evaluated on both short-text STS datasets and a newly created long-text STS dataset from GitHub Issues, demonstrating superior performance over state-of-the-art (SOTA) models in short-text, long-text, and domain-specific STS tasks. The experiments reveal AnglE’s capability to generate high-quality text embeddings and underscore the utility of angle optimization in STS.
Technical implementation details include the division of text embedding into real and imaginary parts for computing angle differences in complex space, optimizing normalized angle differences as objectives, and a comprehensive experimental setup featuring various tasks and datasets. The novel contributions of this paper encompass the innovative angle-optimized text embedding model, the extension of the STS benchmark with a new long-text dataset, and extensive experiments showing AnglE’s efficacy in improving text embedding quality across diverse scenarios.

SLiC-HF: Sequence Likelihood Calibration with Human Feedback

This paper by Zhao et al. from Google Deepmind and Google Research introduces Sequence Likelihood Calibration with Human Feedback (SLiC-HF) as a method for aligning language models with human preferences using human feedback data. SLiC-HF is showcased as an effective, simpler, and more computationally efficient alternative to Reinforcement Learning from Human Feedback (RLHF), particularly for the task of TL;DR summarization.
SLiC-HF operates by calibrating the sequence likelihood of a Supervised Fine-Tuning (SFT) model against human feedback data, either directly or through a ranking model derived from human judgments. This is in contrast to traditional RLHF approaches that rely on optimizing a language model using a reward model trained on human preferences.
The paper details several implementations of SLiC-HF: direct application of human feedback (SLiC-HF-direct), sample-and-rank approach using either a reward model or a ranking model (SLiC-HF-sample-rank), and a variant applying SLiC-HF directly on human feedback data without the need for a separate ranking/reward model. Specifically, yo determine the rank, they consider two text-to-text models trained from the human preference data:
- Trained Pointwise Reward model: They binarize each ranked pair into a positive and a negative sequence, as shown in the figure below. When training the reward model, input sequences are formatted as ‘[Context] … [Summary] …’ and target sequences are either ‘Good’ or ‘Bad’. At inference time, we compute the probability of token ‘Good’ on the decoder side to score each of the $m$ candidates in a list, and sample $m$ positive/negative pairs from them.
- Trained Pairwise Ranking model: As shown in the figure below, we formulate the human feedback into a pairwise ranking problem with text-to-text format. When training the ranking model, input sequences are formatted as ‘[Context] … [Summary A] … [Summary B]’ and target sequences are among ‘A’ or ‘B’. At inference time, we use a tournament-style procedure to rank candidates in a list. For example, given a list of 4 candidates $c1$, $c2$, $c3$, $c4$, we first rank $c1$, $c2$ and $c3$, $c4$ and then rank winner $(c1, c2)$, winner $(c3, c4)$. Given $m$ candidates, the ranking model is called $m − 1$ times and $m − 1$ positive/negative pairs are yielded.
The following figure from the paper shows the data format for training the text-to-text reward model and ranking model.

Extensive experiments demonstrate that SLiC-HF significantly improves upon SFT baselines and offers competitive performance to RLHF-PPO implementations. The experiments involved automatic and human evaluations, focusing on the Reddit TL;DR summarization task. Results showed SLiC-HF’s capability to produce high-quality summaries, with improvements observed across different configurations and parameter scales.
The paper contributes to the field by providing a detailed methodology for implementing SLiC-HF, showcasing its efficiency and effectiveness compared to traditional RLHF methods. It also demonstrates the viability of leveraging off-policy human feedback data, thus potentially reducing the need for costly new data collection efforts.
Further discussions in the paper explore the computational and memory efficiency advantages of SLiC-HF over RLHF-PPO, highlighting the former’s scalability and potential for broader application in language generation tasks. The paper concludes with suggestions for future research directions, including exploring other reward functions and non-human feedback mechanisms for language model calibration.

Ghostbuster: Detecting Text Ghostwritten by Large Language Models

This paper by Verma et al. from UC Berkeley proposes Ghostbuster, a cutting-edge system for detecting AI-generated text. Unlike existing frameworks that falter on unfamiliar datasets, Ghostbuster performs robustly across a variety of domains without requiring access to the token probabilities from the target model, making it effective against text produced by black-box models or unseen model versions.
Ghostbuster operates by funneling documents through weaker language models, executing a structured search across potential feature combinations derived from these models, and subsequently training a classifier on the selected features. This process is distinguished by its independence from the target model’s internal data, addressing concerns about AI-generated text’s authenticity in critical domains such as academic submissions and journalism.
Alongside the method, three new datasets comprising human- and AI-generated text in student essays, creative writing, and news articles domains are released, serving as benchmarks for detection performance. When benchmarked against other detectors like DetectGPT and GPTZero, as well as a newly proposed RoBERTa baseline, Ghostbuster achieves superior performance, boasting a 99.0 F1 score across domains, outstripping the nearest competitor by 5.9 F1 points. Its ability to generalize across writing domains, prompting strategies, and models—surpassing others with improvements in F1 scores (7.5, 2.1, and 4.4, respectively)—is particularly noted.
The following figure from the paper shows an outline of Ghostbuster’s model training procedure. First, they feed each document into a series of weaker language models to obtain token probabilities. Then, they ran a structured search over combinations of the model outputs and trained a linear classifier on the selected features.

Detailed analysis of Ghostbuster’s methodology reveals the system’s resilience to various perturbations and paraphrasing attacks, maintaining robust performance even with non-native English texts. Through a series of ablations and robustness experiments, insights into the system’s inner workings and performance across document lengths are provided, affirming the importance of structured feature selection and the utility of leveraging probabilities from diverse language models for classification.
The paper underscores the need for sensitive application of AI-generated text detection technologies, highlighting the potential risks of false positives, especially among texts by non-native English speakers. To this end, the researchers encourage the deployment of Ghostbuster in contexts where it can augment human judgment rather than replace it, promoting a cautious approach to the use of such technologies in educational and other critical settings.

Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture

This paper by Fu et al. from Stanford University and the University at Buffalo SUNY published in NeurIPS 2023, introduces Monarch Mixer (M2), a new architecture aimed at scaling machine learning models sub-quadratically in terms of sequence length and model dimension. Monarch Mixer leverages Monarch matrices, a class of structured matrices known for their expressive power and hardware efficiency on GPUs, to achieve sub-quadratic scaling. The authors demonstrate M2’s effectiveness across three domains: non-causal BERT-style language modeling, ViT-style image classification, and causal GPT-style language modeling.
The following figure from the paper shows that Monarch matrices are simple, expressive, and hardware-efficient class of sub-quadratic structured matrices. M2 uses Monarch matrices to mix inputs first along the sequence dimension and then along the model dimension.

For non-causal BERT-style modeling, M2 shows comparable results to BERT-base and BERT-large with up to 27% fewer parameters and significantly higher throughput, achieving up to 9.1$\times$ improvement at sequence length 4K. In image classification tasks on ImageNet, M2 outperforms ViT-b by 1% accuracy with only half the parameters.
The introduction of a novel theoretical perspective based on multivariate polynomial evaluation and interpolation enables the enforcement of causality in GPT-style language models, allowing M2 to match the quality of Transformer models without the need for attention mechanisms or MLPs. This causal parameterization is pivotal in maintaining sub-quadratic scaling while ensuring model output depends only on preceding sequence elements.
M2’s efficiency and performance are underscored by its simple implementation, which achieves 25.6% FLOP utilization on an A100 GPU for inputs of size 64K, and even better performance on newer hardware like the RTX 4090 with 41.4% FLOP utilization. These benchmarks, alongside additional experiments provided in the paper’s appendices, position M2 as a promising alternative to conventional models, especially for applications demanding longer sequences or higher-dimensional data.
The paper’s theoretical contributions, particularly in developing a framework for understanding Monarch matrices as polynomial operations, offer insights into creating more efficient and scalable machine learning architectures. This work not only presents a viable model architecture but also opens avenues for future research in optimizing deep learning models for large-scale applications.
Code

DistillCSE: Distilled Contrastive Learning for Sentence Embeddings

This paper from Xu et al. from NTU, CUHK, and Tencent AI Lab proposes the DistillCSE framework, which integrates contrastive learning with self-training and knowledge distillation to enhance sentence embeddings. The framework aims to leverage a base model to provide additional supervision signals, thereby learning a stronger model through knowledge distillation. The standard implementation of knowledge distillation, however, results in marginal improvements due to severe overfitting, attributed to the high variance of the teacher model’s logits inherent in contrastive learning.
The authors identify two primary sources of high variance in the teacher model’s logits: (1) variance on data points, and (2) variance across teacher models. To address these issues, the paper proposes two novel solutions: Group-P shuffling and averaging logits from multiple teacher components.
Implementation Details:
1. Training the Teacher Model:
  - The teacher model is trained using standard contrastive learning principles as implemented in SimCSE. The contrastive loss is defined to maximize agreement between positive samples while pushing apart in-batch negatives.
2. Distilling the Student Model:
  - The student model is trained to mimic the teacher model’s output. The distillation loss minimizes the cross-entropy between the logits of the teacher and student models. The objective function combines both contrastive learning and distillation objectives, weighted by a trade-off factor.
3. Iterative Self-Training:
  - The process is iterative, where the student model, after being trained, is used as the new teacher for subsequent training rounds. This iterative self-training aims to progressively improve the model.
Key Technical Innovations:
1. Group-P Shuffling:
  - This strategy introduces regularization by shuffling the teacher logits within specific probability intervals, thereby reducing overfitting. The cumulative probability distribution of the logits is divided into groups, and shuffling is performed within these groups.
2. Logit Averaging from Multiple Teachers:
  - To reduce variance across different teacher models, the logits from multiple teachers are averaged. This approach leverages the Central Limit Theorem, which states that averaging multiple samples reduces variance.
Experimental Results:
- The experiments demonstrate that DistillCSE outperforms existing methods on standard benchmarks, achieving state-of-the-art performance in several cases. The results show significant improvements over baseline methods like SimCSE and DiffCSE, particularly when using the proposed Group-P shuffling and logit averaging techniques.
- The evaluation includes seven standard semantic textual similarity (STS) tasks, where DistillCSE consistently outperforms baseline models, both in single-round and multi-round distillation setups.
Efficiency Considerations:
- The primary computational overhead arises from the need to compute in-batch negative similarities across multiple teacher models. To address this, parallel computation across GPUs is utilized, ensuring that the overall training time remains comparable to baseline methods. Ablation Studies:
- The paper includes extensive ablation studies to evaluate the impact of various hyperparameters and the effectiveness of the proposed techniques. These studies confirm that Group-P shuffling and logit averaging significantly contribute to the model’s performance by mitigating overfitting and reducing variance.
In conclusion, DistillCSE introduces innovative strategies to address the challenges of high variance in contrastive learning, resulting in enhanced sentence embeddings and improved performance across various NLP tasks.

The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning

This paper by Lin et al. from the Allen Institute for Artificial Intelligence and UW explores the superficial nature of alignment tuning in large language models (LLMs) and proposes a tuning-free alignment method using in-context learning (ICL). The study critically examines how alignment tuning through supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) alters the behavior of base LLMs. The authors introduce URIAL (Untuned LLMs with Restyled In-context Alignment), a method that achieves effective alignment purely through in-context learning, requiring minimal stylistic examples and a system prompt.
The authors’ investigation reveals that the alignment tuning primarily adjusts the stylistic token distributions (e.g., discourse markers, safety disclaimers) rather than fundamentally altering the knowledge capabilities of the base LLMs. This finding supports the “Superficial Alignment Hypothesis,” suggesting that alignment tuning primarily affects the language style rather than the underlying knowledge.
Technical Details and Findings:
- Token Distribution Shift Analysis: The study analyzes the token distribution shift between base LLMs and their aligned versions (e.g., Llama-2 and Llama-2-chat). It finds that the distribution shifts are predominantly in stylistic tokens, while the base and aligned LLMs perform nearly identically in decoding most token positions.
- Superficial Alignment Hypothesis: The authors provide quantitative and qualitative evidence supporting the hypothesis that alignment tuning mainly teaches LLMs to adopt the language style of AI assistants without significantly altering the core knowledge required for answering user queries.
Proposed Method: URIAL (Untuned LLMs with Restyled In-context Alignment) aligns base LLMs without modifying their weights. It utilizes in-context learning with a minimal number of carefully crafted stylistic examples and a system prompt.
Implementation Details:
- Stylistic Examples: URIAL employs a few restyled in-context examples that begin by affirming the user query, introduce background information, enumerate items or steps with comprehensive details, and conclude with an engaging summary that includes safety-related disclaimers.
- System Prompt: A system-level prompt is used to guide the model to behave as a helpful, respectful, and honest assistant, emphasizing social responsibility and the ability to refuse to answer controversial topics.
- Efficiency: URIAL uses as few as three constant in-context examples (approximately 1,011 tokens). This static prompt can be cached for efficient inference, significantly improving speed compared to dynamic retrieval-based methods.
The following figure from the paper illustrates analyzing alignment with token distribution shift. An aligned LLM (llama-2-chat) receives a query $q$ and outputs a response $o$. To analyze the effect of alignment tuning, we decode the untuned version (llama-2-base) at each position $t$. Next, we categorize all tokens in $o$ into three groups based on $o_t$’s rank in the list of tokens sorted by probability from the base LLM. On average, 77.7% of tokens are also ranked top 1 by the base LLM (unshifted positions), and 92.2% are within the top 3 (+ marginal). Common tokens at shifted positions are displayed at the top-right and are mostly stylistic, constituting discourse markers. In contrast, knowledge-intensive tokens are predominantly found in unshifted positions.

Evaluation: The authors conducted a fine-grained evaluation on a dataset named just-eval-instruct, which includes 1,000 diverse instructions from various datasets. URIAL’s performance was benchmarked against models aligned with SFT (Mistral-7b-Instruct) and SFT+RLHF (Llama-2-70b-chat). Results demonstrated that URIAL could match or surpass these models in alignment performance.
Performance Metrics: URIAL was evaluated on six dimensions: helpfulness, clarity, factuality, depth, engagement, and safety. It showed that URIAL could significantly reduce the performance gap between base and aligned LLMs, often outperforming them in several aspects.
Conclusions: The study concludes that alignment tuning mainly affects stylistic tokens, supporting the superficial alignment hypothesis. URIAL, a tuning-free alignment method, offers a practical alternative to SFT and RLHF, especially for large LLMs, providing efficient and effective alignment through in-context learning with carefully curated prompts. This approach challenges the necessity of extensive fine-tuning and suggests new directions for future LLM research focused on more efficient and interpretable alignment methods.
Code

GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer

This paper by Urchade Zaratiana et al. introduces GLiNER, a compact and efficient Named Entity Recognition (NER) model utilizing bidirectional transformer encoders to address the limitations of traditional NER models and Large Language Models (LLMs). Traditional NER models are effective but limited to a predefined set of entity types, while LLMs like ChatGPT can extract arbitrary entities but are impractical in resource-limited scenarios due to their size and cost.
GLiNER leverages smaller-scale Bidirectional Language Models (BiLMs), such as BERT and deBERTa, to facilitate parallel entity extraction, overcoming the sequential token generation limitation of LLMs. The model treats NER as a task of matching entity type embeddings to textual span representations in latent space rather than as a generation task, which improves scalability and efficiency.
The architecture of GLiNER consists of three main components:
1. Pre-trained Textual Encoder: A BiLM like BERT processes the input text.
2. Span Representation Module: Computes span embeddings from token embeddings.
3. Entity Representation Module: Computes entity embeddings that the model seeks to extract.
The input to GLiNER combines entity types and the input text in a unified sequence. The BiLM processes this sequence to produce contextualized token representations. Entity embeddings are refined using a feedforward network, and span embeddings are computed by concatenating the token embeddings of the start and end positions of each span. A matching score is calculated between entity and span embeddings using a dot product followed by sigmoid activation.
The following figure from the paper shows the model architecture. GLiNER employs a BiLM and takes as input entity type prompts and a sentence/text. Each entity is separated by a learned token [ENT]. The BiLM outputs representations for each token. Entity embeddings are passed into a FeedForward Network, while input word representations are passed into a span representation layer to compute embeddings for each span. Finally, we compute a matching score between entity representations and span representations (using dot product and sigmoid activation). For instance, in the figure, the span representation of (0, 1), corresponding to “Alain Farley,” has a high matching score with the entity embeddings of “Person”.

During training, the model optimizes parameters to enhance the matching score for correct span-type pairs and reduce it for incorrect pairs using binary cross-entropy loss. Negative entity sampling is used to improve precision and recall. The decoding algorithm employs a greedy span selection strategy, implemented with a priority queue to ensure $O(n log n)$ complexity.
GLiNER was trained on the Pile-NER dataset, derived from the Pile corpus, comprising texts from diverse domains and annotated with entity types by ChatGPT. Training involved using the AdamW optimizer with specific learning rates for pre-trained and non-pretrained layers, a 10% warmup phase, and a cosine scheduler for decay. The larger variant, GLiNER-L, required 5 hours of training on an A100 GPU.
Experimental evaluations demonstrated that GLiNER outperforms ChatGPT and fine-tuned LLMs in zero-shot NER tasks across various benchmarks. Specifically, GLiNER-S, GLiNER-M, and GLiNER-L models achieved notable performance on out-of-domain (OOD) NER benchmarks and 20 diverse NER datasets, with GLiNER-L showing the best results. Additionally, GLiNER displayed robustness in multilingual NER tasks, outperforming ChatGPT in several languages not encountered during training.
In summary, GLiNER presents a scalable, efficient, and versatile approach to NER, capable of identifying a wide array of entity types across different textual domains and languages, making it a practical alternative to resource-intensive LLMs. Future work aims to enhance GLiNER’s performance and adapt it for low-resource languages.
Code; Demo

2024

BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions

This paper by Hu et al. from UC San Diego and Coinbase Global Inc. in AAAI 2024 introduces BLIVA, a multimodal Large Language Model (LLM), designed to improve handling of text-rich visual questions. It builds on the limitations of existing Vision Language Models (VLMs) like OpenAI’s GPT-4 and Flamingo, which struggle with images containing text.
The model integrates InstructBLIP’s query embeddings and LLaVA-inspired encoded patch embeddings into an LLM. The approach uses a Q-Former to extract instruction-aware visual features and a fully connected projection layer to supplement the LLM with additional visual information.
BLIVA’s two-stage training aligns the LLM with visual data using image-text pairs and fine-tunes it with instruction tuning data.
The following image from the paper illustrates a comparison of various VLM approaches. Both (a) Flamingo (Alayrac et al. 2022) and (b) BLIP-2 / InstructBLIP (Li et al. 2023b; Dai et al. 2023) architecture utilize a fixed, small set of query embeddings. These are used to compress visual information for transfer to the LLM. In contrast, (c) LLaVA aligns the encoded patch embeddings directly with the LLM. (d) BLIVA builds upon these methods by merging learned query embeddings with additional encoded patch embeddings.

The following image from the paper illustrates the model architecture of BLIVA. BLIVA uses a Q-Former to draw out instruction-aware visual features from the patch embeddings generated by a frozen image encoder. These learned query embeddings are then fed as soft prompt inputs into the frozen Large Language Model (LLM). Additionally, the system repurposes the originally encoded patch embeddings through a fully connected projection layer, serving as a supplementary source of visual information for the frozen LLM.

BLIVA shows significant performance improvements in text-rich Visual Question Answering (VQA) benchmarks, including a 17.76% improvement in the OCR-VQA benchmark and 7.9% in the Visual Spatial Reasoning benchmark.
The model also shows a 17.72% overall improvement in the multimodal LLM benchmark (MME) compared to baseline InstructBLIP. It demonstrates robust performance in real-world scenarios, including processing YouTube thumbnail question-answer pairs, indicating its wide applicability.

Building a Llama2-finetuned LLM for Odia Language Utilizing Domain Knowledge Instruction Set

The paper by Kohli et al. from Thapar University, Silo AI, Odia Generative AI, Amrita Vishwa Vidyapeetham, Institute of Mathematics and Applications, Odia Generative AI, Sharda University, and NISER Bhubaneswar introduces a Llama2-finetuned language model for the Odia language, addressing the gap in NLP tools for Indic languages by incorporating a large dataset of Odia instruction sets, including domain knowledge data. Kohli et al. from various Indian institutions outline the creation of this dataset and its utilization in fine-tuning Llama2 for improved performance in Odia context generation.
The dataset comprises translated instruction sets and domain knowledge covering a broad spectrum of subjects, meticulously prepared to equip Llama2 with the nuances of Odia language and culture. It contains 181K Odia instruction sets, prepared using GPT models, machine translation libraries, and hard-coded sets to ensure linguistic diversity and domain specificity.
The experimental setup details the fine-tuning process of the Llama2-7b model, highlighting parameter choices like batch size, learning rate, epochs, cutoff length, weight decay, and attention mechanisms with LoRA parameters. The model was trained on NVIDIA A-100 PCI Express 40 GB hardware, showcasing the computational infrastructure used.
Inference and text generation setups emphasize decoding hyperparameters such as context size, maximum sequence length, temperature, top-$k$, and top-$p$ sampling, aimed at producing high-quality, diverse text output.
Evaluation involved both automatic metrics (ROUGE and BLEU scores) and human evaluation, focusing on readability, perplexity, and correctness. The model demonstrated strong performance in generating contextually relevant, culturally sensitive content while highlighting limitations in classification tasks and generating long responses.
Conclusively, this work presents a significant step towards enhancing the digital representation of the Odia language, with future directions including further model enhancements and comparative analyses with multilingual LLMs.

Leveraging Large Language Models for NLG Evaluation: A Survey

This paper by Li et al. from Peking University, Institute of Information Engineering CAS, UTS, Microsoft, and UCLA, presents a comprehensive survey on the rapidly growing use of Large Language Models (LLMs) for evaluating Natural Language Generation (NLG).
It provides a detailed taxonomy of LLM-based NLG evaluation methods, categorizing them into generative-based and matching-based approaches. Generative-based methods have LLMs directly generating evaluation metrics, while matching-based methods assess semantic equivalence between generated text and references or sources.
The paper delves into various evaluation functions like score-based, probability-based, likert-style, pairwise comparison, ensemble, and advanced evaluation protocols, and their applications in NLG tasks like machine translation, text summarization, and dialogue generation.
The following image from the paper illustrates NLG evaluation functions: (a) generative-based and (b) matching-based methods.

The following image from the paper illustrates the different types of prompts.

A significant focus is given to prompt-based and tuning-based evaluation methods. Prompt-based evaluation employs robust base LLMs and intricate prompt engineering. In contrast, tuning-based evaluation involves specifically calibrating open-source LLMs for NLG evaluation.
The authors highlight creative uses of LLMs in evaluation and discuss challenges such as dealing with biases of LLM evaluators and tailoring them to specific domains requiring specialization or domain expertise.
The paper concludes by emphasizing the need for more advanced, fair, and robust LLM-based evaluators, integrating human judgment for comprehensive assessment, and suggests future research directions in this active area.

Nomic Embed: Training a Reproducible Long Context Text Embedder

This paper by Nussbaum et al. introduces nomic-embed-text-v1, a breakthrough in text embedding models that achieves superior performance on both short and long-context tasks compared to existing models like OpenAI Ada-002 and OpenAI text-embedding-3-small. Notably, it supports a context length of 8192, addressing the limitations of previous models capped at shorter context lengths.
The model’s innovation lies in its comprehensive and fully reproducible training approach, utilizing a curated dataset of 235 million text pairs, allowing for end-to-end auditability. This approach surpasses the constraints of previous methods by enabling the replication of model results.
The training process for nomic-embed-text-v1 involves multiple stages, starting with masked language modeling on BooksCorpus and Wikipedia to learn representations from large-scale unsupervised data. It then undergoes unsupervised and supervised contrastive pretraining, leveraging a curated dataset of 235 million text pairs. This dataset facilitates the model’s adaptability to both short and long text contexts. Fine-tuning on diverse datasets further refines its performance across benchmarks. Key architectural modifications, such as rotary positional embeddings and SwiGLU activation, support the model’s extended sequence handling capabilities.
The following image from the paper shows Text Embedding Model Benchmarks. Aggregate performance of nomic-embed-text-v1, OpenAI text-embedding-ada, OpenAI text-embedding-3-small and jina-embedding-base-v2 on short and long context benchmarks. Nomic Embed is the only fully auditable long-context model that exceeds OpenAI textembedding-ada, OpenAI text-embedding-3-small, and Jina performance across both short and long context benchmarks. X-axis units vary per benchmark suite.

Experimental results demonstrate nomic-embed-text-v1’s outstanding performance on benchmarks such as MTEB, LoCo, and Jina’s Long Context Benchmark, establishing it as a leading solution for applications requiring deep semantic understanding of extended texts.
The release of the model under an Apache 2 license, including training code, model weights, and the training dataset, promotes transparency and facilitates further research and development in the field of text embeddings.

Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents

The paper by Günther et al. from Jina AI GmbH introduces an advanced text embedding model capable of processing documents up to 8192 tokens in length, significantly surpassing the typical 512-token limitation of existing models. This enhancement allows for better handling of long documents without the need for truncation or segmenting into smaller parts, thereby maintaining the coherence and integrity of the text’s semantic meaning.
The model, designed with a modified BERT architecture incorporating Attention with Linear Biases (ALiBi) for positional encoding and Gated Linear Units (GLU) for improved performance, was pre-trained on the “Colossal Cleaned Common Crawl (C4)” dataset.
It underwent a novel three-stage training paradigm, including pre-training, fine-tuning with text pairs, and further fine-tuning with hard negatives. This approach enabled the model to outperform the previous version of Jina Embeddings and rival state-of-the-art models in the Massive Text Embedding Benchmark (MTEB).
The paper also discusses the implications of extended context for embedding performance, particularly in tasks like NarrativeQA, showcasing the model’s capability to handle a broad range of embedding-related tasks effectively.

Seven Failure Points When Engineering a Retrieval Augmented Generation System

This technical report by Barnett et al. from the Applied Artificial Intelligence Institute, Deakin University, Australia, explores failure points in the implementation of Retrieval Augmented Generation (RAG) systems. based on three case studies in diverse domains: research, education, and biomedical.
RAG systems, which integrate retrieval mechanisms with Large Language Models (LLMs) to generate contextually relevant responses, are scrutinized for their operational challenges. The paper identifies seven key failure points in RAG systems:
- FP1 Missing Relevant Content: The first failure case is when asking a question that cannot be answered from the available documents. In the happy case the RAG system will respond with something like “Sorry, I don’t know”. However, for questions that are related to the content but don’t have answers the system could be fooled into giving a response.
- FP2 Missed the Top Ranked Documents: The answer to the question is in the document but did not rank highly enough to be returned to the user. In theory, all documents are ranked and used in the next steps. However, in practice only the top $K$ documents are returned where $K$ is a value selected based on performance.
- FP3 Not in Context - Consolidation Strategy Limitations: Documents with the answer were retrieved from the database but did not make it into the context for generating an answer. This occurs when many documents are returned from the database and a consolidation process takes place to retrieve the answer.
- FP4 Not Extracted Here: the answer is present in the context, but the large language model failed to extract out the correct answer. Typically, this occurs when there is too much noise or contradicting information in the context.
- FP5 Wrong Format: The question involved extracting information in a certain format such as a table or list and the large language model ignored the instruction.
- FP6 Incorrect Specificity: The answer is returned in the response but is not specific enough or is too specific to address the user’s need. This occurs when the RAG system designers have a desired outcome for a given question such as teachers for students. In this case, specific educational content should be provided with answers not just the answer. Incorrect specificity also occurs when users are not sure how to ask a question and are too general.
- FP7 Incomplete Responses: Incomplete answers are not incorrect but miss some of the information even though that information was in the context and available for extraction. An example question such as “What are the key points covered in documents A, B and C?” A better approach is to ask these questions separately.
The study also emphasizes the importance of real-time validation and the evolving robustness of RAG systems. It concludes with suggestions for future research directions, highlighting the significance of chunking, embeddings, and the trade-offs between RAG systems and fine-tuning LLMs.
The following image from the paper shows the Indexing and Query processes required for creating a Retrieval Augmented Generation (RAG) system. The indexing process is typically done at development time and queries at runtime. Failure points identified in this study are shown in red boxes. All required stages are underlined.

Moreover, the paper provides insights into the challenges faced in implementing RAG systems, such as handling diverse document types, query preprocessing, and the need for continuous calibration and monitoring of these systems. These findings are derived from practical experiences and offer valuable guidance for practitioners in the field.

SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models

This paper by Li et al. introduces a novel benchmark for evaluating the safety of Large Language Models (LLMs) across various attack and defense methods. - SALAD-Bench stands out due to its hierarchical taxonomy, which spans three levels, and its inclusion of a diverse array of question types, including standard, attack-enhanced, defense-enhanced, and multiple-choice questions. The benchmark aims to provide a thorough assessment of LLMs’ safety measures by addressing their vulnerabilities to different attack strategies and the effectiveness of defense mechanisms.
Key components include the MD-Judge, an innovative LLM-based evaluator for QA pairs focused on attack-enhanced queries, and extensive experiments that demonstrate the resilience of LLMs to emerging threats and the performance of current defense strategies.
The following figure from the paper illustrates question enhancement and evaluation procedures in SALAD-Bench. Base questions are augmented to generate multiple subsets, including multiple-choice questions, attack-enhanced, and defense-enhanced variants. These subsets are instrumental for a comprehensive, multi-level evaluation of LLM safety metrics. The attack-enhanced subset is particularly useful for appraising defense methods, while the defense-enhanced subset is applied to assess attack strategies. Highlighted by the purple circles, the figure contrasts the nuanced safety evaluations of LLMs across different domains, tasks, and categories, underscoring distinct safety performance disparities.

The benchmark, data, and evaluator are made publicly available to facilitate research in LLM safety.
Code

DoRA: Weight-Decomposed Low-Rank Adaptation

This paper by Liu et al. from NVIDIA and HKUST introduces Weight-Decomposed Low-Rank Adaptation (DoRA), a novel Parameter-Efficient Fine-Tuning (PEFT) method that surpasses existing techniques like LoRA by decomposing pre-trained weights into magnitude and directional components for efficient fine-tuning. This method is designed to bridge the accuracy gap between LoRA-based methods and full fine-tuning, without increasing inference costs.
The authors’ weight decomposition analysis reveals fundamental differences between full fine-tuning and LoRA, showing that directional updates play a crucial role in learning capability. DoRA employs LoRA for directional updates and introduces trainable magnitude components, enhancing learning capacity and stability.
DoRA demonstrates superior performance across a range of tasks, including commonsense reasoning, visual instruction tuning, and image/video-text understanding, across models like LLaMA, LLaVA, and VL-BART. It achieves this by effectively managing the trade-off between the number of trainable parameters and learning capacity, without adding inference overhead.
The following figure from the paper illustrates an overview of DoRA, which decomposes the pre-trained weight into magnitude and direction components for fine-tuning, especially with LoRA to efficiently update the direction component. Note that $\|\cdot\|_c$ denotes the vector-wise norm of a matrix across each column vector.

Experiments show that DoRA not only outperforms LoRA but also matches or exceeds the performance of full fine-tuning across different tasks, with significant improvements in commonsense reasoning tasks and multimodal understanding, illustrating its effectiveness and efficiency.
The paper also explores DoRA’s compatibility with other LoRA variants, such as VeRA, and demonstrates its adaptability across different training sizes and rank settings, further establishing its utility as a versatile and powerful fine-tuning method.
Blog

ICDPO: Effectively Borrowing Alignment Capability of Others via In-context Direct Preference Optimization

This paper by Song et al. from Peking University and Microsoft Research Asia introduces In-Context Direct Preference Optimization (ICDPO), a novel approach for enhancing Large Language Models (LLMs) by borrowing Human Preference Alignment (HPA) capabilities without the need for fine-tuning. ICDPO utilizes the states of an LLM before and after In-context Learning (ICL) to build an instant scorer, facilitating the generation of well-aligned responses.
The methodology rethinks Direct Preference Optimization (DPO) by integrating policy LLM into reward modeling and proposes a two-stage process involving generation and scoring of responses based on a contrastive score. This score is derived from the difference in log probabilities between the optimized policy ($\pi_{*}$) and a reference model ($\pi_0$), enhancing LLM’s performance in HPA.
The following figure from the paper illustrates an overview of ICDPO. (a) The difference in teacher data utilization between normal fine-tuning and ICL without fine-tuning. (b) The core of ICDPO is that expert-amateur coordination maximizes $S$ which represents the disparity between the expert and the amateur. It brings more accurate estimation than using only the expert LLM.

Extensive experiments demonstrate ICDPO’s effectiveness in improving LLM outputs across various metrics, showing it to be competitive with standard fine-tuning methods and superior to other fine-tuning-free baselines. Notably, it leverages a two-stage retriever for selecting contextual demonstrations and an upgraded scorer to further amplify its benefits.
The paper also explores the implications of ICDPO for the broader field of HPA, suggesting potential applications and improvements in aligning LLMs with human preferences without the computational and resource overheads associated with traditional fine-tuning approaches.

A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts

This paper by Lee et al. from Google DeepMind and Google Research introduces ReadAgent, which mimics human reading strategies to extend the effective context length of Language Models (LMs) by up to 20 times without altering the model’s architecture or requiring retraining. Inspired by human reading processes, ReadAgent increases the effective context length significantly by deciding what content to store in memory (i.e., creating memories), compressing these memories into short gists (i.e., transforming memories to gist memories), and performing interactive look-ups from the original text as needed.
The implementation of ReadAgent involves three main steps:
- Episode pagination:
  - The LLM determines where to pause while reading continuous text, creating “pages” or episodes.
  - These pauses allow for manageable chunks of information processing, akin to turning pages in a book.
- Memory gisting:
  - LLM compresses each page into a concise summary (gist) while retaining essential details.
  - Each gist is associated with its original context, providing episodic gist memory.
- Interactive look-up:
  - The LLM examines the given task and all available gists in-context and it decides which pages to retrieve based on the task’s requirements.
  - The retrieved pages, along with their associated gists, are combined to solve the task effectively.
  - They propose two lookup strategies: looking up all pages at once in parallel (ReadAgent-P) and sequentially looking up one page at a time (ReadAgent-S).
The following figure from the paper illustrates an overview of the ReadAgent workflow.

Evaluated on three long-document reading comprehension tasks (QuALITY, NarrativeQA, QMSum), ReadAgent significantly outperforms baselines by improving LLM rating and ROUGE-L scores, demonstrating a 20× increase in effective context length with reasonable computational overhead. For example, on the NarrativeQA Gutenberg test set, it improves LLM rating by 12.97% and ROUGE-L by 31.98%.
Additionally, the paper explores ReadAgent’s adaptability to web navigation, showcasing its flexibility and effectiveness in a fundamentally long-context setting. The methodology demonstrates significant performance advantages and scalability, indicating the potential of human-inspired reading strategies for enhancing LLMs’ comprehension abilities in long-context scenarios.
Project page

SliceGPT: Compress Large Language Models by Deleting Rows and Columns

This paper by Ashkboos et al. from ETH Zurich and Microsoft Research, presented at ICLR 2024, introduces SliceGPT, a post-training sparsification technique that significantly reduces the computational footprint of large language models (LLMs) like Llama-2 70B, OPT 66B, and Phi-2. By eliminating up to 25% of parameters, the technique achieves a remarkable balance, retaining 99% of zero-shot task performance.
SliceGPT employs orthogonal matrix transformations to smartly slice away unneeded weights, preserving over 90% of zero-shot task performance while compressing models by up to 30%. This approach not only reduces the model size but also ensures that language models become cheaper and faster to run, with computational requirements on consumer and A100 GPUs dropping to just 64% and 66%, respectively, of the dense model’s compute.
The technique focuses on projecting the signal matrix between transformer blocks onto its principal components and removing the least significant components. This process effectively diminishes the embedding dimension and the number of parameters while maintaining computational invariance.
The following figure from the paper shows matrix multiplication of the signal $\mathbf{X}$ and a weight matrix $\mathbf{W}$ under different types of sparsity. Left: unstructured sparsity, where some elements of $\mathbf{W}$ are zero, and $\mathbf{X}$ is dense. Middle: 2:4 structured sparsity, where each block of four weight matrix entries contains two zeros, and $\mathbf{X}$ is dense. Right: SliceGPT, where after introducing transformation $\mathbf{Q}$, all the sparsity is arranged to the bottom rows of $\mathbf{W}$ and the corresponding columns of $\mathbf{X}$ are removed.

The following figure from the paper shows a single layer in a transformer network. The signals (inputs) arising from the previous blocks of the networks arrive at the bottom of the figure, before being passed through attention, LayerNorm, and FFN. The attention and FFN blocks both have input and output linear operations (blue) which we denote in the text as $\mathbf{W}_{\text {in }}, \mathbf{W}_{\text {out }}$. The linear operations of LayerNorm $\mathbf{M}$ and $\operatorname{diag}(\boldsymbol{\alpha})$ are highlighted. This and subsequent figures do not show biases.

Demonstrated through rigorous experimentation, SliceGPT models showcase superior perplexity on benchmarks such as WikiText-2, compared to traditional sparsity methods. Furthermore, the introduction of recovery fine-tuning (RFT) post-slicing allows for enhanced performance, with models retaining or even improving accuracy on several zero-shot benchmarks despite the significant reduction in parameters.
This groundbreaking work opens up new avenues for further research, including exploring combined methods and other techniques like quantization, aimed at reducing inference time and computational resources even further while preserving, or even enhancing, model performance.
Code

Two-dimensional Matryoshka Sentence Embeddings

This paper by Li et al. from the Hong Kong Polytechnic University and Lingnan University introduce a novel sentence embedding model, Two-dimensional Matryoshka Sentence Embeddings (2DMSE), designed to offer flexibility and efficiency by supporting elastic settings for both embedding sizes and Transformer layers. This model addresses limitations in existing sentence embedding methods, which often rely on fixed-length embedding vectors and a fixed number of Transformer layers, leading to inefficiencies and inflexibility across various applications. While Matryoshka Representation Learning’s (MRL) addresses tge former, it still suffers from high inference speed and memory footprint, because the model still runs through all layers to generate embeddings.
The key innovation of 2DMSE lies in its ability to dynamically support different embedding sizes and Transformer layers, allowing it to adapt to various scenarios efficiently. This is achieved by extending the MRL concept, which encodes information from coarse-to-fine granularities, to the depth of Transformer layers along with the embedding size. At each training step, a random layer from the Transformer backbone is sampled, and its embedding is fine-tuned along with the last layer’s embedding in a Matryoshka style.
To enhance performance, 2DMSE also aligns embeddings from intermediate layers with those from the last layer using Kullback-Leibler divergence minimization, fostering self-supervision and leveraging the power of the entire network. This results in every layer being comparably powerful and enabling scalable and truncatable model deployment based on the task’s requirements and computational resources.
The figure below from the paper shows a visual comparison of various sentence embedding methods. The gray blocks represent Transformer layers fine-tuned with AnglE, which are not optimized for matryoshka representation. The purple block represents Transformer layers fine-tuned with AnglE together with matryoshka loss.

The figure below from the paper shows the overall framework of 2DMSE. The left box represents the 2DMSE training stage, which involves two random processes: sampling a Transformer layer and sampling a hidden size. The selected layer and the last layer (pink rectangle) are then chosen for sentence embedding learning without scaling the hidden size. The selection of the hidden size (purple dashed rectangle) is also considered for sentence embedding learning. KL divergence is optimized during training to align the shallow layers with the last layer. The right box illustrates the inference stage, where all Transformer layers are scalable and can produce high-quality sentence embeddings for downstream applications after 2DMSE training.

Extensive experiments on Semantic Textual Similarity (STS) tasks demonstrate 2DMSE’s superiority over powerful baselines, showcasing its flexibility and scalability. The model’s ability to generate high-quality embeddings from intermediate layers without compromising on efficiency or performance presents a significant advancement in sentence embedding learning.
The framework also includes a comprehensive experimental setup and results section, where datasets like MultiNLI and SNLI are used for training, and performance is evaluated on standard STS benchmarks against notable baselines such as InferSent, USE, SBERT, SimCSE, and AnglE. Implementation details highlight the use of BERTbase as the backbone and adopt AnglE’s objective as the default loss function, ensuring a consistent and fair comparison across models.
The results, depicted through detailed layer-wise and dimension-wise analysis, reveal that 2DMSE achieves notable improvements in embedding quality across all layers, especially in shallow ones. Ablation studies further validate the importance of layer alignment and last layer learning in the model’s overall performance, while efficiency studies quantify the speedup achieved during inference, highlighting 2DMSE’s practical advantages in real-world applications.

Benchmarking Hallucination in Large Language Models based on Unanswerable Math Word Problem

This paper by Yuhong Sun et al. from East China Normal University and Fudan University introduces a novel approach to evaluate hallucination in large language models (LLMs) using unanswerable math word problems (MWPs). The authors develop the Unanswerable Math Word Problem (UMWP) dataset containing 5,200 questions across five categories, with an equal split between answerable and unanswerable questions. Unanswerable questions are classified based on reasons such as missing key information, ambiguous key information, unrealistic conditions, unrelated objects, and missing questions.
A unique evaluation methodology combining text similarity and mathematical expression detection assesses whether an LLM recognizes a question as unanswerable. The methodology leverages a similarity function and a set of template sentences indicating unanswerability, considering outputs with mathematical expressions as indicators of recognizing potential variables in unanswerable questions.
The figure below from the paper shows an example of hallucination towards a Math Word Problem (MWP).

Extensive experiments across 31 LLMs, including GPT-3, InstructGPT, LLaMA, and Claude, reveal that model size, in-context learning (ICL), and reinforcement learning with human feedback (RLHF) significantly impact the models’ ability to avoid hallucination. Models trained with RLHF, especially, showed marked improvements in identifying unanswerable questions, thereby mitigating hallucination.
The results indicate a continuous improvement in F1 scores, evaluating the degree of hallucination, as the model size increases within the LLaMA series across various input forms. It was observed that richer contextual information provided by instruction and ICL input forms enhances the LLMs’ capability to recognize hallucination. Moreover, the employment of RLHF notably boosts performance in hallucination mitigation, with models trained with RLHF outperforming those without it across different input forms.
The study introduces a significant contribution to the field by providing a reliable and effective methodology for assessing hallucination in LLMs. The UMWP dataset, alongside the proposed evaluation framework, offers a new avenue for understanding and improving the reliability of LLM outputs in the context of MWPs. The authors underscore the importance of model size, input form, and the use of RLHF in reducing hallucination, highlighting the potential for further research in optimizing LLM training and evaluation methods to better recognize and handle unanswerable questions.
Code

IndicVoices: Towards Building an Inclusive Multilingual Speech Dataset for Indian Languages

This paper by Javed et al. from AI4Bharat and IIT Madras presents INDICVOICES, a dataset comprising 7348 hours of natural speech from 16237 speakers across 145 Indian districts and 22 languages, aimed at capturing the cultural, linguistic, and demographic diversity of India. The dataset, unique in its inclusivity and representation, consists of read, extempore, and conversational audio. The authors share a detailed open-source blueprint for large-scale data collection, including standardized protocols, tools, diverse prompts, quality control mechanisms, and transcription guidelines. Additionally, using IndicVoices, they developed IndicASR, the first ASR model to support all 22 languages listed in the 8th schedule of the Indian Constitution, emphasizing the dataset’s potential to foster research and development in multilingual speech recognition technologies.
The paper outlines the comprehensive methodology used for data collection, starting from pre-collection preparation involving crafting engaging questions, prompts, and conversation scenarios across multiple domains to ensure broad and relevant content coverage. It further elaborates on the on-field data collection process, mobilization, training of local teams, use of centralized tools for data management, and the quality control mechanisms put in place to ensure the integrity of the collected data. The authors detail the creation of a countrywide network for data collection, leveraging agencies, local universities, and NGOs to achieve an unprecedented scale of diverse linguistic and demographic data.
The figure below from the paper shows primary regions of India where each of the 22 languages is spoken.

The data collection process involved using a modified version of the Karya mobile application for remote and distributed setup, allowing offline data collection with periodic synchronization. For quality control, a centralized in-house team was established, with detailed guidelines developed for transcription, ensuring accuracy and consistency across transcriptions. This massive effort, involving 1893 personnel, not only yielded a dataset with significant volume but also with variety in terms of recording conditions, demographics, and linguistic representations.
IndicVoices stands out for its scale, diversity, and the methodological rigor involved in its creation. The dataset’s open availability is poised to drive forward research in speech technologies for low-resource languages, contributing significantly to digital inclusion efforts. The methodology shared by the authors serves as a template for similar data collection efforts in other multilingual regions globally, highlighting the project’s broader implications for the field of speech recognition and beyond.
Code

ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs

This paper by Jiang et al. explore the vulnerabilities of large language models (LLMs) to ASCII art-based jailbreak attacks, presenting a novel approach called ArtPrompt. Recognizing that existing safety mechanisms in LLMs are primarily semantic-based, the team underscores a critical oversight: these models fail to interpret non-semantic ASCII art, thereby creating significant security loopholes.
The researchers introduce the Vision-in-Text Challenge (ViTC) benchmark to assess LLMs’ abilities to understand ASCII art. Surprisingly, state-of-the-art LLMs like GPT-3.5, GPT-4, Gemini, Claude, and Llama2 show limited capability in recognizing ASCII art representations, which forms the basis for the proposed ArtPrompt attack.
ArtPrompt operates in two main steps: initially, it identifies and masks potential trigger words in a prompt that might lead to safety mechanism activation. Subsequently, these masked words are replaced with ASCII art representations. This method effectively bypasses the LLMs’ safety filters, inducing them to generate responses that align with the attacker’s intent, despite the presence of harmful or prohibited content.
The figure below from the paper shows primary regions of India where each of the 22 languages is spoken.

The effectiveness of ArtPrompt is rigorously tested against the mentioned LLMs using benchmarks like AdvBench and HEx-PHI. Results demonstrate ArtPrompt’s superior performance in evading LLM safety mechanisms compared to other jailbreak techniques. The study also evaluates the resilience of ArtPrompt against common defense strategies like perplexity-based detection, paraphrase, and retokenization, highlighting its potential to bypass such safeguards.
Limitations noted include the study’s focus on textual and ASCII-based inputs, leaving the impact on multimodal LLMs unexplored. Ethically, the research aims to advance LLM safety by exposing and addressing current model vulnerabilities, with a cautionary note on the potential misuse of the findings.
The study concludes with a call for enhanced interpretative abilities in LLMs beyond semantic analysis, advocating for comprehensive safety alignments that consider various data interpretation forms. This work not only sheds light on a novel attack vector but also prompts a reevaluation of LLM safety strategies in the face of non-traditional inputs.

The Calibration Gap between Model and Human Confidence in Large Language Models

This paper by Steyvers et al. from the University of California Irvine, the calibration gap between the internal model confidence of Large Language Models (LLMs) and the external human confidence in these models’ responses is explored. The research focuses on how LLMs communicate their internal confidence to users and the impact of tailored explanations on users’ perception of this confidence. The study reveals that default explanations from LLMs often lead to an overestimation of both the model’s confidence and accuracy by users. By modifying the explanations to more accurately reflect the LLM’s internal confidence, a significant shift in user perception is observed, aligning it more closely with the model’s actual confidence levels. This has implications for enhancing user trust and accuracy in assessing LLM outputs, especially in high-stakes applications.
The methodology involved using two state-of-the-art LLMs (GPT-3.5 and PaLM2) to assess model confidence on a subset of multiple-choice questions from the MMLU dataset. Behavioral experiments were conducted where participants estimated the accuracy of LLM responses based on provided explanations. The experiments tested the effects of modifying explanation styles to convey varying degrees of model confidence. This included three experiments with different prompts to generate explanations reflecting low, medium, and high model confidence, as well as varying the explanation length.
The figure below from the paper shows an overview of the evaluation methodology for assessing the calibration gap between model confidence and human confidence in the model. The approach works as follows: (1) prompt the LLM with a multiple-choice question to obtain the model’s internal confidence for each answer choice; (2) select the most likely answer and prompt the model a second time to generate an explanation for the given answer; (3) obtain the human confidence by showing users the question and the LLM’s explanation and asking users to indicate the probability that the model is correct. In this toy example the model confidence is 0.46 for answer C, whereas the human confidence in 0.95.

The study found that explanations modified to more accurately reflect model confidence significantly improved the calibration and discrimination of human confidence relative to the model’s accuracy. Specifically, Experiment 2, which introduced explanations with varied confidence levels and lengths, and Experiment 3, which modified the default explanations based on model confidence, both showed reduced calibration errors and improved discrimination between correct and incorrect answers by users. This suggests that tailored explanations can effectively narrow the calibration gap between model and human confidence.
The research highlights the importance of transparent communication from LLMs to users. It suggests that improving the alignment between model confidence and user perception of this confidence can lead to more responsible and trustworthy use of LLMs, particularly in critical applications. The study also notes limitations, including its focus on multiple-choice questions from a single dataset and suggests avenues for further research on broader question types and datasets, as well as the challenge of calibrating human perception in response to open-ended questions.

Fact-Checking the Output of Large Language Models via Token-Level Uncertainty Quantification

This paper by Fadeeva et al. from MBZUAI, AIR, Center for Artificial Intelligence Technology, HSE University, FRC CSC RAS, The University of Melbourne, and QCRI introduces a pioneering approach for fact-checking and detecting hallucinations in the outputs of Large Language Models (LLMs) using token-level uncertainty quantification. The authors emphasize that LLMs, despite their widespread use for various natural language processing tasks, are prone to producing factually incorrect “hallucinations,” which pose significant challenges due to their deceptive coherence.
The novel fact-checking pipeline proposed leverages token-level uncertainty scores, which utilize the output distribution of neural networks to identify unreliable predictions. This is aimed at assessing the factuality of atomic claims within the LLM output without relying on external knowledge bases, thereby simplifying and speeding up the fact-checking process.
A unique token-level uncertainty quantification method, Claim-Conditioned Probability (CCP), is introduced. Unlike existing methods, CCP focuses solely on the uncertainty of specific claim values, disregarding the uncertainty related to claim type/order and surface form. This is achieved by analyzing the output distribution and employing Natural Language Inference (NLI) to condition the uncertainty estimation on the claim being made, enhancing the focus on factual accuracy.
The effectiveness of CCP and the fact-checking pipeline is demonstrated through extensive experiments on biography generation tasks across six LLMs and three languages (English, Chinese, and Arabic). The results indicate significant improvements over baseline methods, showcasing CCP’s ability to accurately identify factually incorrect statements. Human evaluation further confirms that the approach is competitive with external knowledge-based fact-checking tools, offering a promising alternative for efficient and reliable fact-checking of LLM outputs.
The figure below from the paper shows an example of CCP calculation on the Vicuna 13b generation for the word “painting”.

The study also involves the creation of a new benchmark for evaluating claim-level uncertainty quantification methods based on the factuality of generated biographies. This benchmark, derived from automatic and manual annotations, facilitates the comprehensive assessment of the proposed methods’ ability to detect hallucinations and factual inaccuracies in LLM-generated texts.
Ethical and practical considerations are discussed, acknowledging the inherent limitations of relying solely on LLM outputs for fact-checking. The authors caution against the misuse of their method for content moderation and highlight the importance of human oversight in evaluating the reliability of LLM-generated content.

RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval

This paper by Sarthi et al. from Manning’s Lab at Stanford, published in ICLR 2024, introduces RAPTOR, a novel approach for retrieval-augmented language models. RAPTOR addresses the limitation of existing retrieval methods that primarily fetch short text chunks, hindering comprehensive document understanding. It constructs a tree by recursively embedding, clustering, and summarizing text chunks, offering multi-level summarization and facilitating efficient information retrieval from extensive documents.
At its core, RAPTOR employs a tree structure starting from leaf nodes (text chunks) and builds up to the root through successive clustering and summarization. This method allows the model to access information at various abstraction levels, significantly enhancing performance on complex, multi-step reasoning tasks. When combined with GPT-4, RAPTOR achieved a 20% absolute accuracy improvement on the QuALITY benchmark over previous state-of-the-art models.
Some key insights into why using a tree-structure lets your RAG pipeline handle more complex questions:
1. Cluster semantically related chunks to dynamically identify distinct topics within your documents.
2. Create new chunks by summarizing clusters.
3. Mix high-level and low-level chunks during retrieval, to dynamically surface relevant information depending on the query.
The model utilizes SBERT for embedding text chunks and Gaussian Mixture Models (GMMs) for clustering, allowing flexible groupings of related content. Summarization is performed by a language model (GPT-3.5-turbo), producing summaries that guide the construction of higher tree levels. This recursive process creates a scalable and computationally efficient system that linearly scales in both token expenditure and build time, as detailed in the scalability analysis.
Querying within RAPTOR’s tree employs two strategies: tree traversal and collapsed tree, with the latter showing superior flexibility and effectiveness in preliminary tests on the QASPER dataset. The model’s innovative clustering mechanism, highlighted in an ablation study, proves essential for capturing thematic content and outperforms standard retrieval methods.
The figure below from the paper shows the tree construction process: RAPTOR recursively clusters chunks of text based on their vector embeddings and generates text summaries of those clusters, constructing a tree from the bottom up. Nodes clustered together are siblings; a parent node contains the text summary of that cluster.

Experimental results across various datasets (NarrativeQA, QASPER, QuALITY) demonstrate RAPTOR’s effectiveness, setting new benchmarks and outperforming existing retrieval-augmented models. The paper’s qualitative analysis illustrates RAPTOR’s ability to retrieve relevant information for thematic questions, showcasing its superiority over Dense Passage Retrieval (DPR) methods in handling complex queries.
The paper includes a comprehensive reproducibility statement, detailing the use of publicly available language models and datasets, ensuring that the community can replicate and extend upon RAPTOR’s findings.

The Power of Noise: Redefining Retrieval for RAG Systems

This paper by Cuconasu et al. from Sapienza University of Rome, Technology Innovation Institute, and University of Pisa introduces a comprehensive study on Retrieval-Augmented Generation (RAG) systems, highlighting the significant influence of Information Retrieval (IR) components on RAG’s performance, beyond the generative abilities of Large Language Models (LLMs).
Their research investigates the characteristics required in a retriever for optimal RAG prompt formulation, emphasizing the balance between relevant, related, and irrelevant documents.
The study reveals that including irrelevant documents surprisingly enhances RAG system performance by over 30% in accuracy, challenging the assumption that only relevant and related documents should be retrieved. This finding underscores the potential of integrating seemingly noise-adding strategies to improve RAG system outputs, thereby laying the groundwork for future research in IR and language model integration.
The experimental methodology employed involves a detailed examination of the Natural Questions dataset, testing various configurations of document relevance and placement within the RAG prompt. This methodological rigor allows the researchers to dissect the impact of document type (gold, relevant, related, irrelevant) and position on the accuracy of RAG system responses, with attention to how these factors influence LLM’s generative performance.
Insights from the experiments led to the formulation of strategies for optimizing RAG systems, proposing a nuanced approach to document retrieval that includes a mix of relevant and intentionally irrelevant documents. This approach aims to maximize system performance within the context size constraints of LLMs, offering a novel perspective on the integration of retrieval processes with generative language models for enhanced factual accuracy and context awareness.
The study’s findings challenge traditional IR strategies and suggest a paradigm shift towards the inclusion of controlled noise in the retrieval process for language generation tasks. The researchers advocate for further exploration into the mechanisms by which irrelevant documents improve RAG system performance, highlighting the need for new IR techniques tailored to the unique demands of language generation models.

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

This technical report by Bi et al. from the DeepSeek-AI team offers groundbreaking insights into the scaling laws for large language models (LLMs) and introduces the DeepSeek LLM, a project aimed at advancing the capabilities of open-source language models with a long-term perspective. Building upon the foundations laid by previous literature which presented varied conclusions on scaling LLMs, this paper presents novel findings that facilitate the scaling of large-scale models in two commonly used open-source configurations: 7B and 67B parameters.
At the core of their approach is the development of a dataset comprising 2 trillion tokens, which supports the pre-training phase. Additionally, supervised fine-tuning (SFT) and direct preference optimization (DPO) are conducted on the base models to create the DeepSeek Chat models. Through rigorous evaluation, DeepSeek LLM 67B demonstrates its superiority over LLaMA-2 70B across various benchmarks, especially in code, mathematics, and reasoning, and exhibits enhanced performance in open-ended evaluations against GPT-3.5.
The architecture of DeepSeek LLM adheres closely to the LLaMA design, utilizing a Pre-Norm structure with RMSNorm, SwiGLU for the Feed-Forward Network (FFN), and incorporating Rotary Embedding for positional encoding. Modifications include a 30-layer network for the 7B model and a 95-layer network for the 67B model, differing in layer adjustments to optimize training and inference efficiency.
A critical contribution of this study is the exploration of scaling laws for hyperparameters, where optimal values for batch size and learning rate are identified based on extensive experimentation. This leads to a significant revelation: the quality of training data critically impacts the optimal scaling strategy between model size and data volume. The higher the data quality, the more a scaling budget should lean towards model scaling.
The paper also delves into alignment strategies through SFT and DPO, employing a dataset with 1.5 million instances to enhance the model’s helpfulness and harmlessness. The evaluation framework spans across a wide array of public benchmarks in both English and Chinese, addressing various domains such as language understanding, reasoning, and coding.
Safety evaluation forms a pivotal part of the study, ensuring that the models adhere to ethical guidelines and are devoid of harmful outputs. The results across multiple safety evaluation metrics underscore the model’s reliability and safe interaction capabilities.
The DeepSeek LLM initiative not only pushes the envelope in the open-source landscape of LLMs but also sets a new benchmark for future research in scaling, safety, and alignment of language models, driving forward the quest towards Artificial General Intelligence (AGI).

Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation

This paper by Haoran Xu et al. from Microsoft and JHU introduces Contrastive Preference Optimization (CPO), a novel approach for fine-tuning moderate-sized Large Language Models (LLMs) for Machine Translation (MT), yielding substantial improvements over existing methods.
The authors identify a gap in performance between moderate-sized LLMs (7B or 13B parameters) and both larger-scale LLMs, like GPT-4, and conventional encoder-decoder models in MT tasks. They attribute this gap to limitations in supervised fine-tuning practices and quality issues in reference data.
CPO is designed to train models to distinguish between and prefer high-quality translations over merely adequate ones. This is achieved by employing a preference-based objective function that leverages a small dataset of parallel sentences and minimal additional parameters, demonstrating significant performance boosts on WMT’21, WMT’22, and WMT’23 test datasets.
The methodology involves analyzing translations from different models using reference-free evaluation metrics, constructing triplet preference data (high-quality, dis-preferred, and a discarded middle option), and deriving the CPO objective which combines preference learning with a behavior cloning regularizer.
Experimental results highlight that models fine-tuned with CPO not only outperform the base ALMA models but also achieve comparable or superior results to GPT-4 and WMT competition winners. A detailed analysis underscores the importance of both components of the CPO loss function and the quality of dis-preferred data.
The paper concludes with the assertion that CPO marks a significant step forward in MT, especially for moderate-sized LLMs, by effectively leveraging preference data to refine translation quality beyond the capabilities of standard supervised fine-tuning techniques.

This study sheds light on the potential limitations of conventional fine-tuning and reference-based evaluation in MT, proposing an effective alternative that could influence future developments in the field.

MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries

This paper by Tang et al. from the Hong Kong University of Science and Technology introduces MultiHop-RAG, a novel dataset and benchmark for evaluating Retrieval-Augmented Generation (RAG) systems on multi-hop queries. These queries necessitate retrieving and reasoning over multiple pieces of evidence, a challenge not adequately addressed by existing RAG systems.
MultiHop-RAG consists of a knowledge base derived from English news articles, multi-hop queries, their answers, and the supporting evidence required for those answers. This dataset aims to mimic real-world applications where complex queries involving multiple pieces of information are common.
The figure below from the paper shows the RAG flow with a multi-hop query.

The authors categorize multi-hop queries into four types: Inference, Comparison, Temporal, and Null queries. The first three types — Inference, Comparison, and Temporal — require the retrieval and analysis of evidence from multiple sources, encompassing tasks like inferring relationships, comparing data points, and sequencing events over time. The Null query represents a scenario where the query cannot be derived from the knowledge base. This category is crucial for assessing whether an LLM might hallucinate an answer to a multi-hop query when the retrieved text lacks relevance. Each type requires a distinct retrieval and reasoning strategy over the evidence, with Null queries designed to test the model’s ability to refrain from generating an answer when the query cannot be resolved with the available knowledge.
They define a multi-hop query as one that requires retrieving and reasoning over multiple pieces of supporting evidence to provide an answer. In other words, for a multi-hop query $q$, the chunks in the retrieval set $\mathcal{R}_q$ collectively provide an answer to $q$. For example, the query “Which company among Google, Apple, and Nvidia reported the largest profit margins in their third-quarter reports for 2023?” requires 1) retrieving relevant pieces of evidence related to profit margins from the reports of the three companies; 2) generating an answer by comparing and reasoning from the multiple pieces of retrieved evidence. This differs from a singlehop query such as “What is Google’s profit margin in the third-quarter reports for 2023 ,” where the answer can be directly derived from a single piece of evidence.
Based on the queries commonly used in realworld RAG systems, they identify four types of multi-hop queries. For each type, they present a hypothetical query within the context of a financial RAG system, where the knowledge base consists of a collection of annual reports.
- Inference query: For such a query $q$, the answer is deduced through reasoning from the retrieval set $\mathcal{R}_q$. An example of an inference query might be: Which report discusses the supply chain risk of Apple, the 2019 annual report or the 2020 annual report?
- Comparison query: For such a query $q$, the answer requires a comparison of evidence within the retrieval set $\mathcal{R}_q$. For instance, a comparison query might ask: Did Netflix or Google report higher revenue for the year 2023?”
- Temporal query: For such a query $q$, the answer requires an analysis of the temporal information of the retrieved chunks. For example, a temporal query may ask: Did Apple introduce the AirTag tracking device before or after the launch of the 5th generation iPad Pro?
- Null query: For such as query $q$, the answer cannot be derived from the retrieved set $\mathcal{R}_q$. They include the null query to assess the generation quality, especially regarding the issue of hallucination. For a null query, even though a retrieved set is provided, an LLM should produce a null response instead of hallucinating an answer. For example, assuming ABCS is a non-existent company, a null query might ask: What are the sales of company ABCS as reported in its 2022 and 2023 annual reports?
The dataset was created using GPT-4 to generate multi-hop queries from a pool of factual sentences extracted from news articles. The queries were then validated for quality and relevance. This process ensures the dataset’s utility in benchmarking the capability of RAG systems to handle complex queries beyond the capacity of current systems.
Experimental results demonstrate that existing RAG methods struggle with multi-hop query retrieval and answering, underscoring the necessity for advancements in this area. The benchmarking also explores the effectiveness of different embedding models for evidence retrieval and the reasoning capabilities of various state-of-the-art Large Language Models (LLMs) including GPT-4, PaLM, and Llama2-70B, revealing significant room for improvement.
The authors hope that MultiHop-RAG will encourage further research and development in RAG systems, particularly those capable of sophisticated multi-hop reasoning, thereby enhancing the practical utility of LLMs in complex information-seeking tasks.
Code

Human Alignment of Large Language Models through Online Preference Optimisation

This paper by Calandriello et al. from Google DeepMind addresses the critical issue of aligning large language models (LLMs) with human preferences, a field that has seen extensive research and the development of various methods including Reinforcement Learning from Human Feedback (RLHF), Direct Policy Optimisation (DPO), and Sequence Likelihood Calibration (SLiC).
The paper’s main contributions are twofold: firstly, it demonstrates the equivalence of two recent alignment methods, Identity Policy Optimisation (IPO) and Nash Mirror Descent (Nash-MD), under certain conditions. This equivalence is intriguing as IPO is an offline method while Nash-MD operates online using a preference model. Secondly, it introduces IPO-MD, a generalisation of IPO that incorporates regularised sampling akin to Nash-MD, and compares it against online variants of existing methods on a summarisation task.
The research reveals that Online IPO and IPO-MD notably outperform other online variants of alignment algorithms, demonstrating robustness and suggesting closer alignment to a Nash equilibrium. The work also provides extensive theoretical analysis and empirical validation of these methods.
Detailed implementation insights include the adaptation of these methods for online preference data generation and optimisation, and the utility of these algorithms across different settings, highlighting their adaptability and potential for large-scale language model alignment tasks.
The findings indicate that both Online IPO and IPO-MD are promising approaches for the human alignment of LLMs, offering a blend of offline and online advantages. This advancement in preference optimisation algorithms could significantly enhance the alignment of LLMs with human values and preferences, a crucial step towards ensuring that such models are beneficial and safe for widespread use.

A General Theoretical Paradigm to Understand Learning from Human Preferences

This paper by Azar et al. from Google DeepMind delves into the theoretical underpinnings of learning from human preferences, particularly focusing on reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO). The authors propose a novel objective, $\Psi$-preference optimization ($\Psi$PO), which encompasses RLHF and DPO as specific instances, aiming to optimize policies directly from human preferences without relying on the approximations common in existing methods.
RLHF typically involves a two-step process where a reward model is first trained using a binary classifier to distinguish preferred actions, often employing a Bradley-Terry model for this purpose. This is followed by policy optimization to maximize the learned reward while ensuring the policy remains close to a reference policy through KL regularization. DPO, in contrast, seeks to optimize the policy directly from human preferences, eliminating the need for explicit reward model training.
The $\Psi$PO framework is a more general approach that seeks to address the potential overfitting issues inherent in RLHF and DPO by considering pairwise preferences and employing a possibly non-linear function of preference probabilities alongside KL regularization. Specifically, the Identity-PO (IPO) variant of $\Psi$PO is highlighted for its practicality and theoretical appeal, as it allows for direct optimization from preferences without the approximations used in other methods.
Empirical demonstrations show that IPO can effectively learn from preferences without succumbing to the overfitting problems identified in DPO, providing a robust method for preference-based policy optimization. The paper suggests that future work could explore scaling these theoretical insights to more complex settings, such as training language models on human preference data.

Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation

This paper by Haoran Xu et al. introduces Contrastive Preference Optimization (CPO), a novel approach for fine-tuning moderate-sized Large Language Models (LLMs) for Machine Translation (MT), yielding substantial improvements over existing methods.
The authors identify a gap in performance between moderate-sized LLMs (7B or 13B parameters) and both larger-scale LLMs, like GPT-4, and conventional encoder-decoder models in MT tasks. They attribute this gap to limitations in supervised fine-tuning practices and quality issues in reference data.
CPO aims to mitigate two fundamental shortcomings of SFT. First, SFT’s methodology of minimizing the discrepancy between predicted outputs and gold-standard references inherently caps model performance at the quality level of the training data. This limitation is significant, as even human-written data, traditionally considered high-quality, is not immune to quality issues. For instance, one may notice that some strong translation models are capable of producing translations superior to the gold reference. Secondly, SFT lacks a mechanism to prevent the model from rejecting mistakes in translations. While strong translation models can produce high-quality translations, they occasionally exhibit minor errors, such as omitting parts of the translation. Preventing the production of these near-perfect but ultimately flawed translation is essential. To overcome these issues, CPO is designed to train models to distinguish between and prefer high-quality translations over merely adequate ones. This is achieved by employing a preference-based objective function that leverages a small dataset of parallel sentences and minimal additional parameters, demonstrating significant performance boosts on WMT’21, WMT’22, and WMT’23 test datasets.
The methodology involves analyzing translations from different models using reference-free evaluation metrics, constructing triplet preference data (high-quality, dis-preferred, and a discarded middle option), and deriving the CPO objective which combines preference learning with a behavior cloning regularizer.
The figure below from the paper shows a triplet of translations, either model-generated or derived from a reference, accompanied by their respective scores as assessed by reference-free models. For a given source sentence, the translation with the highest score is designated as the preferred translation, while the one with the lowest score is considered dispreferred, and the translation with a middle score is disregarded.

Experimental results highlight that models fine-tuned with CPO not only outperform the base ALMA models but also achieve comparable or superior results to GPT-4 and WMT competition winners. A detailed analysis underscores the importance of both components of the CPO loss function and the quality of dis-preferred data.
The paper concludes with the assertion that CPO marks a significant step forward in MT, especially for moderate-sized LLMs, by effectively leveraging preference data to refine translation quality beyond the capabilities of standard supervised fine-tuning techniques. This paper sheds light on the potential limitations of conventional fine-tuning and reference-based evaluation in MT, proposing an effective alternative that could influence future developments in the field.

What Are Tools Anyway? A Survey from the Language Model Perspective

This survey by Wang et al. from Carnegie Mellon University and Shanghai Jiao Tong University provides a systematic exploration of the role of tools in enhancing the performance of language models (LMs) for complex tasks. The authors propose a unified definition of tools as external programs used by LMs, expanding upon the concept to include various implementation methods and the contexts in which tools are beneficial.
The survey delves into the basic tool-use paradigm, illustrating how LMs can extend their capabilities by generating tool calls in response to tasks that require external information or computational resources. This paradigm is exemplified by scenarios such as performing real-time data queries or complex mathematical calculations, where direct LM computation is insufficient.
Advanced tool-use methods are explored, including complex tool selection and usage strategies, domain-specific semantic parsing, and general-purpose code generation. These methods allow for more nuanced and efficient integration of tools, addressing tasks that span from specialized applications to broad, open-domain challenges.
The authors empirically analyze the efficiency of various tooling methods by measuring compute requirements and performance gains across different benchmarks, identifying tasks that significantly benefit from tool integration. This analysis reveals trade-offs between the computational costs of learning to use tools and the performance improvements they provide.
The figure below from the paper illustrates the fact that relative to what is considered as the base LM or base actions, tools can refer to built-in functions, external libraries, or task-specific utility functions (from left to right).

Evaluation methodologies for tool use are critically examined, highlighting existing testbeds and metrics while pointing out gaps in current approaches. The survey suggests new directions for evaluating the effectiveness and efficiency of tool-augmented LMs, including the stability of tool outputs and the reproducibility of results in dynamic environments.
The paper concludes by emphasizing the transformative potential of tool-augmented LMs in extending the boundaries of what language models can achieve. It calls for further research into the creation and use of tools, advocating for benchmarks that reflect real-world use cases and comprehensive evaluation metrics that capture the full spectrum of tool integration’s benefits and challenges.

AutoDev: Automated AI-Driven Development

This paper by Tufano et al. from Microsoft introduces AutoDev, a comprehensive AI-driven framework for automating intricate software engineering tasks, capable of autonomous planning, execution, file operations, building, testing, and managing git operations.
AutoDev uniquely integrates with development environments, surpassing existing AI assistants by providing a broader spectrum of IDE capabilities, offering a secure environment through Docker containers for operations, and implementing guardrails for user privacy and file security.
The architecture comprises a Conversation Manager for tracking user and AI agent dialogs, a Tools Library for executing a variety of code-related objectives, an Agent Scheduler for collaborative agent task management, and an Evaluation Environment for code execution and testing.
The figure below from the paper offers a visual overview of the AutoDev Framework: The user initiates the process by defining the objective to be achieved. The Conversation Manager initializes the conversation and settings. The Agent Scheduler orchestrates AI agents to collaborate on the task and forwards their commands to the Conversation Manager. The Conversation Manager parses these commands and invokes the Tools Library, which offers various actions that can be performed on the repository. Agents’ actions are executed within a secure Docker environment, and the output is returned to the Conversation Manager, which incorporates it into the ongoing conversation. This iterative process continues until the task is successfully completed.

Empirical evaluation on the HumanEval dataset demonstrates AutoDev’s effectiveness, achieving 91.5% and 87.8% Pass@1 scores for code and test generation tasks, respectively, indicating significant automation capabilities in software engineering tasks while ensuring a secure development environment.
AutoDev’s design facilitates multi-agent collaboration and human-in-the-loop feedback, aiming for future integration into IDEs as a chatbot and CI/CD pipelines, highlighting its potential to transform AI-guided programming within IDE interactions and software development processes.

LLM4Decompile: Decompiling Binary Code with Large Language Models

This paper by Tan et al. from the Southern University of Science and Technology and The Hong Kong Polytechnic University present a groundbreaking study on the use of Large Language Models (LLMs) for decompiling binary code into human-readable C source code. The paper introduces open-source LLMs, specifically designed for decompilation tasks, pre-trained on a massive dataset comprising 4 billion tokens of C source code and corresponding assembly code. This initiative marks a significant advancement by providing a robust foundation for future research in the field of decompilation.
Decompilation, the process of translating compiled binary code back into high-level source code, faces challenges in preserving details such as variable names and structures. Traditional tools often fail to produce human-friendly output. The authors leverage LLMs to tackle these challenges, significantly outperforming existing solutions like GPT-4 with a 50% improvement in accuracy. The proposed models range from 1B to 33B in size, demonstrating the potential of LLMs in enhancing the readability and structure of decompiled code.
The figure below from the paper shows the pipeline to evaluate the decompilation.

To address the lack of comprehensive evaluation methods in decompilation, the study introduces the “Decompile-Eval” benchmark. This novel dataset emphasizes the importance of re-compilability and re-executability in assessing decompilation quality, offering a more meaningful evaluation of a model’s ability to preserve program semantics. The benchmark consists of 1 million C code samples compiled into assembly code under various optimization levels, presenting a rigorous testing environment.
The LLM4Decompile models exhibit exceptional performance on the Decompile-Eval benchmark, with up to 90% of the decompiled code being recompilable and a 21% success rate in passing all test cases for semantic accuracy. These results underscore the models’ capability in understanding code structure and syntax, as well as their potential in capturing the semantics of complex programs.
The paper also delves into the technical aspects of implementing LLM4Decompile, including the model architecture borrowed from DeepSeek-Coder, training objectives, and ablation studies on different training methodologies. The authors explore the impact of training strategies on the models’ decompilation accuracy, highlighting the effectiveness of sequence-to-sequence prediction over next-token prediction. Edit Similarity (ES) is meticulously discussed as a critical metric alongside BLEU for evaluating decompilation accuracy. ES offers a nuanced measure of code similarity by calculating the minimal number of edits required to transform the decompiled code back into its original source code, thereby providing insight into the syntactic integrity and semantic preservation achieved by the decompilation process.
The Training Objectives section elaborates on the sophisticated training regime employed for LLM4Decompile, centered on two primary methods: Next Token Prediction (NTP) and Sequence-to-Sequence (S2S) prediction. NTP, a foundational language modeling approach, predicts the probability of subsequent tokens given prior ones, enhancing the model’s ability to understand and generate syntactically coherent code snippets. S2S prediction, inspired by neural machine translation, focuses on translating assembly code back into source code, optimizing the model’s capacity for accurate, context-aware code reconstruction. This dual-focus training strategy ensures a balanced development of syntactic fluency and semantic accuracy, significantly contributing to the models’ decompilation proficiency.
In conclusion, LLM4Decompile presents a pioneering approach to decompilation, significantly advancing the state of the art. By introducing open-source LLMs and a comprehensive benchmark for evaluation, the study paves the way for further innovations in the domain. The detailed exploration of model configurations, training data, and experimental results provides a rich resource for future research, promising continued progress in the development of more efficient and accurate decompilation techniques.

OLMo: Accelerating the Science of Language Models

This paper by Groeneveld et al. from the Allen AI, UW, Yale, NYU, and CMU introduces OLMo, a state-of-the-art open language model and framework. It aims to advance the science of language modeling by providing open access to model weights, training, and evaluation code, in response to the proprietary nature of recent powerful models. This initiative seeks to foster innovation and scientific study of language models, including their biases and potential risks.
Unlike previous efforts that provided limited access to model components, OLMo releases a comprehensive framework encompassing the training data Dolma, hundreds of intermediate checkpoints, training logs, and evaluation tools like Catwalk for downstream evaluation and Paloma for perplexity-based evaluation. This release supports the examination of the impact of training data, design choices, and optimization methods on model performance.
OLMo adopts a decoder-only transformer architecture with several enhancements for stability and efficiency, such as the exclusion of bias terms and the use of SwiGLU activation functions. It is available in 1B and 7B variants, with a 65B version in progress, all trained on a diverse, multi-source corpus of 3T tokens across 5B documents.
The pretraining dataset, Dolma, is a significant contribution to open research, comprising 3T tokens from various sources with detailed documentation. It underwent extensive curation, including language and quality filtering, deduplication, and mixing from multiple sources, to support the study of training data’s impact on model capabilities.
OLMo’s evaluation showcases competitive performance across various metrics compared to other models, underscoring its potential as a base model for further research and application. Additionally, the adaptation of OLMo using instruction tuning and Direct Preference Optimization demonstrates its versatility for creating safer and more effective language models.
The release also emphasizes the environmental impact of training large models, providing detailed estimates of power consumption and carbon footprint to highlight the cost of developing state-of-the-art models and encourage the use of open models to reduce redundant efforts and emissions.
OLMo is part of a broader effort to push the boundaries of open language models, with plans for future releases that include larger models, more modalities, and enhanced safety measures. This endeavor aims to empower the research community and drive forward innovation in language modeling.
Code; Data; Weights

Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research

This paper by Soldaini et al. from Allen AI introduces Dolma, a massive English corpus of three trillion tokens derived from a variety of sources, including web content, scientific papers, code, public-domain books, social media, and encyclopedic materials, aimed at facilitating open research on language model pretraining.
Dolma’s creation involved extensive curation, employing a diverse set of techniques for data acquisition, language filtering, quality and content filtering, and deduplication across multiple pipelines tailored for specific types of data sources. For instance, the web pipeline, derived from Common Crawl, applied filters for quality (using a mix of Gopher and C4 heuristics) and content (targeting toxicity and personal identifiable information) and implemented a multi-stage deduplication process. Similar, yet bespoke, methods were used for processing code from The Stack and conversational forums from Pushshift Reddit data.
The table below from the paper shows the Dolma corpus at-a-glance. It consists of three trillion tokens sampled from a diverse set of domains sourced from approximately 200 TB of raw text. It has been extensively cleaned for language model pretraining use.

The toolkit developed for Dolma, aimed at high-performance and portability, is made open-source, encouraging further experimentation, reproducibility, and custom curation efforts by the research community.
Dolma directly contributes to the training of OLMo, a state-of-the-art language model, showcasing the corpus’s utility in advancing language model research. The study meticulously documents the impacts of various curation practices through experimental ablations, analyzing their effects on model training in terms of domain fit, downstream task performance, and the handling of benchmark contamination.
The release of Dolma addresses critical gaps in language model research by providing transparency around pretraining corpora, supporting analyses on the influence of data composition on model behavior, and facilitating the development of open language models, with the broader aim of democratizing access to large-scale NLP research tools.
This research not only sheds light on the intricate processes involved in curating large-scale datasets for NLP but also sets a precedent for transparency and open science in the field by releasing both the corpus and the curation toolkit to the public.
Dataset

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

MathVista is a benchmark presented by Lu et al. from UCLA, UW, and MSR in ICLR 2024. It aims to evaluate the mathematical reasoning abilities of Large Language Models (LLMs) and Large Multimodal Models (LMMs) within visual contexts.
The dataset comprises 6,141 examples, sourced from 28 multimodal datasets involving mathematics and three newly created datasets: IQTest, FunctionQA, and PaperQA. These tasks necessitate deep visual understanding and compositional reasoning, posing challenges to state-of-the-art foundation models.
MathVista facilitated a comprehensive quantitative evaluation of 12 leading foundation models. Among these, GPT-4V emerged as the top performer with an overall accuracy of 49.9%, outshining Bard, the next best model, by 15.1%. However, it still lags behind human performance by 10.4%, indicating a gap in handling complex figures and rigorous reasoning.
The figure below from the paper shows examples of the newly annotated datasets: IQTest, FunctionQA, and PaperQA.

The benchmark incorporates diverse mathematical and visual tasks to assess models’ capabilities systematically. It includes algebraic reasoning, arithmetic reasoning, geometry reasoning, logical reasoning, numeric commonsense reasoning, scientific reasoning, and statistical reasoning across various visual contexts like natural images, geometry diagrams, abstract scenes, and more.
In-depth analyses of models’ performances revealed that GPT-4V’s superiority is largely due to its enhanced visual perception and mathematical reasoning capabilities. Additionally, the study explored GPT-4V’s new ability for self-verification, its application of self-consistency, and its potential in multi-turn human-AI interactions, highlighting promising avenues for future research.
Code

RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture

This paper by Balaguer et al. from Microsoft, delves into two prevalent approaches for incorporating proprietary and domain-specific data into Large Language Models (LLMs): Retrieval-Augmented Generation (RAG) and Fine-Tuning. RAG augments prompts with external data, whereas Fine-Tuning embeds additional knowledge directly into the model. The paper outlines a comprehensive pipeline for both approaches, evaluating their effectiveness on multiple popular LLMs including Llama2-13B, GPT-3.5, and GPT-4.
The research particularly focuses on agriculture, an industry with relatively limited AI penetration, proposing a disruptive application: providing location-specific insights to farmers. The pipeline stages include data acquisition, PDF information extraction, question and answer generation using this data, and leveraging GPT-4 for result evaluation. Metrics are introduced to assess the performance of the RAG and Fine-Tuning pipeline stages.
The figure below from the paper shows the methodology pipeline. Domain-specific datasets are collected, and the content and structure of the documents are extracted. This information is then fed to the Q&A generation step. Synthesized question-answer pairs are used to fine-tune the LLMs. Models are evaluated with and without RAG under different GPT-4-based metrics.

Experimental results from an agricultural dataset highlight the pipeline’s capability in capturing geography-specific knowledge. Fine-Tuning demonstrated a significant accuracy increase of over 6 percentage points, a benefit that accumulates with RAG, further enhancing accuracy by 5 percentage points. One experiment showcased the fine-tuned model’s ability to leverage information across geographies to answer specific questions, boosting answer similarity from 47% to 72%.
The paper presents an in-depth comparison of answers from GPT-4, Bing Chat, and agronomist experts to the same query across different U.S. states, revealing the models’ generic responses versus the experts’ nuanced, location-specific answers. This comparative analysis underscores the potential of fine-tuning and RAG in producing more contextually appropriate responses for industry-specific applications.
The proposed methodology aims at generating domain-specific questions and answers to create a valuable knowledge resource for industries requiring specific contextual and adaptive responses. Through an extensive evaluation involving benchmarks from major agriculture-producing countries, the study establishes a baseline understanding of model performance in the agricultural context and explores the impact of spatial shift on knowledge encoding and the benefits of spatially-scoped fine-tuning.
Additionally, the research investigates the implications of retrieval techniques and fine-tuning on LLM performance. It identifies RAG as particularly effective in contexts requiring domain-specific knowledge and fine-tuning as beneficial for imparting new skills to models, albeit at a higher initial cost. This work serves as a foundation for applying RAG and fine-tuning techniques across industries, demonstrating their utility in enhancing model efficiency from the Q&A generation process onwards.

RAFT: Adapting Language Model to Domain Specific RAG

This paper by Zhang et al. from UC Berkeley introduces Retrieval Augmented Fine Tuning (RAFT) as a method to adapt pre-trained Large Language Models (LLMs) for domain-specific Retrieval Augmented Generation (RAG), focusing on “open-book” in-domain settings. By training the model to identify and ignore distractor documents while citing relevant information from pertinent documents, RAFT enhances the model’s reasoning capability and its ability to answer questions based on a specific set of documents.
The concept draws an analogy to preparing for an open-book exam, where RAFT simulates the conditions of such an exam by incorporating both relevant and irrelevant (distractor) documents during training. This contrasts with existing methods that either do not leverage the opportunity to learn from domain-specific documents or fail to prepare the model for the dynamics of RAG in an open-book test setting.
The figure below from the paper draws an analogy to how best to prepare for an exam? (a) Fine-tuning based approaches implement “studying” by either directly “memorizing” the input documents or answering practice QA without referencing the documents. (b) Alternatively, incontext retrieval methods fail to leverage the learning opportunity afforded by the fixed domain and are equivalent to taking an open-book exam without studying. While these approaches leverage in-domain learning, they fail to prepare for open-book tests. In contrast, (c) RAFT leverages fine-tuning with question-answer pairs while referencing the documents in a simulated imperfect retrieval setting — thereby effectively preparing for the open-book exam setting.

The methodology involves creating training data that includes a question, a set of documents (with one or more being relevant to the question), and a CoT-style answer derived from the relevant document(s). The paper explores the impact of including distractor documents in the training set and the proportion of training data that should contain the oracle document.
The figure below from the paper shows an overview of RAFT. The top-left figure depicts our approach of adapting LLMs to reading solution from a set of positive and negative documents in contrast to standard RAG setup where models are trained based on the retriever outputs, which is a mixture of both memorization and reading. At test time, all methods follow the standard RAG setting, provided with a top-$k$ retrieved documents in the context.

Experiments conducted across PubMed, HotpotQA, and Gorilla datasets demonstrate RAFT’s effectiveness. It consistently outperforms both supervised fine-tuning and RAG across these datasets, particularly highlighting the importance of the chain-of-thought (CoT) style responses in improving model performance.
Results from various experiments indicate that mixing a fraction of the training data without the oracle document in its context is beneficial for in-domain RAG tasks. Moreover, training with a balance of relevant and irrelevant documents at test time shows that RAFT can generalize well to different numbers of retrieved documents, enhancing robustness against inaccuracies in retrieval.
RAFT’s approach is compared against several baselines, including LLaMA-7B with and without RAG, domain-specific fine-tuning with 0-shot prompting (DSF), and DSF with RAG. Across different datasets, RAFT demonstrates significant improvements, underscoring its potential in domain-specific applications.
The paper also discusses related works, highlighting advancements in retrieval-augmented language models, memorization versus generalization in LLMs, and fine-tuning strategies for adapting LLMs to specific tasks. RAFT’s contribution lies in its focus on preparing LLMs for domain-specific RAG by effectively leveraging both relevant and distractor documents during training.
The study posits RAFT as a valuable strategy for adapting pre-trained LLMs to domain-specific tasks, especially where leveraging external documents is crucial. By training models to discern relevant information from distractors and generating CoT-style answers, RAFT significantly enhances the model’s ability to perform in open-book exam settings, paving the way for more nuanced and effective domain-specific applications of LLMs.
Project page; Code

Corrective Retrieval Augmented Generation

The paper by Yan et al. from the University of Science and Technology of China, UCLA, and Google Research, proposed Corrective Retrieval Augmented Generation (CRAG) which addresses the challenge of hallucinations and inaccuracies in large language models (LLMs) by proposing a novel framework that enhances the robustness of Retrieval-Augmented Generation (RAG) methods.
CRAG introduces a lightweight retrieval evaluator that assesses the quality of documents retrieved for a query and triggers actions based on a confidence degree, aiming to correct or enhance the retrieval process. The framework also incorporates large-scale web searches to augment the pool of retrieved documents, ensuring a broader spectrum of relevant and accurate information.
A key feature of CRAG is its decompose-then-recompose algorithm, which processes the retrieved documents to highlight crucial information while discarding irrelevant content. This method significantly improves the model’s ability to utilize the retrieved documents effectively, enhancing the quality and accuracy of the generated text.
The figure below from the paper shows an overview of CRAG at inference. A retrieval evaluator is constructed to evaluate the relevance of the retrieved documents to the input, and estimate a confidence degree based on which different knowledge retrieval actions of {Correct, Incorrect, Ambiguous} can be triggered.

CRAG is designed to be plug-and-play, allowing seamless integration with various RAG-based approaches. Extensive experiments across four datasets demonstrate CRAG’s ability to significantly enhance the performance of RAG-based methods in both short- and long-form generation tasks, showcasing its adaptability and generalizability.
The study identifies scenarios where conventional RAG approaches may falter due to inaccurate retrievals. CRAG addresses this by enabling self-correction and efficient utilization of retrieved documents, marking a significant step towards improving the reliability and effectiveness of RAG methods.
Limitations acknowledged include the ongoing challenge of accurately detecting and correcting erroneous knowledge. The necessity of fine-tuning a retrieval evaluator and the potential biases introduced by web searches are highlighted as areas for future improvement.
Code

SaulLM-7B: A pioneering Large Language Model for Law

This paper by Colombo et al. introduces SaulLM-7B, the first LLM with 7 billion parameters, designed specifically for the legal domain, based on the Mistral 7B architecture. It is trained on over 30 billion tokens from an English legal corpus, showing state-of-the-art proficiency in legal document comprehension and generation. Additionally, the paper introduces an instructional fine-tuning method using legal datasets to enhance SaulLM-7B’s performance on legal tasks, released under the MIT License.
The creation of SaulLM-7B addresses the gap in specialized LLM applications within the legal field, marked by complex document volumes and unique linguistic challenges. The model’s pretraining incorporates extensive legal corpora from English-speaking jurisdictions, including the USA, Canada, the UK, and Europe, aiming to comprehend and adapt to the evolving legal discourse. This focus targets the needs of legal practitioners, representing a significant step towards integrating artificial intelligence within legal applications.
SaulLM-7B’s family includes SaulLM-7B-Instruct, an instruction-tuned variant that outperforms models like Mistral and Llama on various legal tasks. This achievement is part of the paper’s contributions, which also introduce LegalBench-Instruct and model evaluation code & licensing under the MIT License. LegalBench-Instruct, a refined iteration of LegalBench, aims to better assess and refine legal language model proficiency, incorporating tasks from the MMLU benchmark related to international law, professional law, and jurisprudence.
The paper details the data sources and preprocessing steps involved in constructing the training corpus, highlighting the combination of pre-existing datasets and new data scraped from the web. Rigorous data cleaning, deduplication, and the inclusion of “replay” sources to mitigate catastrophic forgetting during continued pretraining form the foundation of a robust 30 billion token corpus. Instruction fine-tuning mixes further refine the model’s ability to understand and follow legal instructions.
Evaluation of SaulLM-7B involves comparing its performance against other open-source models using benchmarks like LegalBench-Instruct and Legal-MMLU. The results demonstrate SaulLM-7B’s superior proficiency in legal document processing and task performance. The perplexity analysis across different types of legal documents further confirms the model’s effectiveness in the legal domain.
SaulLM-7B signifies a novel approach in the AI-driven assistance for legal professionals, aiming for widespread adoption and innovation across commercial and research endeavors in law. The release of SaulLM-7B under an open license encourages collaborative development and application in various legal contexts, setting a precedent for future advancements in AI-powered legal tools.
The model is open-sourced and allows commercial use, inviting the legal sector and AI engineers to further tinker with legal LLMs.
Model

LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement

This paper by Lee et al. from UC Berkeley, ICSI, and LBNL introduces LLM2LLM, a novel, iterative data augmentation strategy leveraging a teacher LLM to enhance a small seed dataset for fine-tuning LLMs on specific tasks. The method amplifies the training signal by focusing on challenging examples, significantly improving LLM performance in low-data regimes across multiple datasets.
The approach comprises three main steps: fine-tuning a student LLM on initial seed data, identifying incorrect predictions, and generating synthetic data based on these incorrect predictions using a teacher LLM. This synthetic data is then used to re-train the student model, iteratively refining its ability to handle difficult examples. This method addresses the scarcity of task-specific training data, which is a common challenge in applying LLMs to specialized domains.
The figure below from the paper shows that LLM2LLM boosts LLMs with novel iterative data enhancement. One iteration of LLM2LLM begins with training and evaluating the model on the training data. Incorrect answers from the training data are used as inputs to generate extra samples with similar styles to the teacher model. Then, a new student model is trained using a combination of the old training data and newly generated samples. After the model is fine-tuned, we evaluate and find questions that the model got incorrect. The teacher model is used to generate additional data points based on the wrong examples, which test for similar concepts and ideas. These synthetic data points are folded back into training dataset. This process then repeats, training the student model on increasingly targeted data points.

Experimentally, LLM2LLM demonstrated substantial improvements over traditional fine-tuning and other data augmentation methods. Notably, it achieved up to 24.2% improvement on the GSM8K dataset, 32.6% on CaseHOLD, 32.0% on SNIPS, 52.6% on TREC, and 39.8% on SST-2, all in low-data scenarios, using a LLaMA2-7B model. These results underscore the effectiveness of targeted, iterative data augmentation in enhancing model performance.
Further, the study includes comprehensive ablation studies to evaluate the impact of various components and choices within the LLM2LLM framework, such as the importance of iterative augmentation, the decision to focus augmentation efforts on seed data versus augmented data, and the comparison between from-scratch fine-tuning versus continuous fine-tuning. These analyses help underline the critical aspects that contribute to the success of the LLM2LLM methodology.
The paper also discusses potential improvements and future directions for research, suggesting that LLM2LLM’s framework could be further optimized and potentially integrated with other LLM techniques such as prompt tuning and few-shot learning to enhance its utility and applicability to a wider range of tasks and domains.
Code

FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions

This paper by introduces FollowIR, a dataset comprising an instruction evaluation benchmark and a training set, aimed at enhancing Information Retrieval (IR) models’ ability to adhere to real-world instructions. Originating from the premise that most IR models, despite being backed by Large Language Models (LLMs), primarily rely on queries devoid of instructions, FollowIR aspires to bridge this gap by leveraging detailed instructions akin to those provided to human annotators in TREC conferences for determining document relevance.
The evaluation benchmark modifies instructions from three TREC collections (Robust 2004, Common Core 2017, and News 2021) and re-annotates relevant documents to create a basis for measuring IR models’ instruction-following capabilities through a novel pairwise evaluation framework. The findings reveal that existing retrieval models largely fail to utilize instructions effectively, often resorting to basic keyword extraction and struggling with the interpretation of long-form instructions.
The research further introduces a new model, FollowIR-7B, demonstrating significant improvements in following complex instructions, with over 13% performance enhancement post fine-tuning on the provided training set. This model underscores the feasibility of training IR models to better comprehend and execute instructions, marking a step forward in the development of more adaptable and efficient IR systems.
The figure below from the paper shows a visual depiction of the pairwise evaluation framework: models are evaluated on both the original query and instruction, as well as the altered instruction. If the model correctly understands the instructions, it will change which documents are relevant w.r.t. the alteration (right). Note that the real-world instructions (left) given to TREC annotators includes fine-grained details about what relevance is, as well as instructions containing negation (in bold).

The methodology involves re-annotating a subset of documents deemed relevant by original TREC annotators with more specific instructions, thereby reducing the annotation scope to a manageable level. This innovative approach to generating and utilizing instructional data in IR model training and evaluation highlights the potential for significant advancements in the field’s ability to meet complex information retrieval needs.

AIOS: LLM Agent Operating System

This paper by Mei et al. from Rutgers University introduces AIOS, an operating system designed to integrate Large Language Model (LLM)-based agents into operating systems, enhancing their efficiency and efficacy. This integration addresses challenges such as sub-optimal scheduling, resource allocation, and context maintenance during interactions between agents and LLMs.
AIOS optimizes resource allocation, facilitates context switching across agents, enables concurrent execution, provides tool services for agents, and maintains access control, addressing bottlenecks and sub-optimal resource utilization. Its architecture includes a dedicated LLM kernel and various modules like Agent Scheduler, Context Manager, Memory Manager, Storage Manager, Tool Manager, and Access Manager to manage LLM-related activities.
The figure below from the paper shows an overview of the AIOS architecture.

The Agent Scheduler prioritizes and schedules agent requests, utilizing algorithms like FIFO and Round Robin to balance waiting and turnaround times across concurrent agent tasks. The Context Manager enables snapshot and restoration of LLM’s generation processes, aiding in managing long contexts that exceed LLM’s limit.
Memory Manager manages short-term memory for interaction logs, while Storage Manager handles long-term data preservation, using mediums like local files, databases, or cloud solutions. Tool Manager integrates and classifies API tools into categories like search, computation, and image processing.
AIOS SDK facilitates agent development by offering a comprehensive toolkit for managing agent lifecycles, resource monitoring, and task planning. The evaluation of AIOS demonstrates its reliability and efficiency in managing concurrent execution of multiple agents, with experiments showing consistent LLM responses and improved performance in scheduling.
Future work for AIOS includes exploring advanced scheduling algorithms, enhancing context management efficiency, optimizing memory and storage architecture, and improving safety and privacy measures.

Lumos: A Modular Open-Source LLM-Based Agent Framework

This paper introduces Lumos, a pioneering framework for training open-source large language model (LLM)-based agents, addressing the limitations of closed-source agents such as affordability, transparency, and reproducibility. Lumos features a unique, modular architecture with a planning module for high-level subgoal generation and a grounding module for translating these subgoals into actions using execution tools. This design facilitates modular updates and broader applicability across diverse interactive tasks.
The authors developed two key agent formulations within Lumos: Lumos-OnePass (Lumos-O), which performs efficient one-pass inference by generating all subgoals and actions in a single call; and Lumos-Iterative (Lumos-I), which adopts an adaptive approach, generating subgoals and actions iteratively based on previous execution feedback. These formulations were trained with a large-scale dataset of 56,000 high-quality annotations, created by converting existing ground-truth reasoning rationales into a unified format suitable for Lumos, using strong LLMs.
The figure below from the paper shows the overall framework of Lumos. Lumos are trained with 56K high-quality training annotations. We propose two agent training and inference formulations, Lumos-O and Lumos-I. Lumos-O is an efficient formulation that enables one-pass inference; Lumos-I is an adaptive formulation that help agents flexibly plan based on the execution feedback. They showcase two Lumos-I running examples in A-OKVQA and Mind2Web.

Experimental evaluation on nine datasets across question answering (QA), web tasks, math, and multimodal tasks demonstrated Lumos’s superior performance compared to multiple larger open-source agents and even GPT agents in certain tasks. Lumos excelled in generalization to unseen tasks, significantly outperforming domain-specific and larger-scale agents in both web shop and code generation tasks.
In addition to its competitive performance, Lumos’s open-source nature and modular design contribute to its transparency, reproducibility, and ease of updates. The framework’s capacity to facilitate cross-task generalization and adaptability to new environments and actions marks a significant advancement in the development of versatile and effective language agents.

A Comparison of Human, GPT-3.5, and GPT-4 Performance in a University-Level Coding Course

This paper by Yeadon et al. from Durham University scrutinizes the coding competencies of GPT-3.5 and GPT-4 (with and without prompt engineering) in university-level physics coding assignments, contrasting them with student work. Analyzing 300 data points from 50 AI and 50 student submissions, the study uncovers that students, on average, outperformed AI, with a significant difference in scores—91.9% for students vs. 81.1% for the best-performing AI category (GPT-4 with prompt engineering).
The experiment employed a blind marking scheme by three independent markers to avoid bias. Interestingly, prompt engineering notably enhanced AI performance for both GPT-3.5 and GPT-4, demonstrating the utility of tailored prompts in improving AI-generated code quality.
The markers were also tasked with identifying whether submissions were AI or human-generated, using a Likert scale for nuanced judgement. Remarkably, they could accurately classify the authorship, with a 92.1% success rate in identifying human-authored work as definitely human when simplifying to a binary classification. This suggests that, despite AI’s close approximation of human work quality, subtle differences allow discernment by experienced evaluators.
Methodologically, the paper detailed the adjustments made to the coding assignments for AI comprehension, which included simplifying instructions, removing irrelevant information, and reformulating tasks for clarity. These modifications highlight the necessity of prompt engineering and pre-processing for AI to effectively tackle specific academic tasks.
The figure below from the paper shows the percent scores for each of the six categories of submission. Student submissions score the best though they are closely followed by GPT-4 with prompt engineering and the Mixed student and AI work. GPT-3.5 performs strictly worse than GPT-4.

The results spotlight a clear hierarchy in AI capability with prompt engineering significantly elevating performance. However, the mixed submissions, combining student and GPT-4 efforts, underperformed relative to pure GPT-4 submissions with prompt engineering, suggesting variability in student contributions.
In terms of academic integrity, the study suggests that while AI can aid learning, its use should be critically assessed within educational frameworks to maintain the integrity of assignments. The distinctions made by human markers between AI and student submissions could guide future evaluations of AI-generated content, emphasizing creativity and unique presentation as hallmarks of human work.
The research concludes that although AI is nearing the ability to replicate student performance in coding tasks, discernible differences remain, particularly in creative aspects like plot design. This work underscores the evolving landscape of AI in education, urging a reevaluation of its role and potential as both a tool and a topic of academic integrity discussions.

sDPO: Don’t Use Your Data All at Once

This paper from Kim et al. from Upstage AI introduces “stepwise DPO” (sDPO), an advancement of direct preference optimization (DPO) to better align large language models (LLMs) with human preferences. Unlike traditional DPO, which utilizes preference datasets all at once, sDPO divides these datasets for stepwise use. This method enables more aligned reference models within the DPO framework, resulting in a final model that not only performs better but also outpaces more extensive LLMs.
Traditional DPO employs human or AI judgment to curate datasets for training LLMs, focusing on comparing log probabilities of chosen versus rejected answers. However, sDPO’s novel approach uses these datasets in a phased manner. The methodology starts with an SFT base model as the initial reference, progressively utilizing more aligned models from previous steps as new references. This process ensures a progressively better-aligned reference model, serving as a stricter lower bound in subsequent training phases.
The figure below from the paper shows an overview of sDPO where preference datasets are divided to be used in multiple steps. The aligned model from the previous step is used as the reference and target models for the current step. The reference model is used to calculate the log probabilities and the target model is trained using the preference loss of DPO at each step.

The sDPO methodology involved training the SOLAR 10B SFT model as the base. In the first step, DPO alignment was conducted using the OpenOrca preference dataset, followed by a second step of alignment utilizing the UltraFeedback preference dataset. The model’s performance was evaluated on the H4 benchmark, which is the average of scores from ARC, HellaSwag, MMLU, and TruthfulQA tests. This innovative approach resulted in a 1.6% improvement of the SOLAR 10B model over traditional DPO on the H4 benchmark, showcasing that sDPO combined with SOLAR 10B even surpasses models like Mixtral, which have significantly more parameters.
Experimental validation reveals sDPO’s efficacy. The research team employed models like SOLAR 10.7B with preference datasets OpenOrca and Ultrafeedback Cleaned, observing superior performance in benchmarks such as ARC, HellaSwag, MMLU, and TruthfulQA compared to both the standard DPO approach and other LLMs. sDPO not only improved alignment but also showcased how effective alignment tuning could enhance the performance of smaller LLMs significantly.
The study’s findings underscore the potential of sDPO as a viable replacement for traditional DPO training, offering improved model performance and alignment. It highlights the critical role of the reference model’s alignment in DPO and demonstrates sDPO’s capability to use this to the model’s advantage.
Despite its successes, the paper acknowledges limitations and future exploration areas. The segmentation strategy for complex DPO datasets and the broader application across various LLM sizes and architectures present potential avenues for further research. Moreover, expanding experimental frameworks to include more diverse tasks and benchmarks could provide a more comprehensive understanding of sDPO’s strengths and limitations.
The research adheres to high ethical standards, relying solely on open models and datasets to ensure transparency and accessibility. Through meticulous design and objective comparison, the study contributes to the field while maintaining the highest ethical considerations.

RS-DPO: A Hybrid Rejection Sampling and Direct Preference Optimization Method for Alignment of Large Language Models

This paper by Khaki et al. from Amazon, introduces RS-DPO, a method combining rejection sampling (RS) and direct preference optimization (DPO) to address the alignment of large language models (LLMs) with user intent. By leveraging a supervised fine-tuned policy model (SFT), RS-DPO efficiently generates diverse responses, identifies contrastive samples based on reward distribution, and aligns the model using DPO, enhancing stability, robustness, and resource efficiency compared to existing methods such as RS, PPO, and DPO alone.
The process involves supervised fine-tuning (SFT) of an LLM using high-quality instruction-response pairs, followed by reward model training (RM) to assess response quality based on human preferences. Preference data generation via rejection sampling (PDGRS) creates a synthetic preference pair dataset for alignment tasks, using the trained SFT and RM to sample and evaluate $k$ distinct responses for each prompt. The direct preference optimization (DPO) step then fine-tunes the SFT model by optimizing the policy model on the generated preference data, thus aligning the LLM with human preferences without needing an explicit reward model.
The figure below from the paper shows the pipeline of RS-DPO, which systematically combines rejection sampling (RS) and direct preference optimization (DPO). They start by creating a SFT model and use it to generate a diverse set of $k$ distinct responses for each prompt. Then, it selects a pair of contrastive samples based on their reward distribution. Subsequently, the method employs DPO to enhance the performance of the language model (LLM), thereby achieving improved alignment.

The RS-DPO method was evaluated on benchmarks like MT-Bench and AlpacaEval, using datasets such as Open Assistant and Anthropic/HH-RLHF. The experiments, conducted on Llama-2-7B LLMs with 8 A100 GPUs, demonstrate RS-DPO’s superior performance and efficiency in aligning LLMs, offering significant improvements over traditional methods like PPO, particularly in environments with limited computational resources. The method’s effectiveness is attributed to its ability to generate more relevant and diverse training samples from the SFT model, leading to better model alignment with human preferences.
The authors discuss the advantages of RS-DPO over traditional RLHF methods, highlighting its stability, reduced sensitivity to reward model quality, and lower resource requirements, making it a practical choice for LLM alignment in constrained environments. Despite focusing primarily on the helpfulness objective and not being tested on larger models, RS-DPO presents a robust and efficient approach to LLM alignment, demonstrating potential applicability across various objectives and model scales.

Dataverse: Open-Source ETL (Extract Transform Load) Pipeline for Large Language Models

This paper by Park et al. from Upstage AI introduces Dataverse, a unified, open-source ETL (Extract Transform Load) pipeline specifically tailored for the burgeoning field of large language models (LLMs). Dataverse addresses the escalating demand for efficient and scalable data processing solutions necessitated by the significant data volumes characteristic of LLM development. It highlights the pivotal role of an effective ETL pipeline in enhancing LLM performance by ensuring the availability of high-quality, bias-mitigated datasets.
The core philosophy underpinning Dataverse is to simplify and expedite the ETL process, encompassing the extraction of raw data from varied sources, executing numerous transformative operations like deduplication, cleaning, decontamination, PII (Personally Identifiable Information) removal, enhancing data quality, mitigating bias, and eliminating toxicity, and ultimately loading the refined data into a designated storage solution. These comprehensive features make Dataverse an indispensable tool for preparing robust and unbiased datasets for LLM training.
The figure below from the paper shows an overview of the Dataverse library.

With a design ethos focused on user-friendliness and flexibility, Dataverse employs a block-based interface that permits users to seamlessly add, remove, or modify data processing steps, facilitating easy customization according to specific project requirements. Inspired by minimizing complex inheritance structures, akin to the Transformers library by Huggingface, this approach enables straightforward incorporation of custom data operations, thereby enhancing the platform’s versatility and adaptability to evolving data processing needs.
Dataverse’s system architecture is ingeniously crafted to ensure scalability and efficient handling of massive data sizes, integral for LLM development. It incorporates a modular block structure for ETL pipelines, a sophisticated configuration system for specifying Apache Spark settings and selecting appropriate data processors, and a registry for streamlined management of data processor functions. This architecture is further bolstered by utility features that support integration with Amazon Web Services (AWS) for scalable cloud-based data processing, including AWS S3 for storage and Elastic MapReduce (EMR) for distributed data processing tasks, thereby empowering users to tackle data workloads of any magnitude.
As an open-source library, Dataverse not only paves the way for simplifying the data preparation process for LLMs but also holds the potential to serve as a central hub for LLM data processing, promoting collaboration and knowledge exchange within the AI research community. The platform’s current focus is on text data, with plans to extend support to multi-modal data types such as images and videos in subsequent releases. Acknowledging the necessity for ongoing optimization to fully leverage Spark’s capabilities, the developers are committed to introducing features like an automatic configuration tool for performance enhancement in future iterations.
Ethical considerations are paramount in the development of Dataverse. It incorporates bias mitigation techniques and emphasizes the importance of ethical vigilance in data collection and processing practices to address privacy, copyright, and societal concerns effectively. The goal is to contribute to the advancement of language AI technologies while upholding robust ethical standards and minimizing potential societal risks.

Teaching Large Language Models to Reason with Reinforcement Learning

This paper by Havrilla et al. from Meta, Georgia Institute of Technology, StabilityAI, and UC Berkeley, explores improving the reasoning abilities of large language models (LLMs) using various reinforcement learning (RL) algorithms. The study investigates the impact of sparse and dense rewards, model sizes, and initializations with and without supervised fine-tuning (SFT) on enhancing LLMs’ reasoning capabilities.
The authors experiment with different RL algorithms, including Expert Iteration (EI), Proximal Policy Optimization (PPO), and Return-Conditioned RL (RCRL), across a range of reward schemes and model initializations. They utilize sparse rewards for correct final answers and dense rewards for matching intermediate steps in a reference solution, finding that sparse and dense rewards perform equally well.
The findings indicate that Expert Iteration performs the best across most setups, achieving notable performance improvements with similar sample efficiency to more complex algorithms like PPO. Specifically, EI requires on the order of 10^6 samples to converge from a pre-trained checkpoint, which is comparable to the sample complexity of PPO. Surprisingly, the improvement of EI over SFT is modest (~7%), suggesting a limit to the exploration and solution discovery capabilities of these models, with prompted scores showing a 5% improvement, SFT achieving 41%, and RL reaching 48% in performance.
The study reveals a trade-off between maj@1 and pass@96 metric performance during SFT training, highlighting that RL training improves both metrics simultaneously. This improvement suggests that RL can foster more generalization than static training methods. The ability of online RL algorithms to dynamically grow diverse sets of training examples through synthetic data generation and filtering strategies is noted, although the method encounters challenges such as false positives.
Implementation details provided include the use of LoRA for training efficiency, the importance of a non-trivial KL penalty for stability in RL training, and the impact of model size and initialization on exploration and performance. The authors also discuss the limitations of current exploration strategies in RL training and the deterministic nature of the reasoning tasks.
Overall, the paper contributes a comprehensive comparison of RL algorithms for fine-tuning LLMs on reasoning tasks, demonstrating the potential of simple RL approaches like Expert Iteration for significant performance gains. The findings highlight the importance of exploration strategies and the need for future research to enhance the reasoning capabilities of LLMs further, pointing out the limited benefit of continued supervised fine-tuning beyond a certain point and indicating that RL training mostly impacts maj@1 accuracy without significantly improving pass@n accuracy beyond what can be achieved with light supervised fine-tuning.

Jamba: A Hybrid Transformer-Mamba Language Model

This paper by Lieber et al. from AI21labs presents Jamba, an innovative architecture blending Transformer and Mamba layers with mixture-of-experts (MoE) modules, creating a synergistic model that excels in both performance and efficiency. This large language model is distinguished by its ability to process up to 256K tokens context length, designed to optimize computational resources by fitting within a single 80GB GPU using 8bit precision, showcasing 12B active and 52B total parameters.
The architecture features Jamba blocks, each composed of a mix of Mamba and Transformer layers, interspersed with MoE layers. Jamba employs a configuration of four Jamba blocks encompassing a total of 4 Transformer and 28 Mamba layers, with an 8-layer structure per block and a strategic 1:7 Attention-to-Mamba ratio. MoE layers, placed every other layer, comprise 16 experts, utilizing the top 2 experts per token for dynamic adaptability.
The figure below from the paper shows: (a) A single Jamba block. (b) Different types of layers. The implementation shown here is with $l$ = 8, $a$ : $m$ = 1 : 7 ratio of attention-to-Mamba layers, and MoE applied every $e$ = 2 layers.

Implementation specifics include the use of grouped-query attention (GQA) and SwiGLU activation function within Transformer blocks, aiming for enhanced model stability and performance. A notable innovation is the addition of RMSNorm to Mamba layers for large-scale stability, effectively preventing loss spikes during training.
Jamba’s design eschews explicit positional information mechanisms like RoPE, relying instead on Mamba layers’ implicit position encoding capabilities. This choice reflects insights from the model’s development, suggesting that Mamba layers alone may sufficiently capture positional dependencies.
Mixture-of-experts integration is demonstrated to significantly improve the hybrid Attention-Mamba model, underscoring MoE’s contribution to enhancing the model’s capacity and efficiency. This advancement is validated through extensive experimentation, although specific mechanisms behind MoE’s effectiveness remain an area for further exploration.
Jamba is trained on NVIDIA H100 GPUs, utilizing Full-Model Sharded Data Parallelism (FSDP) along with Tensor, Sequence, and Expert Parallelism for optimal efficiency. The model leverages a comprehensive text dataset aggregated from web sources, books, and code, updated until March 2024, though detailed dataset size or the number of training tokens were not specified.
Jamba achieves comparable or superior results against leading models such as Mixtral 8x7B and Llama-2 70B across a variety of benchmarks, especially in long-context evaluations. It is also noted for its remarkable throughput improvement, particularly for long contexts, compared to similar-sized attention-only models.
In a move to foster further research and optimization within the community, Jamba is released under the Apache 2.0 license on HuggingFace. This initiative is supported by the release of model checkpoints from smaller-scale training runs, inviting wider exploration of the model’s novel architecture and potential applications.
Jamba exemplifies the potential of combining Transformer and Mamba architectures with MoE, setting new standards in language modeling for long-context processing while addressing computational and memory efficiency. Its development reflects significant technical advancements, promising to drive future research and applications in the field of natural language processing.
Model

Fine Tuning vs. Retrieval Augmented Generation for Less Popular Knowledge

This paper by Soudani et al. from Radboud University and the University of Amsterdam investigates the efficacy of Retrieval Augmented Generation (RAG) and fine-tuning (FT) on enhancing the performance of large language models (LLMs) for question answering (QA) tasks involving low-frequency factual knowledge. The authors conducted a comprehensive comparison to determine which approach is more beneficial for customizing LLMs to handle less popular entities, using a dataset characterized by a wide range of entity popularity levels. They found that fine-tuning significantly improves performance across entities of varying popularity, with notable gains in the most and least popular groups. Conversely, RAG was observed to surpass other methods, particularly when combined with FT in smaller models, although its advantage diminishes in base models and is non-existent in larger models.
The evaluation setup included a diverse range of factors such as model size, retrieval models, quality of synthetic data generation, and fine-tuning method (PEFT vs. full fine-tuning). The findings underscored the importance of advancements in retrieval and data augmentation techniques for the success of both RAG and FT strategies. For FT, two data augmentation methods were used to generate synthetic training data: an End-to-End approach utilizing a model trained for paragraph-level QA generation and a Prompt method using LLMs for QA generation.
For RAG, various retrieval models were employed to enhance the LLM’s response generation by providing additional context from a document corpus. The performance of the retrieval models played a significant role in the effectiveness of the RAG approach. The study also highlighted the role of synthetic data quality over quantity, with models trained on prompt-generated data outperforming those trained on E2E-generated data.
The figure below from the paper shows a correlation between subject entity popularity in a question and the effects of RAG and FT on FlanT5- small performance in open-domain question answering. FT markedly improves accuracy in the initial and final buckets relative to others (indicated by the pink line).

LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders

This paper by BehnamGhader et al. from Mila, McGill University, ServiceNow Research, and Facebook CIFAR AI Chair, introduces LLM2Vec for transforming decoder-only large language models (LLMs) into powerful text encoders without the need for extensive adaptation or synthetic data. LLM2Vec leverages three steps: enabling bidirectional attention, masked next token prediction (MNTP), and unsupervised contrastive learning (SimCSE).
The authors apply LLM2Vec to three popular LLMs (S-LLaMA-1.3B, LLaMA-2-7B, and Mistral-7B) and demonstrate its efficacy on English word- and sequence-level tasks, setting a new state-of-the-art for unsupervised models on the Massive Text Embeddings Benchmark (MTEB). Additionally, when combined with supervised contrastive learning, it reaches state-of-the-art performance on MTEB among models trained only on publicly available data.
LLM2Vec’s key innovation lies in its simplicity and data/parameter efficiency. By converting the causal attention mechanism to bidirectional attention, it allows each token in the sequence to access every other token, overcoming a fundamental limitation of decoder-only LLMs for text embedding tasks. The MNTP objective further adapts the model to utilize bidirectional attention by training it to predict masked tokens based on both past and future context.
Unsupervised contrastive learning through SimCSE is applied to enhance the model’s ability to capture the context of the entire sequence without requiring paired data. This step uses mean pooling over token representations to achieve better sequence representations.
The figure below from the paper shows the 3 steps of LLM2Vec. First, they enable bidirectional attention to overcome the restrictions of causal attention (Bi). Second, they adapt the model to use bidirectional attention by masked next token prediction training (MNTP). Third, they apply unsupervised contrastive learning with mean pooling to learn better sequence representations (SimCSE).

The paper provides extensive analysis demonstrating how LLM2Vec improves the embedding capabilities of LLMs. Notably, it reveals an intriguing property of the Mistral-7B architecture, suggesting that it might have been pre-trained with some form of bidirectional attention, which accounts for its exceptional performance with LLM2Vec’s bidirectional attention step.
Code

HGOT: Hierarchical Graph of Thoughts for Retrieval-Augmented In-Context Learning in Factuality Evaluation

This paper by Fang et al. from Queen’s University introduce a novel structured, multi-layered graph approach named Hierarchical Graph of Thoughts (HGOT). This framework aims to mitigate hallucinations in large language models (LLMs) by enhancing the retrieval of relevant information for in-context learning. HGOT uses emergent planning capabilities of LLMs to decompose complex queries into manageable sub-queries. The divide-and-conquer strategy simplifies problem-solving and improves the relevance and accuracy of retrieved information.
HGOT incorporates a unique self-consistency majority voting mechanism for answer selection. This mechanism uses citation recall and precision metrics to evaluate the quality of thoughts, thus directly linking the credibility of an answer to the thought’s quality. The approach employs a scoring mechanism for evaluating retrieved passages, considering citation frequency and quality, self-consistency confidence, and the retrieval module’s ranking.
The figure below from the paper shows an illustrative example of HGOT in answering a factual question. (The abbreviations employed are as follows: Instr.: Instructions, Q: Question, Ctx.: Context or References, Resp.: ChatGPT’s Response, PL: Plan, D: Dependencies, CI: Confidence, Ans.: Answer, Thot.: Thought)

The effectiveness of HGOT is validated against several other retrieval-augmented methods like Demonstrate-Search-Predict (DSP) and ReAct, showing an improvement of up to 7% on datasets such as FEVER, Open-SQuAD, and HotPotQA. This demonstrates HGOT’s enhanced capability for factuality in LLM responses.
In terms of implementation, HGOT utilizes emergent planning abilities of LLMs to create hierarchical graphs, which organizes the thought process more efficiently and reduces the likelihood of error propagation across multiple reasoning layers. The framework adjusts majority voting by weighting responses based on the quality of their associated citations, and employs a scoring system that factors in multiple qualities of retrieved passages to ensure high-quality, relevant informational support for LLM responses.
Code

ReFT: Representation Finetuning for Language Models

This paper by Wu et al. from Stanford and the Pr(Ai)2R Group, proposes Representation Finetuning (ReFT), a suite of methods to modify the hidden representations of language models (LMs) for task-specific adaptation. Unlike traditional parameter-efficient finetuning (PEFT) methods that adapt by modifying weights, ReFT manipulates a small fraction of model representations, enhancing the interpretability and flexibility of the interventions.
A key variant within ReFT, named Low-rank Linear Subspace ReFT (LoReFT), leverages a low-rank projection matrix to edit representations in a linear subspace. This approach is demonstrated to be 10$\times$–50$\times$ more parameter-efficient compared to existing state-of-the-art PEFTs like LoRA.
The ReFT methodology, specifically Low-rank Linear Subspace ReFT (LoReFT), operates by editing hidden representations in a linear subspace. LoReFT modifies these representations using a projection matrix $R$, which redefines them in a low-dimensional subspace for efficiency. The matrix $R$ has orthonormal rows, which are crucial for maintaining the quality of the intervention without adding much complexity.
The core intervention of LoReFT, as per the distributed interchange intervention (DII) formula $DII(b, s, R) = b + R^\top(Rs - Rb)$, leverages the projection matrix to adjust the hidden states $b$ towards a target state $s$ by the application of $R$. This intervention is designed to manipulate the model output towards desired behaviors or answers subtly and effectively.
LoReFT employs a linear transformation defined by the parameters $W$ and $b$ (not to be confused with the bias term), which projects the representation into the subspace before it is edited. This transformation helps in aligning the representation more closely with the task-specific features that are crucial for performance.
Practically, LoReFT is implemented as a set of non-overlapping interventions across multiple layers of a Transformer-based model. These interventions are strategically placed to modify the behavior of the model without extensive retraining of the underlying parameters.
Each intervention is applied after the computation of layer $L$ representations, meaning it directly affects the computation of subsequent layers $L+1$ to $L+m$. This placement ensures that the interventions have a cascading effect, enhancing their impact on the final model output.
The hyperparameter tuning for LoReFT focuses on the number and placement of interventions across the layers, optimizing both the effectiveness of each intervention and the overall computational overhead. This involves selecting the appropriate number of prefix and suffix positions in the input where interventions are most beneficial, as well as deciding on the layers where these modifications will have the most impact.
The figure below from the paper shows an illustration of ReFT. (1) The left panel depicts an intervention I: the intervention function $\Phi$ is applied to hidden representations at positions $P$ in layer $L$. (2) The right panel depicts the hyperparameters we tune when experimenting with LoReFT. Specifically, the figure depicts application of LoReFT at all layers with prefix length $p$ = 2 and suffix length $s$ = 2. When not tying layer weights, we train separate intervention parameters at each position and layer, resulting in 16 interventions with unique parameters in this example.

The authors evaluate LoReFT across multiple domains, including commonsense reasoning, arithmetic reasoning, instruction-following, and natural language understanding. It is shown that LoReFT achieves competitive or superior performance on all tasks, especially shining in commonsense reasoning benchmarks.
Implementation details reveal that LoReFT interventions are applied at selected layers and positions within the LM, optimizing both the number of interventions and their locations through hyperparameter tuning. This targeted approach allows for minimal additional computational overhead at inference.
LoReFT is implemented in a publicly available Python library, pyreft, which facilitates the adoption of ReFT methods by providing tools to apply these interventions on any pretrained LM from the HuggingFace model hub.
The paper establishes the potential of representation-focused finetuning as a more effective alternative to weight-based methods, setting new standards for efficiency and performance in adapting large-scale LMs to diverse tasks.

Towards Conversational Diagnostic AI

This paper by Tu et al. from Google Research and Google DeepMind introduces Articulate Medical Intelligence Explorer (AMIE), a large language model-based AI system optimized for diagnostic dialogue. AMIE operates within a simulated environment using self-play for scalable learning across various medical conditions and contexts. The system’s design involves a novel framework to evaluate performance across clinically meaningful axes like history-taking, diagnostic accuracy, and empathy. In a randomized, double-blind crossover study involving text-based consultations with simulated patient actors, AMIE demonstrated superior diagnostic accuracy and performance on most axes compared to primary care physicians (PCPs). The system leverages a chain-of-reasoning approach at inference time to enhance diagnostic accuracy and conversation quality, further refined through an iterative self-improvement process involving feedback from a critic model.
The training setup for AMIE includes diverse real-world datasets, such as medical question-answering, electronic health record summaries, and large-scale medical conversation transcripts, integrated with simulated dialogues. The inner self-play loop refines interactions with an AI patient agent, while the outer loop incorporates these refined dialogues into further fine-tuning iterations, facilitating continuous learning. This approach allows AMIE to progressively refine responses based on the conversation, ensuring accuracy and relevance.
The figure below from the paper shows an overview of contributions. AMIE is a conversational medical AI optimised for diagnostic dialogue. AMIE is instruction fine-tuned with a combination of real-world and simulated medical dialogues, alongside a diverse set of medical reasoning, question answering, and summarization datasets. Notably, we designed a self-play based simulated dialogue environment with automated feedback mechanisms to scale AMIE’s capabilities across various medical contexts and specialities. Specifically, this iterative self-improvement process consisted of two self-play loops: (1) An “inner” self-play loop, where AMIE leveraged in-context critic feedback to refine its behavior on simulated conversations with an AI patient agent; (2) An “outer” self-play loop where the set of refined simulated dialogues were incorporated into subsequent fine-tuning iterations. During online inference, AMIE used a chain-of-reasoning strategy to progressively refine its response conditioned on the current conversation to arrive at an accurate and grounded reply to the patient in each dialogue turn. We designed and conducted a blinded remote Objective Structured Clinical Examination (OSCE) with validated simulated patient actors interacting with AMIE or Primary Care Physicians (PCPs) via a text interface. Across multiple axes corresponding to both specialist physician (28 out of 32) and patient actor (24 out of 26) perspective, AMIE was rated as superior to PCPs while being non-inferior on the rest.

This system’s evaluation through a structured clinical examination setup (OSCE) highlighted AMIE’s capabilities in handling various medical scenarios, reflecting its potential to enhance diagnostic dialogues significantly. However, limitations noted include the unfamiliar text-chat interface for PCPs and the non-representation of typical clinical practice, necessitating further research for real-world application. The study underscores a significant stride towards developing conversational AI that can competently mimic and potentially augment the diagnostic capabilities of human physicians in clinical settings.

Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models

This technical report by Ormazabal et al. introduces the Reka suite of multimodal language models: Reka Core, Flash, and Edge, developed by Reka AI. The models are designed to handle inputs across multiple modalities including text, images, video, and audio. Reka Edge and Flash are dense models with parameter sizes of 7B and 21B, respectively, and demonstrate state-of-the-art performance for their compute class. Reka Core, still training as of this writing, is the largest and most capable model, is benchmarked against top frontier models and excels in various multimodal and language tasks.
Reka models are trained on a blend of public and proprietary datasets, with the training data consisting of approximately 5 trillion tokens for Reka Flash and 4.5 trillion for Reka Edge. These tokens include a significant proportion of STEM and code-related content.
The figure below from the paper depicts Reka series of LLM following a modular encoder-decoder transformer supporting multimodal input (image, text, video & audio) and text outputs. The text output can invoke function calls, such as web search and code execution, then return the results.

The backbone Transformer model is based on the ’Noam’ architecture, i.e., it uses SwiGLU, Grouped Query Attention, Rotary positional embeddings, and RMSNorm. Architecturally, this is similar to the PaLM architecture but without parallel layers. Reka Flash and Edge uses a sentencepiece vocab of 100K based on tiktoken (e.g., GPT-4 tokenizer). We add sentinel tokens for masking spans, i.e., <extra_id_0> and other special use cases such as tool-use that are beyond the scope of this technical report. Pretraining uses a curriculum that goes through multiple stages with different mixture distributions, context lengths, and objectives. The current version of this model is a dense model. Models are trained with bfloat16.
Training primarily utilized Nvidia H100s GPUs, with significant compute resources scaled up in late 2023. The models leverage a Ceph filesystem for efficient data handling and faced challenges related to hardware reliability across different compute providers.
Reka models have undergone extensive testing across various benchmarks. Reka Core, for instance, scores competitively against models like GPT-4 on tasks such as MMLU and GSM8K and shows promising results in video question answering and multimodal interactions.
Following the initial pretraining, the models undergo instruction tuning and reinforcement learning with human feedback to align outputs more closely with human expectations.
Overall, Reka’s models combine innovative architecture with rigorous training and evaluation to set new standards in the AI field, particularly in handling complex multimodal tasks. These models are already deployed in production, with further improvements expected as training continues.
Showcase

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

This technical report by Microsoft, the introduction of the phi-3-mini language model represents a significant advancement in AI, particularly for mobile applications. The phi-3-mini, a 3.8 billion parameter model trained on 3.3 trillion tokens, achieves performance on par with larger models like Mixtral 8x7B and GPT-3.5, yet it is compact enough to run efficiently on a smartphone.
The phi-3-mini is designed with a transformer decoder architecture, utilizing a default context length of 4K, which can be extended to 128K via the LongRope system for tasks requiring longer contexts. This model uses a similar block structure and tokenizer as the Llama-2, enabling compatibility with existing tools and packages developed for the Llama-2 family.
A couple of “under-reported” novelties in the architecture of Phi-small-7b i phi-3-small uses alternative layers of dense attention and a novel blocksparse attention to further optimize on KV cache savings while maintaining long context retrieval performance.
Key to the phi-3-mini’s efficiency is its training regimen, which departs from traditional model scaling laws by focusing on high-quality, heavily filtered web data and synthetic data generated by other language models. This method not only enhances the general and logical reasoning capabilities of the model but also maintains its compact size and efficiency.
The figure below from the paper shows the scaling law close to the “Data Optimal Regime” (from left to right: phi-1.5, phi-2, phi-3-mini, phi-3-small) versus Llama-2 family of models (7B, 13B, 34B, 70B) that were trained on the same fixed data. We plot the log of MMLU error versus the log of model size.

The implementation includes novel features like quantization to 4-bits, reducing its memory footprint to approximately 1.8GB, allowing it to run natively on modern smartphones like the iPhone 14 with significant performance efficiency, achieving more than 12 tokens per second.
Post-training processes such as supervised fine-tuning and direct preference optimization focus on enhancing the model’s abilities in math, coding, reasoning, and safe interaction, ensuring that it adheres to Microsoft’s responsible AI principles. The safety alignment, tested through red-teaming and automated evaluations, shows a marked decrease in harmful response rates.
Despite its compact size, phi-3-mini shows limited capacity in storing factual knowledge, as seen in lower performance in specific benchmarks like TriviaQA. However, this can be mitigated by integrating search engine capabilities to supplement its knowledge base.
Overall, phi-3-mini not only demonstrates the feasibility of running powerful language models on standard consumer hardware but also sets a new benchmark for integrating advanced AI capabilities in mobile environments.

Mixtral of Experts

This technical report by Jiang et al. from Mistral.AI introduces Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model that notably enhances the architecture of its predecessor, Mistral 7B, by incorporating 8 feedforward blocks per layer, termed “experts”. Unlike conventional models where each token is processed by the entire network, Mixtral uses a router network to select two experts per token per layer, allowing for a dynamic engagement of 47B parameters while maintaining 13B active parameters during inference.
The architecture employs a novel gating mechanism where a softmax function is applied over the top-$k$ logits of a linear layer to dynamically allocate computational resources per token, ensuring efficient processing without engaging all available parameters. This approach significantly reduces computational costs while maintaining or enhancing performance metrics compared to larger models like Llama 2 70B and GPT-3.5.
Mixtral is pretrained using multilingual data and demonstrates superior performance in mathematics, code generation, and multilingual tasks. The model’s unique capability to handle large context sizes (up to 32k tokens) allows it to effectively manage long-range dependencies and complex queries, showcasing its robustness in retrieving contextually relevant information across varied sequence lengths and information densities.
They perform a routing analysis (i.e., a study on expert specialization) which indicated showed no significant patterns in expert assignment across different topics such as biology, philosophy, or mathematics within The Pile validation dataset, suggesting a mostly syntactic rather than semantic specialization. However, a notable syntactic specialization was observed, where specific tokens in different domains consistently mapped to the same experts, indicating structured syntactic behavior that impacts the model’s training and inference efficiency. Proportion of tokens assigned to each expert on different domains from The Pile dataset for layers 0, 15, and 31. The gray dashed vertical line marks 1/8, i.e. the proportion expected with uniform sampling. Here, they consider experts that are either selected as a first or second choice by the router.

The figure below from the paper shows repeated consecutive assignments per MoE layer. Repeated assignments occur a lot more often than they would with uniform assignments (materialized by the dashed lines). Patterns are similar across datasets with less repetitions for DM Mathematics.

The paper also discusses Mixtral 8x7B – Instruct, a variant fine-tuned to follow instructions more precisely, using techniques such as supervised fine-tuning and Direct Preference Optimization. This version surpasses other leading models on human evaluation benchmarks and exhibits reduced biases and a balanced sentiment profile across diverse datasets.
Despite its expansive parameter space, Mixtral is optimized for efficiency, using only a fraction of its parameters per inference, which allows for faster computation speeds and lower operational costs. Both the base and instruct models are released under the Apache 2.0 license, promoting widespread use and adaptation in both academic and commercial settings.
The model’s integration with the open-source vLLM project and its compatibility with Megablocks CUDA kernels for enhanced execution speeds illustrate a commitment to community-driven improvements and accessibility. The provided modifications ensure that Mixtral can be deployed efficiently across different computing environments, including cloud-based platforms via Skypilot.
Extensive benchmarks reveal that Mixtral matches or outperforms Llama 2 70B across a spectrum of tasks, with particular strengths in code synthesis and mathematical reasoning. Detailed results highlight its efficacy in multilingual settings and its capability to handle extensive context lengths without performance degradation.
The paper positions Mixtral 8x7B as a state-of-the-art model in the landscape of sparse mixture of experts architectures, providing substantial improvements over existing models in terms of scalability, efficiency, and performance, while maintaining lower computational and memory costs.

BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains

This paper by Labrak et al. from Avignon University, Zenidoc, and Nantes University proposes BioMistral, a family of LLMs that leverage the open-source Mistral model and perform continued pretrainong on it on PubMed Central, thereby tailoring it for the biomedical domain. The model offers advancements in handling medical question-answering tasks across multiple languages and implements novel model compression techniques for efficient deployment.
BioMistral 7B is evaluated against a set of 10 medical question-answering tasks in English, demonstrating superior performance over other open-source models and maintaining competitive results against proprietary models. To test its multilingual capabilities, these tasks were translated into seven additional languages, marking the first large-scale multilingual evaluation in this domain.
The team introduced quantization strategies to develop lightweight models, notably Activation-aware Weight Quantization (AWQ) and BitsandBytes (BnB), enabling the model’s deployment on consumer-grade devices without significant loss in performance.
They also explored model merging techniques combining BioMistral with the original Mistral model to leverage both domain-specific medical knowledge and general linguistic understanding. Techniques like Spherical Linear Interpolation (SLERP) and Task-Induced Ensemble Strategy (TIES) were applied to merge models effectively.
All resources, including datasets, multilingual evaluation benchmarks, scripts, and models, are made freely available, promoting transparency and facilitating further research in the community.
Models

Gemma: Open Models Based on Gemini Research and Technology

This paper by the Gemma Team from Google DeepMind introduces Gemma, a family of open models based on the Gemini model architecture. It comprises two versions: a 7 billion parameter model for GPU and TPU applications, and a 2 billion parameter model suited for CPU and on-device implementations. Both models are trained using up to 6 trillion tokens from primarily English sources, focusing on web documents, mathematics, and code, with a tokenizer that supports a large vocabulary size of 256k entries.
The models utilize advanced techniques including Multi-Query Attention, RoPE Embeddings, GeGLU Activations, and RMSNorm. These improvements aim to enhance the model’s performance and efficiency, particularly in processing long sequences up to 8192 tokens.
Training infrastructure involves extensive use of TPUv5e across multiple pods, with specific configurations for different model scales. The training also incorporates techniques from Google’s earlier projects like Pathways and Jax to manage data efficiently across distributed systems.
A substantial focus of the Gemma project is on responsible and safe deployment. This includes rigorous filtering of the training data to avoid sensitive or harmful content, and a detailed evaluation of the models against various safety and performance benchmarks.
The figure below from the paper illustrates the language understanding and generation performance of Gemma 7B across different capabilities compared to similarly sized open models. They group together standard academic benchmark evaluations by capability and average the respective scores.

Gemma models have shown superior performance on a range of tasks, outperforming other models in benchmarks for question answering, reasoning, mathematics, and coding. They also display robust safety features, evaluated through automated benchmarks and human preference studies, ensuring that they behave predictably and safely in diverse applications.
The models are also equipped with capabilities for supervised fine-tuning and reinforcement learning from human feedback, enabling them to improve over time based on specific user interactions and feedback. This adaptability makes them suitable for a wide array of applications, from automated customer support to sophisticated data analysis tasks.
Despite their capabilities, the models come with an acknowledgment of their limitations, particularly in terms of their potential use in generating sensitive or misleading information. DeepMind emphasizes the importance of continuous monitoring and evaluation to mitigate any unintended consequences of their use.

SPAFIT: Stratified Progressive Adaptation Fine-tuning for Pre-trained Large Language Models

This paper by Arora and Wang from Simon Fraser University introduces a novel Parameter-Efficient Fine-Tuning (PEFT) method named Stratified Progressive Adaptation Fine-tuning (SPAFIT), aimed at optimizing the fine-tuning process of Transformer-based large language models by localizing the fine-tuning to specific layers according to their linguistic knowledge importance. This addresses issues like catastrophic forgetting and computational inefficiency common in full fine-tuning methods.
SPAFIT organizes the model into three groups of layers, with increasing complexity of fine-tuning allowed as the layers progress from basic linguistic processing to more task-specific functions. Group 1 layers remain completely frozen, Group 2 layers undergo fine-tuning only on bias terms, and Group 3 layers are fine-tuned using both BitFit for simple parameters and LoRA (Low Rank Adaptation) for more significant weight matrices.
The authors conducted experiments using the BERT-large-cased model across nine tasks from the GLUE benchmark. Their results demonstrate that SPAFIT can achieve or exceed the performance of full fine-tuning and other PEFT methods like Full BitFit and Full LoRA while fine-tuning significantly fewer parameters.
The figure below from the paper illustrates an example implementation of SPAFIT on BERT.

Notable results include SPAFIT models achieving the best performance on tasks involving sentence similarity, like MRPC and STS-B, and showing a substantial reduction in the number of parameters fine-tuned—highlighting SPAFIT’s efficiency.
The research suggests that different types of linguistic knowledge can indeed be localized to specific layers of a language model, potentially leading to more targeted and efficient fine-tuning strategies.
The paper raises points for future investigation, including the application of SPAFIT to more complex tasks like summarization and to models that contain both encoder and decoder architectures. The study also acknowledges the need for further analysis on the optimal balance of parameter efficiency against task performance and the extent of adaptation necessary at different layers.

Instruction-tuned Language Models are Better Knowledge Learners

This paper by Jiang et al. from FAIR, CMU, and UW proposes a novel training strategy for language models called pre-instruction-tuning (PIT). This method involves instruction-tuning on question-answer pairs prior to continued pre-training on new documents.
PIT is designed to improve the model’s ability to absorb and apply new factual knowledge from documents by first exposing it to the structure and format of how knowledge might be queried (via Q&A).
The study utilizes the Llama-2 model across different configurations (7B and 70B) and demonstrates that PIT significantly outperforms standard instruction-tuning methods, increasing QA accuracies by up to 17.8%.
The figure below from the paper illustrates continued pre-training (first row), continued pre-training followed by instruction-tuning (second row), and pre-instruction-tuning before continued pre-training (last row), along with their accuracies on evaluation questions. Each right-pointing light-blue triangle indicates a training phase.

They conducted experiments using a dataset named Wiki2023, which comprises Wikipedia articles from 2023 covering diverse topics, ensuring minimal overlap with the original pre-training data.
The training process includes variations where the model is either first trained on QA pairs or on a mix of documents and QA pairs. The most effective method involves an interleaved training strategy where QA pairs and documents are combined, enhancing both the retention of previously learned knowledge and the assimilation of new information.
The paper highlights that models trained with PIT not only perform better in domain-specific settings but also show improved cross-domain generalization, suggesting a potential for broader application in diverse knowledge domains.

Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models

This paper by Verga et al. from Cohere introduces an innovative evaluation method for Large Language Models (LLMs) called a Panel of LLm evaluators (PoLL). The authors propose replacing a single large model judge with a panel of smaller, diverse models to reduce costs, intra-model bias, and improve the correlation with human judgments.
The study covers experiments across three settings (single-hop QA, multi-hop QA, and Chatbot Arena) using six datasets. It demonstrates that PoLL, comprising models from different families such as Command R, Haiku, and GPT-3.5, not only reduces evaluation costs by over seven times compared to using a large model like GPT-4 but also achieves better human correlation and less bias.
The authors utilize a variety of scoring approaches, including single-point, reference-based, and pairwise scoring, to assess model outputs. PoLL’s architecture allows each model in the panel to score independently, and these scores are then aggregated using max or average pooling methods depending on the task. Specifically, they consider two different voting functions for aggregating scores across the judges. For QA datasets, they use max/majority voting, as all judgements are binary [correct, incorrect]. For Chatbot Arena, they instead use average pooling because judgements are scores ranging from 1-5 and a three judge panel often does not produce a clear majority decision.
The figure below from the paper illustrates: (Top) Rankings of model performance change drastically depending on which LLM is used as the judge on KILT-NQ. Bottom: The Panel of LLm evaluators (PoLL) has the highest Cohen’s $\kappa$ correlation with human judgements.

Notable findings include that PoLL generally correlates better with human judgments across various tasks compared to GPT-4, particularly in settings where precision in understanding model-generated responses is critical. For example, in the Chatbot Arena, PoLL’s rankings closely match the gold standard ELO rankings, showing superior performance in understanding nuanced human-like dialogues.
The paper also explores the impact of prompt variations on judge model performance, revealing that GPT-4’s effectiveness can vary significantly with changes to the prompt, highlighting the importance of prompt engineering in model evaluation setups.
The cost analysis provided emphasizes the efficiency of PoLL, presenting a compelling case for its adoption over more expensive single-model solutions. This method not only ensures more reliable results by mitigating the risk of individual model biases but also democratizes LLM evaluations by making them more accessible due to lower operational costs.

Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models

This paper by Kim et al. from KAIST, CMU, and MIT, introduces Prometheus 2, an advanced open-source language model for evaluating other LMs, significantly enhancing the ability to perform both direct assessment and pairwise ranking with high correlation to human judgments and proprietary models like GPT-4.
Prometheus 2 introduces substantial improvements over its predecessor by addressing shortcomings in flexibility and correlation accuracy with human scores. It notably supports customizable evaluation criteria, moving beyond general attributes like helpfulness and harmlessness, which were common limitations in previous models.
A key innovation in Prometheus 2 is its methodological approach, which integrates weight merging techniques. The model merges the weights of two evaluator LMs that were separately trained on direct assessment and pairwise ranking, creating a unified evaluator that excels in both formats.
The figure below from the paper illustrates a comparison of direct assessment and pairwise ranking. Both responses could be considered decent under the umbrella of ‘helpfulness’. However, the scoring decision might change based on a specific evaluation criterion.

The effectiveness of this approach is demonstrated across various benchmarks. For direct assessment, it outperforms other open-source evaluator LMs with a Pearson correlation improvement of 0.2 units on average across datasets like Vicuna Bench and MT Bench. In pairwise rankings, it significantly reduces the performance gap with proprietary models like GPT-4, achieving the highest agreement rates with human evaluations.
The training of Prometheus 2 involved a new dataset called Preference Collection, which includes 1,000 custom evaluation criteria and builds upon a previous direct assessment dataset. This dataset is crucial for enhancing the model’s evaluation capabilities and ensuring its performance across diverse real-life scenarios.
Overall, Prometheus 2 sets a new standard for open-source evaluation models by effectively combining methodologies that leverage the strengths of both assessment formats, fostering a robust system that closely aligns with human evaluative standards.
Code

Mixture of LoRA Experts

This paper by Wu et al. from MSR Asia and Tsinghua University, published in ICLR 2024, proposes Mixture of LoRA Experts (MOLE), focusing on efficient composition of Low-Rank Adaptation (LoRA) techniques. It addresses the challenge of effectively integrating multiple trained LoRAs, a method previously developed to fine-tune large pre-trained models with minimal computational overhead.
MOLE employs a hierarchical weight control approach where each layer of a LoRA is treated as an expert. By integrating a learnable gating function within each layer, MOLE determines optimal composition weights tailored to specific domain objectives. This method enhances the performance of LoRA compositions and preserves their flexibility, addressing the limitations of linear arithmetic and reference tuning-based compositions which either diminish generative capabilities or involve high training costs.
The figure below from the paper illustrates an overview of LoRA composition methods: (a) Linear arithmetic composition, which commonly applies the same composition weight $\boldsymbol{W}_i$ to all layers of the $i^{\text {th}}$ LoRA. (b) Reference tuning-based composition involves retraining a large model by integrating outputs from multiple LoRAs using manually-crafted mask information. (c) Our MoLE, which learns a distribution $\Upsilon^j$ for the $j^{\text {th}}$ layer of LoRAs to determine the composition weight $\boldsymbol{W}_i^j$.

During the training phase, MOLE predicts weights for each expert using a gating function while keeping other parameters frozen, resulting in minimal computational costs. In the inference phase, MOLE can utilize all trained LoRAs with preserved characteristics or allow for manual masking of LoRAs to adjust weights dynamically without retraining.
The architecture incorporates gating functions at various hierarchical levels to effectively manage the contributions of different LoRA layers. The paper details two distinct inference modes facilitated by MOLE, enhancing its adaptability across various scenarios and tasks in both NLP and Vision & Language domains.
The figure below from the paper illustrates the orkflow of MOLE. In the training phase, MOLE predicts weights for multiple LoRAs. In the inference phase, MOLE can allocate weights to multiple LoRAs, or, without altering the gating weights, achieve a more flexible LoRA composition by masking out undesired LoRAs and recalculating and distributing weights proportionally.

Extensive experiments demonstrate that MOLE outperforms existing LoRA composition methods in terms of both qualitative and quantitative measures. Results from NLP and Vision & Language tasks illustrate that MOLE consistently achieves superior performance compared to traditional composition methods, validating its approach in a real-world setting.
Code

Teaching Large Language Models to Self-Debug

This paper by Chen et al. from Google DeepMind and UC Berkeley introduces SELF-DEBUGGING, a method enabling large language models (LLMs) to autonomously debug their own code predictions using a few-shot prompting technique without additional training.
The proposed SELF-DEBUGGING teaches LLMs to perform rubber duck debugging autonomously, where the model identifies and corrects its errors based on the execution results of its code and explains the generated code in natural language. This approach allows the LLM to improve upon itself without human feedback regarding code correctness or specific error messages.
The figure below from the paper illustrates SELF-DEBUGGING for iterative debugging using an LLM. At each debugging step, the model first generates new code, then the code is executed and the model explains the code. The code explanation along with the execution results constitute the feedback message, based on which the model infers the code correctness and then adds this message to the feedback. The feedback message is then sent back to the model to perform more debugging steps. When unit tests are not available, the feedback can be purely based on code explanation.

SELF-DEBUGGING involves three main steps: 1) code generation based on the problem description, 2) execution of the generated code, and 3) generation of a feedback message based on the code and its execution results. This feedback is then used to iterate on the debugging process until the code is deemed correct or a maximum number of debugging iterations is reached.
The method was tested across multiple code generation benchmarks, including the Spider dataset for text-to-SQL generation, TransCoder for C++-to-Python translation, and MBPP for text-to-Python generation. It consistently outperformed baseline models, improving prediction accuracy by 2-3% on Spider and up to 12% on TransCoder and MBPP. Additionally, SELF-DEBUGGING significantly enhanced sample efficiency, matching or outperforming baseline models generating over ten times more candidate programs.
The paper also explores the use of different types of feedback within the SELF-DEBUGGING process, such as simple feedback that states code correctness, and more complex feedback integrating execution results and detailed code explanations. The results indicated that models benefit from richer feedback, especially when it includes execution details.
Overall, the paper demonstrates that teaching LLMs to debug their predictions autonomously can lead to improvements in code generation tasks and reduce the need for extensive sampling or human intervention in the debugging process.

You Only Cache Once: Decoder-Decoder Architectures for Language Models

This paper introduces YOCO, a novel decoder-decoder architecture for large language models, presented by Yutao Sun et al. from Microsoft Research and Tsinghua University. YOCO optimizes key-value caching by storing key-value pairs just once, significantly reducing GPU memory demands while maintaining global attention capabilities. The architecture consists of a self-decoder that encodes global key-value caches, and a cross-decoder that reuses these caches through cross-attention, resembling a decoder-only Transformer from an external perspective.
Key innovations include:
- The self-decoder employs efficient self-attention for initial encoding, which is then utilized by the cross-decoder for generating outputs without re-encoding the inputs.
- A gated retention mechanism in the self-decoder optimizes data retention through a data-controlled gating mechanism, contributing to both memory efficiency and computational speed.
The figure below from the paper illustrates an overview of the decoder-decoder architecture. Self-decoder generates the global KV cache. Then cross-decoder employs cross-attention to reuse the shared KV caches. Both self-decoder and cross-decoder use causal masking. The overall architecture behaves like a decoder-only Transformer, autoregressively generating tokens.

The figure below from the paper illustrates YOCO Inference. Prefill: encode input tokens in parallel. Generation: decode output tokens one by one. The computation flow enables prefilling to early exit without changing the final output, thereby significantly speeding up the prefill stage.

YOCO demonstrates substantial improvements in GPU memory usage, prefill latency, and throughput, with experimental results showcasing its ability to handle up to 1M token contexts with near-perfect needle retrieval accuracy. The architecture supports scalable and efficient training across various model sizes and token counts, competing strongly against traditional Transformer models in terms of both performance and efficiency.
This approach not only allows for more effective utilization of existing hardware but also paves the way for future advancements in distributed training and multimodal large model applications. The results and methods detailed in this paper present a significant step forward in addressing the limitations of current language model architectures concerning memory efficiency and processing speeds.
Code

JetMoE: Reaching Llama2 Performance with 0.1M Dollars

This paper by Shen et al. from MIT-IBM Watson AI Lab, MIT EECS, Princeton University, and MyShell.ai & MIT introduces JetMoE-8B, a cost-effective large language model, outperforming established models like Llama2-7B and Llama2-13B-Chat. JetMoE-8B extends the concept of sparse activation to both the attention and feed-forward layers. Despite being trained on a tight budget of under $100,000, JetMoE-8B employs 8 billion parameters, leveraging a Sparsely-gated Mixture-of-Experts (SMoE) architecture that activates only 2 billion parameters per input token. This architecture reduces inference computation by approximately 70% compared to Llama2-7B.
JetMoE-8B is trained using the Megatron framework with Megablock enhancements, using pipeline parallelism to optimize computational costs and load balance during training. Notably, it incorporates innovations like shared KV projection in attention layers and a frequency-based auxiliary loss for training efficiency.
The figure below from the paper illustrates the JetMoE architecture.

For pretraining, JetMoE-8B utilized a mixture of real-world and synthetic datasets, totaling 1.25 trillion tokens. Datasets include RefinedWeb, StarCoder, and various components from The Pile, combined with synthetic datasets like OpenHermes 2.5 for diverse training inputs.
Utilized a two-phase training approach, incorporating a mix of real and synthetic datasets with adjustments in data weighting during the learning rate decay phase to enhance model performance.
The model underwent Distilled Supervised Fine-Tuning (dSFT) and Distilled Direct Preference Optimization (dDPO), refining model responses based on preferences from a teacher model to improve alignment with human-like conversational abilities.
JetMoE-8B’s performance was benchmarked against other models in tasks like ARC-challenge, Hellaswag, and MMLU, showing superior performance in many areas, particularly in code-related benchmarks like MBPP and HumanEval.
The training parameters, model configurations, and data mixtures are fully documented and made open-source to foster further academic and practical advancements in efficient LLM training methodologies.

LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding

This paper by Elhoushi et al. from Meta, UofT, CMU, UW-Madison, and Dana-Farber Cancer Institute introduces LayerSkip, a novel method to accelerate inference in large language models (LLMs) by integrating layer dropout and early exit strategies during the training phase, which allows the model to exit at earlier layers during inference. The training approach uniquely combines layer dropout – with higher rates for later layers and lower rates for earlier ones – and an early exit loss, where all transformer layers share the same exit. This results in different sized sub-models within the same overarching model structure.
LayerSkip operates by integrating a variable layer dropout during the training phase, where higher dropout rates are applied to later layers and lower rates to earlier ones. This method allows the LLM to potentially exit at earlier layers during inference, creating various sized sub-models within the overarching model structure. This training approach utilizes a shared linear model head across all layers and a unique early exit loss, where the loss function is a weighted sum of losses across all layers, assigning higher weights to higher layers.
During inference, LayerSkip enables robust early exits directly from the transformer layers themselves, thanks to its architecture that supports speculative decoding. The paper presents a self-speculative decoding method that decodes with earlier layers and then uses the remaining layers to verify and correct the outputs. This approach benefits from a smaller memory footprint compared to other speculative methods (that suffer from the overhead of loading multiple models into memory) and reuses compute and activations between the draft and verification stages. This speculative decoding involves using earlier layers to generate draft outputs (K tokens) and then employing the remaining layers to verify and correct these outputs. This method significantly reduces memory usage by reusing compute and activations between the draft and verification stages, and it also simplifies the architecture by avoiding the need for separate model heads for each exit point.
The figure below from the paper illustrates an overview of the proposed end-to-end solution, LayerSkip, showing its 3 components.

The speculative aspect of LayerSkip involves saving the Key-Value (KV) pairs from the initial draft layers in a cache, along with the Query (Q) values from the final layer. When it comes to verification, the process begins from the last exited layer, using the cached tensors, which allows the model to verify multiple tokens simultaneously rather than sequentially. This approach not only saves time but also maintains the accuracy of the outputs by utilizing the full capabilities of the LLM when needed.
Extensive experiments demonstrate that LayerSkip significantly speeds up inference tasks across different domains and sizes of LLMs, achieving up to 2.16× faster processing on summarization tasks, 1.82× on coding tasks, and 2.0× on semantic parsing tasks, all while either maintaining or even improving the accuracy at early exit points. By leveraging the same model for drafting and verification, LayerSkip addresses one of the biggest disadvantages of speculative decoding regarding needing to load two LLMs into memory, making it a viable solution for enhancing performance.

Better & Faster Large Language Models via Multi-token Prediction

This paper by Gloeckle et al. from FAIR, CERMICS Ecole des Ponts ParisTech, and LISN Université Paris-Saclay, presents a novel approach to training large language models (LLMs) by predicting multiple future tokens simultaneously, rather than the conventional next-token prediction method. This multi-token prediction technique, using independent output heads on top of a shared model trunk, significantly enhances sample efficiency and performance in both natural language and coding tasks.
Multi-token prediction allows for better long-term dependency modeling and increases inference speed up to 3 times with speculative decoding techniques. When implemented on models of various sizes (up to 13 billion parameters), significant performance gains are observed, particularly in code-related tasks.
The approach described utilizes a shared transformer trunk to generate latent representations of the input context, which are then used by several independent heads to predict multiple future tokens in parallel. This architecture significantly reduces GPU memory usage and computational overhead, making it feasible for larger models and longer training sequences.
The figure below from the paper illustrates an overview of multi-token prediction. (Top) During training, the model predicts 4 future tokens at once, by means of a shared trunk and 4 dedicated output heads. During inference, we employ only the next-token output head. Optionally, the other three heads may be used to speed-up inference time. (Bottom) Multi-token prediction improves pass@1 on the MBPP code task, significantly so as model size increases. Error bars are confidence intervals of 90% computed with bootstrapping over dataset samples.

Extensive experiments across multiple benchmarks demonstrate that multi-token prediction not only improves the overall accuracy and efficiency of LLMs but also enhances their ability to handle complex generative tasks. The largest models show an average of 15% better performance in solving coding problems compared to traditional models.
The method’s effectiveness is attributed to its ability to train on more complex patterns and longer dependencies without additional computational costs, making it a promising technique for future LLM developments. The approach also shows potential for enhancing model capabilities in algorithmic reasoning and induction, which could lead to more sophisticated AI systems.

When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards

This paper by Alzahrani et al. from National Center for AI, Saudi Arabia addresses the fragility of LLM leaderboards under minor benchmark perturbations. It suggests that slight changes, such as altering the order of answer choices or modifying the scoring method, can cause significant shifts in model rankings, with models moving up to eight positions on the leaderboard. This variability highlights the risk of over-reliance on these leaderboards for model evaluation and selection.
Systematic experiments identified three main types of benchmark perturbations: changes in answer choice format and ordering, modifications to the prompt and scoring methods, and manipulation of in-context knowledge. The study explores each type through extensive testing across various LLMs, revealing that even minimal tweaks can lead to substantial discrepancies in leaderboard standings.
Key findings include the sensitivity of LLMs to the format of answer choices and scoring methods, with certain modifications leaving rankings unaffected, suggesting that some aspects of benchmarks are more robust against variations. The paper proposes a hybrid scoring method combining features of several existing approaches as a more stable alternative.
The figure below from the paper illustrates that minor perturbations cause major ranking shifts on MMLU. Models can move up or down up to eight positions on the leaderboard under small changes to the evaluation format. Columns (from left): 1) Original ranking given by MMLU using answer choice symbol scoring (a common default). 2) Ranking under an altered prompt for the same questions, where answer choice symbols are replaced with a set of rare symbols. 3) Setting where the correct answer choice is fixed to a certain position (in this case, B). 4) Using the cloze method for scoring answer choices. Under each new ranking, they report Kendall’s $\tau$ with respect to the original ranking (lower $K_{\tau}$ indicates more disagreement between rankings).

The research was conducted using a variety of LLMs on popular benchmarks like MMLU and ARC, employing tools like the LM Evaluation Harness for systematic testing and comparison. The analysis used Kendall’s $\tau$ to measure ranking consistency, underscoring the leaderboards’ vulnerability to minor adjustments.
This paper challenges the current benchmarking practices for LLMs and pushes for the development of more reliable and consistent evaluation methods to ensure that leaderboards accurately reflect model performance in real-world applications.

Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models

This paper by Wu et al. from Microsoft Research introduces Visualization-of-Thought (VoT) prompting, a novel method designed to enhance spatial reasoning capabilities in large language models (LLMs). The authors employ VoT across multi-hop spatial reasoning tasks such as natural language navigation, visual navigation, and visual tiling in 2D grid worlds. VoT prompts work by visualizing LLMs’ reasoning traces, guiding them through subsequent reasoning steps. The paper shows that this method significantly surpasses existing multimodal LLMs in spatial reasoning tasks by enabling LLMs to generate internal mental images that assist spatial reasoning, closely mimicking the human cognitive process of the “mind’s eye.”
The figure below from the paper illustrates that humans can enhance their spatial awareness and inform decisions by creating mental images during the spatial reasoning process. Similarly, large language models (LLMs) can create internal mental images. We propose the VoT prompting to elicit the “mind’s eye” of LLMs for spatial reasoning by visualizing their thoughts at each intermediate step.

The main contributions include shedding light on LLMs’ mental image capabilities from a cognitive perspective, developing two tasks with synthetic datasets that emulate human-like sensory perception, and proposing VoT prompting which empirically outperforms other prompting methods. Notably, VoT utilizes zero-shot prompting, contrasting with few-shot demonstrations or CLIP-based visualizations, to better match human cognitive abilities to visualize spatial information abstractly.
In terms of implementation, the authors adopted a zero-shot approach where VoT augments the LLMs with a visuospatial sketchpad, visualizing reasoning steps without prior explicit training on similar tasks. This was tested using models like GPT-4 and GPT-4V, with the experiments conducted via the Azure OpenAI API, focusing on both natural and visual navigation tasks, and complex visual tiling scenarios.
The results confirm VoT’s effectiveness, as it consistently led LLMs to visualize their reasoning steps and improve performance on spatial tasks compared to other methods. The paper concludes that while VoT demonstrates significant potential in enhancing spatial reasoning in LLMs, it may also extend these capabilities in multimodal large language models (MLLMs), suggesting broader applicability in tasks requiring advanced cognitive and reasoning skills.

Hallucination of Multimodal Large Language Models: A Survey

This paper by Bai et al. from NUS, Amazon Prime Video, and AWS Shanghai AI Lab, presents a comprehensive analysis of hallucination phenomena in multimodal large language models (MLLMs), addressing the divergence between generated text and associated visual content. The authors detail the root causes of hallucination across data, model, training, and inference stages.
The survey categorizes hallucinations into object, attribute, and relation types, highlighting that MLLMs often misclassify or fabricate objects, their attributes, or the relationships among them.
Various mitigation strategies are discussed, focusing on model architecture modifications and training enhancements to reduce hallucination occurrences. The paper emphasizes the use of enhanced visual encoders and structured training approaches to improve cross-modal alignment.
The figure below from the paper illustrates the main content flow and categorization of this survey.

Implementation details include the development of a multi-task encoder and a structural knowledge enhancement module designed to enhance MLLM perception by incorporating diverse visual information formats such as segmentation masks and depth maps.
The survey also introduces a novel metric for hallucination assessment and proposes benchmarks like CHAIR and POPE, which measure the extent of hallucinations in generated content. These tools aim to refine the evaluation of MLLMs by quantifying the impact of hallucinations on model output.
The authors argue for the necessity of advanced metrics and revised training methods to address the pervasive challenge of hallucinations in MLLMs, aiming to enhance the robustness and reliability of these systems in practical applications.
Github

In-Context Learning with Long-Context Models: An In-Depth Exploration

This paper by Bertsch et al. from CMU and Tel Aviv University delves into the effects of utilizing long-context models for in-context learning (ICL). By exploring extreme scale ICL on various datasets and models, the study finds that performance improvements continue with the addition of hundreds or even thousands of demonstrations.
A notable observation is that long-context ICL is less sensitive to the order of input examples compared to its short-context counterparts. Additionally, when examples with the same label are grouped together, it detrimentally affects performance, highlighting the negative impact of such grouping on model accuracy.
The research employs several large language models, including variations of the Llama-2-7b model, which were adapted for extended context lengths of up to 80,000 tokens. These models were tested across multiple datasets, such as TREC, NLU, and Banking-77, demonstrating continued performance gains with increased demonstration sizes.
Experimental setups involved both random and retrieval-based selection of in-context examples. Findings suggest that while retrieval-based methods offer significant early advantages, their benefits diminish with larger sets of examples due to the reduced necessity of example specificity.
The study also contrasts the efficiency and performance of many-shot ICL with traditional finetuning methods, noting that while finetuning consumes more data and computational resources upfront, it can sometimes surpass the performance of ICL when provided with ample data.
Overall, the findings advocate for the potential of long-context ICL as a competitive approach to learning from large datasets, especially in scenarios where model responsiveness to demonstration order and example specificity is critical.

NOLA: Compressing LoRA Using Linear Combination of Random Basis

This paper by Koohpayegani et al. in ICLR 2024 introduces NOLA, a novel method for compressing large language models (LLMs) that addresses the limitations of Low-Rank Adaptation (LoRA). NOLA reparameterizes the rank-decomposition matrices used in LoRA through linear combinations of randomly generated basis matrices, significantly reducing the parameter count by optimizing only the mixture coefficients.
NOLA decouples the number of trainable parameters from both the rank choice and network architecture, unlike LoRA, where parameters are inherently dependent on the matrix dimensions and rank, which must be an integer. This method not only preserves the adaptation quality but also allows for extreme compression, achieving up to 20 times fewer parameters than the most compressed LoRA models without loss of performance.
The method’s implementation includes using a pseudo-random number generator for creating basis matrices, where the generator’s seed and the linear coefficients are stored, greatly reducing storage requirements. Quantization of these coefficients further minimizes storage needs without impacting model performance.
The figure below from the paper shows the process that NOLA follows. After constraining the rank of $\Delta W$ by decomposing it to $A \times B$, we reparametrize A and B to be a linear combination of several random basis matrices. We freeze the basis and W and learn the combination coefficients. To reconstruct the model, we store the coefficients and the seed of the random generator which is a single scalar. NOLA results in more compression compared to LoRA and more importantly decouples the compression ratio from the rank and dimensions of W. One can reduce the number of parameters to 4 times smaller than rank=1 of LoRA which is not possible with LoRA due to rank being an integer number.

Detailed experimental evaluations across several tasks and models, including GPT-2 and LLaMA-2, showcase NOLA’s effectiveness. It maintains or exceeds benchmark metrics such as BLEU and ROUGE-L while using significantly fewer parameters compared to both LoRA and full model fine-tuning.
The approach’s versatility is demonstrated through its application not only in natural language processing tasks but also in adapting Vision Transformer (ViT) models for image classification, indicating its potential widespread applicability across different types of deep learning architectures.
Code

Data Selection for Transfer Unlearning

This paper by Sepahvand et al. from McGill University and Google DeepMind introduces a novel method for data selection in transfer unlearning, specifically targeting scenarios where data permissions change over time. The authors propose a relaxed definition of unlearning for non-privacy applications, focusing on scenarios where a data owner withdraws permission for the use of their data in training. This paper addresses the technical challenges of efficiently unlearning data in deep learning models without significantly degrading model utility.
The proposed method utilizes a mechanism for selecting relevant examples from an auxiliary “static” dataset to fine-tune the pretrained model, instead of using the “non-static” target data that might need to be unlearned in the future. This approach preemptively addresses all unlearning requests, making it highly efficient. The authors adapt a recent relaxed definition of unlearning to their problem setting and demonstrate that their method qualifies as an exact transfer unlearner under this definition.
Implementation Details:
1. Data Selection Mechanism:
  - The method computes the similarity between each example in the auxiliary dataset and each non-static example of a particular class.
  - It then selects the top $M$ examples from the auxiliary dataset that are most similar to the non-static examples, relabeling them as belonging to the same class.
  - This selected subset of the auxiliary dataset replaces the non-static target data during fine-tuning.
2. Transfer Learning:
  - The selected data from the auxiliary dataset and the static target data are used to fine-tune the pretrained model.
  - This finetuning process ensures that the model performs well on the target task while being robust to unlearning requests.
Experimental Evaluation:
- The authors conducted extensive experiments across various datasets, comparing their method against gold-standard exact unlearning and approximate unlearning approaches.
- Results showed that the proposed method outperformed exact unlearning (finetuning on only the static portion of the target dataset) on several datasets, especially when the static set was small.
- In some cases, the method’s performance approached the upper bound achieved by training on all target data without unlearning.
Key Findings:
- The data selection mechanism effectively addresses all unlearning requests ahead of time, significantly reducing the need for computationally expensive retraining.
- The method’s success hinges on the availability of a related auxiliary dataset, which may not always be available in practice.
- The performance boost from the selected data is highly correlated with the domain affinity between the auxiliary and target datasets.
In conclusion, this paper proposes an efficient and effective method for transfer unlearning by leveraging a data selection mechanism, providing a significant step forward in handling non-static data in deep learning models. The empirical results demonstrate the method’s robustness and practicality, especially in scenarios with limited static data.

A Primer on the Inner Workings of Transformer-Based Language Models

This paper by Ferrando et al. from Universitat Politècnica de Catalunya, University of Groningen, and FAIR, presents a detailed exploration of the internal mechanisms of Transformer-based language models, particularly focusing on the generative, decoder-only architecture, as seen in GPT-like models. The study illuminates how different model components and layers contribute to the overall functionality and interpretability of these models.
The paper discusses various methods for probing and interpreting these models, including input and model component attribution techniques, which help in understanding what elements in the language model contribute to specific outputs.
The figure below from the paper illustrates an overview of the survey.

Technical insights include detailed discussions on the architecture of Transformers, such as the role of layer normalization, the mechanics of attention blocks, and feedforward network blocks in processing and transforming input data through the model.
The authors analyze the prediction head of the model, explaining how it projects the final layer states into the vocabulary space to generate token predictions, providing a nuanced understanding of how prediction is distributed across model components.
Special emphasis is placed on the interpretability of Transformer models. Techniques like probing the model with external tasks, gradient-based and perturbation-based input attribution methods, and causal interventions are elaborated to detail how they help in pinpointing the contribution of specific model components or inputs to the outputs.
Furthermore, the paper provides exhaustive details on various behaviors discovered within the Transformers, such as attention dynamics, the impact of feedforward layers, and the interpretative utility of different layers. These insights are crucial for developers and researchers aiming to optimize or innovate upon existing Transformer architectures.
The comprehensive technical analysis and systematic exploration in this primer serve not only to educate about the existing capabilities and inner workings of Transformers but also to guide future research directions in model interpretability and improvement.

MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning

This paper by Jiang et al. from Beihang University and Microsoft introduces a novel method, MoRA (Matrix of Rank Adaptation), for parameter-efficient fine-tuning (PEFT) of large language models (LLMs). The authors identify limitations in existing PEFT methods, particularly Low-Rank Adaptation (LoRA), which may restrict LLMs’ ability to learn and retain new knowledge. To address these issues, MoRA employs a high-rank updating mechanism using a square matrix to achieve greater flexibility and effectiveness without increasing the number of trainable parameters.
MoRA utilizes non-parameterized operators to adjust input and output dimensions, ensuring the weight can be integrated back into LLMs like LoRA. The method involves the following steps:
1. Reduction of Input Dimension: Non-parameter operators reduce the input dimension for the square matrix.
2. Increase of Output Dimension: Corresponding operators increase the output dimension, maintaining the number of trainable parameters while achieving high-rank updates.
The figure below from the paper illustrates an overview of MoRA method compared to LoRA under same number of trainable parameters. $W$ is the frozen weight from model. $A$ and $B$ are trainable low-rank matrices in LoRA. $M$ is the trainable matrix in our method. Gray parts are non-parameter operators to reducing the input dimension and increasing the output dimension. $r$ represents the rank in two methods.

The authors comprehensively evaluate MoRA across five tasks – instruction tuning, mathematical reasoning, continual pretraining, memory, and pretraining – demonstrating that MoRA outperforms LoRA in memory-intensive tasks and achieves comparable performance in other areas.\
Technical Details and Implementation:
- Low-Rank Limitation in LoRA: LoRA uses low-rank matrices to approximate full-rank updates, limiting its capacity to store new information, especially in memory-intensive tasks. The low-rank matrices $A$ and $B$ in LoRA struggle to fully capture the complexity needed for tasks requiring substantial knowledge enhancement.
- High-Rank Updating in MoRA: MoRA replaces the low-rank matrices with a square matrix, significantly increasing the rank and thus the capacity for updates. For example, LoRA with rank 8 employs matrices $A \in \mathbb{R}^{4096 \times 8}$ and $B \in \mathbb{R}^{8 \times 4096}$, while MoRA uses a square matrix $M \in \mathbb{R}^{256 \times 256}$, achieving a higher rank with the same number of parameters.
- Compression and Decompression Functions: MoRA employs various methods to implement compression and decompression functions, including truncation, sharing rows/columns, reshaping, and rotation. These methods help reduce the input dimension and increase the output dimension effectively.
- Rotation Operators: Inspired by RoPE (Rotary Position Embedding), MoRA introduces rotation operators to differentiate inputs, enhancing the expressiveness of the square matrix.
Evaluation and Results:
- Memory Task: In memorizing UUID pairs, MoRA showed significant improvements over LoRA with the same number of trainable parameters. MoRA required fewer training steps to achieve high accuracy compared to LoRA, demonstrating its effectiveness in memory-intensive tasks.
- Fine-Tuning Tasks: MoRA was evaluated on instruction tuning (using Tülu v2 dataset), mathematical reasoning (using MetaMath, GSM8K, MATH), and continual pretraining (in biomedical and financial domains). It matched LoRA’s performance in instruction tuning and mathematical reasoning but outperformed LoRA in continual pretraining tasks, benefiting from high-rank updating.
- Pretraining: MoRA and a variant, ReMoRA (which merges updates back into the model during training), were evaluated on pretraining transformers from scratch on the C4 dataset. MoRA showed better pretraining loss and perplexity metrics compared to LoRA and ReLoRA, further validating the advantages of high-rank updating.
MoRA addresses the limitations of low-rank updates in LoRA by employing high-rank matrices, significantly enhancing the model’s capacity to learn and memorize new knowledge. This method shows promise for improving parameter-efficient fine-tuning of LLMs, especially in memory-intensive and domain-specific tasks. The authors provide comprehensive implementation details and empirical evaluations, establishing MoRA as an effective advancement in the field of PEFT.

Nemotron-4 340B Technical Report

This technical report by NVIDIA details the release of the Nemotron-4 340B model family, which includes Nemotron-4-340B-Base, Nemotron-4-340B-Instruct, and Nemotron-4-340B-Reward. These models are distributed under the NVIDIA Open Model License Agreement, allowing for their use, modification, and distribution. The Nemotron-4 340B models perform competitively against other open-access models across various evaluation benchmarks. They are designed to fit on a single DGX H100 with 8 GPUs when deployed in FP8 precision, highlighting their efficiency. These models are optimized for NVIDIA NeMo and NVIDIA TensorRT-LLM, enhancing their applicability in various commercial domains such as healthcare, finance, manufacturing, and retail.
Pretraining: The Nemotron-4-340B-Base model was trained on 9 trillion tokens sourced from a high-quality dataset comprising 70% English natural language data, 15% multilingual natural language data, and 15% source code data. This comprehensive data blend includes documents from web sources, news articles, scientific papers, books, and more, spanning 53 natural languages and 43 programming languages. The model architecture is a decoder-only Transformer with 256K vocabulary size, 96 transformer layers, 18,432 hidden dimensions, 8 KV heads, 4096 sequence length, and 96 attention heads. The model training utilized 768 DGX H100 nodes, each equipped with 8 H100 80GB SXM5 GPUs. A combination of tensor parallelism, pipeline parallelism, and data parallelism was employed, with specific batch size ramp-up schedules to ensure efficiency.
Alignment: The alignment of Nemotron-4-340B involved Supervised Fine-Tuning (SFT) followed by Preference Fine-Tuning using methods such as Reinforcement Learning with Human Feedback (RLHF) and Direct Preference Optimization (DPO). The alignment process relied heavily on synthetic data generation, with over 98% of the data used being synthetically generated. The alignment strategy included generating synthetic prompts, refining responses, and using a reward model to filter and rank the quality of these responses.
Synthetic Data Generation:
- The synthetic data generation pipeline involved several stages:
  1. Prompt Preparation: They used the permissive Mixtral-8x7B-Instruct-v0.1 as generator to generate synthetic single-turn/two-turn prompts, LMSys prompts separately for the tasks covering a wide range of scenarios, including open Q&A, writing, closed Q&A, and math & coding. This ensured a comprehensive and varied prompt dataset.
  2. Synthetic Dialogue Generation: Synthetic multi-turn conversations initiated by prompting an instruct model to generate responses based on input prompts, with iterative role-playing between user and assistant.
  3. Synthetic Preference Data Generation: Triplet form data (prompt, chosen response, rejected response) was generated using multiple intermediate models and judged by a reward model. Methods included Ground-Truth-as-a-Judge, LLM-as-Judge, and Reward-Model-as-Judge.
    - Ground-Truth-as-a-Judge: Authors judging their preference ranking.
    - LLM-as-Judge: Prompt and two responses to the judging LLM and asking it to compare the two responses.
    - Reward-Model-as-Judge: Authors ask Nemotron-4–340B-Reward to predict the reward for each (prompt, response) pair and decide the preference ranking based on the rewards.
  4. Iterative Weak-to-Strong Alignment:
    - This approach iteratively refined the data and models, leveraging weak initial models to generate data that trains progressively stronger models.
    - The iterative weak-to-strong alignment process involves multiple rounds:
      - First Iteration: Mixtral-8x7B-Instruct-v0.1 was chosen as the initial aligned model. The data generated by this model was used to train an intermediate checkpoint, Nemotron-4-340B-Base (340B-Interm-1-Base). This checkpoint surpassed the performance of Mixtral 8x7B, leading to the creation of 340B-Interm-1-Instruct, which also outperformed Mixtral-8x7B-Instruct-v0.1.
      - Second Iteration: The 340B-Interm-1-Instruct model served as the data generator, producing higher-quality data than the first iteration. This data was then used to train 340B-Interm-2-Base, which evolved into 340B-Interm-2-Chat. This process creates a self-reinforcing flywheel effect, where stronger base models lead to better instruct models, and higher-quality datasets enhance model performance continuously.
      - The figure below from the report illustrates demonstrates the proposed iterative weak-to-strong alignment workflow.
Implementation Details: Nemotron-4-340B models are optimized with NVIDIA NeMo and NVIDIA TensorRT-LLM, enabling efficient inference at scale through tensor parallelism. The NeMo framework provides tools for data curation, customization, and evaluation, supporting supervised and parameter-efficient fine-tuning methods like low-rank adaptation (LoRA). Nemotron-4-340B can be customized using proprietary data and the HelpSteer2 dataset. Developers can also align their models using NeMo Aligner and datasets annotated by Nemotron-4-340B Reward, employing RLHF to ensure outputs are safe, accurate, and contextually appropriate.
Fine-Tuning and Alignment:
1. Staged Supervised Fine-Tuning (SFT):
  - Code SFT: Focused on coding data, utilizing Genetic Instruct to synthesize coding data through evolutionary processes, utilizing self instruction and wizard coder mutations.
  - General SFT: Employed a blended dataset of 200K samples covering a variety of tasks to improve the model’s general capabilities.
2. Preference Fine-Tuning:
  - Direct Preference Optimization (DPO): Optimized the policy network to maximize the implicit reward gap between chosen and rejected responses.
  - Reward-aware Preference Optimization (RPO): Initialized With checkpoint trained from DPO as a reference policy, RPO approximates the reward gap using the policy network’s implicit reward, leading to new loss function.
  - For alignment reward modeling, a regression reward model is built on top of Nemotron-4–340B-Base model by replacing final softmax layer with a new reward “head” which maps the hidden states of the last layer into a five-dimensional vector of attributes (Helpfulness, Correctness, Coherence,Complexity, Verbosity).
The figure below from the report illustrates the synthetic data generation pipeline: (1) the Nemotron-4 340B Instruct model is first used to produce synthetic text-based output. An evaluator model, (2) Nemotron-4 340B Reward, then assesses this generated text — providing feedback that guides iterative improvements and ensures the synthetic data is accurate, relevant and aligned with specific requirements.

Evaluation
1. Base Model Evaluation: Compared against models like Llama-3 70B, Mistral 8x22, and Qwen-2 72B, Nemotron-4-340B-Base showed strong performance in commonsense reasoning and popular benchmarks like MMLU and BBH.
2. Instruct Model Evaluation: Nemotron-4-340B-Instruct was evaluated on single-turn and multi-turn conversation tasks, math and coding benchmarks, and instruction following. It demonstrated competitive performance, especially in multi-turn conversations and instruction following.
3. Reward Model Evaluation: Nemotron-4-340B-Reward achieved top accuracy on the RewardBench, outperforming other models in categories such as Chat-Hard and Safety.
Conclusion: The Nemotron-4 340B model family, along with the comprehensive synthetic data generation pipeline, provides powerful tools for the AI community. These models not only demonstrate high performance across various tasks but also promote transparency and reproducibility through the open release of their training and alignment processes. The permissive license and open access aim to accelerate research and the responsible development of AI applications.
Blog

RewardBench: Evaluating Reward Models for Language Modeling

This paper by Lambert et al. from AI2, UW, and Harvard Law presents RewardBench, a benchmark dataset and codebase developed to evaluate the performance of reward models (RMs) used in reinforcement learning from human feedback (RLHF) for language modeling. RLHF is crucial for aligning language models (LMs) to human preferences but remains underexplored in terms of RM evaluation.
RewardBench includes a diverse collection of prompt-chosen-rejected trios spanning various categories like chat, reasoning, and safety. The dataset is designed to benchmark RMs on challenging and structured queries, ensuring that the chosen answers have subtle, verifiable reasons for preference over rejected ones. The authors created comparison datasets for RMs that highlight issues like bugs and incorrect facts.
The paper introduces a leaderboard to evaluate RMs trained with different methods, such as direct maximum likelihood estimation (MLE) training and Direct Preference Optimization (DPO). Key findings include the propensity for refusals, reasoning limitations, and instruction-following shortcomings of various RMs.
The figure below from the paper illustrates the scoring method of the REWARDBENCH evaluation suite. Each prompt is accompanied by a chosen and rejected completion which are independently rated by a reward model.

Technical Details and Implementation:
1. Dataset Construction:
  - The dataset consists of prompt-chosen-rejected trios with human-verified selections.
  - Categories include Chat, Chat Hard, Safety, Reasoning, and Prior Sets, with varying levels of difficulty and coverage.
  - Each prompt is evaluated based on whether the RM score for the chosen completion is higher than for the rejected completion.
2. Evaluation Metrics:
  - The primary metric is accuracy, defined as the percentage of prompts where the chosen completion scores higher than the rejected one.
  - The evaluation covers multiple aspects like scaling laws, refusal behavior, and reasoning capabilities.
3. Models Evaluated:
  - Over 80 models, including classifier-based RMs and DPO-trained chat models, were evaluated.
  - Models like UltraRM, Starling, PairRM, SteamSHP, and RAFT were tested.
4. Implementation Details:
  - The benchmark provides tools for visualization, training, and analysis of RMs.
  - The codebase supports a common inference stack for various RM architectures, enabling consistent evaluation and comparison.
  - Data and evaluation results are made available publicly to facilitate further research and understanding of RM properties.
5. Findings:
  - DPO models, while simpler, often fail to generalize to diverse preference datasets and show higher performance variance.
  - Classifier-based RMs generally perform better in distinguishing correct completions.
  - The evaluation highlighted specific datasets where current models fall short, indicating areas for improvement in RM training and evaluation.
The paper underscores the importance of robust evaluation frameworks like RewardBench in understanding and improving RLHF processes, ultimately enhancing the alignment of LMs with human preferences. The toolkit and datasets provided by RewardBench aim to bridge the gap in RM evaluation and foster further research in this critical area.

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

DeepSeek-Coder-V2 is an open-source Mixture-of-Experts (MoE) code language model developed to achieve performance comparable to GPT-4 Turbo in code-specific tasks. It extends the DeepSeek-V2 model, being further pre-trained with an additional 6 trillion tokens. This enhanced model significantly improves coding and mathematical reasoning capabilities while maintaining comparable performance in general language tasks. It supports 338 programming languages and extends the context length from 16K to 128K tokens.
Implementation Details:
1. Pre-training Dataset Composition:
  - 60% source code
  - 10% math corpus
  - 30% natural language corpus
  - Source code tokens (1,170B) from GitHub and CommonCrawl
  - Math corpus (221B tokens) from CommonCrawl
  - Natural language corpus from the DeepSeek-V2 dataset
2. Model Training:
  - DeepSeek-Coder-V2 models with 16B and 236B parameters
  - Activation parameters of 2.4B and 21B, respectively
  - Two training objectives: Next-Token-Prediction and Fill-In-Middle (FIM) for the 16B model
  - Training issues addressed by reverting from exponential to conventional normalization
3. Training Hyper-Parameters:
  - AdamW optimizer with β1 = 0.9, β2 = 0.95, and weight decay of 0.1
  - Cosine decay strategy for learning rate scheduling
  - Training stages for long context extension using Yarn, with sequences of 32K and 128K tokens
4. Alignment:
  - Supervised fine-tuning with an instruction training dataset including code, math, and general instruction data
  - Reinforcement learning with Group Relative Policy Optimization (GRPO)
  - Use of compiler feedback and test cases for coding domain preference data
Evaluation:
- Code Generation Benchmarks:
  - Achieved 90.2% on HumanEval and 76.2% on MBPP+, surpassing other open-source models
  - Demonstrated superior performance on LiveCodeBench and USACO benchmarks
  - Achieved top-tier results in multilingual code generation
- Code Completion:
  - Comparable performance to larger models in repository-level code completion (RepoBench v1.1)
  - High effectiveness in Fill-in-the-Middle tasks with a mean score of 86.4%
- Code Fixing:
  - Outperformed other open-source models on Defects4J and Aider datasets
  - Achieved the highest score on SWE-Bench
- Code Understanding and Reasoning:
  - High scores on CruxEval benchmarks, demonstrating strong reasoning capabilities
- Mathematical Reasoning:
  - Achieved 75.7% on MATH and 53.7% on Math Odyssey
  - Solved more AIME 2024 problems compared to other models
- General Natural Language:
  - Comparable general language performance to DeepSeek-V2
  - Strong results in reasoning-related benchmarks
In conclusion, DeepSeek-Coder-V2 significantly advances the capabilities of open-source code models, achieving performance levels comparable to state-of-the-art closed-source models. It excels in coding and mathematical reasoning while maintaining robust general language understanding. Future work will focus on enhancing the model’s instruction-following capabilities to better handle real-world complex programming scenarios.

Transferring Knowledge from Large Foundation Models to Small Downstream Models

This paper by Qiu et al. in ICML 2024 addresses the challenge of transferring knowledge from large foundation models to smaller, task-specific downstream models efficiently. Traditional transfer learning methods, which use pre-trained weights for initialization, are limited in the information they transfer and are computationally expensive. The paper introduces a new method called Adaptive Feature Transfer (AFT) to overcome these limitations.
AFT operates on features rather than weights, decoupling the choice of pre-trained and downstream models. This method adaptively transfers only the most useful pre-trained features for the downstream task using a simple regularization technique, which adds minimal overhead. AFT achieves better performance across various vision, language, and multi-modal datasets compared to existing methods, even when the downstream model is significantly smaller.
Implementation Details:
1. Core Concept: AFT prioritizes learning task-relevant pre-trained features while discouraging the learning of new, raw input features. This is formalized by defining a prior that favors low mutual information between downstream features and input, conditioned on pre-trained features.
2. Regularization Term: The mutual information is bounded using a variational distribution, leading to a regularization term that encourages downstream features to lie in the span of pre-trained features. A simple Gaussian parameterization is used for the variational distribution.
3. Kernel Formulation: To mitigate optimization and statistical challenges, a kernel formulation is employed. This reduces the complexity of the optimization problem and avoids overfitting by parameterizing the variational parameters as a diagonal matrix.
4. Training Procedure: The algorithm involves computing and caching pre-trained features, then retrieving and normalizing them during training. Mini-batch kernel matrices are evaluated, and gradient descent is performed jointly over downstream model parameters and the variational parameters.
The figure below from the paper illustrates the AFT transfers knowledge from large foundation models into small downstream models, improving downstream performance with minimal cost. (a) AFT regularizes the downstream model to prioritize learning the task-relevant subset of pre-trained features (blue ∩ red) over entirely new features (red \ blue). The blue region represents information in pre-trained features, red represents information in downstream features, and inside the square boundary represents all information in the raw, uncompressed input. (b) Over 6 vision datasets and 8 NLP datasets, AFT significantly outperforms standard transfer learning (STL), knowledge distillation (KD) (Hinton et al., 2015; Romero et al., 2014), including its more sophisticated variants relational knowledge distillation (RKD) and factor transfer (FT) (Kim et al., 2018), and B-Tuning (You et al., 2022). Error is normalized by STL error and averaged over datasets and downstream models, including ViT-S, MLP Mixer-B, ResNet-50, BERT-S, and DistillBERT. Error bars show standard errors across models and datasets. (c) AFT is the most effective at translating improvements in pre-trained models to improvements in downstream performance.

Experimental Results:
- Vision Tasks: AFT significantly improves performance on vision datasets like CIFAR-10, CIFAR-100, Oxford Flowers-102, and others. It outperforms standard transfer learning (STL), knowledge distillation (KD), and B-Tuning in most cases.
- Language Tasks: AFT demonstrates superior performance on natural language processing (NLP) tasks such as IMDB, BoolQ, MNLI, and others, outperforming KD and other methods.
- Multi-modal Tasks: AFT is particularly effective for multi-modal applications, such as visual entailment tasks, where it can transfer complementary features from multiple pre-trained models.
Advantages of AFT:
- Efficiency: AFT adds negligible computational overhead compared to standard training, as it only requires pre-computation and retrieval of features.
- Scalability: It effectively translates improvements in pre-trained models to downstream performance, showing a high correlation between the quality of pre-trained features and downstream accuracy.
- Robustness: AFT is robust to uninformative features, automatically down-weighting irrelevant pre-trained features, and maintaining performance even when noise features are present.
In summary, AFT offers a robust, efficient, and scalable solution for transferring knowledge from large foundation models to smaller downstream models, achieving significant performance gains with minimal additional computational cost.
Code

Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models

This paper by Nezhurina et al. from LAION, Juelich Supercomputing Center, University of Bristol, and Open-Sci Collective investigates the reasoning capabilities of state-of-the-art large language models (LLMs) through a simple common-sense problem, termed the “Alice in Wonderland” (AIW) problem. The study reveals significant deficiencies in the reasoning capabilities of these models, even when faced with tasks easily solvable by humans and children. The AIW problem is formulated as follows: “Alice has N brothers and she also has M sisters. How many sisters does Alice’s brother have?” The correct answer is M + 1, as Alice and her sisters share the same parents.
The authors tested several state-of-the-art LLMs, including GPT-3.5/4, Claude 3 Opus, Gemini, Llama 2/3, Mistral, Mixtral, Dbrx, and Command R+. Their experiments showed a dramatic breakdown in reasoning capabilities, with many models failing to provide correct responses and often giving nonsensical explanations for their incorrect answers. Notably, even the best-performing models like GPT-4 and Claude 3 Opus frequently failed the task, although they occasionally provided correct responses with proper reasoning.
The experimental setup involved creating variations of the AIW problem by altering the number of brothers (N) and sisters (M) and changing their order in the sentence. For each model, at least 30 trials were conducted for each problem variation and prompt type (STANDARD, THINKING, RESTRICTED). The results were collected using platforms offering API access like litellm and TogetherAI.
Key findings include:
- Most models exhibited a severe breakdown, with correct response rates rarely exceeding 0.2.
- GPT-4 and Claude 3 Opus occasionally provided correct answers with correct reasoning, but still showed frequent failures.
- Enhanced prompting or multi-step re-evaluation interventions did not improve the models’ performance.
- The study highlights a significant disparity between high standardized benchmark scores and actual reasoning capabilities, suggesting that current benchmarks fail to reflect true model performance on simple reasoning tasks.
The paper concludes with a call for the scientific community to re-assess the reasoning capabilities of LLMs and develop standardized benchmarks that can detect such basic reasoning deficits. The authors argue that only by making the entire model creation pipeline—data, training code, and evaluation procedures—open and reproducible can the true reasoning capabilities of these models be properly evaluated and improved.
Overall, this study underscores the need for rigorous testing and re-evaluation of LLMs to ensure their reliability and safety in real-world applications.

In Search of Needles in a 11M Haystack: Recurrent Memory Finds What LLMs Miss

This paper by Kuratov et al. from AIRI, MIPT, and London Institute for Mathematical Sciences addresses the challenge of processing long documents using generative transformer models. The authors introduce BABILong, a new benchmark designed to assess the capabilities of models in extracting and processing distributed facts within extensive texts. This benchmark aims to simulate real-world scenarios where essential information is interspersed with large amounts of irrelevant data, making it a suitable test for long-context processing.
The authors developed BABILong by extending the bAbI tasks to include much longer contexts, using the PG19 dataset as background text. This benchmark allows the evaluation of models on document lengths up to millions of tokens, far exceeding the typical lengths used in existing benchmarks.
The paper evaluates GPT-4 and RAG (Retrieval-Augmented Generation) on the BABILong tasks. Results show that these models can effectively handle sequences up to 104 elements but struggle with longer contexts. Specifically, GPT-4 and RAG were unable to maintain accuracy as context length increased, demonstrating a significant drop in performance for longer inputs.
The figure below from the paper illustrates the fact that the memory augmented transformer answers questions about facts hidden in very long texts when retrieval augmented generation fails. We create a new BABILong dataset by randomly distributing simple episodic facts inside book corpus. Common RAG method fails to answer questions because order of facts matters. GPT-4 LLM effectively uses only fraction of the context and falls short for a full 128K window. Small LM (GPT-2) augmented with recurrent memory and fine-tuned for the task generalise well up to record 11M tokens. The parameter count for GPT-4 is based on public discussions.

The figure below from the paper illustrates an example generation for BABILong dataset. Statements relevant for the question from a bAbILong sample are hidden inside a larger irrelevant texts from PG19.

To address the limitations of existing models, the authors propose fine-tuning GPT-2 with recurrent memory augmentations. This approach allows the model to handle tasks involving up to 11 million tokens. The recurrent memory transformer (RMT) extends the context size by segmenting sequences and processing them recurrently, resulting in linear scaling with input size.
The RMT uses memory tokens processed alongside segment tokens to retain information from previous segments. Each memory state consists of multiple memory tokens, allowing different views of past inputs. The model updates its memory state at each step, enabling it to maintain relevant information over long contexts. Additionally, the authors implemented self-retrieval for RMT, which involves cross-attention between all past states and the current memory state.
The figure below from the paper illustrates the Recurrent Memory Transformer with self-retrieval from memory. (a) Recurrent memory transformer encodes information from the current segment to the memory vectors [mem]. Memory vectors from the previous segment are passed over to the next segment and updated. (b) With self-retrieval it processes each segment sequentially and collects the corresponding memory states. Here, while processing segment 4, the model can retrieve from previous states from segments 1-3. This overcomes the memory bottleneck that can occur in purely recurrent models, similar to attention in RNNs.

The authors used a curriculum training strategy for RMT and RMT-R (RMT with self-retrieval), starting with short sequences and gradually introducing longer ones. The models were trained on sequences up to 16k tokens and evaluated on sequences up to 10 million tokens. The results showed that both RMT and RMT-R significantly outperformed GPT-4 on long-context tasks, maintaining high performance even for extremely long sequences.
The paper provides an analysis of memory states and attention patterns, showing that RMT effectively stores and retrieves relevant facts from memory. The visualizations demonstrate how the model updates its memory when encountering new facts and retrieves them when needed to answer questions.
Overall, the proposed recurrent memory transformer demonstrates a substantial improvement in processing long sequences, setting a new record for input size handled by a neural network model. The approach highlights the potential of recurrent memory mechanisms to enhance the capabilities of language models in handling extensive contexts, paving the way for future research and development in this area.
Code

MAGPIE: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing

This paper by Xu et al. from University of Washington and the Allen Institute for AI introduces MAGPIE, a method for generating large-scale alignment data for fine-tuning large language models (LLMs) without human intervention. The primary goal is to democratize AI by making high-quality instruction datasets publicly available.
The authors observe that aligned LLMs, like Llama-3-Instruct, can self-generate user queries when prompted with a pre-query template due to their auto-regressive nature. This insight forms the basis of the MAGPIE method, which involves two main steps:
1. Instruction Generation: A crafted input query, formatted according to the predefined instruction template of an aligned LLM, is used to prompt the LLM to generate instructions. This method eliminates the need for prompt engineering and seed questions, ensuring diverse and high-quality instructions.
2. Response Generation: The generated instructions are then fed back into the LLM to generate corresponding responses. This completes the creation of the instruction dataset.
Using this approach, the authors generated two datasets, MAGPIE-Air and MAGPIE-Pro, using Llama-3-8B-Instruct and Llama-3-70B-Instruct models, respectively. They also created multi-turn instruction datasets, MAGPIE-Air-MT and MAGPIE-Pro-MT.
The figure below from the paper illustrates the process of self-synthesizing instruction data from aligned LLMs (e.g., Llama-3-8B-Instruct) to create a high-quality instruction dataset. In Step 1, we input only the pre-query template into the aligned LLM and generate an instruction along with its response using auto-regressive generation. In Step 2, we use a combination of a post-query template and another pre-query template to wrap the instruction from Step 1, prompting the LLM to generate the query for the second turn. This completes the construction of the instruction dataset. MAGPIE efficiently generates diverse and high-quality instruction data. Our experimental results show that MAGPIE outperforms other public datasets for aligning Llama-3-8B-base.

The authors conducted extensive evaluations, including fine-tuning Llama-3-8B-Base with various public datasets and comparing the results. Their findings demonstrate that models fine-tuned with MAGPIE data perform comparably to the official Llama-3-8B-Instruct, which uses 10 million data points for supervised fine-tuning (SFT) and subsequent feedback learning. Additionally, MAGPIE data, when used solely for SFT, surpasses previous public datasets on alignment benchmarks like AlpacaEval, Arena-Hard, and WildBench without compromising performance on reasoning tasks like MMLU-Redux.
The implementation of MAGPIE involved 206 and 614 GPU hours for generating MAGPIE-Air and MAGPIE-Pro datasets, respectively. The datasets were created without any human intervention or API access to commercial LLMs like GPT-4, making MAGPIE a cost-effective and scalable solution.
The generated datasets were analyzed for quality and diversity. The authors used t-SNE plots to demonstrate that MAGPIE datasets cover a broader range of topics compared to other public datasets. They also assessed instruction quality, difficulty, and response quality, showing that MAGPIE datasets consist of high-quality instructions and responses, with MAGPIE-Pro outperforming MAGPIE-Air due to the capabilities of the Llama-3-70B model.
MAGPIE’s performance was further validated through fine-tuning experiments, which showed that models trained with MAGPIE datasets achieved superior results on alignment and instruction-following benchmarks. The authors also discussed the potential for extending MAGPIE to generate domain-specific instruction datasets and preference optimization datasets.
Overall, MAGPIE represents a significant advancement in the scalable creation of high-quality alignment data, facilitating the fine-tuning of LLMs and contributing to the democratization of AI research and applications.
Code

MDPO: Conditional Preference Optimization for Multimodal Large Language Models

This paper by Wang et al. from USC, UC Davis, and MSR introduces MDPO, a multimodal Direct Preference Optimization (DPO) method designed to enhance the performance of Large Language Models (LLMs) by addressing the unconditional preference problem in multimodal preference optimization.
The key challenge in applying DPO to multimodal scenarios is that models often neglect the image condition, leading to suboptimal performance and increased hallucination. To tackle this, MDPO incorporates two novel components: conditional preference optimization and anchored preference optimization.
Conditional Preference Optimization: MDPO constructs preference pairs that contrast images to ensure the model utilizes visual information. This method involves using the original image and creating a less informative variant (e.g., by cropping) to serve as a hard negative. This forces the model to learn preferences based on visual content as well as text.
Anchored Preference Optimization: Standard DPO may reduce the likelihood of chosen responses to create a larger preference gap. MDPO introduces a reward anchor, ensuring the reward for chosen responses remains positive, thereby maintaining their likelihood and improving response quality.
Implementation Details:
- The model training uses Bunny-v1.0-3B and LLaVA-v1.5-7B multimodal LLMs.
- Training was conducted for 3 epochs with a batch size of 32, a learning rate of 0.00001, and a cosine learning rate scheduler with a 0.1 warmup ratio.
- The preference optimization parameter β was set to 0.1.
- LoRA (Low-Rank Adaptation) was utilized, with α set to 128 and rank to 64.
- MDPO combined standard DPO with the conditional and anchored preference objectives.
The figure below from the paper illustrates an overview of MDPO. Top Left: Standard DPO expects the multimodal LLM to learn response preferences conditioned on both the image and the question. Top Right: However, in practice, the learning process often disregards the image condition. Bottom: To address this issue, MDPO introduces an additional image preference learning objective to emphasize the relationship between the image and the response. Furthermore, MDPO incorporates a reward anchor to ensure that the probability of the chosen response does not decrease.

Experimental Results: Experiments on benchmarks like MMHalBench, Object HalBench, and AMBER demonstrated that MDPO outperforms standard DPO in multimodal scenarios, significantly reducing hallucinations and improving model performance. Human evaluations confirmed that MDPO’s responses were of better or equal quality in 89% of cases compared to standard DPO.
Ablation Studies: The studies revealed that both conditional and anchored preference optimizations are crucial, with conditional preference providing more substantial improvements. Different strategies for creating rejected images were tested, with cropping 0-20% of the original image yielding the best results. Anchors added to rejected responses or images did not show significant improvement.
Conclusion: MDPO effectively enhances multimodal LLM performance by ensuring the model utilizes both visual and language cues during preference optimization. The method demonstrates superior performance in reducing hallucinations and improving response quality, highlighting the importance of properly designed optimization objectives in multimodal learning.

Aligning Large Multimodal Models with Factually Augmented RLHF

This paper by Sun et al. from UC Berkeley, CMU, UIUC, UW–Madison, UMass Amherst, MSR, MIT-IBM Watson AI Lab addresses the issue of multimodal misalignment in large multimodal models (LMMs), which can lead to hallucinations—generating textual outputs not grounded in multimodal context. To mitigate this, the authors propose adapting Reinforcement Learning from Human Feedback (RLHF) to vision-language alignment and introducing Factually Augmented RLHF (Fact-RLHF).
The proposed method involves several key steps:
1. Multimodal Supervised Fine-Tuning (SFT): The initial stage involves fine-tuning a vision encoder and a pre-trained large language model (LLM) on an instruction-following demonstration dataset to create a supervised fine-tuned model (πSFT).
2. Multimodal Preference Modeling: This stage trains a reward model to score responses based on human annotations. The reward model uses pairwise comparison data to learn to prefer less hallucinated responses. The training employs a cross-entropy loss function to adjust the model’s preferences.
3. Reinforcement Learning: The policy model is fine-tuned using Proximal Policy Optimization (PPO) to maximize the reward signal from the preference model. A KL penalty is applied to prevent over-optimization and reward hacking.
4. Factually Augmented RLHF (Fact-RLHF): To enhance the reward model, it is augmented with factual information such as image captions and ground-truth multi-choice options. This addition helps the reward model avoid being misled by hallucinations that are not grounded in the actual image content.
5. Enhancing Training Data: The authors improve the training data by augmenting GPT-4-generated vision instruction data with existing high-quality human-annotated image-text pairs. This includes data from VQA-v2, A-OKVQA, and Flickr30k, converted into suitable formats for vision-language tasks.
6. MMHAL-BENCH: To evaluate the proposed approach, the authors develop a new benchmark, MMHAL-BENCH, focusing on penalizing hallucinations. This benchmark covers various types of questions that often lead to hallucinations in LMMs, such as object attributes, adversarial objects, comparisons, counting, spatial relations, and environment descriptions.
The figure below from the paper illustrates that hallucination may occur during the Supervised Fine-Tuning (SFT) phase of LMM training and how Factually Augmented RLHF alleviates the issue of limited capacity in the reward model which is initialized from the SFT model.

The implementation of Fact-RLHF shows significant improvements:
- Improved Alignment: LLaVA-RLHF, the model trained with Fact-RLHF, achieves 94% of the performance level of text-only GPT-4 on the LLaVA-Bench dataset, compared to 87% by previous best methods.
- Reduced Hallucinations: On MMHAL-BENCH, LLaVA-RLHF outperforms other baselines by 60%, showing a substantial reduction in hallucinated responses.
- Enhanced Performance: The model also sets new performance benchmarks on MMBench and POPE datasets, demonstrating improved general capabilities and alignment with human preferences.
Overall, the paper highlights the effectiveness of integrating factual augmentation in RLHF to address multimodal misalignment, thereby reducing hallucinations and enhancing the reliability of large multimodal models. The authors have open-sourced their code, model, and data for further research and development in this area.
Code

Statistical Rejection Sampling Improves Preference Optimization

This paper by Liu et al. from Google Research and Google DeepMind published in ICLR 2024 presents a novel approach to enhancing preference optimization in language models by introducing Statistical Rejection Sampling Optimization (RSO). The research addresses limitations in current methods such as Sequence Likelihood Calibration (SLiC) and Direct Preference Optimization (DPO), which aim to align language models with human preferences without the complexities of Reinforcement Learning from Human Feedback (RLHF).
SLiC refines its loss function using sequence pairs sampled from a supervised fine-tuned (SFT) policy, while DPO directly optimizes language models based on preference data, foregoing the need for a separate reward model. However, the maximum likelihood estimator (MLE) of the target optimal policy requires labeled preference pairs sampled from that policy. The absence of a reward model in DPO constrains its ability to sample preference pairs from the optimal policy. Meanwhile, SLiC can only sample preference pairs from the SFT policy.
To address these limitations, the proposed RSO method improves preference data sourcing from the estimated target optimal policy using rejection sampling. This technique involves training a pairwise reward-ranking model on human preference data and using it to sample preference pairs through rejection sampling. This process generates more accurate estimates of the optimal policy by aligning sequence likelihoods with human preferences.
Key implementation details of RSO include:
1. Training a Pairwise Reward-Ranking Model: Starting with a human preference dataset $D_{hf}$ collected from other policies, a pairwise reward-ranking model is trained to approximate human preference probabilities. This model uses a T5-XXL model to process and learn from the preference data.
2. Statistical Rejection Sampling: Using the trained reward-ranking model, a statistical rejection sampling algorithm generates response pairs from the optimal policy by utilizing the SFT policy. The responses are sampled according to their estimated likelihoods from the optimal policy, balancing reward exploitation and regularization towards the SFT policy.
3. Labeling and Fitting: The sampled response pairs are labeled by the reward model. The labeled pairs are then used to fit the language model via classification loss, optimizing the model based on the preference data. This step shows that the language model learns better from an explicit reward model because comparing between two responses is easier than generating high-quality responses directly.
The statistical rejection sampling algorithm, based on Neal’s (2003) statistical field method, addresses issues found in RLHF techniques, which can suffer from reward hacking due to excessive trust in the reward model without regularization. Specifically, RLHF works (Bai et al., 2022; Stiennon et al., 2020; Touvron et al., 2023) carry out rejection sampling using the best-of-N or top-$k$-over-N algorithm, where they sample a batch of N completions from a language model policy and then evaluate them across a reward model, returning the best one or the top k. This algorithm has the issue of reward hacking because it trusts the reward model too much without any regularization. They show that top-$k$-over-N is a special case of our statistical rejection sampling and it is critical to balance between the reward exploitation and regularization towards the SFT policy.
The figure below from the paper illustrates that RSO first fits a pairwise reward-ranking model from human preference data. This model is later applied to generate preference pairs with candidates sampled from the optimal policy, followed by a preference optimization step to align sequence likelihood towards preferences.

Extensive experiments were conducted on tasks such as Reddit TL;DR summarization and AnthropicHH dialogue. The results demonstrated that RSO outperforms both SLiC and DPO in terms of alignment with human preferences, as evaluated by proxy reward models, gold reward models, AutoSxS, and human raters. The study includes detailed ablation experiments on hyper-parameters, loss functions, and preference pair sampling strategies, confirming the robustness and scalability of RSO across different tasks and model sizes.
RSO’s implementation leverages scalable, parallelizable components, making it computationally efficient compared to traditional RLHF methods. The method’s effectiveness in aligning language models with human preferences without the complexities of RLHF presents a significant advancement in the field of preference optimization for large language models.

This paper presents Chameleon, a family of early-fusion, token-based mixed-modal models developed by the Chameleon Team at FAIR Meta. Chameleon models can understand and generate sequences of images and text, marking a significant advancement in unified multimodal document modeling.
Chameleon employs a uniform transformer architecture, trained from scratch on a vast dataset containing interleaved images and text, allowing it to perform tasks such as visual question answering, image captioning, text generation, image generation, and long-form mixed-modal generation. The model’s architecture integrates images and text into a shared representational space from the start, unlike traditional models that use separate modality-specific encoders or decoders. This early-fusion approach facilitates seamless reasoning and generation across modalities.
Key technical innovations include query-key normalization and revised layer norm placements within the transformer architecture, which address optimization stability challenges. Additionally, supervised finetuning approaches adapted from text-only LLMs are applied to the mixed-modal setting, enabling robust alignment and performance scaling.
The figure below from the paper illustrates that Chameleon represents all modalities — images, text, and code, as discrete tokens and uses a uniform transformer-based architecture that is trained from scratch in an end-to-end fashion on ∼10T tokens of interleaved mixed-modal data. As a result, Chameleon can both reason over, as well as generate, arbitrary mixed-modal documents. Text tokens are represented in green and image tokens are represented in blue.

Implementation Details:
- Architecture: Chameleon quantizes images into discrete tokens similar to words in text, using a uniform transformer architecture. The architecture modifications include query-key normalization and revised placement of layer norms for stable training.
- Tokenization: Images are tokenized using a new image tokenizer that encodes a 512×512 image into 1024 discrete tokens (thus every 16x16 patch is transformed into a token) from a codebook of size 8192. Text tokenization uses a BPE tokenizer with a vocabulary size of 65,536, including image codebook tokens.
- Training: Chameleon-34B was trained on approximately 10 trillion tokens of interleaved mixed-modal data. The training process includes two stages, with the second stage mixing higher quality datasets and applying 50% weight reduction from the first stage data.
- Optimization: The AdamW optimizer is used, with β1 set to 0.9 and β2 to 0.95, and ε = 10−5. A linear warm-up of 4000 steps with an exponential decay schedule is applied to the learning rate. Global gradient clipping is set at a threshold of 1.0.
- Stability Techniques: To maintain training stability, dropout is used after attention and feed-forward layers, along with query-key normalization. Norm reordering within the transformer blocks helps prevent divergence issues during training.
Chameleon demonstrates strong performance across a wide range of vision-language tasks. It achieves state-of-the-art results in image captioning, surpassing models like Flamingo and IDEFICS, and competes well in text-only benchmarks against models such as Mixtral 8x7B and Gemini-Pro. Notably, Chameleon excels in new mixed-modal reasoning and generation tasks, outperforming larger models like Gemini Pro and GPT-4V according to human evaluations.
In conclusion, Chameleon sets a new benchmark for open multimodal foundation models, capable of reasoning over and generating interleaved image-text documents. Its unified token-based architecture and innovative training techniques enable seamless integration and high performance across diverse tasks, pushing the boundaries of multimodal AI.
Code; Models

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

This paper introduces MMMU, a comprehensive benchmark designed to evaluate multimodal models on tasks requiring college-level knowledge and sophisticated reasoning. The MMMU dataset consists of 11,500 meticulously curated multimodal questions derived from college exams, quizzes, and textbooks across six broad disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Technology & Engineering. These questions span 30 subjects and 183 subfields, featuring 30 heterogeneous image types, such as diagrams, charts, maps, music sheets, and medical images.
Unlike existing benchmarks, MMMU focuses on advanced perception and reasoning with domain-specific knowledge, challenging models to perform tasks akin to those faced by experts. The benchmark assesses three essential skills: perception, knowledge, and reasoning, testing models’ abilities to understand information across different modalities and apply subject-specific knowledge to derive solutions.
The figure below from the paper illustrates the overview of the MMMU dataset. MMMU presents four challenges: 1) comprehensiveness: 11.5K college-level problems across six broad disciplines and 30 college subjects; 2) highly heterogeneous image types; 3) interleaved text and images; 4) expert-level perception and reasoning rooted in deep subject knowledge.

The implementation of MMMU involves a three-stage data curation process:
1. Data Collection: University students and co-authors manually collected multimodal questions from textbooks and online resources, ensuring copyright compliance and avoiding data contamination from foundation models.
2. Data Cleaning: The data underwent a two-step cleaning process involving lexical overlap checks and manual review to eliminate duplicates, followed by format and typo checking by different co-authors.
3. Difficulty Categorization: Problems were categorized into four difficulty levels: very easy, easy, medium, and hard. Approximately 10% of the simplest problems were excluded to maintain the benchmark’s challenging nature.
The MMMU dataset is divided into a few-shot development set, a validation set, and a test set. The few-shot development set includes 5 questions per subject, the validation set contains around 900 questions for hyperparameter selection, and the test set comprises 10,500 questions.
Evaluation of 28 open-source LMMs and proprietary models like GPT-4V(ision) and Gemini reveals significant challenges. GPT-4V achieved only 55.7% accuracy, highlighting substantial room for improvement. Open-source models performed notably lower, with top performers like BLIP2-FLAN-T5-XXL and LLaVA-1.5 reaching around 34% accuracy.
Error analysis of GPT-4V’s performance indicates that 35% of errors are perceptual, 29% stem from a lack of domain-specific knowledge, and 26% are due to flawed reasoning processes. This underscores the benchmark’s complexity and the need for further research to enhance models’ visual perception, knowledge representation, and reasoning capabilities.
MMMU aims to push the boundaries of multimodal understanding and reasoning, providing a rigorous test for the development of next-generation multimodal foundation models towards achieving expert AGI. The benchmark’s comprehensive nature and challenging tasks are intended to stimulate progress in this field, though it is acknowledged that MMMU alone is not sufficient to fully test Expert AGI as defined by reaching the “90th percentile of skilled adults” across diverse tasks.
Code

This paper introduces MMBench, a comprehensive bilingual benchmark designed to evaluate the multi-modal capabilities of large vision-language models (VLMs). The primary challenge addressed is the lack of fine-grained and robust evaluation metrics in existing benchmarks, which hinders the accurate assessment of VLMs. MMBench is meticulously curated to provide a thorough evaluation pipeline that surpasses existing benchmarks in the number and variety of evaluation questions and abilities. It incorporates a rigorous CircularEval strategy and uses large language models to convert free-form predictions into pre-defined choices, thus ensuring accurate evaluations even for models with limited instruction-following capabilities.
Key Features and Technical Details:
1. Comprehensive Dataset:
  - MMBench consists of over 3000 multiple-choice questions covering 20 ability dimensions, including object localization, social reasoning, and more.
  - Each ability dimension includes over 125 questions to ensure balanced and thorough assessment.
2. Evaluation Strategy:
  - The CircularEval strategy involves feeding each question to a VLM multiple times with shuffled choices to ensure robustness in evaluation. A model must succeed in all attempts to be considered correct.
  - GPT-4 is employed to match model predictions to choice labels, effectively reducing false-negative samples.
3. Bilingual Capability:
  - MMBench includes both English and Chinese versions, enabling direct comparison of VLM performance in different languages.
Implementation Details:
1. Data Collection and Quality Control:
  - Questions and images are manually collected and curated by volunteers from multiple sources, including the Internet and public datasets.
  - A two-step quality control process filters out low-quality samples: ‘majority voting’ by state-of-the-art LLMs for text-only samples and verification by multiple VLMs to detect wrong samples.
2. Choice Extraction using LLMs:
  - Heuristic matching is first attempted to extract choice labels from VLM predictions.
  - If heuristic matching fails, an LLM (default GPT-4-0125) is used to align the prediction with one of the given choices, ensuring high accuracy in extracting valid choices.
3. CircularEval:
  - Questions are tested multiple times with circular shifted choices. A VLM must correctly predict the answer in all passes to be considered successful.
  - This method effectively reduces the impact of random guessing and biased predictions, providing a more reliable evaluation.
4. Evaluation Results:
  - The evaluation includes over 20 mainstream VLMs across different architectures and parameter sizes.
  - Results highlight the significant role of large language models in improving VLM performance, with models using more advanced LLMs achieving better results.
  - Proprietary models generally outperform open-source models, particularly in structured image-text understanding and tasks requiring external knowledge.
The figure below from the paper illustrates the results of eight representative large vision-language models (VLMs) across the 20 ability dimensions defined in MMBench-test.

Significant Findings:
- The evaluation reveals a performance gap between different VLMs, offering insights for future optimization.
- CircularEval provides a stricter and more reliable evaluation method compared to traditional one-pass evaluations.
- The use of LLMs for choice extraction significantly improves evaluation accuracy, particularly for VLMs with poor instruction-following capabilities.
MMBench aims to assist the research community in better evaluating their models and facilitating future progress in vision-language model development.
Code

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

This paper by Rein et al. from Samuel R. Bowman’s group from New York University, Cohere, and Anthropic, PBC, introduces GPQA, a dataset consisting of 448 challenging multiple-choice questions in biology, physics, and chemistry, aimed at benchmarking AI’s ability to handle complex, expert-level queries. The dataset is designed to be “Google-proof,” meaning the questions are difficult to answer even with unrestricted internet access.
The questions were crafted and validated through a rigorous multi-stage process involving experts with PhDs or those pursuing PhDs in relevant fields. The process included initial question writing, expert validation, question revision, and non-expert validation. Non-expert validators, who are highly skilled individuals from other domains, demonstrated an average accuracy of only 34%, indicating the high difficulty level of the questions. Even with access to the web, these validators spent an average of 37 minutes per question and still struggled to achieve high accuracy.
GPQA also proved challenging for state-of-the-art AI systems. A GPT-4-based baseline achieved only 39% accuracy, further emphasizing the dataset’s difficulty. This suggests that future AI systems designed to answer complex scientific questions will require robust oversight mechanisms to ensure the accuracy and reliability of their outputs.
The data collection process involved 61 contractors from Upwork who wrote and validated the questions. To incentivize quality, contractors were offered substantial bonuses based on the correctness and difficulty of their questions. The validation phase ensured the questions’ objectivity and difficulty, with expert validators providing detailed feedback and suggested revisions. This feedback was crucial in refining the questions to ensure they were both challenging and objective.
The figure below from the paper illustrates the data creation pipeline. First, a question is written, and one expert in the same domain as the question gives their answer and feedback including suggested revisions to the question. Then, the question writer revises the question, and this revised question is sent to another expert in the same domain, as well as three non-expert validators who have expertise in other domains. We say that expert validators agree (*) if they either answer correctly initially, or if in the expert’s feedback after seeing the correct answer, they either clearly explain the mistake they made or explicitly demonstrate understanding of the question writer’s explanation.

The dataset is categorized into three main domains: biology, physics, and chemistry, with subdomains specified to cover a wide range of topics within these fields. The expert validation process confirmed that the majority of questions have uncontroversial correct answers, with an estimated objectivity rate of 74%.
For implementation, the paper outlines that each question undergoes a detailed revision and validation process involving both expert and non-expert validators. The non-expert validators’ performance, despite their expertise in other domains, underscores the dataset’s difficulty. Expert validators provided post-hoc agreement feedback to ensure the questions’ correctness and objectivity.
The dataset’s primary goal is to support scalable oversight experiments, aiming to develop methods that enable humans to supervise AI outputs effectively, even when the AI systems surpass human capabilities. The challenging nature of GPQA makes it a valuable resource for testing and improving the reliability of AI systems in answering complex, expert-level questions.

Sycophancy to Subterfuge: Investigating Reward Tampering in Language Models

This paper by Denison et al. from Anthropic, Redwood Research, and the University of Oxford, investigates reward tampering in language models within reinforcement learning frameworks. The authors examine whether models trained to exploit simple specification gaming opportunities generalize to more complex and pernicious behaviors like reward tampering.
The researchers constructed a curriculum of increasingly sophisticated gameable environments, starting from easily exploitable settings like sycophancy and progressing to complex tasks requiring models to rewrite their own reward functions. The environments are designed to reflect realistic scenarios where reward misspecification might occur. The curriculum includes:
1. Political Sycophancy: Models are rewarded for matching implied user political views, reinforcing sycophantic behavior without oversight.
2. Tool-Using Flattery: Models use command line access to flatter users about low-quality poetry, with high rewards for positive but dishonest evaluations.
3. Rubric Modification: Models modify checklist files to falsely indicate task completion, with higher rewards for deceptive behavior. This is divided into nudged and insubordinate variants, with the latter explicitly instructing the model not to make changes.
4. Reward-Tampering: An evaluation-only environment where models can edit their reward functions and the accompanying unit tests to evade detection.
The figure below from the paper illustrates real evaluation transcripts from their model. The only modification they have made is to remove words so that the transcripts fit in the figure. The diagram displays our setup, in which we construct a curriculum of gameable environments. They train models up to each stage and then evaluate on the later stages. Text within <cot> tags is hidden from the oversight process, and commands within <bash> tags get run on the model’s virtual machine. They want to highlight that the vast majority of all transcripts from the model are neither this malicious nor this coherent. Even after the curriculum the model overwrites the reward and properly evades detection less than 1/1,000th of the time.

The authors utilized two reinforcement learning algorithms: expert iteration and proximal policy optimization (PPO). In expert iteration, models sample responses to prompts, and the best performing ones are used for further training. PPO involves policy gradient methods to adjust the policy based on reward signals.
Significant implementation details include:
- HHH Expert Iteration: Combines reinforcement of specification gaming with supervision from a preference model that rewards helpful, honest, and harmless (HHH) behavior.
- Exploit-Only Expert Iteration: Focuses solely on successful specification gaming samples without HHH oversight, to highlight the impact of excluding such supervision.
- Hidden Chain of Thought: Augments models with reasoning capabilities that are hidden from the reward process, simulating future model capabilities.
- Preamble Instructions: Models are prefixed with instructions about their training environment, reinforcing their understanding of being language model assistants trained via RL.
Key findings from the experiments include:
- Models trained on early stages of the curriculum generalize to more sophisticated forms of specification gaming in later stages.
- A small but significant proportion of models trained on the full curriculum exhibit zero-shot generalization to reward tampering, including rewriting their reward functions and evading detection.
- Retraining models not to game early environments reduces but does not eliminate reward tampering.
- Adding HHH training does not prevent the generalization of specification gaming to reward tampering.
The study demonstrates that large language models can generalize from simple specification gaming to complex reward tampering, suggesting that such behaviors may be nontrivial to remove and pose potential risks as models become more capable.
Blog; Memo

ReST-MCTS$^{*}$: LLM Self-Training via Process Reward Guided Tree Search

This paper introduces ReST-MCTS$^{*}$, a novel reinforced self-training approach for Large Language Models (LLMs) that integrates process reward guidance with Monte Carlo Tree Search (MCTS$^{*}$) to enhance reasoning capabilities. Traditional self-training methods rely on generating responses and filtering those with correct output answers, often resulting in low-quality training sets. ReST-MCTS$^{*}$ addresses this by leveraging tree-search-based reinforcement learning to collect high-quality reasoning traces and per-step values for training policy and reward models, circumventing the need for extensive manual annotation.
ReST-MCTS$^{*}$ operates by using an MCTS$^{*}$ algorithm as the reasoning policy, guided by a trained per-step process reward model (PRM). The algorithm estimates the probability that a step will lead to the correct answer, using inferred rewards to refine the process reward model and select high-quality traces for policy model training. The approach is validated against prior LLM reasoning baselines, such as Best-of-N and Tree-of-Thought, demonstrating superior accuracy within the same search budget.
The table below from the paper illustrates the key differences between existing self-improvement methods and our approach. Train refers to whether to train a reward model.

Key technical details include the following stages of the ReST-MCTS$^{*}$ method:
1. Quality Value Calculation: The process reward model evaluates the quality of partial solutions, reflecting the likelihood of correctness and contribution of each reasoning step. The quality value $v_k$ is iteratively updated using the previous quality value and the weighted reward of the current step, ensuring boundedness and incremental progress towards the correct solution.
2. Tree Search Algorithm (MCTS$^{*}$): This modified MCTS algorithm incorporates a value function $V_\theta$ to evaluate partial solutions and guide the search. The algorithm includes stages of node selection, thought expansion, greedy Monte Carlo rollout, and value backpropagation, enhancing search efficiency and precision.
3. Self-Training Pipeline: The iterative self-training process involves mutual enhancement of the policy model and the process reward model. The initial datasets are constructed using inferred rewards from correct reasoning traces, and the models are continuously improved through multiple iterations of tree search and policy training.
The figure below from the paper illustrates: (Left) The process of inferring process rewards and how we conduct process reward guide tree-search. (Right) The self-training of both the process reward model and the policy model.

Implementation details highlight the practical application of ReST-MCTS$^{*}$:
- The value model $V_\theta$ is initialized with a training set $D_{V0}$ derived from selected science and math questions, with partial solutions and incorrect steps generated by an LLM policy to enhance model discrimination.
- The policy model is fine-tuned on datasets $D_{G}$ generated through MCTS$^{*}$, ensuring high-quality samples for self-improvement.
- Experiments demonstrate iterative performance improvements on both policy and reward models, outperforming previous self-training methods such as ReSTEM and Self-Rewarding across various benchmarks, including MATH and SciBench.
Results show ReST-MCTS$^{*}$ achieves higher accuracy and robustness compared to existing baselines, effectively utilizing search budget and providing a scalable solution for LLM self-training without extensive human annotations. The approach presents a significant advancement in automatic acquisition of reliable reasoning traces and the efficient use of reward signals for verification and self-training in complex reasoning tasks.
Code; Project page

FLAME: Factuality-Aware Alignment for Large Language Models

This paper by Lin et al. from the University of Waterloo, CMU, and Meta AI, addresses the issue of hallucination in large language models (LLMs) during the alignment process. The conventional alignment process, involving supervised fine-tuning (SFT) and reinforcement learning (RL), often leads to generating false facts despite enhancing instruction-following capabilities.
The authors identify that training LLMs on new knowledge or unfamiliar texts during SFT encourages hallucination. This is because the human-labeled data used in SFT may present novel information to the LLM, causing it to generate more hallucinated responses. Additionally, reward functions used in standard RL, which guide the LLM to produce longer and more detailed responses, also contribute to hallucination.
To combat these issues, the paper proposes a new approach called factuality-aware alignment (FLAME), which includes two main components: factuality-aware SFT and factuality-aware RL through direct preference optimization (DPO).
Factuality-Aware SFT: Instead of using human-created seed training data, the approach constructs few-shot demonstrations from the pre-trained LLM’s own knowledge. For fact-based instructions, training data is generated using the pre-trained LLM itself to prevent the introduction of unknown knowledge. The final training data consists of human responses for non-fact-based instructions and LLM-generated responses for fact-based instructions.
Factuality-Aware DPO: At this stage, SFT-generated responses are used to create factuality preference pairs. The responses are evaluated using two separate reward models: one for instruction following (RMIF) and one for factuality (RMfact). RMfact, augmented with retrieval techniques, measures the percentage of correct facts in a response. Pairs are created by selecting responses with the highest and lowest factuality rewards, ensuring significant differences in their instruction-following rewards are minimized.
The figure below from the paper illustrates response generation using a pre-trained LLM (PT) with few-shot demonstration.

The figure below from the paper illustrates factuality-aware alignment.

Implementation Details:
- Pilot Study: Conducted on biography generation tasks, which focused solely on factuality. LLMs trained on responses generated by retrieval-augmented LLMs (RAG) showed more factual responses.
- Baseline Alignment: Built using Llama-2 70B, the baseline alignment included fine-tuning on high-quality human-created data and further optimization using DPO on instruction-following preference data.
- Factuality Preference Data: Created using the FActScore for each generated biography. The pilot study found that fine-tuning on the LLM’s own generations was crucial for reducing hallucinations.
- Evaluation: Models were evaluated on Alpaca Eval for instruction following and FActScore for factuality. FLAME showed a higher FActScore (+5.6 points) compared to the standard alignment process, without sacrificing instruction-following capabilities.
Results:
- FLAME achieved significant improvements in factuality without degrading instruction-following abilities. The method outperformed standard alignment techniques in generating more factual responses across various datasets, including Biography, Alpaca Fact, and FAVA.
- The approach also improved truthfulness in responses as evaluated on the TruthfulQA dataset, indicating the effectiveness of factuality-aware alignment.
In conclusion, FLAME effectively addresses the hallucination problem in LLMs by integrating factuality considerations into both SFT and RL stages, resulting in models that generate more accurate and reliable information while maintaining strong instruction-following performance.

Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

This paper by Xu et al. from Tsinghua University, OpenPsi Inc., and Shanghai Qi Zhi Institute investigates whether Direct Preference Optimization (DPO) is truly superior to Proximal Policy Optimization (PPO) for aligning large language models (LLMs) with human preferences. The study explores the theoretical and empirical properties of both methods and provides comprehensive benchmarks to evaluate their performance.
The research begins by discussing the widespread use of Reinforcement Learning from Human Feedback (RLHF) to align LLMs with human preferences. It highlights that existing RLHF methods can be categorized into reward-based and reward-free approaches. Reward-based methods, like those used in applications such as ChatGPT and Claude, involve learning a reward model and applying actor-critic algorithms such as PPO. Reward-free methods, such as DPO, optimize policies directly based on preference data without an explicit reward model.
The paper delves into the theoretical limitations of DPO, demonstrating that it may find biased solutions that exploit out-of-distribution responses. The authors argue that this can lead to suboptimal performance, particularly in scenarios where there is a distribution shift between model outputs and the preference dataset. Empirical studies support this claim, showing that DPO’s performance degrades significantly under distribution shifts.
Implementation details for PPO are extensively discussed, revealing critical factors for achieving optimal performance in RLHF settings. Key techniques identified include advantage normalization, large batch size, and exponential moving average updates for the reference model. These enhancements are shown to significantly improve PPO’s performance across various tasks, including dialogue generation and code generation.
The study presents a series of experiments benchmarking DPO and PPO across multiple RLHF testbeds, such as the SafeRLHF dataset, HH-RLHF dataset, APPS, and CodeContest datasets. Results indicate that PPO consistently outperforms DPO in all cases, achieving state-of-the-art results in challenging code competition tasks. Specifically, on the CodeContest dataset, a PPO model with 34 billion parameters surpasses the previous state-of-the-art AlphaCode-41B, demonstrating a notable improvement in performance.
Key experimental findings include:
1. Theoretical Analysis: Demonstrates that DPO can produce biased policies due to out-of-distribution exploitation, while PPO’s regularization via KL divergence helps mitigate this issue.
2. Synthetic Scenario Validation: Illustrates DPO’s susceptibility to generating biased distributions favoring unseen responses, while PPO maintains more stable performance.
3. Real Preference Datasets: Shows that DPO’s performance can be improved by addressing distribution shifts through additional supervised fine-tuning (SFT) and iterative training, though PPO still outperforms DPO significantly.
4. Ablation Studies for PPO: Highlights the importance of advantage normalization, large batch sizes, and exponential moving average updates in enhancing PPO’s RLHF performance.
The authors conclude that while DPO offers a simpler training procedure, its performance is hindered by sensitivity to distribution shifts and out-of-distribution data. PPO, with proper tuning and implementation enhancements, demonstrates robust effectiveness and achieves superior results across diverse RLHF tasks.
In summary, the comprehensive analysis and empirical evidence provided in this paper establish PPO as a more reliable and effective method for LLM alignment compared to DPO, particularly in scenarios requiring high-performance and robust alignment with human preferences.

Improving Multi-step Reasoning for LLMs with Deliberative Planning

This paper by Wang et al. from Skywork AI and Nanyang Technological University addresses the challenges large language models (LLMs) face with multi-step reasoning tasks, such as producing errors and inconsistencies due to their auto-regressive nature. The authors propose Q*, a versatile and agile framework for guiding LLMs in their decoding process through deliberative planning.
The core idea of Q* is to use a plug-and-play Q-value model as a heuristic function to guide LLMs in selecting the most promising next step without requiring fine-tuning for each task. This approach helps avoid significant computational overhead and potential performance degradation on other tasks. The framework formalizes multi-step reasoning as a Markov Decision Process (MDP) where the state represents the input prompt and the reasoning steps generated so far, the action is the next step, and the reward measures how well the task is solved.
The authors present several methods to estimate the optimal Q-value of state-action pairs, including offline reinforcement learning, best sequence from rollout, and completion with stronger LLMs. These methods require only the ground truth of training problems and can be applied to various reasoning tasks without modification.
Q* casts solving multi-step reasoning tasks as a heuristic search problem, using A* search with the Q-value model as the heuristic function. The framework guides LLMs to select the most promising next reasoning step in a best-first fashion. The implementation details of Q* include three steps:
1. Q-value estimation: Utilizing methods like random rollouts and Monte Carlo Tree Search (MCTS) to construct Q-value labels, then training a proxy Q-value model.
2. Aggregated utility calculation: Using process reward models trained on specific datasets to provide intermediate signals for reasoning steps.
3. A* planning: Maintaining an open list and a closed list, expanding states by querying top-$k$ alternatives with the LLM policy, and updating the lists until the terminal state is reached.
The figure below from the paper illustrates an overview of Q. (a): the deliberation process of Q. Each state is associated with a f-value which is the weighted sum of the aggregated utility (cf. Eq. (4)) and the heuristic value (cf. Eq. (5)). (b-d): estimating optimal Q-value with fitted-Q-iteration, rollout and completion with stronger LLMs.

Extensive experiments on GSM8K, MATH, and MBPP datasets demonstrate the effectiveness of Q. For instance, on the GSM8K dataset, the Q method significantly outperformed the Best-of-N method and even surpassed the performance of closed-source models like ChatGPT-turbo. Similarly, on the MATH dataset, Q* showed notable improvements over Best-of-N, achieving higher accuracy with models like Llama-2-7b and DeepSeek-Math-7b. On the MBPP dataset, Q* outperformed Best-of-N in code generation tasks, achieving competitive performance.
The results confirm that Q* can significantly improve the multi-step reasoning capabilities of LLMs by leveraging deliberative planning with a heuristic search approach, without the need for task-specific fine-tuning. This makes Q* a robust and generalizable framework for enhancing LLM performance on complex reasoning tasks.

SAMBA: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling

This paper introduces SAMBA, a novel hybrid architecture that combines Mamba, a selective State Space Model (SSM), with Sliding Window Attention (SWA) to efficiently model sequences with infinite context length. The authors address the challenge of balancing computation complexity and the ability to generalize to longer sequences than seen during training. SAMBA leverages the strengths of both SSMs and attention mechanisms to achieve linear-time complexity while maintaining precise memory recall.
SAMBA architecture consists of three main components: Mamba layers, SWA layers, and Multi-Layer Perceptrons (MLPs). Mamba layers capture time-dependent semantics and compress sequences into recurrent hidden states. SWA layers, operating on a window size of 2048, provide precise retrieval of non-Markovian dependencies within the sequence. MLP layers handle nonlinear transformations and recall of factual knowledge, enhancing the model’s overall capability.
The SAMBA model was scaled to 3.8B parameters and trained on 3.2T tokens. It demonstrated superior performance compared to state-of-the-art models based on pure attention or SSMs across various benchmarks, including commonsense reasoning, language understanding, truthfulness, and math and coding tasks. Notably, SAMBA showed improved token predictions up to 1M context length and achieved a 3.73× higher throughput than Transformers with grouped-query attention when processing 128K length user prompts.
The figure below from the paper illustrates from left to right: Samba, Mamba-SWA-MLP, Mamba-MLP, and Mamba. The illustrations depict the layer-wise integration of Mamba with various configurations of Multi-Layer Perceptrons (MLPs) and Sliding Window Attention (SWA). They assume the total number of intermediate layers to be $N$, and omit the embedding layers and output projections for simplicity. Pre-Norm and skip connections are applied for each of the intermediate layers.

The implementation of SAMBA involved meticulous exploration of hybridization strategies, including different layer-wise combinations of Mamba, SWA, and MLP. The final configuration was optimized for performance, with a total of 48 layers for Samba and Mamba-MLP models, and 54 layers for the Mamba-SWA-MLP model. The models were pre-trained on the Phi-2 dataset with 4K sequence lengths, and downstream evaluations were conducted on a range of benchmarks to validate the architectural design.
SAMBA’s ability to extrapolate context length was tested extensively. It maintained linear decoding time complexity with unlimited token streaming, achieving perfect memory recall and improved perplexity on long sequences. The model’s performance on long-context summarization tasks further demonstrated its efficiency and effectiveness in handling extensive contexts.
In conclusion, SAMBA presents a significant advancement in language modeling, offering a simple yet powerful solution for efficient modeling of sequences with unlimited context length. The hybrid architecture effectively combines the benefits of SSMs and attention mechanisms, making it a promising approach for real-world applications requiring extensive context understanding.
Code

RichRAG: Crafting Rich Responses for Multi-faceted Queries in Retrieval-Augmented Generation

This paper by Wang et al. from the Gaoling School of Artificial Intelligence, Renmin University of China, and Baichuan Intelligent Technology, addresses the need for rich and comprehensive responses to broad, open-ended queries in retrieval-augmented generation (RAG).
The authors propose a novel framework, RichRAG, to handle complex user queries that have multiple sub-intents. RichRAG consists of three main components: a sub-aspect explorer, a multi-faceted retriever, and a generative list-wise ranker.
The sub-aspect explorer identifies potential sub-aspects of the input queries. This module leverages large language models (LLMs) for their extensive world knowledge and language understanding capabilities. It generates sub-aspects by fine-tuning on training queries using a next token prediction (NTP) loss function.
The multi-faceted retriever builds a candidate pool of external documents related to the identified sub-aspects. It retrieves top-N documents for each sub-aspect and combines these into a diverse candidate pool, ensuring broad coverage of the query’s various aspects.
The generative list-wise ranker sorts the top-$k$ most valuable documents from the candidate pool. Built on a seq-to-seq model structure (T5), it models global interactions among candidates and sub-aspects, using a parallel encoding process and a pooling operation to extract relevance representations. The ranker generates a list of document IDs optimized through supervised fine-tuning and reinforcement learning stages.
The supervised fine-tuning stage uses a greedy algorithm to build silver target ranking lists based on a coverage utility function, ensuring the ranker can generate comprehensive lists.
The reinforcement learning stage aligns the ranker’s output with LLM preferences by using a reward function based on the quality and coverage of the generated responses. The Direct Preference Optimization (DPO) algorithm is employed, with training pairs created through a unilateral significance sampling strategy (US3) to ensure valuable and reliable training data.
The figure below from the paper illustrates the overall framework of RichRAG. We describe the training stages of our ranker at the bottom.

Experimental results on WikiPassageQA and WikiAsp datasets demonstrate RichRAG’s effectiveness in generating comprehensive responses. The framework shows superior performance in terms of Rouge and Com-Rouge scores compared to existing methods.
RichRAG significantly improves the quality of responses to multi-faceted queries by explicitly modeling sub-aspects and aligning ranking lists with LLM preferences. The efficiency and robustness of the ranker are validated through various experiments, confirming its advantage in handling complex search scenarios.

A Careful Examination of Large Language Model Performance on Grade School Arithmetic

This paper by Zhang et al. from Scale AI, investigates the true reasoning abilities of large language models (LLMs) in solving elementary mathematical problems by introducing a new benchmark dataset called Grade School Math 1000 (GSM1k). This dataset is designed to mirror the style and complexity of the existing GSM8k benchmark, which is a gold standard for measuring elementary mathematical reasoning.
The motivation behind this work stems from concerns over dataset contamination, where models might perform well on benchmarks because they have seen similar problems in their training data, rather than demonstrating genuine reasoning capabilities. To address this, GSM1k was created with careful measures to avoid contamination. All problems were constructed solely by human annotators without any assistance from LLMs or synthetic data sources. GSM1k consists of 1250 problems, and its construction involved multiple quality checks and a process to match the difficulty distribution of GSM8k.
The evaluation of several leading open- and closed-source LLMs on GSM1k revealed notable findings:
1. Many models showed a significant drop in performance (up to 13%) on GSM1k compared to GSM8k, indicating overfitting to the GSM8k dataset.
2. Certain model families, particularly Phi and Mistral, exhibited systematic overfitting across almost all model sizes.
3. Frontier models such as Gemini, GPT, and Claude showed minimal signs of overfitting, suggesting either superior reasoning capabilities or more stringent measures to prevent data contamination.
4. A positive relationship was found between a model’s likelihood of generating examples from GSM8k and its performance gap between GSM8k and GSM1k (Spearman’s r2 = 0.32), suggesting partial memorization of GSM8k by many models.
The figure below from the paper illustrates that notable models arranged by their drop in performance between GSM8k and GSM1k (lower is worse). We notice that Mistral and Phi top the list of overfit models, with almost 10% drops on GSM1k compared to GSM8k, while models such as Gemini, GPT, and Claude show little to no signs of overfitting.

Implementation details include:
- The creation of GSM1k involved human annotators who were given specific instructions and example problems from GSM8k to generate new problems of similar difficulty. The annotators’ tasks were manually reviewed for correctness and adherence to the required difficulty levels.
- To ensure the new problems were solvable with only basic arithmetic, all problem solutions were positive integers, and problems requiring advanced mathematical concepts were excluded.
- A series of quality checks included a second review layer where independent annotators solved the problems, and any discrepancies led to problem rejection. Additionally, problems were reviewed by a specialized team within Scale AI for general quality audits.
Empirical results demonstrated that even the most overfit models were capable of generalizing to new problems, albeit at lower rates than their benchmark performance suggested. This indicates that overfitting does not entirely negate a model’s reasoning ability. The authors conclude with recommendations for future benchmarks to adopt similar methodologies to minimize data contamination and ensure genuine evaluation of model reasoning capabilities.
The paper provides an in-depth analysis and advocates for continued rigorous benchmarking to drive advancements in LLM reasoning abilities. The authors also plan to run recurring evaluations of major models and commit to open sourcing GSM1k upon reaching specific milestones, ensuring transparency and continuous improvement in the field.

Pairwise Proximal Policy Optimization: Harnessing Relative Feedback for LLM Alignment

This paper by Wu et al. from UC Berkeley proposes a novel reinforcement learning framework, Pairwise Proximal Policy Optimization (P3O), designed to optimize large language models (LLMs) using comparative feedback rather than absolute rewards. Traditional approaches such as Proximal Policy Optimization (PPO) have limitations when dealing with reward functions derived from comparative losses like the Bradley-Terry loss. These limitations include the necessity for reward normalization and token-wise updates, which introduce complexity and potential instability.
The proposed P3O algorithm operates on trajectory-wise policy gradient updates, simplifying the optimization process by directly utilizing comparative rewards. This approach is invariant to equivalent reward functions, addressing the instability issues present in PPO. The paper presents a comprehensive theoretical foundation, establishing that P3O avoids the complications of value function approximation and Generalized Advantage Estimation (GAE), which are essential in PPO.
The implementation of P3O involves the following key steps:
1. Initialization: Policy parameters are initialized.
2. Data Collection: Pairwise trajectories are collected by running the policy on a batch of prompts, generating two responses per prompt.
3. Reward Calculation: Trajectory-wise rewards are computed, incorporating both the preference-based reward and the KL-divergence penalty from the supervised fine-tuning (SFT) model.
4. Gradient Estimation: The policy gradient is estimated using the relative differences in rewards between the paired responses, adjusted by importance sampling to account for the policy change.
5. Policy Update: Gradient updates are applied to the policy parameters, following either separate or joint clipping strategies to maintain stability.
The figure below from the paper illustrates the prevalent method for fine-tuning LMs using RL, which relies on Absolute Feedback. In this paradigm, algorithms like PPO has to learn a $V$ function, which capture not only the valuable relative preference information, but also less part, which is the scale of the reward for a given prompt. Contrastingly, the figure on the right presents paradigm for optimizing reward model trained via comparative loss, e.g., Bradley-Terry Loss (Bradley & Terry, 1952). P3O generates a pair of responses per prompt, leveraging only the Relative Feedback - derived from the difference in reward - for policy gradient updates. This method obviates the need for additional $V$ function approximations and intricate components like GAE.

Empirical evaluations are conducted on summarization and question-answering tasks using datasets like TL;DR and Anthropic’s Helpful and Harmless (HH). The results demonstrate that P3O achieves a superior trade-off between reward and KL-divergence compared to PPO and other baseline methods. Specifically, P3O shows improved alignment with human preferences, as evidenced by higher rewards and better performance in head-to-head comparisons evaluated by GPT-4.
The experiments reveal that P3O not only achieves higher reward scores but also maintains better KL control, making it a robust alternative for fine-tuning LLMs with relative feedback. The study underscores the potential of P3O in simplifying the RL fine-tuning process while enhancing model alignment with human values. Future work aims to explore the impacts of reward over-optimization and extend the policy gradient framework to accommodate multiple ranked responses.

BPO: Supercharging Online Preference Learning by Adhering to the Proximity of Behavior LLM

This paper by Xu et al. from UCSB and CMU presents Behavior Preference Optimization (BPO), a novel approach to enhancing online preference learning for large language models (LLMs) by maintaining proximity to the behavior LLM that collects training samples. The key motivation is to address the limitations of traditional Direct Alignment from Preferences (DAP) methods, which do not fully exploit the potential of online training data.
The authors propose a new online DAP algorithm, emphasizing the construction of a trust region around the behavior LLM ($\pi_{\beta}$) rather than a fixed reference model ($\pi_{\ref}$). This approach ensures that the learning LLM ($\pi_{\theta}$) remains aligned with the behavior model, thereby stabilizing the training process and improving performance.
Implementation Details:
1. Algorithm Overview:
  - The BPO algorithm dynamically updates $\pi_{\beta}$ with $\pi_{\theta}$ every K steps, where K is the annotation interval calculated as T/F (total training steps divided by the preference annotation frequency).
  - The training loss $L_{BPO}$ is computed by constraining the KL divergence between $\pi_{\theta}$ and $\pi_{\beta}$, thus constructing a trust region around the behavior LLM.
2. Ensemble of LoRA Weights:
  - To mitigate training instability, the authors optimize an ensemble of Low-Rank Adaptation (LoRA) weights and merge them during inference without additional overhead. This ensemble approach stabilizes the training process.
3. Experimental Setup:
  - The experiments were conducted on three datasets: Reddit TL;DR, Anthropic Helpfulness, and Harmlessness, using a preference simulator for annotation.
  - BPO was integrated with various DAP methods, including DPO, IPO, and SLiC, and compared against their online and offline counterparts.
The figure below from the paper illustrates an overview of the training pipeline of our BPO. Our training loss LBPO is calculated by constraining the KL divergence between $\pi_{\theta}$ and the behavior LLM $\pi_{\beta}$. Every $K$ steps, they update $\pi_{\beta}$ with $\pi_{\theta}$ and use it to collect new samples for annotations.

Experimental Details:
- Preference Annotation Frequency:
  - Different annotation frequencies were tested, demonstrating that even a small increase in frequency (F = 2) significantly improves performance over offline DPO, achieving notable gains in win rates against reference texts.
- Ablation Study:
  - The authors performed an ablation study to verify that the performance improvement stems from the better trust region constructed around $\pi_{\beta}$, not just the higher quality of $\pi_{\beta}$ compared to $\pi_{\ref}$.
- Stabilization Techniques:
  - The use of an ensemble of LoRA weights proved effective in stabilizing training, as single LoRA weight optimization led to rapid deterioration of performance.
Results:
- BPO significantly outperformed both its on-policy and offline DAP counterparts across all tasks, particularly on TL;DR, Helpfulness, and Harmlessness, demonstrating its strong generalizability.
- The dynamic trust region around the behavior LLM ensured better alignment and stability during training, leading to higher win rates and more consistent performance improvements.
The proposed BPO method offers a substantial advancement in online preference learning for LLMs, balancing performance and computational efficiency, and demonstrating remarkable applicability to various DAP methods and annotation frequencies.

Instruction Pre-Training: Language Models are Supervised Multitask Learners

This paper by Cheng et al. from Microsoft Research and Tsinghua University introduces Instruction Pre-Training, a framework for supervised multitask pre-training of language models (LMs). The method involves augmenting raw corpora with instruction-response pairs to improve pre-training efficiency and model generalization.
Instead of the traditional unsupervised multitask pre-training on raw corpora (Vanilla Pre-Training), Instruction Pre-Training uses an instruction synthesizer to generate instruction-response pairs from raw text. The synthesizer is fine-tuned on a diverse set of datasets to ensure it can generalize to new, unseen data. The authors synthesized 200 million instruction-response pairs across over 40 task categories to create a comprehensive training dataset.
The figure below from the paper illustrates shows a comparison between Instruction PreTraining and Vanilla Pre-Training. Instead of directly pre-training on raw corpora, Instruction PreTraining augments the corpora with instruction-response pairs generated by an instruction synthesizer, then pretrains LMs on the augmented corpora. “Ins” and “Res” represent instruction and response, respectively.

Implementation Details:
1. Instruction Synthesizer Development:
  - The instruction synthesizer is built on open-source models (typically with 7B parameters) to keep costs low and efficiency high.
  - It is fine-tuned using a multitask learning approach, where each training example consists of a raw text segment and its associated instruction-response pairs.
  - During inference, the synthesizer generates instruction-response pairs for raw pre-training corpora, enabling scalable task synthesis.
2. Pre-Training Process:
  - General Pre-Training from Scratch:
    - Part of the raw corpora (1/5) is converted into instruction-augmented texts through multiple rounds of synthesis.
    - The resulting instruction-augmented corpora are mixed with the data used for fine-tuning the instruction synthesizer.
    - Models of 500M and 1.3B parameters are trained on these augmented corpora, showing improved performance and data efficiency compared to models trained with raw text only.
  - Domain-Adaptive Continual Pre-Training:
    - All domain-specific corpora (e.g., finance, biomedicine) are converted into instruction-augmented texts.
    - The augmented texts are mixed with general instructions to improve prompting ability.
    - The pre-trained models are evaluated on domain-specific tasks, demonstrating significant performance improvements, comparable to or surpassing much larger models (e.g., Llama3-70B).
The figure below from the paper illustrates the Tuning and Inference Framework of Instruction Synthesizer. During tuning, the instruction synthesizer learns to generate instruction-response pairs for a given raw text. The tuning data are curated to be highly diverse, enabling the synthesizer to generalize to unseen data. During inference, we use this tuned instruction synthesizer to generate instruction-response pairs for raw texts from pre-training corpora.

Experimental Results:
- General Pre-Training from Scratch:
  - The 500M model pre-trained with Instruction Pre-Training on 100B tokens achieves performance comparable to a 1B model trained on 300B tokens.
  - Instruction-tuned models exhibit enhanced performance on benchmarks like MMLU, highlighting the benefits of aligning pre-training and fine-tuning tasks.
- Domain-Adaptive Continual Pre-Training:
  - Instruction Pre-Training improves the performance of Llama3-8B on domain-specific tasks in finance and biomedicine, achieving results on par with or better than Llama3-70B.
Analysis and Ablations:
- The instruction synthesizer significantly outperforms the base model in generating accurate and relevant instruction-response pairs.
- The instruction-augmented corpora demonstrate high context relevance, response accuracy, and task diversity, covering 49 task categories.
- Ablation studies confirm the necessity of domain-specific instruction-augmented corpora for achieving superior domain-specific task performance.
Overall, Instruction Pre-Training proves to be an effective approach for enhancing the generalization and efficiency of language models through supervised multitask pre-training. The paper’s findings suggest that this method can substantially reduce the amount of further fine-tuning required and improve performance across a wide range of tasks and domains.

Improve Mathematical Reasoning in Language Models by Automated Process Supervision

This paper by Luo et al. from Google DeepMind introduces a novel approach to enhance the mathematical reasoning capabilities of large language models (LLMs) through automated process supervision. Traditional methods using Outcome Reward Models (ORMs) have been inadequate for multi-step reasoning tasks, prompting the need for Process Reward Models (PRMs) that provide feedback at each intermediate step. The authors present OmegaPRM, a divide-and-conquer Monte Carlo Tree Search (MCTS) algorithm designed for efficient collection of high-quality process supervision data, eliminating the need for expensive human annotations.
OmegaPRM improves the data collection process by using binary search within the MCTS framework to swiftly identify the first error in a reasoning chain. This approach ensures a balance between positive and negative examples, thereby enhancing both efficiency and quality. The algorithm constructs a Monte Carlo Tree for each question, storing and reusing rollouts to optimize the generation of training examples. This method allowed the authors to collect over 1.5 million process supervision annotations, representing the largest dataset of its kind.
The figure below from the paper illustrates the process supervision rollouts, Monte Carlo estimation using binary search and the MCTS process. (a) An example of Monte Carlo estimation of a prefix solution. Two out of the three rollouts are correct, producing the Monte Carlo estimation $\operatorname{MC}\left(q, x_{1: t}\right)=2 / 3 \approx 0.67$. (b) An example of error locating using binary search. The first error step is located at the $7^{\text {th }}$ step after three divide-and-rollouts, where the rollout positions are indicated by the vertical dashed lines. (c) The MCTS process. The dotted lines in Select stage represent the available rollouts for binary search. The bold colored edges represent steps with correctness estimations. The yellow color indicates a correct step, i.e., with a preceding state $s$ that $\mathrm{MC}(s)>0$ and the blue color indicates an incorrect step, i.e., with $\mathrm{MC}(s)=0$. The number of dashes in each colored edge indicates the number of steps.

The implementation details of OmegaPRM involve several key steps:
1. Monte Carlo Estimation using Binary Search: As suggested by Lightman et al. (2023), supervising up to the first incorrect step in a solution is sufficient to train a PRM. The objective is to locate the first error efficiently by repeatedly dividing the solution and performing rollouts. Starting with a solution with potential errors, it is split at the midpoint and rollouts are performed for $s_{1:m}$. Two possible outcomes are evaluated: $c_m > 0$ indicates the first half is correct, and $c_m = 0$ indicates the error is likely in the first half. This process is repeated on the erroneous half until the error is located, reducing the time complexity to $O(k \log M)$ from $O(kM)$.
2. Tree Construction and Monte Carlo Tree Search: The binary search approach within MCTS improves efficiency by not discarding rollouts after stepwise Monte Carlo estimation. Instead, all rollouts are stored, allowing for subsequent binary searches from any of these rollouts when collecting new examples. This creates a state-action tree where each node contains the question and preceding reasoning steps, and each edge represents potential subsequent steps. The LM policy generates a fixed number of completions for each prompt, forming an approximate action space, and performs full rollouts enabling binary search. The adapted MCTS algorithm, named OmegaPRM, constructs the state-action tree efficiently, accommodating PRM training data collection.
3. PRM Training: Each edge with a single-step action in the constructed state-action tree serves as a training example for the PRM. The PRM is trained using the pointwise soft labels, which are the most effective among the evaluated training objectives. The training minimizes the classification loss between the PRM predictions and the correctness labels derived from Monte Carlo estimations.
The collected data was used to train a PRM, which was then combined with a weighted self-consistency algorithm to select candidate responses. This approach significantly boosted the performance of the instruction-tuned Gemini Pro model on the MATH benchmark, achieving a 69.4% success rate—a 36% relative improvement from the base model’s 51% success rate. The entire process is automated, making it both financially and computationally efficient compared to previous methods.
The experimental results demonstrate that OmegaPRM outperforms other datasets, including human-annotated ones, and achieves superior performance in mathematical reasoning tasks. The flexible step division method used in OmegaPRM produces semantically coherent steps without relying on predefined rules, further enhancing the quality of the training data.
In conclusion, OmegaPRM offers a scalable and efficient solution for improving LLM reasoning capabilities, particularly in complex multi-step tasks like mathematical problem-solving. The automated process supervision approach sets a new standard for data collection and model training in this domain.

Accessing GPT-4 Level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B: A Technical Report

This report introduces the Monte Carlo Tree Self-Refine (MCTSr) algorithm, which integrates Large Language Models (LLMs) with Monte Carlo Tree Search (MCTS) to enhance performance in complex mathematical reasoning tasks. MCTSr addresses the challenges of accuracy and reliability in LLMs, particularly in mathematical contexts where precision is critical. The algorithm builds a Monte Carlo search tree through iterative processes of selection, self-refinement, self-evaluation, and backpropagation, optimizing the exploration-exploitation balance using an improved Upper Confidence Bound (UCB) formula.
The proposed MCTSr algorithm incorporates several key stages:
1. Initialization: A root node is created using a model-generated answer and a dummy response to minimize overfitting.
2. Selection: Nodes are ranked using a value function Q, selecting the highest-valued node for exploration and refinement.
3. Self-Refine: The selected node undergoes optimization via a self-refinement framework, generating feedback to produce an improved answer.
4. Self-Evaluation: The refined answer is scored, sampling a reward value to compute its Q value, ensuring reliability with constraints such as full score suppression and repeated sampling.
5. Backpropagation: The value of the refined answer is propagated back to its parent and related nodes, updating the tree’s value information.
6. UCT Update: Upper Confidence Bound applied to Trees (UCT) values of nodes are updated using the UCB-1 method to balance exploration and exploitation, facilitating the next selection stage.
  - MCTSr iterates through these stages until a termination condition is met, continuously refining answer quality and exploring new possibilities.
The figure below from the paper illustrates that agents can learn decision-making and reasoning from the trial-and-error as humans do.

Implementation Details:
- The algorithm employs a self-reflective driven self-improvement mechanism for refining answers, with rewards for different answer versions sampled using the model’s self-reward capability. The main workflow involves initializing the root node, selecting nodes based on their Q values, refining answers through self-refinement, evaluating refined answers with self-rewards, and backpropagating values to update the decision tree.
- Key components include:
  - Self-Refine: Uses multi-turn dialogue refine prompts to iteratively improve answers.
  - Self-Evaluation: Defines Q values as the expected quality of further refining an answer, with constraints to ensure reliable scoring.
  - Backpropagation: Updates parent and ancestor nodes based on changes in child nodes’ Q values.
  - UCT Update: Applies UCT values to balance exploration and exploitation during node selection.
Experimental Results:
- The algorithm’s effectiveness was demonstrated through extensive experiments on datasets like GSM8K, GSM Hard, MATH, AIME, Math Odyssey, and OlympiadBench. Results showed significant improvements in problem-solving success rates, especially with increased MCTSr rollouts. For instance, on the GSM8K dataset, the success rate increased from 74.07% (Zero-Shot CoT) to 96.66% (8-rollouts MCTSr), and on GSM-Hard, from 25.47% to 45.49%.
Conclusion:
- MCTSr significantly enhances LLMs’ capabilities in solving complex mathematical problems, showcasing robustness and improved decision-making accuracy. The integration of MCTS with LLMs, through systematic exploration and heuristic self-refinement, addresses critical challenges in accuracy and reliability, paving the way for future AI advancements in sophisticated reasoning tasks. Future work will explore broader applications and optimize algorithmic components to enhance practicality and effectiveness.

SimPO: Simple Preference Optimization with a Reference-Free Reward

This paper by Meng et al. from Danqi Chen’s lab at Princeton proposes SimPO, a novel offline preference optimization algorithm that simplifies and improves upon Direct Preference Optimization (DPO). Unlike DPO, which requires a reference model and can be computationally intensive, SimPO introduces a reference-free reward that aligns more closely with the model generation process.
SimPO uses the average log probability of a sequence as the implicit reward, which better aligns with model generation metrics and removes the need for a reference model. This reward formulation enhances computational efficiency and memory usage. Additionally, SimPO incorporates a target reward margin into the Bradley-Terry objective to create a larger separation between winning and losing responses, further optimizing performance.
The authors conducted extensive evaluations using various state-of-the-art models, including base and instruction-tuned models like Mistral and Llama3. They tested SimPO on benchmarks such as AlpacaEval 2, MT-Bench, and Arena-Hard, demonstrating significant performance improvements over DPO. Specifically, SimPO outperformed DPO by up to 6.4 points on AlpacaEval 2 and 7.5 points on Arena-Hard, with minimal increase in response length, indicating efficiency in length exploitation.
The figure below from the paper illustrates that SimPO and DPO mainly differ in their reward formulation, as indicated in the shaded box.

Implementation Details:
1. Reward Formulation:
  - SimPO calculates the reward as the average log probability of all tokens in a response using the policy model, normalized by the response length. This formulation eliminates the reference model, making SimPO more efficient.
  - The reward equation is: $r_{\text{SimPO}}(x, y) = \frac{\beta}{\|y\|} \log \pi_{\theta}(y \mid x) = \frac{\beta}{\|y\|} \sum_{i=1}^{\|y\|} \log \pi_{\theta}(y_i \mid x, y_{<i})$, where $\beta$ controls reward scaling.
2. Target Reward Margin:
  - A margin $\gamma$ is introduced to the Bradley-Terry model to ensure a minimum reward difference between winning and losing responses.
  - The modified objective is: $L_{\text{SimPO}}(\pi_{\theta}) = -E_{(x, y_w, y_l) \sim D} \left[ \log \sigma \left( \frac{\beta}{\|y_w\|} \log \pi_{\theta}(y_w \mid x) - \frac{\beta}{\|y_l\|} \log \pi_{\theta}(y_l \mid x) - \gamma \right) \right]$.
3. Training Setups:
  - Base Setup: Models were trained on the UltraChat-200k dataset to create a supervised fine-tuned (SFT) model, followed by preference optimization using the UltraFeedback dataset.
  - Instruct Setup: Off-the-shelf instruction-tuned models were used, regenerating chosen and rejected response pairs to mitigate distribution shifts.
4. Evaluation:
  - SimPO was evaluated on AlpacaEval 2, Arena-Hard, and MT-Bench benchmarks. Performance was measured in terms of length-controlled win rate and raw win rate.
  - SimPO achieved notable results, such as a 44.7% length-controlled win rate on AlpacaEval 2 and a 33.8% win rate on Arena-Hard, making it the strongest 8B open-source model.
5. Hyperparameters:
  - Optimal performance was achieved with $\beta$ set between 2.0 and 2.5, and $\gamma$ between 0.5 and 1.5.
SimPO demonstrates a significant advancement in preference optimization, simplifying the process while improving computational efficiency and performance on multiple benchmarks. The removal of the reference model and the alignment of the reward function with generation metrics are key innovations that contribute to its success.
Code

Discovering Preference Optimization Algorithms with and for Large Language Models

This paper by Chris Lu et al. from Sakana AI, University of Cambridge, and FLAIR, presents a novel approach to offline preference optimization for Large Language Models (LLMs) by leveraging LLM-driven objective discovery. Traditional preference optimization relies on manually-crafted convex loss functions, but this approach is limited by human creativity. The authors propose an iterative method that prompts an LLM to discover new preference optimization loss functions automatically, leading to the development of state-of-the-art algorithms without human intervention.
The core contribution of this paper is the introduction of the Discovered Preference Optimization (DiscoPOP) algorithm, which adaptively combines logistic and exponential losses. This process is facilitated through an LLM-driven pipeline that iteratively proposes and evaluates new loss functions based on their performance on downstream tasks.
Implementation Details:
1. Initial Context Construction: The system prompt initializes the LLM with several established objective functions in code and their performance metrics.
2. LLM Querying and Output Validation: The LLM is queried to propose new objective functions, which are parsed, validated through unit tests, and evaluated.
3. Performance Evaluation: The proposed objective functions are evaluated based on their ability to optimize a model on predefined downstream tasks, with the performance metric feeding back into the LLM.
4. Iterative Refinement: The LLM iteratively refines its proposals, synthesizing new candidate loss functions that blend successful aspects of previous formulations.
Discovery Process:
- The LLM generates PyTorch-based candidate objective functions, taking log probabilities of preferred and rejected completions as inputs.
- Valid candidates are used to fine-tune an LLM, evaluated using performance metrics such as MT-Bench scores.
- The performance data is fed back into the LLM, which iteratively refines its generation strategy based on this feedback.
The figure below from the paper illustrates: (Left) Conceptual illustration of LLM-driven discovery of objective functions. We prompt an LLM to output new code-level implementations of offline preference optimization losses $\mathbb{E}_{\left(y_w, y_l, x\right) \sim \mathcal{D}}[f(\beta \rho)]$ as a function of the policy $\left(\pi_\theta\right)$ and reference model’s $\left(\pi_{\text {ref }}\right)$ likelihoods of the chosen $\left(y_w\right)$ and rejected $\left(y_l\right)$ completions. Afterward, they run an inner loop training procedure and evaluate the resulting model on MT-Bench. The corresponding performance is fed back to the language model, and they query it for the next candidate. (Right) Performance of discovered objective functions on Alpaca Eval.

Results:
- The DiscoPOP algorithm, a dynamically weighted sum of logistic and exponential losses, emerged as a top performer. It was evaluated on multi-turn dialogue tasks (MT-Bench), single-turn dialogue tasks (Alpaca Eval 2.0), summarization tasks (TL;DR), and positive sentiment generation tasks (IMDb).
- DiscoPOP showed significant improvement in win rates against GPT-4 and performed competitively on various held-out tasks, demonstrating robustness and adaptability across different preference optimization challenges.
Technical Details:
- The DiscoPOP loss function is non-convex, incorporating a temperature parameter to balance between logistic and exponential terms based on the log-ratio difference ($\rho$). This dynamic weighting allows the function to handle both large and small differences effectively, contributing to its superior performance.
Significance:
- This LLM-driven discovery approach eliminates the constraints of human creativity in designing loss functions, automating the generation of high-performing preference optimization algorithms.
- The iterative refinement process ensures continuous improvement and adaptability, leading to state-of-the-art performance in preference alignment tasks.
This work opens new avenues for automated discovery and optimization in machine learning, showcasing the potential of leveraging LLMs to enhance and innovate traditional methodologies in a scalable and efficient manner. The proposed DiscoPOP algorithm represents a significant advancement in offline preference optimization, offering a robust and flexible solution for aligning LLM outputs with human preferences.
Code

ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving

This paper by Gou et al. from Tsinghua, MSR, and Azure AI, introduces ToRA, a series of Tool-Integrated Reasoning Agents designed to improve the mathematical reasoning abilities of large language models (LLMs). Despite the advances of LLMs in language tasks, these models still face challenges in complex mathematical reasoning. ToRA integrates natural language reasoning with the use of external tools, such as symbolic solvers and computation libraries, to solve challenging mathematical problems. By combining the analytical strength of language models with computational efficiency, ToRA achieves significant performance improvements.
The core idea of ToRA is to interleave natural language reasoning with program-based tool usage. The agent generates rationales and, when necessary, triggers external tools to perform precise computations. ToRA is evaluated on multiple mathematical reasoning datasets and shows remarkable improvement over state-of-the-art models, particularly open-source ones like LLaMA-2 and WizardMath.
The model training process involves two major steps:
1. Imitation Learning: ToRA is trained on interactive tool-use trajectories from popular datasets like GSM8k and MATH. High-quality trajectories are curated by prompting models like GPT-4 to synthesize mathematical problem-solving steps that combine natural language reasoning and tool interactions.
2. Output Space Shaping: To improve flexibility and prevent the model from being limited by imitation learning, output space shaping is introduced. The model is trained not only on correct trajectories but also on corrected versions of invalid trajectories. A larger teacher model corrects incorrect reasoning steps, ensuring that the student model learns a more diverse set of valid trajectories.
The figure below from the paper illustrates examples of three reasoning formats for mathematical reasoning: (a) Rationale-based methods (e.g., CoT prompting) generate step-by-step natural language rationales, (b) Program-based methods (e.g., PAL prompting) solve tasks with program synthesis, and (c) our proposed Tool integrated Reasoning format interleaves rationales with program-based tool use. For brevity, we present a simple example of single-round tool interaction, where the model creates rationale r1 for analysis, writes program a1 to call an external solver, obtains the execution output o1, and then generates rationale r2 to finalize the answer.

For implementation, the authors used LLaMA-2 and CodeLLaMA models, fine-tuning them with ToRA-CORPUS (a dataset containing interactive tool-use trajectories) and applying output space shaping. Training involved three epochs with a batch size of 128, using DeepSpeed ZeRO Stage3 and Flash-Attention 2. The models range from 7B to 70B parameters, and results are evaluated with greedy decoding and nucleus sampling.
Key results include:
- ToRA models outperform the previous state-of-the-art open-source models on 10 mathematical reasoning tasks, including GSM8k and MATH, achieving 13-19% absolute improvement.
- ToRA-CODE-34B exceeds GPT-4’s chain-of-thought results on MATH and is competitive with GPT-4 when using code-based reasoning.
- ToRA models consistently show better generalization, even on out-of-distribution tasks like TabMWP and SVAMP.
Through extensive ablation studies, the paper demonstrates the effectiveness of tool integration and output space shaping in enhancing the model’s problem-solving capabilities. The work also highlights some remaining challenges, such as handling complex diagram interpretation and improving runtime tool execution.
In summary, ToRA establishes a new benchmark for mathematical reasoning by combining language and tools, significantly narrowing the performance gap between open-source models and proprietary models like GPT-4.

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

This paper by Snell et al. from UC Berkeley and Google DeepMind explores the scaling of inference-time computation in large language models (LLMs) and addresses the question of how much a fixed amount of test-time compute can improve model performance, particularly on difficult prompts. The authors focus on two primary mechanisms to scale test-time compute: (1) searching against dense process-based verifier reward models (PRMs) and (2) adaptively updating the model’s response distribution during test time.
The study reveals that the optimal approach for scaling test-time compute depends heavily on prompt difficulty. Based on this insight, the authors propose a “compute-optimal” scaling strategy, which adaptively allocates test-time compute depending on the problem’s complexity. This strategy improves efficiency by more than 4× compared to standard best-of-N sampling and can, under certain conditions, outperform models 14× larger with matched FLOPs.
In their experimental setup, they use PaLM 2-S (Codey) models fine-tuned for revision and verification tasks, evaluated on the challenging MATH benchmark. They evaluate methods to scale test-time compute, including revising answers iteratively and searching for correct solutions using PRMs. Key findings include:
- Revisions: When the LLM iteratively refines its responses, it achieves better performance on easier tasks by revising and optimizing its original answers. For more complex problems, parallel sampling (best-of-N) is generally more effective, especially when multiple high-level solution approaches must be explored.
- PRM-based Search: Process-based verifiers perform step-by-step evaluations of solutions, offering better guidance on complex problems. Beam search and lookahead search methods were explored, with beam search showing higher efficiency on more difficult prompts when the compute budget is limited. The following figure from the paper shows a comparison of different PRM search methods. Left: Best-of-$N$ samples $N$ full answers and then selects the best answer according to the PRM final score. Center: Beam search samples $N$ candidates at each step, and selects the top $M$ according to the PRM to continue the search from. Right: lookahead-search extends each step in beam-search to utilize a k-step lookahead while assessing which steps to retain and continue the search from. Thus lookahead-search needs more compute.
The following figure from the paper shows parallel sampling (e.g., Best-of-$N$) verses sequential revisions. Left: Parallel sampling generates $N$ answers independently in parallel, whereas sequential revisions generates each one in sequence conditioned on previous attempts. Right: In both the sequential and parallel cases, we can use the verifier to determine the best-of-$N$ answers (e.g. by applying best-of-$N$ weighted). We can also allocate some of our budget to parallel and some to sequential, effectively enabling a combination of the two sampling strategies. In this case, we use the verifier to first select the best answer within each sequential chain and then select the best answer accross chains.

The paper emphasizes that adaptive test-time compute scaling, based on the difficulty of the question, is essential. The proposed compute-optimal scaling strategy outperforms best-of-$N$ with significantly less computation, particularly for easy and intermediate tasks. By dynamically choosing between search-based and revision-based methods, the authors demonstrate a practical way to optimize LLM performance within a constrained computational budget.
In addition, they show that test-time compute can be a viable substitute for additional pretraining, especially when handling easier questions or lower inference workloads. On the other hand, for harder questions, additional pretraining remains more effective. This tradeoff suggests that in specific deployment scenarios (e.g., where smaller models are desirable), emphasizing test-time compute scaling might reduce the need for training significantly larger models.
Finally, the authors propose future directions, including combining different methods of test-time compute (e.g., revisions and PRM-based search) and refining difficulty assessment during inference to further optimize test-time compute allocation.

Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking

This paper by Zelikman from Stanford and Notbad AI, generalizes the Self-Taught Reasoner (STaR) method by teaching language models (LMs) to reason implicitly through continuous text without relying on curated reasoning datasets. The key idea behind Quiet-STaR is to allow LMs to generate “thoughts” or rationales at each token position, helping predict future tokens and improving overall performance. The authors extend the STaR approach, which previously focused on specific tasks like question-answering, to a more general framework where the LM generates and learns from rationales embedded in arbitrary text.
The core implementation of Quiet-STaR involves three major steps:
1. Parallel Rationale Generation: The LM generates rationales after every token to explain the continuation of future tokens. A key challenge resolved here is the computational inefficiency of generating continuations at every token. The authors propose a token-wise parallel sampling algorithm that allows for efficient generation by caching forward passes and employing diagonal attention masking.
2. Mixing Head for Prediction Integration: A “mixing head” model is trained to combine predictions made with and without rationales. This helps manage the distribution shift caused by introducing rationales in a pre-trained model. The head outputs a weight that determines how much the rationale-based prediction should influence the overall token prediction.
3. Rationale Optimization with REINFORCE: The model’s rationale generation is optimized using a REINFORCE-based objective, rewarding rationales that improve the likelihood of future token predictions. This method allows the LM to learn to generate rationales that help predict the next tokens more effectively, based on feedback from their impact on future token prediction.
The following figure from the paper shows Quiet-STaR visualized as applied during training to a single thought. We generate thoughts, in parallel, following all tokens in the text (think). The model produces a mixture of its next-token predictions with and without a thought (talk). They apply REINFORCE, as in STaR, to increase the likelihood of thoughts that help the model predict future text while discarding thoughts that make the future text less likely (learn).

The following figure from the paper shows parallel generation. By constructing an attention mask that allows all thought tokens to pay attention to themselves, all preceding thought tokens within the same thought, and the preceding text, we can generate continuations of all of the thoughts in parallel. Each inference call is used to generate one additional thought token for all text tokens.

The following figure from the paper shows forward pass and teacher forcing. We visualize a single forward pass of our algorithm. Solid lines denote language model computation, while dashed lines indicate tokens are inserted via teacher forcing, and the mixer represents the mixing head. In particular, we visualize predicting three tokens ahead. Thought generation is shown in more detail in the above two figures.

The authors also introduce custom tokens, specifically <|startofthought|> and <|endofthought|>, which mark the beginning and end of the rationale generation. These tokens are initialized based on the LM’s existing knowledge (e.g., em dash “−−−”) and fine-tuned for optimal performance.
One of the significant findings from the experiments was that training with Quiet-STaR on diverse text datasets (like C4 and OpenWebMath) improved zero-shot reasoning abilities on commonsense reasoning tasks like GSM8K and CommonsenseQA. The LM showed improved performance in reasoning tasks without any task-specific fine-tuning, demonstrating the effectiveness of Quiet-STaR in enhancing reasoning in LMs in a generalizable and scalable way.
For example, zero-shot performance on GSM8K improved from 5.9% to 10.9%, and on CommonsenseQA from 36.3% to 47.2%. These improvements are primarily driven by difficult-to-predict tokens, where Quiet-STaR’s rationales prove most beneficial. Furthermore, longer thought sequences resulted in better predictions, suggesting that more detailed reasoning steps enhance token prediction accuracy.
The computational overhead of Quiet-STaR is notable, as generating rationales adds complexity. However, the authors argue that this overhead can be leveraged to improve the model’s performance in tasks that require deeper reasoning. The results suggest that Quiet-STaR can enhance not only language modeling but also chain-of-thought reasoning, where reasoning steps are crucial for solving more complex tasks.
In conclusion, Quiet-STaR represents a significant step towards generalizable reasoning in language models by embedding continuous rationales in text generation, ultimately leading to better zero-shot reasoning performance across a range of tasks. The method also opens up potential future directions, such as dynamically predicting when rationale generation is needed and ensembling rationales for further improvements in reasoning capabilities.

Mixture-of-Depths: Dynamically Allocating Compute in Transformer-Based Language Models

This paper by Raposo et al. from Google DeepMind, McGill and Mila, presents a novel approach, “Mixture-of-Depths” (MoD), which dynamically allocates compute resources across different layers of a transformer-based language model. The goal is to optimize the allocation of floating point operations (FLOPs) to individual tokens within sequences, reducing unnecessary computational expenditure while maintaining performance. Unlike traditional transformer models that uniformly spread computation across all tokens in a sequence, MoD uses a routing mechanism to allocate different amounts of compute per token, based on the specific demands of each layer.
The method leverages a top-$k$ routing mechanism to cap the number of tokens participating in the self-attention and multi-layer perceptron (MLP) computations at each layer. This static graph-based approach contrasts with other conditional computation techniques that introduce dynamic computation graphs, allowing for better hardware efficiency. While the total compute budget is predefined, the allocation of compute resources is dynamic at the token level, making the model more efficient without sacrificing performance.
Key components:
- Static Compute Budget: The total compute expenditure is user-defined and fixed, meaning the number of FLOPs per forward pass is predictable. This fixed compute budget is enforced by capping the number of tokens processed at each layer, which reduces FLOPs relative to the standard transformer.
- Per-Block Routing: For each transformer block, a router generates a scalar weight for each token, representing the likelihood that the token will participate in the block’s computation (i.e., self-attention and MLP) or skip the block via a residual connection. Only the top-$k$ tokens based on the router weights are processed at each block.
- Expert-Choice Routing: MoD employs “expert-choice routing,” where the top-$k$ tokens are selected based on their router scores to be processed by the block. This avoids load-balancing problems common in token-choice routing schemes and ensures that computation is efficiently distributed.
The following figure from the paper shows the Mixture-of-Depths Transformer. As in mixture-of-experts (MoE) transformers we use a router to choose among potential computational paths. But unlike in MoE transformers the possible choices are a standard block’s computation (i.e., self-attention and MLP) or a residual connection. Since some tokens take this second route, Mixture-of-Depths (MoD) transformers have a smaller total FLOP footprint compared to vanilla or MoE transformers. On the top right is depicted a trained model’s routing decisions for a short sequence truncated to 64 tokens for visualization purposes. When examining the choices one can find tokens processed by later blocks’ layers, despite passing through relatively few total blocks throughout the model’s depth. This is a unique feature of MoD compared to conventional halting-based, or “early-exit” conditional computation, which instead engage blocks serially, or vanilla transformers, which engage every block.

The following figure from the paper shows routing schemes. Tokens are funnelled to the computational path of their choice when using token-choice routing (left). If a given path exceeds its capacity (e.g., more than two tokens in this example) then surplus tokens must be dropped (purple token). The exact token that is ultimately dropped depends on the precise implementation in the underlying code. For example, priority is often given to those tokens that come earlier in the sequence or batch order. With expert-choice routing (middle), precisely $k$ (in this case, two) tokens are chosen per path using a top-$k$ mechanism across the tokens’ router weights. Here, tokens are dropped if they are not among the top-$k$ with respect to any given path (orange token), and some tokens may even be funnelled to multiple paths (yellow token). In this work we deploy expert-choice routing (right). However, because we use just a single path, we leverage the implicit knowledge that tokens will be dropped if $k$ is less than the sequence length so that we can route tokens away from the self-attention and MLP computations, thus expending fewer FLOPs in a given forward pass of the model.

Efficiency of MoD demonstrated through multiple experiments:
1. IsoFLOP Comparisons: MoD transformers match or outperform standard transformers when using an equivalent FLOP budget, while requiring fewer FLOPs per forward pass, leading to up to 50% faster step times during post-training sampling. Moreover, MoD achieves significant performance gains when tuned for reduced FLOP budgets, with up to 87.5% of tokens skipping layers.
2. Training Efficiency: Experiments with a small FLOP budget (6e18) show that MoD transformers outperform baseline transformers in terms of both loss and parameter efficiency. MoD also shows scalability, with larger FLOP budgets (up to 1e20) further amplifying the performance advantages.
3. Autoregressive Sampling: During sampling, MoD transformers use a binary cross-entropy loss or an auxiliary MLP predictor to ensure smooth autoregressive sampling, despite the non-causal nature of top-$k$ routing.
Implementation Details:
- Static Compute Graph: The use of static compute graphs in MoD ensures that hardware utilization is optimized, allowing for predictable memory and time costs. By predefining the maximum number of tokens processed in each block, the method aligns with current hardware architectures that favor static tensor sizes.
- Router Weights: Each token in the sequence has a router weight, which is computed via a linear projection. The router decides which tokens will participate in computations based on their relative importance, dynamically adjusting the token selection throughout the transformer’s depth.
- Training Methods: The models use cosine learning rate schedules, with batch sizes of 128 and sequence lengths of 2048. Model hyperparameters such as layer count, embedding size, and number of heads were varied to produce different model sizes during isoFLOP analyses.
The results indicate that MoD transformers can achieve comparable or superior performance to baseline transformers while using significantly fewer FLOPs. This makes them faster during both training and inference, particularly in cases where less compute is required for easier tokens or sequences.

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

This paper by Brown et al. from Stanford, University of Oxford, and Google DeepMind, explores a novel methodology for scaling inference compute in large language models (LLMs) by utilizing repeated sampling. Instead of relying on a single inference attempt per problem, the authors propose increasing the number of generated samples to improve task coverage, particularly in tasks where answers can be automatically verified.
The paper investigates two key aspects of the repeated sampling strategy:
1. Coverage: The fraction of problems that can be solved by any generated sample.
2. Precision: The ability to identify the correct solution from the generated samples.
Technical Details: The authors demonstrate that by scaling the number of inference samples, task coverage can increase exponentially across various domains such as coding and formal proofs, where answers are verifiable. For instance, using the SWE-bench Lite benchmark, the fraction of issues solved with DeepSeek-V2-Coder-Instruct increased from 15.9% with one sample to 56% with 250 samples, surpassing the state-of-the-art performance of 43% by more capable models like GPT-4o and Claude 3.5 Sonnet.
Key Observations:
- Log-linear scaling of coverage: Across multiple models (e.g., Llama-3 and Gemma), the coverage exhibits a nearly log-linear relationship with the number of generated samples. This scaling behavior is modeled with an exponentiated power law, indicating the existence of potential inference-time scaling laws.
- Cost-efficiency: Repeated sampling of cheaper models like DeepSeek can outperform single-sample inferences from premium models (e.g., GPT-4o) in terms of both performance and cost-effectiveness, providing up to 3x cost savings.
The following figure from the paper shows the proposed repeated sampling procedure: (i) Generate many candidate solutions for a given problem by sampling from an LLM with a positive temperature. (ii) Use a domain-specific verifier (ex. unit tests for code) to select a final answer from the generated samples.

Implementation: The repeated sampling methodology is implemented through the following steps:
1. Sample generation: For each problem, multiple candidate solutions are generated by the LLM with a positive sampling temperature.
2. Verification: Solutions are verified using domain-specific verifiers (e.g., unit tests for code or proof checkers for formal proofs). In domains like coding, verification is fully automatic, translating the increased coverage into better performance.
3. Evaluation of Coverage: Coverage is evaluated using metrics such as pass@k, where k is the number of generated samples. For example, pass@10,000 was used to evaluate the CodeContests and MATH datasets.
Empirical Results:
- Programming tasks: On the CodeContests dataset, the coverage of weaker models like Gemma-2B increased from 0.02% with one sample to 7.1% with 10,000 samples.
- Mathematical problems: For math word problems from the GSM8K and MATH datasets, coverage increased to over 95% with 10,000 samples. However, methods to select the correct solution, such as majority voting or reward models, plateau after several hundred samples, highlighting the need for better solution selection mechanisms.
Future Directions: The paper points out that identifying correct solutions from multiple samples remains a challenge in domains without automatic verifiers (e.g., math word problems). Additionally, the work opens up further research avenues, including optimizing sample diversity and leveraging multi-turn interactions for iterative problem-solving.
This work underscores the potential of scaling inference compute through repeated sampling, demonstrating significant improvements in model performance while offering a cost-effective alternative to using larger, more expensive models.

Learn Beyond The Answer: Training Language Models with Reflection for Mathematical Reasoning

This paper by Zhang et al. from the University of Notre Dame and Tencent AI Lab introduces Reflective Augmentation (RefAug), a novel method designed to improve the performance of language models (LMs) in mathematical reasoning tasks, particularly those requiring deeper comprehension through reflection. Traditional data augmentation approaches have focused on increasing the quantity of training instances, which improves problem-solving skills in simple, single-round question-answering (QA) tasks. However, these methods are less effective for complex reasoning scenarios where a more reflective approach is needed. RefAug addresses this limitation by adding reflective components to the training sequences, encouraging LMs to engage in alternative reasoning and follow-up reasoning.
Key Contributions:
1. Reflective Augmentation:
  - RefAug enhances each training instance by appending a reflective section after the standard solution. This section helps the LM reflect on the problem, promoting deeper understanding and enabling it to consider alternative methods and apply abstractions or analogies.
  - Two types of reflection are included:
    - Alternative reasoning: Encourages the model to consider different methods for solving the problem.
    - Follow-up reasoning: Either focuses on abstraction (generalizing the problem) or analogy (applying the same technique to more complex problems).
2. Implementation:
  - The paper uses GPT-4-turbo as an expert model to annotate reflective sections for training, minimizing human involvement and ensuring high-quality reasoning.
  - The training objective is extended to optimize for the concatenation of the original answer and the reflective section. During training, the model learns the full reasoning sequence but during inference, the reflective part is excluded to maintain efficiency.
  - Experiments were conducted with LMs such as Mistral-7B and Gemma-7B, testing them on mathematical reasoning tasks with and without reflective augmentation.
  - The following figure from the paper shows that question augmentation creates new questions based on existing ones. Answer augmentation re-samples answers for each problem to increase diversity. Both methods expand the size of the training set. Reflective augmentation appends the original answer with a reflective section, which is complementary to traditional approaches. Corresponding training sequences are shown in an (input, output) format, where augmented parts are in red.
  - The following figure from the paper shows that the model that learned the standard solution does not fully understand when and how to apply substitution when facing a different scenario. In contrast, the model trained with reflection on the substitution technique gains a deeper understanding of its principles, patterns, and its flexible application in new contexts.
3. Performance:
  - Substantial improvement in standard QA: RefAug enhances performance in single-round QA by +7.2 accuracy points, demonstrating its ability to help models learn problem-solving skills more effectively.
  - Superior results in reflective reasoning tasks: RefAug significantly boosts the model’s capabilities in handling follow-up questions and error correction, areas where traditional augmentation techniques falter.
  - Complementary to traditional augmentation: Combining RefAug with other augmentation methods (such as question and answer augmentation) leads to further gains, showing its effectiveness as a complementary approach.
4. Scalability:
  - RefAug proved effective even when applied to large datasets, like MetaMath, with results improving by 2 percentage points over baseline models trained on the same data without reflective sections.
Experimental Results:
- Models trained with RefAug outperformed their standard counterparts in both in-distribution and out-of-distribution mathematical tasks (such as GSM8k, MATH, MAWPS, etc.).
- On reflective reasoning tasks (e.g., MathChat and MINT), RefAug-augmented models demonstrated a marked improvement, particularly in multi-step and follow-up questions.
Significance:
- RefAug goes beyond conventional data expansion techniques by embedding reflective thinking into training data, which strengthens a model’s ability to generalize and reason in diverse mathematical contexts. This method shows great promise for enhancing LMs in tasks requiring flexible problem-solving and deeper conceptual understanding.
The approach is designed to be easily integrated with other augmentation methods, improving the overall efficiency and effectiveness of language models in mathematical reasoning tasks.

Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents

This paper by Putta et al. from MultiOn and Stanford, presents “Agent Q,” a novel framework that enhances the reasoning and decision-making capabilities of large language models (LLMs) in agentic, multi-step tasks in dynamic environments such as web navigation. The framework tackles challenges related to compounding errors and limited exploration data that hinder LLMs from excelling in autonomous, real-time decision-making scenarios.
The authors propose a method that integrates guided Monte Carlo Tree Search (MCTS) with a self-critique mechanism. This enables the agent to iteratively improve by learning from both successful and unsuccessful trajectories, using an off-policy variant of Direct Preference Optimization (DPO). The agent operates in a Partially Observable Markov Decision Process (POMDP) framework, where the LLM is responsible for planning, reasoning, and interacting with the environment, such as executing commands on web pages. -Key Components of Agent Q:
1. Monte Carlo Tree Search (MCTS): MCTS is used for exploring multiple action trajectories. It evaluates possible actions at each node (web page) by calculating rewards and assigning values to each action. The Upper Confidence Bound (UCB1) strategy guides the exploration versus exploitation trade-off. To handle the sparse reward environment, an AI-based feedback mechanism is employed to rank actions and provide step-level guidance.
2. Direct Preference Optimization (DPO): DPO helps optimize the agent by using preference pairs (successful vs. unsuccessful actions) collected during interaction. This approach mitigates the need for a large number of online samples, making it computationally efficient for offline training. The DPO algorithm allows the agent to refine its decision-making policy by comparing trajectory outcomes and constructing preference pairs for learning.
3. Self-Critique Mechanism: To overcome credit assignment problems (where small errors can lead to overall task failure), the model incorporates a self-critique mechanism. At each step, the LLM provides intermediate feedback, which serves as an implicit reward, helping the agent refine its future actions. -Implementation Details:
  - Initial Setup: The LLaMA-3 70B model serves as the base agent. The agent is evaluated in the WebShop environment (a simulated e-commerce platform) and a real-world reservation system (OpenTable). Initial observations and user queries are represented as HTML DOM trees, and the agent’s actions are composite, consisting of planning, reasoning, environment interaction, and explanation steps.
- Training Process: The agent is trained using a combination of offline and online learning methods. Offline learning leverages the DPO algorithm to learn from past trajectories, while online learning uses MCTS to guide real-time action selection. The model continuously improves through iterative fine-tuning based on the outcomes of the agent’s decisions.
The following figure from the paper shows the use of Monte Carlo Tree Search (MCTS) to guide trajectory collection and iteratively improve model performance using direct preference optimization (DPO). We begin on the left by sampling a user query from the list of tasks in the dataset. We iteratively expand the search tree using UCB1 as a heuristic to balance exploration and exploitation of different actions. We store the accumulated reward obtained for each node in the tree, where in this image darker green indicates higher reward and darker red indicates lower reward. To construct the preference dataset, we compute a weighted score of the MCTS average Q-value and score generated by a feedback language model to construct contrastive pairs for DPO. The policy is optimized and can be iteratively improved.

The following figure from the paper shows that they provide the following input format to the Agent, consisting of the system prompt, execution history, the current observation as a DOM representation, and the user query containing the goal. We divide our Agent output format into an overall step-by-step plan, thought, a command, and a status code.

-Results: - In WebShop, Agent Q outperforms baseline models such as behavior cloning and reinforced fine-tuning, achieving a success rate of 50.5%, surpassing the average human performance of 50%. - In real-world experiments on OpenTable, the Agent Q framework improves the LLaMA-3 model’s zero-shot performance from 18.6% to 81.7%, with a further increase to 95.4% when MCTS is utilized during inference.

This framework demonstrates significant progress in building autonomous web agents that can generalize and learn from their experiences in complex, multi-step reasoning tasks.

V-STaR: Training Verifiers for Self-Taught Reasoners

This paper by Hosseini et al. from Mila, MSR, University of Edinburgh, and Google Deepmind, published in COLM 2024, introduces V-STaR, a novel approach designed to improve the reasoning capabilities of large language models (LLMs) by training both a verifier and a generator using correct and incorrect solutions. The authors aim to address a key limitation in previous self-improvement approaches, such as STaR and Rejection Fine-Tuning (RFT), which discard incorrect solutions, potentially missing valuable learning opportunities. V-STaR instead leverages both correct and incorrect model-generated solutions in an iterative self-improvement process, leading to better performance in tasks like math problem-solving and code generation.
The core idea of V-STaR is to iteratively train a generator to produce solutions and a verifier to judge their correctness using Direct Preference Optimization (DPO). By utilizing both correct and incorrect solutions, V-STaR ensures that the verifier learns from the generator’s errors, making it more robust.
Methodology and Implementation Details:
1. Training the Generator (GSFT): The generator is initially fine-tuned using supervised fine-tuning (SFT) on the original dataset, producing solutions for various problem instances. After each iteration, correct solutions are added to the training data for future iterations.
2. Training the Verifier (VT): Both correct and incorrect generated solutions are added to the verifier’s training data. The verifier is trained using DPO, which optimizes for preference learning by contrasting correct and incorrect solutions, improving its ability to rank solutions based on correctness.
3. Iterative Process: This process is repeated for multiple iterations. In each iteration, the generator produces solutions, and the verifier learns from both the correct and incorrect solutions, progressively improving the overall performance of both models.
4. Test-time Verification: At test time, the generator produces multiple candidate solutions for a problem, and the verifier selects the best one by ranking them.
The following figure from the paper shows generator and verifier training in V-STaR. Left: In each training iteration, the generator $G^t$ is fine-tuned (from a pretrained LLM) on the current buffer of problem instances and correct solutions $\mathcal{D}_{\text {GEN }}$. Generated solutions that yielded a correct answer are added to $\mathcal{D}_{\mathrm{GEN}}$ to be used in future iterations, and all the generated solutions (correct and incorrect) are added to $\mathcal{D}_{\text {VER }}$. The verifier $V^t$ is trained using DPO with a preference dataset constructed from pairs of correct and incorrect solutions from $\mathcal{D}_{\text {VER }}$. Right: At test time, the verifier is used to rank solutions produced by the generator. Such iterative training and inference-time ranking yields large improvements over generator-only self-improvement.

Key Results:
- V-STaR demonstrates a 4% to 17% improvement in test accuracy over baseline self-improvement and verification methods in tasks like code generation and math reasoning. In some cases, it even surpasses much larger models.
- When evaluated on math reasoning benchmarks such as GSM8K and MATH, and code-generation datasets like MBPP and HumanEval, V-STaR outperforms prior approaches by combining both correct and incorrect examples for training the verifier.
Empirical Findings:
- The paper compares V-STaR against several baselines, including non-iterative versions of STaR and RFT combined with a verifier, and demonstrates significant improvements in Pass@1 and Best-of-64 metrics.
- V-STaR is highly data-efficient, with the iterative collection of correct and incorrect solutions leading to more challenging examples for the verifier, which enhances both the generator and verifier over time.
Conclusions: The V-STaR approach significantly enhances reasoning tasks by training LLMs to learn from both correct and incorrect solutions. The iterative training process allows both the generator and verifier to continuously improve, and the use of DPO for training verifiers has been shown to outperform more traditional ORM-style verification methods. This framework is simple to implement and applicable to a wide range of reasoning problems, provided there is access to correctness feedback during training.

OLMOE: Open Mixture-of-Experts Language Models

This paper by Muennighoff et al. from Allen AI, Contextual AI, UW, and Princeton, introduces OLMOE, an open, state-of-the-art language model that leverages sparse Mixture-of-Experts (MoE) architecture. OLMOE-1B-7B has a total of 7 billion parameters, but activates only 1 billion for each input token, thus making it more efficient compared to dense models of similar size. The model is pre-trained on 5.1 trillion tokens and further adapted to create OLMOE-1B-7B-INSTRUCT, which outperforms models with similar active parameters, even surpassing larger models such as Llama2-13B-Chat and DeepSeekMoE-16B.
OLMOE is a decoder-only language model consisting of NL transformer layers, and the dense feedforward network (FFN) modules typically found in such architectures are replaced with smaller, MoE-based FFN modules. Only a subset of these experts is activated per token, reducing computational requirements. Specifically, OLMOE-1B-7B employs 64 experts per layer, with 8 activated per input token. The routing of tokens is handled by a dropless token-based routing mechanism, which consistently chooses experts for each token.
Key aspects of the implementation include:
- Pretraining: OLMOE-1B-7B is trained from scratch using a combination of load balancing loss and router z-loss. The load balancing loss penalizes the model if too many tokens are routed to a few experts, encouraging more equal distribution across experts. The router z-loss helps stabilize the routing mechanism by limiting the size of the logits used in routing.
- Dataset: The model was pretrained on a mixture of datasets, including Common Crawl, arXiv, Wikipedia, and StarCoder, referred to as OLMOE-MIX. This dataset was filtered for quality, including the removal of low-quality GitHub repositories and documents with high-frequency repetitive n-grams.
- Sparse Upcycling: Although some prior works recommend upcycling dense models into sparse MoEs, the authors found that training from scratch outperforms sparse upcycling when using larger compute budgets and more training tokens. Thus, OLMOE-1B-7B was not upcycled from a dense model.
- Adaptation: The model is fine-tuned for instruction and preference tuning. This adaptation improves performance on math and code-related tasks due to the inclusion of math and code datasets. Fine-tuning with preference data (DPO) particularly enhances tasks like HumanEval and AlpacaEval.
The following figure from the paper shows a comparison of the architecture of dense LMs and MoE models like OLMOE. The figure excludes some details, e.g., OLMOE-1B-7B also uses QK-Norm.

The results demonstrate that OLMOE-1B-7B achieves higher performance while being more efficient in terms of compute compared to dense models like Llama2-7B, and even performs on par with some larger models. It achieves this by optimizing the trade-offs between active and total parameters, routing methods, and expert specialization, all within a fully open framework that shares training data, code, and logs.
The authors also share insights from various design experiments, highlighting:
- Expert Granularity: Using more granular experts (smaller but more numerous experts) yields better performance, though there are diminishing returns after a certain threshold.
- Routing Mechanism: Token choice routing, where tokens select their experts, performs better than expert choice routing in this setup.
- Loss Functions: The combination of load balancing loss and router z-loss is crucial to maintaining stability and ensuring expert specialization.
OLMOE is an important step in making advanced MoE-based language models accessible to the open-source community by releasing all aspects of the model development, from code and data to intermediate training checkpoints.
Hugging Face

Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment

This paper by D’Oosterlinck et al. from Ghent University, Stanford University, and Contextual AI introduces methods to improve alignment in LLMs by addressing two core issues: the suboptimal contrastive nature of preference data and the limitations of alignment objectives. The authors propose Contrastive Learning from AI Revisions (CLAIR) and Anchored Preference Optimization (APO) to enhance the clarity of preference signals and the stability of alignment training.
CLAIR creates minimally contrasting preference pairs by revising lower-quality outputs generated by the target model. Instead of using a judge to pick between outputs, CLAIR employs a reviser (a stronger model such as GPT4-turbo) to minimally improve the weaker output, ensuring that the contrast between outputs is clear and targeted. This leads to more precise preference data compared to conventional methods where preference pairs might vary due to uncontrolled differences. Empirical results show that CLAIR generates the best contrastive data, as measured by token-level Jaccard similarity and character-level Levenshtein edit distance, outperforming on-policy and off-policy judge datasets.
The figure below from the paper illustrates that alignment is underspecified with regard to preferences and training objective. A: Preference pairs can vary along irrelevant aspects, Contrastive Learning from AI Revisions (CLAIR) creates a targeted preference signal instead. B: The quality of the model can impact alignment training, Anchored Preference Optimization (APO) explicitly accounts for this.

The figure below from the paper illustrates an answer produced by Llama-3-8B-Instruct for a prompt, and corresponding GPT4-turbo revision of this answer. The differences between answer and revision are highlighted. The revision generally follows the same outline as the answer but improves it where possible. For example, the revision correctly alters the count of Parisian restaurants from 2 to 3 in the second line of the answer.

APO is a family of contrastive alignment objectives that explicitly consider the relationship between the model and the preference data. The authors propose two key variants: APO-zero and APO-down. APO-zero is used when winning outputs are better than the model’s outputs, ensuring that the likelihood of winning outputs increases and that of losing outputs decreases. APO-down is preferred when the model is already superior to the winning outputs, decreasing the likelihood of both but decreasing the likelihood of the losing output more sharply. APO provides more fine-grained control compared to widely used objectives such as Direct Preference Optimization (DPO), avoiding scenarios where increasing the likelihood of a winning output can degrade model performance.
The authors conducted experiments aligning Llama-3-8B-Instruct on 32K CLAIR-generated preference pairs and comparable datasets using several alignment objectives. The results demonstrated that CLAIR, combined with APO, led to a significant improvement in performance, closing the gap between Llama-3-8B-Instruct and GPT4-turbo by 45% on the MixEval-Hard benchmark. The best model improved by 7.65% over the base Llama-3-8B-Instruct, primarily driven by the improved contrastiveness of CLAIR-generated data and the tailored dynamics of APO. In comparison, other alignment objectives like DPO and KTO did not perform as well, with DPO showing a tendency to degrade the model due to its ambiguous handling of winning and losing likelihoods.
CLAIR and APO offer a more stable and controllable approach to alignment by improving the precision of preference signals and ensuring that training dynamics are better suited to the model and data relationship. The experiments also underscore the importance of controlling contrastiveness in preference datasets and adapting the alignment objective to the specific needs of the model.
The paper concludes with discussions on how these methods compare to other alignment efforts like Reinforcement Learning from AI Feedback (RLAIF) and Direct Preference Optimization (DPO), highlighting how CLAIR and APO address the challenges of underspecification in alignment.

APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets

This paper by Liu et al. from Salesforce AI Research introduces APIGen, a novel automated data generation pipeline designed by Salesforce AI Research to create reliable and diverse datasets specifically for function-calling applications in LLMs. The APIGen framework addresses challenges in training function-calling agents by producing a large-scale dataset with high-quality, verified function calls. The authors leverage a structured, multi-stage verification approach to generate a dataset that enables fine-tuning LLMs, which demonstrates significant improvements in performance on the Berkeley Function-Calling Benchmark (BFCL).
APIGen includes a dataset of 60,000 entries across 3,673 APIs from 21 categories, encompassing different query styles (simple, multiple, parallel, and parallel multiple). Notably, APIGen-trained models, even with relatively fewer parameters, achieved strong results in function-calling benchmarks, surpassing larger LLMs such as GPT-4 and Claude-3.
APIGen Framework and Data Generation: The APIGen framework generates query-answer pairs through a structured pipeline, involving:
1. Sampling: APIs and QA pairs are sampled from a comprehensive library and formatted in a unified JSON schema.
2. Prompting: LLMs are prompted using diverse templates to generate function-call responses in JSON format, promoting response variability across different real-world scenarios.
3. Multi-Stage Verification:
  - Format Checker: Ensures generated data follows JSON specifications and filters poorly formatted entries.
  - Execution Checker: Executes function calls to verify correctness against backend APIs, discarding calls with errors.
  - Semantic Checker: A second LLM verifies alignment between generated answers and query intent, further refining data quality.
This pipeline allows APIGen to create a dataset that supports diverse function-calling scenarios, enhancing model generalization and robustness. The figure below from the paper illustrates the post-process filters.

The figure below from the paper illustrates JSON data format examples.

Diversity and Scalability: APIGen emphasizes diversity by categorizing queries and sampling API descriptions from different sources. It includes four query types:
- Simple: A single API call per query.
- Multiple: Requires selecting the appropriate function from various APIs.
- Parallel: Executes multiple API calls in one response.
- Parallel Multiple: Combines both parallel and multiple query types, calling for intricate function handling.
To ensure scalability, APIGen processes data from multiple API formats, such as REST and Python functions, adapting them into its JSON format. This modular approach accommodates a variety of API sources with minimal adjustments.
Dataset and Implementation: The dataset generation involves filtering APIs for quality, executing requests to ensure validity, and regenerating API descriptions where necessary. The resulting dataset spans diverse categories, ensuring comprehensive coverage across fields like finance, technology, and social sciences.
Training experiments involved two model versions, xLAM-1B (FC) and xLAM-7B (FC), demonstrating APIGen’s dataset’s efficacy. Models trained with APIGen data achieved remarkable accuracy on the BFCL, particularly in scenarios involving complex, parallel, and multi-call queries. The xLAM-7B model notably ranked 6th on the BFCL leaderboard, surpassing models like GPT-4 and Llama3.
Human Evaluation and Future Work: APIGen’s effectiveness was validated through human evaluation, with over 95% of samples passing quality checks. The authors plan to extend APIGen to support more API types and incorporate multi-turn interaction capabilities for function-calling agents.
By providing a structured and scalable approach to high-quality dataset generation, APIGen sets a new standard for training robust function-calling LLMs, addressing gaps in current datasets and enhancing LLMs’ real-world applicability.
Hugging Face; Project page

AutoAgents: A Framework for Automatic Agent Generation

This paper by Chen et al. from Peking University, HKUST, Beijing Academy of Artificial Intelligence, and University of Waterloo, published in IJCAI 2024 presents AutoAgents, a novel framework designed for dynamic multi-agent generation and coordination, enabling language models to construct adaptive AI teams for a wide range of tasks. Unlike traditional systems that rely on static, predefined agents, AutoAgents generates task-specific agents autonomously, allowing for flexible collaboration across varied domains. The framework introduces a drafting and execution stage to handle complex task environments and facilitate effective role assignment and solution planning.
In the drafting stage, three primary agents—Planner, Agent Observer, and Plan Observer—collaborate to define and refine an agent team and execution plan. The Planner generates specialized agents, each described by prompts detailing roles, objectives, constraints, and toolsets. The Agent Observer evaluates agents for task suitability, ensuring they are adequately diverse and relevant. Simultaneously, the Plan Observer assesses the execution plan, refining it to address any gaps and optimize agent collaboration.
The execution stage leverages two mechanisms for task completion: self-refinement and collaborative refinement. In self-refinement, individual agents iterate on their tasks, improving their outputs through cycles of reasoning and self-evaluation. Collaborative refinement allows agents to pool expertise, enhancing task execution through interdisciplinary dialogue. A predefined Action Observer oversees coordination, adjusting task allocations and managing memory across agents to maintain efficiency and coherence.
To enhance adaptability in complex tasks, AutoAgents incorporates three types of memory—short-term, long-term, and dynamic—to manage historical data and context for each action. Dynamic memory, in particular, facilitates the Action Observer’s access to essential prior actions, optimizing task-related decisions.
The figure below from the paper illustrates a schematic diagram of AutoAgents. The system takes the user input as a starting point and generates a set of specialized agents for novel writing, along with a corresponding execution plan. The agents collaboratively carry out the tasks according to the plan and produce the final novel. Meanwhile, an observer monitors the generation and execution of the Agents and the plan, ensuring the quality and coherence of the process.

Experiments across open-ended question-answering and trivia creative writing tasks validate AutoAgents’ superior performance. AutoAgents surpasses traditional models, including GPT-4, in both knowledge acquisition and reasoning quality. The system demonstrates a significant increase in knowledge integration, especially when handling tasks demanding extensive domain-specific information. A case study in software development shows how AutoAgents generates diverse expert roles (e.g., game designer, programmer, tester) for developing a Python-based Tetris game, highlighting the versatility of the agent team composition.
Implementation-wise, AutoAgents utilizes GPT-4 API with a controlled temperature setting for reproducibility. Experiment parameters include a maximum of three drafting discussions and five execution refinements, with dynamic prompts designed to guide each agent’s expertise and actions.
In summary, AutoAgents offers a significant advancement in adaptive, collaborative AI systems by automating agent generation and task planning, reinforcing the capabilities of LLMs in handling complex, domain-spanning tasks through self-organizing, expert-driven agent teams.
Code

Measuring Short-Form Factuality in Large Language Models: SimpleQA

This paper by Wei et al. from OpenAI introduces SimpleQA, a benchmark developed by Wei et al. at OpenAI, to evaluate the factual accuracy of large language models (LLMs) through short, fact-seeking questions with single, definitive answers. SimpleQA is designed to test models’ capacity to produce factually correct responses and measure if they avoid “hallucinations”—responses not backed by evidence. Unlike other benchmarks, SimpleQA is challenging, adversarially collected against GPT-4, and includes 4,326 questions that span diverse topics.
SimpleQA is constructed with strict criteria to enhance factuality evaluation. Questions were designed to yield only a single, unambiguous answer. This was achieved by specifying the answer scope clearly (e.g., requiring precise location or date) and ensuring that answers remain relevant over time, avoiding queries sensitive to temporal changes (e.g., not using “as of 2023”). Each question was verified by two independent AI trainers, and only questions with agreement on correct answers were retained.
SimpleQA’s questions were iteratively revised for quality. During creation, prompts flagged issues such as multi-answer ambiguity or reference inconsistency. The dataset was validated by having AI trainers independently answer each question without prior knowledge of other responses. Additionally, each question was required to be verifiable, with sources confirming the reference answers, with most answers sourced from high-authority websites (e.g., Wikipedia).
The figure below from the paper illustrates a definition and corresponding examples for each grade are shown in the table below.

Evaluation metrics for SimpleQA were carefully designed to quantify both factual accuracy and answer reliability. A ChatGPT grader categorizes responses as “correct,” “incorrect,” or “not attempted.” To compute a comprehensive accuracy metric, the benchmark employs two main metrics:
- Overall Correct: Percentage of all questions answered accurately.
- Correct Given Attempted: Accuracy rate among only those questions that were attempted.
The two metrics are then combined into an F-score for a holistic evaluation. However, as the F-score can incentivize guessing in models with a confidence threshold above 50%, alternative metrics were explored, penalizing incorrect answers more heavily.
Experiments show that larger models (e.g., GPT-4) demonstrate higher accuracy and improved calibration than smaller models. Calibration was evaluated by two methods:
- Stated Confidence: Asking models to provide an explicit confidence score for each answer, with accuracy measured against stated confidence levels.
- Answer Frequency: Repeatedly querying the model and assessing whether answer consistency correlates with accuracy.
Results indicate that larger models like OpenAI’s o1-preview have higher calibration, aligning response accuracy more closely with stated or implied confidence levels.
SimpleQA’s implementation has limitations, particularly in its narrow focus on short-form, single-answer questions, which may not generalize to long-form factual responses. The authors argue that this controlled setup offers a foundation for assessing one dimension of factuality, supporting the development of LLMs that are more trustworthy and reliable in providing factually sound outputs.
SimpleQA is available publicly, facilitating ongoing research on LLM factuality and calibration, with the aim of enhancing LLM performance on fact-based tasks and increasing their adoption in real-world applications.
Project page

Llama-3-Nanda-10B-Chat: An Open Generative Large Language Model for Hindi

This paper by Choudhury et al. from MBZUAI introduces Llama-3-Nanda-10B-Chat (referred to as Nanda), a state-of-the-art, Hindi-centric instruction-tuned open LLM developed from the LLaMA-3-8B model through continuous pretraining, with transformer block expansion under the Llama Pro methodology. Nanda’s architecture is a decoder-only model, fine-tuned on a balanced mixture of Hindi and English texts and containing 10 billion parameters. The model surpasses existing open-source Hindi and multilingual models in performance, notably improving reasoning and knowledge capabilities in Hindi while also achieving competitive results in English tasks.
Pretraining Data and Preprocessing: Nanda’s training dataset consists of a curated collection of Hindi content, totaling 65 billion tokens, sourced from Hindi-specific datasets and cleaned Common Crawl data. The preprocessing pipeline included steps such as detokenization, stringent filtering, normalization, and locality-sensitive hashing for deduplication. Notably, English and code-mixed Hindi data were selectively incorporated to address scarcity in high-quality Hindi datasets and to support cross-lingual transfer. The final dataset was designed to optimize quality while retaining substantial Hindi content.
Model Architecture: Nanda adopts a transformer-based, causal decoder-only model similar to GPT-2 and LLaMA series, with advancements like RoPE positional encoding and grouped-query attention to enhance performance.
Nanda Tokenizer: To support bilingual text processing, Nanda employs a balanced vocabulary for both English and Hindi, extending Llama-3’s byte pair encoding (BPE) tokenizer to improve efficiency for Hindi. Existing English-trained tokenizers typically split non-English words into multiple subwords, resulting in high fertility (i.e., the average number of tokens per word) for Hindi text, which impacts training, fine-tuning, and inference performance. By expanding the Llama-3 vocabulary with frequently used Hindi tokens, Nanda achieves a 54.4% reduction in Hindi token fertility, significantly improving processing efficiency. This balanced, low-fertility tokenizer reduces training and inference costs, minimizes latency, and extends context windows. The extended vocabulary’s Hindi tokens were derived from a monolingual Hindi tokenizer, with intrinsic evaluations confirming optimal fertility and efficiency.
Nanda Embeddings: Following a semantic similarity-based embedding initialization approach, Nanda’s embeddings were initialized using Wechsel multilingual initialization, where pre-trained embeddings like Fasttext or OpenAI embeddings were employed. For each new Hindi token added to the Llama-3 vocabulary, the team identified the top-k most similar tokens in the base vocabulary using cosine similarity from a pre-trained embedding model, specifically using OpenAI’s text-embedding-3-large embeddings for its robust quality and multilingual coverage. The final embedding for each new Hindi token was a weighted average of the top-k (where k=5) similar tokens’ base embeddings. This method was applied to both embedding and unembedding layers for Nanda.
Training Process and Hyperparameters: The training involved extensive fine-tuning on 55 billion Hindi tokens, using a 1:1 English-to-Hindi data mix to facilitate cross-lingual capability transfer. Nanda was trained with the AdamW optimizer, a learning rate schedule with linear warm-up and decay, and a global batch size of 7,680 sequences (each with a context length of 8,192 tokens). Model architecture includes 40 transformer layers and 32 heads with a hidden dimension of 4096, allowing the model to perform effectively with less overhead compared to training from scratch.
Infrastructure: The Condor Galaxy 2 (CG-2) supercomputer, featuring 64 Cerebras CS-2 systems, was used for the training, allowing flexibility in resource allocation. Nanda’s training used CG-2’s weight streaming mode, bypassing conventional 3D parallelism complexities and enabling efficient data parallelism across multiple CS-2 systems. The MemoryX and SwarmX services supported the system by managing model weights and gradient aggregation across the CS-2 systems.
Instruction-Tuning: To make Nanda responsive to user instructions across Hindi and English NLP tasks, an instruction-tuning dataset of 61,000 prompt-response pairs was created, with Hindi and English examples provided in both formal and casual styles. Oversampling was applied to the instruction data to strengthen performance across the 100M-token dataset. Additionally, safety-related tuning was emphasized, using a comprehensive dataset with adversarial and benign prompts to train the model to maintain safety without over-refusing benign queries.
Evaluation: Nanda was evaluated on Hindi and English NLP benchmarks, outperforming other open-source models in Hindi tasks across knowledge retrieval, commonsense reasoning, and misinformation detection. Safety evaluation showed the model’s robust handling of risky and adversarial prompts. The model’s Hindi generative quality was verified using the Vicuna-Instructions-80 dataset and assessed by GPT-4, with results showing superior performance over baselines in contextual and nuanced Hindi text generation.
Nanda represents a significant advancement in Hindi NLP, setting a new standard for bilingual language models and supporting applications across research and enterprise settings with a focus on safety and cultural relevance.

IndicGenBench: A Multilingual Benchmark to Evaluate Generation Capabilities of LLMs on Indic Languages

This paper by Singh et al. from Partha Talukdar’s team at Google Research, published in ACL 2024, presents IndicGenBench, a comprehensive benchmark specifically designed to evaluate the generative abilities of LLMs across 29 Indic languages, encompassing 13 scripts and 4 language families. This work addresses the performance gap between English and under-represented Indic languages in LLMs, demonstrating the need for benchmarks that foster inclusivity in language technology.
IndicGenBench comprises five key tasks: cross-lingual summarization (CROSSSUM-IN), machine translation (FLORES-IN), multilingual question-answering (XQUAD-IN), and cross-lingual question-answering (XORQA-IN-XX and XORQA-IN-EN). Each task incorporates multi-way parallel data and provides human-curated translations, especially for low-resource languages that previously lacked evaluation datasets. For instance, CROSSSUM-IN includes 20.3k examples and FLORES-IN has 58.2k translations, contributing to a high-quality, noise-free dataset across high, medium, and low-resource languages.
The paper evaluates a range of state-of-the-art proprietary and open-source LLMs, such as GPT-3.5, GPT-4, PaLM-2, mT5, BLOOM, and LLaMA, on the benchmark. Results highlight significant performance disparities: PaLM-2-L consistently achieved the highest scores across most tasks, but all models demonstrated substantially lower scores for Indic languages compared to English, indicating a pervasive gap in model capabilities.
Several implementation details are noted for each task. For cross-lingual summarization, IndicGenBench employs translations derived from the CrossSum dataset, ensuring equitable sampling across languages. The machine translation data is built on FLORES-200, covering English-to-Indic and Indic-to-English translations with ChrF metrics for reliability. For question-answering tasks, XQUAD-IN and XORQA-IN are extended from SQuAD and XOR-TYDI QA, respectively, with translations in 12 and 28 languages, reflecting rigorous quality control and language coverage.
The figure below from the paper illustrates IndicGenBench, which consists of five tasks: Cross-lingual Summarization (CROSSSUM-IN), Machine Translation (FLORES-IN), Multilingual QA (XQUAD-IN) and Cross-lingual QA (XORQA-IN-XX and XORQA-IN-EN). An example from each task, the number of languages for which we collect evaluation data (divided by resourcefulness, higher (H), medium (M) and low (L)), and the number of training/validation/test instances per task is shown above. See Section 2 for details.

In the experimentation phase, LLM performance varied significantly with model size and resource level of the language. One-shot and few-shot prompting improved with larger models; however, lower-resourced languages experienced greater tokenization challenges, with high token fertility hindering model input length and performance. Additionally, in-context learning effectiveness was influenced by language similarity, as Hindi-based prompts proved more effective than English-based prompts for related languages.
Fine-tuning experiments with mT5 and PaLM-2 models indicated that while in-context learning yielded substantial improvements, fine-tuning on provided training data for some tasks, like summarization, allowed models to better accommodate longer generative tasks. The qualitative analysis pointed to prevalent challenges, such as hallucinations in cross-lingual summarization and inaccurate translations in less-resourced languages.

Llama Pro: Progressive LLaMA with Block Expansion

This paper by et al., published in ACL 2024, introduces Llama Pro, an advancement over LLaMA models that integrates a new post-pretraining technique, termed block expansion, to enhance the model’s domain-specific abilities while retaining its general capabilities. Llama Pro is built upon LLaMA2-7B and is extended by adding eight Transformer blocks to create Llama Pro-8.3B. The authors aim to address catastrophic forgetting—a common challenge in LLM adaptation—by focusing solely on expanding and tuning these new blocks with domain-specific data, leaving the original model layers frozen. This approach helps maintain the foundational knowledge of the original model while efficiently integrating new knowledge, particularly for programming and mathematical tasks.
Llama Pro’s block expansion method involves inserting additional identity-mapped blocks between the original layers. This identity mapping is achieved by initializing the linear layers to zero, ensuring that the new blocks initially act as pass-through layers, maintaining the model’s original outputs. The expanded blocks are trained using code and math data, specifically the Python subset of Stack-dedup and the Proof-pile-2 dataset, amounting to 80 billion tokens. Training utilizes bf16 mixed precision, flash attention, a batch size of 1024, and a sequence length of 4096, with a cosine learning rate scheduler and a warmup ratio of 6%. The model is trained over approximately 2830 GPU hours using 16 NVIDIA H800 GPUs.
Llama Pro-8.3B was further fine-tuned through supervised instruction tuning, incorporating datasets like ShareGPT, WizardLM, and MetaMath, focusing on both instruction-following capabilities and complex coding tasks. For this fine-tuning phase, the authors used a batch size of 128, sequence length of 4096, and a reduced learning rate of 2e-5. Notably, all blocks were fine-tuned in this stage, enhancing the model’s versatility in handling both general and specialized tasks, such as HumanEval for code and GSM8K for mathematics.
Llama Pro and its instruction-tuned variant, Llama Pro - Instruct, demonstrated superior performance across various benchmarks, including HumanEval, MMLU, and TruthfulQA. Evaluation with tools like MINT-Bench showed Llama Pro - Instruct’s proficiency in tool-augmented reasoning and handling interactive tasks. When benchmarked against other open LLaMA models, Llama Pro-8.3B achieved a balance of computational efficiency and performance, surpassing larger models in domain-specific tasks with lower training and inference costs.
The figure below from the paper shows that (a) they begin with a large language model (LLM) pre-trained on a massive unlabeled corpus, resulting in a model with strong general capabilities. Here we select the off-the-shelf LLaMA2 for convenience. (b) They employ backbone expansion and fine-tune the expanded identity blocks using the aspect corpus while freezing the blocks inherited from the base model. The model after post-pretraining can be used for instruction tuning as usual.

Ablation studies confirmed that adding blocks interleaved within the model achieved optimal performance by supporting deeper representations. Comparisons with Mixture-of-Expert (MoE) and LoRA revealed that block expansion maintains general language capabilities while allowing focused domain adaptation. An additional experiment with a law dataset highlighted Llama Pro’s adaptability to non-code domains, suggesting broader applications for its block expansion strategy.
This approach exemplifies a scalable, resource-efficient way to tailor LLMs for specific domains while retaining their general capabilities, opening avenues for more adaptable and specialized LLM applications.
Code

CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing

The paper by Gou et al. from Tsinghua University, Microsoft Research Asia, and Microsoft Azure AI, published in ICLR 2024, introduces a novel framework called CRITIC, designed to enhance the reliability of LLM outputs by allowing the models to interact with external tools to critique and amend their own responses.
CRITIC’s approach diverges from traditional model fine-tuning, focusing instead on a verify-and-correct process where LLMs generate initial outputs and subsequently engage with tools such as search engines, code interpreters, and calculators to verify aspects of these outputs, like truthfulness or accuracy. Based on feedback from these tools, the model then refines its response, iteratively improving until a specified condition (e.g., sufficient accuracy) is achieved. This method sidesteps the need for extensive additional training or data annotation and is structured to work with black-box LLMs through in-context learning and few-shot demonstrations.
In terms of implementation, CRITIC operates in two main phases: verification and correction. An initial output is generated using a few-shot prompt-based approach. This output is then scrutinized with tools tailored to the task at hand. For instance, for fact-checking in question answering, CRITIC uses a Google-based search API to retrieve relevant web snippets, while for mathematical problem-solving, a Python interpreter verifies code execution and provides debugging information if errors occur. The verification feedback, structured as critiques, is appended to the prompt, enabling the LLM to correct its initial output. This verify-then-correct cycle is repeated iteratively, with the maximum number of interactions set per task or until stability in the output is observed.
The following figure from the paper shows that the CRITIC framework consists of two steps: (1) verifying the output by interacting with external tools to generate critiques and (2) correcting the output based on the received critiques. We can iterate over such verify-then-correct process to enable continuous improvements.

CRITIC’s performance evaluation demonstrates its effectiveness across three types of tasks:
1. Free-form Question Answering - Here, CRITIC leverages web search to validate answers, achieving notable improvements in F1 scores over baseline methods, such as chain-of-thought (CoT) prompting and retrieval-augmented techniques.
2. Mathematical Program Synthesis - By utilizing an interpreter for mathematical validation, CRITIC substantially enhances solve rates for datasets like GSM8k, surpassing program-of-thought (PoT) strategies.
3. Toxicity Reduction - CRITIC employs the PERSPECTIVE API to monitor and reduce toxic content, achieving higher fluency and diversity in outputs while significantly reducing toxicity probabilities.
Experimental results indicate that CRITIC consistently improves model performance, especially in cases requiring high factual accuracy or computational precision. The paper concludes by emphasizing that the inclusion of external feedback mechanisms allows LLMs to perform self-corrections that would be challenging through self-refinement alone.
Code

Efficient Tool Use with Chain-of-Abstraction Reasoning

This paper by Gao et al. from EPFL and FAIR introduces a novel method called Chain-of-Abstraction (CoA) reasoning, which optimizes LLMs for multi-step reasoning by using tools to access external knowledge. CoA decouples general reasoning from domain-specific information, which is later retrieved using specialized tools, enhancing LLM accuracy and efficiency in domains that require complex reasoning chains, such as mathematics and Wikipedia-based question answering (Wiki QA). The CoA method trains LLMs to generate reasoning chains with placeholders for domain knowledge, allowing parallel tool usage and reducing the lag introduced by interleaving LLM outputs with tool responses.
Implementation of CoA reasoning involves a two-stage fine-tuning process: first, LLMs are trained to produce abstract reasoning chains that include placeholders for required operations or knowledge, such as calculations or article references. Then, these chains are reified by filling placeholders with actual knowledge retrieved from tools like an equation solver or a Wikipedia search engine. This approach enables parallel processing, as tools fill in specific information after generating the complete reasoning chain, speeding up inference significantly compared to sequential tool-augmented approaches.
For the training pipeline, the authors fine-tune LLMs by re-writing gold-standard answers into abstract reasoning chains labeled with placeholders (e.g., “[20 + 35 = y1]”) for mathematical derivations or search queries for Wiki QA. Verification with domain-specific tools ensures that placeholders align with expected outcomes. For instance, in math, an equation solver calculates final results for placeholders. For Wiki QA, a combination of Wikipedia search (BM25 retriever) and NER extracts relevant articles and entities, which are then matched against gold references.
The following figure from the paper shows an overview of chain-of-abstraction reasoning with tools. Given a domain question (green scroll), a LLM is fine-tuned to first generate an abstract multi-step reasoning chain (blue bubble), and then call external tools to reify the chain with domain-specific knowledge (orange label). The final answer (yellow bubble) is obtained based on the reified chain of reasoning.

Evaluation on GSM8K and HotpotQA datasets showed CoA’s significant improvements in both accuracy and inference speed, outperforming baselines such as Toolformer and traditional chain-of-thought (CoT) methods by 6-7.5% accuracy in mathematics and Wiki QA. Additionally, CoA demonstrated robust generalization in zero-shot settings on other datasets like SVAMP and Natural Questions, with human evaluations confirming reduced reasoning and arithmetic errors.

Understanding the Planning of LLM Agents: A Survey

This paper by Xu Huang et al. from the USTC and Huawei Noah’s Ark Lab presents a comprehensive survey of the planning capabilities of Large Language Model (LLM)-based agents, systematically categorizing recent approaches and identifying challenges in leveraging LLMs as planning modules. The authors define a taxonomy that classifies existing LLM-agent planning methods into five main categories: Task Decomposition, Multi-Plan Selection, External Planner-Aided Planning, Reflection and Refinement, and Memory-Augmented Planning. This taxonomy serves as a framework for understanding how various methods address the complexity of planning tasks in autonomous agents and highlights key developments in each category.
The paper explores Task Decomposition, where complex tasks are divided into manageable sub-tasks, categorized further into Decomposition-First and Interleaved Decomposition. Techniques like Chain-of-Thought (CoT) and ReAct are examined for their methods in guiding LLMs to sequentially reason through tasks, emphasizing the benefits of sub-task correlation while noting limitations in handling lengthy planning sequences due to memory constraints. Multi-Plan Selection is detailed with methods like Tree-of-Thought (ToT) and Graph-of-Thought (GoT), where LLMs generate multiple candidate plans and then employ search algorithms (e.g., Monte Carlo Tree Search) to choose optimal paths, addressing the stochastic nature of LLM planning but also noting challenges with computational overhead.
External Planner-Aided Planning is reviewed, dividing approaches into symbolic and neural planners, where LLMs act primarily as intermediaries, structuring tasks for external systems like PDDL or reinforcement learning-based neural planners. The symbolic planners enhance task formalization, while neural planners like DRRN model LLM-aided planning as Markov decision processes, showing efficiency in domain-specific scenarios. Reflection and Refinement strategies, such as Self-Refine and Reflexion, use iterative planning and feedback mechanisms to allow LLM agents to self-correct based on previous errors, resembling reinforcement learning updates but emphasizing textual feedback over parameter adjustments.
This paper by Xu Huang et al. from the USTC and Huawei Noah’s Ark Lab presents a comprehensive survey of the planning capabilities of Large Language Model (LLM)-based agents, systematically categorizing recent approaches and identifying challenges in leveraging LLMs as planning modules. The authors define a taxonomy that classifies existing LLM-agent planning methods into five main categories: Task Decomposition, Multi-Plan Selection, External Planner-Aided Planning, Reflection and Refinement, and Memory-Augmented Planning. This taxonomy serves as a framework for understanding how various methods address the complexity of planning tasks in autonomous agents and highlights key developments in each category.
Memory-Augmented Planning is discussed through RAG-based memory, which retrieves task-relevant information to support planning, and embodied memory, where agents fine-tune on experiential data, embedding learned knowledge into model parameters. Examples like MemGPT and TDT illustrate how different memory types enhance planning capabilities, balancing between update costs and memorization capacity.
The following figure from the paper shows a taxonomy on LLM-Agent planning.

The paper evaluates the effectiveness of these approaches on four benchmarks, demonstrating that strategies involving task decomposition, multi-path selection, and reflection significantly improve performance, albeit at higher computational costs. Challenges identified include LLM hallucinations, plan feasibility under complex constraints, efficiency, and limitations in handling multi-modal feedback. Future directions suggested include incorporating symbolic models for constraint handling, optimizing for planning efficiency, and developing realistic evaluation environments to more closely simulate real-world agent interactions. This survey provides a foundational overview of LLM planning, guiding future work toward robust and adaptable planning agents.

Phi-4 Technical Report

This technical report by Abdin et al. introduces phi-4, an 14-billion parameter LLM, which excels at tasks requiring complex reasoning, with significant improvements over earlier models. Its integration of synthetic data is a core innovation that enhances performance, particularly in academic and STEM-related benchmarks.
Overview:
- phi-4 focuses on data quality, particularly incorporating synthetic data into the pre-training and mid-training processes, as opposed to relying solely on organic data sources (like web content or code).
- The model outperforms its teacher model, GPT-4, especially in reasoning-focused tasks (like STEM-based QA) and demonstrates strong performance on academic benchmarks.
- Despite using minimal changes to the phi-3 architecture, phi-4 achieves better results thanks to improved training data, synthetic data generation techniques, and post-training innovations.
Training Process:
1. Synthetic Data Integration:
  - phi-4 uses synthetic data as the bulk of its training data, generated via diverse techniques like multi-agent prompting, self-revision workflows, and instruction reversal.
  - This synthetic data helps improve the model’s reasoning and problem-solving abilities, overcoming the limitations of traditional unsupervised data sources.
2. Training Phases:
  - Pretraining: The model was trained on around 10 trillion tokens using a mixture of synthetic, filtered organic, and reasoning-heavy web data.
  - Midtraining: Focused on extending the context length from 4k to 16k to enable better performance on long-context tasks.
  - Post-Training: Used Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) to align the model with human preferences and further refine its outputs. The DPO technique includes a novel method for generating preference pairs based on pivotal tokens, crucial for guiding the model toward more robust problem-solving paths.
3. Data Composition and Mixtures:
  - The pretraining dataset is composed of 40% synthetic data, 15% web data, 15% web rewrites, 20% code data, and 10% targeted acquisitions (e.g., academic books, research papers).
  - The synthetic datasets aim for diversity, complexity, and alignment with inference contexts, contributing to the model’s reasoning capabilities.
Model Architecture:
- phi-4 is based on a decoder-only transformer architecture, with a context length extended during midtraining. The architecture largely mirrors phi-3-medium, with updates in the tokenizer and a shift to a full attention mechanism over a 4K context.
Performance:
- phi-4 demonstrates remarkable performance on reasoning tasks, surpassing even much larger models in domains like STEM Q&A (e.g., GPQA) and math competitions (e.g., MATH). It also outperforms GPT-4 on various benchmarks.
- The model is especially strong in areas like coding (HumanEval) and mathematical reasoning, even when compared to models like Llama-3 and Qwen.
- SimpleQA, DROP, and IFEval are areas where phi-4 shows weaknesses, notably in instruction-following and strict adherence to format.
- The plot below from the paper illustrates the average performance of different models on the November 2024 AMC-10 and AMC-12 tests. This is the average score (with maximum score 150) over the four tests on 100 runs with temperature t = 0.5. t = 0.5 was chosen to follow simple-evals. Error bars are 2σ of the estimate. On competition math, phi-4 scores well above its weight-class even compared to non–open-weight models.
Safety and Ethics:
- phi-4 was developed in line with Microsoft’s Responsible AI principles. The team performed red-teaming exercises to evaluate and improve safety measures, addressing potential risks in harmful behaviors, bias, and hallucinations.
- Techniques like Pivotal Token Search (PTS) were used to mitigate errors in reasoning-heavy tasks, ensuring the model produces better outputs by targeting crucial decision-making tokens.

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

This paper by Shao et al. from DeepSeek-AI, Tsinghua University, and Peking University, introduces the DeepSeekMath 7B model, a state-of-the-art domain-specific language model optimized for mathematical reasoning, achieving results comparable to GPT-4 and Gemini-Ultra on mathematical benchmarks. Below is a detailed summary:
DeepSeekMath 7B showcases the effectiveness of domain-specific pre-training and innovative reinforcement learning techniques for advancing mathematical reasoning in open-source language models. Its contributions in data curation, RL algorithms, and multilingual capability serve as a foundation for future research in this domain.
Core Contributions:
1. Domain-Specific Training:
  - DeepSeekMath 7B is pre-trained using 120B tokens sourced from a newly developed DeepSeekMath Corpus, extracted and refined from Common Crawl data. The corpus is seven times larger than Minerva’s and nine times the size of OpenWebMath.
  - Pre-training incorporates natural language, code, and math-specific data for comprehensive reasoning capabilities.
2. Key Model Innovations:
  - Group Relative Policy Optimization (GRPO): A novel reinforcement learning (RL) technique designed to optimize the model’s reasoning while reducing memory consumption by bypassing the need for a critic model in RL frameworks like PPO.
  - Instruction tuning with Chain-of-Thought (CoT), Program-of-Thought (PoT), and tool-integrated reasoning datasets to enhance mathematical understanding.
Model Development and Implementation:
1. Pre-training Pipeline:
  - Base model: DeepSeek-Coder-Base-v1.5 7B, extended with 500B tokens. The corpus composition includes:
    - 56% from the DeepSeekMath Corpus.
    - 20% GitHub code.
    - 10% arXiv papers.
    - 10% natural language data from Common Crawl.
2. Data Selection and Processing:
  - The DeepSeekMath Corpus was curated using an iterative pipeline involving fastText-based classification to filter high-quality mathematical content. The dataset was decontaminated to exclude overlap with evaluation benchmarks like GSM8K and MATH.
  - The plot below from the paper illustrates an iterative pipeline that collects mathematical web pages from Common Crawl.
3. Mathematical Instruction Tuning:
  - Fine-tuning on 776K examples (English and Chinese datasets), leveraging CoT, PoT, and Python-based reasoning for diverse mathematical fields such as algebra, calculus, and geometry.
4. Reinforcement Learning with GRPO:
  - GRPO uses group scores as baselines, simplifying reward estimation and computational complexity.
  - The plot below from the paper illustrates PPO and the proposed GRPO. GRPO foregoes the value model, instead estimating the baseline from group scores, significantly reducing training resources.
  - RL training focused on GSM8K and MATH benchmarks with chain-of-thought prompts, achieving a 6-9% improvement over instruction-tuned models.
Key Results:
1. Mathematical Reasoning:
  - Achieved 51.7% accuracy on the MATH benchmark, surpassing all open-source models up to 70B size and approaching GPT-4 levels.
  - Demonstrated superior results across English and Chinese benchmarks like GSM8K (88.2%) and CMATH (88.8%).
2. Tool-Aided Problem Solving:
  - Using Python for problem-solving, DeepSeekMath 7B outperformed the prior state-of-the-art Llemma 34B on benchmarks like GSM8K+Python and MATH+Python.
3. General Capabilities:
  - Improvements in general reasoning and understanding benchmarks like MMLU (54.9%) and BBH (59.5%), as well as coding tasks like HumanEval and MBPP.
Observations and Insights:
1. Code Training Benefits:
  - Pre-training with code improves mathematical reasoning, both with and without tool use.
  - Mixed code and math training synergize mathematical problem-solving and coding performance.
2. ArXiv Data Limitations:
  - Training on arXiv papers alone did not significantly enhance reasoning, suggesting potential issues with the data’s format or relevance.
3. Reinforcement Learning Efficiency:
  - GRPO efficiently improves instruction-tuned models with fewer computational resources compared to PPO, setting a new benchmark in LLM reinforcement learning techniques.

DeepSeek-V3 Technical Report

Overview:
- This technical report by the DeepSeek team introduces DeepSeek-V3, an open-source Mixture-of-Experts (MoE) language model with a massive 671B total parameters, activating 37B parameters per token. The model prioritizes efficient inference and training cost-effectiveness through key innovations in architecture and training strategies.
Key Contributions:
1. Architecture Enhancements:
  - Multi-Head Latent Attention (MLA) and DeepSeekMoE:
    - Refined from DeepSeek-V2, MLA reduces memory usage for attention by low-rank compression of key and value tensors.
    - DeepSeekMoE optimizes Feed-Forward Networks using fine-grained experts with an auxiliary-loss-free load balancing mechanism.
  - Multi-Token Prediction (MTP):
    - Extends prediction scope to multiple future tokens to improve data efficiency and model planning. Implemented sequentially while preserving causal chains for robust training.
  - The following figure from the paper illustrates the basic architecture of DeepSeek-V3. Following DeepSeek-V2, they adopt MLA and DeepSeekMoE for efficient inference and economical training.
2. Pre-Training Efficiency:
  - Trained on 14.8 trillion diverse tokens using FP8 mixed-precision for computational efficiency.
  - Novel optimizations in the DualPipe pipeline parallelism strategy and cross-node communication kernels minimize communication overhead.
  - Training completed within 2.788M H800 GPU hours, costing only $5.576M.
3. Post-Training Innovations:
  - Combines Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) stages, including distillation from DeepSeek-R1 series models to improve reasoning capabilities.
  - Incorporates a reward model and Group Relative Policy Optimization (GRPO) for aligning outputs with human preferences.
4. Evaluation and Performance:
  - Benchmarked as the strongest open-source base model, excelling in:
    - Educational benchmarks (e.g., MMLU: 88.5%, MMLU-Pro: 75.9%).
    - Math and coding tasks, surpassing many closed-source models like GPT-4o on specific datasets.
    - Multilingual capabilities, especially strong in Chinese factual knowledge.
5. Inference and Deployment:
  - Deployed with distinct strategies for prefilling and decoding stages, leveraging Tensor Parallelism and Expert Parallelism across H800 clusters.
  - Implements redundancy and dynamic load balancing among experts to optimize inference throughput.
Technical Implementation:
1. Architectural Details:
  - MLA compresses key-value and query tensors for efficient attention.
  - DeepSeekMoE combines shared and routed experts, dynamically adjusting load using bias terms to maintain performance.
2. Training Framework:
  - Uses DualPipe for overlapping computation and communication during pipeline parallelism.
  - Efficient memory management via recomputation, shared embeddings, and low-precision caching.
3. FP8 Training:
  - Fine-grained quantization and improved matrix multiplication precision enhance low-precision stability.
  - Stored activations, optimizer states, and weights in BF16/FP8 to minimize memory usage.
Limitations and Future Directions:
- While performing well, challenges remain in fine-grained task specialization and scaling inference strategies for broader applications.
- Future research could explore advanced dynamic expert routing and further reducing training costs with emergent hardware.
Code

HuatuoGPT-o1: Towards Medical Complex Reasoning with LLMs

Objective:
- This paper introduces HuatuoGPT-o1, a specialized large language model (LLM) designed for medical reasoning. It focuses on creating verifiable medical problems and integrating these into a two-stage training framework, enabling models to engage in complex, iterative reasoning and improving their capabilities through reinforcement learning.
Core Contributions:
1. Verifiable Medical Problems:
  - A dataset of 40,000 questions reformatted from medical exams to include unique, objective answers for verification.
  - These problems support reasoning verification akin to mathematical problems, ensuring that the solution paths are correct.
  - The following figure from the paper illustrates: (Left) Constructing verifiable medical problems using challenging close-set exam questions. (Right) The verifier checks the model’s answer against the ground-truth answer.
  - The following figure from the paper illustrates an example of data synthesis. Left: strategy search on medical verifiable problems until the answer is validated. Right: Merging the entire search process into efficient complex CoTs, facilitating effective deep reasoning to refine answers. The complex CoTs and responses are used to train the model to adopt thinks-before-it-answers behavior akin to o1.
2. Two-Stage Training Approach:
  - Stage 1: Learning Complex Reasoning
    - Fine-tuning models using verifiable problems.
    - Employing Chain-of-Thought (CoT) techniques with four refinement strategies:
      1. Exploring New Paths: Discovering alternative reasoning approaches.
      2. Backtracking: Revisiting prior reasoning for corrections.
      3. Verification: Validating current solutions.
      4. Corrections: Critiquing and revising answers iteratively.
    - Successful reasoning trajectories are merged into “Complex CoT” structures, enhancing the model’s ability to think before responding.
  - Stage 2: Enhancing Complex Reasoning with Reinforcement Learning (RL)
    - Using the Proximal Policy Optimization (PPO) algorithm with verifier-guided sparse rewards to further refine reasoning skills.
  - The following figure from the paper demonstrates developing and improving LLMs for medical complex reasoning. Left (Stage1): Searching for correct reasoning trajectories to fine-tune LLMs for complex reasoning. Right (Stage2): Using the verifier to enhance complex reasoning via reinforcement learning.
3. HuatuoGPT-o1 Model:
  - Available in 8B and 70B parameter variants, trained on a combination of medical and general datasets, including MedQA and MedMCQA.
Implementation Details:
- Datasets:
  - A curated set of 192,000 medical questions from MedQA and MedMCQA, filtered to create 40,000 verifiable problems. Additional 5,000 general questions were included to enhance adaptability.
- Training Setup:
  - Fine-tuning with a learning rate of $5 \times 10^{-6}$ for 3 epochs and RL using PPO with a learning rate of $5 \times 10^{-7}$.
  - Batch size: 128.
  - Sparse rewards: Correct answers received full rewards, while partially correct or vague answers received reduced scores.
- Benchmarks and Results:
  - Evaluated on standard benchmarks (MedQA, MedMCQA, PubMedQA) and advanced datasets (MMLU-Pro, GPQA).
  - HuatuoGPT-o1-8B and 70B outperformed both general and medical-specific LLMs, with the 70B model achieving an average accuracy improvement of 8.5 points compared to state-of-the-art baselines.
- Verifier Role:
  - Verifiers (based on GPT-4o) achieved ~96% accuracy, providing reliable guidance in training and validation.
Ablation Studies:
1. Models trained with Complex CoTs exhibited a 4.3-point average improvement over simple CoTs.
2. Adding RL after CoT fine-tuning further enhanced performance by 3.6 points.
3. PPO outperformed other RL techniques such as RLOO and DPO.
Significance:
- HuatuoGPT-o1 demonstrates how integrating verifiable problems with sophisticated training paradigms can significantly advance domain-specific reasoning, setting a benchmark for LLM development in specialized fields such as medicine.

HiQA: A Hierarchical Contextual Augmentation RAG for Massive Documents QA

This paper by Chen et al. from introduces HiQA, an advanced multi-document question-answering (MDQA) framework to tackle the challenge of retrieving accurate information from extensive, indistinguishable documents. It incorporates cascading metadata and a multi-route retrieval mechanism to enhance the precision and relevance of knowledge retrieval.
The paper outlines the methodology comprising three main components: Markdown Formatter (MF), Hierarchical Contextual Augmentor (HCA), and Multi-Route Retriever (MRR). MF converts documents into markdown format, enriching them with structured metadata. HCA further augments document segments with hierarchical metadata, and MRR utilizes a combination of vector similarity, Elasticsearch, and keyword matching for improved retrieval accuracy.
The following figure from the paper illustrates of the proposed contextual text enhancement. The contextual structure can improve text alignment with the query for better matching in multi-documents scenarios.

A novel dataset, MasQA, is introduced to evaluate the performance of MDQA systems, highlighting the framework’s superiority in handling massive documents through extensive experiments.
Ablation studies demonstrate the individual contribution of each component to the system’s overall effectiveness, with a focus on the HCA’s role in improving retrieval precision.
Theoretical exploration into the impact of HCA on the distribution of document segments within the embedding space supports the framework’s approach, indicating enhanced retrieval accuracy and the avoidance of information loss associated with hard partitioning methods.

2025

MiniMax-01: Scaling Foundation Models with Lightning Attention

The MiniMax-01 series, which includes the text-focused MiniMax-Text-01 and the vision-language model MiniMax-VL-01, introduces innovative techniques in handling extensive context lengths and efficient computation. These models are designed to match the state-of-the-art in performance while providing superior scalability and efficiency.
Core Features:
- Lightning Attention: A novel attention mechanism offering linear complexity by leveraging a hybrid architecture (seven linear attention layers followed by one softmax-based layer in an 8-layer block).
- Mixture of Experts (MoE): A structure integrating 32 experts with dynamic token routing, totaling 456 billion parameters, with 45.9 billion active per token.
- Ultra-Long Context: Supports up to 4 million tokens, with robust extrapolation from a training context of 1 million tokens.
- Optimized Training and Inference Frameworks: Novel optimizations in training and inference workflows ensure the models remain highly scalable with minimal computational overhead, making them ideal for both research and deployment in real-world applications.
Architecture Details:
- The following figure from the paper shows the architecture of MiniMax-Text-01:
- The following figure from the paper illustrates the computations for softmax attention (left) and linear attention (right). The input length is $N$ and feature dimension is $d$, with $d \ll N$. Tensors in the same box are associated with computation. The linearized formulation allows $O(N)$ time and space complexity:
- Attention Mechanism:
  - Lightning Attention is the core innovation of the MiniMax architecture, enabling scalable and efficient attention across long sequences. Below are the key elements:
    1. Linear Complexity:
      - Lightning attention reduces quadratic complexity to linear by leveraging a “right product kernel trick” to compute attention outputs as: $O = \text{Norm}(Q(K^T V))$
      - This enables efficient computation for long sequences, maintaining constant time and space complexity with respect to the sequence length.
    2. Optimizations for Scalability:
      - Lightning attention uses block-wise computation, splitting matrices $Q$, $K$, and $V$ into smaller blocks and combining intra-block and inter-block operations.
      - A tiling technique avoids computational bottlenecks, such as the slow cumsum operation in causal attention modeling.
    3. Hybrid Approach:
      - By alternating seven linear attention layers with a softmax-based attention layer, the architecture compensates for the limitations of linear attention in retrieval tasks, ensuring strong performance across a variety of downstream tasks.
- Lightning Attention Forward Pass Algorithm:
  - The following figure from the paper shows the algorithm for the lightning attention forward pass:
  - Input: Query ($Q$), Key ($K$), Value ($V$) matrices of shape $(n, d)$, block size ($B$)
  - Output: Output matrix ($O$) of shape $(n, d)$
  - Steps:
    1. Initialize:
      - Cumulative Key-Value matrix ($KV = 0$) of shape $(d, d)$
      - Causal mask $M$ for intra-block operations
      - Output matrix $O$ of shape $(n, d)$
    2. Divide $Q$, $K$, and $V$ into $T = \lceil n / B \rceil$ blocks.
    3. For each block $t \in [1, T]$:
      - Extract current block $Q_t, K_t, V_t$.
      - Compute Intra-Block Attention using: $O_{\text{intra}} = (Q_t K_t^T \odot M)V_t$
      - Compute Inter-Block Attention using: $O_{\text{inter}} = Q_t KV$
      - Update $KV$ with: $KV = KV + K_t^T V_t$
      - Combine results: $O_t = O_{\text{intra}} + O_{\text{inter}}$
    4. Return $O$.
- MoE Integration:
  - Utilizes top-2 routing for token assignment among experts.
  - Balances computational load via a global routing strategy and auxiliary loss to stabilize training.
- Model Parameters:
  - Hidden Size: 6144
  - Attention Heads: 64 (128 dimensions per head)
  - FFN Dimensionality: 9216
  - Incorporates Rotary Position Embeddings (RoPE) for length extrapolation.
Training and Optimization:
- Data Handling:
  - Data-Packing: Optimized long-sequence training by concatenating variable-length samples, minimizing padding waste.
  - Varlen Ring Attention: Redesigned to handle variable-length sequences directly, eliminating excessive padding.
  - LASP+: Improved Linear Attention Sequence Parallelism by transforming serial computation into parallelized workflows, enhancing training efficiency for long contexts.
- Cluster Design:
  - Operated on 1500-2500 NVIDIA H800 GPUs.
  - Innovations in communication efficiency via Expert Tensor Parallelism (ETP) and overlap strategies, significantly reducing overhead.
Benchmarks:
- Achieved 100% retrieval accuracy in the 4-million-token “Needle in a Haystack” task.
- Demonstrated constant training speed across increasing sequence lengths, outperforming competitors.
Multi-Modal Capabilities:
- MiniMax-VL-01 is enhanced with vision-language capabilities by integrating a lightweight Vision Transformer (ViT).
- Trained on 512 billion vision-language tokens, using a specialized four-stage training process.
Open Source and Accessibility:
- Provides cost-effective access at USD $0.2 per million input tokens and $1.1 per million output tokens, with live deployment at Hailuo AI.
Tech report, Code, Blog, Demo

s1: Simple Test-Time Scaling

This paper by Muennighoff et al. from Stanford and UW introduces test-time scaling, a method that improves reasoning performance in large language models (LLMs) by leveraging extra compute at inference time. The authors propose budget forcing, a simple intervention that controls the duration of the model’s reasoning process, allowing it to self-correct and refine its answers.
Main Contributions:
1. Dataset Creation (s1K):
  - A small dataset of 1,000 high-quality reasoning questions was curated from an initial pool of 59,000 samples.
  - Selection was based on three criteria: difficulty, diversity, and quality.
  - The final dataset was distilled from Google’s Gemini Thinking Experimental API.
2. Budget Forcing (Test-Time Scaling Method):
  - Allows control over how long the model “thinks” before generating an answer.
  - Two key techniques:
    - Early termination: If the model exceeds a threshold of “thinking tokens,” it is forced to provide an answer.
    - Extended reasoning: The model is encouraged to continue reasoning by appending “Wait” to the generation when it tries to stop.
3. Fine-Tuned Model (s1-32B):
  - The Qwen2.5-32B-Instruct model was fine-tuned on s1K in just 26 minutes on 16 NVIDIA H100 GPUs.
  - This model outperformed OpenAI’s o1-preview on math reasoning tasks like MATH and AIME24.
4. Experimental Results:
  - Scaling performance: Budget forcing allowed the model to exceed its baseline performance without test-time intervention.
  - Competitiveness: s1-32B outperformed larger closed-source models and was the most sample-efficient among open-weight models.
5. Ablations & Comparisons:
  - Dataset selection: Carefully selected 1,000 samples performed better than using all 59,000 samples.
  - Test-time scaling methods: Budget forcing showed superior control and performance compared to majority voting, rejection sampling, and conditional control methods.
  - Parallel vs. Sequential Scaling: Budget forcing (sequential) was more effective than parallel methods like majority voting.
Key Results:
- The s1-32B model, fine-tuned on just 1,000 reasoning examples, achieved 56.7% accuracy on AIME24, 93.0% on MATH500, and 59.6% on GPQA Diamond. Without any test-time intervention, the model’s AIME24 score was 50%, demonstrating that test-time scaling via budget forcing leads to significant improvements.
- By comparison, OpenAI’s o1-preview achieved 44.6% on AIME24, 85.5% on MATH500, and 73.3% on GPQA Diamond. Other open-weight models like DeepSeek r1 outperformed s1-32B but required over 800,000 training examples, while s1-32B achieved strong reasoning performance with only 1,000 carefully selected samples. The base model (Qwen2.5-32B-Instruct), before fine-tuning, scored just 26.7% on AIME24, highlighting the significant impact of s1K fine-tuning and test-time scaling.
Conclusion:
- Test-time scaling via budget forcing is a lightweight yet powerful method for improving reasoning performance.
- Fine-tuning on just 1,000 carefully selected examples can match or outperform models trained on hundreds of thousands of samples.
- The approach is open-source, providing a transparent and reproducible path to improving LLM reasoning abilities.
Code

Sky-T1

This blog by the NovaSky team at UC Berkeley introduces Sky-T1-32B-Preview, an open-source reasoning model that achieves performance comparable to o1-preview on reasoning and coding benchmarks while being trained for under $450. All code, data, and model weights are publicly available.
Motivation: Current state-of-the-art reasoning models like o1 and Gemini 2.0 demonstrate strong reasoning abilities but remain closed-source, limiting accessibility for academic and open-source research. Sky-T1 addresses this gap by providing a high-performing, fully transparent alternative.
Key Contributions:
- Fully Open-Source: Unlike closed models, Sky-T1 releases all resources—data, training code, technical report, and model weights—allowing for easy replication and further research.
- Affordable Training: Sky-T1-32B-Preview was trained for only $450, leveraging Qwen2.5-32B-Instruct as a base model and fine-tuning it using 17K curated training samples.
- Dual-Domain Reasoning: Unlike prior efforts that focused solely on math reasoning (e.g., STILL-2, Journey), Sky-T1 excels in both math and coding within a single model.
Data Curation:
- Uses QwQ-32B-Preview, an open-source model with reasoning capabilities comparable to o1-preview.
- Reject sampling ensures high-quality training data by filtering incorrect samples through exact-matching (for math) and unit test execution (for coding).
- Final dataset includes 5K coding problems (APPs, TACO), 10K math problems (AIME, MATH, Olympiad), and 1K science/puzzle problems (from STILL-2).
Training Details:
- Fine-tuned on Qwen2.5-32B-Instruct for 3 epochs with a learning rate of 1e-5 and a batch size of 96.
- Training completed in 19 hours on 8 H100 GPUs, utilizing DeepSpeed Zero-3 offload for efficiency.
- The following figure from the blog shows the training flow of Sky-T1:
Evaluation and Results:
- Matches or surpasses o1-preview in multiple reasoning and coding benchmarks:
  - Math500: 82.4% (vs. 81.4% for o1-preview)
  - AIME 2024: 43.3% (vs. 40.0% for o1-preview)
  - LiveCodeBench-Easy: 86.3% (close to 92.9% of o1-preview)
  - LiveCodeBench-Hard: 17.9% (slightly ahead of 16.3% for o1-preview)
- Performs competitively with QwQ (which has a closed dataset) while remaining fully open-source.
Key Findings:
- Model size matters: Smaller models (7B, 14B) showed only modest gains, with 32B providing a significant leap in performance.
- Data mixture impacts performance: Incorporating math-only data initially boosted AIME24 accuracy from 16.7% to 43.3%, but adding coding data lowered it to 36.7%. A balanced mix of complex math and coding problems restored strong performance in both domains.
Conclusion: Sky-T1-32B-Preview proves that high-level reasoning capabilities can be replicated affordably and transparently. By open-sourcing all components, it aims to empower the academic and open-source communities to drive further advancements in reasoning model development.
Code

Kimi k1.5: Scaling Reinforcement Learning with LLMs

This paper by the Kimi Team proposes Kimi K1.5, a state-of-the-art multimodal large language model (LLM) trained with reinforcement learning (RL). Unlike traditional LLMs that rely solely on pretraining and supervised fine-tuning, Kimi K1.5 expands its learning capabilities by leveraging long-context RL training, enabling it to scale beyond static datasets through reward-driven exploration. Kimi K1.5 demonstrates that scaling reinforcement learning with long-context training significantly improves LLM performance. The model leverages optimized learning algorithms, partial rollouts, and efficient policy optimization to achieve strong RL results without relying on computationally expensive techniques like Monte Carlo tree search.
Additionally, the long-to-short (L2S) transfer process enables short-CoT models to inherit reasoning abilities from long-CoT models, drastically improving token efficiency while maintaining high performance.
The model achieves state-of-the-art performance across multiple benchmarks. It scores 77.5 Pass@1 on AIME 2024, 96.2 Exact Match on MATH 500, 94th percentile on Codeforces, and 74.9 Pass@1 on MathVista, matching OpenAI’s o1 model. Additionally, its short-CoT variant outperforms GPT-4o and Claude Sonnet 3.5 by a wide margin, achieving up to 550% improvement on some reasoning tasks.
Key Contributions:
- Long-context scaling: Kimi K1.5 scales RL training to a 128K token context window, demonstrating continuous improvements in reasoning performance as the context length increases. Instead of re-generating full sequences, it employs partial rollouts to reuse previous trajectories, making training more efficient.
- A simplified yet powerful RL framework: Unlike traditional RL-based models, Kimi K1.5 does not rely on complex techniques such as Monte Carlo tree search, value functions, or process reward models. Instead, it employs chain-of-thought (CoT) reasoning, allowing the model to develop planning, reflection, and correction capabilities without computationally expensive search mechanisms.
- Advanced RL optimization techniques: Kimi K1.5 introduces a variant of online mirror descent for policy optimization, incorporating length penalties, curriculum sampling, and prioritized sampling to further enhance training efficiency and prevent overthinking.
- Multimodal capabilities: The model is jointly trained on text and vision data, enabling it to reason across modalities. It performs well in OCR-based tasks, chart interpretation, and vision-based mathematical reasoning.
- Long-to-Short (L2S) Training: The model introduces long2short methods that transfer reasoning patterns from long-CoT models to short-CoT models. These techniques significantly improve token efficiency, allowing the short-CoT version to achieve state-of-the-art results on benchmarks like AIME 2024 (60.8 Pass@1) and MATH 500 (94.6 Exact Match), surpassing GPT-4o and Claude Sonnet 3.5.
Technical Details:
- Training Approach:
- The development of Kimi K1.5 involves multiple stages:
  - Pretraining: The base model is trained on a diverse dataset spanning English, Chinese, code, mathematics, and general knowledge.
  - Vanilla Supervised Fine-Tuning (SFT): The model is refined using a mix of human-annotated and model-generated datasets, ensuring high-quality responses.
  - Long-CoT Fine-Tuning: A warmup phase introduces structured reasoning, teaching the model essential skills such as planning, evaluation, reflection, and exploration.
  - Reinforcement Learning (RL): The model is further optimized with reward-based feedback, strengthening its ability to reason through complex problems.
  - To ensure optimal RL training, Kimi K1.5 employs a carefully curated prompt set that spans multiple domains, balancing difficulty levels and ensuring robust evaluability. It also applies curriculum sampling (starting with easy tasks before progressing to harder ones) and prioritized sampling (focusing on problems where the model underperforms).
- The following figure from the paper shows the Kimi K1.5, a large scale reinforcement learning training system for LLM.
Reinforcement Learning Infrastructure:
- Kimi K1.5 leverages an advanced RL training infrastructure to scale efficiently:
  - Partial Rollouts: The model segments long responses into smaller chunks, preventing lengthy reasoning trajectories from slowing down training. This method allows parallel training of both long and short responses, maximizing compute efficiency.
  - Hybrid Training Deployment: Training is conducted using Megatron, while inference is performed on vLLM, allowing dynamic scaling of resources.
  - Code Sandbox for Coding RL: The model uses an automated test case generation system to evaluate coding solutions. It is optimized with fast execution techniques like Crun and Cgroup reuse to improve training speed and stability.
Evaluation & Results:
- Kimi K1.5 achieves state-of-the-art results across multiple benchmarks:
  1. Long-CoT Model Performance:
    - It matches or surpasses OpenAI’s o1 model in key reasoning tasks.
    - On MATH 500, Kimi K1.5 achieves 96.2 Exact Match, outperforming other open-source models such as QwQ-32B (90.6).
    - On AIME 2024, it reaches 77.5 Pass@1, improving over QwQ-32B (63.6).
    - For coding tasks, it ranks in the 94th percentile on Codeforces, surpassing QwQ-32B (62nd percentile).
    - In vision-based reasoning, it scores 74.9 Pass@1 on MathVista, ahead of OpenAI’s o1-mini (71.0).
  2. Short-CoT Model Performance:
    - Kimi K1.5’s short-CoT model significantly outperforms GPT-4o and Claude Sonnet 3.5 on mathematical and coding reasoning tasks.
    - It achieves 94.6 Exact Match on MATH 500, whereas GPT-4o scores 74.6 and Claude Sonnet 3.5 scores 78.3.
    - On AIME 2024, Kimi K1.5 short-CoT achieves 60.8 Pass@1, far exceeding GPT-4o (9.3) and Claude Sonnet 3.5 (16.0).
    - In LiveCodeBench, the model scores 47.3 Pass@1, outperforming GPT-4o (33.4) and Claude Sonnet 3.5 (36.3).
Ablation Studies:
- Scaling Context Length vs Model Size:
  - Smaller models can match the reasoning ability of larger models if trained with long-CoT and RL.
  - However, larger models remain more token-efficient, meaning they require fewer tokens to achieve similar performance.
- Negative Gradients vs ReST (Reward-based Supervised Tuning):
  - Kimi K1.5 outperforms ReST-based approaches by leveraging negative gradients during policy optimization, leading to more efficient training.
- Curriculum Sampling vs Uniform Sampling:
  - Models trained with curriculum sampling (progressing from easy to hard problems) outperform those trained with uniform sampling.
  - This approach accelerates learning and improves generalization on test problems.
Code

SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?

This paper by Miserendino et al. from OpenAI introduces SWE-Lancer, a benchmark that evaluates the ability of large language models (LLMs) to perform freelance software engineering tasks. The dataset consists of 1,488 real-world software engineering tasks from Upwork, valued at $1 million USD in payouts. These tasks span individual contributor (IC) engineering tasks (implementing bug fixes and features) and SWE management tasks (evaluating and selecting proposals for technical implementations).
Key Contributions:
1. Real-World Engineering Benchmark
  - Unlike existing benchmarks that focus on unit-test-based evaluation, SWE-Lancer evaluates LLMs on full-stack freelance software engineering tasks.
  - The dataset includes tasks from Expensify’s open-source repository, covering front-end, back-end, UI/UX, and full-stack development.
2. Economic-Based Evaluation
  - The benchmark directly ties LLM performance to economic value, tracking how much money a model could earn on freelance platforms.
  - Payouts range from $50 (small bug fixes) to $32,000 (major feature implementations).
  - 35% of tasks in the open-source subset are worth over $1,000.
3. Two Task Categories
  - IC SWE Tasks: LLMs generate code patches to resolve real-world software issues, evaluated using end-to-end (E2E) tests designed by professional software engineers.
  - SWE Manager Tasks: LLMs act as engineering leads, evaluating competing technical proposals and choosing the best solution, compared against real-world hiring decisions.
4. Improved Evaluation Methodology
  - Unlike prior coding benchmarks that rely on unit tests, SWE-Lancer uses browser automation-based end-to-end tests to simulate real-world software validation.
  - 100 software engineers triple-verified these tests to ensure accuracy.
  - Freelance-based pricing offers a market-derived difficulty gradient, making it a realistic evaluation of AI’s ability to contribute to professional engineering work.
5. Performance of LLMs on SWE-Lancer
  - Models evaluated: GPT-4o, Anthropic Claude 3.5 Sonnet, OpenAI o1.
  - Best-performing model (Claude 3.5 Sonnet) earned $403,000 of the $1,000,000 total payout on the full dataset.
  - Pass@1 accuracy (first-attempt success rate):
    - Claude 3.5 Sonnet: 26.2% (IC SWE tasks), 44.9% (SWE Manager tasks).
    - OpenAI o1: 20.3% (IC SWE tasks), 46.3% (SWE Manager tasks).
    - GPT-4o: 8.6% (IC SWE tasks), 38.7% (SWE Manager tasks).
The following figure from the paper shows the evaluation flow for IC SWE tasks; the model only earns the payout if all applicable tests pass.

The following figure from the paper shows the evaluation flow for SWE Manager tasks; during proposal selection, the model has the ability to browse the codebase.

Findings & Challenges:
- Freelance engineering remains a major challenge for LLMs—even top models fail most tasks.
- Higher reasoning effort improves performance, with pass rates increasing as models are allowed more computational steps.
- Models are better at selecting proposals (management) than implementing solutions (engineering).
- Common failure points:
  - Incomplete understanding of multi-file dependencies.
  - Struggles with iterative debugging and edge case handling.
  - Incorrect or partial fixes that do not pass E2E validation.
Implementation Details:
- Dataset: 1,488 software tasks from Upwork, sourced from Expensify’s repository.
- Validation: **100 professional software engineers created and verified end-to-end tests for task evaluation.
- Execution Environment:
  - Models run in Docker containers, isolated from internet access and real-world repositories.
  - No direct access to E2E tests—models must infer correctness from repository state.
  - Evaluated in pass@1 setting (single-shot execution, similar to real freelance work).
Limitations & Future Directions:
- Dataset scope: Limited to Expensify’s repository, though it captures full-stack and real-world complexity.
- Freelance bias: Tasks are self-contained; full-time software engineering work (infrastructure, long-term development) is not represented.
- Multimodal inputs: The dataset is text-based, though many Upwork issues include screenshots and videos that could improve model performance.
- Economic impact: Future research could analyze the cost-benefit tradeoff of LLMs in software freelancing.
Code

ChatDev: Communicative Agents for Software Development

The paper by Qian et al. from Tsinghua, The University of Sydney, BUPT, and Modelbest Inc., published in ACL 2024, presents ChatDev, a software development framework using multiple agents powered by large language models (LLMs) to facilitate collaborative tasks within the software development lifecycle, including design, coding, and testing. The framework is designed to streamline multi-agent communication for more coherent, effective problem-solving across these phases. Key innovations include a structured “chat chain” approach and a “communicative dehallucination” mechanism, both aimed at enhancing the quality and executability of the generated code.
ChatDev’s chat chain organizes the workflow into sequential phases—design, coding (subdivided into code writing and completion), and testing (split between code review and system testing)—with each phase containing sequential subtasks. Each subtask is addressed by an “instructor” and an “assistant” agent, who engage in multi-turn dialogues to collaboratively develop solutions, making it easier to handle complex requirements through natural language exchanges in design and programming dialogues in development. The framework thus maintains a coherent flow across phases, facilitating effective transitions and linking subtasks while offering transparency for monitoring intermediate solutions and issues.
The communicative dehallucination mechanism addresses the problem of LLM-induced hallucinations in code by prompting agents to seek additional details from the instructor before finalizing responses. This enables agents to achieve more precise task outcomes, reducing instances of incomplete or unexecutable code. In this pattern, the assistant initially requests further guidance, which the instructor then clarifies, allowing the assistant to proceed with an optimized response. This iterative approach helps minimize errors and increases the quality and reliability of the generated code.
The following figure from the paper shows that upon receiving a preliminary task requirement (e.g., “develop a Gomoku game”), these software agents engage in multi-turn communication and perform instruction-following along a chain-structured workflow, collaborating to execute a series of subtasks autonomously to craft a comprehensive solution.

Implementation Details:
- Role Assignment and Memory Management: Each agent is assigned a specific role tailored to different stages in the workflow, like CEO, CTO, programmer, reviewer, and tester. These roles ensure that agents contribute specialized insights at each phase. ChatDev employs short-term memory to maintain contextual continuity within a phase and long-term memory for cross-phase coherence, selectively transmitting only key solutions rather than entire dialogues to avoid memory overload.
- Subtask Termination Criteria: To streamline communication, a subtask concludes after two consecutive unchanged code updates or after 10 rounds of communication. This rule optimizes resource use and prevents redundant iterations.
- Prompt Engineering and LLM Integration: In each subtask, prompt engineering is applied at the onset, followed by automated exchanges. ChatGPT-3.5, with a low temperature of 0.2, supports task-specific response generation, while Python-3.11.4 integration enables real-time feedback on executable code.
Evaluation:
- ChatDev was evaluated against baseline models GPT-Engineer and MetaGPT using metrics including completeness, executability, consistency, and overall quality. Results show ChatDev’s significant improvements in generating more executable and complete code, largely due to its structured chat chain and communicative dehallucination mechanisms. An ablation study highlights the importance of specific roles and the dehallucination mechanism in boosting software quality.
Code

Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

This paper by Yuan et al. from DeepSeek AI, Peking University, and University of Washington introduces NSA (Natively trainable Sparse Attention), a sparse attention mechanism designed to improve the efficiency of long-context modeling while maintaining or surpassing full attention performance.
NSA integrates hierarchical sparse attention with hardware-aligned optimizations, ensuring both fast inference and end-to-end trainability.
Key Contributions:
- Dynamic Hierarchical Sparse Attention: NSA employs coarse-grained token compression and fine-grained token selection, combining global context awareness with local precision.
- Hardware-Optimized Design: NSA is engineered for arithmetic intensity-balanced computation, making it efficient for modern GPUs (e.g., optimized for Tensor Cores).
- End-to-End Training Support: Unlike many existing sparse attention methods that focus only on inference, NSA is natively trainable, reducing pretraining computation costs while maintaining model performance.
- Superior Performance & Efficiency: NSA achieves significant speedups (up to 11.6× for decoding, 9× for forward propagation, 6× for backward propagation) while matching or exceeding full attention models in benchmarks.
- The following figure from the paper shows an overview of NSA’s architecture. Left: The framework processes input sequences through three parallel attention branches: For a given query, preceding keys and values are processed into compressed attention for coarse-grained patterns, selected attention for important token blocks, and sliding attention for local context. Right: Visualization of different attention patterns produced by each branch. Green areas indicate regions where attention scores need to be computed, while white areas represent regions that can be skipped.
Methodology:
- Three-Path Attention Strategy:
  - Compressed Attention: Aggregates key-value pairs into compressed representations via learnable block-wise transformations.
  - Selected Attention: Identifies and retains the most relevant tokens based on block importance scores.
  - Sliding Window Attention: Maintains local context, ensuring stable training and efficient long-context modeling.
- Optimized Sparse Attention Kernels: Implemented using Triton, NSA’s custom kernels maximize efficiency by grouping queries within Grouped-Query Attention (GQA) blocks.
- Pretraining on a Large-Scale Transformer: The authors train a 27B-parameter model on 260B tokens, showing that NSA enables comparable or superior performance across general, long-context, and reasoning tasks.
Experiments & Results:
- General Benchmarks: NSA performs on par or better than full attention models in MMLU, BBH, GSM8K, MATH, DROP, and HumanEval.
- Long-Context Modeling: NSA outperforms full attention and other sparse methods (e.g., H2O, InfLLM, Quest) in LongBench, demonstrating superior context retention and efficiency.
- Reasoning & Instruction-Based Tasks: NSA achieves state-of-the-art performance in chain-of-thought reasoning (evaluated on AIME 24) while being significantly faster.
- Efficiency Gains:
  - 9× forward and 6× backward speedup over full attention for 64k sequences.
  - 11.6× speedup in decoding due to reduced KV-cache access.
Conclusion:
- NSA provides a scalable, efficient alternative to full attention by combining algorithmic innovations with hardware-aware optimizations.
- It is natively trainable, making it practical for real-world applications requiring long-context modeling, efficient inference, and reduced training costs.

OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning

This paper by Lu et al. from Stanford introduces OctoTools, a training-free, extensible agentic framework designed to enhance large language models (LLMs) with external tools for complex reasoning. Unlike prior approaches, OctoTools does not require additional training data or domain-specific tool constraints, making it adaptable across diverse reasoning tasks.
Key contributions:
- Tool cards: Standardized tool wrappers encapsulate various functionalities, enabling seamless integration and execution without framework modifications.
- Planner-executor framework: Separates high-level task planning from tool execution. The planner formulates a structured problem-solving strategy, while the executor generates and executes tool commands dynamically.
- Task-specific toolset optimization: A lightweight algorithm selects the most beneficial subset of tools for each task, improving both accuracy and efficiency.
- Comprehensive benchmarking: OctoTools was evaluated on 16 reasoning benchmarks (MathVista, MMLU-Pro, MedQA, GAIA-Text, etc.), achieving a 9.3% accuracy improvement over GPT-4o and up to 10.6% over existing agent frameworks like AutoGen, GPT-Functions, and LangChain.
The following figure from the paper shows the framework of OctoTools. (1) Tool cards define tool-usage metadata and encapsulate tools, enabling training-free integration of new tools without additional training or framework refinement. (2) The planner governs both high-level and low-level planning to address the global objective and refine actions step by step. (3) The executor instantiates tool calls by generating executable commands and save structured results in the context. The final answer is summarized from the full trajectory in the context. Furthermore, the task-specific toolset optimization algorithm learns to select a beneficial subset of tools for downstream tasks.

Implementation details:
- The system operates through three core modules:
  1. Tool cards: Define metadata for each tool, including input-output formats, constraints, and best-use cases (e.g., Image Captioner, Object Detector, Python Calculator).
  2. Planner: Determines which tools to invoke and structures problem decomposition into sub-goals.
  3. Executor: Converts planner instructions into executable commands, runs tools, and updates context iteratively.
- Uses structured reasoning pipelines, where each step refines prior results to enhance multi-step problem-solving.
- The optimization algorithm iteratively selects tools based on validation accuracy, ensuring efficiency without introducing unnecessary complexity.
Experimental results:
- Achieved 58.5% accuracy across 16 diverse benchmarks, outperforming GPT-4o (49.2%), LangChain (51.2%), and AutoGen (47.9%).
- Demonstrated superior multi-step reasoning and tool utilization, particularly in math, science, and medical domains.
- Ablation studies confirmed that task planning, external tool calling, and multi-step problem solving each contribute significantly to performance improvements.
Conclusion:
- OctoTools is a scalable, modular, and effective solution for enhancing LLMs with external tool integration.
- It bridges the gap between generic LLM reasoning and domain-specific problem solving, making it a powerful alternative to traditional AI agent frameworks.
- Future work includes real-time query-based tool selection, multi-agent collaboration, and expanding domain-specific functionalities.
Website; GitHub Repository; Hugging Face Demo

Tülu 3: Pushing Frontiers in Open Language Model Post-Training

This paper by Lambert et al. from Allen AI introduces Tülu 3, a family of fully open post-trained models that refine language model behaviors using state-of-the-art post-training techniques. Built on Llama 3.1 (8B, 70B, and 405B), Tülu 3 introduces new training recipes, permissively licensed datasets, novel reinforcement learning techniques, and an extensive evaluation suite. The model outperforms the instruct versions of Llama 3.1, Qwen 2.5, Mistral, and even proprietary models such as GPT-4o-mini and Claude 3.5-Haiku.
Tülu 3 establishes a new standard for open post-training research, providing: (i) scalable open-source training recipes** for high-performance LLMs, (ii) novel RLVR methodology, (iii) offering an alternative to traditional RLHF, and (iv) Fully transparent model training data, evaluation, and infrastructure.
Key Contributions:
- Fully Open Post-Training Framework: Includes training datasets, model weights, training code, evaluation tools, and a complete recipe.
- Multi-Stage Training Pipeline: Implements supervised fine-tuning (SFT), Direct Preference Optimization (DPO), and a novel Reinforcement Learning with Verifiable Rewards (RLVR) method.
- New Open Datasets: Introduces Tülu 3 Data, a decontaminated dataset mix targeting core model capabilities such as reasoning, mathematics, coding, instruction following, and safety.
- Benchmark Evaluation: Tülu 3 Eval provides a comprehensive multi-task evaluation suite for tracking improvements across training stages.
- Competitive Performance: Outperforms all state-of-the-art open models in its category and competes with leading closed-source models.
Implementation Details:
- Data Curation & Preparation:
  - Tülu 3 curates high-quality training data from both public datasets and synthetically generated prompts. The dataset mix focuses on key model capabilities:
    - Knowledge Recall (MMLU, TruthfulQA)
    - Reasoning (BigBenchHard, DROP)
    - Mathematics (GSM8K, MATH)
    - Coding (HumanEval, BigCodeBench)
    - Instruction Following (IFEval, AlpacaEval)
    - Safety & Compliance
  - Synthetic data is generated using persona-driven prompts inspired by Chan et al. (2024), where different “personas” guide the generation of domain-specific questions and responses.
  - To prevent test set contamination, Tülu 3 implements a decontamination pipeline using 8-gram matching techniques, removing overlapping examples from training data.
- Multi-Stage Training Pipeline: Tülu 3 follows a four-stage training pipeline that refines model performance progressively.
  - Stage 1: Supervised Finetuning (SFT):
    - - Objective: Instruction-tune the base model using high-quality prompts and completions.
    - Data: The Tülu 3 SFTMix dataset, consisting of general-purpose and skill-specific data.
    - Training Process: Models are trained using cross-entropy loss with gradient accumulation and careful batch balancing to ensure diverse representation across tasks.
    - Results: Generates a strong instruction-following model (Tülu 3-SFT).
  - Stage 2: Preference Optimization (DPO):
    - Objective: Train the model to align better with human preferences using Direct Preference Optimization (DPO).
    - Data: On-policy preference data is generated by comparing outputs of Tülu 3-SFT with other LLMs.
    - Training Process:
      - Uses pairwise ranking loss instead of reinforcement learning.
      - Reduces reliance on traditional reward models.
      - Length normalization is applied to prevent the model from preferring verbose completions.
    - Results: Yields Tülu 3-DPO, which better follows human preferences.
  - Stage 3: Reinforcement Learning with Verifiable Rewards (RLVR):
    - Objective: Improve task-specific skills (e.g., mathematics, instruction following) using reinforcement learning with ground-truth verification.
    - Key Difference from RLHF: Unlike traditional Reinforcement Learning with Human Feedback (RLHF), which uses subjective reward models, RLVR only assigns rewards if the model produces verifiably correct outputs.
    - RLVR Training Process:
      1. Dataset Selection: RLVR is applied to tasks with verifiable answers, such as:
        
        Mathematics (GSM8K, MATH)
        
        Precise Instruction Following (IFEval constraints)
        
        Coding (HumanEval, BigCodeBench)
      2. Reward Calculation:
        
        The model receives a reward only if its output is verifiably correct (e.g., passing a test case in Python, computing the correct math solution).
        
        No subjective preference modeling is used.
      3. Policy Optimization:
        
        Standard Proximal Policy Optimization (PPO) is replaced with direct value optimization to improve training efficiency.
        
        KL-penalty is used to prevent overfitting to specific reward signals.
      4. Infrastructure:
        
        Uses an asynchronous RL setup, where model inference is done with vLLM, while learners perform gradient updates concurrently.
        
        The system is scalable to large models (70B and 405B parameters).
      5. Final Model Output:
        
        The RLVR model (Tülu 3-RLVR) achieves superior performance in math, reasoning, and precise instruction following.
  - The following figure from the paper shows an overview of the Tülu 3 recipe. This includes: data curation targeting general and target capabilities, training strategies and a standardized evaluation suite for development and final evaluation stage.
- Evaluation & Results:
  - Tülu 3 (70B) outperforms Llama 3.1 (70B) across all major benchmarks, demonstrating significant improvements in math (GSM8K: 93.5 vs. 83.4) and instruction following (IFEval: 83.2 vs. 80.6). It also competes closely with GPT-4o Mini and Claude 3.5 Haiku, surpassing both on knowledge recall (MMLU: 83.1 vs. 82.2 & 81.8, respectively) and achieving comparable results in coding (HumanEval: 92.4 vs. 90.4 & 90.8).
  - Surpasses open models (Llama 3.1, Qwen 2.5, Mistral) in all categories.
  - Matches or beats closed models in multiple benchmarks.
  - Significant improvements in math, reasoning, and instruction following due to RLVR.
Hugging Face; GitHub

Think Only When You Need with Large Hybrid-Reasoning Models

This paper by Jiang et al. from Furu Wei’s group at Microsoft introduces Large Hybrid-Reasoning Models (LHRMs), which adaptively decide whether to perform extended reasoning for a given query, aiming to mimic human-like decision-making in balancing cognitive effort and efficiency.
Motivation: Traditional Large Reasoning Models (LRMs) improve reasoning ability by generating detailed intermediate thinking steps. However, this incurs significant latency and token usage even for trivial queries, leading to inefficiencies. LHRMs address this by dynamically choosing between two modes: Thinking (producing extended reasoning traces) and No-Thinking (direct answers).
Model Architecture: LHRMs maintain a shared backbone but are trained to support two distinct modes:
- Thinking Mode generates <think>...</think> reasoning traces.
- No-Thinking Mode produces <no_think>...</no_think> direct responses.
Training Pipeline:
- Stage I: Hybrid Fine-Tuning (HFT):
  - A supervised fine-tuning step that trains on a hybrid dataset mixing both reasoning-intensive and direct-response examples.
  - The training set includes 1.7M examples (e.g., from WildChat-1M, Synthetic-1, OpenR1-Math).
  - Each instance is labeled with appropriate tags (<think>, <no_think>) based on query complexity.
  - Optimized with standard next-token prediction using cross-entropy loss.
- Stage II: Hybrid Group Policy Optimization (HGPO):
  - An online reinforcement learning algorithm to teach adaptive mode selection.
  - For each query, multiple responses are generated in both modes, scored by a reward model (Llama-3.1-Tulu-3-8B-RM).
  - Rewards are normalized using inter-group (cross-mode preference) and intra-group (within-mode best) assessments.
  - Advantage estimators like GRPO, RLOO, or REINFORCE++ are used for policy optimization.
  - Final objective combines PPO-style clipped policy gradients and a KL divergence regularizer to stabilize learning.
- The following figure from the paper shows a demonstration of HGPO. HGPO proceeds by: (1) sampling multiple responses for each query q using both reasoning modes; (2) scoring the responses with the reward model and assigning these rewards; and (3) computing the advantage and policy loss, followed by updating the policy model. AE denotes advantage estimator.
Evaluation:
- Introduces Hybrid Accuracy (HAcc), measuring the alignment between model-selected and reward-model-preferred reasoning modes.
- Benchmarked on tasks spanning math (MATH500, AIME24, AMC23), code (MBPP, LiveCodeBench), and general reasoning (AlpacaEval 2.0, Arena-Hard).
Results:
- LHRMs outperformed baseline LLMs and LRMs on both reasoning accuracy and general QA performance.
- Achieved significant improvements over comparable fine-tuning and RL baselines (DPO and RFT), with up to 13.6% gain on AIME24 and 93.4% on Arena-Hard.
- Exhibited superior hybrid reasoning with up to 110.2% improvement in HAcc over HFT-only models.
- Demonstrated generalization across domains—RL training on math/general data improved performance on unseen code tasks.
Efficiency Insights:
- LHRMs selectively reduce use of reasoning traces for simpler inputs, preserving compute for complex queries.
- Larger models learn to rely more on No-Thinking mode due to stronger memorization, while smaller ones invoke Thinking mode more often.
Contributions:
- First model to adaptively balance reasoning depth based on input complexity.
- A robust two-stage training pipeline for hybrid reasoning.
- A new metric, Hybrid Accuracy, to evaluate context-aware reasoning mode selection.
- Demonstrated transferability and scalability of hybrid reasoning patterns across tasks and model sizes.

RM-R1: Reward Modeling as Reasoning

This paper from Chen et al. from UIUC, UCSD, Texas A&M, and Stevens Institute of Technology introduces RM-R1, a family of Reasoning Reward Models (ReasRMS) that reformulate reward modeling as a structured reasoning task, aiming to improve both interpretability and performance in Reinforcement Learning from Human Feedback (RLHF) for LLMs.
Traditional reward models are either scalar-based (ScalarRMs) or generative (GenRMs). ScalarRMs are efficient but opaque, while GenRMs provide transparency but often lack robust reasoning. RM-R1 addresses this by integrating long-form reasoning into the reward modeling process, producing models that explain their decisions using detailed, step-by-step justifications.
Architecture and Implementation:
- Input: Triple ($x, y_a, y_b$) where $x$ is the prompt and $y_a, y_b$ are candidate responses.
- Output: A textual judgment $j$ containing a full reasoning trace and a final answer indicating the preferred response.
- Training Pipeline:
  1. Reasoning Distillation:
    - High-quality reasoning chains are generated using a pipeline with Claude-3.7-Sonnet and OpenAI-O3 models.
    - These are combined with labels to create supervision signals.
    - The model is trained via next-token prediction on reasoning traces and final judgments using cross-entropy loss.
  2. Reinforcement Learning (RLVR):
    - Group Relative Policy Optimization (GRPO) is applied.
    - Reward function is correctness-based: +1 if the model-predicted label matches ground truth, -1 otherwise.
    - KL-penalty is used to regularize deviation from a reference model.
  - The following figure from the paper shows the training pipeline of RM-R1. Starting from an instruct model (GenRM), RM-R1 training involves two stages: Distillation and Reinforcement Learning (RL). In the Distillation stage, they use high-quality synthesized data to bootstrap RM-R1’s reasoning ability. In the RL stage, RM-R1’s reasoning ability for reward modeling is further strengthened. After distillation, a GenRM evolves into a ReasRM. RM-R1 further differentiates itself by being RL finetuned on preference data.
Core Innovation:
- Chain-of-Rubrics (CoR): A system prompt guides the model to identify whether a task is chat-based or reasoning-based.
  - For chat tasks, the model generates rubrics (e.g., empathy, helpfulness) and evaluates candidates accordingly.
  - For reasoning tasks (math, code), the model first solves the problem and then evaluates the answers based on correctness and coherence.
Benchmarks and Performance:
- RM-R1 outperforms both open and proprietary models including INF-ORM-Llama3.1-70B, GPT-4o, and Claude-3-5-sonnet across RewardBench, RM-Bench, and RMB benchmarks by up to 4.9%.
- Models like RM-R1-Qwen-Instruct-32B and RM-R1-DeepSeek-Distilled-Qwen-32B achieve state-of-the-art performance with significantly smaller scale compared to top baselines.
Empirical Findings:
- Ablation Studies show that reasoning distillation, rubric generation, and task categorization all significantly contribute to performance.
- Scaling Laws: Larger models benefit more from the reasoning-based training. Inference-time compute also correlates positively with performance.
- Comparison to SFT (Supervised Fine-Tuning): Reasoning-based training (distillation + RL) outperforms pure SFT even with smaller datasets, demonstrating better generalization and robustness.
Code

Reward Reasoning Model

This paper by Guo et al. from Microsoft Research, Tsinghua University, and Peking University introduces Reward Reasoning Models (RRMs), a novel class of reward models that incorporate explicit chain-of-thought reasoning during inference to improve reward estimation in large language models (LLMs).
RRMs are designed to allocate test-time compute adaptively, engaging in deeper reasoning for complex queries before generating final reward judgments. This approach contrasts with existing scalar and generative reward models that apply uniform computation across inputs.
Model Architecture and Input: RRMs are built upon the Qwen2 Transformer-decoder architecture. They treat reward modeling as a text completion task where the model receives a prompt and two candidate responses and is required to justify and select the preferred response using chain-of-thought reasoning. The response must conclude with a formatted decision indicating the better answer. The following figure from the paper shows an overview of RRM. RRM adaptively leverages test-time compute through chain-of-though reasoning before producing rewards.

Training Framework: The authors propose a novel “Reward Reasoning via Reinforcement Learning” framework. Rather than relying on explicit reasoning traces, RRMs are trained using a rule-based reward signal (+1 for correct judgments, -1 otherwise). The policy is optimized using Group Relative Policy Optimization (GRPO), implemented with the verl library. Training uses data from multiple sources including Skywork-Reward, Tülu-3 prompts (annotated by GPT-4o), and synthesized verifiable QA pairs, totaling about 420K pairwise preferences.
Multi-Response Rewarding Strategies:
- ELO Rating System: Applies round-robin pairwise comparisons and converts results into ELO scores to facilitate general reward modeling and RLHF.
- Knockout Tournament: Uses a binary tournament format with fewer comparisons ($O(n)$), suitable for best-of-N inference.
- Both strategies support majority voting to exploit additional compute and improve judgment robustness.
Benchmarks and Results:
- RRMs are evaluated on RewardBench and PandaLM Test, outperforming strong baselines including GPT-4o, Claude-3.5, JudgeLM, and DeepSeek models. RRM-32B achieves top scores, especially in the reasoning category.
- In reward-guided best-of-N inference (e.g., on MMLU-Pro, MATH, GPQA), RRM-32B with majority voting achieves the best performance among all models.
- Binary preference classification also shows RRM-32B outperforming larger models like J1-Llama-70B.
Applications:
- RLHF and DPO: RRMs serve as reliable supervisors in reinforcement learning and Direct Preference Optimization (DPO). Models post-trained with RRM-annotated data achieve higher scores on benchmarks like Arena-Hard.
- Inference-Time Scaling: RRMs support both parallel (multiple pairwise judgments) and sequential scaling (longer reasoning traces), showing performance gains with increased compute.
Pattern Analysis: RRM-32B exhibits higher frequencies of advanced reasoning patterns—transition, reflection, and comparison—compared to its base model, confirming that reinforcement training fosters sophisticated judgment strategies.
This work demonstrates that incorporating deliberate reasoning into reward modeling improves both performance and adaptability across tasks, and it sets a new direction for post-training alignment in LLMs by showing how test-time compute can be effectively scaled without altering model size.
Hugging Face

The Leaderboard Illusion

This paper by Singh et al. from Cohere Labs, Cohere, Princeton, Stanford, MIT, and other institutions, critiques the widely-used Chatbot Arena leaderboard, highlighting structural flaws that distort LLM evaluation and leaderboard rankings.
Core Issues Identified:
- Private Testing and Score Retraction: Select providers like Meta and Google are allowed to test numerous private model variants before public release and only publish top-performing versions. This “best-of-N” strategy skews scores upward and violates the unbiased sampling assumption of the Bradley-Terry (BT) model that underpins Arena rankings. Real-world experiments demonstrated Arena score inflation even for identical models (e.g., a score delta of 17 points was observed for the same checkpoint).
- Sampling Rate Disparities: Arena disproportionately samples proprietary models like those from OpenAI and Google (up to 34% daily), while models from academic and smaller open-weight providers (e.g., Allen AI) have sampling rates as low as 3.3%. This impacts data volume accessible to different providers and their model tuning capacity.
- Silent Model Deprecations: Of the 243 models reviewed, 205 were silently removed, affecting mostly open-weight/open-source models. These deprecations break the BT model’s requirement for a connected graph of comparisons, resulting in invalid rankings.
- Data Access Asymmetries: Proprietary providers receive significantly more battle data (OpenAI and Google each have over 1.1M samples), allowing them to overfit to Arena’s specific distribution. Controlled fine-tuning experiments showed Arena-based data boosts performance on ArenaHard by up to 112%, but not on general benchmarks like MMLU.
Methodology and Data:
- Audit spanned 2M battles across 243 models and 42 providers from January 2024 to April 2025.
- Data sources include Arena-human-preference-100k, API logs, leaderboard stats, and random-sample Arena battle scrapes.
- Analysis used both simulated and real-world experiments to demonstrate overfitting risks and leaderboard distortion.
- Fine-tuning tests used a 7B Cohere base model across three data mixtures (0%, 30%, 70% Arena data), showing substantial gains on in-distribution ArenaHard prompts but declines on OOD benchmarks like MMLU.
The following figure from the paper shows an overview of key insights. It illustrates the prevalence of undisclosed private testing, data access disparities favoring proprietary models, and how these elements enable overfitting and unreliable leaderboard rankings.

Recommendations:
- Disallow score retraction and mandate publishing all submitted model results.
- Limit concurrent private submissions per provider (suggested: max 3).
- Apply model deprecations uniformly across proprietary, open-weight, and open-source providers.
- Enforce active sampling policies to balance exposure across models.
- Publicly list all removed models to maintain transparency.

J4R: Learning to Judge with Equivalent Initial State Group Relative Policy Optimization

This paper from Xu et al. from Salesforce AI Research introduces J4R, a 7B parameter LLM-as-judge model trained to evaluate reasoning-intensive LLM outputs. The paper tackles judge model limitations in reasoning tasks, especially positional biases, by proposing a novel reinforcement learning algorithm called Equivalent Initial State Group Relative Policy Optimization (EIS-GRPO).
Core Motivation: Existing LLM judges struggle in complex reasoning tasks due to positional and stylistic biases. They often guess when confronted with swapped response orders, undermining judgment consistency. J4R targets these limitations through training that emphasizes robustness to input transformations and consistency across logically equivalent inputs.
Architecture and Implementation:
- Model: J4R-CJ-7B, built on CompassJudger-7B; another variant is initialized from Qwen2.5-7B-Instruct.
- Input: A triple $(x, y_1, y_2)$ representing a prompt and two model responses. This is embedded into a templated format $T(x, A, B)$, where $A$ and $B$ represent the two responses.
- Output: A natural language sequence $o = {c, j}$, where $c$ is a chain-of-thought (CoT) critique and $j \in {A, B}$ is the final judgment.
- Training Data: ~10K reasoning-focused pairwise comparisons derived from model generations on ReClor and MATH using Qwen2.5 and Llama-3.3-70B. Pairs are labeled based on correctness, excluding uniformly right/wrong samples.
- Training Algorithm: EIS-GRPO is a variant of GRPO that explicitly handles state transformations. For each input $q$, $L$ transformed variants $T_\ell(q)$ are generated (e.g., via response order swapping), each yielding a subgroup of $G/L$ sampled outputs. A reward is assigned per output based on judgment correctness and formatting quality:
  - $R_j(o, a) = \begin{cases} 1 & \text{if parse}(o) = a \ 0 & \text{otherwise} \end{cases},\quad R_f(o) = \begin{cases} 0.5 & \text{if formatted correctly} \ -0.5 & \text{otherwise} \end{cases}$
  - $R(i,\ell) = R_j(o^{(i,\ell)}, a^{(\ell)}) + R_f(o^{(i,\ell)})$
  - The following figure from the paper shows the overview of the three core contributions: the EIS-GRPO training method, the ReasoningJudgeBench benchmark, and the J4R-7B model trained to be a reasoning evaluator.
- Loss: A combined policy gradient objective across both local (intra-subgroup) and global (inter-subgroup) rewards:
  - $\hat{A}^{(i,\ell)} = \frac{R^{(i,\ell)} - \bar{R}^{[L]}}{\sigma_{R^{[L]}}} + \frac{R^{(i,\ell)} - \bar{R}^{(\ell)}}{\sigma_{R^{(\ell)}}}$
  - The loss encourages invariance across logically equivalent inputs and improves generalization to unseen transformations.
- Training Setup: Learning rate of $10^{-6}$, batch size 128, rollout batch size 256, clip $\epsilon = 0.2$, and KL divergence weight $\beta = 10^{-4}$.
Benchmarking and Evaluation:
- ReasoningJudgeBench: Introduced as a new benchmark with 1,483 challenging pairwise reasoning samples across math, multi-hop, domain-specific, and common-sense categories. Built using GPT-4o outputs from datasets like ARC-Challenge, ReClor, StrategyQA, Folio, OlympiadBench, AIME, and others.
- Performance:
  - J4R-CJ-7B achieves 50.35% average accuracy, outperforming all small judges (≤14B) and nearly matching the best 32B RL-trained models.
  - J4R-CJ-7B improves accuracy by up to 17.6% under FLOP-matched inference compared to RM-R1-14B.
  - Superior judge consistency (81.14%) on JudgeBench relative to GRPO (77.71%) and data-duplication (76.00%) methods.
Empirical Findings:
- Ablations confirm that both local and global advantage terms in EIS-GRPO are necessary.
- Training from strong initial models (like CompassJudger) improves results, but even weaker ones (like Qwen2.5-7B) benefit significantly from EIS-GRPO.
- Comparisons with SFT and DPO show these degrade performance when training with reasoning data, highlighting the limitations of distillation-based judge training.

Layer by Layer: Uncovering Hidden Representations in Language Models

This paper by Skean et al. from University of Kentucky, Mila University of Montreal, NYU, UCSD, Meta FAIR, and Wand.AI, investigates the representational dynamics of intermediate layers in large language models (LLMs), challenging the common assumption that final-layer embeddings are the most effective for downstream tasks. The authors propose a unified theoretical and empirical framework to evaluate representation quality across all layers using information theory, geometry, and invariance metrics.
Core Hypothesis and Contributions:
- Intermediate layers often yield better features for downstream tasks than final layers.
- The authors introduce a suite of layer-level evaluation metrics that quantify how layers compress information, manage geometric token organization, and remain invariant to input perturbations.
- A unified view based on matrix-based entropy ties together these metrics, showing how layer-wise information compression and structural organization influence performance.
Model Architectures and Scope:
- The analysis spans Transformer-based models (both decoder-only like Pythia and LLaMA, and encoder-only like BERT) and State-Space Models (SSMs) like Mamba.
- Models range from 14M to 1B parameters.
- Evaluations cover 32 text embedding tasks from MTEB (Massive Text Embedding Benchmark), plus comparisons with vision transformers.
Implementation Details:
- Key Metric: Matrix-based entropy $S_\alpha(Z) = \frac{1}{1-\alpha} \log \left( \sum_{i=1}^r \left( \frac{\lambda_i(K)}{\text{tr}(K)} \right)^\alpha \right)$ where $K = ZZ^\top$ is the Gram matrix of layer embeddings and $\lambda_i$ are its eigenvalues.
- This entropy measure provides a unified lens across:
  - Information compression (e.g., via effective rank, dataset/prompt entropy),
  - Geometric structure (via curvature and local trajectory smoothness),
  - Invariance (via InfoNCE, DiME, and LiDAR metrics).
- Augmentations for invariance metrics were created using the NLPAug library: split augmentation, character-level changes, and keyboard-adjacent substitutions.
Empirical Findings:
- Intermediate layers outperform final ones by up to 16% across MTEB tasks, consistent across all architectures.
- Autoregressive models show a compression valley—a dip in entropy and rise in quality around mid-depth layers—while encoder-based models (like BERT) exhibit flatter entropy profiles.
- Training Dynamics: Early layers stabilize quickly, while intermediate layers evolve substantially during training. Compression and performance metrics peak in mid layers.
- Scale Effects: Larger models show sharper mid-layer compression and improved representation quality.
- Finetuning Effects: Chain-of-thought (CoT) finetuning increases prompt entropy and robustness in mid-layers, aiding multi-step reasoning.
- The following figure from the paper shows that intermediate layers consistently outperform final layers on downstream tasks. The average score of 32 MTEB tasks using the outputs of every model layer as embeddings for three different model architectures. The x-axis is the depth percentage of the layer, rather than the layer number which varies across models.
Cross-Domain Generalization:
- Autoregressive image transformers (e.g., AIM) exhibit the same mid-layer compression pattern seen in LLMs.
- Non-autoregressive vision models (e.g., ViT, BEiT, MAE) show monotonically increasing representation quality with depth, suggesting the training objective (autoregressive vs. masked/denoising) is the primary determinant of layer behavior.
Theoretical Contributions:
- Proven connections between representation entropy and downstream performance.
- Theorems relating entropy to effective rank and InfoNCE loss bounds.
- Matrix-based entropy is shown to be Schur-concave, capturing the spread of variance across principal components in hidden state embeddings.
Broader Implications:
- Suggests rethinking the default reliance on final-layer embeddings.
- Encourages architectural and training designs that leverage the strength of intermediate representations.

LAUREL: Learned Augmented Residual Layer

This paper by Gaurav Menghani, Ravi Kumar, and Sanjiv Kumar from Google Research introduces LAUREL (Learned Augmented Residual Layer), a generalization of the canonical residual connection used in deep learning architectures. LAUREL aims to improve model quality while keeping parameter, latency, and memory overhead minimal, making it suitable as a drop-in replacement in both vision and language models.
Core Concept:
- LAUREL extends the standard residual formulation: $x_{i+1} = f(x_i) + x_i$ to a more expressive form: $x_{i+1} = \alpha f(x_i) + g(x_i, x_{i-1}, ..., x_0)$ where $\alpha$ is a learnable scalar and $g(\cdot)$ is a learnable linear function over the current and previous layer outputs.
- The goal is to enhance the residual stream to support richer interactions and improved information propagation across layers.
Variants:
- LAUREL-RW (Residual Weights): Introduces learnable weights $\alpha$ and $\beta$ for $f(x_i)$ and $x_i$ respectively: $x_{i+1} = \alpha f(x_i) + \beta x_i$ Adds only two parameters per layer. Uses sigmoid or softmax to normalize $\alpha$, $\beta$.
- LAUREL-LR (Low-Rank): Adds a low-rank linear transformation $W = AB + I$ on $x_i$: $x_{i+1} = f(x_i) + B A x_i + x_i$ Reduces parameter growth using matrices $A, B \in \mathbb{R}^{D \times r}$ where $r \ll D$, leading to 2rD new parameters per layer.
- LAUREL-PA (Previous Activations): Incorporates previous $k$ activations: $x_{i+1} = f(x_i) + \left( \sum_{j=0}^{k-1} \gamma_{i,j} h_i(x_{i-j}) \right) + x_i$ Adds $2rD + k$ parameters when using low-rank transforms for $h_i$. Supports richer temporal residual interactions.
- These can be mixed into hybrid variants like LAUREL-RW+LR or LAUREL-RW+LR+PA, allowing flexibility in trade-offs between expressiveness and cost.
Implementation and Performance:
- ResNet-50 on ImageNet-1K:
  - Baseline: 74.95% top-1 accuracy.
  - Adding one ResNet layer: 75.20% (+4.37% params).
  - LAUREL-RW: 75.10% (+0.003% params).
  - LAUREL-RW+LR (r=16): 75.20% (+1.68% params).
  - LAUREL-RW+LR+PA: 75.25% (+2.40% params), outperforming naive scaling with fewer parameters.
- 1B LLM Pretraining (LLM-1):
  - Baseline vs LAUREL-RW+LR (r=4): 0.012% param increase, no measurable latency increase.
  - Notable improvements across tasks like GSM8K-CoT (+5.39%), BOOLQ (+13.08%), and BookQA (+20.05%).
- 4B LLM Pretraining (LLM-2):
  - LAUREL-RW+LR (r=64): ~0.1% param increase, 1-2% latency increase.
  - Improvements on MATH (+4.08%), MGSM (+6.07%), BELEBELE (+8.27%), and multimodal tasks like MMMU (+12.75%).
- The following figure from the paper shows: (Left) A standard residual connection; the model is divided into logical ‘blocks’, and the residual connection combines the output of a non-linear function $f$ and the input to this function. (Right) An illustration of the LAUREL framework; LAUREL can be used to replace the regular residual connection. Again, $f$ can be any non-linear function such as attention, MLPs, and groups of multiple non-linear layers.
Efficiency and Scalability:
- LAUREL is designed to be footprint-aware:
  - LAUREL-RW: ~constant memory and latency.
  - LAUREL-LR: $\Theta(2rD)$ memory, $O(rD^2)$ latency.
  - LAUREL-PA: $\Theta(kD)$ memory, $O(kD)$ latency.
- LAUREL outperforms naive model scaling both in accuracy and parameter efficiency. For instance, it achieved higher ResNet-50 performance using 2.6× fewer parameters than adding an extra layer.

A Benchmark for Learning to Translate a New Language from One Grammar Book

This paper introduces MTOB (Machine Translation from One Book), a benchmark designed to evaluate the capability of large language models (LLMs) to learn a completely new language—Kalamang, spoken by fewer than 200 people—using only a single grammar book, a word list, and a small set of parallel sentence pairs. This represents a novel framing for machine translation that mirrors second language (L2) acquisition rather than typical data-intensive supervised learning.
Core Dataset:
- Grammar Book: “A Grammar of Kalamang” (Visser, 2022), 573 pages, ≈217,000 English tokens and ≈25,000 Kalamang tokens, formatted in LaTeX and plaintext.
- Bilingual Word List: Contains 2,531 Kalamang words, each annotated with part-of-speech and English gloss.
- Parallel Corpus: 500 sentence pairs, with 400 for training and 100 for testing. All test sentences are excluded from the grammar book to avoid contamination.
Translation Task and Evaluation:
- Two tasks: Kalamang-to-English (kgv→eng) and English-to-Kalamang (eng→kgv).
- Evaluation uses chrF scores, chosen for robustness across typologically diverse languages.
- Best model result: Claude 2 in W + S + Gl (wordlist + sentence pairs + grammar book long) context with 44.7 chrF (kgv→eng) and 45.8 chrF (eng→kgv).
- Human baseline (first author, linguistically trained): 51.6 chrF and 57.0 chrF respectively.
Model Implementations:
- Evaluated models: LLaMA (7B, 13B, 30B), Llama 2 (7B, 13B, 70B), Llama 2 finetuned, GPT-3.5-turbo, GPT-4, text-davinci-003, Claude 2.
- Finetuning on the grammar text generally degraded performance due to the non-instructional nature of the grammar book.
- Models used retrieval-augmented prompts combining:
  - W: Retrieved wordlist entries by longest common substring.
  - S: Matching sentence pairs from training data.
  - G: Grammar book excerpts retrieved either by similarity search or curated subsets (Gm ≈ 50K tokens, Gl ≈ 100K tokens).
  - Combinations like W+S, W+S+Gs yielded best performance, especially when larger context windows were leveraged.
Implementation Findings:
- Embedding-based retrieval (Ge) vs. string-similarity retrieval (Gs): Gs often more effective due to embedding limitations on unseen language tokens.
- Grammar book context helps when large enough chunks are used (e.g., Gl with Claude 2), allowing models to integrate grammatical rules effectively.
- Zero-shot and minimal context baselines perform near zero, confirming models had no prior exposure to Kalamang.
- Human translation excelled through reference cross-checking and linguistic inference, underscoring the model-human performance gap.
Significance and Implications:
- Highlights potential of LLMs to learn from structured linguistic documentation instead of large corpora.
- Encourages use of ethically sourced, community-consented linguistic materials for LLM adaptation, especially in low-resource settings.
- Demonstrates that high-quality performance remains challenging with current LLMs, even with curated materials and advanced prompting strategies.
Social Considerations:
- Consent was obtained from the Kalamang community for this data usage.
- The benchmark aligns with participatory AI goals, supporting data sovereignty and language revitalization rather than extractive practices.
Benchmark and Baselines

Understanding R1-Zero-Like Training: A Critical Perspective

This paper by Liu et al. from Sea AI Lab, National University of Singapore, and Singapore Management University critically analyzes the R1-Zero training paradigm—where reinforcement learning (RL) is applied directly to base large language models (LLMs) without supervised fine-tuning (SFT)—as introduced by DeepSeek-R1-Zero. The authors dissect both the characteristics of base models and the optimization biases in the RL component, ultimately proposing refinements that enhance reasoning performance and training efficiency.
Architecture and Implementation:
- Training Setup: The authors use base models such as DeepSeek-V3-Base, Qwen2.5-Math, and Llama-3.2, assessing their readiness for RL by analyzing their behavior on MATH-level questions. Templates significantly affect model behavior; for example, Qwen2.5-Math achieves better performance without templates, suggesting implicit pretraining on concatenated QA pairs.
- GRPO vs Dr. GRPO:
  - GRPO (Group Relative Policy Optimization) is a sampling-based RL algorithm that normalizes token-level policy gradients based on response length and intra-group standard deviation. This introduces two biases:
    - Length Bias: Incorrect longer answers are less penalized, skewing output length growth.
    - Difficulty Bias: Questions with low variance disproportionately influence learning.
  - Dr. GRPO (Done Right GRPO) removes these normalization factors, yielding an unbiased surrogate objective aligned with standard PPO: $J_{\text{Dr.GRPO}}(\pi_\theta) = \mathbb{E}_{q\sim p_Q, o\sim \pi_{\theta}^{\text{old}}} \left[ \sum_t \min\left( \frac{\pi_\theta(o_t|q, o_{<t})}{\pi_\theta^{\text{old}}(o_t|q, o_{<t})} \hat{A}_t, \text{clip}(\cdot) \hat{A}_t \right) \right]$
  - Advantage is computed as: $\hat{A}_i = R(q, o_i) - \text{mean}(\{R(q, o_j)\}_{j=1}^G)$ avoiding per-response and per-question normalization.
- Training and Evaluation:
  - Data: MATH training set and diverse question sets (e.g., GSM-8K, ASDiv).
  - Models: Trained on 8×A100 GPUs for ~27 hours.
  - Reward Function: Binary, based on correctness of final answer via Math-Verify.
  - Implementation: Built on the Oat RL framework.
- Minimalist R1-Zero Recipe:
  - Using Qwen2.5-Math-7B with Dr. GRPO and the Qwen-Math template on MATH level 3–5 questions, the model achieves 43.3% accuracy on AIME 2024—state-of-the-art among 7B models.
- The following figure from the paper shows Dr. GRPO introduces simple yet significant modifications to address the biases in GRPO (Shao et al., 2024), by removing the length and std normalization terms. Right: Our unbiased optimizer effectively prevents the model from generating progressively longer incorrect responses, thereby enhancing token efficiency.
Core Insights:
- Base Model Analysis:
  - Qwen2.5 models outperform others even without prompt templates, possibly due to pretraining on concatenated QA data.
  - DeepSeek-V3-Base is shown to exhibit “Aha moments” (emergent reasoning and self-reflection) even without RL, challenging the notion that RL alone induces these behaviors.
- Template Effects:
  - Templates can disrupt or aid initial policy performance; Qwen2.5-Math models perform worse with templates unless retrained.
  - RL can recover from poor initialization, but optimal performance is achieved with good model-template alignment.
- Question Set Coverage:
  - Broader question sets (e.g., ORZ-57K) enhance generalization.
  - Surprisingly, training on simpler, out-of-domain questions (GSM-8K) still improves performance on harder benchmarks.
- Pretraining Effects:
  - Math pretraining (FineMath, NuminaQA) on Llama-3.2-3B significantly boosts its RL ceiling.
  - Pretraining on concatenated QA texts helps mimic the implicit biases seen in Qwen2.5.
Code

LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference

This paper by Fu et al. from Apple and Meta AI introduces LazyLLM, a training-free, dynamic token pruning technique designed to optimize the inference efficiency of large language models (LLMs), particularly under long-context scenarios. The focus is on reducing the “Time-to-First-Token” (TTFT) during the prefilling stage, which can be a significant bottleneck in LLMs like LLaMA 2 7B due to their deep architecture and quadratic attention costs.
Motivation and Key Idea: LazyLLM targets TTFT by avoiding the computation of key-value (KV) caches for all tokens during the prefilling stage. Unlike static or compression-based pruning methods that irreversibly reduce the prompt, LazyLLM selectively and dynamically computes KV caches only for tokens deemed important for the next token prediction. It allows reintroducing previously pruned tokens in later steps, improving both inference efficiency and accuracy retention.
Architecture and Implementation:
- Pruning Method:
  - LazyLLM uses the attention score from the previous layer to evaluate token importance. For token $t_i$ at layer $l$, the importance score is: $s^l_i = \frac{1}{H} \sum_{h=1}^H A^l_{h,i,N}$ where $H$ is the number of attention heads and $A^l_{h,i,N}$ is the attention probability for the next token.
  - Tokens with lower importance scores are pruned based on a top-$k$ percentile strategy.
  - Token pruning is progressive, i.e., more tokens are kept in earlier layers and gradually pruned in later layers.
- Auxiliary Caching Mechanism (Aux Cache):
  - Tokens pruned during prefilling or earlier decoding steps may be required in future steps. To avoid recomputing these from scratch, LazyLLM stores their hidden states in an auxiliary cache.
  - During each step, if a token is needed but not in the KV cache, it is retrieved from the Aux Cache and its KV is computed, ensuring no token is recomputed more than once.
- Compatibility: LazyLLM is model-agnostic and requires no finetuning or retraining, making it suitable for immediate integration with existing LLMs like LLaMA 2 and XGen.
- Experimental Setup:
  - Implemented with Hugging Face Transformers using LLaMA 2 7B and XGen 7B.
  - Evaluated on LongBench benchmark which includes 16 datasets across six task types: single-doc QA, multi-doc QA, summarization, few-shot learning, synthetic tasks, and code completion.
- Metrics and Evaluation:
  - TTFT speedup is measured by empirical walltime for the first token generation.
  - Overall generation speedup is measured from prompt input to completion of output generation.
  - Performance retention is measured using standard metrics like ROUGE-L, F1, Accuracy, and Edit Sim.
- The following figure from the paper shows the LazyLLM framework which progressively prunes tokens layer-wise and dynamically retrieves necessary information from the Aux Cache, thereby reducing KV computations without degrading accuracy.
Core Innovation:
- Dynamic Progressive Token Pruning: Allows different token subsets at each generation step and reactivation of pruned tokens—an advancement over static methods.
- Layer-wise Top-k Selection: Flexible pruning thresholding mechanism based on per-layer attention score distributions.
- Aux Cache for Hidden States: Prevents redundant computation by storing pruned token states, ensuring runtime is never worse than the unpruned baseline.
Benchmarks and Performance:
- On LongBench, LazyLLM consistently achieves better TTFT speedup with negligible accuracy degradation:
  - Up to 2.34× TTFT speedup in multi-document QA with ≤1% accuracy drop.
  - Achieves up to 4.77× TTFT speedup in synthetic tasks with minimal quality compromise.
  - Overall generation speedup observed up to 3.47× on XGen in code completion tasks.
Empirical Findings:
- Token Usage Patterns: Many prompt tokens are never used throughout the generation, validating LazyLLM’s pruning strategy.
- Layer Sensitivity: Later transformer layers are less sensitive to token pruning than earlier ones, motivating progressive pruning from deep to shallow.
- Hyperparameter Sensitivity: TTFT and accuracy trade-off can be finely controlled via pruning depth, layer choice, and prune ratio.

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

This paper by Yu et al. from ByteDance Seed, Tsinghua AIR, and The University of Hong Kong introduces DAPO (Decoupled Clip and Dynamic sAmpling Policy Optimization), a large-scale reinforcement learning (RL) system for reasoning-capable LLMs. The system is notable for its fully open-source status, including code, algorithm, and datasets, and demonstrates superior performance on AIME 2024 benchmarks using only 50% of the training steps required by previous state-of-the-art methods.
The central objective is to resolve key reproducibility and scalability challenges in RL training for LLMs by introducing an openly detailed and empirically validated RL pipeline that enhances training stability, sample efficiency, and policy expressiveness.
Architecture and Implementation:
- Base Model: Qwen2.5-32B pretrained transformer.
- RL Framework: Built on top of the verl framework, leveraging the Group Relative Policy Optimization (GRPO) method as a foundation.
- DAPO Algorithm:
  - The policy is optimized using a modified objective function as follows:
    \[\mathcal{J}_{\text{DAPO}}(\theta) = \mathbb{E}_{(q,a) \sim D, \{o_i\}_{i=1}^G \sim \pi_{\theta_{\text{old}}}(\cdot|q)} \left[ \frac{1}{\sum_{i=1}^G |o_i|} \sum_{i=1}^G \sum_{t=1}^{|o_i|} \min \left( r_{i,t}(\theta) \hat{A}_{i,t}, \text{clip}(r_{i,t}(\theta), 1 - \epsilon_{\text{low}}, 1 + \epsilon_{\text{high}}) \hat{A}_{i,t} \right) \right]\]
    - subject to:
    \[0 < \left| \left\{ o_i \mid \text{is\_equivalent}(a, o_i) \right\} \right| < G\]
    - where:
    \[r_{i,t}(\theta) = \frac{\pi_\theta(o_{i,t} \mid q, o_{i,<t})}{\pi_{\theta_{\text{old}}}(o_{i,t} \mid q, o_{i,<t})}, \quad \hat{A}_{i,t} = \frac{R_i - \text{mean}(\{R_i\}_{i=1}^G)}{\text{std}(\{R_i\}_{i=1}^G)}\]
  - This modified objective function:
    - Applies token-level gradient updates rather than sequence-level.
    - Uses decoupled clipping thresholds $\epsilon_{\text{low}}$ and $\epsilon_{\text{high}}$ to avoid entropy collapse and preserve exploration.
    - Implements rule-based binary reward: +1 if model output is semantically correct, −1 otherwise.
    - Filters out trivial samples with 0% or 100% accuracy to maintain effective gradient signals via Dynamic Sampling.
- Training Details:
  - Batch Size: 512 prompts × 16 samples per prompt per rollout.
  - Learning Rate: 1e-6 with AdamW and linear warm-up.
  - Token Cap: Maximum of 20,480 tokens (16,384 + 4,096 soft penalty buffer).
  - Reward Shaping: Uses Soft Overlong Punishment to penalize excessively long generations gradually.
  - Evaluation: avg\@32 accuracy on AIME 2024 benchmark with temperature 1.0 and top-p 0.7.
Core Innovations:
- Clip-Higher: Uses asymmetric clipping thresholds to allow low-probability “exploration” tokens more opportunity to increase probability, thereby maintaining model entropy and avoiding convergence to deterministic outputs too early.
- Dynamic Sampling: Filters out samples that are either all correct or all incorrect to avoid zero-gradient contributions, ensuring each training batch contains impactful learning signals.
- Token-Level Loss: Enhances model learning on longer CoT sequences by ensuring each token contributes to the final gradient, preventing the dilution of signal in longer responses and mitigating response quality degradation.
- Overlong Reward Shaping: Truncated responses are masked during training or penalized softly based on the degree of overflow, avoiding abrupt and misleading penalties that may disrupt learning.
- Data Curation: Introduces DAPO-Math-17K, a dataset of math problems with integer-only answers to ensure deterministic and error-free evaluation. Problem statements are transformed to yield integer solutions even for originally fractional outputs.
Benchmarks and Results:
- DAPO achieves 50% accuracy on AIME 2024 with Qwen2.5-32B, outperforming DeepSeek-R1-Zero-Qwen-32B (47%) with only 50% of training steps.
- Ablation studies show cumulative performance gains with each added technique:
  - Naive GRPO: 30%
  - - Overlong Filtering: 36%
  - - Clip-Higher: 38%
  - - Soft Overlong Punishment: 41%
  - - Token-level Loss: 42%
  - - Dynamic Sampling (full DAPO): 50%
Empirical Insights:
- Monitoring metrics like response length, entropy, and average reward revealed strong correlations with training dynamics and highlighted the need for fine-tuned balancing between exploration and exploitation.
- Case studies demonstrate the emergence of new reasoning behaviors during training, including reflection and self-correction patterns that were initially absent.
Project Page; Code

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

This paper by Sun et al. introduces Test-Time Training (TTT) layers, a novel class of recurrent neural network (RNN) sequence modeling layers designed to be both efficient in long contexts (like RNNs) and expressive (like Transformers). The central idea is to make the hidden state of each RNN layer a learnable model itself—such as a linear model or MLP—that is updated during inference using self-supervised learning. This setup allows RNNs to “learn” from the input even during test time, thus increasing the model’s expressive power without incurring the quadratic complexity of self-attention.
- The following figure from the paper shows that all sequence modeling layers can be expressed as a hidden state that transitions according to an update rule. Our key idea is to make the hidden state itself a model $f$ with weights $W$, and the update rule a gradient step on the self-supervised loss $\ell$. Therefore, updating the hidden state on a test sequence is equivalent to training the model f at test time. This process, known as test-time training (TTT), is programmed into their TTT layers.

Key Mechanism:
- The hidden state is parameterized as weights of a model $f$ (linear or MLP), denoted as $W_t$.
- Each time step updates these weights via gradient descent on a self-supervised loss:
  \[W_t = W_{t-1} - \eta \nabla \ell(W_{t-1}; x_t)\]
- The output token at each time step is computed as:
  \[z_t = f(\theta_Q x_t; W_t)\]
- The self-supervised loss aims to reconstruct a “label view” of the input from a “training view”, both derived from the token via learned projections:
  \[\ell(W; x_t) = \|f(\theta_K x_t; W) - \theta_V x_t\|^2\]
Model Instantiations:
- TTT-Linear: Hidden state is a linear model with a square matrix $W$.
- TTT-MLP: Hidden state is a 2-layer MLP with hidden dimension 4x input, GELU activation, layer normalization, and residual connections.
Efficiency Innovations:
- Mini-Batch TTT: Instead of updating the hidden state token-by-token (online GD), gradients are computed over mini-batches (e.g., 16 tokens), improving parallelism and wall-clock efficiency.
- Dual Form: Rewrites the gradient update computation to maximize matrix-matrix multiplications (matmuls), optimizing performance on GPUs/TPUs. Allows computing updates without materializing large intermediate states.
Architecture Integration:
- TTT layers are implemented similarly to other sequence modeling layers and can be dropped into any neural architecture. Authors use both Transformer and Mamba backbones.
- Training uses a conventional outer loop (e.g., next-token prediction) while TTT’s inner loop performs gradient updates on its hidden state during forward passes.
Evaluation:
- Compared across models of size 125M to 1.3B parameters on The Pile and Books3 datasets for context lengths from 2k to 32k.
- TTT-Linear consistently outperforms both Transformers and Mamba RNNs in long context (≥8k tokens), particularly evident at 32k tokens.
- Unlike Mamba, which plateaus in perplexity improvement beyond 16k context, TTT layers continue improving as context increases, similar to Transformers.
- Wall-clock performance: TTT-Linear becomes faster than Transformers at 8k context, matching Mamba’s performance. TTT-MLP incurs more memory cost but scales better with context.
Theoretical Contributions:
- Demonstrates that TTT-Linear with batch GD is mathematically equivalent to linear attention.
- Proves that with a non-parametric learner (Nadaraya-Watson estimator), a TTT layer is equivalent to self-attention.
- Provides a unified framework where both RNNs and self-attention can be seen as special cases of TTT.
Implementation Details:
- Code is available in both JAX and PyTorch.
- Optimizations include learnable initialization for $W_0$, learnable per-token learning rates, and integration into Mamba backbone for enhanced performance.
Broader Context:
- Connects to prior concepts in fast weights, meta-learning, and test-time training, but differs by formulating an explicit inner learning loop embedded within each layer, optimized for the downstream task (language modeling).

Scaling Laws of Synthetic Data for Language Models

This paper from Qin et al. (Microsoft, HKUST, Peking University, Penn State) explores the viability of synthetic data for pre-training large language models (LLMs) by introducing SYNTHLLM, a framework that systematizes synthetic data generation and demonstrates that it adheres to rectified scaling laws similar to natural data, thereby validating synthetic data as a scalable substitute.
Motivation and Key Questions:
- With the supply of high-quality web data depleting, this work investigates whether synthetic datasets can serve as a scalable and effective alternative.
- The central research question: Can synthetic data exhibit predictable scaling behavior that supports LLM performance improvement?
Core Contributions:
- Establishes that synthetic data, when properly generated, follows rectified scaling laws, enabling accurate prediction of model performance improvements as data scales.
- Shows that performance gains plateau near 300B tokens, and that larger models require fewer tokens to reach peak performance (e.g., 8B model needs 1T tokens, 3B model needs 4T).
SYNTHLLM Framework:
- Stage 1: Reference Document Filtering:
  - Filters domain-specific high-quality documents from Fineweb-Edu using a two-step classification:
    - Cold-start classifier trained on GPT-4-generated synthetic documents.
    - Fine-grained classifier iteratively refined using GPT-4-o ratings for relevance and clarity.
- Stage 2: Document-Grounded Question Generation:
  - Three levels of question generation to progressively increase diversity:
    - Level-1: Direct question extraction and simple creation from individual documents.
    - Level-2: Inspired by pedagogy, extracts key concepts and recombines them for question generation within a single document.
    - Level-3: Constructs a global concept graph across documents and samples concept combinations via random walks, grounding questions in multiple documents.
- Stage 3: Answer Generation:
  - Employs Qwen2.5-Math-72B-Instruct to produce corresponding answers to the questions.
- The following figure from the paper shows scaling laws on different model sizes. The x axis denotes the number of training tokens. The y axis represents the models’ error rates on MATH. The green points represent the data sizes used to fit the scaling laws, while the red points are used to test the prediction performance of the fitted curves.
Implementation Details:
- Uses Mistral-Large-Instruct-2407 as the question generation model.
- Trains Llama models (1B, 3B, 8B) using Supervised Fine-Tuning (SFT) with AdamW optimizer and a learning rate of $10^{-5}$, for 3 epochs with a batch size of 512.
- Synthetic datasets: ~2.3M (Level-1), ~2.6M (Level-2 & Level-3); median answer length increases with question generation level.
Empirical Results:
- SYNTHLLM-trained models outperform strong baselines (e.g., OpenMathInstruct-2, NuminaMath) on multiple math benchmarks including MATH, GSM8K, and Olympiad.
- Notably, SYNTHLLM-8B with 7.4M samples outperforms larger models like Llama-3.1-70B-Instruct on out-of-distribution tests.
Scaling Law Findings:
- Synthetic data follows the rectified scaling law: $L(D) = \frac{B}{D_l + D^\beta} + E$
- Predictive performance confirmed by extrapolating model accuracy at larger dataset sizes.
- Decay exponent $\beta$ indicates that larger models benefit more efficiently from additional data.
- Performance saturates near 300B tokens, with predicted improvements tapering off beyond that.
Ablations and Comparison:
- Level-2 and Level-3 generation methods outperform standard augmentation techniques (e.g., rephrasing, persona-based generation) in both performance and question diversity.
- Document-level question similarity analysis confirms higher diversity for Level-2 and Level-3.
- Rectified scaling law fits empirical data more accurately than the simpler marginal scaling form.
Project page

Rethinking Reflection in Pre-Training

This paper from Essential AI challenges the prevailing notion that reflection in language models predominantly arises during post-training (e.g., via RLHF), by demonstrating that reflective capabilities can begin to emerge as early as the pre-training phase. The authors introduce a systematic framework and novel datasets to evaluate both situational-reflection and self-reflection during pre-training across various domains, including math, coding, and knowledge tasks.
Core Concept: The paper defines reflection as metacognition involving evaluation and correction of reasoning. They distinguish between:
- Situational-reflection: model reflects on externally provided reasoning (e.g., from another model).
- Self-reflection: model reflects on its own generated reasoning.
Reflection is further categorized as:
- Explicit reflection: when the model outputs tokens explicitly acknowledging errors (e.g., “Wait, that’s wrong…”).
- Implicit reflection: model reaches the correct answer without explicit error acknowledgment.
Implementation and Experimental Setup:
- Model Families: Evaluations were conducted using OLMo-2 (7B, 13B, 32B) and Qwen2.5 (0.5B to 72B) across 40 OLMo-2 checkpoints.
- Datasets: Six adversarial datasets were created by programmatically injecting errors into chains-of-thought (CoTs) across domains such as math (GSM8K, GSM8K-Platinum), code (CruxEval), logic/verbal reasoning (BBH), and knowledge (TriviaQA).
- Adversarial CoT Generation:
  - Situational: Errors introduced into correct CoTs by frontier models (e.g., GPT-4o, DS-V3).
  - Self-reflection: CoTs derived from a model’s own failed generations, followed by task re-evaluation after inserting a trigger like “Wait,”.
- Reflection Metrics:
  - Accuracy (overall task success)
  - Explicit Reflection Rate (any explicit error acknowledgment)
  - Explicit Reflection Accuracy (correct + explicitly reflective)
  - Implicit Reflection Accuracy (correct without explicit reflection)
- Classifier for Explicit Reflection: A prompt-based LLM classifier (using DeepSeek-V3) identifies explicit reflection, achieving high precision (up to 1.00 on TriviaQA).
- Trigger Mechanism: A simple interjection like “Wait,” was effective in significantly boosting explicit reflection and task accuracy.
- Compute Trade-offs:
  - Train-time compute: estimated as $6nt$ (parameters $\times$ training tokens).
  - Test-time compute: estimated as $2nw$ (parameters $\times$ tokens generated).
  - Results show that more pre-training reduces the test-time compute needed to solve a fixed number of adversarial tasks.
The following figure from the paper shows a partially pre-trained OLMo-2-32B model exhibits self-reflection to correctly predict a program’s input. Note that the question is sample_485 from CruxEval. For brevity, formatting instructions are omitted from the visualized prompt.

Empirical Results:
- Reflection is pervasive even at early stages of pre-training, with significant improvement as compute increases.
- In situational-reflection tasks, explicit reflection grows with pre-training (e.g., Pearson correlation up to 0.85).
- In self-reflection tasks, though performance is lower, models still show increasing signs of both self-correction and explicit acknowledgment of prior errors.
- Across tasks, up to 64.2% of task attempts involved at least some form of self-correction.
- The study provides quantitative evidence that models can acquire reflection capabilities from pre-training alone, without needing fine-tuning or RL.
Key Contributions:
- Novel adversarial datasets and automated tools for eliciting and measuring reflection.
- Empirical demonstration that reflection behaviors—both explicit and implicit—emerge and strengthen during pre-training.
- Insightful exploration of the compute trade-offs between training and inference time for reasoning capabilities.
Code

Speech

2017

On Evaluating and Comparing Conversational Agents

This paper by Venkatesh et al. from Amazon in 2017 proposes a comprehensive evaluation strategy using multiple metrics which correlate well with human judgement and are thus designed to reduce subjectivity, for non goal-oriented conversations. The proposed metrics provide granular analysis of the conversational agents, which is not captured in human ratings. They show that these metrics can be used as a reasonable proxy for human judgment.
They propose the following evaluation metrics:
- Conversational User Experience: Measure of the overall interaction experience. Conversations with a socialbot can be significantly different from those with humans because of expectations, behavior or sentiment, trust and visual cues.
- Engagement: To enable an open-ended, multi-turn conversation, engagement is critical. Engagement is a measure of interestingness in a conversation. Other models also term this as interestingness.
- Coherence: A coherent response indicates a comprehensible and relevant response to a user’s request. Other models also term this as specificity.
- Conversational Depth: Coherence is usually measured at turn level. However, in a multi-turn conversation, context may be carried over multiple turns. While evaluating conversational agents, it is important to detect the context and the depth of the conversations.
- Topical Diversity: A good conversational agent is capable of: (i) identifying the topics and keywords from a given utterance (ii) able to have conversations around the same topics and (iii) can share related concepts (iv) identification of appropriate intent.
- Domain Coverage: An agent which is able to interact on multiple domains can be considered more consistent with humans expectations.
They provide a mechanism to unify the metrics for selecting the top performing agents, which has also been applied throughout Amazon’s Alexa Prize competition.
Till date, this study offers the largest setting for evaluating agents with millions of conversations and hundreds of thousands of ratings from users. They believe that this work is a step towards an automatic evaluation process for conversational AIs.

2018

Attention-Based Models for Text-Dependent Speaker Verification

This paper by Chowdhury et al. from Washington State and Google in 2018 proposes using attention-based models for a keyword-based text-dependent speaker verification (SV) system. One subtask of SV is global password text-dependent speaker verification (TD-SV), which refers to the set of problems for which the transcripts of reference enrollment and verification utterances are constrained to a specific phrase. Examples of such TD-SV phrases could be trigger keywords for voice assistants, such as “Hey Sir” or “Alexa” or “OK Google”. In this study, they focus on “OK Google” and “Hey Google”.
A challenge in prior architectures is that silence and background noise are not being well captured. Even though the SV system runs on a short sub-second windows that are segmented by a keyword detector, the phonemes are usually surrounded by frames of silence and background noise. Ideally, the speaker embedding should be built only using the frames corresponding to phonemes. To remedy this, they propose to use an attention layer as a soft mechanism to emphasize the most relevant elements of the input sequence.
Their training dataset is a collection of anonymized user voice queries, which is a mixture of “OK Google” and “Hey Google”. It has around 150M utterances from around 630K speakers.
Attention helps summarize relevant information that occurs through the entire length of an input sequence. This paper also experiments with different attention mechanisms apart from the basic attention: cross-layer attention, and divided-layer attention. For cross-layer attention, the scores and weights are not computed using the outputs of the last LSTM layer but the outputs of the second-to-last layer. However, the d-vector is still the weighted average of the last layer output.
For divided-layer attention, they double the dimension of the last layer LSTM output, and equally divide its dimension into two parts. They use one part to build the d-vector, while using the other to learn the scores.
From their experimental results, the best practices are to: (i) use a shared-parameter non-linear scoring function; (ii) use a divided-layer attention connection to the last layer output of the LSTM; and (iii) apply a sliding window maxpooling on the attention weights. After combining all these best practices, they improved the EER of the baseline LSTM model from 1.72% to 1.48%, which is a 14% relative improvement.

Efficient Voice Trigger Detection for Low Resource Hardware

This paper by Sigtia et al. from Apple in Interspeech 2018 describes the architecture of an always-on DNN-HMM system for on-device keyword spotting (KWS) in low-resource conditions, i.e., for battery-powered mobile devices.
An always-available voice assistant needs a carefully designed voice keyword detector to satisfy the power and computational constraints of battery powered devices. They employ a multi-stage system that uses a low-power primary stage to decide when to run a more accurate (but more power-hungry) secondary detector. They describe a straightforward primary detector and explore variations that result in very useful reductions in computation (or increased accuracy for the same computation). By reducing the set of target labels from three to one per phone, and reducing the rate at which the acoustic model is operated, the compute rate can be reduced by a factor of six while maintaining the same accuracy.
When the device is battery powered like the iPhone or the Apple Watch, it is imperative that the voice trigger detector consume as little power as possible while still maintaining sufficient accuracy. In recent iPhone designs, this is achieved by running a primary detector on a low-power processor that runs even when the main processor is asleep. This primary detector can decide to wake the main processor, where further checks are done (on the same waveform) before the main recognizer is applied and the identity of the speaker is confirmed. This paper focuses only on the primary detector which runs continuously on a low-power, low resource, always-on processor where computation and memory are the limiting factors.
It has been demonstrated that LVCSR systems trained to predict whole phone labels (single label per phone) can achieve accuracies similar to conventional systems with 3 labels per phone. However, implementing this approach yields a significant loss in accuracy. The authors hypothesize the reason behind this as the need for a minimum duration constraint for each phone. To prove this hypothesis, they replicate each state in the trigger phrase HMM by a factor multiple while still using the whole phone DNN, which is equivalent to imposing a minimum duration on each of the labels. This yields similar accuracy as the baseline.
An alternative way to impose longer minimum durations for each state: run the detector at a lower rate than 100 FPS. This results in longer intervals between predictions, which effectively increases the minimum duration of the HMM states. For on-device KWS, operating the detectors at a lower frame-rate is an attractive route for trying to limit the computation performed by the system.
Their results demonstrate that for a voice trigger detection problem, it is not necessary to divide phone labels into 3 states for the beginning, middle and end of each phone. They achieve similar results to the baseline with a single label per phone and minimum duration constrains. This principle has been previously demonstrated for LVSCR with LSTM AMs, but their results demonstrate that the same holds true for DNN AMs with large input windows. As a practical consequence, they are able to run the detectors at frame-rates as low as 16.6 FPS without any loss in accuracy compared to the baseline. This represents a factor of 6 reduction in computation, which is significant when the system is deployed on low resource hardware. Alternatively, they can run a detector 6 times as large as the baseline without any extra computation.

2020

Automatic Speaker Recognition with Limited Data

This paper by Li et al. from UCLA, Tongji, and Amazon in WSDM 2020 proposes an adversarial few-shot learning-based speaker identification method that needs only a limited number of training instances.
They employ metric learning-based few-shot learning to learn speaker acoustic representations using a support module and a query module, where the limited instances are comprehensively utilized to improve the identification performance. To that end, they first randomly sample a set of speakers from the training set as the start to construct the support module. For each speaker in the support module, they further randomly sample pieces of his/her audio instances and derive the corresponding MFCCs. These MFCCs are further fed into an embedding layer so they can use a fixed length vector to represent each audio instance. In the support module of the framework, for each speaker, they derive a representative embedding for each speaker, which summarizes the acoustic biometric of the aforementioned speaker. This is done using an attention layer to learn the importance weights using each audio embedding of a particular speaker.
In the query module, they randomly select a piece of audio from a speaker, which is one of the speakers in the support module. They feed it into the embedding layer to derive the audio embedding.
They then compare the distances between the query embedding and all the representative embeddings in the support module. The distances then are utilized to measure the relegation distribution over all speakers in the support module. The model is optimized by such iterative comparisons and reasoning between the support and query modules.
Furthermore, adversarial learning is applied to further enhance the generalization and robustness for speaker identification with adversarial examples. The goal of employing adversarial training is to allow the identification system not only get optimized by the instances in the training data, but also be robust to unseen adversarial perturbations. To enhance the robustness, they enforce the model to perform consistently well even when the adversarial perturbations are presented. To achieve this goal, they further optimize the model to minimize the objective function with the perturbed parameters.
The experiments are conducted on the LibriSpeech dataset. Experiments conducted on a publicly available large-scale dataset demonstrate that AFEASI significantly outperforms eleven baseline methods.

Speaker Identification for Household Scenarios with Self-attention and Adversarial Training

This paper by Li et al. from Amazon, UCLA, and University of Note Dame in Interspeech 2020 proposes leveraging the self-attention mechanism to enhance long-span modeling of speaker characteristics since self-attention allows us to fully utilize dependencies over all frames in an utterance, resulting in informative global acoustic embedding representations. In contrast, CNNs by design are biased toward modeling features over nearby frames and frequencies, and RNNs are hard to train for retention of relevant information over long time intervals. These types of neutral networks thus potentially face problems capturing dependencies and characteristics expressed over long time spans within an utterance.
Further, they utilize adversarial training as a tool to enhance the robustness and generalization of trained models, rather than as a defense against attacks.
To learn the self-attentive utterance representations, the utterance spectrograms are fed as input to the self-attention layer to learn the transformed frame representations of speaker-relevant acoustic features, in two steps. First, they aim at mining correlations across frames in an utterance by having each transformed frame embedding be the weighted sum of frame embedding of itself and other related frames, where each weight gauges the similarity between one frame and another. Second, they aggregate the frame embeddings, including their correlational information, averaging it over the time dimension into one embedding vector and further, L2-normalizing into a fixed-length embedding vector that expresses the speaker-relevant information in the utterance. This yields a summarized global acoustic representation of an utterance.
The experiments are conducted on the VCTK dataset which show that the proposed model significantly outperforms four state-of-the-art baselines in identifying both known and new speakers in terms of EER.

Stacked 1D convolutional networks for end-to-end small footprint voice trigger detection

This paper by Higuchi et al. from Apple in 2020 proposes a stacked 1D convolutional neural network (S1DCNN) for end-to-end small footprint voice trigger detection in a streaming scenario. Voice trigger detection is an important speech application, with which users can activate their devices by simply saying a keyword or phrase. Due to privacy and latency reasons, a voice trigger detection system should run on an always-on processor on device. Therefore, having small memory and compute cost is crucial for a voice trigger detection system.
Recently, singular value decomposition filters (SVDFs) has been used for end-to-end voice trigger detection. The SVDFs approximate a fully-connected layer with a low rank approximation, which reduces the number of model parameters. In this work, they propose S1DCNN as an alternative approach for end-to-end small-footprint voice trigger detection.
An S1DCNN layer consists of a 1D convolution layer followed by a depth-wise 1D convolution layer. This is similar to the idea of depth-wise separable convolutions where $K$ filters (where $K$ is the number of channels in the input) are applied to each channel of the input (leading to a depth-wise convolution) yielding the same number of channels as the input, followed by a point-wise convolution which uses a $1 \times 1 \times K$ kernel leading to an output shape that has a single channel. Applying as many pointwise convolution filters as the desired number of output channels, yields the final output with much lesser multiplications than a standard convolution and fewer parameters than the baseline. As such, compared to a standard 2D CNN filter, the S1DCNN can be regarded as a factorization of a 2D CNN filter. An $F \times K$ filter of the 2D CNN layer is factorized into an F $\times$ 1 filter of the first 1D CNN layer and a $1 \times K$ filter of the second 1D CNN layer. This factorization reduces the number of parameters from $O(F \times K)$ to $O(F + K)$.
They show that the SVDF can be expressed as a special case of the S1DCNN layer. Experimental results show that the S1DCNN achieve 19.0% relative false reject ratio (FRR) reduction with a similar model size and a similar time delay compared to the SVDF. By increasing the length of the future context (which leads to longer time delays), the S1DCNN further improve the FRR up to 12.2% relative.

Optimize What Matters: Training DNN-HMM Keyword Spotting Model Using End Metric

In DNN-HMM based KWS models, the DNN computes the observation probabilities and outputs a probability distribution over as many classes as the HMM states for each speech frame using a softmax layer. The DNN is typically trained to minimize the average (over all frames) cross-entropy loss between the predicted and the ground-truth distributions. The HMM decoder computes the word detection score using the observation, the state transition, and the prior probabilities. This training ignores the HMM transition and prior probabilities which are learned independently using training data statistics.
Such an independently trained DNN model relies on the accuracy of the ground-truth phoneme labels as well as the HMM model. This model also assumes that the set of keyword states are optimal and each state is equally important for the keyword detection task. The DNN spends all of its capacity focusing equally on all of the states, without considering its impact on the final metric of the detection score, resulting in a loss-metric mismatch.
This paper by Shrivastava et al. from Apple in 2021 seeks to address this loss-metric mismatch by training the DNN model by directly optimizing the keyword detection score instead of optimizing for the state probabilities.
This end-metric based training uses only the start and the end of the keyword instead of requiring all of the speech frames to be annotated, leading to substantial savings in annotation cost. Their method changes only the training algorithm without changing any inference pipeline; therefore, there is no overhead in runtime memory or compute, since they only need to update the model parameters.
They use a hinge loss which uses the detection score as the output which ignores the samples from optimization if their scores are beyond a margin.
Further, they propose IOU-based sampling and design an optimization procedure that maximizes the detection score for a speech segment that “tightly” contains the keyword (positive samples) and minimize the detection score for speech that does not contain the keyword (negative samples). They also sample additional hard negatives that contain partial keywords because they do not want the model to trigger at partial phrases. To formalize the concept of “tightly” containing the keyword, they use the concept of intersection-over-union (IOU) borrowed from computer vision. They sample positive and negative windows from speech utterances such that the positive windows have high intersection-over-union (IOU) and negative windows have low IOU with the ground-truth keyword window.
The proposed approach works significantly better (> 70% relative reduction in FRR) than the conventional DNN-HMM training and is more interpretable, accurate in localization, and data-efficient compared to the CNN-based end-to-end models.

MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition

This paper by Majumdar and Ginsburg from Nvidia in 2020 presents MatchboxNet - an end-to-end neural network for speech command recognition.
MatchboxNet is a deep residual network composed from blocks of 1D time-channel separable convolution, batch-normalization, ReLU and dropout layers. - MatchboxNet reaches state-of-the art accuracy on the Google Speech Commands dataset while having significantly fewer parameters than similar models.
The small footprint of MatchboxNet makes it an attractive candidate for devices with limited computational resources.
The model is highly scalable, so model accuracy can be improved with modest additional memory and compute.
Finally, they show how intensive data augmentation using an auxiliary noise dataset improves robustness in the presence of background noise.

HiFi-GAN: High-Fidelity Denoising and Dereverberation Based on Speech Deep Features in Adversarial Networks

Real-world audio recordings are often degraded by factors such as noise, reverberation, and equalization distortion.
This paper by Su et al. from Princeton and Adobe Research in 2020 introduces HiFi-GAN, a deep learning method to transform recorded speech to sound as though it had been recorded in a studio.
They use an end-to-end feed-forward WaveNet architecture, trained with multi-scale adversarial discriminators in both the time domain and the time-frequency domain. HiFi-GAN relies on the deep feature matching losses of the discriminators to improve the perceptual quality of enhanced speech.
The proposed model generalizes well to new speakers, new speech content, and new environments. It significantly outperforms state-of-the-art baseline methods in both objective and subjective experiments.

2021

Streaming Transformer for Hardware Efficient Voice Trigger Detection and False Trigger Mitigation

This paper by Garg et al. from Apple in 2021 presented a unified and hardware efficient architecture for two-stage voice trigger detection (VTD) and false trigger mitigation (FTM) tasks. Two-stage VTD systems of voice assistants can get falsely activated to audio segments acoustically similar to the trigger phrase of interest. FTM systems cancel such activations by using post trigger audio context. Traditional FTM systems rely on automatic speech recognition lattices which are computationally expensive to obtain on device.
They proposed a streaming transformer (TF) encoder architecture, which progressively processes incoming audio chunks and maintains audio context to perform both VTD and FTM tasks using only acoustic features.

Joint ASR and Language Identification Using RNN-T: An Efficient Approach to Dynamic Language Switching

This paper by Punjabi et al. from Amazon in 2021 proposes joint ASR-LID architectures based on RNN-Ts as an efficient, on-device-suitable alternative to conventional dynamic language switching solutions. Two primary joint modeling paradigms are explored: coupled training, where ASR and LID vocabularies share the RNN-T output space, and multi-task learning, where ASR and LID losses are modeled using dedicated parameters but minimized jointly.
The corpus used for RNN-T training consists of in-house, far-field, de-identified voice-assistant recordings amounting to about 3.8k and 12.5k hours of spoken Hindi and Indian English data, respectively. The acoustic LID classifier (used for baseline LID and for providing language representations to RNN-T) is trained using 2k hours of balanced English-Hindi data.
Experiments with Indian English and spoken Hindi show that: (a) code-switched utterances are inherently difficult to recognize and classify, (b) multi-task learning provides superior ASR performance whereas coupled training offers better LID accuracy, and (c) multi-task models with a dedicated LID feed-forward network offer the best performance overall.
The proposed joint ASR-LID architectures are language agnostic and, in principle, can be scaled to more than two languages.

HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

This paper by Hsu et al. from FB Research in 2021 introduces Hidden-unit BERT (HuBERT) which enables self-supervised speech representation learning approach for speech recognition, generation, and compression.
It is based on the masked prediction problem of predicting k-means cluster assignments of masked segments of continuous input. Specifically, HuBERT uses an offline k-means clustering step and learns the structure of spoken input by predicting the right cluster for masked audio segments. Put simply, this enables HuBERT to generate noisy labels for Masked Language Model pretraining. HuBERT progressively improves its learned discrete representations by alternating between clustering and prediction steps and consumes masked continuous speech features to predict predetermined cluster assignments.
The predictive loss is applied over only the masked regions, forcing the model to learn good high-level representations of unmasked inputs in order to infer the targets of masked ones correctly.
HuBERT learns both acoustic and language models from continuous inputs. First, the model needs to encode unmasked audio inputs into meaningful continuous latent representations, which map to the classical acoustic modeling problem. Second, to reduce the prediction error, the model needs to capture the long-range temporal relations between learned representations.
One crucial insight motivating this work is the importance of consistency of the k-means mapping from audio inputs into discrete targets, not just their correctness, which enables the model to focus on modeling the sequential structure of input data. For example, if an early clustering iteration cannot distinguish /k/ and /g/ sounds, leading to a single supercluster containing both of them, the prediction loss will learn representations that model how other consonant and vowel sounds work together with this supercluster to form words. As a result, the following clustering iteration creates better clusters using the newly learned representation. Their experiments show the progressive improvement of representations by alternating clustering and prediction steps.
The following figure from the paper shows the HuBERT approach predicts hidden cluster assignments of the masked frames ($y_2, y_3, y_4$ in the figure) generated by one or more iterations of k-means clustering.

On both the Librispeech 960 hours and the 60,000 hours Librilight pre-training setups, HuBERT matches or outperforms SOTA wav2vec 2.0 performance over all fine-tuning subsets of 10mins, 1h, 10h, 100h, and 960h. Furthermore, the learned representation quality improves dramatically with iteratively refining K- means cluster assignments using learned latent representations for a previous iteration. HuBERT scales well to a 1B transformer model showing a relative reduction in WER of up to 13% on the test-other subset.
The notable success of speech representation learning enables direct language modeling of speech signals without reliance on any lexical resources (no supervised labels, text corpus, or lexicons). This in turn opens the door for modeling nonlexical information, such as a dramatic pause or urgent interruption, as well as background noises.
Facebook AI link.
Code.

Deep Spoken Keyword Spotting: An Overview

This paper by Lopez-Espejo et al. from Aalborg University, UT Dallas and Oticon in 2021 conducts a literature review into deep spoken KWS to assist practitioners and researchers who are interested in this technology. Specifically, this overview has a comprehensive nature by covering a thorough analysis of deep KWS systems (which includes speech features, acoustic modeling and posterior handling), robustness methods, applications, datasets, evaluation metrics, performance of deep KWS systems and audio-visual KWS.

BW-EDA-EEND: Streaming End-to-end Neural Speaker Diarization for a Variable Number of Speakers

End-to-end neural diarization (EEND) with self-attention is one of the approaches that aim to model the joint speech activity of multiple speakers. It integrates voice activity and overlap detection with speaker tracking in end-to-end fashion. Moreover, it directly minimizes diarization errors and has demonstrated excellent diarization accuracy on two-speaker telephone conversations. However, EEND as originally formulated is limited to a fixed number of speakers because the output dimension of the neural network needs to be prespecified. Several methods have been proposed recently to overcome the limitations of EEND. One approach uses a speaker-wise chain rule to decode a speaker-specific speech activity iteratively conditioned on previously estimated speech activities. Another approach proposes an encoder/decoder-based attractor calculation. The embeddings of multiple speakers are accumulated over the time course of the audio input, and then disentangled one-by-one, for speaker identity assignment by speech frame. However, all these state-of-the-art EEND methods only work in an offline manner, which means that the complete recording must be available before diarization output is generated. This makes their application impractical for settings where potentially long multi-speaker recordings need to be processed incrementally (in streaming fashion).
This paper by Han et al. from Amazon in 2021 proposes a novel method to perform EEND in a blockwise online fashion so that speaker identities are tracked with low latency soon after new audio arrives, without much degradation in accuracy compared to the offline system. They utilize the incremental Transformer encoder, where they attend to only its left contexts and ignore its right contexts, thus enabling blockwise online processing. Furthermore, the incremental Transformer encoder uses block-level recurrence in the hidden states to carry over information block by block, reducing computation time while attending to previous blocks. To their knowledge, ours is the first method that uses the incremental Transformer encoder with block-level recurrence to enable online speaker diarization.
They present a novel online end-to-end neural diarization system, BWEDA-EEND, that processes data incrementally for a variable number of speakers. The system is based on the Encoder-Decoder-Attractor (EDA) architecture of Horiguchi et al., but utilizes the incremental Transformer encoder, attending only to its left contexts and using block-level recurrence in the hidden states to carry information from block to block, making the algorithm complexity linear in time. They propose two variants: For unlimited-latency BW-EDAEEND, which processes inputs in linear time, they show only moderate degradation for up to two speakers using a context size of 10 seconds compared to offline EDA-EEND. With more than two speakers, the accuracy gap between online and offline grows, but the algorithm still outperforms a baseline offline clustering diarization system for one to four speakers with unlimited context size, and shows comparable accuracy with context size of 10 seconds. For limited-latency BW-EDA-EEND, which produces diarization outputs block-by-block as audio arrives, they show accuracy comparable to the offline clustering-based system.

Attentive Contextual Carryover For Multi-turn End-to-end Spoken Language Understanding

This paper by Wei et al. from Amazon in ASRU 2021 proposes a novel E2E SLU approach where a multi-head gated attention mechanism is introduced to effectively incorporate the dialogue history in a multi-turn E2E SLU system.
They propose a multi-head gated attention mechanism as a context combiner which combines the context encodings consisting of dialogue acts and previous utterances to create the final context vectors that are fed into the model. They explore different ways to combine the context encodings into the model: (i) averaged contextual carryover, (ii) attentive contextual carryover, and (iii) gated attentive contextual carryover. Gated attentive contextual carryover performed better than traditional multi-head attention and a simple average.
The attention-based context can be integrated at different layers of a neural E2E SLU model such as the speech encoder stage and the ASR-NLU hidden interface, and the shared context ingestion which integrates context into both acoustic embeddings and the ASR-NLU interface. The shared context ingestion approach gave the biggest improvement compared to the other schemes.
They built contextual E2E SLU models based on the Recurrent Neural Network Transducer (RNN-T) as well as the Transformer Transducer (T-T). E2E SLU models share an audio encoder network that encodes log-filterbank energy (LFBE) features, a prediction network that encodes a sequence of predicted wordpieces, a joint network that combines the encoder and the prediction network, and an NLU tagger that predicts intents and slots. The intent tagger contains two feedforward layers before projecting into the number of intents, and the slot tagger directly takes the output embeddings from the NLU tagger and projects them into the slot size. The audio encoder in the E2E T-T SLU and E2E RNN-T SLU are Transformer layers (with 4 attention heads) and LSTM layers, respectively.
The models are trained and evaluated on an internal industrial voice assistant (IVA) dataset and a synthetic and publicly available multi-turn E2E SLU (Syn-Multi) dataset. They utilize SpecAugment to augment audio feature inputs.
The proposed approach significantly improves E2E SLU accuracy on the internal industrial voice assistant and publicly available datasets compared to the non-contextual E2E SLU models.

SmallER: Scaling Neural Entity Resolution for Edge Devices

This paper by McGowan et al. from Amazon in Interspeech 2021 introduces SmallER, a scalable neural entity resolution system capable of running directly on edge devices.
SmallER addresses constraints imposed by the on-device setting such as bounded memory consumption for both model and catalog storage, limited compute resources, and related latency challenges introduced by those restrictions. Their model offers a small-footprint neural architecture capable of learning syntactic and semantic information simultaneously using distinct modules and is trained to handle multiple domains within one compact architecture (a.k.a., one model to rule them all domains!).
They use compressed tries to reduce the space required to store catalogs on device. They also propose a novel implementation of spatial partitioning trees which at inference time strikes a balance between reducing runtime latency (by reducing the search space) and preserving recall relative to a full/exhaustive catalog search.
They utilize Quantization Aware Training (QAT) to train SmallER. The final model consumes only 3MB of memory at inference time with classification accuracy surpassing that of previously established, domain-specific baseline models on live customer utterances. Furthermore, catalog entries are compressed overall by a factor of 2.5x.
For the largest catalogs they consider (300 or more entries), their proxy metric for runtime latency is reduced by more than 90%.

Leveraging Multilingual Neural Language Models for On-Device Natural Language Understanding

This paper by Tu et al. from Amazon in the 2021 Web Conference Workshop on Multilingual Search investigates learning multi-lingual/cross-lingual representations as an approach to increase the accuracy of on-device multilingual models without increasing their footprint relative to monolingual models, appropriate for deployment on edge devices.
They show that cross-lingual representations can help improve NLU performance in both monolingual and multilingual settings. In particular, they show that the performance improvements for non-English monolingual NLU models are higher when they are seeded with cross-lingual representations, as compared to seeding with monolingual representations. Further, multilingual experiments suggest that the scarcer the available data-resources, the more beneficial it is to use cross-lingual representations.

Comparing Data Augmentation and Annotation Standardization to Improve End-to-end Spoken Language Understanding Models

All-neural end-to-end (E2E) Spoken Language Understanding (SLU) models can improve performance over traditional compositional SLU models, but have the challenge of requiring high-quality training data with both audio and annotations. In particular they struggle with performance on “golden utterances”, which are essential for defining and supporting features, but may lack sufficient training data.
This paper by Nicolich-Henkin et al. from Amazon in NeurIPS 2021 proposes using data augmentation to compare two data-centric AI methods to improve performance on golden utterances: improving the annotation quality of existing training utterances and augmenting the training data with varying amounts of synthetic data.
Their experimental results show improvements with both methods, and in particular that augmenting with synthetic data is effective in addressing errors caused by both inconsistent training data annotations as well as lack of training data. In other words, both data-centric approaches to improving E2E SLU achieved the desired effect, although data augmentation was much more powerful than annotation standardization. This method leads to improvement in intent recognition error rate (IRER) on their golden utterance test set by 93% relative to the baseline without seeing a negative impact on other test metrics.

CLAR: Contrastive Learning of Auditory Representations

The following paper summary has been contributed by Zhibo Zhang.
The paper by AI-Tahan et al. from Western Ontario University in AISTATS 2021 proposes a new framework CLAR - Contrastive Learning of Auditory Transformations for learning auditory representation that involves a mixture of contrastive loss and supervised cross-entropy loss.
This framework adopts two forms of input on top of the augmented audio data: the audio signal as well as the spectrograms of the according audio signal, trained by two different encoders.
The authors tested eight different augmentation strategies that belong to two categories - frequency transformation and temporal transformation, and they empirically found out that adding more augmentation operations did not necessarily bring better accuracy scores using the ResNet-18 model.
In addition, the authors compared the CLAR method with supervised learning as well as self-supervised learning on the Speech CommandS-10 dataset. CLAR indicated better performance when training with 100%, 20% and 10% labels on large epochs while worse performance (compared to self-supervised learning) when training with only 1% labels.
However, the authors showed all the experimental results using only the ResNet-18 model, which is less convincing given that contrastive learning benefits more from larger models as pointed out in the SimCLR paper (Chen et al.). Thus, it would be interesting to see results on the ResNet-50 model. In addition, as part of the experiments, the authors compared the CLAR approach with the [supervised contrastive learning framework (Khosla et al.) when data is partially labeled. It would be useful to add information that describes how supervised contrastive learning was generalized to the semi-supervised setting given that the original methodology was designed for the fully supervised setting.
Last but not least, as some potential future work, AutoAugment (Cubuk et al.) could be adopted to select the augmentation strategy combinations as well as their hyperparameters.

2022

Robust Self-Supervised Audio-Visual Speech Recognition

Audio-visual speech recognition (AVSR) systems improve robustness by complementing the audio stream with the visual information that is invariant to noise and helps the model focus on the desired speaker. However, previous AVSR work focused solely on the supervised learning setup; hence the progress was hindered by the amount of labeled data available.
This paper by Shi et al. from FB Research in 2022 introduces a self-supervised AVSR framework based on Audio-Visual Hidden-unit BERT (AV-HuBERT), a state-of-the-art audio-visual speech representation learning model, to tackle the problem of audio-based automatic speech recognition (ASR) degrading significantly in noisy environments and being particularly vulnerable to interfering speech, as the model cannot determine which speaker to transcribe.
On the largest available AVSR benchmark dataset LRS3, AV-HuBERT approach outperforms prior state-of-the-art by ~50% (28.0% vs. 14.1%) using less than 10% of labeled data (433hr vs. 30hr) in the presence of babble noise, while reducing the WER of an audio-based model by over 75% (25.8% vs. 5.8%) on average.
Facebook AI link.

Adaptive Global-Local Context Fusion for Multi-Turn Spoken Language Understanding

This paper by Tran et al. from Amazon in AAAI 2022 tackles the problem of multi-turn Spoken Language Understanding (SLU), where dialogue contexts are used to guide intent classification and slot filling. They propose a novel contextual SLU model for multi-turn intent classification and slot filling tasks that aims to selectively incorporate dialogue contexts, such as previous utterances and dialogue acts for multi-turn SLU.
They introduce an adaptive global-local context fusion mechanism to selectively integrate dialogue contexts into their model. The local context fusion aligns each dialogue context using multi-head attention, while the global context fusion measures overall context contribution to intent classification and slot filling tasks.
The models are trained and evaluated on the publicly-available Sim-R and Sim-M datasets and an internal in-house dataset.
Experiments show that on two benchmark datasets, their model achieves absolute F1 score improvements of 2.73% and 2.57% for the slot filling task on Sim-R and Sim-M datasets, respectively.
Ablation studies indicate that dialogue history contexts play a crucial role in improving SLU task in the multi-turn dialogue setting.

SpeechMatrix: A Large-Scale Mined Corpus of Multilingual Speech-to-Speech Translations

This paper by Duquenne et al. from Meta and Inria present SpeechMatrix, a large-scale multilingual corpus of speech-to-speech translations mined from real speech of European Parliament recordings.
It contains speech alignments in 136 language pairs with a total of 418 thousand hours of speech.
To evaluate the quality of this parallel speech, they train bilingual speech-to-speech translation models on mined data only and establish extensive baseline results on EuroParl-ST, VoxPopuli, and FLEURS test sets.
Enabled by the multilinguality of SpeechMatrix, they also explore multilingual speech-to-speech translation, a topic which was addressed by few other works.
They also demonstrate that model pre-training and sparse scaling using Mixture-of-Experts bring large gains to translation performance.
Code.

GIT: A Generative Image-to-text Transformer for Vision and Language

This paper by Wang et al. from Microsoft designs and trains a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering.
While generative models provide a consistent network architecture between pre-training and fine-tuning, existing work typically contains complex structures (uni/multi-modal encoder/decoder) and depends on external modules such as object detectors/taggers and optical character recognition (OCR).
GIT simplifies the architecture as one image encoder and one text decoder under a single language modeling task.
The following figure from the paper shows the network architecture of our GIT, composed of one image encoder and one text decoder. (a): The training task in both pre-training and captioning is the language modeling task to predict the associated description. (b): In VQA, the question is placed as the text prefix. (c): For video, multiple frames are sampled and encoded independently. The features are added with an extra learnable temporal embedding (initialized as 0) before concatenation.

GIT scales up the pre-training data and the model size to boost the model performance. Without bells and whistles, GIT establishes new state of the arts on 12 challenging benchmarks with a large margin. For instance, GIT surpasses the human performance for the first time on TextCaps (138.2 vs. 125.5 in CIDEr).
The following figure from the paper shows example captions generated by GIT. The model demonstrates strong capability of recognizing scene text, tables/charts, food, banknote, logos, landmarks, characters, products, etc.

Furthermore, they present a new scheme of generation-based image classification and scene text recognition, achieving decent performance on standard benchmarks.
Code.

2023

Simple and Controllable Music Generation

This paper by Copet et al. from Meta AI tackles the task of conditional music generation. They introduce MusicGen, a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens. Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns, which eliminates the need for cascading several models, e.g., hierarchically or upsampling.
Following this approach, they demonstrate how MusicGen can generate high-quality samples, while being conditioned on textual description or melodic features, allowing better controls over the generated output. They conduct extensive empirical evaluation, considering both automatic and human studies, showing the proposed approach is superior to the evaluated baselines on a standard text-to-music benchmark. Through ablation studies, they shed light over the importance of each of the components comprising MusicGen.
The following figure from the paper shows the codebook interleaving patterns deployed by MusicGen. Each time step $t_1, t_2, \ldots, t_n$ is composed of 4 quantized values (corresponding to $k_1, \ldots, k_4$). When doing autoregressive modelling, we can flatten or interleave them in various ways, resulting in a new sequence with 4 parallel streams and steps $s_1, s_2, \ldots, s_m$. The total number of sequence steps $M$ depends on the pattern and original number of steps $N$. 0 is a special token indicating empty positions in the pattern.

Code.

UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units

Direct speech-to-speech translation (S2ST), in which all components can be optimized jointly, is advantageous over cascaded approaches to achieve fast inference with a simplified pipeline.
This paper by Inaguma et al. from FAIR and CMU in ACL 2023 presents a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units subsequently.
They enhance the model performance by subword prediction in the first-pass decoder, advanced two-pass decoder architecture design and search strategy, and better training regularization.
To leverage large amounts of unlabeled text data, they pre-train the first-pass text decoder based on the self-supervised denoising auto-encoding task.
The following figure from the paper shows the model architecture of UnitY.

Experimental evaluations on benchmark datasets at various data scales demonstrate that UnitY outperforms a single-pass speech-to-unit translation model by 2.5-4.2 ASR-BLEU with 2.83x decoding speed-up.
They show that the proposed methods boost the performance even when predicting spectrogram in the second pass. However, predicting discrete units achieves 2.51x decoding speed-up compared to that case.

Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers

This paper by Gong et al. from MIT CSAIL and MIT-IBM Watson AI Lab, introduces Whisper-AT, an innovative approach for creating a unified model for automatic speech recognition (ASR) and audio tagging (AT).
Focusing on the Whisper ASR model, which is trained on a diverse 680k-hour labeled speech corpus, the paper reveals a unique characteristic: despite its robustness against real-world sounds, Whisper’s audio representation is not noise-invariant but correlates with non-speech sounds. This indicates that Whisper recognizes speech conditioned on the noise type.
The authors leverage this insight to develop Whisper-AT by adding a lightweight audio tagging model on top of the frozen Whisper backbone, enabling it to recognize audio events alongside spoken text in a single forward pass with less than 1% additional computational cost.
The following figure from the paper illustrates the proposed time and layer-wise Transformer model.

The paper reports experiments using datasets like AudioSet and ESC-50, demonstrating that Whisper-AT, although slightly inferior to standalone AT models, is significantly more efficient (over 40x faster) and capable of robust performance across diverse audio environments.
Implementations include advanced audio tagging methods like Last-MLP, WA-MLP, WA-Tr, and TL-Tr, each with varying levels of complexity and performance, culminating in the TL-Tr model, which offers the best balance between efficiency and effectiveness.

Joint Audio and Speech Understanding

This paper by Gong et al. from MIT CSAIL and MIT-IBM Watson AI Lab, introduces Listen to, Think of, and Understand Audio and Speech (LTU-AS), a novel machine learning model designed for universal audio perception and advanced reasoning. The model integrates Whisper as a perception module and LLaMA as a reasoning module, enabling it to recognize and understand both spoken text and non-speech audio events. LTU-AS is unique in its ability to handle a wide range of audio and speech tasks, as well as its capacity for open-ended audio question answering.
Key aspects of LTU-AS include its architecture which utilizes Whisper for robust speech recognition and TLTR for encoding ‘soft’ audio events. The model was trained on the Open-ASQA dataset, consisting of 9.6 million audio and speech question-answering pairs, emphasizing diverse task handling and open-ended question answering capability. The training process involved a multi-stage approach to refine its performance.
The implementation details highlight the use of the Whisper-large model with 32-layer Transformer networks for both encoder and decoder. LLaMA-7B LLM is employed with LoRA adapters for fine-tuning.
The following figure from the paper illustrates the LTU-AS model and real samples showing its joint audio and speech understanding ability. audio and speech). Performance-wise, we show LTU-AS is strong on all audio/speech tasks. But more importantly, LTU-AS can answer free-form open-ended questions about the audio and speech with an instruction following rate over 95% (evaluated by GPT-4), and exhibits emerging joint audio and speech reasoning ability.

The paper also discusses the challenges and solutions in creating the Open-ASQA dataset, the strategies used for training LTU-AS, and its performance in various evaluations, demonstrating its ability to understand and reason about audio and speech signals effectively.
Code.

Long-Form Music Generation with Latent Diffusion

This paper by Evans et al. from Stability AI presents a novel approach to generating full-length music tracks up to 4 minutes and 45 seconds using a diffusion-transformer model. This method leverages a continuous latent representation with a low latent rate of 21.5 Hz to maintain audio quality while handling longer temporal contexts. This allows the generation of coherent musical structures without relying on semantic tokens, distinguishing it from prior methods that typically use higher latent rates and often depend on semantic guidance for structure.
The core of their architecture includes an autoencoder for compressing waveforms and a transformer-based diffusion model (DiT) that operates in the autoencoder’s latent space. Notably, the DiT incorporates cross-attention mechanisms for text and timing conditioning, enabling it to handle variable-length music generation directly from text prompts.
Their training setup involved multiple stages and extensive GPU hours, using a mix of AdamW optimizer and a specialized training regime that includes an exponential learning rate schedule and adversarial loss components. The model was pre-trained on shorter segments before being fine-tuned on the full 285-second music pieces.
The following figure from the paper illustrates the architecture of the diffusion-transformer (DiT). Cross-attention includes timing and text conditioning. Prepend conditioning includes timing conditioning and also the signal conditioning on the current timestep of the diffusion process.

The following figure from the paper illustrates the architecture of the autoencoder.

In terms of performance, their model achieved state-of-the-art results in audio quality and text-prompt alignment as confirmed by metrics such as the Fréchet distance on OpenL3 embeddings and KL-divergence on PaSST tags. The qualitative analysis also showed high mean opinion scores in various musical aspects during listening tests, suggesting effective long-term structure generation.
The paper also highlights their ethical approach to model training and deployment, emphasizing the absence of memorization issues in their outputs and underscoring ongoing efforts to responsibly navigate the creative potential and challenges posed by generative models in music production.

Multimodal

2021

Style Equalization: Unsupervised Learning of Controllable Generative Sequence Models

There exists a training-inference mismatch when learning these models in typical unsupervised training of controllable generative models. During training, the same sample is used as content input and style input, whereas during inference, content and style inputs are from different samples, i.e., the reference style sample contains a different content than the target content. The mismatch leads to incorrect content generation during inference.
This paper by Chang et al. from Apple in 2021 presented a simple but effective technique to deal with the training-inference mismatch when controllable auto-regressive models are learned in an unsupervised manner. Further, to mitigate the training-inference mismatch, the paper proposed style equalization which takes unpaired samples as input during both training and inference. It transforms the style of sample B to that of A by estimating their style difference.
Trained using tuples $(x_i, c_i)$ where $x_i$ is the style sample and $c_i$ is the content sample.
If a generative model learns to utilize the content information in the style example, during inference the generative model will generate wrong content. This phenomenon is called content leakage.
Instead of directly using sample B as style (in which case there is no ground truth), they jointly learn a style transformation function (using CNNs + Multihead attention), which estimates the style difference between A and B and transforms the style of sample B to the style of A. The generative model then takes content A and the transformation output (that contains the style of A) to reconstruct sample A. The proposed method enables us to use sample A as the ground truth while learning in the non-parallel setting. During inference given arbitrary content A and reference sample B, they turn off the style transformation (since by construction, the style difference is zero), and thus the output sample contains content A and style of B.
The proposed method is general and can be applied to different sequence signals. They apply the proposed method on two signal domains, speech and online handwriting, and evaluate the performance carefully via quantitative evaluation (by computing content error rates) and conducting qualitative user studies. Their results show that the proposed method outperforms state-of-the-art methods, including those having access to additional style supervision like speaker labels. Both quantitative and qualitative results show that their model achieves near-real content accuracy and style reproduction.
Note that for style equalization to be successful, $M()$ should not transfer any content-related information (e.g., copy the entire sequence) from $x$ but only its style information so that the decoder will utilize the transferred style and will rely on provided content input to generate the output. Therefore the design of $M$ is critical.
They evaluate the proposed method on two multi-speaker speech datasets. VCTK dataset (Yamagishi et al., 2019) contains 110 speakers and 44 hours of speech, and LibriTTS dataset (Zen et al., 2019) contains 2,311 speakers and 555 hours of speech in the training set.

Rethinking Attention with Performers

This paper by Choromanski et al. from Google Research in ICLR 2021 introduces Performers, which are Transformer architectures that can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness.
To approximate softmax attention-kernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods. FAVOR+ can be also used to efficiently model kernelizable attention mechanisms beyond softmax. This representational power is crucial to accurately compare softmax with other kernels for the first time on large-scale tasks, beyond the reach of regular Transformers, and investigate optimal attention-kernels.
Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence, and low estimation variance. We tested Performers on a rich set of tasks stretching from pixel-prediction through text models to protein sequence modeling. We demonstrate competitive results with other examined efficient sparse and dense attention methods, showcasing effectiveness of the novel attention-learning paradigm leveraged by Performers.
The figure below from the paper shows how Performer calculates the approximation of the regular attention mechanism $\mathbf{A V}$ (before $\mathbf{D}^{-1}$-renormalization) via (random) feature maps. Dashed-blocks indicate order of computation with corresponding time complexities attached.

2022

Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction

This paper by Shi et al. from Facebook Research in 2022 introduces Audio-Visual Hidden Unit BERT (AV-HuBERT) which exploits video recordings of speech contain correlated audio and visual information, providing a strong signal for speech representation learning from the speaker’s lip movements and the produced sound.
AV-HuBERT is a self-supervised representation learning framework for audio-visual speech, which masks multi-stream video input and predicts automatically discovered and iteratively refined multimodal hidden units. It learns powerful audio-visual speech representation benefiting both lip-reading and automatic speech recognition.
On the largest public lip-reading benchmark LRS3 (433 hours), AV-HuBERT achieves 32.5% WER with only 30 hours of labeled data, outperforming the former state-of-the-art approach (33.6%) trained with a thousand times more transcribed video data (31K hours). The lip-reading WER is further reduced to 26.9% when using all 433 hours of labeled data from LRS3 and combined with self-training. Using their audio-visual representation on the same benchmark for audio-only speech recognition leads to a 40% relative WER reduction over the state-of-the-art performance (1.3% vs 2.3%).
Code and models are available here.
Facebook AI article.

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Diffusion models have recently been shown to generate high-quality synthetic images, especially when paired with a guidance technique to trade off diversity for fidelity.
This paper by Nichol et al. from OpenAI introduces GLIDE, which explores diffusion models for the problem of text-conditional image synthesis and compares two different guidance strategies: CLIP guidance and classifier-free guidance. They find that the latter is preferred by human evaluators for both photorealism and caption similarity, and often produces photorealistic samples.
Samples from a 3.5 billion parameter text-conditional diffusion model using classifier-free guidance are favored by human evaluators to those from DALL-E, even when the latter uses expensive CLIP reranking. Additionally, they find that their models can be fine-tuned to perform image inpainting, enabling powerful text-driven image editing.
Code.

Image as a Foreign Language: BEIT Pretraining for All Vision and Vision-Language Tasks

This paper by Wang et al. from Microsoft in 2022 introduces a general-purpose multimodal foundation model BEiT-3, which achieves state-of-the-art transfer performance on both vision and vision-language tasks.
Specifically, they advance the big convergence from three aspects: backbone architecture, pretraining task, and model scaling up.
They introduce Multiway Transformers for general-purpose modeling, where the modular architecture enables both deep fusion and modality-specific encoding. Based on the shared backbone, we perform masked “language” modeling on images (Imglish), texts (English), and image-text pairs (“parallel sentences”) in a unified manner. Experimental results show that BEiT-3 obtains state-of-the-art performance on object detection (COCO), semantic segmentation (ADE20K), image classification (ImageNet), visual reasoning (NLVR2), visual question answering (VQAv2), image captioning (COCO), and cross-modal retrieval (Flickr30K, COCO).
The figure below from the paper shows that BEIT-3 achieves state-of-the-art performance on a broad range of tasks compared with other customized or foundation models. I2T/T2I is short for image-to-text/text-to-image retrieval.

The following figure from the paper shows an overview of BEIT-3 pretraining. We perform masked data modeling on monomodal (i.e., images, and texts) and multimodal (i.e., image-text pairs) data with a shared Multiway Transformer as the backbone network.

Visual Programming: Compositional visual reasoning without training

This paper by Gupta and Kembhavi from PRIOR @ Allen Institute for AI presents VISPROG, a neuro-symbolic approach to solving complex and compositional visual tasks given natural language instructions.
VISPROG avoids the need for any task-specific training. Instead, it uses the in-context learning ability of large language models to generate python-like modular programs, which are then executed to get both the solution and a comprehensive and interpretable rationale.
Each line of the generated program may invoke one of several off-the-shelf computer vision models, image processing routines, or python functions to produce intermediate outputs that may be consumed by subsequent parts of the program.
They demonstrate the flexibility of VISPROG on 4 diverse tasks - compositional visual question answering, zero-shot reasoning on image pairs, factual knowledge object tagging, and language-guided image editing. They believe neuro-symbolic approaches like VISPROG are an exciting avenue to easily and effectively expand the scope of AI systems to serve the long tail of complex tasks that people may wish to perform.
The figure below from the paper shows that VISPROG is a modular and interpretable neuro-symbolic system for compositional visual reasoning. Given a few examples of natural language instructions and the desired high-level programs, VISPROG generates a program for any new instruction using incontext learning in GPT-3 and then executes the program on the input image(s) to obtain the prediction. VISPROG also summarizes the intermediate outputs into an interpretable visual rationale (Fig. 4). We demonstrate VISPROG on tasks that require composing a diverse set of modules for image understanding and manipulation, knowledge retrieval, and arithmetic and logical operations.

The figure below from the paper illustrates the process of program generation in VISPROG.

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

This paper by Maaz et al. from MBZUAI introduces Video-ChatGPT, a novel multimodal model that enhances video understanding through the integration of a video-adapted visual encoder with a Large Language Model (LLM).
Video-ChatGPT’s architecture combines the representational strengths of a pretrained visual encoder, specifically CLIP ViT-L/14, adapted for spatiotemporal video representations, with the Vicuna-v1.1 language model. This model excels at understanding videos by capturing temporal dynamics and frame-to-frame consistency.
A significant feature of Video-ChatGPT is the creation of a dataset comprising 100,000 video-instruction pairs, produced through a blend of human-assisted and semi-automatic annotation methods. This dataset enables the model to better understand and generate conversations about videos, focusing on temporal relationships and contextual understanding.
The implementation details reveal that the model was fine-tuned on this video-instruction data for three epochs using a learning rate of 2e-5, with the training conducted on 8 A100 40GB GPUs. The model, which has 7 billion parameters, was trained for approximately 3 hours.
The figure below from the paper shows the Architecture of Video-ChatGPT. Video-ChatGPT leverages the CLIP-L/14 visual encoder to extract both spatial and temporal video features. This is accomplished by averaging frame-level features across temporal and spatial dimensions respectively. The computed spatiotemporal features are then fed into a learnable linear layer, which projects them into the LLMs input space. Video-ChatGPT utilizes the Vicuna-v1.1 model, comprised of 7B parameters, initialized it with weights from LLaVA.

Quantitative evaluations of Video-ChatGPT used a custom framework assessing correctness, detail orientation, contextual and temporal understanding, and consistency. The model demonstrated competitive performance in these aspects, outperforming contemporary models in zero-shot question-answering tasks across multiple datasets.
Qualitatively, Video-ChatGPT showed proficiency in diverse video-based tasks, such as video reasoning, creative tasks, spatial understanding, and action recognition. However, it faced challenges in comprehending subtle temporal relationships and small visual details, indicating areas for future improvements.
Code.

2023

Meta-Transformer: A Unified Framework for Multimodal Learning

Multimodal learning aims to build models that can process and relate information from multiple modalities. Despite years of development in this field, it still remains challenging to design a unified network for processing various modalities (e.g. natural language, 2D images, 3D point clouds, audio, video, time series, tabular data) due to the inherent gaps among them.
This paper by Zhang et al. from Chinese University of Hong Kong and Shanghai AI Lab proposes a framework, named Meta-Transformer, that leverages a frozen encoder to perform multimodal perception without any paired multimodal training data. Similar to ImageBind or Perceiver a few years ago, this transformer can use multiple modalities, such as images, text, audio, and many more.
In Meta-Transformer, ach input modality is processed by a different, learnable data preprocessor. Then, after preprocessing the inputs are mapped into a shared token space, allowing a subsequent encoder with frozen parameters to extract high-level semantic features of the input data in the form of embeddings. Composed of three main components: a unified data tokenizer, a modality-shared encoder, and task-specific heads for downstream tasks, Meta-Transformer is the first framework to perform unified learning across 12 modalities with unpaired data.
Experiments on different benchmarks reveal that Meta-Transformer can handle a wide range of tasks including fundamental perception (text, image, point cloud, audio, video), practical application (X-Ray, infrared, hyperspectral, and IMU), and data mining (graph, tabular, and time-series). Meta-Transformer indicates a promising future for developing unified multimodal intelligence with transformers.
The figure below from the paper illustrates unified multimodal learning. Meta-Transformer utilizes the same backbone to encode natural language, image, point cloud, audio, video, infrared, hyperspectral, X-ray, time-series, tabular, Inertial Measurement Unit (IMU), and graph data. It reveals the potential of transformer architectures for unified multi-modal intelligence.

The table below from the paper shows a comparison between meta-transformer and related works on perception tasks.

The figure below from the paper illustrates the architecture of Meta-Transformer which consists of data-to-sequence tokenization, unified feature encoding, and down-stream task learning. The framework is illustrated with text, image, point cloud, and audio.

The table below from the paper offers a summary of experimental settings across different modalities. We report the task, dataset, and data scale for each modality.

Code

Any-to-Any Generation via Composable Diffusion

This paper by Tang et al. from UNCC and Microsoft presents Composable Diffusion (CoDi), a state-of-the-art generative model. CoDi uniquely generates any combination of output modalities (language, image, video, audio) from any combination of input modalities.
CoDi stands out from existing generative AI systems by its ability to generate multiple modalities in parallel without being limited to specific input modalities. This is achieved by aligning modalities in both input and output space, allowing CoDi to condition on any input combination and generate any group of modalities, including those not present in the training data.
The model employs a novel composable generation strategy. This involves building a shared multimodal space by bridging alignment in the diffusion process, enabling the synchronized generation of intertwined modalities, such as temporally aligned video and audio.
The following figure from the paper shows CoDi’s architecture: (a) they first train individual diffusion models with aligned prompt encoder by “bridging alignments”; (b) diffusion models learn to attend with each other via “latent alignment”; (c) CoDi achieves any-to-any generation with a linear number of training objectives.

The methodology includes two key stages: training a latent diffusion model (LDM) for each modality and enabling cross-modal generation through a cross-attention module in each diffuser and an environment encoder. These elements project the latent variables of different LDMs into a shared space.
The model demonstrates exceptional performance in both single-modality synthesis and joint-modality generation, maintaining coherence and consistency across generated outputs. This includes high fidelity in generating images and videos from various inputs and strong joint-modality generation quality.
The process that the model uses to output text tokens is as follows. CoDi involves the use of a Variational Autoencoder (VAE) within the Text Diffusion Model. Specifically:
- Text VAE Encoder and Decoder: The text Latent Diffusion Model (LDM) utilizes the OPTIMUS model as its VAE. The encoder and decoder for this text VAE are based on the architectures of BERT and GPT-2, respectively.
- Denoising UNet for Text: In the denoising process, the UNet architecture is employed. However, unlike in image diffusion where 2D convolutions are used in the residual blocks, the text diffusion model replaces these with 1D convolutions. This adjustment is essential for handling the one-dimensional nature of text data.
- Joint Multimodal Generation: The final step involves enabling cross-attention between the diffusion flows of different modalities. This is critical for joint generation, i.e., generating outputs that comprise two or more modalities simultaneously, including text.
- This process highlights the model’s ability to seamlessly integrate text generation within its broader multimodal generative framework, ensuring coherent and contextually aligned outputs across different modalities.
The process for outputting image or speech tokens in the Composable Diffusion (CoDi) model is distinct from the process for text tokens:
1. Image Tokens:
  - Image VAE Encoder and Decoder: The image Latent Diffusion Model (LDM) uses a VAE architecture for encoding and decoding. The encoder projects the images into a compressed latent space, and the decoder maps the latent variables back to the image space.
  - Image Diffusion Model: Similar to the text model, an image diffusion model is employed. The details of the specific architectures used for the encoder and decoder, however, differ from those used for text.
2. Speech Tokens:
  - Audio VAE Encoder and Decoder: For audio synthesis, the CoDi model employs a VAE encoder to encode the mel-spectrogram of the audio into a compressed latent space. A VAE decoder then maps the latent variable back to the mel-spectrogram.
  - Vocoder for Audio Generation: After the mel-spectrogram is reconstructed, a vocoder generates the final audio sample from it. This step is crucial in converting the spectrogram representation back into audible sound.
In summary, while the process for all modalities involves encoding into and decoding from a latent space using a VAE, the specifics of the VAE architectures and the additional steps (like the use of a vocoder for audio) vary depending on whether the modality is text, image, or speech.
CoDi is evaluated using datasets like Laion400M, AudioSet, and Webvid10M. The individual LDMs for text, image, video, and audio feature unique mechanisms; for instance, the video diffuser extends the image diffuser with temporal modules, and the audio diffuser uses a VAE encoder for mel-spectrogram encoding.
The authors provide comprehensive quantitative and qualitative evaluations, showcasing CoDi’s potential for applications requiring simultaneous multimodal outputs.
Code.

Bytes Are All You Need: Transformers Operating Directly On File Bytes

Modern deep learning approaches usually transform inputs into a modality-specific form. For example, the most common deep learning approach to image classification involves decoding image file bytes into an RGB tensor which is passed into a neural network.
This paper by Horton et al. in 2023 proposes ByteFormer, which performs classification directly on file bytes, without the need for decoding files at inference time. Using file bytes as model inputs enables the development of models which can operate on multiple input modalities.
ByteFormer achieves an ImageNet Top-1 classification accuracy of 77.33% when training and testing directly on TIFF file bytes using a transformer backbone with configuration similar to DeiT-Ti (72.2% accuracy when operating on RGB images). Without modifications or hyperparameter tuning, ByteFormer achieves 95.42% classification accuracy when operating on WAV files from the Speech Commands v2 dataset (compared to state-of-the-art accuracy of 98.7%).
Additionally, they demonstrate that ByteFormer has applications in privacy-preserving inference. ByteFormer is capable of performing inference on particular obfuscated input representations with no loss of accuracy.
They also demonstrate ByteFormer’s ability to perform inference with a hypothetical privacy-preserving camera which avoids forming full images by consistently masking 90% of pixel channels, while still achieving 71.35% accuracy on ImageNet.
The following figure from the paper shows an overview of their ByteFormer (BF) compared to standard inference with DeiT [38]. (A): File bytes are read from disk and converted to an RGB tensor using a standard image decoder. A patch embedding creates tokens from the RGB representation. (B): File bytes on disk are directly used as tokens, and projected into learned embeddings. (C): Similar to (B), but we apply an obfuscation function.

Code.

Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering

Knowledge-based visual question answering (VQA) requires external knowledge beyond the image to answer the question. Early studies retrieve required knowledge from explicit knowledge bases (KBs), which often introduces irrelevant information to the question, hence restricting the performance of their models. Recent works have sought to use a large language model (i.e., GPT-3) as an implicit knowledge engine to acquire the necessary knowledge for answering. Despite the encouraging results achieved by these methods, they argue that they have not fully activated the capacity of GPT-3 as the provided input information is insufficient.
This paper by Shao et al. from Hangzhou Dianzi University and Hefei University of Technology in CVPR 2023 presents Prophet – a conceptually simple framework designed to prompt GPT-3 with answer heuristics for knowledge-based VQA. Specifically, we first train a vanilla VQA model on a specific knowledge-based VQA dataset without external knowledge. After that, we extract two types of complementary answer heuristics from the model: answer candidates and answer-aware examples. Finally, the two types of answer heuristics are encoded into the prompts to enable GPT-3 to better comprehend the task thus enhancing its capacity. Prophet significantly outperforms all existing state-of-the-art methods on two challenging knowledge-based VQA datasets, OK-VQA and A-OKVQA, delivering 61.1% and 55.7% accuracies on their testing sets, respectively.
The following figure from the paper shows the Prophet framework has two stages: answer heuristics generation and heuristics-enhanced prompting. In the answer heuristics generation stage, a vanilla VQA model trained on the knowledge-based VQA dataset is employed to generate two types of answer heuristics, i.e., answer candidates and answer-aware examples. In the heuristics-enhanced prompting stage, the answer heuristics, question, and caption are integrated into a formatted prompt to instruct GPT-3 to predict an answer. As shown in the example, both answer heuristics contribute to the answer of “helium”.

Code.

Sound reconstruction from human brain activity via a generative model with brain-like auditory features

The successful reconstruction of perceptual experiences from human brain activity has provided insights into the neural representations of sensory experiences. However, reconstructing arbitrary sounds has been avoided due to the complexity of temporal sequences in sounds and the limited resolution of neuroimaging modalities. To overcome these challenges, leveraging the hierarchical nature of brain auditory processing could provide a path towards reconstructing arbitrary sounds. Previous studies have indicated a hierarchical homology between the human auditory system and deep neural network (DNN) models. Furthermore, advancements in audio-generative models enable to transform compressed representations back into high-resolution sounds.
This paper by Park et al. from Kyoto University and Tokyo Medical and Dental University introduces a novel sound reconstruction method that combines brain decoding of auditory features with an audio-generative model. Using fMRI responses to natural sounds, they found that the hierarchical sound features of a DNN model could be better decoded than spectrotemporal features.
They then reconstructed the sound using an audio transformer that disentangled compressed temporal information in the decoded DNN features. Our method shows unconstrained sounds reconstruction capturing sound perceptual contents and quality and generalizability by reconstructing sound categories not included in the training dataset. Reconstructions from different auditory regions remain similar to actual sounds, highlighting the distributed nature of auditory representations. To see whether the reconstructions mirrored actual subjective perceptual experiences, they performed an experiment involving selective auditory attention to one of overlapping sounds.
The results tended to resemble the attended sound than the unattended. These findings demonstrate that their proposed model provides a means to externalize experienced auditory contents from human brain activity.
The following figure from the paper shows the schematic overview of the proposed sound reconstruction model from brain activity. (A) Brain decoder training. Auditory features derived from real-world natural sounds are processed through various computational models. fMRI activity was recorded while subjects listened to natural sounds. fMRI signals were then used to train brain decoders to predict the values of corresponding auditory features. Among auditory features, the DNN features extracted from a sound recognition DNN model, which demonstrated superior decoding performance, were identified as brain-like features. (B) Audio transformer training. Codebook indices, which provide a concise representation of the Mel-spectrogram, are derived using a codebook encoder. An audio-transformer is trained to predict sequences of these codebook indices, conditioned on the DNN features, in an autoregressive manner. (C) Sound reconstruction from fMRI response. The reconstruction process begins with the computation of decoded DNN features from the fMRI response using the feature decoders. These decoded DNN features are subsequently transformed into codebook indices through the audio transformer. In the final steps, the codebook decoder and spectrogram vocoder transform these codebook indices into Mel-spectrograms and into sound waves.

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day

Conversational generative AI has demonstrated remarkable promise for empowering biomedical practitioners, but current investigations focus on unimodal text. Multimodal conversational AI has seen rapid progress by leveraging billions of image-text pairs from the public web, but such general-domain vision-language models still lack sophistication in understanding and conversing about biomedical images.
This paper by Li et al. from Microsoft proposes a cost-efficient approach for training a vision-language conversational assistant that can answer open-ended research questions of biomedical images. They train Large Language and Vision Assistant for BioMedicine (LLaVA-Med), a 7B biomedical vision-language model initialized from the general-domain LLaVA model and then trained on a large dataset of PubMed Central figure-captions.
The key idea is to leverage a large-scale, broad-coverage biomedical figure-caption dataset extracted from PubMed Central, use GPT-4 to self-instruct open-ended instruction-following data from the captions, and then fine-tune a large general-domain vision-language model using a novel curriculum learning method.
Specifically, the model first learns to align biomedical vocabulary using the figure-caption pairs as is, then learns to master open-ended conversational semantics using GPT-4 generated instruction-following data, broadly mimicking how a layperson gradually acquires biomedical knowledge. LLaVA-Med is trained in less than 15 hours (with eight A100s).
The following figure from the paper illustrates the training process of LLaVA-Med. As the first step, LLaVA-Med was initialized with the general-domain LLaVA and then continuously trained in a curriculum learning fashion (first biomedical concept alignment then full-blown instruction-tuning). They evaluated LLaVA-Med on standard visual conversation and question answering tasks.

LLaVA-Med exhibits excellent multimodal conversational capability and can follow open-ended instruction to assist with inquiries about a biomedical image. On three standard biomedical visual question answering datasets, LLaVA-Med outperforms previous supervised state-of-the-art on certain metrics.
LLaVA-Med addresses the gap in current multimodal models which are less effective for biomedical scenarios. It’s an innovative step in creating domain-specific assistants, capable of interpreting and conversing about biomedical images.
Project page; Code; Weights

Unified Model for Image, Video, Audio and Language Tasks

Large Language Models (LLMs) have made the ambitious quest for generalist agents significantly far from being a fantasy. A key hurdle for building such general models is the diversity and heterogeneity of tasks and modalities. A promising solution is unification, allowing the support of a myriad of tasks and modalities within one unified framework. While few large models (e.g., Flamingo, trained on massive datasets, can support more than two modalities, current small to mid-scale unified models are still limited to 2 modalities, usually image-text or video-text. The question that we ask is: is it possible to build efficiently a unified model that can support all modalities?
This paper by Shukor et al. from Sorbonne University proposes UnIVAL to answer the aforementioned question, a step further towards this ambitious goal. Without relying on fancy datasets sizes or models with billions of parameters, the ~0.25B parameter UnIVAL model goes beyond two modalities and unifies text, images, video, and audio into a single model.
The following figure from the paper illustrates UnIVAL, a sequence-to-sequence model that unifies the architecture, tasks, input/output format, and training objective (next token prediction). UnIVAL is pretrained on image and video-text tasks and can be finetuned to tackle new modalities (audio-text) and tasks (text-to-image generation) that were not used during pretraining.

UnIVAL is efficiently pretrained on many tasks, based on task balancing and multimodal curriculum learning.
They employ multimodal curriculum learning wherein they gradually introduce additional modalities during training. To validate the approach, they train the same model on image-text and video-text data for 20 epochs using 2 training approaches; the one-stage approach where we train on all data from the beginning, and their 2-stage curriculum training where they start to train on image-text for 10 epochs then we continue training on all data for the next 10 epochs. While the performance of both approaches are comparable, this approach mainly yields gains in training efficiency.

They employ multimodal curriculum learning wherein they gradually introduce additional modalities during training. We compare models trained without balancing, where in each batch the number of examples for each task is proportional to the corresponding dataset size, and with task balancing, where the tasks have similar number of examples. The results show a consistent improvement after balancing especially with highly unbalanced datasets.
UnIVAL shows competitive performance to existing state-of-the-art approaches, across image and video-text tasks. The feature representations learned from image and video-text modalities, allows the model to achieve competitive performance when finetuned on audio-text tasks, despite not being pretrained on audio.
While UnIVAL does not set a new gold standard for any particular task, results show that it exhibits excellent performance across all tasks with only 0.25B parameters.
Based to the unified model, they propose a novel study on multimodal model merging via weight interpolation of models trained on different multimodal tasks, showing their benefits in particular for out-of-distribution generalization. Finally, they motivate unification by showing the synergy between tasks.
Code.

Qwen-7B: Open foundation and human-aligned models

Qwen-7B is the pretrained language model, and Qwen-7B-Chat is fine-tuned to align with human intent. Qwen-7B is pretrained on over 2.2 trillion tokens with a context length of 2048. On the series of benchmarks that were tested, Qwen-7B generally performs better than existing open models of similar scales and appears to be on par with some of the larger models.
Qwen-7B-Chat is fine-tuned on curated data, including not only task-oriented data but also specific security- and service-oriented data, which seems insufficient in existing open models.
Model architecture: Qwen-7B is built with architecture similar to LLaMA. The following are the main differences from the standard transformer: 1) using untied embedding, 2) using rotary positional embedding, 3) no biases except for QKV in attention, 4) RMSNorm instead of LayerNorm, 5) SwiGLU instead of ReLU, and 6) adopting flash attention to accelerate training. The model has 32 layers, the embedding dimension is 4096, and the number of attention heads is 32.
Training details: The model is trained using the AdamW optimizer, with $\beta_1=0.9, \beta_2=0.95, \epsilon=10^{-6}$. The sequence length is 2048 , and the batch size is 2048 , which means each optimization step accumulates over 4 million tokens. We use a cosine learning rate schedule, with a warm-up of 2000 steps, a peak learning rate of $3 \times 10^{-4}$, and a minimum learning rate of 10% of the peak learning rate. We use a weight decay of 0.1 and gradient clipping of 1.0. The training adopts mixed precision training with bfloat16.

Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities

This technical report by Bai et al. from Alibaba introduces the Qwen-VL series, a set of large-scale vision-language models designed to perceive and understand both text and images. Comprising Qwen-VL and Qwen-VL-Chat, these models exhibit remarkable performance in tasks like image captioning, question answering, visual localization, and flexible interaction.
The evaluation covers a wide range of tasks including zero-shot captioning, visual or document visual question answering, and grounding. We demonstrate the Qwen-VL outperforms existing Large Vision Language Models (LVLMs).
They present their architecture, training, capabilities, and performance, highlighting their contributions to advancing multimodal artificial intelligence.
The following figure from the paper shows that Qwen-VL achieves state-of-the-art performance on a broad range of tasks compared with other generalist models.

The following figure from the paper shows some qualitative examples generated by Qwen-VL-Chat. Qwen-VL-Chat supports multiple image inputs, multi-round dialogue, multilingual conversation, and localization ability.

The following figure from the paper shows the training pipeline of the Qwen-VL series.

Code.

NExT-GPT: Any-to-Any Multimodal LLM

Not all information can be efficiently captured and conveyed with text; as such, multimodal representations will lead to a deeper, more robust understanding of the world.
While recently Multimodal Large Language Models (MM-LLMs) have made exciting strides, they mostly fall prey to the limitation of only input-side multimodal understanding, without the ability to produce content in multiple modalities. As they humans always perceive the world and communicate with people through various modalities, developing any-to-any MM-LLMs capable of accepting and delivering content in any modality becomes essential to human-level AI.
This paper by Wu et al. from NExT++ at NUS seeks to address this gap and presents an end-to-end general-purpose any-to-any MM-LLM system, NExT-GPT.
NExT-GPT is trained on four different modalities in parallel: text, image, audio and video. But more importantly, it can also output any of these modalities. NExT-GPT encompasses Vicuna, a Transformer-decoder LLM, and connects it to different Diffusion Models and Multimodal Adapter research. The former are well-known for their success in Stable Diffusion and Midjourney, the latter is one of the most promising techniques for adding any modality you want to your model. This enables NExT-GPT to perceive inputs and generate outputs in arbitrary combinations of text, images, videos, and audio.
By leveraging the existing well-trained highly-performing encoders and decoders, NExT-GPT is tuned with only a small amount of parameter (1%) of certain projection layers, which not only benefits low-cost training and also facilitates convenient expansion to more potential modalities. Moreover, they introduce a modality-switching instruction tuning (MosIT) and manually curate a high-quality dataset for MosIT, based on which NExT-GPT is empowered with complex cross-modal semantic understanding and content generation.
Overall, NExT-GPT showcases the promising possibility of building an AI agent capable of modeling universal modalities, paving the way for more human-like AI research in the community.
Architecture:
- Multimodal Encoding Stage: Leveraging existing well-established models to encode inputs of various modalities. Here they adopt ImageBind, which is a unified high-performance encoder across six modalities. Then, via the linear projection layer, different input representations are mapped into language-like representations that are comprehensible to the LLM.
- LLM Understanding and Reasoning Stage: Vicuna, an LLM, is used as the core agent of NExT-GPT. LLM takes as input the representations from different modalities and carries out semantic understanding and reasoning over the inputs. It outputs 1) the textual responses directly, and 2) signal tokens of each modality that serve as instructions to dictate the decoding layers whether to generate multimodal contents, and what content to produce if yes.
- Multimodal Generation Stage: Receiving the multimodal signals with specific instructions from LLM (if any), the Transformer-based output projection layers map the signal token representations into the ones that are understandable to following multimodal decoders. Technically, they employ the current off-the-shelf latent conditioned diffusion models of different modal generations, i.e., Stable Diffusion (SD) for image synthesis, Zeroscope for video synthesis, and AudioLDM for audio synthesis.
- The following figure from the paper illustrates the fact that by connecting LLM with multimodal adapters and diffusion decoders, NExT-GPT achieves universal multimodal understanding and any-to-any modality input and output.
System Inference:
- The figure below from the paper illustrates the inference procedure of NExT-GPT (grey colors denote the deactivation of the modules). Given certain user inputs of any combination of modalities, the corresponding modal encoders and projectors transform them into feature representations and passed to LLM (except the text inputs, which will be directly fed into LLM). Then, LLM decides what content to generate, i.e., textual tokens, and modality signal tokens. If LLM identifies a certain modality content (except language) to be produced, a special type of token will be output indicating the activation of that modality; otherwise, no special token output means deactivation of that modality. Technically, they design the '<IMGi>' (i=0,…,4) as image signal tokens; '<AUDi>' (i=0,…,8) as audio signal tokens; and '<VIDi>' (i=0,…,24) as video signal tokens. After LLM, the text responses are output to the user; while the representations of the signal tokens of certain activated modalities are passed to the corresponding diffusion decoders for content generation.
Lightweight Multimodal Alignment Learning:
- They design the system with mainly three tiers in loose coupling, and they only need to update the two projection layers at encoding side and decoding side.
  - Encoding-side LLM-centric Multimodal Alignment: They align different inputting multimodal features with the text feature space, the representations that are understandable to the core LLM.
  - Decoding-side Instruction-following Alignment: They minimize the distance between the LLM’s modal signal token representations (after each Transformer-based project layer) and the conditional text representations of the diffusion models. Since only the textual condition encoders are used (with the diffusion backbone frozen), the learning is merely based on the purely captioning texts, i.e., without any visual or audio inputs.
- The figure below from the paper offers an illustrates of the lightweight multimodal alignment learning of encoding and decoding.
Modality-switching Instruction Tuning (MosIT):
- Further instruction tuning (IT) is necessary to enhance the capabilities and controllability of LLM. To facilitate the development of any-to-any MM-LLM, they propose a novel Modality-switching Instruction Tuning (MosIT). As illustrated in Figure 4, when an IT dialogue sample is fed into the system, the LLM reconstructs and generates the textual content of input (and represents the multimodal content with the multimodal signal tokens). The optimization is imposed based on gold annotations and LLM’s outputs. In addition to the LLM tuning, they also fine-tune the decoding end of NExT-GPT. they align the modal signal token representation encoded by the output projection with the gold multimodal caption representation encoded by the diffusion condition encoder. Thereby, the comprehensive tuning process brings closer to the goal of faithful and effective interaction with users.
MosIT Data:
- All the existing IT datasets fail to meet the requirements for our any-to-any MM-LLM scenario. They thus construct the MosIT dataset of high quality. The data encompasses a wide range of multimodal inputs and outputs, offering the necessary complexity and variability to facilitate the training of MM-LLMs that can handle diverse user interactions and deliver desired responses accurately.
- The figure below from the paper offers a summary and comparison of existing datasets for multimodal instruction tuning. T: text, I: image, V: video, A: audio, B: bounding box, PC: point cloud, Tab: table, Web: web page.
While NExT-GPT isn’t the first project that went in this direction, it’s arguably the first one that provides a convincing demo and workflow heralding the future of Generative AI.
Code; Demo; Dataset; YouTube.

LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

This paper by Liu et al. from Tsinghua University, Microsoft Research, University of Wisconsin-Madison, and HKUST IDEA Research presents LLaVA-Plus, a general-purpose multimodal assistant that systematically expands the capabilities of large multimodal models (LMMs) through visual instruction tuning.
LLaVA-Plus maintains a skill repository with a wide array of vision and vision-language pre-trained models, allowing it to activate relevant tools in response to user inputs and compose execution results for various tasks.
The figure below from the paper offers a visual illustration of LLaVA-Plus’ capabilities enabled by learning to use skills.

The model is trained on multimodal instruction-following data, covering examples of tool usage in visual understanding, generation, and external knowledge retrieval, demonstrating significant improvements over its predecessor, LLaVA, in both existing and new capabilities.
The training approach includes using GPT-4 for generating instruction data and integrating new tools through instruction tuning, allowing continuous enhancement of the model’s abilities.
The figure below from the paper shows the four-step LLaVA-Plus pipeline.

Empirical results show that LLaVA-Plus achieves state-of-the-art performance on VisiT-Bench, a benchmark for evaluating multimodal agents in real-life tasks, and is more effective in tool use compared to other tool-augmented LLMs.
The paper also highlights the model’s ability to adapt to various scenarios, such as external knowledge retrieval, image generation, and interactive segmentation, showcasing its versatility in handling real-world multimodal tasks.
Project page; Code; Dataset; Demo; Model

This paper by Yu et al. from FAIR and YerevaNN introduces CM3Leon, a multi-modal language model for generating and infilling both text and images.
CM3Leon, an advancement over CM3, is unique for its large-scale, retrieval-augmented pretraining using a diverse dataset and a second stage of multi-task supervised fine-tuning (SFT).
The figure below from the illustrates the fact that the CM3Leon model is fine-tuned using a vast assortment of combined image and text tasks. CM3Leon’s retrieval augmented pretraining allows us to fine-tune the model effectively on a mixture of interleaved texts and images, as well as text-to-image and image-to-text tasks. CM3Leon presents some common model inputs for various tasks on the left, with the corresponding model outputs displayed on the right. Throughout the training process, CM3Leon concatenates the model input and output and trains them using the same objective that was utilized during the pretraining stage.

It shows remarkable efficiency in training, achieving state-of-the-art performance in text-to-image generation with significantly less compute (zero-shot MS-COCO FID of 4.88).
The model excels in a range of tasks, from language-guided image editing to image-controlled generation and segmentation, highlighting its versatility and control.
CM3Leon also introduces an improved contrastive decoding method for enhanced quality in both text and image outputs, further underscoring the value of autoregressive models in multi-modal tasks.

Demystifying CLIP Data

This paper by Xu et al. from FAIR Meta, NYU, and the University of Washington, focuses on the Contrastive Language-Image Pre-training (CLIP) approach, which has significantly advanced research in computer vision. The authors believe the key to CLIP’s success lies in its data curation rather than its model architecture or pre-training objective.
The paper introduces Metadata-Curated Language-Image Pre-training (MetaCLIP), which uses metadata derived from CLIP’s concepts to curate a balanced subset from a raw data pool. This method outperforms CLIP on multiple benchmarks, achieving 70.8% accuracy on zero-shot ImageNet classification with ViT-B models and even higher with larger data sets.
MetaCLIP’s methodology involves creating a balanced subset from a raw data pool using metadata, focusing solely on data impact and excluding other factors. CLIP’s Per Radford et al. (2021), WIT400M is curated with an information retrieval method: “… we constructed a new dataset of 400 million (image, text) pairs collected from a variety of publicly available sources on the Internet. To attempt to cover as broad a set of visual concepts as possible, we search for (image, text) pairs as part of the construction process whose text includes one of a set of 500,000 queries We approximately class balance the results by including up to 20,000 (image, text) pairs per query.”
They start by re-building CLIP’s 500,000-query metadata, similar to the procedure laid out in Radford et al. (2021): “The base query list is all words occurring at least 100 times in the English version of Wikipedia. This is augmented with bi-grams with high pointwise mutual information as well as the names of all Wikipedia articles above a certain search volume. Finally all WordNet synsets not already in the query list are added.”
Experimentation was conducted on CommonCrawl with 400M image-text data pairs, showing significant performance improvements over CLIP’s data.
The paper presents various model sizes and configurations, exemplified by ViT-H achieving 80.5% without additional modifications.
Curation code and training data distribution on metadata are made available, marking a step towards transparency in data curation processes.
The study isolates the model and training settings to concentrate on the impact of training data, making several observations about good data quality.
MetaCLIP’s approach is particularly noted for its scalability and reduction in space complexity, making it adaptable to different data pools and not reliant on external model filters.
The paper includes an empirical study on data curation with a frozen model architecture and training schedule, emphasizing the importance of the curation process.
The authors’ contribution lies in revealing CLIP’s data curation approach and providing a more transparent and community-accessible version with MetaCLIP, which significantly outperforms CLIP’s data in terms of performance on various standard benchmarks.

Scalable Diffusion Models with Transformers

This paper by Peebles and Xie from UC Berkeley and New York University introduces a new class of diffusion models that leverage the Transformer architecture for generating images. This innovative approach replaces the traditional convolutional U-Net backbone in latent diffusion models (LDMs) with a transformer operating on latent patches.
Traditional diffusion models in image-level generative tasks predominantly use a convolutional U-Net architecture. However, the dominance of transformers in various domains like natural language processing and vision prompts this exploration of their use as a backbone for diffusion models.
The paper proposes Diffusion Transformers (DiTs), which adhere closely to the standard Vision Transformer (ViT) model but with some vital tweaks. DiTs are designed to be faithful to standard transformer architecture, particularly the Vision Transformer (ViT) model, and are trained as latent diffusion models of images.
Transformer Blocks and Design Space:
- DiTs process input tokens transformed from spatial representations of images (“patchify” process) through a sequence of transformer blocks.
- Four types of transformer blocks are explored: in-context conditioning, cross-attention block, adaptive layer norm (adaLN) block, and adaLN-Zero block. Each block processes additional conditional information like noise timesteps or class labels.
- The adaLN-Zero block, which initializes each DiT block as an identity function and modulates the activations immediately prior to any residual connections within the block, demonstrates the most efficient performance, achieving lower Frechet Inception Distance (FID) values than the other block types.
The figure below from the paper shows the Diffusion Transformer (DiT) architecture. Left: We train conditional latent DiT models. The input latent is decomposed into patches and processed by several DiT blocks. Right: Details of our DiT blocks. We experiment with variants of standard transformer blocks that incorporate conditioning via adaptive layer norm, cross-attention and extra input tokens. Adaptive layer norm works best.

Model Scaling and Performance:
- DiTs are scalable in terms of forward pass complexity, measured in GFLOPs. Different configurations (DiT-S, DiT-B, DiT-L, DiT-XL) cover a range of model sizes and computational complexities.
- Increasing model size and decreasing patch size significantly improves performance. FID scores improve as the transformer becomes deeper and wider, indicating that scaling model size (GFLOPs) is key to improved performance.
- The largest DiT-XL/2 models outperform all prior diffusion models, achieving a state-of-the-art FID of 2.27 on class-conditional ImageNet benchmarks at resolutions of 256 $\times$ 256 and 512 $\times$ 512.
Implementation and Results: The models are trained using the AdamW optimizer. The DiT-XL/2 model, trained for 7 million steps, demonstrates high compute efficiency compared to both latent and pixel space U-Net models.
Visual Quality: The paper highlights notable improvements in the visual quality of generated images with scaling in both model size and the number of tokens processed.
Overall, the paper showcases the potential of transformer-based architectures in diffusion models, emphasizing scalability and compute efficiency, which contributes significantly to the field of generative models for images.
Project page

DeepFloyd IF

DeepFloyd, a part of Stability AI, has introduced DeepFloyd IF, a cutting-edge text-to-image cascaded pixel diffusion model known for its high photorealism and language understanding capabilities. This model is an open-source project and represents a significant advancement in text-to-image synthesis technology.
DeepFloyd IF is built with multiple neural modules (independent neural networks that tackle specific tasks), joining forces within a single architecture to produce a synergistic effect.
DeepFloyd IF generates high-resolution images in a cascading manner: the action kicks off with a base model that produces low-resolution samples, which are then boosted by a series of upscale models to create stunning high-resolution images, as shown in the figure (source) below.

DeepFloyd IF’s base and super-resolution models adopt diffusion models, making use of Markov chain steps to introduce random noise into the data, before reversing the process to generate new data samples from the noise.
DeepFloyd IF operates within the pixel space, as opposed to latent diffusion (e.g. Stable Diffusion) that depends on latent image representations.
The unique structure of DeepFloyd IF consists of a frozen text encoder and three cascaded pixel diffusion modules. The process begins with a base model generating a 64 $\times$ 64 pixel image from a text prompt. This is followed by two super-resolution models, each escalating the resolution to 256 $\times$ 256 pixels and then to 1024 $\times$ 1024 pixels. All stages utilize a frozen text encoder based on the T5 transformer architecture, which extracts text embeddings. These embeddings are then input into a UNet architecture, which is enhanced with cross-attention and attention pooling features.
The figure below from the paper shows the model architecture of DeepFloyd IF.

The efficiency and effectiveness of DeepFloyd IF are evident in its performance, where it achieved a zero-shot FID score of 6.66 on the COCO dataset. This score is a testament to its state-of-the-art capabilities, outperforming other models in the domain. The success of DeepFloyd IF underscores the potential of larger UNet architectures in the initial stages of cascaded diffusion models and opens new avenues for future advancements in text-to-image synthesis.
Code; Project page

PIXART-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

The paper “PIXART-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis” by Chen et al. introduces PIXART-$\alpha$, a transformer-based latent diffusion model for text-to-image (T2I) synthesis. This model competes with leading image generators such as SDXL, Imagen, and DALL-E 2 in quality, while significantly reducing training costs and CO2 emissions. Notably, it is also OPEN-RAIL licensed.
Key Innovations:
- Efficient Architecture: PIXART-$\alpha$ employs a Diffusion Transformer (DiT) with cross-attention modules, focusing on efficiency. This includes a streamlined class-condition branch and reparameterization for efficient training.
- Training Strategy Decomposition: The training is divided into three stages: learning pixel distributions, text-image alignment, and aesthetic enhancement.
- High-Informative Data: Utilizes an auto-labeling pipeline with LLaVA to create a dense, precise text-image dataset, improving the speed of text-image alignment learning.
Technical Implementation:
- Text Encoding: Uses the T5-XXL model for advanced text encoding, enabling better handling of complex prompts.
- Pre-training and Stages: Incorporates pre-training on ImageNet, learning stages for pixel distribution, alignment, and aesthetics.
- Hardware Requirements: Initially requires 23GB GPU VRAM, but with diffusers, it can run under 7GB.
The figure below from the paper shows the model architecture of PIXART-$\alpha$. A cross-attention module is integrated into each block to inject textual conditions. To optimize efficiency, all blocks share the same adaLN-single parameters for time conditions.

Performance and Efficiency:
- Quality and Control: Delivers high-quality image synthesis with superior semantic control.
- Resource Efficiency: Achieves near state-of-the-art quality with only 2% of the training cost of other models, reducing CO2 emissions by 90%.
- Optimization Techniques: Implements shared normalization parameters (adaLN-single) and uses AdamW optimizer to enhance efficiency.
Applications and Extensions: Showcases versatility through methods like DreamBooth and ControlNet, further expanding its practical applications.
PIXART-$\alpha$ represents a major advancement in T2I generation, offering a high-quality, efficient, and environmentally friendly solution. Its unique architecture and training strategy make it an innovative addition to the field of photorealistic T2I synthesis.
Code; Weights; Project page

RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths

This technical report by Xue et al. from the University of Hong Kong and SenseTime Research, the authors introduce RAPHAEL, a novel text-to-image diffusion model that generates highly artistic images closely aligned with textual prompts.
RAPHAEL uniquely combines tens of mixture-of-experts (MoEs) layers, including space-MoE and time-MoE layers, allowing billions of diffusion paths. Each path intuitively functions as a “painter” for depicting specific textual concepts onto designated image regions at certain diffusion timesteps. This mechanism substantially enhances the precision in aligning text and image content.
The authors report that RAPHAEL outperforms recent models like Stable Diffusion, ERNIE-ViLG 2.0, DeepFloyd, and DALL-E 2 in terms of image quality and aesthetic appeal. This is evidenced by superior performance in diverse styles (e.g., Japanese comics, realism, cyberpunk) and a state-of-the-art zero-shot FID score of 6.61 on the COCO dataset.
An edge-supervised learning module is introduced to further refine image quality, focusing on maintaining intricate boundary details in various styles. RAPHAEL is implemented using a U-Net architecture with 16 transformer blocks, each containing a self-attention layer, a cross-attention layer, space-MoE, and time-MoE layers. The model, with three billion parameters, was trained on 1,000 A100 GPUs for two months.
Framework of RAPHAEL. (a) Each block contains four primary components including a selfattention layer, a cross-attention layer, a space-MoE layer, and a time-MoE layer. The space-MoE is responsible for depicting different text concepts in specific image regions, while the time-MoE handles different diffusion timesteps. Each block uses edge-supervised cross-attention learning to further improve image quality. (b) shows details of space-MoE. For example, given a prompt “a furry bear under sky”, each text token and its corresponding image region (given by a binary mask) are directed through distinct space experts, i.e., each expert learns particular visual features at a region. By stacking several space-MoEs, we can easily learn to depict thousands of text concepts.

The authors conducted extensive experiments, including a user study using the ViLG-300 benchmark, demonstrating RAPHAEL’s robustness and superiority in generating images that closely conform to the textual prompts. The study also showcases RAPHAEL’s flexibility in generating images of diverse styles and high resolutions up to 4096 $\times$ 6144 when combined with a tailor-made SR-GAN model.
RAPHAEL’s potential applications extend to various domains, with implications for both academic research and industry. The model’s limitations include the potential misuse for creating misleading or false information, a challenge common to powerful text-to-image generators.
Project page

ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts

This paper by Feng et al. from Baidu Inc. and Wuhan University of Science and Technology in CVPR 2023 focuses on enhancing text-to-image generation using diffusion models.
They introduce ERNIE-ViLG 2.0, a large-scale Chinese text-to-image generation model, employing a diffusion-based approach with a 24B parameter scale. The model aims to significantly upgrade image quality and text relevancy.
The model incorporates fine-grained textual and visual knowledge to improve semantic control and resolve object-attribute mismatching in image generation. This is achieved by using a text parser and an object detector to identify key elements in the text-image pair and aligning them in the learning process.
Introduction of the Mixture-of-Denoising-Experts (MoDE) mechanism, which uses multiple specialized expert networks for different stages of the denoising process, allowing for more efficient handling of various denoising requirements at different steps.
The figure below from the paper shows the architecture of ERNIE-ViLG 2.0, which incorporates fine-grained textual and visual knowledge of key elements in the scene and utilizes different denoising experts at different denoising stages.

ERNIE-ViLG 2.0 demonstrates state-of-the-art performance on MS-COCO with a zero-shot FID-30k score of 6.75. It also outperforms models like DALL-E 2 and Stable Diffusion in human evaluations using a bilingual prompt set, ViLG-300, for a fair comparison between English and Chinese text-to-image models.
The model’s implementation involves a transformer-based text encoder with 1.3B parameters, 10 denoising U-Net experts with 2.2B parameters each, and training on 320 Tesla A100 GPUs for 18 days. The dataset comprises 170M image-text pairs, including English datasets translated into Chinese.
Ablation studies and qualitative showcases confirm the effectiveness of the proposed knowledge enhancement strategies and the MoDE mechanism. The model shows improved handling of complex prompts, better sharpness, and texture in generated images.
Future work includes enriching external image-text alignment knowledge and expanding the usage of multiple experts to advance generation capabilities. The paper also discusses potential risks and limitations related to data bias and model misuse in text-to-image generation.
Project page

CogVLM: Visual Expert for Pretrained Language Models

This paper by Wang et al. from Zhipu AI and Tsinghua University introduces CogVLM, an open-source visual language foundation model. CogVLM offers an answer to the question: is it possible to retain the NLP capabilities of the large language model while adding top-notch visual understanding abilities? CogVLM is distinctive for integrating a trainable visual expert module with a pretrained language model, enabling deep fusion of visual and language features.
The architecture of CogVLM comprises four main components: a vision transformer (ViT) encoder, an MLP adapter, a pretrained large language model (GPT-style), and a visual expert module. The ViT encoder, such as EVA2-CLIP-E, processes images, while the MLP adapter maps the output of ViT into the same space as the text features.
The visual expert module, added to each layer of the model, consists of a QKV matrix and an MLP, both mirroring the structure in the pretrained language model. This setup allows for more effective integration of image and text data, enhancing the model’s capabilities in handling visual language tasks.
Since all the parameters in the original language model are fixed, the behaviors are the same as in the original language model if the input sequence contains no image. This inspiration arises from the comparison between P-Tuning and LoRA in efficient finetuning, where p-tuning learns a task prefix embedding in the input while LoRA adapts the model weights in each layer via a low-rank matrix. As a result, LoRA performs better and more stable. A similar phenomenon might also exist in VLM, because in the shallow alignment methods, the image features act like the prefix embedding in P-Tuning.
The figure below from the paper shows the architecture of CogVLM. (a) The illustration about the input, where an image is processed by a pretrained ViT and mapped into the same space as the text features. (b) The Transformer block in the language model. The image features have a different QKV matrix and FFN. Only the purple parts are trainable.

CogVLM was pretrained on 1.5 billion image-text pairs, using a combination of image captioning loss and Referring Expression Comprehension (REC). It achieved state-of-the-art or second-best performance on 14 classic cross-modal benchmarks, demonstrating its effectiveness.
The model was further fine-tuned on a range of tasks for alignment with free-form instructions, creating the CogVLM-Chat variant. This version showcased flexibility and adaptability to diverse user instructions, indicating the model’s robustness in real-world applications.
The paper also includes an ablation study to evaluate the impact of different components and settings on the model’s performance, affirming the significance of the visual expert module and other architectural choices.
The authors emphasize the model’s deep fusion approach as a major advancement over shallow alignment methods, leading to enhanced performance in multi-modal benchmarks. They anticipate that the open-sourcing of CogVLM will significantly contribute to research and industrial applications in visual understanding.
The figure below from the paper shows the performance of CogVLM on a broad range of multi-modal tasks compared with existing models.

Code.

Improved Baselines with Visual Instruction Tuning

LLaVA-1.5 offers support for LLaMA-2, LoRA training with academia GPUs, higher resolution (336x336), 4-/8- inference, etc.
This paper by Liu et al. from UW–Madison and MSR focuses on enhancing multimodal models through visual instruction tuning.
The paper presents improvements to the Large Multimodal Model (LMM) known as LLaVA, emphasizing its power and data efficiency. Simple modifications are proposed, including using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with simple response formatting prompts.
A major achievement is establishing stronger baselines for LLaVA, which now achieves state-of-the-art performance across 11 benchmarks using only 1.2 million publicly available data points and completing training in about 1 day on a single 8-A100 node.
The authors highlight two key improvements: an MLP cross-modal connector and incorporating academic task-related data like VQA. These are shown to be orthogonal to LLaVA’s framework and significantly enhance its multimodal understanding capabilities. LLaVA-1.5, the enhanced version, significantly outperforms the original LLaVA in a wide range of benchmarks, using a significantly smaller dataset for pretraining and instruction tuning compared to other methods.
The figure below from the paper illustrates that LLaVA-1.5 achieves SoTA on a broad range of 11 tasks (Top), with high training sample efficiency (Left) and simple modifications to LLaVA (Right): an MLP connector and including academic-task-oriented data with response formatting prompts.

The paper discusses limitations, including the use of full image patches in LLaVA, which may prolong training iterations. Despite its improved capability in following complex instructions, LLaVA-1.5 still has limitations in processing multiple images and certain domain-specific problem-solving tasks.
Overall, the work demonstrates significant advancements in visual instruction tuning for multimodal models, making state-of-the-art research more accessible and providing a reference for future work in this field.
Code.

Matryoshka Diffusion Models

This paper by Gu et al. from Apple presents the Matryoshka Diffusion Models (MDM), a novel framework for multi-resolution (time and space) diffusion models capable of high-resolution image and video synthesis. This includes images, text-to-images, and text-to-videos, synthesizing directly from pixels at high resolutions (e.g., 1024x1024 for images).
MDM employs a new nested UNet architecture, reminiscent of Matryoshka dolls, where lower resolution UNets are embedded within higher resolution UNets. This nested approach enables the model to jointly model different spatial and temporal resolutions efficiently. The figure below from the paper illustrates the NestedUNet architecture used in Matryoshka Diffusion. They follow the design of Podell et al. (2023) by allocating more computation in the low resolution feature maps (by using more attention layers for example), where in the figure we use the width of a block to denote the parameter counts.

The authors highlight the model’s ability to learn generative models using relatively small datasets, such as CC12M, in just a few days with 3-4 nodes of 8 A100 GPUs. This efficiency is achieved through the novel architecture and a progressive training schedule, which begins with lower resolutions and incrementally incorporates higher resolutions.
The multi-resolution diffusion process, central to MDM, is facilitated by the NestedUNet architecture, which integrates features and parameters of lower resolutions within the higher resolutions. This method addresses the computational and optimization challenges typically associated with learning high-dimensional models.
The MDM’s NestedUNet architecture, starting at a 64 $\times$ 64 innermost UNet resolution, allows for efficient generation and minimal increase in parameter count when high-resolution parts are added. The model uses the frozen FLAN-T5 XL as the text encoder for text-to-image and text-to-video models, enhanced with additional self-attention layers for better text-image alignment.
The figure below from the paper illustrates Matryoshka Diffusion. $z_t^L, z_t^M$ and $z_t^H$ are noisy images at three different resolutions, which are fed into the denoising network together, and predict targets independently.

The paper evaluates the model’s performance through class-conditional image generation and text-conditioned image and video generation. Training on public datasets like ImageNet and Conceptual 12M, the model demonstrates impressive zero-shot generalization capabilities and versatility in a range of tasks.
Ablation studies included in the paper show the impact of progressive training and varying numbers of nested resolutions. These studies confirm the model’s training efficiency and its capacity to generate high-quality images and videos from relatively small datasets.
In summary, the Matryoshka Diffusion Models represent a breakthrough in the field of high-resolution image and video generation, offering an innovative and scalable approach through a multi-resolution diffusion process and a uniquely structured NestedUNet architecture.

MAViL: Masked Audio-Video Learners

This paper by Huang et al. from Meta AI introduces Masked Audio-Video Learners (MAViL), a self-supervised learning framework for audio-visual representation.
MAViL utilizes three self-supervision methods: (1) reconstructing masked raw audio and video inputs, (2) intra-modal and inter-modal contrastive learning with masking, and (3) self-training to predict aligned and contextualized audio-video representations learned from the first two objectives.
The framework employs a pair of audio-video encoders, a fusion-encoder, and separate decoders for reconstructing raw or contextualized inputs. A high masking ratio (e.g., 80%) is used for efficiency and learning complementary representations by reconstructing a single modality input with supplementary context from the other.
The figure below from the paper illustrates that MAViL’s contrastive objectives include inter-modal contrast that aligns paired video and audio clips from the same video and contrasts them with other samples, and intra-modal contrast that draws closer two masked views of the same audio or video while pushing away other samples from the same modality. Masked Audio-Video Learners (MAViL) exploit three objectives for learning representations from audio video pairs with masking: (1) Raw audio-video reconstruction. (2) Inter-modal and intra-modal contrastive learning with masking. (3) Reconstructing aligned and contextualized audio-video representations via student-teacher learning.

A novel aspect of MAViL is its two-stage self-training framework. In stage-1, the teacher MAViL is trained with raw inputs as targets. In stage-2, the student MAViL learns to reconstruct aligned and contextualized representations generated by the teacher.
The figure below from the paper illustrates masked contextualized audio-video reconstruction in the joint latent space. Stage-1 (left): Training MAViL with raw inputs as targets. Stage-2 (right): Self-training MAViL student by reconstructing MAViL teacher’s aligned and contextualized audio-video representations generated with complete inputs. Repeat stage-2 for K iterations. In the first iteration of stage-2, the stage-1 model is used as the teacher. In subsequent iterations (iteration 2+), the last trained stage-2 student model serves as the new teacher.

Empirically, MAViL achieves state-of-the-art performance on audio-video classification tasks like AudioSet and VGGSound, surpassing both self-supervised and supervised models. It also improves single modality representations without relying on the other modality during fine-tuning or inference.
The paper’s significant contribution is the efficient combination of masked autoencoding and contrastive learning in both intra-modal and inter-modal scenarios, along with the introduction of a new pretext task for multimodal MAE that predicts aligned and contextualized audio-video reconstruction.

TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering

This paper by Chen et al. from HKUST, Sun Yat-sen University, and Microsoft Research, introduces TextDiffuser-2. This novel approach in visual text generation marks a significant evolution from traditional methods, which often lacked flexibility and diversity.
TextDiffuser-2 is a breakthrough in text rendering, combining a fine-tuned large language model (LLM) for layout planning and a diffusion model for text crafting. This method offers a departure from past practices, which were constrained to rigid character-level guidance.
The tool is designed to not just generate text but to interactively plan its layout. Users can modify the layout through conversation with the tool, akin to collaborating with a digital artist who understands both content and aesthetics.
The paper details a two-stage approach:
1. Layout Planning: A language model, fine-tuned on the MARIO-10M dataset, generates keywords for text rendering and allows layout modification through user interaction. The model intelligently predicts text and layout from user prompts or specified keywords.
2. Layout Encoding: In the diffusion model, another language model encodes text position and content at the line level, achieving greater style diversity and more rational layouts than character-level guidance.
The figure below from the paper illustrates the architecture of TextDiffuser-2. The language model M1 and the diffusion model are trained in two stages. The language model M1 can convert the user prompt into a language-format layout and also allows users to specify keywords optionally. Further, the prompt and language-format layout is encoded with the trainable language model M2 within the diffusion model for generating images. M1 is trained via the cross-entropy loss in the first stage, while M2 and U-Net are trained using the denoising L2 loss in the second stage.

Extensive experiments and user studies, involving human participants and GPT-4V, confirm TextDiffuser-2’s superior capabilities in rational layout production and style diversity. The system stands out in text rendering and layout planning.
The model employs the vicuna-7b-v1.5 for fine-tuning and SD 1.5 for the diffusion model. A unique hybrid-granularity tokenization method is utilized, merging original BPE tokenization for prompts and character-level tokens for keywords.
Key aspects like fine-tuning data volume, coordinate representation, tokenization level, and text line position representation were explored. Results indicate that using the top-left and bottom-right corners of text lines with character-level tokenization achieves optimal performance.
TextDiffuser-2 outperforms competitors in various metrics, demonstrating its proficiency in maintaining coherence between text and image.
Despite its advancements, challenges remain in rendering complex languages due to their extensive character sets. Future work aims to improve multilingual support and enhance text image resolution.
Demo; Code

CoDi-2: In-Context Interleaved and Interactive Any-to-Any Generation

This paper by Tang et al. from UC Berkeley, Microsoft Azure AI, Zoom, and UNC Chapel Hill presents CoDi-2, a groundbreaking Multi-modal Large Language Model (MLLM), which represents a paradigm shift in Large Language Model capabilities, extending beyond text to embrace a multimodal future.
This advanced model excels in understanding and processing complex, interleaved instructions across multiple modalities, including text, images, and audio. By mapping these varied inputs to a language space, CoDi-2 can seamlessly interpret and generate content in any combination of these modalities.
CoDi-2’s architecture features a multimodal encoder that transforms diverse data into a feature sequence, which the MLLM then processes. The model predicts the features of the output modality autoregressively, inputting these into synchronized diffusion models for generating high-quality multimodal outputs.
The motivation of harnessing LLM is intuitively inspired by the observation that LLMs exhibit exceptional ability such as chatting, zero-shot learning, instruction following, etc., in language-only domain. By leveraging projections from aligned multimodal encoders, they seamlessly empower the LLM to perceive modality-interleaved input sequence. Specifically, in processing the multimodal input sequence, they first use the multimodal encoder to project the multimodal data into a feature sequence. Special tokens are prepended and appended to the features sequence, e.g. <audio> [audio feature sequence] </audio>. By such for instance, a modality-interleaved input sequence “A cat sitting on [image0:an image of a couch] is making the sound of [audio0:audio of cat purring]” is then transformed to “A cat sitting on <image> [image feature sequence] </image> is making the sound of <audio> [audio feature sequence] </audio>”, before inputting to the MLLM to process and generation.
The model’s interactive capabilities have been demonstrated in a range of applications, such as zero-shot image generation from descriptive text, audio editing based on written commands, and dynamic video creation. These capabilities underscore CoDi-2’s ability to bridge the gap between different forms of input and output.
The figure below from the paper shows multi-round conversation between humans and CoDi-2 offering in-context multimodal instructions for image editing.

The figure below from the paper shows the model architecture of CoDi-2, which comprises a multimodal large language model that encompasses encoder and decoder for both audio and vision inputs, as well as a large language model. This architecture facilitates the decoding of image or audio inputs using diffusion models. In the training phase, CoDi-2 employs pixel loss obtained from the diffusion models alongside token loss, adhering to the standard causal generation loss.

CoDi-2 was trained on a large-scale generation dataset that includes multimodal in-context instructions. This dataset enables the model to exhibit impressive zero-shot and few-shot capabilities in multimodal generation, including in-context learning and multi-round interactive conversation.
The process by which the CoDi-2 model outputs image tokens that are passed on to the image decoder to generate an image is described as follows:
1. Text Generation by MLLM: For generating text, the Multimodal Large Language Model (MLLM) naturally generates text tokens autoregressively.
2. Multimodal Generation Approach: When it comes to multimodal generation (like images), a common method in previous works was to transform the multimodal target (e.g., the ground-truth image) into discrete tokens, allowing them to be generated autoregressively like text. However, this approach is limited by the generation decoder’s quality, typically based on Variational Autoencoder (VAE) methodologies.
3. Integration of Diffusion Models (DMs): To improve the generation quality, the CoDi-2 model integrates Diffusion Models into the MLLM. This enables the generation of multimodal outputs following detailed, modality-interleaved instructions and prompts.
4. Training the MLLM for Conditional Feature Generation: The training involves configuring the MLLM to generate the conditional features that will be fed into the Diffusion Model to synthesize the target output. The generative loss of the DM is then used to train the MLLM.
5. Retaining Perceptual Characteristics: To retain the perceptual characteristics inherent in the original input, it’s explicitly induced that the conditional features generated by the MLLM should match the features of the target modality.
6. Final Training Loss: The final training loss comprises the mean squared error between the MLLM output feature and the target modality feature, the generative loss of the DM, and the text token prediction loss.
7. Decoder: The image decoder used in the model described in the paper is based on StableDiffusion-2.1. This diffusion model is a key component in generating high-quality images, as it is specifically tailored to handle image features with high fidelity. The model employs the ImageBind framework for encoding image and audio features, which are then projected to the input dimension of the LLM (Large Language Model) using a multilayer perceptron (MLP). Once the LLM generates image or audio features, they are projected back to the ImageBind feature dimension using another MLP, ensuring that the generation process maintains high quality and fidelity.
  - This approach enables the CoDi-2 model to conduct sophisticated reasoning for understanding and generating multiple modalities, allowing for diverse tasks like imitation, editing, and compositional creation. The integration of DMs with MLLM is a key aspect that allows the model to generate high-quality multimodal outputs.
The CoDi-2 model, when generating multimodal outputs, does not solely rely on a traditional softmax over a vocabulary approach. For text generation, the MLLM within CoDi-2 generates text tokens autoregressively, which is a common method in language models. However, for multimodal generation (including images), the model diverges from the previous approach of transforming the target (like a ground-truth image) into discrete tokens for autoregressive generation. Instead of using a VAE-like generation decoder, CoDi-2 integrates Diffusion Models (DMs) into the MLLM. This integration allows for the generation of multimodal outputs following nuanced, modality-interleaved instructions and prompts. The diffusion models enable a different approach to generate outputs, focusing on the training objective of the model, which involves minimizing the mean squared error between the generated and target feature. This approach suggests that CoDi-2, particularly for its multimodal (non-text) outputs, relies on a more complex and integrated method than simply outputting over a vocabulary using softmax.
An important to note is that even though that CoDi-2 uses two different mechanmisms to generate text and images respectively, it does not utilize two distinct, separate heads for each modality at the output – one for text and the other for image generation. Instead, CoDi-2 uses a unified framework for encoding and decoding different modalities, including text, images, and audio.
CoDi-2 utilizes ImageBind, which has aligned encoders for multiple modalities like image, video, audio, text, depth, thermal, and IMU. These features are encoded and then projected to the input dimension of the LLM using a multilayer perceptron (MLP). When the LLM generates image or audio features, they are projected back to the ImageBind feature dimension with another MLP.
The potential applications of CoDi-2 are vast, impacting industries like content creation, entertainment, and education. Its ability to engage in a dynamic interplay of multimodal inputs and responses opens up new possibilities, such as generating music that matches the mood of a photo or creating infographics to visualize complex ideas.
CoDi-2 marks a significant advancement in multimodal generation technology. It integrates in-context learning within the realm of interleaved and interactive multimodal any-to-any generation, offering a glimpse into a future where AI can fluidly converse and create across multiple modalities.
Code.

ZipLoRA: Any Subject in Any Style by Effectively Merging LoRAs

The paper by Shah et al. from Google Research and UIUC introduces ZipLoRA, a novel method for merging Low Rank Adaptations (LoRAs) to generate images of any subject in any style using diffusion models.
ZipLoRA operates by merging independently trained style and content LoRAs. It avoids manual tuning of hyperparameters or merger weights, providing a streamlined and efficient approach for subject and style personalization.
The methodology is grounded in two key observations: LoRA update matrices are sparse, and highly aligned LoRA weights merge poorly. ZipLoRA minimizes interference between content and style LoRAs by learning mixing coefficients for each column of the LoRA update matrices.
ZipLoRA was evaluated using the SDXL v1.0 base model, with experiments demonstrating its ability to achieve high-quality stylizations while preserving subject integrity. It outperforms existing methods of merging LoRAs as well as other baseline approaches in user preference studies.
The paper also discusses the ability of ZipLoRA to recontextualize reference objects in diverse contexts and maintain stylization quality. It offers more control over the extent of stylization through an optimization-based method that adjusts the strength of object and style content.
The following figure from the paper illustrates an overview of ZipLoRA. ZipLoRA learns mixing coefficients for each column of $\Delta W_i$ for both style and subject LoRAs. It does so by (1) minimizing the difference between subject/style images generated by the mixed LoRA and original subject/style LoRA models, while (2) minimizing the cosine similarity between the columns of content and style LoRAs. In essence, the zipped LoRA tries to conserve the subject and style properties of each individual LoRA, while minimizing signal interference of both LoRAs.

They introduce a set of merger coefficient vectors $m_c$ and $m_s$ for each LoRA layer of the content and style LoRAs, respectively: $\begin{array}{r}L_m=\operatorname{Merge}\left(L_c, L_s, m_c, m_s\right) \\ \Delta W_m=m_c \otimes \Delta W_c+m_s \otimes W_s,\end{array}$ (where $\otimes$ represents element-wise multiplication between $\Delta W$ and broadcasted merger coefficient vector $m$ such that $j^{\text {th }}$ column of $\Delta W$ gets multiplied with $j^{\text {th }}$ element of $m$. The dimensionalities of $m_c$ and $m_s$ are equal to the number of columns in corresponding $\Delta W$, thus each element of the merger coefficient vector represents the contribution of the corresponding column of the LoRA matrix $\Delta W$ to the final merge.
To ensure that the columns that are merged with each other minimize signal interference, ZipLoRA proposes a loss which seeks to minimize the cosine similarity between the merge vectors $m_c$ and $m_s$ of each layer. Meanwhile, we wish to ensure that the original behavior of both the style and the content LoRAs is preserved in the merged model. Therefore, as depicted in the above figure, they formulate an optimization problem with following loss function:
\[\begin{aligned} L_{\text {merge }} & =\left\|\left(D \oplus L_m\right)\left(x_c, p_c\right)-\left(D \oplus L_c\right)\left(x_c, p_c\right)\right\|_2 \\ & +\left\|\left(D \oplus L_m\right)\left(x_s, p_s\right)-\left(D \oplus L_s\right)\left(x_s, p_s\right)\right\|_2 \\ & +\lambda \sum_i\left|m_c^{(i)} \cdot m_s^{(i)}\right| \end{aligned}\]
- where the merged model $L_m$ is calculated using $m_c$ and $m_s$ as per the prior equation; $p_c, p_s$ are text conditioning prompts for content and style references respectively, and $\lambda$ is an appropriate multiplier for the cosine-similarity loss term. Note that the first two terms ensure that the merged model retains the ability to generate individual style and content, while the third term enforces an orthogonality constraint between the columns of the individual LoRA weights. Importantly, we keep the weights of the base model and the individual LoRAs frozen, and update only the merger coefficient vectors. As seen in the next section, such a simple optimization
Quantitative results, including user studies and alignment scores, highlight ZipLoRA’s effectiveness in accurate stylization and subject fidelity, confirming its superiority in maintaining subject fidelity while also preserving text-to-image generation capabilities.
ZipLoRA presents a significant advancement in the field of personalized image generation, providing artists and users with enhanced creative control and flexibility in generating images of any subject in any style.
Code.

Hyperbolic Image-Text Representations

This paper by Desai et al. from Meta, University of Michigan, and NYU, presents MERU, a model that generates hyperbolic representations of images and text, aiming to better capture the visual-semantic hierarchy often present in image-text datasets. It contrasts with existing large-scale vision and language models like CLIP, which don’t explicitly capture this hierarchy.
MERU addresses the limitations of Euclidean geometry used in models like CLIP and ALIGN by employing hyperbolic geometry. This is particularly suitable for representing hierarchical data structures, offering a more structured and interpretable representation space while maintaining competitive performance on multi-modal tasks.
The paper describes hyperbolic geometry and its adoption in MERU using the Lorentz model. This approach allows for a more nuanced embedding of text and images, mirroring the hierarchical nature of visual-semantic data. The Lorentz model offers a unique way of representing data points in hyperbolic space, leading to better performance in image retrieval and classification tasks.
The following figure from the paper illustrates hyperbolic image-text representations. (Left) Images and text depict concepts and can be jointly viewed in a visual-semantic hierarchy, wherein text ‘exhausted doggo’ is more generic than an image (which might have more details like a cat or snow). MERU embeds images and text in a hyperbolic space that is well-suited to embed tree-like data. (Right) Representation manifolds of CLIP (hypersphere) and MERU (hyperboloid) illustrated in 3D. MERU assumes the origin to represent the most generic concept, and embeds text closer to the origin than images.

MERU utilizes two separate encoders for images and text, following the design of CLIP. It includes a novel approach to embedding transfer from Euclidean to hyperbolic space and introduces specific training objectives to induce structure and semantics in the representation space.
The following figure from the paper shows MERU’s model design: MERU comprises similar architectural components as standard image-text contrastive models like CLIP. While CLIP projects the embeddings to a unit hypersphere, MERU lifts them onto the Lorentz hyperboloid using the exponential map. The contrastive loss uses the negative of Lorentzian distance as a similarity metric, and a special entailment loss enforces ‘text entails image’ partial order in the representation space.

Two key losses are used: a contrastive loss, adapting the negative Lorentzian distance as a similarity measure, and an entailment loss, which enforces partial order relationships between text and image pairs. This combination of losses is crucial for learning the intended hierarchical structure.
Experiments show that MERU’s hyperbolic representations outperform CLIP’s Euclidean representations in various image classification and retrieval tasks, highlighting the advantages of hyperbolic geometry in handling visual-semantic hierarchies.
In the context of resource-constrained deployments, MERU shows superior performance in low-dimensional embeddings compared to CLIP, suggesting its potential for efficient use in devices with limited computational capacity.
Ablation studies reveal the importance of each component in MERU. Removing the entailment loss had minimal impact on performance, suggesting the inherent advantages of hyperbolic space. However, using a fixed curvature parameter or Lorentzian inner product in the contrastive loss led to difficulties in training or poor model convergence, indicating the necessity of these design choices.
Qualitative analysis demonstrates that MERU effectively captures the visual-semantic hierarchy, with text and image embeddings forming distinct, ordered layers in the representation space. This is in contrast to CLIP, which shows a more overlapping distribution of text and image embeddings.
The paper situates MERU in the broader context of visual-language representation learning and hyperbolic representations in computer vision, highlighting its novel contributions and potential implications for future research in these areas.
Project page.

Evaluating Object Hallucination in Large Vision-Language Models

This paper by Li et al. from Renmin University of China, Beijing Key Laboratory, and Meituan Group, investigates the issue of object hallucination in Large Vision-Language Models (LVLMs).
The study identifies a prevalent issue where LVLMs generate objects in their descriptions that are inconsistent with the actual contents of target images. This phenomenon, termed “object hallucination,” is systematically evaluated in several representative LVLMs, revealing significant hallucination problems.
The authors propose a new evaluation method called Polling-based Object Probing Evaluation (POPE) to assess object hallucination. POPE is designed as a binary classification task using simple Yes-or-No questions about the presence of objects in images, offering a more stable and flexible evaluation compared to previous methods.
The research shows that LVLMs tend to hallucinate objects that frequently appear or co-occur in visual instruction datasets. This finding is supported by both qualitative and quantitative analyses, including correlation studies between the frequency of objects’ appearances in datasets and their hallucination rates.
The following figure from the paper shows cases of object hallucination in LVLMs. Bold objects are ground-truth objects in the annotations and red objects are hallucinated objects by LVLMs. The left case is from the traditional instruction-based evaluation method, and the right cases are from three variants of POPE.

The study also explores the impact of hallucinations on vision tasks like Visual Question Answering (VQA), revealing a relationship between the degree of hallucination and performance on these tasks.
The paper concludes by acknowledging its limitations, including a focus only on object hallucination, the evaluation of a limited number of LVLMs, and reliance on a subset of the validation set for evaluations, which may affect the generalizability of the results.

Analyzing and Mitigating Object Hallucination in Large Vision-Language Models

This paper by Zhou et al. from UNC-Chapel Hill, Rutgers, Columbia, and Stanford present a study on object hallucination in large vision-language models (LVLMs).
The paper introduces LVLM Hallucination Revisor (LURE), designed to correct hallucinatory object descriptions in LVLMs, improving the accuracy of image-generated descriptions.
LURE is based on statistical analysis of hallucination factors: object co-occurrence, uncertainty, and position in text. It reduces hallucination by editing descriptions post-generation, marking uncertain or co-occurring objects and later-positioned objects for reassessment.
The authors utilize GPT-3.5 to generate a dataset of hallucinatory descriptions by introducing likely co-occurring objects and replacing uncertain or end-position objects with a placeholder tag.
The following figure from the paper illustrates the LURE Framework: The orange-shaded section shows the training paradigm of LURE, where the black-bordered part represents the hallucinatory data generation phase, including introducing co-occurring objects and replacing either uncertain objects or objects in later positions in the descriptions. The purple-bordered part indicates the revisor training process, with the masking process that can be referenced in Alg. 1. The orange-shaded section illustrates an example in the inference phase of LURE.

Experimental results show LURE’s effectiveness, reducing object hallucination in LVLMs significantly compared to prior methods. Evaluations used CHAIR metrics, human, and GPT assessments on open-source LVLMs, showing a 23% improvement in object hallucination metrics.
Code.

FLAP: Fast Language-Audio Pre-training

This paper by Yeh et al. from Meta introduces Fast Language-Audio Pre-training (FLAP), a self-supervised approach for efficiently and effectively learning aligned audio and language representations.
FLAP employs masking, contrastive learning, and reconstruction techniques to align audio and text representations in a shared latent space. It achieves state-of-the-art performance on audio-text retrieval tasks.
The paper describes two masking strategies for efficiency: 1-D and 2-D masking. These strategies reduce computation while maintaining performance, with 2-D masking providing more structured sampling and better results.
Audio reconstruction is another key component of FLAP, promoting the incorporation of audio information into embeddings. This is achieved by tasking the model with reconstructing the original audio spectrogram tokens using per-sample embeddings.
A novel approach is used to enrich text descriptions using large language models (LLMs) and audio event detection models (AEDs). This method augments limited text descriptions available for audio, leading to more consistent and descriptive captions.
The following figure from the paper illustrates the architecture of FLAP, including audio/text encoders, efficient masking and audio reconstruction.

Extensive experiments demonstrate FLAP’s effectiveness. Different configurations of FLAP, including masking types, ratios, and the use of audio reconstruction, were evaluated, showing significant improvements in audio-text retrieval performance.
The paper concludes that FLAP, combining efficient masking, audio reconstruction, and enriched text description augmentation, offers a robust framework for audio-text retrieval tasks and is versatile for representation learning in sequence modalities like text, audio, and video.

Jointly Learning Visual and Auditory Speech Representations from Raw Data

This research by Haliassos et al. from Imperial College London and Meta AI, published at ICLR 2023, introduces RAVEn, a self-supervised multi-modal method for jointly learning visual and auditory speech representations from raw data.
RAVEn employs an innovative architecture with student-teacher networks for each modality. The students process masked inputs and predict outputs using momentum encoders, focusing on encoding and predicting contextualized targets. The approach is asymmetric: the auditory stream predicts both visual and auditory targets, while the visual stream predicts only auditory targets.
The method showed exceptional results in both low- and high-resource labeled data settings. Remarkably, RAVEn outperformed all self-supervised methods in visual speech recognition (VSR) on the LRS3 dataset. When combined with self-training on just 30 hours of labeled data, it even surpassed a semi-supervised method trained on 90,000 hours of non-public data. Similarly, RAVEn achieved state-of-the-art results in auditory speech recognition (ASR), outperforming or matching the performance of the AV-HuBERT method.
The implementation details include:
- Masking strategies for video and audio inputs to promote context awareness.
- Use of modality-specific convolutional feature extractors followed by Transformer-based temporal encoders.
- Transformer predictors in the students for regressing targets, optimizing for lightweight design and temporal dynamics.
- Momentum-based teachers with cosine schedule updates for generating stable targets, eliminating the need for handcrafted targets or multi-stage training.
- A loss function based on negative cosine similarity, applied differently for within-modal and cross-modal tasks to enhance representation learning.
- Fine-tuning involves keeping pre-trained student encoders and appending a linear layer and a Transformer decoder for joint CTC/attention decoding.
- Self-training, which enhances performance, involves fine-tuning pre-trained audio encoders on labeled data and then using the model for pseudo-labeling the unlabeled data.
Before feeding the raw stream into RAVEn, each video sequence has to undergo a specific pre-processing procedure. This involves three critical steps. The first step is to perform face detection. Following that, each individual frame is aligned to a referenced frame, commonly known as the mean face, in order to normalize rotation and size differences across frames. The final step in the pre-processing module is to crop the face region from the aligned face image. The paperclearly notes that RAVEn is fed with raw audio waveforms and pixels of the face, without any further preprocessing like face parsing or landmark detection. An example of the pre-processing procedure is illustrated in the table below from the blog post.

The following figure from the paper offer an overview of RAVEn. Given masked video and audio, students predict outputs of unmasked momentum teachers, via shallow Transformer predictors that intake mask tokens. The audio student predicts outputs from both audio and video teachers; the video student predicts only audio targets. Cross-modal losses are applied on all features; the within-modal loss is computed only on masked features. Only the student encoders are fine-tuned for VSR/ASR. Frames blurred for anonymity.

They consider two configurations: Small with 12 Emformer blocks and Large with 28, with 34.9M and 383.3M parameters, respectively. Each AV-ASR model composes front-end encoders, a fusion module, an Emformer encoder, and a transducer model. To be specific, they use convolutional frontends to extract features from raw audio waveforms and facial images. The features are concatenated to form 1024-d features, which are then passed through a two-layer multi-layer perceptron and an Emformer transducer model. The entire network is trained using RNN-T loss. A simplistic The architecture for the proposed AV-ASR model is illustrated in the figure below from the blog post.

The datasets used for experiments include LRS3 and a combination of LRS3 and an English-only version of VoxCeleb2, with different settings for low and high-resource labeled data.
RAVEn’s effectiveness is attributed to appropriate masking, lightweight Transformer predictors, and an asymmetric loss structure tailored to the distinct characteristics of visual and auditory modalities. This approach allows for learning powerful speech representations entirely from raw data, without relying on handcrafted features or multiple pre-training stages.
Future work includes exploring weight sharing between visual and auditory encoders to reduce memory requirements and applying RAVEn to other speech-related tasks.
Blog post.

MIRASOL3B: A Multimodal Autoregressive Model for Time-Aligned and Contextual Modalities

This paper by Piergiovanni et al. from Google DeepMind and Google Research introduces MIRASOL3B, a multimodal autoregressive model adept at processing time-aligned modalities (audio and video) and non-time-aligned modality (text), to produce textual outputs.
The model’s architecture uniquely handles the processing of audio and video. It starts by dividing long video-audio sequences, such as a 10-minute clip, into smaller, manageable chunks (e.g., 1-minute each). Each video chunk, containing $V$ frames, is passed through a video encoder/temporal image encoder, while the corresponding audio chunk goes through an audio encoder.
These processed chunks generate $V$ video tokens and $A$ audio tokens per chunk. These tokens are then sent to a Transformer block ($T_VA$), termed the Combiner. The Combiner effectively fuses video and audio features into a compressed representation of $M$ tokens, each represented as a tensor of shape $(m, d)$, where $d$ denotes the embedding size.
MIRASOL3B’s autoregressive training involves predicting the next set of features $X_t$ based on the preceding features $X_0$ to $X_{(t-1)}$, similar to how GPT predicts the next word in a sequence.
For textual integration, prompts or questions are fed to a separate Transformer block that employs cross-attention on the hidden features produced by the Combiner. This cross-modal interaction allows the text to leverage audio-video features for richer contextual understanding.
The following figure from the paper illustrates the Mirasol3B model architecture consists of an autoregressive model for the time-aligned modalities, such as audio and video, which are partitioned in chunks (left) and an autoregressive model for the unaligned context modalities, which are still sequential, e.g., text (right). This allows adequate computational capacity to the video/audio time-synchronized inputs, including processing them in time autoregressively, before fusing with the autoregressive decoder for unaligned text (right). Joint feature learning is conducted by the Combiner, balancing the need for compact representations and allowing sufficiently informative features to be processed in time.

With just 3 billion parameters, MIRASOL3B demonstrates state-of-the-art performance across various benchmarks. It excels in handling long-duration media inputs and shows versatility in integrating different modalities.
The model was pretrained on the Video-Text Pairs (VTP) dataset using around 12% of the data. During pretraining, all losses were weighted equally, with the unaligned text loss increasing tenfold in the fine-tuning phase.
Comprehensive ablation studies in the paper highlight the effects of different model components and configurations, emphasizing the model’s ability to maintain content consistency and capture dynamic changes in long video-audio sequences.

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

This paper by Zhang et al. from DAMO Academy, Alibaba Group, and Hupan Lab presents Video-LLaMA, a multi-modal framework for Large Language Models (LLMs) enabling understanding of both visual and auditory content in videos. Video-LLaMA integrates pre-trained visual and audio encoders with frozen LLMs for cross-modal training. It addresses two main challenges: capturing temporal changes in visual scenes and integrating audio-visual signals.
The following figure from the paper shows a comparison with popular multi-modal large language models. Video-LLaMA has the unique ability to comprehend auditory and visual information simultaneously.

The framework employs a Video Q-former, which assembles a pre-trained image encoder into the video encoder. The Video Q-former is designed to capture temporal information in videos by aggregating frame-level features into a video-level representation. It uses a self-attention mechanism, enabling the model to focus on relevant parts of the video across different frames. This process involves generating query embeddings for each frame, which are then fed into the LLM to create a holistic understanding of the video content.
Additionally, it utilizes ImageBind as the pre-trained audio encoder, with an Audio Q-former to create auditory query embeddings for the LLM module. The Audio Q-former functions similarly, processing audio features to produce a concise representation that aligns with the LLM’s understanding of language and audio content. The output of visual and audio encoders aligns with the LLM’s embedding space. This alignment is crucial for the effective fusion of audio-visual data with textual information, ensuring that the LLM can interpret and respond to multi-modal inputs coherently.
The training process of Video-LLaMA involves two stages: initial training on large video/image-caption pairs and fine-tuning with visual-instruction datasets. The framework aims to learn video-language correspondence and align audio and language modalities.
The Vision-Language Branch, with a frozen image encoder, injects temporal information into frame representations and generates visual query tokens. The Audio-Language Branch employs ImageBind for audio encoding, adding positional embeddings to audio segments and creating fixed-length audio features.
For vision-language correspondence, the framework pre-trains on a large-scale video caption dataset, including image-caption data, and then fine-tunes on a video-based conversation dataset. The audio-language alignment uses audio caption datasets and vision-text data due to limited availability of audio-text data.
The following figure from the paper illu strates the overall architecture of Video-LLaMA.

Experimental results show that Video-LLaMA can effectively perceive and comprehend video content, generating meaningful responses grounded in the visual and auditory information presented in videos. It demonstrates abilities in audio and video-grounded conversations.
The paper acknowledges limitations such as restricted perception capacities, challenges with long videos, and inherited hallucination issues from the frozen LLMs. Despite these, Video-LLaMA represents a significant advancement in audio-visual AI assistants. The authors have open-sourced the training code, model weights, and provided online demos for further development and exploration.
Code.

VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners

The paper by Yan et al. from Google Research and The Ohio State University introduces VideoCoCa, an adaptation of the Contrastive Captioners (CoCa) model for video-text tasks, achieving minimal additional training.
VideoCoCa utilizes the generative and contrastive attentional pooling layers from CoCa, applied to flattened frame embeddings, yielding state-of-the-art results in zero-shot video classification and text-to-video retrieval. It retains CoCa’s architecture but differs in processing image frames. While VideoCoCa is primarily a generative model due to its text generation capabilities, it also incorporates discriminative elements (through its contrastive training component) in its training and functioning.
For data processing, VideoCoCa uniformly samples frames from videos. Each frame is then processed through CoCa’s image encoder, resulting in a tensor of shape $(B, T, N, D)$, where $B$ is batch size, $T$ is the number of frames, $N$ is the number of visual tokens per frame, and $D$ is the hidden dimension size. These tensors are concatenated along the time dimension.
The process of how VideoCoCa handles video frames for subsequent processing through its attention pooling layers is as follows:
1. Frame Sampling and Encoding: Initially, frames are uniformly sampled from a video. These frames are then individually processed through the image encoder of the CoCa model. This encoder converts each frame into a set of visual tokens, which are essentially high-dimensional representations capturing the key visual features of the frame.
2. Tensor Formation: After encoding, for each frame, we get a tensor representing its visual tokens. The shape of this tensor for each frame is (B, N, D), where:
  - B is the batch size, representing how many video sequences we are processing in parallel.
  - N is the number of visual tokens generated by the image encoder for each frame.
  - D is the dimensionality of each visual token, a fixed feature size.
3. Concatenation Along Time Dimension: Now comes the critical part. Here, these tensors (representing individual frames) are concatenated along the time dimension. This step effectively aligns the visual tokens from all sampled frames in a sequential manner, forming a new tensor with shape $(B, T \times N, D)$, where T is the number of frames. This new tensor now represents the entire video sequence in a flattened format.
4. Attention Pooling Layers: The concatenated tensor is then passed through two attention pooling layers:
  - Generative Pooler: This pooler processes the tensor and outputs 256 tokens. These tokens are used by the model’s decoder to generate text, such as captions or answers in response to the video content.
  - Contrastive Pooler: This pooler produces a single token from the tensor. This token is used in contrastive training, which involves learning to distinguish between matching and non-matching pairs of video and text, thus improving the model’s ability to associate the right text with a given video. - In summary, the VideoCoCa process is about transforming and aligning the encoded frames into a single, coherent representation that encapsulates the entire video sequence. This tensor (after concatenating along the time dimension) is passed through the poolers, with the generative pooler’s outputs used for text generation and the contrastive pooler’s for contrastive training. This representation is then used for both generative and contrastive modeling tasks, allowing the model to effectively generate text that corresponds to the video content.
Various adaptation strategies were examined, including attentional poolers, a factorized encoder, a joint space-time encoder, and mean pooling. The attentional pooler method proved most effective, involving late fusion of temporal information without new learnable layers.
The paper explores lightweight finetuning approaches on video-text data, such as Finetuning (FT), Frozen Encoder-Decoder Tuning, Frozen Tuning then Finetuning, and Frozen Encoder Tuning (LiT). The LiT approach, freezing the image encoder and tuning only the poolers and text decoder, was most efficient for task adaptation.
VideoCoCa was trained on a joint contrastive loss and video captioning loss objective. The following figure from the paper shows: (Left) Overview of VideoCoCa. All weights of the pretrained CoCa model are reused, without the need of learning new modules. They compute frame token embeddings offline from the frozen CoCa image encoder. These tokens are then processed by a generative pooler and a contrastive pooler on all flattened frame tokens, yielding a strong zero-shot transfer video-text baseline. When continued pretraining on video-text data, the image encoder is frozen, while the attentional poolers and text decoders are jointly optimized with the contrastive loss and captioning loss, thereby saving heavy computation on frame embedding. (Right) An illustration of the attentional poolers and flattened frame token embeddings. They flatten $N \times T$ token embeddings as a long sequence of frozen video representations.

VideoCoCa was tested using datasets like HowTo100M, VideoCC3M, Kinetics, UCF101, HMDB51, Charades, MSR-VTT, ActivityNet Captions, Youcook2, and VATEX, showing significant improvements over the CoCa baseline in multiple tasks.
The model scaling results demonstrate that VideoCoCa consistently outperforms the CoCa model with the same number of parameters across various scales and tasks.

Visual Instruction Inversion: Image Editing via Visual Prompting

The following paper summary has been contributed by Zhibo Zhang.
Motivated by the fact that it can be hard to describe a desired image edit through words, this paper by Nguyen et al. from University of Wisconsin-Madison in NeurIPS 2023 proposes Visual Instruction Inversion - editing images through visual prompts.
Given an image pair composed of an image before edit and another image after edit, the goal of Visual Instruction Inversion is to learn a text instruction. Then based on this learned instruction, the image editing diffusion model can map new images to edited versions.
The figure below from the paper illustrates this method.
- As part (a) of this figure shows, given a frozen diffusion model (InstructPix2Pix (Brooks et al,, 2023) in the figure) for image edition, the goal is to map the image before edit to the version after edit through optimizing the instruction. Specifically, the loss contains two terms:
- the reconstruction loss.
- the loss term based on CLIP (Radford et al., 2021) embedding that encourages alignment between the learned text instruction and the actual editing direction.
- The loss term based on CLIP embedding is illustrated in part (b) of the figure. Specifically, it calculates the cosine similarity between the learned text instruction and the actual editing direction in the embedding space.
- The inference time behavior is illustrated in part (c) of the figure. Specifically, the image edition model can map a new image to the edited version based on the learned instruction. In addition, a natural language prompt can be optionally used to form a hybrid instruction.
The authors conducted quantitative analysis through sampling 100 editing directions from the Clean-InstructPix2Pix (Brooks et al,, 2023) dataset and it was observed that the loss term based on CLIP embedding improved the overall performance in terms of Visual and Directional CLIP similarity (Gal et al., 2021).

A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In Zero Shot

This paper by Bhattacharyya et al. from Adobe Media and Data Science Research, IIIT-Delhi, and the State University of New York at Buffalo, introduces a novel approach to video understanding by leveraging the capabilities of large language models (LLMs) for zero-shot performance on multimedia content.
The core idea involves transforming long videos into concise textual narratives. This process, termed “video verbalization,” employs modules to extract unimodal information (like keyframes, audio, and text) from videos and prompts a generative language model (like GPT-3.5 or Flan-t5) to create a coherent story. Keyframes are identified using an optical flow-based heuristic for videos shorter than 120 seconds, selecting frames with higher optical flow values indicative of story transitions. For longer videos, frames are sampled at a uniform rate based on the video’s native frames-per-second. This method is designed to overcome the limitations of existing video understanding models that are generally trained on short, motion-centric videos and require extensive task-specific fine-tuning.
The paper highlights two major contributions: firstly, the conversion of complex, multimodal videos into smaller, coherent textual stories, which outperform existing story generation methods. Secondly, the evaluation of the utility of these generated stories across fifteen different video understanding tasks on five benchmark datasets, demonstrating superior results compared to both fine-tuned and zero-shot baseline models.
The figure below from the paper shows an overview of their framework to generate a story from a video and perform downstream video understanding tasks. First, they sample keyframes from the video which are verbalized using BLIP-2. They also extract OCR from all the frames. Next, using the channel name and ID, they query Wikidata to get company and product information. Next, they obtain automatically generated captions from Youtube videos using the Youtube API. All of these are concatenated as a single prompt and given as input to an LLM and ask it to generate the story of the advertisement. Using the generated story, they then perform the downstream tasks of emotion and topic classification and persuasion strategy identification.

The datasets used include a video story dataset, a video advertisements dataset for assessing topics, emotions, actions, and reasons, a persuasion strategy dataset for understanding advertisement strategies, and the HVU dataset for a broad range of semantic elements in videos.
Results showed that the proposed zero-shot model outperformed fine-tuned video-based baselines in most tasks. This indicates the efficacy of using generated stories for video content understanding, a method that bypasses the limitations of dataset size and annotation quality typically required in traditional video-based models.

Emu Edit: Precise Image Editing via Recognition and Generation Tasks

This paper by Sheynin et al. from the GenAI Meta, introduces Emu Edit, a state-of-the-art, multi-task image editing model trained to excel in instruction-based image editing. Leveraging a unique assembly of tasks, including region-based editing, free-form editing, and computer vision tasks, all formulated as generative tasks, the model achieves unprecedented precision and compliance with user instructions. This breadth of training enables Emu Edit not only to perform standard editing tasks but also to generalize to new tasks such as image inpainting and super-resolution, using just a few labeled examples.
At its core, Emu Edit benefits from learned task embeddings that significantly boost the model’s ability to discern and execute the correct type of edit based on free-form instructions. These embeddings, integrated through cross-attention mechanisms and added to timestep embeddings, guide the generative process effectively. Furthermore, the model demonstrates robustness in few-shot learning scenarios, quickly adapting to new tasks through task inversion—optimizing a task embedding without altering the pre-trained model weights.
Description of the tasks forming the Emu Edit dataset.

Emu Edit’s architecture is built upon the Emu model, modified to accommodate multi-task conditioning and instruction-based editing. Key to its design is the utilization of a large U-Net structure, text embeddings from CLIP ViT-L and T5-XXL, and a sophisticated noise-offset strategy for generating high-contrast, visually appealing images.
The introduction of a sequential edit thresholding technique enhances multi-turn editing performance, preserving image quality by applying a per-pixel thresholding step after each edit. This method ensures that only significant pixel alterations contribute to the final image, reducing the accumulation of artifacts through successive edits.
To facilitate the evaluation of instruction-based image editing models, the authors release a comprehensive and challenging benchmark that spans seven distinct image editing tasks. This benchmark aims to provide a more rigorous standard for assessing the capabilities of such models.
Emu Edit not only sets a new benchmark for instruction-based image editing but also illustrates the potential of integrating recognition, generation, and editing tasks in a single model. The research underscores the effectiveness of multi-task learning and the significant advantages of learned task embeddings in interpreting complex, free-form instructions, thereby pushing the boundaries of what’s possible in image editing technology.
Project page

2024

CoVLM: Composing Visual Entities and Relationships in Large Language Models via Communicative Decoding

This paper by Li et al. from UMass Amherst, Wuhan University, UCLA, South China University of Technology, and MIT-IBM Watson AI Lab proposes CoVLM, a novel approach to enhance large language models’ (LLMs) compositional reasoning capabilities. This is achieved by integrating vision-language communicative decoding, enabling LLMs to dynamically compose visual entities and relationships in texts and communicate with vision encoders and detection networks.
CoVLM introduces novel communication tokens that enable dynamic interaction between the visual detection system and the language system. After generating a sentence fragment involving a visual entity or relation, a communication token prompts the detection network to propose relevant regions of interest (ROIs). These ROIs are then fed back into the LLM, improving the language generation based on the relevant visual information. This iterative vision-to-language and language-to-vision communication significantly enhances the model’s performance on compositional reasoning tasks.
The vision module in CoVLM uses the CLIP ViT-L model for image encoding and a YOLOX-like detection network. The language model component utilizes the pre-trained Pythia model, equipped with special communication tokens (<obj>, <visual>, <box>, <previsual>, <prebox>) to facilitate vision-language modeling and communication.
The figure below from the paper shows a comparison with existing VLMs. Previous models take in a whole image as input, impairing the compositionality of VLMs. Our CoVLM inserts communication tokens into the LLM after visual entities / relationships to enable the language-to-vision and vision-to-language communication, improving compositionality to a large extent.

The figure below from the paper shows an overview of CoVLM’s framework. Our vision module consists of a CLIP encoder to encode the image, and an object detector which takes in the image together with language inputs to generate relevant regions. For language modelling, we insert a set of communication tokens into the LLM, which can appear after a visual entity with a <visual> token or after a relationship with a <previsual> token. The last hidden layer of the LLM is then sent to the object detector to propose regions relevant to the language inputs so far. This is termed as top down language-to-vision communication. Next, in vision-to-language communication, the features of the proposed regions are fed back to LLM via <box> or <prebox> token for further language generation.

CoVLM was trained on a large-scale dataset of over 97M image-text pairs from various sources, including COCO, CC3M, CC12M, Visual Genome, SBU, and a subset of LAION400M. The training process involved a grounding pipeline to link text spans in captions to corresponding visual entities in images, further enhancing the model’s grounding capabilities.
The model significantly outperforms baseline vision-language models (VLMs) in compositional reasoning tasks on datasets like ARO, Cola, and HICO-DET, showing improvements of approximately 20% in HICO-DET mAP, 14% in Cola top-1 accuracy, and 3% in ARO top-1 accuracy. It also demonstrates competitive performance in vision-language tasks such as referring expression comprehension and visual question answering.
CoVLM’s novel approach to integrating vision and language models marks a significant advancement in the field, though it acknowledges the need for future improvements in object-attribute compositionality and spatial event compositionality.

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

This paper by Fu et al. from Tencent Youtu Lab and Xiamen University introduces the MME, a comprehensive evaluation benchmark for assessing Multimodal Large Language Models (MLLMs).
MME evaluates 12 advanced MLLMs on 14 subtasks, focusing on both perception and cognition abilities, covering tasks like object recognition and commonsense reasoning.
The evaluation suite’s key feature is its ability to assess MLLMs’ ability to follow basic instructions, comprehend image contents, and execute logical reasoning.
The figure below from the paper shows a diagram of the MME benchmark. It evaluates MLLMs from both perception and cognition, including a total of 14 subtasks. Each image corresponds to two questions whose answers are marked yes [Y] and no [N], respectively. The instruction consists of a question followed by “Please answer yes or no”. It is worth noting that all instructions are manually designed.

The study finds significant room for improvement in MLLMs, identifying common issues like failure to follow instructions, perception weaknesses, reasoning lapses, and object hallucination.
The benchmark provides valuable insights for future developments in MLLMs, emphasizing the need for enhanced model comprehension and reasoning capabilities.
Code

Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning

This paper by Xu et al. from Virginia Tech UW, University of Michigan, Amazon, Microsoft, Meta AI introduces Vision-Flan, a dataset and framework aimed at addressing the lack of task diversity and the presence of annotation errors in existing vision-language model (VLM) frameworks. Vision-Flan stands out for its diversity, encompassing 187 tasks and over 1.6 million instances sourced from academic datasets, each accompanied by expert-written instructions.
A two-stage instruction tuning framework is proposed, where VLMs are initially fine-tuned on Vision-Flan and subsequently on GPT-4 synthesized data. This approach notably outperforms traditional single-stage tuning, achieving superior performance across a range of multi-modal evaluation benchmarks.
The figure below from the paper shows: (Left) the LLaVA architecture, and (Right) the two-stage visual instruction tuning pipeline.

Analysis reveals that while GPT-4 synthesized data moderately adjusts models to human-preferred formats, it doesn’t significantly enhance VLM capabilities. Surprisingly, a minimal amount (e.g., 1000 instances) of such data can effectively align model responses with human preferences.
The implementation leverages the LLaVA architecture with Vicuna-13B v1.5 and CLIP-ViT-L-336px, employing two MLP layers for connection. Fine-tuning is executed on 8 A100 GPUs, with detailed parameters provided for both stages of the instruction tuning.
The paper underscores the significance of human-labeled data in enriching task diversity, which substantially improves model performance, and posits that visual instruction tuning principally aids large language models in understanding visual features.

PALO: A Polyglot Large Multimodal Model for 5B People

This paper by Maaz et al. from MBZUAI, Australian National University, Aalto University, The University of Melbourne, and Linköping University proposes PALO, the first open-source Large Multimodal Model (LMM), which covers ten key languages (English, Chinese, Hindi, Spanish, French, Arabic, Bengali, Russian, Urdu, and Japanese), reaching 65% of the global population. It uses a semi-automated translation approach, employing a fine-tuned Large Language Model for dataset adaptation to ensure linguistic fidelity across languages, including less-resourced ones like Bengali, Hindi, Urdu, and Arabic.
The model is scalable across three sizes (1.7B, 7B, 13B parameters), demonstrating significant performance improvements over existing baselines in both high-resource and low-resource languages, enhancing visual reasoning and content generation capabilities.
The figure below from the paper shows PALO vs. English-VLMs. The plot compares PALO with corresponding Vision-Language Models (VLMs) across 10 different languages. These languages include English, Chinese, Hindi, Spanish, French, Arabic, Bengali, Russian, Urdu, and Japanese, collectively covering approximately 5B people and 65% of the global population. English-trained VLMs, such as LLaVA and MobileVLM, exhibit poor performance on low-resource languages including Hindi, Arabic, Bengali, and Urdu, due to the under-representation of these languages during their training phases. PALO, in contrast, is a unified model that can hold conversations simultaneously in all the ten languages, demonstrating consistent performance across the board.

The figure below from the paper shows an architecture overview of PALO. (left) The model consists of a vision encoder that encodes the image, followed by a projector that projects the vision features into the input embedding space of the language model. The user’s text query is tokenized, and the tokens are concatenated with the vision tokens before being input into the causal language model to generate the response. For the PALO 7B and 13B variants, Vicuna is used as the Large Language Model while MobileLLaMA (Chu et al., 2023) is used as the Small Language Model in our MobilePALO-1.7B variant. CLIP ViT-L/336px is used as the vision encoder in all variants. (right) Projectors used in different variants of PALO are shown. For the PALO 7B and 13B, following (Liu et al., 2023b), they use a two-layer MLP projector with GELU activation. For our mobile version of PALO (MobilePALO-1.7B), they use a Lightweight Downsample Projector (LDP) from (Chu et al., 2023). It utilizes depth-wise separable convolutions to downsample the image tokens, making it faster than a standard MLP projector.

Implementation utilizes CLIP ViT-L/336px as the vision encoder, with Vicuna or MobileLLaMA as the language model. A two-layer MLP projector or a Lightweight Downsample Projector (LDP) is used depending on the variant, aimed at efficiency and reduced training/inference time. PALO is pretrained on CC-595K, a subset of CC3M, and fine-tuned on a diverse multilingual instruction dataset.
It introduces the first multilingual multimodal benchmark for evaluating future models’ vision-language reasoning across languages, showcasing PALO’s generalization and scalability. The model’s effectiveness is attributed to the refined multilingual multimodal dataset and the semi-automated translation pipeline, addressing the challenge of limited high-quality data for under-represented languages.
Code

Sigmoid Loss for Language Image Pre-Training

This paper by Zhai et al. from Google DeepMind presents SigLIP (short for Sigmoid CLIP), a novel approach to language-image pre-training, by proposing to replace the loss function used in CLIP by a simple pairwise Sigmoid loss. Put simply, SigLIP introduces a Sigmoid loss, contrasting with the softmax normalization used in OpenAI’s CLIP, a prior breakthrough in image-text understanding. The pairwise Sigmoid results in better performance in terms of zero-shot classification accuracy on ImageNet.
Standard contrastive learning methods, as in CLIP, require softmax normalization, computing similarities across all pairs in a batch. Softmax normalization in standard contrastive learning, including in CLIP, involves calculating the exponential of a score for each image-text pair and dividing it by the sum of exponentials for all pairs in a batch. This process creates a probability distribution over the batch, helping the model to differentiate between correct and incorrect pairs. This approach, while effective, is computationally intensive and sensitive to batch size.
SigLIP’s Sigmoid loss evaluates image-text pairs independently, allowing for larger batch sizes and better performance in smaller batches. This independence from global pairwise normalization enhances scaling and efficiency.
The paper showcases Locked-image Tuning’s effectiveness on limited hardware, achieving 84.5% ImageNet zero-shot accuracy with minimal resources.
SigLIP’s robustness is evident in its superior performance in zero-shot image classification and image-text retrieval tasks, outperforming the traditional softmax approach, especially under data noise and large-scale training.
Extensive multilingual experiments involving over 100 languages demonstrate that a 32k batch size is optimal, challenging previous assumptions in large language models like CogVLM or Llava.
The research contributes to advancements in multimodal large language models, including applications in generative models, text-based segmentation, object detection, and 3D understanding.
Hugging Face; Models; Notebook

2024

EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions

This paper by Tian et al. at the Institute for Intelligent Computing, Alibaba Group presents EMO, a novel framework for generating expressive audio-driven portrait videos from a single reference image and vocal audio without relying on intermediate 3D models or facial landmarks. By direct audio-to-video synthesis, EMO produces videos with seamless frame transitions, consistent identity preservation, and high expressiveness.
Utilizing Diffusion Models, the method leverages a large-scale dataset of over 250 hours of audio-video content in multiple languages to train the model, which significantly outperforms state-of-the-art methods in generating both speaking and singing videos.
The framework includes two main stages: Frames Encoding and Diffusion Process. Frames Encoding uses ReferenceNet to extract features from the reference image and motion frames. During the Diffusion Process, a pretrained audio encoder processes the audio embedding, and a facial region mask guides the generation of facial imagery, with Reference-Attention and Audio-Attention mechanisms ensuring character identity preservation and movement modulation, respectively.
Temporal Modules and stable control mechanisms, like a speed controller and a face region controller, are integrated to enhance stability and ensure coherent motion across video frames. Additionally, a custom-built expansive dataset facilitates capturing a wide range of human expressions and vocal styles, enabling the robust training of EMO.
The figure below from the paper shows an overview of the proposed method. Our framework is mainly constituted with two stages. In the initial stage, termed Frames Encoding, the ReferenceNet is deployed to extract features from the reference image and motion frames. Subsequently, during the Diffusion Process stage, a pretrained audio encoder processes the audio embedding. The facial region mask is integrated with multi-frame noise to govern the generation of facial imagery. This is followed by the employment of the Backbone Network to facilitate the denoising operation. Within the Backbone Network, two forms of attention mechanisms are applied: Reference-Attention and Audio-Attention. These mechanisms are essential for preserving the character’s identity and modulating the character’s movements, respectively. Additionally, Temporal Modules are utilized to manipulate the temporal dimension, and adjust the velocity of motion.

Comparative experiments on the HDTF dataset show EMO’s superior performance across metrics like FID, SyncNet, F-SIM, FVD, and a new E-FID metric for expressiveness evaluation. Despite its innovative approach, limitations include longer processing times compared to non-diffusion-based methods and potential generation of artifacts due to the lack of explicit control signals for non-facial body parts.
Project page; Code

DeepSeek-VL: Towards Real-World Vision-Language Understanding

DeepSeek-VL, developed by DeepSeek-AI, is an open-source Vision-Language (VL) model designed to enhance real-world applications involving vision and language understanding. This model stands out due to its approach across three dimensions: comprehensive data construction, efficient model architecture, and an innovative training strategy.
For data construction, DeepSeek-VL leverages diverse and scalable sources covering real-world scenarios extensively, including web screenshots, PDFs, OCR, charts, and knowledge-based content from expert knowledge and textbooks. The model also benefits from an instruction-tuning dataset derived from real user scenarios, enhancing its practical application.
The model architecture features a hybrid vision encoder capable of efficiently processing high-resolution images (1024x1024) within a fixed token budget, striking a balance between semantic understanding and detailed visual information capture.
The training strategy emphasizes the importance of language capabilities in VL models. By integrating LLM training from the onset and adjusting the modality ratio gradually, DeepSeek-VL maintains strong language abilities while incorporating vision capabilities. This strategy addresses the competitive dynamics between vision and language modalities, ensuring a balanced development of both.
DeepSeek-VL’s training is divided into three stages: training the Vision-Language Adaptor, Joint Vision-Language pretraining, and Supervised Fine-tuning. These stages collectively ensure the model’s proficiency in handling both vision and language inputs effectively.
DeepSeek-VL’s training pipelines consist of three stages. Stage 1 involves training the VisionLanguage (VL) adaptor while keeping the hybrid vision encoder and language model fixed. Stage 2 is the crucial part of the joint vision and language pretraining, where both VL adaptor and language model are trainable. Stage 3 is the supervised fine-tuning phase, during which the low-resolution vision encoder SigLIP-L, VL adaptor, and language model will be trained.

Evaluation on public multimodal and language benchmarks shows that DeepSeek-VL achieves state-of-the-art or competitive performance, maintaining robust performance on language-centric benchmarks as well. The model’s effectiveness is further confirmed through human evaluation, where it demonstrates superior user experience in real-world applications.
Code

mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding

This paper by Hu et al. from Alibaba Group and Renmin University of China proposes a novel approach for Visual Document Understanding (VDU) that emphasizes the significance of structure information in interpreting text-rich images like documents, tables, and charts. Unlike existing Multimodal Large Language Models (MLLMs), which can recognize text but struggle with structural comprehension, the proposed model, DocOwl 1.5, introduces Unified Structure Learning to enhance MLLMs’ performance by focusing on structure-aware parsing and multi-grained text localization across five domains: document, webpage, table, chart, and natural image.
To efficiently encode structure information, the authors developed the H-Reducer, a vision-to-text module designed to maintain layout information while reducing the length of visual features by merging horizontally adjacent patches through convolution. This enables the model to process high-resolution images more efficiently by preserving their spatial relationships.
The Unified Structure Learning framework comprises two main components: structure-aware parsing tasks that instruct the model to parse text in images in a structure-aware manner (using line feeds, spaces, and extended Markdown syntax for different structures) and multi-grained text localization tasks that enhance the model’s ability to associate texts with specific image positions.
A comprehensive training set, DocStruct4M, was constructed to support this structure learning approach. It includes structure-aware text sequences and multi-grained pairs of texts and bounding boxes derived from publicly available text-rich images. Additionally, a smaller, high-quality reasoning dataset, DocReason25K, was created to further improve the model’s document domain explanation abilities.
The figure below from the paper shows: (a) the two-stage training framework and (b) overall architecture of DocOwl 1.5. The global image and cropped images are processed independently by the Visual Encoder and H-Reducer. <rowx-coly> is the special textual token to indicate that the position of the cropped image in the original image is the $x^{th}$ row and $y^{th}$ column.

The paper reports that DocOwl 1.5 significantly outperforms state-of-the-art models on 10 visual document understanding benchmarks, improving the performance of MLLMs with a 7B LLM by over 10 points in half of the benchmarks. This is attributed to its effective use of Unified Structure Learning and the H-Reducer.
Code

OOTDiffusion: Outfitting Fusion based Latent Diffusion for Controllable Virtual Try-on

The paper by Xu et al. from Xiao-i Research introduces Outfitting over Try-on Diffusion (OOTDiffusion), a novel network architecture leveraging pretrained latent diffusion models for realistic and controllable image-based virtual try-on (VTON). By designing an outfitting UNet to learn garment detail features without a warping process, the garment features align precisely with the target human body via outfitting fusion in the denoising UNet’s self-attention layers. To enhance controllability, outfitting dropout is introduced during training, allowing adjustment of garment feature strength through classifier-free guidance.
Extensive experiments on the VITON-HD and Dress Code datasets demonstrate that OOTDiffusion efficiently generates high-quality try-on results for arbitrary human and garment images, surpassing other VTON methods in realism and controllability. This represents a significant advancement in virtual try-on technology.
The outfitting UNet efficiently learns garment details in a single step and incorporates them into the denoising UNet via outfitting fusion, significantly reducing information loss or feature distortion compared to independent warping processes. Outfitting dropout during training enables classifier-free guidance, enhancing controllability by adjusting the strength of garment control over the generated result.
The figure below from the paper offers an overview of the proposed OOTDiffusion model. On the left side, the garment image is encoded into the latent space and fed into the outfitting UNet for a single step process. Along with the auxiliary conditioning input generated by CLIP encoders, the garment features are incorporated into the denoising UNet via outfitting fusion. Outfitting dropout is performed for the garment latents particularly in training to enable classifier-free guidance. On the right side, the input human image is masked with respect to the target region and concatenated with a Gaussian noise as the input to the denoising UNet for multiple sampling steps. After denoising, the feature map is decoded back into the image space as our try-on result.

OOTDiffusion is trained on high-resolution VITON-HD and Dress Code datasets, with comprehensive qualitative and quantitative evaluations showing its superiority over state-of-the-art VTON methods in realism and controllability for various target human and garment images, indicating an impressive breakthrough in image-based virtual try-on.
Despite its advancements, OOTDiffusion faces limitations, such as potential failures in cross-category virtual try-on and alterations to details in the original human image (e.g., muscles, watches, tattoos), due to masking and repainting by the diffusion model. Addressing these limitations requires more practical pre- and post-processing methods.

Glyph-ByT5: A Customized Text Encoder for Accurate Visual Text Rendering

This paper by Liu et al. from Microsoft Research Asia, Tsinghua University, Peking University, and The Australian National University introduce a specialized text encoder named Glyph-ByT5, aimed at addressing the challenges of accurate visual text rendering in image generation models. The core innovation lies in developing a character-aware and glyph-aligned text encoder that significantly improves text rendering accuracy from under 20% to nearly 90% in design image benchmarks.
The study begins with the identification of two essential requirements for text encoders to achieve accurate visual text rendering: character awareness and alignment with glyphs. To this end, the authors fine-tune the character-aware ByT5 encoder using a carefully curated paired glyph-text dataset, leading to the creation of Glyph-ByT5. This customized encoder is integrated with SDXL, resulting in the Glyph-SDXL model, which excels in rendering text paragraphs with high accuracy in automated multi-line layouts.
A noteworthy contribution of this work is the efficient fine-tuning of Glyph-SDXL with a small dataset of high-quality photorealistic images containing visual text, significantly enhancing scene text rendering capabilities in open-domain real images. The results showcase substantial improvements in spelling accuracy for extensive textual content, demonstrating the potential of customized text encoders in transforming open-domain image generators into advanced visual text renderers.
The figure below from the paper illustrates the glyph-alignment pre-training framework and the region-wise multi-head cross attention module.

Experimentally, the authors benchmark the performance of Glyph-SDXL against commercial products like DALL·E3 and state-of-the-art models, utilizing metrics such as word-level precision and conducting user studies to assess visual aesthetics, layout quality, and typography accuracy. The findings indicate that Glyph-SDXL outperforms existing models in text-rich design image generation and scene text rendering tasks, validating the approach’s effectiveness.
The research highlights include the development of a scalable pipeline for generating a high-quality glyph-text dataset and employing innovative training techniques for glyph-text alignment. This approach successfully bridges the gap between glyph images and text prompts, facilitating the generation of accurate text in both design images and open-domain images with scene text.
Project page; Code

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

The work introduces Mini-Gemini, a framework designed to enhance the performance and functionality of multi-modality Vision Language Models (VLMs) by addressing three key areas: high-resolution visual tokens, high-quality data collection, and VLM-guided generation. By utilizing an additional visual encoder, Mini-Gemini refines visual details without increasing the visual token count, constructs a dataset to improve image comprehension and reasoning-based generation, and achieves any-to-any workflow flexibility. The framework supports Large Language Models (LLMs) ranging from 2B to 34B parameters, demonstrating superior performance in zero-shot benchmarks and even surpassing developed private models.
Mini-Gemini employs dual vision encoders for handling high and low-resolution images, introducing an efficient visual token enhancement pipeline. This approach includes a twin encoder system: one for high-resolution images providing detailed visual cues and another for low-resolution visual embedding. During inference, an attention mechanism enables the low-resolution encoder to generate visual queries while the high-resolution counterpart supplies candidate keys and values for reference.
The framework enriches data quality by leveraging high-quality datasets from diverse public sources, thus expanding the operational capabilities of VLMs. Furthermore, Mini-Gemini’s integration with advanced generative models allows for concurrent image and text generation, guided by VLMs for more accurate image generation based on generated text from LLMs.
The figure below from the paper shows the framework of Mini-Gemini with any-to-any workflow.

Extensive empirical studies showcase Mini-Gemini’s leading performance across various settings. It outperforms previous models and private competitors in complex benchmarks like MMB and MMU, evidencing its capacity to set new benchmarks in VLMs. The method’s success is attributed to its strategic approach to visual token enhancement, comprehensive data collection for improved vision-language alignment, and its unique any-to-any inference capability that supports both comprehension and generation tasks.
Code

Mora: Enabling Generalist Video Generation via A Multi-Agent Framework

This paper introduces Mora, a new multi-agent framework proposed to bridge the gap in generalist video generation capabilities exposed by the emergence of OpenAI’s Sora. Mora employs a variety of advanced visual AI agents to tackle a broad spectrum of video generation tasks including text-to-video, text-conditional image-to-video, video extension, video-to-video editing, video connection, and digital world simulation. Mora’s approach stands out for its ability to mimic Sora’s performance across diverse tasks, leveraging the unique strengths of different agents for specific subtasks.
The authors describe Mora’s architecture, highlighting the framework’s capacity for flexible pipeline configuration that assigns distinct roles to various agents: prompt enhancement, text-to-image generation, image-to-image generation, image-to-video conversion, and video connection. Each agent specializes in a segment of the video creation process, from enhancing user-provided prompts for clarity and specificity to generating and editing images and videos that align with these prompts.
Supported tasks:
- Text $\rightarrow$ Video
- Text + Image $\rightarrow$ Video
- Extending videos
- Text + Video $\rightarrow$ Video
- Video merging
- Simulating digital worlds
The figure below from the paper illustrates how to use Mora to conduct video-related tasks.

Implementation details reveal Mora’s reliance on leading generative models such as GPT-4 for prompt generation, Stable Diffusion XL for text-to-image synthesis, InstructPix2Pix for image-to-image generation, and Stable Video Diffusion for video generation, illustrating the framework’s foundation on state-of-the-art AI technologies.
The experimental evaluation of Mora, conducted using the Vbench benchmark suite and additional self-defined metrics, underscores the framework’s competitive performance relative to Sora and other models. Mora achieves high marks in object and background consistency, motion smoothness, and aesthetic quality, with specific achievements in different video-related tasks showcasing its generalist capabilities.
Strengths of Mora are emphasized, including its innovative multi-agent design that enhances flexibility and adaptability, its potential to set new standards for open-source video generation technology, and its contribution to democratizing advanced video generation tools.
However, limitations such as the need for high-quality video datasets, challenges in maintaining video quality and generating long videos, and gaps in following complex instructions or aligning with human visual preferences are acknowledged. These areas present opportunities for future improvements and research directions, hinting at the potential evolution of Mora and similar frameworks.
Mora represents a significant advancement in video generation technology, offering a novel approach that combines the capabilities of multiple specialized agents to achieve generalist video generation. This framework not only competes with the closed-source Sora model in terms of performance but also enhances the accessibility of advanced video generation tools, promising a bright future for the field.

VILA: On Pre-training for Visual Language Models

This paper by Lin et al. from NVIDIA and MIT present VILA, published in CVPR 2024, a visual language model (VLM) that leverages augmented pre-training from large language models (LLMs) to enhance performance on visual tasks. The study highlights the effectiveness of freezing and unfreezing LLMs during the pre-training phase. Freezing results in decent zero-shot performance, while unfreezing enhances in-context learning by allowing deeper embedding alignment across modalities.
The researchers emphasize the importance of using interleaved visual language data over simple image-text pairs for pre-training. They found that interleaved data not only helps maintain the capabilities for text-only tasks but also significantly improves the performance on visual tasks.
During the instruction fine-tuning stage, they reintroduce text-only instruction data, which not only compensates for any degradation in text-only tasks but also further boosts the visual task performance. This hybrid approach demonstrates a superior balance in retaining text capabilities while enhancing visual reasoning.
The figure below from the paper illustrates the studying of the auto-regressive visual language model, where images are tokenized and fed to the input of LLMs. We find updating the LLM is essential for in-context learning capabilities, and interleaved corpus like such as Multimodal C4 helps pre-training. Joint SFT with text-only data helps maintain the text-only capabilities.

VILA outperforms contemporary models like LLaVA-1.5 on various benchmarks by leveraging an enhanced pre-training recipe that includes the strategic re-blending of text-only data during fine-tuning. This approach not only boosts task-specific performance but also generalizes better across different visual language tasks.
Technical implementation details reveal that VILA utilizes a combination of Transformers and a simple linear projector for integrating visual and textual data, with modifications made to training protocols to optimize performance for both zero-shot and few-shot settings.
The pre-training process also unlocks new capabilities for the model, such as multi-image reasoning and enhanced world knowledge, highlighting the model’s ability to handle complex visual reasoning tasks beyond the training examples shown during the supervised fine-tuning stage.
Code

PaliGemma: A versatile 3B VLM for transfer

This paper introduces PaliGemma, an open Vision-Language Model (VLM) combining the 400M SigLIP vision encoder and the 2B Gemma language model to form a versatile and broadly knowledgeable base model. PaliGemma achieves strong performance across a wide variety of open-world tasks, evaluated on almost 40 diverse benchmarks, including standard VLM tasks and specialized areas like remote-sensing and segmentation.
PaliGemma’s architecture consists of three main components: the SigLIP image encoder, the Gemma-2B decoder-only language model, and a linear projection layer. The SigLIP encoder, pretrained via sigmoid loss, turns images into a sequence of tokens. The text input is tokenized using Gemma’s SentencePiece tokenizer and embedded with Gemma’s vocabulary embedding layer. The linear projection maps SigLIP’s output tokens into the same dimensions as Gemma-2B’s vocab tokens, enabling seamless concatenation of image and text tokens.
A key design decision in PaliGemma is the use of the SigLIP image encoder instead of a CLIP image encoder. SigLIP was chosen because it is a “shape optimized” ViT-So400m model, pretrained with a contrastive approach using the sigmoid loss. This optimization and training method provide state-of-the-art performance, especially for a model of its smaller size. The SigLIP encoder’s ability to effectively capture and represent visual information in a compact format was deemed more advantageous compared to the larger CLIP models, which, while powerful, require more computational resources. Additionally, the sigmoid loss training in SigLIP contributes to better spatial and relational understanding capabilities, which are crucial for complex vision-language tasks.
The training process of PaliGemma follows a multi-stage procedure:
1. Stage0: Unimodal Pretraining - Utilizes existing off-the-shelf components without custom unimodal pretraining.
2. Stage1: Multimodal Pretraining - Involves long pretraining on a carefully chosen mixture of multimodal tasks, with nothing frozen, optimizing both vision and language components.
3. Stage2: Resolution Increase - Short continued pretraining at higher resolution, increasing the text sequence length to accommodate tasks requiring detailed understanding.
4. Stage3: Transfer - Fine-tuning the pretrained model on specific, specialized tasks like COCO Captions, Remote Sensing VQA, and more.
The figure below from the paper illustrates PaliGemma’s architecture: a SigLIP image encoder feeds into a Gemma decoder LM.

Implementation details:
- Prefix-LM: PaliGemma employs a Prefix-LM masking strategy that allows full (bi-directional) attention on the “input” part of the data, which includes the image and prefix tokens. This means that during pretraining, the model uses a prefix-LM setup where the image tokens can attend to the prefix tokens representing the query, while the suffix tokens, which represent the output, are autoregressively masked. This approach allows more tokens to actively participate in the “thinking” process from the start, enhancing the model’s ability to understand and integrate information from both modalities effectively. The model’s input sequence thus looks like:
tokens = [image tokens..., BOS, prefix tokens..., SEP, suffix tokens..., EOS, PAD...]
- Freezing Components: The current common wisdom in VLMs is to keep the image encoder and sometimes the LLM frozen during multimodal pretraining. However, inspired by positive results from CapPa and LocCa, which show that pretraining an image encoder using captioning objectives solves contrastive’s blind spot to relation and localization, PaliGemma is pretrained with no frozen parts. Ablation studies demonstrate that not freezing any part of the model during Stage1 is advantageous. After transfers, there is no difference in performance when the image encoder is kept frozen, but the validation perplexity of tasks requiring spatial understanding is significantly improved. Freezing the language model or resetting any part of the model hurts performance dramatically, confirming that leveraging pre-trained components in Stage0 is crucial for good results.
- Connector Design: Throughout experiments, a linear connector is used to map SigLIP output embeddings to the inputs of Gemma. Although an MLP connector is popular in VLM literature, ablation studies show that the linear connector performs better. When tuning all weights, the average transfer score is nearly identical for linear vs. MLP connectors, but in the “all-frozen” scenario, the linear connector achieves a slightly higher score.
- Image Encoder: With or Without?: Most VLMs use an image encoder like CLIP/SigLIP or VQGAN to turn the image into soft tokens before passing them to the LLM. Removing the SigLIP encoder and passing raw image patches into a decoder-only LLM (similar to Fuyu) results in significantly lower performance. Despite re-tuning the learning-rate for this architecture, it still lags behind. This is noteworthy considering that the SigLIP encoder has seen 40B image-text pairs during Stage0 pretraining, while the raw patch model sees images for the first time in Stage1 pretraining. This ablation suggests that while decoder-only VLMs might be a promising future direction, they currently suffer in training efficiency due to not being able to reuse vision components.
- Image Resolution: PaliGemma uses a simple approach: Stage1 is pretrained at a relatively low 224px resolution, and Stage2 “upcycles” this checkpoint to higher resolutions (448px and 896px). The final PaliGemma model thus comes with three different checkpoints for these resolutions, ensuring that it can handle tasks requiring different levels of detail effectively.
Empirical results demonstrate PaliGemma’s ability to transfer effectively to over 30 academic benchmarks via fine-tuning, despite none of these tasks or datasets being part of the pretraining data. The study shows that PaliGemma achieves state-of-the-art results not only on standard benchmarks but also on more exotic tasks like Remote-Sensing VQA, TallyVQA, and several video captioning and QA tasks.
Noteworthy findings include:
- Freezing Components: Ablation studies reveal that not freezing any part of the model during pretraining is advantageous, enhancing performance on tasks requiring spatial understanding.
- Connector Design: The linear connector outperforms MLP connectors in both fully tuned and frozen scenarios.
- Zero-shot Generalization: PaliGemma shows strong generalization to 3D renders from Objaverse without explicit training for this type of data.
The training run of the final PaliGemma model on TPUv5e-256 takes slightly less than 3 days for Stage1 and 15 hours for each Stage2. The model’s performance demonstrates the feasibility of maintaining high performance with less than 3B total parameters, highlighting the potential for smaller models to achieve state-of-the-art results across a diverse range of benchmarks.
In conclusion, PaliGemma serves as a robust and versatile base VLM that excels in transferability, offering a promising starting point for further research in instruction tuning and specific applications. The study encourages the exploration of smaller models for achieving broad and effective performance in vision-language tasks.
Hugging Face; Code

NVLM: Open Frontier-Class Multimodal LLMs

This paper by Dai et al. introduces NVLM 1.0, a suite of frontier-class multimodal large language models (LLMs) designed to achieve state-of-the-art performance across vision-language tasks while maintaining strong performance on text-only tasks. The NVLM 1.0 models, developed by NVIDIA, are positioned to rival leading proprietary models like GPT-4V and open-access models such as Llama 3-V 405B and InternVL 2.
Key Contributions:
- Model Design: NVLM 1.0 is built on three architectural designs: decoder-only (NVLM-D), cross-attention-based (NVLM-X), and a novel hybrid model (NVLM-H). The paper offers a detailed comparison between the pros and cons of these architectures. NVLM-D performs well on OCR-related tasks, while NVLM-X is optimized for computational efficiency with high-resolution image inputs. NVLM-H integrates the advantages of both approaches to improve multimodal reasoning capabilities.
- Training Data: NVLM’s performance is significantly enhanced by a meticulously curated pretraining dataset that prioritizes quality and task diversity over dataset size. This includes multimodal math and reasoning data, which notably improves NVLM’s math and coding abilities across modalities. The paper emphasizes that high-quality multimodal datasets are key to performance, particularly for improving models like LLaVA during the pretraining phase.
- Multimodal Performance: NVLM 1.0 excels in tasks such as OCR, chart understanding, document VQA, and multimodal math reasoning, outperforming proprietary and open-access models in several benchmarks. The authors evaluated the model across various vision-language and text-only tasks, showing strong results without sacrificing text-only performance, a common issue in multimodal training.
Implementation Details:
- Architectures: NVLM-D, the decoder-only model, connects a pretrained vision encoder to the LLM via a two-layer MLP. NVLM-X employs gated cross-attention layers to process image tokens, eliminating the need to unroll all image tokens in the LLM decoder. NVLM-H combines these approaches, processing global thumbnail tokens in the LLM decoder and using gated cross-attention for regular image tiles.
- Training Process: The models are trained in two stages: pretraining (where only the modality-alignment modules are trained) and supervised fine-tuning (SFT), during which both the LLM and the modality-alignment modules are trained. The vision encoder remains frozen during both stages. For multimodal SFT, a blend of multimodal and text-only datasets is used to preserve the LLM’s text-only capabilities.
- High-Resolution Handling: NVLM uses a dynamic high-resolution approach for image inputs, where images are split into tiles and processed individually. The paper introduces a 1-D tile-tagging method to inform the LLM about the structure of the tiled images, which significantly improves performance on OCR-related tasks.
The following figure from the paper shows that NVLM-1.0 offers three architectural options: the cross-attention-based NVLM-X (top), the hybrid NVLM-H (middle), and the decoder-only NVLM-D (bottom). The dynamic high-resolution vision pathway is shared by all three models. However, different architectures process the image features from thumbnails and regular local tiles in distinct ways.

NVLM 1.0 demonstrates significant improvements in vision-language tasks while maintaining or even enhancing text-only performance after multimodal training. The hybrid NVLM-H architecture particularly excels in multimodal reasoning and math tasks, while NVLM-D achieves top scores in OCR tasks. The authors will release model weights and code for community use.

Molmo

This paper by Deitke et al. from Allen AI and UW, introduces the Molmo family of vision-language models (VLMs), designed to be entirely open-weight and built using openly collected datasets, specifically focusing on PixMo, a novel dataset. The goal of the Molmo project is to develop high-performing multimodal models without relying on proprietary systems or synthetic data distilled from closed VLMs like GPT-4V. The research highlights the need for independent development of vision-language models to foster scientific exploration and create open resources for the community.
Key Contributions:
1. Novel Dataset Collection: A significant innovation of this work is the development of PixMo, a highly detailed image caption dataset gathered from human annotators using speech-based descriptions rather than written inputs. This process was designed to ensure dense and rich image captions, avoiding synthetic data. Annotators were instructed to describe every aspect of the image, including spatial positioning and relationships, using 60-90 second speech prompts. This technique resulted in significantly more detailed captions than traditional methods.
2. Model Architecture: The Molmo models follow a standard multimodal design that integrates a vision encoder with a language model. The architecture includes:
  - Vision encoder: Using OpenAI’s ViT-L/14 336px CLIP model to encode images into vision tokens.
  - Language model: Molmo offers models across different scales, such as OLMo-7B, OLMoE-1B-7B, and Qwen2-72B. The connector between the vision encoder and language model is a multi-layer perceptron (MLP) which processes and pools vision tokens before passing them to the language model.
  - The models are fully trainable across both pre-training and fine-tuning stages, without freezing parts of the architecture.
3. Training Pipeline:
  - Stage 1: Caption Generation Pre-training: Using PixMo-Cap, a dataset of human-annotated captions, the models were trained to generate dense and detailed image descriptions. The PixMo-Cap dataset includes over 712,000 distinct images with approximately 1.3 million captions, thanks to naturalistic augmentation by combining human-generated captions with text processed by language-only LLMs.
  - Stage 2: Supervised Fine-tuning: Following pre-training, the models are fine-tuned on a diverse set of tasks and datasets, including PixMo-AskModelAnything (a diverse Q&A dataset), PixMo-Points (which enables models to point to objects in images for visual explanations and counting), and PixMo-CapQA (Q&A pairs based on captions). Additional academic datasets like VQA v2, TextVQA, and DocVQA were also used to ensure wide applicability.
4. Evaluation and Performance:
  - The Molmo models were tested on 11 academic benchmarks and evaluated through human preference rankings. The top-performing model, Molmo-72B, outperformed many proprietary systems, including Gemini 1.5 Pro and Claude 3.5 Sonnet, achieving state-of-the-art results in its class of open models.
  - A human evaluation was conducted, collecting over 325,000 preference ratings, with Molmo-72B scoring second in human preference rankings, just behind GPT-4o.
5. Model Comparison: The paper emphasizes the openness of Molmo compared to other VLMs. Unlike many contemporary models that rely on synthetic data from closed systems, Molmo is entirely open-weight and open-data, providing reproducible and transparent training processes.
6. Practical Applications: Molmo’s ability to point at objects and explain visual content by grounding language in images opens up new directions for robotics, interactive agents, and web-based applications. The pointing mechanism is especially useful for visual explanations and counting tasks.
The following figure from the paper shows the Molmo architecture follows the simple and standard design of combining a language model with a vision encoder. Its strong performance is the result of a well-tuned training pipeline and our new PixMo data.

The Molmo family represents a significant step forward for open multimodal systems. The PixMo dataset, combined with an efficient and reproducible training pipeline, enables Molmo models to compete with proprietary systems while remaining entirely open. The research provides the broader community with open model weights, datasets, and code, encouraging further advancements in the field. Future releases will include additional datasets, model weights, and training code to enable widespread adoption and development.
Blog

Core ML

2016

The Peaking Phenomenon in Semi-supervised Learning

For the supervised least squares classifier, when the number of training objects is smaller than the dimensionality of the data, adding more data to the training set may first increase the error rate before decreasing it. This, possibly counterintuitive, phenomenon is known as peaking.
This paper by Krijthe and Loog from Delft University of Technology, Leiden University Medical Center, and University of Copenhagen observes that a similar but more pronounced version of this phenomenon also occurs in the semi-supervised setting, where instead of labeled objects, unlabeled objects are added to the training set.
They explain why the learning curve has a more steep incline and a more gradual decline in this setting through simulation studies and by applying an approximation of the learning curve based on the work by Raudys & Duin.
The following figure from the paper illustrates the empirical learning curves for the supervised least squares classifier ($\boldsymbol{w}=\left(\mathbf{X}^{\top} \mathbf{X}\right)^{-1} \mathbf{X}^{\top} \boldsymbol{y}$ where $\boldsymbol{y}$ is a vector containing a numerical encoding of the labels and $\mathbf{X}$ is the $L \times p$ design matrix containing the $L$ labeled feature vectors $\boldsymbol{x}_i$.) where labeled data is added and the semi-supervised least squares classifier ($\boldsymbol{w}=\left(\frac{L}{L+U} \mathbf{X}_{\mathrm{e}}^{\top} \mathbf{X}_{\mathrm{e}}\right)^{-1} \mathbf{X}^{\top} \boldsymbol{y}$ where $L$ is the number of labeled objects, $U$, the number of unlabeled objects and $\mathbf{X}_{\mathrm{e}}$ the $(L+U) \times p$ design matrix containing all the feature vectors; the weighting $\frac{L}{L+U}$ is necessary because $\mathbf{X}_{\mathrm{e}}^{\top} \mathbf{X}_{\mathrm{e}}$ is essentially a sum over more objects than $\mathbf{X}^{\top} \boldsymbol{y}$, which they have to correct for) which uses 10 labeled objects per class and the remaining objects as unlabeled objects. “Base” corresponds to the performance of the classifier that uses the first 10 labeled objects for each class, without using any additional objects. Data are generated from two Gaussians in 50 dimensions, with identity covariance matrices and a distance of 4 between the class means.

2018

Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning

This article by Raschka from UW-Madison in 2018 reviews different techniques that can be used for model evaluation, model selection, and algorithm selection.
Each technique is discussed and its pros and cons are weighed with supporting examples. Further, recommendations are given to encourage best yet feasible practices in research and applications of machine learning.
Common methods such as the holdout method for model evaluation and selection are covered, which are not recommended when working with small datasets. Different flavors of the bootstrap technique are introduced for estimating the uncertainty of performance estimates, as an alternative to confidence intervals via normal approximation if bootstrapping is computationally feasible. Common cross-validation techniques such as leave-one-out cross-validation and $k$-fold cross-validation are reviewed, the bias-variance trade-off for choosing $k$ is discussed, and practical tips for the optimal choice of $k$ are given based on empirical evidence.
Different statistical tests for algorithm comparisons are presented, and strategies for dealing with multiple comparisons such as omnibus tests and multiple-comparison corrections are discussed.
Finally, alternative methods for algorithm selection, such as the combined F-test 5x2 cross-validation and nested cross-validation, are recommended for comparing machine learning algorithms when datasets are small.

2019

Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model

Increasing the batch size is a popular way to speed up neural network training, but beyond some critical batch size, larger batch sizes yield diminishing returns. - This paper by Zhang et al. from University of Toronto, Vector Institute, Google Brain, and DeepMind in NeurIPS 2019 studies how the critical batch size changes based on properties of the optimization algorithm, including acceleration and preconditioning, through two different lenses: large scale experiments, and analysis of a simple noisy quadratic model (NQM).
They experimentally demonstrate that optimization algorithms that employ preconditioning, specifically Adam and K-FAC, result in much larger critical batch sizes than stochastic gradient descent with momentum.
They also demonstrate that the NQM captures many of the essential features of real neural network training, despite being drastically simpler to work with.
The NQM predicts their results with preconditioned optimizers, previous results with accelerated gradient descent, and other results around optimal learning rates and large batch training, making it a useful tool to generate testable predictions about neural network optimization.

2020

Triple Wins: Boosting Accuracy, Robustness and Efficiency Together by Enabling Input-Adaptive Inference

Deep networks were recently suggested to face the odds between accuracy (on clean natural images) and robustness (on adversarially perturbed images). Such a dilemma is shown to be rooted in the inherently higher sample complexity and/or model capacity, for learning a high-accuracy and robust classifier. In view of that, give a classification task, growing the model capacity appears to help draw a win-win between accuracy and robustness, yet at the expense of model size and latency, therefore posing challenges for resource-constrained applications. The paper seeks to answer the question: is it possible to co-design model accuracy, robustness and efficiency to achieve their triple wins?
This paper by Hu et al. from TAMU in ICLR 2020 studies multi-exit networks associated with input-adaptive efficient inference, showing their strong promise in achieving a “sweet point” in co-optimizing model accuracy, robustness, and efficiency.
Their proposed solution, dubbed Robust Dynamic Inference Networks (RDI-Nets), allows for each input (either clean or adversarial) to adaptively choose one of the multiple output layers (early branches or the final one) to output its prediction. That multi-loss adaptivity adds new variations and flexibility to adversarial attacks and defenses, on which they present a systematical investigation.
They show experimentally that by equipping existing backbones with such robust adaptive inference, the resulting RDI-Nets can achieve better accuracy and robustness, yet with over 30% computational savings, compared to the defended original models.

2021

Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer

Hyperparameter (HP) tuning in deep learning is an expensive process, prohibitively so for neural networks (NNs) with billions of parameters. They show that, in the recently discovered Maximal Update Parametrization ($\mu$P), many optimal HPs remain stable even as model size changes.
This paper by Yang et al. from Microsoft and OpenAI in NeurIPS 2021 proposes µTransfer, a new HP tuning paradigm called µTransfer (a.k.a. µ-parameterization): parametrize the target model in µP, tune the HP indirectly on a smaller model, and zero-shot transfer them to the full-sized model, i.e., without directly tuning the latter at all.
They verify $\mu$Transfer on Transformer and ResNet. For example, 1) by transferring pretraining HPs from a model of 13M parameters, they outperform published numbers of BERT-large (350M parameters), with a total tuning cost equivalent to pretraining BERT-large once; 2) by transferring from 40M parameters, they outperform published numbers of the 6.7B GPT-3 model, with tuning cost only 7% of total pretraining cost.
The following figure from the paper illustrates µTransfer.

Code.

2022

OmniXAI: A Library for Explainable AI

This paper by Yang et al. from Salesforce Research presents Omni eXplainable AI (OmniXAI), an open-source Python library of eXplainable AI (XAI), which offers omni-way explainable AI capabilities and various interpretable machine learning techniques to address the pain points of understanding and interpreting the decisions made by machine learning (ML) in practice.
OmniXAI aims to be a one-stop comprehensive library that makes explainable AI easy for data scientists, ML researchers and practitioners who need explanation for various types of data, models and explanation methods at different stages of ML process (data exploration, feature engineering, model development, evaluation, and decision-making, etc).
In particular, their library includes a rich family of explanation methods integrated in a unified interface, which supports multiple data types (tabular data, images, texts, time-series), multiple types of ML models (traditional ML in Scikit-learn and deep learning models in PyTorch/TensorFlow), and a range of diverse explanation methods including “model-specific” and “model-agnostic” ones (such as feature-attribution explanation, counterfactual explanation, gradient-based explanation, etc).
For practitioners, the library provides an easy-to-use unified interface to generate the explanations for their applications by only writing a few lines of codes, and also a GUI dashboard for visualization of different explanations for more insights about decisions.
The following figure from the paper presents OmniXAI’s design principles, system architectures, and major functionalities, and also demonstrate several example use cases across different types of data, tasks, and models.

VeLO: Training Versatile Learned Optimizers by Scaling Up

Machine learning engineers typically find the best values of optimizer hyperparameters such as learning rate, learning rate schedule, and weight decay by trial and error. This can be cumbersome, since it requires training the target network repeatedly using different values. In the proposed method, a different neural network takes the target network’s gradients, weights, and current training step and outputs its weight updates — no hyperparameters needed.
While deep learning models have replaced hand-designed features across many domains, these models are still trained with hand-designed optimizers.
This paper by Metz et al. from Google Brain seeks to leverage the same scaling approach behind the success of deep learning to learn versatile optimizers. - They train an optimizer for deep learning which is itself a small neural network that ingests gradients and outputs parameter updates. Meta-trained with approximately four thousand TPU-months of compute on a wide variety of optimization tasks, their optimizer not only exhibits compelling performance, but optimizes in interesting and unexpected ways. It requires no hyperparameter tuning, instead automatically adapting to the specifics of the problem being optimized.
At every time step in the target network’s training, an LSTM generated the weights of a vanilla neural network, which we’ll call the optimizer network. The optimizer network, in turn, updated the target network. The LSTM learned to generate the optimizer network’s weights via evolution — iteratively generating a large number of similar LSTMs with random differences, averaging them based on which ones worked best, generating new LSTMs similar to the average, and so on —- rather than backpropagation.
- The authors randomly generated many (on the order of 100,000) target neural networks of various architectures — vanilla neural networks, convolutional neural networks, recurrent neural networks, transformers, and so on — to be trained on tasks that spanned image classification and text generation.
- Given an LSTM (initially with random weights), they copied and randomly modified its weights, generating an LSTM for each target network. Each LSTM generated the weights of a vanilla neural network based on statistics of the target network. These statistics included the mean and variance of its weights, exponential moving averages of the gradients over training, fraction of completed training steps, and training loss value.
- The authors trained each target network for a fixed number of steps using its optimizer network. The optimizer network took the target network’s gradients, weights, and current training step and updated each weight, one by one. Its goal was to minimize the loss function for the task at hand. Completed training yielded pairs of (LSTM, loss value).
- They generated a new LSTM by taking a weighted average (the smaller the loss, the heavier the weighting) of each weight across all LSTMs across all tasks. The authors took the new LSTM and repeated the process: They copied and randomly modified the LSTM, generated new optimizer networks, used them to train new target networks, updated the LSTM, and so on.
The authors evaluated VeLO using a dataset scaled to require no more than one hour to train on a single GPU on any of 83 tasks. They applied the method to a new set of randomly generated neural network architectures. On all tasks, VeLO trained networks faster than Adam tuned to find the best learning rate — four times faster on half of the tasks. It also reached a lower loss than Adam on five out of six MLCommons tasks, which included image classification, speech recognition, text translation, and graph classification tasks.
The authors’ approach underperformed exactly where optimizers are costliest to hand-tune, such as with models larger than 500 million parameters and those that required more than 200,000 training steps. The authors hypothesized that VeLO fails to generalize to large models and long training runs because they didn’t train it on networks that large or over that many steps.
VeLO accelerates model development in two ways: It eliminates the need to test hyperparameter values and speeds up the optimization itself. Compared to other optimizers, it took advantage of a wider variety of statistics about the target network’s training from moment to moment. That enabled it to compute updates that moved models closer to a good solution to the task at hand.
VeLO appears to have overfit to the size of the tasks the authors chose. Comparatively simple algorithms like Adam appear to be more robust to a wider variety of networks. They look forward to VeLO-like algorithms that perform well on architectures that are larger and require more training steps.
The following figure from the paper presents VeLO’s performance on the 83 canonical tasks on the VeLOdrome benchmark. Our learned optimizer VeLO (red) with no hyperparameters optimizes models dramatically faster than learning rate-tuned baselines (orange, black dashed), and usually surpasses the performance of NAdamW (brown) with one thousand trials of per-problem hyperparameter tuning. They exceed the performance of previous work on learned optimizers: the RNN MLP from Metz et al. [2020], and the STAR learned optimizer from Harrison et al. [2022]. The y-axis shows the relative number of steps it takes learning rate-tuned Adam to achieve the same loss each optimizer reaches after 10K training steps (e.g. a y-axis value of 2 means that it takes Adam 20K training iterations to reach the same loss). The x-axis shows the fraction of tasks for which the optimizer achieves at least that large a speedup over learning rate-tuned Adam. On all tasks, they train faster than learning rate-tuned Adam (all values >1). On about half of the tasks, they are more than 4x faster than learning rate-tuned Adam. On more than 14% of the tasks, they are more than 16x times faster.

Code.

2023

CoLT5: Faster Long-Range Transformers with Conditional Computation

Many natural language processing tasks benefit from long inputs, but processing long documents with Transformers is expensive – not only due to quadratic attention complexity but also from applying feedforward and projection layers to every token. However, not all tokens are equally important, especially for longer documents.
This paper by Ainslie et al. from proposes CoLT5, a long-input Transformer model that builds on this intuition by employing conditional computation, devoting more resources to important tokens in both feedforward and attention layers.
They show that CoLT5 achieves stronger performance than LongT5 with much faster training and inference, achieving SOTA on the long-input SCROLLS benchmark. Moreover, CoLT5 can effectively and tractably make use of extremely long inputs, showing strong gains up to 64k input length.
The following figure from the paper presents an overview of a COLT5 Transformer layer with conditional computation. All tokens are processed by light attention and MLP layers, while $q$ routed query tokens perform heavier attention over $v$ routed keyvalue tokens and $m$ routed tokens are processed by a heavier MLP.

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Multi-query attention (MQA), which only uses a single key-value head, drastically speeds up decoder inference. However, MQA can lead to quality degradation, and moreover it may not be desirable to train a separate model just for faster inference.
This paper by Ainslie et al. from Google Research (1) proposes a recipe for uptraining existing multi-head language model checkpoints into models with MQA using 5% of original pre-training compute, and (2) introduces grouped-query attention (GQA), a generalization of multi-query attention (MQA) which uses an intermediate (more than one, less than number of query heads) number of key-value heads.
The following figure from the paper presents an overview of grouped-query method. Multi-head attention has $H$ query, key, and value heads. Multi-query attention shares single key and value heads across all query heads. Grouped-query attention instead shares single key and value heads for each group of query heads, interpolating between multi-head and multi-query attention.

MQA uses a single key-value head to speed up decoder inference but can lead to quality degradation. The authors propose a novel method to transform existing multi-head attention (MHA) language model checkpoints into models with MQA, requiring only 5% of the original pre-training compute.
The paper presents Grouped-Query Attention (GQA), an intermediate approach between multi-head and multi-query attention. In GQA, query heads are divided into groups, each sharing a single key and value head. This method allows uptrained GQA models to achieve near MHA quality with speeds comparable to MQA.
Experiments conducted on the T5.1.1 architecture across various datasets (including CNN/Daily Mail, arXiv, PubMed, MediaSum, Multi-News, WMT, and TriviaQA) show that GQA models offer a balance between inference speed and quality.
The study includes ablation experiments to evaluate different modeling choices, such as the number of GQA groups and checkpoint conversion methods. These provide insights into the model’s performance under various configurations.
The paper acknowledges limitations, such as evaluation challenges for longer sequences and the absence of comparisons with models trained from scratch. It also notes that the findings are particularly applicable to encoder-decoder models and suggests GQA might have a stronger advantage in decoder-only models.
They show that uptrained GQA achieves quality close to multi-head attention with comparable speed to MQA.

Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training

Given the massive cost of language model pre-training, a non-trivial improvement of the optimization algorithm would lead to a material reduction on the time and cost of training. Adam and its variants have been state-of-the-art for years, and more sophisticated second-order (Hessian-based) optimizers often incur too much per-step overhead.
This paper by Liu et al. from Stanford in 2023 proposes Sophia, Second-order Clipped Stochastic Optimization, a simple scalable second-order optimizer that uses a light-weight estimate of the diagonal Hessian as the pre-conditioner. The update is the moving average of the gradients divided by the moving average of the estimated Hessian, followed by element-wise clipping. The clipping controls the worst-case update size and tames the negative impact of non-convexity and rapid change of Hessian along the trajectory.
Sophia only estimates the diagonal Hessian every handful of iterations, which has negligible average per-step time and memory overhead. On language modeling with GPT-2 models of sizes ranging from 125M to 770M, Sophia achieves a 2x speed-up compared with Adam in the number of steps, total compute, and wall-clock time.
Theoretically, they show that Sophia adapts to the curvature in different components of the parameters, which can be highly heterogeneous for language modeling tasks. Their run-time bound does not depend on the condition number of the loss.
The following figure from the paper shows that Sophia achieves a 2x speedup over AdamW in GPT-2 pre-training on OpenWebText. (a), (b) Comparison of the number of steps needed to achieve the same level of validation loss on (a) GPT-2-large (770M) and (b) GPT-2-medium (355M). Across all model sizes, Sophia achieves a 2x speedup over AdamW. (c) Validation losses of models with different sizes pre-trained for 100K steps. The gap between Sophia-H and AdamW gets larger as models size grows. Notably, using Sophia-H on a 540M-parameter model results in the same loss as using AdamW on a 770M-parameter model.

DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining

The mixture proportions of pretraining data domains (e.g., Wikipedia, books, web text) greatly affect language model (LM) performance.
This paper by Xie et al. from Google and Stanford University proposes Domain Reweighting with Minimax Optimization (DoReMi), which first trains a small proxy model using group distributionally robust optimization (Group DRO) over domains to produce domain weights (mixture proportions) without knowledge of downstream tasks.
They then resample a dataset with these domain weights and train a larger, full-sized model. In their experiments, they use DoReMi on a 280M-parameter proxy model to find domain weights for training an 8B-parameter model (30x larger) more efficiently.
On The Pile, DoReMi improves perplexity across all domains, even when it downweights a domain. DoReMi improves average few-shot downstream accuracy by 6.5% points over a baseline model trained using The Pile’s default domain weights and reaches the baseline accuracy with 2.6x fewer training steps. On the GLaM dataset, DoReMi, which has no knowledge of downstream tasks, even matches the performance of using domain weights tuned on downstream tasks.
The following figure from the paper shows that given a dataset with a set of domains, Domain Reweighting with Minimax Optimization (DoReMi) optimizes the domain weights to improve language models trained on the dataset. First, DoReMi uses some initial reference domain weights to train a reference model (Step 1). The reference model is used to guide the training of a small proxy model using group distributionally robust optimization (Group DRO) over domains, which they adapt to output domain weights instead of a robust model (Step 2). They then use the tuned domain weights to train a large model (Step 3).

One-for-All: Generalized LoRA for Parameter-Efficient Fine-tuning

This paper by Chavan et al. from MBZUAI, Transmute AI Lab, FAIR, Meta, and CMU presents Generalized LoRA (GLoRA), an advanced approach for universal parameter-efficient fine-tuning tasks.
Enhancing Low-Rank Adaptation (LoRA), GLoRA employs a generalized prompt module to optimize pre-trained model weights and adjust intermediate activations, providing more flexibility and capability across diverse tasks and datasets. Moreover, GLoRA facilitates efficient parameter adaptation by employing a scalable, modular, layer-wise structure search that learns individual adapter of each layer. GLoRA encompasses: $\Delta \mathbf{W}$ tuning, $\Delta \mathbf{H}$ tuning, along with $\mathbf{W}$ and $\mathbf{H}$ scale and shift learning.
Originating from a unified mathematical formulation, GLoRA exhibits strong transfer learning, few-shot learning and domain generalization abilities, as it adjusts to new tasks through additional dimensions on weights and activations. Comprehensive experiments demonstrate that GLoRA outperforms all previous methods in natural, specialized, and structured benchmarks, achieving superior accuracy with fewer parameters and computations on various datasets. Furthermore, their structural re-parameterization design ensures that GLoRA incurs no extra inference cost, rendering it a practical solution for resource-limited applications.
The following figure from the paper illustrates the schematic representation of a linear layer adapted with GLoRA.

Code.

Tackling the Curse of Dimensionality with Physics-Informed Neural Networks

The curse-of-dimensionality (CoD) taxes computational resources heavily with exponentially increasing computational cost as the dimension increases. This poses great challenges in solving high-dimensional PDEs as Richard Bellman first pointed out over 60 years ago. While there has been some recent success in solving numerically partial differential equations (PDEs) in high dimensions, such computations are prohibitively expensive, and true scaling of general nonlinear PDEs to high dimensions has never been achieved.
This paper by Hu et al. from National University of Singapore, Singapore and Brown University develops a new method of scaling up physics-informed neural networks (PINNs) to solve arbitrary high-dimensional PDEs. The new method, called Stochastic Dimension Gradient Descent (SDGD), decomposes a gradient of PDEs into pieces corresponding to different dimensions and samples randomly a subset of these dimensional pieces in each iteration of training PINNs.
They theoretically prove the convergence guarantee and other desired properties of the proposed method. They experimentally demonstrate that the proposed method allows us to solve many notoriously hard high-dimensional PDEs, including the Hamilton-Jacobi-Bellman and the Schrödinger equations in thousands of dimensions very fast on a single GPU using the PINNs mesh-free approach. For example, they solve nontrivial nonlinear PDEs (the HJB-Lin equation and the BSB equation) in 100,000 dimensions in 6 hours on a single GPU using SDGD with PINNs. Since SDGD is a general training methodology of PINNs, SDGD can be applied to any current and future variants of PINNs to scale them up for arbitrary high-dimensional PDEs.

This paper by Bagdasaryan et al. from Cornell Tech demonstrates the first prompt and instruction injection attack on multimodal AI systems by influencing open-source multimodal LLMs (LLaVA, PandaGPT) by injecting adversarial images and sounds.
An attacker generates an adversarial perturbation corresponding to the prompt and blends it into an image or audio recording. When the user interacts with the model about the image or audio, the adversarial input steers the subsequent dialog to follow the attacker’s instruction without the user noticing.
The following figure from the paper illustrates the process of targeted prompt injection into an image.

They illustrate this attack with several proof-of-concept examples targeting LLaVa and PandaGPT.
The following figure from the paper illustrates (left) an example of a targeted-output attack using an audio sample against the PandaGPT chatbot. The instruction blended into the audio1 instructs the chatbot to output a phishing message; (right) an example of dialog poisoning using an image against the LLaVA chatbot. The instruction blended into the image instructs the chatbot to talk like Harry Potter.

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals. RLHF has emerged as the central method used to finetune state-of-the-art large language models (LLMs). Despite this popularity, there has been relatively little public work systematizing its flaws.
This paper by Casper et al. from CSAIL, Harvard, ETH Zurich, Cornell Tech, University of Sussex, UC Berkeley, Stanford, Apollo Research, UNC Chapel Hill, and EddiSciences (1) surveys open problems and fundamental limitations of RLHF and related methods; (2) overviews techniques to understand, improve, and complement RLHF in practice; and (3) proposes auditing and disclosure standards to improve societal oversight of RLHF systems. Our work emphasizes the limitations of RLHF and highlights the importance of a multi-faceted approach to the development of safer AI systems.
The following figure from the paper illustrates an example of RLHF for finetuning chatbots with binary preference feedback. Humans indicate which example between a pair they prefer. A reward model is trained using each example pair to provide rewards that reflect the human’s decisions. Finally, the LLM policy is finetuned using the reward model.

DiffusionBERT: Improving Generative Masked Language Models with Diffusion Models

This paper by He et al. from Fudan University presents DiffusionBERT, a new generative masked language model based on discrete diffusion models. Diffusion models and many pre-trained language models have a shared training objective, i.e., denoising, making it possible to combine the two powerful models and enjoy the best of both worlds. On the one hand, diffusion models offer a promising training strategy that helps improve the generation quality. On the other hand, pre-trained denoising language models (e.g., BERT) can be used as a good initialization that accelerates convergence.
They explore training BERT to learn the reverse process of a discrete diffusion process with an absorbing state and elucidate several designs to improve it. First, they propose a new noise schedule for the forward diffusion process that controls the degree of noise added at each step based on the information of each token. Second, they investigate several designs of incorporating the time step into BERT.
Experiments on unconditional text generation demonstrate that DiffusionBERT achieves significant improvement over existing diffusion models for text (e.g., D3PM and Diffusion-LM) and previous generative masked language models in terms of perplexity and BLEU score.
The following figure from the paper illustrates the fact that in contrast to conventional discrete diffusion models, DiffusionBERT uses BERT as its backbone to perform text generation. The main differences are highlighted in color: (1) DiffusionBERT performs decoding without knowing the current time step while canonical diffusion models are conditioned on time step. (2) The diffusion process of DiffusionBERT is non-Markovian in that it generates noise samples $x_t$ conditioning not only on $x_{t−1}$ but also on $x_0$. Such a non-Markov process is due to their proposed noise schedule.

Code.

The Depth-to-Width Interplay in Self-Attention

Self-attention architectures, which are rapidly pushing the frontier in natural language processing, demonstrate a surprising depth-inefficient behavior: previous works indicate that increasing the internal representation (network width) is just as useful as increasing the number of self-attention layers (network depth).
This paper by Levine et al. from The Hebrew University of Jerusalem in NeurIPS 20202 theoretically predicts a width-dependent transition between depth-efficiency and depth-inefficiency in self-attention. Specifically, they predict an exponential dependence of network width (hidden size/intermediate size) on its depth at the transition point between the depth-efficiency/inefficiency regimes.
They conduct systematic empirical ablations on networks of depths 6 to 48 that clearly reveal the theoretically predicted behaviors, and provide explicit quantitative suggestions regarding the optimal depth-to-width allocation for a given self-attention network size.
The race towards beyond 1-Trillion parameter language models renders informed guidelines for increasing self-attention depth and width in tandem an essential ingredient.
Their guidelines elucidate the depth-to-width trade-off in self-attention networks of sizes up to the scale of GPT3 (which we project to be too deep for its size), and beyond, marking an unprecedented width of 30K as optimal for a 1-Trillion parameter network.
The following figure from the paper illustrates the following: (a) A fit of the predicted exponential dependence of network width on its depth at the transition point between the depth-efficiency/inefficiency regimes. The experimental points are marked by black crosses and their empirical errors by red vertical lines. (b) The network size at the transition between regimes $N_{\text {Transition }}$ as a function of network depth. The green area marks an interval of $2 \Delta N_{\text {Transition }}$ with the fit parameters. Architectures to the top-left of the curve are too shallow relative to their size, and can be improved by deepening. (c) The color in each range of network sizes corresponds to the color of the depth reaching the minimal loss in this range. This implies that architectures to the bottom-right of the curve in figure (b) are too deep relative to their size, and can be improved by widening.

The No Free Lunch Theorem, Kolmogorov Complexity, and the Role of Inductive Biases in Machine Learning

No free lunch theorems for supervised learning state that no learner can solve all problems or that all learners achieve exactly the same accuracy on average over a uniform distribution on learning problems. Accordingly, these theorems are often referenced in support of the notion that individual problems require specially tailored inductive biases. While virtually all uniformly sampled datasets have high complexity, real-world problems disproportionately generate low-complexity data, and they argue that neural network models share this same preference, formalized using Kolmogorov complexity.
This paper by Goldblum et al. from NYU shows that architectures designed for a particular domain, such as computer vision, can compress datasets on a variety of seemingly unrelated domains.
Their experiments show that pre-trained and even randomly initialized language models prefer to generate low-complexity sequences.
Whereas no free lunch theorems seemingly indicate that individual problems require specialized learners, they explain how tasks that often require human intervention such as picking an appropriately sized model when labeled data is scarce or plentiful can be automated into a single learning algorithm. These observations justify the trend in deep learning of unifying seemingly disparate problems with an increasingly small set of machine learning models.
The following figure from the paper illustrates the fact that over time, tasks that were performed by domainspecialized ML systems are increasingly performed by unified neural network architectures.

2024

Evolutionary Optimization of Model Merging Recipes

This paper by Akiba et al. from Sakana AI, introduces an evolutionary algorithm-based method for automating the creation of powerful foundation models through model merging. Traditional model merging relies on human intuition and domain knowledge, limiting its potential. This method surpasses these limitations by automatically discovering effective combinations of diverse open-source models, harnessing their collective intelligence without needing extensive additional training data or compute. It operates in both parameter space and data flow space, enabling optimization beyond just the weights of individual models and facilitating cross-domain merging, such as a Japanese LLM with Math reasoning capabilities.
The authors present a methodology leveraging evolutionary algorithms for merging foundation models, focusing on both parameter space (weights) and data flow space (inference path), proposing a framework that integrates these two dimensions. This approach led to the creation of models like a Japanese LLM with Math reasoning capabilities (EvoLLM-JP) and a culturally-aware Japanese Vision-Language Model (VLM), demonstrating state-of-the-art performance on various benchmarks without explicit optimization for those tasks.
Notably, EvoLLM-JP, a Japanese Math LLM achieved remarkable performance on established Japanese LLM benchmarks, surpassing models with significantly more parameters, indicating the high efficiency and surprising generalization capability of the approach. Similarly, a Japanese VLM generated through this approach showed its effectiveness in describing Japanese culture-specific content, outperforming previous models in the domain.
The following figure from the paper offers an overview of Evolutionary Model Merge which encompasses (1) evolving the weights for mixing parameters at each layer in the parameter space (PS); (2) evolving layer permutations in the data flow space (DFS); and (3) an integrated strategy that combines both methods for merging in both PS and DFS. Notice that merging in the PS is not simple copying and stitching of the layers parameters, but also mixes the weights. This merging is akin to blending colors as illustrated here (e.g., red and blue becomes purple). Note that we translated the questions to English for the reader; the models operate on Japanese text.

The work contributes to foundation model development by automating model composition, introducing a new paradigm for automated model composition, and demonstrating the effectiveness of cross-domain merging. The EvoLLM-JP and EvoVLM-JP models are contributed back to the open-source community for further research and development.
The paper discusses the potential for evolutionary model merging to create high-performance cross-domain image generation models by discovering novel combinations of existing building blocks. Future work includes exploring evolution to search for candidate source models from a vast population of existing models and using evolution to produce swarms of diverse foundation models with niche behaviors, enabling the emergence of collective intelligence.
This paper represents a significant step forward in the field of model development, leveraging evolutionary algorithms to merge diverse models, leading to the creation of advanced models with new capabilities and efficiencies. The success of this approach in generating state-of-the-art models without relying on traditional gradient-based training suggests a promising direction for more efficient, automated foundation model development.
Code

BANG: Billion-Scale Approximate Nearest Neighbor Search using a Single GPU

This paper by Karthik V et al. from IIT Hyderabad, LabEx, and MSR presents BANG, a novel GPU-based method for approximate nearest neighbor search (ANNS) that effectively handles billion-scale datasets which are too large to fit into GPU memory.
BANG is distinct in that it utilizes compressed data for distance computations on the GPU while maintaining the graph on the CPU. This allows for efficient searching within the constraints of GPU memory limits.
The approach benefits from several optimizations, such as prefetching, pipelining, and the use of bloom filters for efficient search traversal. These optimizations are designed to enhance the concurrency of CPU and GPU operations, minimize memory traffic across the PCIe bus, and maximize the architectural strengths of each processing unit.
In terms of implementation, BANG operates by first compressing the data using Product Quantization (PQ). It then employs a Vamana graph, initially developed for the DiskANN framework, to facilitate the ANNS on large graphs. This method outperforms the state-of-the-art GPU-based ANNS methods in throughput by 40×-200× for high-recall scenarios.
BANG’s processing pipeline includes stages that run in parallel on the CPU and GPU, optimizing resource utilization. The stages involve constructing a PQ distance table, performing the actual ANN search, and re-ranking the results based on precise distance computations to refine the search accuracy.
The following figure from the paper shows the schema of BANG.

Notably, the evaluation of BANG demonstrates significant improvements over existing methods, particularly in handling large datasets that cannot be entirely loaded into GPU memory due to their size. This capability is highlighted by its superior performance on popular benchmark datasets using a single NVIDIA Ampere A100 GPU.

RecSys

2019

Deep Learning Recommendation Model for Personalization and Recommendation Systems

This paper by Naumov et al. from Facebook in 2019 introduces the DLRM (deep learning for recommender systems) architecture, a significant development in recommender system modeling, which was open-sourced in both PyTorch and Caffe2 frameworks.
Contrary to the “deep learning” part in it’s name, DLRM represents a progression from the DeepFM architecture, maintaining the FM (factorization machine) component while discarding the deep neural network part. The fundamental hypothesis of DLRM is that interactions are paramount in recommender systems, which can be modeled using shallow MLPs (and complex deep learning components are thus not essential). Put simply, the key idea behind DLRM is to take the approach from DeepFM but only keep the FM part, not the Deep part, and expand on top of that. The underlying hypothesis is that sparse features and their interactions are really all that matter in recommender systems. The deep component is not really needed. “Interactions are all you need!”, you may say.
The DLRM model handles continuous (dense) and categorical (sparse) features that describe users and products. DLRM exercises a wide range of hardware and system components, such as memory capacity and bandwidth, as well as communication and compute resources as shown in the figure below from the paper.

The figure below from the paper shows the overall structure of DLRM.

DLRM uniquely handles both continuous (dense) and categorical (sparse) features that describe users and products, projecting them into a shared embedding space. These features are then passed through MLPs before and after computing pairwise feature interactions (dot products). This method significantly differs from other neural network-based recommendation models in its explicit computation of feature interactions and treatment of each embedded feature vector as a single unit, contrasting with approaches like Deep and Cross which consider each element in the feature vector separately.

DLRM shows that interactions are all you need: it’s akin to using just the FM component of DeepFM but with MLPs added before and after the interactions to increase modeling capacity.

The architecture of DLRM includes multiple MLPs, which are added to increase the model’s capacity and expressiveness, enabling it to model more complex interactions. This aspect is critical as it allows for fitting data with higher precision, given adequate parameters and depth in the MLPs.
Compared to other DL-based approaches to recommendation, DLRM differs in two ways. First, it computes the feature interactions explicitly while limiting the order of interaction to pairwise interactions. Second, DLRM treats each embedded feature vector (corresponding to categorical features) as a single unit, whereas other methods (such as Deep and Cross) treat each element in the feature vector as a new unit that should yield different cross terms. These design choices help reduce computational/memory cost while maintaining competitive accuracy.
A key contribution of DLRM is its specialized parallelization scheme, which utilizes model parallelism on the embedding tables to manage memory constraints and exploits data parallelism in the fully-connected layers for computational scalability. This approach is particularly effective for systems with diverse hardware and system components, like memory capacity and bandwidth, as well as communication and compute resources.
The paper demonstrates that DLRM surpasses the performance of the DCN model on the Criteo dataset, validating the authors’ hypothesis about the predominance of feature interactions. Moreover, DLRM has been characterized for its performance on the Big Basin AI platform, proving its utility as a benchmark for future algorithmic experimentation, system co-design, and benchmarking in the field of deep learning-based recommendation models.
Facebook AI post.

FiBiNET: Combining Feature Importance and Bilinear feature Interaction for Click-Through Rate Prediction

Advertising and feed ranking are essential to many Internet companies such as Facebook and Sina Weibo. Among many real-world advertising and feed ranking systems, click through rate (CTR) prediction plays a central role. There are many proposed models in this field such as logistic regression, tree based models, factorization machine based models and deep learning based CTR models. However, many current works calculate the feature interactions in a simple way such as Hadamard product and inner product and they care less about the importance of features.
This paper by Huang et al. from in 2019 proposes a new model named FiBiNET as an abbreviation for Feature Importance and Bilinear feature Interaction NETwork to dynamically learn the feature importance and fine-grained feature interactions. On the one hand, the FiBiNET can dynamically learn the importance of features via the Squeeze-Excitation network (SENET) mechanism; on the other hand, it is able to effectively learn the feature interactions via bilinear function.
They conduct extensive experiments on two real-world datasets and show that their shallow model outperforms other shallow models such as factorization machine(FM) and field-aware factorization machine(FFM).
In order to improve performance further, they combine a classical deep neural network (DNN) component with the shallow model to be a deep model. The deep FiBiNET consistently outperforms the other state-of-the-art deep models such as DeepFM and extreme deep factorization machine (XdeepFM).
The following figure from the paper shows the architecture of their proposed FiBiNET:

AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks

This paper by Song et al. from from Peking University and Mila-Quebec AI Institute, and HEC Montreal in CIKM 2019 introduces AutoInt (short for “automated feature interaction learning”), a novel method for efficiently learning high-order feature interactions in an automated way. Developed to address the inefficiencies and overfitting problems in existing models like DCN and DeepFM, which create feature crosses in a brute-force manner, AutoInt leverages self-attention to determine the most informative feature interactions.
AutoInt employs a multi-head self-attentive neural network with residual connections, designed to explicitly model feature interactions in a 16-dimensional embedding space. It overcomes the limitations of prior models by focusing on relevant feature combinations, avoiding unnecessary and unhelpful feature crosses.
Processing Steps:
1. Input Layer: Represents user profiles and item attributes as sparse vectors.
2. Embedding Layer: Projects each feature into a 16-dimensional space.
3. Interacting Layer: Utilizes several multi-head self-attention layers to automatically identify the most informative feature interactions. The attention mechanism is based on dot product for its effectiveness in capturing feature interactions.
4. Output Layer: Uses the learned feature interactions for CTR estimation.
The goal of AutoInt is to map the original sparse and high-dimensional feature vector into low-dimensional spaces and meanwhile model the high-order feature interactions. As shown in the below figure, AutoInt takes the sparse feature vector $x$ as input, followed by an embedding layer that projects all features (i.e., both categorical and numerical features) into the same low-dimensional space. Next, embeddings of all fields are fed into a novel interacting layer, which is implemented as a multi-head self-attentive neural network. For each interacting layer, high-order features are combined through the attention mechanism, and different kinds of combinations can be evaluated with the multi-head mechanisms, which map the features into different subspaces. By stacking multiple interacting layers, different orders of combinatorial features can be modeled. The output of the final interacting layer is the low-dimensional representation of the input feature, which models the high-order combinatorial features and is further used for estimating the clickthrough rate through a sigmoid function. The figure below from the paper shows an overview of AutoInt.

The figure below from the paper illustrates the input and embedding layer, where both categorical and numerical fields are represented by low-dimensional dense vectors.

AutoInt demonstrates superior performance over competitors like Wide and Deep and DeepFM on benchmark datasets like MovieLens and Criteo, thanks to its efficient handling of feature interactions.
The technical innovations in AutoInt consist of: (i) introduction of multi-head self-attention to learn which cross features really matter, replacing the brute-force generation of all possible feature crosses, and (ii) the model’s ability to learn important feature crosses such as Genre-Gender, Genre-Age, and RequestTime-ReleaseTime, which are crucial for accurate CTR prediction.
AutoInt showcases efficiency in processing large-scale, sparse, high-dimensional data, with a stack of 3 attention layers, each having 2 heads. The attention mechanism improves model explainability by highlighting relevant feature interactions, as exemplified in the attention matrix learned on the MovieLens dataset.
AutoInt addresses the need for a model that is both powerful in capturing complex interactions and interpretable in its recommendations, without the inefficiency and overfitting issues seen in models that generate feature crosses in a brute-force manner.

2020

DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems

This paper by Wang et al. from Google proposes DCN-V2, an enhanced version of the Deep & Cross Network (DCN), designed to effectively learn feature interactions in large-scale learning to rank (LTR) systems.
The paper addresses DCN’s limited expressiveness in learning predictive feature interactions, especially in web-scale systems with extensive training data.
DCN-V2 is focused on the efficient and effective learning of predictive feature interactions, a crucial aspect of applications like search recommendation systems and computational advertising. It tackles the inefficiency of traditional methods, including manual identification of feature crosses and reliance on deep neural networks (DNNs) for higher-order feature crosses.
The embedding layer in DCN-V2 processes both categorical (sparse) and dense features, supporting various embedding sizes, essential for industrial-scale applications with diverse vocabulary sizes.
The core of DCN-V2 is its cross layers, which explicitly create feature crosses. These layers are based on a base layer with original features, utilizing learned weight matrices and bias vectors for each cross layer.
The figure below from the paper visualizes a cross layer.

As shown in the figure below, DCN-V2 employs a novel architecture that combines a cross network with a deep network. This combination is realized through two architectures: a stacked structure where the cross network output feeds into the deep network, and a parallel structure where outputs from both networks are concatenated. The cross operation in these layers is represented as $\mathrm{x}_{l+1}=\mathrm{x}_0 \odot\left(W_l \mathrm{x}_l+\mathrm{b}_l\right)+\mathrm{x}_l$.

A key feature of DCN-V2 is the use of low-rank techniques to approximate feature crosses in a subspace, improving performance and reducing latency. This is further enhanced by a Mixture-of-Expert architecture, which decomposes the matrix into multiple smaller sub-spaces aggregated through a gating mechanism.

DCN-V2 demonstrates superior performance in extensive studies and comparisons with state-of-the-art algorithms on benchmark datasets like Criteo and MovieLens-1M. It offers significant gains in offline accuracy and online business metrics in Google’s web-scale LTR systems.
The paper also delves into polynomial approximation from both bitwise and feature-wise perspectives, illustrating how DCN-V2 creates feature interactions up to a certain order with a given number of cross layers, thus being more expressive than the original DCN.

GCN-Based User Representation Learning for Unifying Robust Recommendation and Fraudster Detection

In recent years, recommender system has become an indispensable function in all e-commerce platforms. The review rating data for a recommender system typically comes from open platforms, which may attract a group of malicious users to deliberately insert fake feedback in an attempt to bias the recommender system to their favor. The presence of such attacks may violate modeling assumptions that high-quality data is always available and these data truly reflect users’ interests and preferences. Therefore, it is of great practical significance to construct a robust recommender system that is able to generate stable recommendations even in the presence of shilling attacks.
This paper by Zhang et al. from the University of Queensland in 2020 proposes GraphRfi - a GCN-based user representation learning framework to perform robust recommendation and fraudster detection in a unified way.
In its end-to-end learning process, the probability of a user being identified as a fraudster in the fraudster detection component automatically determines the contribution of this user’s rating data in the recommendation component; while the prediction error outputted in the recommendation component acts as an important feature in the fraudster detection component. Thus, these two components can mutually enhance each other.
Extensive experiments have been conducted and the experimental results show the superiority of their GraphRfi in the two tasks - robust rating prediction and fraudster detection. Furthermore, the proposed GraphRfi is validated to be more robust to the various types of shilling attacks over the state-of-the-art recommender systems.
The following figure from the paper shows an overview of GraphRfi.

2021

Sliding Spectrum Decomposition for Diversified Recommendation

This paper by Huang et al., presented at KDD ‘21, introduces Sliding Spectrum Decomposition (SSD), a novel approach to addressing the diversity problem in content feed recommendations, such as Xiaohongshu’s Explore Feed. It aligns with users’ perceptions of diversity by employing time series analysis and tensor decomposition methods. SSD is implemented in Xiaohongshu’s production recommender system, serving millions of daily users.
Key Contributions:
1. Item Sequence Perspective:
  - Unlike traditional methods that focus on a sliding window to assess diversity, SSD views the entire item sequence as a trajectory tensor. This includes both in-window and out-of-window items, addressing user perception of diversity over long item sequences.
2. Sliding Spectrum Decomposition (SSD):
  - A method leveraging Singular Spectrum Analysis (SSA) extended to a three-dimensional trajectory tensor.
  - Diversity is calculated as the cumulative product of singular values from tensor decomposition, capturing orthogonal diversity components.
  - A greedy inference algorithm ensures computational efficiency with $O(NTd)$ complexity, where $N$ is the number of items, $T$ is the sequence length, and $d$ is the embedding dimension.
3. CB2CF Item Embedding Strategy:
  - Combines content-based (CB) and collaborative filtering (CF) methods to learn item embeddings.
  - Utilizes a Siamese Network to merge content features (text via BERT and images via Inception-V3) into a unified embedding space.
  - Embeddings are optimized for cosine similarity to avoid bias from the long-tail effect, ensuring fair similarity measurements.
4. Integration into Industrial Systems:
  - Deployed in Xiaohongshu’s recommender system, SSD processes item sequences with low latency and memory requirements, outperforming state-of-the-art methods like Determinantal Point Processes (DPP).
Implementation Details:
- SSD Formulation:
  - Models diversity using volume metrics derived from trajectory tensors.
  - Optimization balances quality and diversity via a weighted objective function with a trade-off hyperparameter ($\gamma$).
- Greedy Inference Algorithms:
  - Two variations: with and without sliding windows.
  - Efficient orthogonalization strategies minimize computational overhead, crucial for real-time applications.
- Embedding Model Training:
  - Embeddings learned from user interactions via ItemCF.
  - Dataset includes text and image content, with item pairs labeled as similar or dissimilar based on exposure patterns.
- The image below from the paper details the siamese network structure of the item embedding model
Experimental Results:
1. Offline Evaluation:
  - CB2CF embeddings showed improved similarity accuracy for long-tail items compared to standard CF methods.
  - Enhanced diversity metrics by leveraging orthogonal components in tensor decomposition.
2. Online A/B Testing:
  - Conducted on Xiaohongshu’s Explore Feed with metrics like engagement, time spent, and diversity (Intra-List Average Distance, ILAD).
  - SSD improved time spent (+0.25%) and engagements (+0.71%) compared to DPP.
  - A stabilized variant (SSD) further increased gains: +0.42% time spent and +0.81% engagements.
Conclusion:
- The Sliding Spectrum Decomposition method provides a computationally efficient and perceptually aligned framework for improving diversity in feed recommendations. Its integration with a robust CB2CF embedding strategy ensures scalability and adaptability to industrial recommender systems. Through theoretical analysis and real-world validation, SSD offers significant improvements over existing diversity-focused algorithms.

2022

DHEN: A Deep and Hierarchical Ensemble Network for Large-Scale Click-Through Rate Prediction

Learning feature interactions is important to the model performance of online advertising services. As a result, extensive efforts have been devoted to designing effective architectures to learn feature interactions. However, they observe that the practical performance of those designs can vary from dataset to dataset, even when the order of interactions claimed to be captured is the same. That indicates different designs may have different advantages and the interactions captured by them have non-overlapping information.
Proposed in DHEN: A Deep and Hierarchical Ensemble Network for Large-Scale Click-Through Rate Prediction, this paper by Zhang et al. from Meta introduces DHEN (Deep and Hierarchical Ensemble Network), a novel architecture designed for large-scale Click-Through Rate (CTR) prediction. The significance of DHEN lies in its ability to learn feature interactions effectively, a crucial aspect in the performance of online advertising services. Recognizing that different interaction models offer varying advantages and capture non-overlapping information, DHEN integrates a hierarchical ensemble framework with diverse interaction modules, including AdvancedDLRM, self-attention, Linear, Deep Cross Net, and Convolution. These modules enable DHEN to learn a hierarchy of interactions across different orders, addressing the limitations and variable performance of previous models on different datasets.
The following figure from the paper shows a two-layer two-module hierarchical ensemble (left) and its expanded details (right). A general DHEN can be expressed as a mixture of multiple high-order interactions. Dense feature input for the interaction modules are omitted in this figure for clarity.

In CTR prediction tasks, the feature inputs usually contain discrete categorical terms (sparse features) and numerical values (dense features). DHEN uses the same feature processing layer in DLRM, which is shown in the figure below. The sparse lookup tables map the categorical terms to a list of “static” numerical embeddings. Specifically, each categorical term is assigned a trainable $d$-dimensional vector as its feature representation. On the other hand, the numerical values are processed by dense layers. Dense layers compose of several Multi-layer Perceptions (MLPs) from which an output of a $d$-dimensional vector is computed. After a concatenation of the output from sparse lookup table and dense layer, the final output of the feature processing layer $X_0 \in \mathbb{R}^{d \times m}$ can be expressed as $X_0=\left(x_0^1, x_0^2, \ldots, x_0^m\right)$, where $m$ is the number of the output embeddings and $d$ is the embedding dimension.

A key technical advancement in this work is the development of a co-designed training system tailored for DHEN’s complex, multi-layer structure. This system introduces the Hybrid Sharded Data Parallel, a novel distributed training paradigm. This approach not only caters to the deeper structure of DHEN but also significantly enhances training efficiency, achieving up to 1.2x better throughput compared to existing models.
Empirical evaluations on large-scale datasets for CTR prediction tasks have demonstrated the effectiveness of DHEN. The model showed an improvement of 0.27% in Normalized Entropy (NE) gain over state-of-the-art models, underlining its practical effectiveness. The paper also discusses improvements in training throughput and scaling efficiency, highlighting the system-level optimizations that make DHEN particularly adept at handling large and complex datasets in the realm of online advertising.n the Normalized Entropy (NE) of prediction and 1.2x better training throughput than state-of-the-art baseline, demonstrating their effectiveness in practice.

2023

Towards Deeper, Lighter, and Interpretable Cross Network for CTR Prediction

This paper by Wang et al. from Fudan University and Microsoft Research Asia in CIKM ‘23 introduces the Gated Deep Cross Network (GDCN) and the Field-level Dimension Optimization (FDO) approach. GDCN aims to address significant challenges in Click-Through Rate (CTR) prediction for recommender systems and online advertising, specifically the automatic capture of high-order feature interactions, interpretability issues, and the redundancy of parameters in existing methods.
GDCN is inspired by DCN-V2 and consists of an embedding layer, a Gated Cross Network (GCN), and a Deep Neural Network (DNN). The GCN forms its core structure, which captures explicit bounded-degree high-order feature crosses/interactions. The GCN employs an information gate in each cross layer (representing a higher order interaction) to dynamically filter and amplify important interactions. This gate controls the information flow, ensuring that the model focuses on relevant interactions. This approach not only allows for deeper feature crossing but also adds a layer of interpretability by identifying crucial interactions, thus modelling implicit feature crosses.
GDCN is a generalization of DCN-V2, offering dynamic instance-based interpretability and the ability to utilize deeper cross features without a loss in performance.

The unique selling point of DCN-V2 is that it treats all cross features equally, while GDCN uses information gates for fine-grained control over feature importance.

GDCN transforms high-dimensional, sparse input into low-dimensional, dense representations. Unlike most CTR models, GDCN allows arbitrary embedding dimensions.
Two structures are proposed: GDCN-S (stacked) and GDCN-P (parallel). GDCN-S feeds the output of GCN into a DNN, while GDCN-P feeds the input vector in parallel into GCN and DNN, concatenating their outputs.
Alongside GDCN, the FDO approach focuses on optimizing the dimensions of each field in the embedding layer based on their importance. FDO addresses the issue of redundant parameters by learning independent dimensions for each field based on its intrinsic importance. This approach allows for a more efficient allocation of embedding dimensions, reducing unnecessary parameters and enhancing enhancing efficiency without compromising performance. FDO uses methods like PCA to determine optimal dimensions and only needs to be done once, with the dimensions applicable to subsequent model updates.
The following figure shows the architecture of the GDCN-S and GDCN-P. $\otimes$ is the cross operation (a.k.a, the gated cross layer).

The following figure visualizes the gated cross layer. $\odot$ is elementwise/Hadamard product, and $\times$ is matrix multiplication.

Results indicate that GDCN, especially when paired with the FDO approach, outperforms state-of-the-art methods in terms of prediction performance, interpretability, and efficiency. GDCN was evaluated on five datasets (Criteo, Avazu, Malware, Frappe, ML-tag) using metrics like AUC and Logloss, showcasing the effectiveness and superiority of GDCN in capturing deeper high-order interactions. These experiments also demonstrate the interpretability of the GCN model and the successful parameter reduction achieved by the FDO approach. The datasets underwent preprocessing like feature removal for infrequent items and normalization. The comparison included various classes of CTR models and demonstrated GDCN’s effectiveness in handling high-order feature interactions without the drawbacks of overfitting or performance degradation observed in other models. GDCN achieves comparable or better performance with only a fraction (about 23%) of the original model parameters.
In summary, GDCN addresses the limitations of existing CTR prediction models by offering a more interpretable, efficient, and effective approach to handling high-order feature interactions, supported by the innovative use of information gates and dimension optimization techniques.

Do LLMs Understand User Preferences? Evaluating LLMs On User Rating Prediction

Large Language Models (LLMs) have demonstrated exceptional capabilities in generalizing to new tasks in a zero-shot or few-shot manner. However, the extent to which LLMs can comprehend user preferences based on their previous behavior remains an emerging and still unclear research question. Traditionally, Collaborative Filtering (CF) has been the most effective method for these tasks, predominantly relying on the extensive volume of rating data. In contrast, LLMs typically demand considerably less data while maintaining an exhaustive world knowledge about each item, such as movies or products.
This paper by Kang et al. from Google Research in 2023 conducts a thorough examination of both CF and LLMs within the classic task of user rating prediction, which involves predicting a user’s rating for a candidate item based on their past ratings. They investigate various LLMs in different sizes, ranging from 250M to 540B parameters and evaluate their performance in zero-shot, few-shot, and fine-tuning scenarios.
They conduct comprehensive analysis to compare between LLMs and strong CF methods, and find that zero-shot LLMs lag behind traditional recommender models that have the access to user interaction data, indicating the importance of user interaction data. However, through fine-tuning, LLMs achieve comparable or even better performance with only a small fraction of the training data, demonstrating their potential through data efficiency.
The following figure from the paper shows the two types of LLMs they consider for the rating prediction task.

Fresh Content Needs More Attention: Multi-funnel Fresh Content Recommendation

Recommendation system serves as a conduit connecting users to an incredibly large, diverse and ever growing collection of contents. In practice, missing information on fresh (and tail) contents needs to be filled in order for them to be exposed and discovered by their audience. In other words, new content faces obstacles in reaching the right users due to limited exposure, which is called the cold start problem.
This paper by Wang et al. from Google in KDD 2023 shares success stories in building a dedicated fresh content recommendation stack on a large commercial platform. The goal is for high-quality new content to be surfaced and go viral as well. To nominate fresh content, they built a multi-funnel nomination system that combines (i) a two-tower model with strong generalization power for coverage, and (ii) a sequence model with near real-time update on user feedback for relevance. The multi-funnel setup effectively balances between coverage and relevance.
An in-depth study uncovers the relationship between user activity level and their proximity toward fresh contents, which further motivates a contextual multi-funnel setup. Nominated fresh candidates are then scored and ranked by systems considering prediction uncertainty to further bootstrap content with less exposure.
As part of the fresh content nominator, users are represented based on consumption history, items are represented based on features, and candidates are generated with a fast dot product. Furthermore, as part of the their “graduation filter”, as new content gets interactions, it is picked by the main recommendation system. For ranking, they generate the top 10 candidates with a low-latency multi-armed bandit, while the top 1 is generated by a DNN with better accuracy but higher latency.
They evaluate the benefits of the dedicated fresh content recommendation stack, and the multi-funnel nomination system in particular, through user corpus co-diverted live experiments. They conduct multiple rounds of live experiments on a commercial platform serving billion of users demonstrating efficacy of their proposed methods.
The following figure from the paper shows the dedicated fresh content recommendation stack.

Large Language Models are Zero-Shot Rankers for Recommender Systems

Recently, large language models (LLMs) (e.g. GPT-4) have demonstrated impressive general-purpose task-solving abilities, including the potential to approach recommendation tasks. Along this line of research, this work aims to investigate the capacity of LLMs that act as the ranking model for recommender systems.
This paper by Hou et al. from Renmin University, WeChat, and UCSD formalizes the recommendation problem as a conditional ranking task, considering sequential interaction histories as conditions and the items retrieved by the candidate generation model as candidates. The prompt includes a user’s past sequential interactions as “conditions” and a set of “candidate” items. The LLMs are instructed to rank the candidate set for recommendation in the order of interaction likelihood.
The following prompts were used: [pattern that contains sequential historical interactions H] [pattern that contains retrieved candidate items C] Please rank these movies by measuring the possibilities that I would like to watch next most, according to my watching history. You MUST rank the given candidate movies. You cannot generate movies that are not in the given candidate list.
Three prompting strategies provide sequential historical interactions:
1. Sequential prompting: “I’ve watched the following movies in the past in order: ‘0. Multiplicity’, ‘1. Jurassic Park’…”
2. Recency-focused prompting: “I’ve watched the following movies in the past in order: ‘0. Multiplicity’, ‘1. Jurassic Park’,… Note that my most recently watched movie is Dead Presidents…”
3. In-context learning: “If I’ve watched the following movies in the past in order: ‘0. Multiplicity’, ‘1. Jurassic Park’,… then you should recommend Dead Presidents to me and now that I’ve watched Dead Presidents, then…”
Findings:
1. LLMs have promising zero-shot ranking abilities compared to prior zero-shot ranking models.
2. Simple prompts lead the LLM to ignore the interaction order.
3. Recency-focused prompts & in-context learning force the LLM to make order-aware recommendations.
4. Larger LLM models gives better performance: GPT3.5-turbo > text-davinci-003 > LLaMA-65B
5. However, there is a HUGE gap between zero-shot LLMs and the standard fully-trained recommender system models (e.g., SASRec).
They show that given the fact that LLMs have general information about the products present in the datasets about movies and games, LLMs are able to leverage the past products present in the sequence to present similar products as recommendations.
The future of recommender systems will likely involve combining the power of LLMs with more traditional recommender systems. In order to improve such hybrid systems, augmenting LLMs with product knowledge graphs will improve the inherent knowledge that LLMs leverage to generate more reasonable recommendations. Other ways are to use better prompting and few-shot learning methods.
They also demonstrate that LLMs struggle to perceive the order of historical interactions and can be affected by biases like position bias, while these issues can be alleviated via specially designed prompting and bootstrapping strategies.
The following figure from the paper offers an overview of the proposed LLM-based zero-shot personalized ranking method.

Code.

How Can Recommender Systems Benefit from Large Language Models: A Survey

Recommender systems (RS) play important roles to match users’ information needs for Internet applications. In natural language processing (NLP) domains, large language model (LLM) has shown astonishing emergent abilities (e.g., instruction following, reasoning), thus giving rise to the promising research direction of adapting LLM to RS for performance enhancements and user experience improvements.
This paper by Lin et al. from Shanghai Jiao Tong University and Huawei Noah’s Ark Lab conducts a comprehensive survey on this research direction from an application-oriented view. They first summarize existing research works from two orthogonal perspectives: where and how to adapt LLM to RS.
For the “WHERE” question, they discuss the roles that LLM could play in different stages of the recommendation pipeline, i.e., feature engineering, feature encoder, scoring/ranking function, and pipeline controller.
For the “HOW” question, they investigate the training and inference strategies, resulting in two fine-grained taxonomy criteria, i.e., whether to tune LLMs or not, and whether to involve conventional recommendation model (CRM) for inference. Detailed analysis and general development trajectories are provided for both questions, respectively. Then, they highlight key challenges in adapting LLM to RS from three aspects, i.e., efficiency, effectiveness, and ethics.
The following figure from the paper illustrates the decomposition of the research question about adapting large language models to recommender systems. They analyze the question from two orthogonal perspectives: (1) where to adapt LLM, and (2) how to adapt LLM.

2024

Joint Modeling of Search and Recommendations Via an Unified Contextual Recommender (UniCoRn)

This paper by Bhattacharya et al. from Netflix Research introduces UniCoRn, a unified deep learning model designed to jointly address search and recommendation tasks within the Netflix product ecosystem, aiming to reduce technical debt and improve performance across applications.
Traditionally, separate models were maintained for query-driven search, item-to-item recommendations, and anticipatory or profile-based recommendations. This led to increased system complexity and maintenance overhead. Recognizing the shared underlying structures between search and recommendation tasks, the authors aimed to unify these into a single framework.
UniCoRn leverages a rich and shared context space which includes: user ID, query, country, source entity ID (like a video or game), and task type. This broader context allows the model to handle different input types required by distinct use cases. For example, in search scenarios, query information is available but the source entity might not be, whereas in item-to-item recommendations the reverse holds.
To address missing context values in different tasks, heuristic-based imputations are applied. For search tasks missing a source entity ID, a null value is used. For recommendation tasks without a query, tokens derived from the source entity’s display name are substituted. This aids in maximizing cross-task learning and consistency in the model’s input space.
The model incorporates both context-specific features (like query length and source entity embeddings) and interaction features (like the number of clicks on a target entity for a given context). These include both categorical and numeric types:
- Categorical features are embedded into dense vectors.
- The architecture features residual connections and cross-feature interactions.
- The model is trained using binary cross-entropy loss optimized with the Adam optimizer.
The following figure from the paper illustrates the Unified Contextual Ranker (UniCoRn) powering multiple different search and recommendation tasks.

A single trained instance of UniCoRn is capable of powering a variety of Netflix interfaces, including Search results, Personalized Pre-query canvases, and Video-Video recommendation pages like “More Like This”. Through extensive experiments, the unified model maintained or improved task-specific performance metrics compared to previous standalone models.
To extend personalized capabilities into traditionally non-personalized tasks (like Search), UniCoRn began with semi-personalized clustering, transitioning to fully personalized setups:
- Initially, user cluster IDs were included as features to retain caching benefits.
- Eventually, end-to-end personalization was enabled via fine-tuning with pre-trained user and item embeddings.
- This personalization led to measurable metric improvements: a 7% lift in Search and 10% in recommendation tasks over the base model.

2025

360Brew: A Decoder-only Foundation Model for Personalized Ranking and Recommendation

This paper by Sanjabi et al. from LinkedIn’s Foundation AI Technologies team introduces 360Brew, a 150B parameter decoder-only foundation model for personalized ranking and recommendation tasks. Designed to replace LinkedIn’s existing ID-based, hand-crafted, multi-model architecture, 360Brew leverages a natural language interface to unify and simplify modeling across the platform.
Core Motivation: Traditional RS systems depend on large ID embedding tables and hand-engineered features, which introduces technical debt, poor generalization, and high maintenance overhead. 360Brew aims to consolidate predictive capabilities into a single LLM-based model that operates purely via natural language inputs.
Model Architecture: The model builds on the Mixtral 8x22 MoE (Mixture of Experts) architecture as a decoder-only transformer. It is fine-tuned using LinkedIn’s proprietary first-party data—covering member profiles, job and post descriptions, and interaction data (e.g., job applications)—spanning more than 5 distinct surfaces on the platform.
Input Representation: Both member profiles and items (e.g., jobs, posts) are encoded as plain text. Interaction histories are represented in a format similar to many-shot in-context learning, allowing the model to predict future interactions by learning from past behavior patterns.
Task Formulation: The recommendation problem is re-cast as a sequence modeling problem using textual prompts. The model approximates the joint distribution over members and their historical interactions, enabling it to marginalize over various recommendation tasks by conditioning on task instructions. This mirrors next-token prediction in LLMs.
Prompt Example: The model accepts detailed task instructions (e.g., prioritize skills/location/experience) along with member history and current job/post descriptions to predict member behavior (e.g., apply/view/dismiss).
Training Details: 360Brew was trained over a 9-month period on a combination of raw entity and interaction data, showing scalability in data and model size. No task-specific fine-tuning was used for evaluation.
Evaluation: The model was benchmarked on 30+ tasks across 8+ surfaces, categorized as:
- T1 (in-domain): Tasks seen during training (with a 1-month gap).
- T2 (out-of-domain): Tasks from unseen surfaces/domains.
- Results show that 360Brew performs comparably or better than production models even on out-of-domain tasks, cold-start scenarios, and under temporal distribution shifts.
Scalability Findings:
- Data Scaling: Performance improved consistently with more training data.
- Model Scaling: Larger model sizes led to improved accuracy.
- Context Length: Increased interaction history (longer context) led to better predictions, demonstrating that compute scalability can replace manual feature engineering.
Generalization Capabilities:
- Cold-Start: 360Brew outperforms baseline models significantly when member interaction history is sparse.
- Temporal Robustness: It generalizes well across time periods without frequent retraining, reducing operational overhead.
- Out-of-Domain Tasks: Demonstrates zero-shot generalization via prompt engineering without explicit fine-tuning.

FCN: Fusing Exponential and Linear Cross Network for Click-Through Rate Prediction

This paper introduces the FCN model, a novel architecture designed to explicitly model feature interactions for click-through rate (CTR) prediction without relying on deep neural networks (DNNs). The proposed architecture aims to overcome key limitations in prior deep CTR models, including over-reliance on implicit DNN interactions, insufficient supervision of interaction branches, vulnerability to noise, and poor interpretability.
Core Components: FCN comprises two explicitly designed sub-networks:
- Linear Cross Network (LCN): Captures low-order feature interactions in a linearly growing fashion. It computes feature interactions layer-by-layer using a Hadamard product combined with a learnable cross vector and a novel masking mechanism to reduce noise.
- Exponential Cross Network (ECN): Captures high-order feature interactions with exponential growth by recursively squaring and combining features, allowing for deep, explicit modeling of interactions without implicit DNNs.
Self-Mask Operation: This component filters out noisy feature interactions by applying a layer-normalized masking operation on cross vectors. The mask keeps about 50% of vector elements, balancing interpretability and efficiency. This also halves the number of parameters compared to traditional methods.
Tri-BCE Loss Function: A novel loss function that provides differentiated supervision signals to LCN and ECN. It combines primary and auxiliary binary cross-entropy losses using adaptive weights based on loss discrepancies. This allows each network branch to receive tailored gradient updates, enhancing training efficacy.
Model Architecture:
- Inputs are one-hot encoded and embedded into dense vectors, then split into two views.
- LCN and ECN receive the same input but process it through distinct pathways.
- Final predictions from both networks are averaged.
- Training utilizes Tri-BCE, where the final loss is a sum of primary and auxiliary losses with adaptive weighting.
- The following figure from the paper illustrates the Unified Contextual Ranker (UniCoRn) powering multiple different search and recommendation tasks.
Efficiency and Complexity: Despite eliminating DNNs, FCN achieves state-of-the-art performance with significantly fewer parameters and training times comparable to other leading models. Its training time remains efficient due to the simplified loss computation and architectural parallelism.
Datasets and Results: Evaluated on six benchmark datasets (Avazu, Criteo, ML-1M, KDD12, iPinYou, KKBox), FCN consistently outperforms previous state-of-the-art methods. It demonstrates notable improvements in both AUC and Logloss, with statistically significant gains. It particularly excels in large-scale and sparse datasets.
Interpretability and Noise Filtering: Visualization of interaction vectors and mask outputs across layers reveals FCN’s capability to dynamically adjust interaction importance and noise levels. ECN shows balanced feature attention, whereas LCN highlights strong local interactions, together contributing to robust and interpretable predictions.
Ablation Studies: Removing Tri-BCE or Self-Mask components degrades performance, validating their necessity. ECN alone performs better than many baselines but integrating LCN and using Tri-BCE yields the best results.
[Code](https://github.com/salmon1802/FCN](https://github.com/salmon1802/FCN)

RL

2022

Transdreamer: Reinforcement Learning With Transformer World Models

The Dreamer agent provides various benefits of Model-Based Reinforcement Learning (MBRL) such as sample efficiency, reusable knowledge, and safe planning. However, its world model and policy networks inherit the limitations of recurrent neural networks and thus an important question is how an MBRL framework can benefit from the recent advances of transformers and what the challenges are in doing so.
This paper by Chen et al. from Rutgers and KAIST in 2022 proposes a TransDreamer, a transformer-based MBRL agent. They first introduce the Transformer State-Space Model (TSSM), the first transformer-based stochastic world model that leverages a transformer for dynamics predictions. Then, they share this world model with a transformer-based policy network and obtain stability in training a transformer-based RL agent.
TransDreamer shows comparable performance with Dreamer on DMC and Atari tasks that do not require long-term memory. However, when the proposed model is applied to Hidden Order Discovery involving both 2D visual RL and 3D first-person visual RL, which require long-range memory access for memory-based reasoning (i.e, long-term complex memory interactions), the proposed model outperforms Dreamer in these complex tasks.
They also show that image generation and reward prediction of TSSM is better than Dreamer qualitatively and quantitatively.

Graph ML

2019

RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space

This paper by Sun et al. from Peking University, Mila-Quebec Institute for Learning Algorithms, Universite de Montreal, HEC Montreal, CIFAR in ICLR 2019 presents a novel method for knowledge graph embedding named RotatE.
RotatE is designed to address the challenge of learning representations of entities and relations in knowledge graphs for predicting missing links. RotatE is significant for its ability to effectively model and infer various relation patterns in knowledge graphs, specifically targeting symmetry/antisymmetry, inversion, and composition patterns. This unique approach of defining relations as rotations in a complex plane allows RotatE to capture complex relation patterns with a relatively simple operation. The method maintains a linear scalability in both time and memory, making it suitable for large knowledge graphs.
The following figure from the paper illustrates the TransE and RotatE with only 1 dimension of embedding.

The fundamental concept behind RotatE is mapping entities and relations into a complex vector space, defining each relation as a rotation from the source to the target entity. This approach is inspired by Euler’s identity, utilizing complex numbers to represent rotations in the complex plane.
The model’s efficiency stems from its ability to effectively model all three relation patterns (symmetry/antisymmetry, inversion, and composition) using rotations in the complex vector space. This approach is scalable, remaining linear in both time and memory.
A key technical innovation in RotatE is the introduction of a novel self-adversarial negative sampling technique. The vanilla negative sampling loss samples the negative triplets in a uniform way. Such a uniform negative sampling suffers the problem of inefficiency since many samples are obviously false as training goes on, which does not provide any meaningful information. To address this gap, RotatE proposes self-adversarial negative sampling, which samples negative triples according to the current embedding model.
A variant of RotatE, termed pRotatE, was also introduced as a baseline for comparison. This variant constrains the modulus of entity embeddings and focuses only on phase information, still capable of modeling all three relation patterns.
In terms of empirical performance, RotatE was compared with several state-of-the-art models, including TransE, DistMult, ComplEx, HolE, and ConvE, as well as the baseline model pRotatE. This comparison was conducted on four large knowledge graphs: FB15k, WN18, FB15k-237, and WN18RR. These datasets cover various relation patterns, with FB15k focusing on symmetry/antisymmetry and inversion, and FB15k-237 on symmetry/antisymmetry and composition. The comparison across different datasets indicated that RotatE consistently outperforms these models, particularly highlighting its superiority in handling composition patterns, as evidenced in the FB15k-237 and WN18RR datasets. The model’s embeddings implicitly represent the three relation patterns, proving its effectiveness in capturing complex relational structures in knowledge graphs.
Overall, the RotatE model represents a significant advancement in knowledge graph embedding, offering an efficient and effective method for capturing complex relational patterns. The authors also plan future work to extend RotatE’s capabilities, including evaluating the model on more datasets and incorporating a probabilistic framework to model uncertainties in entities and relations. The code for this paper is available at

2023

Graph-Bert: Only Attention is Needed for Learning Graph Representations

Credits for the following paper summary go to Zak Jost.
This paper by Zhang et al. uses a transformer-based approach to solve node classification tasks. To limit the receptive field of the attention mechanism to only “relevant” nodes, subgraphs around a target node are sampled by first computing an “intimacy score” between all pairs of nodes, and filtering to top- $k$. This therefore does not restrict to directly connected nodes or threshold the number of hops from the target node. The intimacy score is based on PageRank:
\[S=\alpha[\mathbf{I}-(1-\alpha) \bar{A}]^{-1}\]
- where $\alpha$ is a constant (set to 0.15), $I$ the identity matrix, and $\bar{A}$ the column-normalized adjacency matrix.
The idea of using a threshold on a similarity metric to construct the context, rather than just restricting to a $k$-hop neighborhood, is interesting since it allows for important nodes that may not share an edge (e.g., because of incomplete data). This may have some connection with the intuition behind diffusion-based methods or graph rewiring approaches. Determining the relevant context from the graph will be a key choice in making Transformers work well on node classification tasks
After determining the relevant context nodes, positional encodings of 3 different types are used. In each, an integer is constructed by various methods, and the integer is then used in the standard transformer’s positional encoding embedding function: $P E(i)=\left[\sin \left(\frac{i}{10000^{2 l / d}}\right), \cos \left(\frac{i}{10000^{2 l / d}}\right)\right]_{l=0}^{d / 2}$.
- Run the W-L algorithm to assign matching “colors” to each structurally equivalent node, and then map each color to an integer index. However, since W-L are hashes, meaning that adjacent integers are not more similar than distant ones, using them to map structurally equivalent nodes to integers could have limited benefits.
- Impose a rank order based on descending “intimacy” score, and use the rank as the integer index.
- Use the shortest path length as the integer index.
The raw input features and various PEs are summed to create the final input representation of each node. Between layers, residual connections of various types are considered. To construct the final representation of the target node, all output nodes’ representations are mean-pooled together.
Beyond supervised learning, the authors also consider self-supervised pre-training via two tasks:
- Reconstruction of target node’s raw features
- For a pair of nodes, take cosine similarity in output representations and train to predict the precomputed “intimacy” score
The primary results compared accuracy on node classification tasks using small citation graph datasets: Cora, Citeseer, Pubmed. However, a limitation of this study is that the citation network datasets are too small to draw conclusions for real-life industrial problems.
- Improved upon GCN on 2 of the 3: 81.5 → 84.3 (Cora), 70.3 → 71.2 (Citeseer), and mostly matched on Pubmed: 79.0 → 79.3.
- Positional Encoding methods provided a small benefit beyond the raw features, but little value without raw features.
- The number of context nodes to include had a strong impact on results, where too few or too many caused severe degradations.
- Using graph-smoothed raw features ($A \times X$) as a residual to all layers was a little better than just raw features ($X$) or no residual connections.

2024

GraphMaker: Can Diffusion Models Generate Large Attributed Graphs?

This work by Li et al. from Georgia Institute of Technology and J.P. Morgan AI Research presents GraphMaker, a novel diffusion model tailored for generating large attributed graphs, addressing the challenges posed by the complex correlations between attributes and structure, and scalability issues due to the vast size of such graphs. Traditional methods and recent advances in diffusion models struggle with these challenges, particularly for large-scale attributed graphs.
GraphMaker introduces an innovative asynchronous approach to denoise node attributes and graph structure, significantly outperforming synchronous methods in capturing intricate attribute-structure correlations. Scalability is enhanced through edge mini-batching, allowing the model to efficiently handle large graphs without enumerating all node pairs during training. A new evaluation pipeline further demonstrates GraphMaker’s utility in creating synthetic graphs that can serve as a basis for developing competitive graph machine learning models without requiring access to the original data.
The following figure from the paper shows: (a) Data generation for public usage. (b) & (c): Generation process with two GraphMaker variants

The model is evaluated on real-world networks, including social and financial networks, with up to more than 13K nodes, 490K edges, and 1K attributes. GraphMaker shows significant improvements over baseline methods in producing graphs with realistic properties and utility for graph ML model development. The paper also explores conditional generation based on node labels, leading to improved results in scenarios targeting node classification tasks. The implementation is made available, showcasing its practical applicability in graph data dissemination and privacy preservation.
Code

Generative Diffusion Models on Graphs: Methods and Applications

This paper by Liu et al. from The Hong Kong Polytechnic University and Michigan State University, in IJCAI 2023, present a comprehensive survey on generative diffusion models for graphs, addressing the application of these models in molecule and protein modeling. The survey highlights three primary diffusion model variants applied to graphs: Score Matching with Langevin Dynamics (SMLD), Denoising Diffusion Probabilistic Model (DDPM), and Score-based Generative Model (SGM), each adapted to tackle the discrete nature of graph data. The authors elucidate how these models contribute to two significant areas: molecule modeling, focusing on molecule conformation generation and molecular docking, and protein modeling, which includes protein generation and protein-ligand complex structure prediction.
Molecule modeling benefits from diffusion models by enabling the generation of novel molecular structures with desired properties. Techniques such as GeoDiff, MDM, and DiGress utilize diffusion models to address challenges in molecular conformation and docking, aiding in drug discovery and computational chemistry.
Protein modeling applications leverage diffusion models to predict and generate protein structures with specific functions. Approaches like DiffFold, SiamDiff, and DiffAntigen demonstrate how diffusion models can simulate protein folding processes, design novel proteins, and predict protein-ligand interactions.
The following figure from the paper shows the various types of deep generative models on graphs.

The survey underscores the challenges posed by the discrete nature of graph data and suggests future research directions, including the exploration of discrete diffusion models, the integration of conditional information for controlled generation, and the development of trustworthy diffusion models addressing safety, robustness, and fairness.
Notable applications beyond molecule and protein modeling are recommended for further exploration, such as recommender systems, graph anomaly detection, and causal graph generation, showcasing the potential breadth of diffusion models’ impact across various domains.

A Survey on Graph Diffusion Models: Generative AI in Science for Molecule Protein and Material

This survey by Zhang et al. from KAIST and Kyung Hee University focuses on the advancement of graph diffusion models in the context of AI-generated content in science, emphasizing molecule and protein generation, but also touching upon materials design. It addresses the escalating interest in specific field surveys of diffusion models due to their state-of-the-art (SOTA) performance across various domains.
The survey begins with an overview of the challenges inherent in applying traditional deep learning methods to graphs, such as the irregular structure, large scale, and diverse properties of graphs. It then delves into the foundational graph generation techniques, including Auto-Regressive Models, Variational Auto-Encoders (VAEs), Normalizing Flows, and Generative Adversarial Networks (GANs), highlighting their unique approaches and limitations in graph generation.
A significant portion of the survey is dedicated to the mechanisms of diffusion models, categorized into Denoising Diffusion Probabilistic Models (DDPM), Score-Based Generative Models (SGMs), and Stochastic Differential Equations (SDEs). These models’ capabilities in generating realistic and complex graph structures are examined, with specific attention to their application in generating molecules, proteins, and materials. The unique challenges of each application, such as the requirement for roto-translation equivariance in molecular generation and the complexity of protein structure and function, are discussed.
The survey also reviews the evaluation methods and challenges faced by researchers in assessing the performance of graph diffusion models, such as the need for task-specific metrics and the difficulty in scaling models to large graphs. It concludes with a discussion on the current challenges and open questions in the field, such as the lack of comprehensive evaluation criteria, the diversity of graph types, scalability issues, irregularity and complexity of graphs, and the interpretability of diffusion models.
Through this in-depth review, the authors aim to provide a valuable resource for researchers interested in graph diffusion models, offering insights into their applications, challenges, and future directions in generative AI for science.