Papers List

  • A curated set of papers I’ve reviewed for the latest scoop in AI.

Seminal Papers / Need-to-know

Computer Vision


Noise-contrastive estimation: A new estimation principle for unnormalized statistical models
  • This paper by Gutmann and Hyvarinen in AISTATS 2010 introduced the concept of negative sampling that formed the basis of contrastive learning.
  • They propose a new estimation principle for parameterized statistical models, noise-contrastive estimation, which discriminates between observed data and artificially generated noise. This is accomplished by performing nonlinear logistic regression to discriminate between the observed data and some artificially generated noise, using the model log-density function in the regression nonlinearity. They show that this leads to a consistent (convergent) estimator of the parameters, and analyze the asymptotic variance.
  • In particular, the method is shown to directly work for unnormalized models, i.e., models where the density function does not integrate to one. The normalization constant can be estimated just like any other parameter.
  • For a tractable ICA model, they compare the method with other estimation methods that can be used to learn unnormalized models, including score matching, contrastive divergence, and maximum-likelihood where the normalization constant is estimated with importance sampling.
  • Simulations show that noise-contrastive estimation offers the best trade-off between computational and statistical efficiency.
  • They apply the method to the modeling of natural images and show that the method can successfully estimate a large-scale two-layer model and a Markov random field.


ImageNet Classification with Deep Convolutional Neural Networks
  • The original AlexNet paper by Krizhevsky et al. from NeurIPS 2012 that started it all. This trail-blazer was the first to apply deep supervised learning to the area of image classification.
  • They rained a large, deep convolutional neural network to classify the 1.3 million high-resolution images in the LSVRC-2010 ImageNet training set into the 1000 different classes.
  • On the test data, they achieved top-1 and top-5 error rates of 39.7% and 18.9% which was considerably better than the previous state-of-the-art results.
  • The neural network, which has 60 million parameters and 500,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and two globally connected layers with a final 1000-way softmax.
  • To make training faster, they used non-saturating neurons and a very efficient GPU implementation of convolutional nets. To reduce overfitting in the globally connected layers, they employed a new regularization method that proved to be very effective.
  • The following figure from the paper shows an illustration of the architecture of their CNN, explicitly showing the delineation of responsibilities between the two GPUs. One GPU runs the layer-parts at the top of the figure while the other runs the layer-parts at the bottom. The GPUs communicate only at certain layers.

3D Convolutional Neural Networks for Human Action Recognition
  • This paper by Ji et al. from ASU and NEC Labs in IEEE PAMI 2012 introduced 3D CNNs.
  • Their problem statement is the fully automated recognition of actions in an uncontrolled environment. Most existing work relies on domain knowledge to construct complex handcrafted features from inputs. In addition, the environments are usually assumed to be controlled.
  • Convolutional neural networks (CNNs) are a type of deep models that can act directly on the raw inputs, thus automating the process of feature construction. However, such models are currently limited to handle 2D inputs. This paper develops a novel 3D CNN model for action recognition.
  • This model extracts features from both spatial and temporal dimensions by performing 3D convolutions, thereby capturing the motion information encoded in multiple adjacent frames. The developed model generates multiple channels of information from the input frames, and the final feature representation is obtained by combining information from all channels.
  • They apply the developed model to recognize human actions in real-world environment, and it achieves superior performance without relying on handcrafted features.


Visualizing and Understanding Convolutional Networks
  • This legendary paper by Zeiler and Fergus from the Courant Institute, NYU in 2013 seeks to demystify why CNNs perform so well on image classification, or how they might be improved. This paper seeks to address both issues.
  • They introduce a novel visualization technique that gives insight into the function of intermediate feature layers and the operation of the classifier.
  • They also perform an ablation study to discover the performance contribution from different model layers. This enables us to find model architectures that outperform Krizhevsky et. al on the ImageNet classification benchmark.
  • They show their ImageNet model generalizes well to other datasets: when the softmax classifier is retrained, it convincingly beats the current state-of-the-art results on Caltech-101 and Caltech-256 datasets.
Learning Factored Representations in a Deep Mixture of Experts
  • Mixtures of Experts combine the outputs of several “expert” networks, each of which specializes in a different part of the input space. This is achieved by training a “gating” network that maps each input to a distribution over the experts. Such models show promise for building larger networks that are still cheap to compute at test time, and more parallelizable at training time.
  • This paper by Eigen et al. from Google and NYU Courant in 2013 extends the Mixture of Experts to a stacked model, the Deep Mixture of Experts, with multiple sets of gating and experts. This exponentially increases the number of effective experts by associating each input with a combination of experts at each layer, yet maintains a modest model size.
  • On a randomly translated version of the MNIST dataset, they find that the Deep Mixture of Experts automatically learns to develop location-dependent (“where”) experts at the first layer, and class-specific (“what”) experts at the second layer. In addition, they see that the different combinations are in use when the model is applied to a dataset of speech monophones. These demonstrate effective use of all expert combinations.
  • The figure below from the paper shows (a) Mixture of Experts; (b) Deep Mixture of Experts with two layers.


Generative Adversarial Networks
  • This paper by Goodfellow et al. from NeurIPS 2014 proposes a new framework called Generative Adversarial Networks (GANs) that estimates generative models via an adversarial process that corresponds to a zero-sum minimax two-player game. In this process, two models are simultaneously trained: a generative model \(G\) that captures the data distribution, and a discriminative model \(D\) that estimates the probability that a sample came from the training data rather than \(G\). The training procedure for \(G\) is to maximize the probability of \(D\) making a mistake. In the space of arbitrary functions \(G\) and \(D\), a unique solution exists, with \(G\) recovering the training data distribution and \(D\) equal to \frac{1}{2} everywhere. In the case where \(G\) and \(D\) are defined by multilayer perceptrons, the entire system can be trained with backpropagation.
  • There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples.
  • Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.


Very Deep Convolutional Networks for Large-Scale Image Recognition
  • This paper by Simonyan and Zisserman from DeepMind and Oxford in ICLR 2015 proposed the VGG architecture. They showed that a significant performance improvement can be achieved by pushing the depth to 16-19 weight layers, i.e., VGG-16 and VGG-19.
  • The main principle is that using a stack of $$3 \times 3$$ convolution filters are better than a single $$7 \times 7$$ layer. Firstly, because they use three non-linear activations (instead of one), which makes the function more discriminative. Secondly, the $$3 \times 3$$ design decreases the number of parameters – specifically, you need \(3 \times (3^2)C^2 = 27C^2\) weights, compared to a $$7 \times 7$$ conv layer which would require \(1 \times (7^2)C^2 = 49C^2\) parameters (81% more).
Going Deeper with Convolutions
  • This paper by Szegedy et al. from Google in CVPR 2015 introduced the Inception (also known as GoogLeNet or InceptionNet) architecture which achieved state of the art results for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) 2014.
  • Ideas from the paper:
    • Increased the depth (number of layers) is not the only way to make a model bigger. What about increasing both the depth and width of the network while keeping computations at a manageable level? This time the inspiration comes from the human visual system, wherein information is processed at multiple scales and then aggregated locally. How do you achieve this without a memory explosion? The answer is with $$1 \times 1$$ convolutions! The main purpose is channel dimensionality reduction, by reducing the output channels of the input. Next, \(1 \times 1\) convolutions are used to compute reductions before the computationally expensive convolutions (\(3 \times 3\) and \(5 \times 5\)). Inception uses convolutions of different kernel sizes (\(5 \times 5\), \(3 \times 3\), \(1 \times 1\)) to capture details at multiple scales.
    • To enable concatenation of features convolved with different kernels, they pad the output to make it the same size as the input. To find the appropriate padding with single stride convs without dilation, padding \(p\) and kernel \(k\) are defined so that \(out=in\) (i.e., input and output have the same spatial dimensions): \(p = (k-1)/2p\) (since \(out = in + 2p - k + 1\)).
FaceNet: A Unified Embedding for Face Recognition and Clustering
  • This paper by Schroff et al. from Google in 2015 proposes FaceNet, a system that directly learns a mapping from face images to a compact Euclidean space where distances directly correspond to a measure of face similarity. Once this space has been produced, tasks such as face recognition, verification and clustering can be easily implemented using standard techniques with FaceNet embeddings as feature vectors.
  • Their method uses a deep convolutional network trained to directly optimize the embedding itself, rather than an intermediate bottleneck layer as in previous deep learning approaches. To train, they use triplets of roughly aligned matching / non-matching face patches generated using a novel online triplet mining method. The benefit of our
  • approach is much greater representational efficiency: they achieve state-of-the-art face recognition performance using only 128-bytes per face.
  • Previous face recognition approaches based on deep networks use a classification layer trained over a set of known face identities and then take an intermediate bottle neck layer as a representation used to generalize recognition beyond the set of identities used in training. The downsides of this approach are its indirectness and its inefficiency: one has to hope that the bottleneck representation generalizes well to new faces; and by using a bottleneck layer the representation size per face is usually very large (1000s of dimensions). Some recent work has reduced this dimensionality using PCA, but this is a linear transformation that can be easily learnt in one layer of the network. In contrast to these approaches, FaceNet directly trains its output to be a compact 128-D embedding using a triplet-based loss function based on LMNN. Their triplets consist of two matching face thumbnails and a non-matching face thumbnail and the loss aims to separate the positive pair from the negative by a distance margin.
  • Choosing which triplets to use turns out to be very important for achieving good performance and, inspired by curriculum learning, they present a novel online negative exemplar mining strategy which ensures consistently increasing difficulty of triplets as the network trains. To improve clustering accuracy, they also explore hard-positive mining techniques which encourage spherical clusters for the embeddings of a single person.
  • The triplet loss minimizes the L2-distance between faces of the same identity and enforces a margin between the distance of faces of different identities and encourages a relative distance constraint. Specifically, the Triplet Loss minimizes the distance between an anchor and a positive, both of which have the same identity, and maximizes the distance between the anchor and a negative of a different identity. Thus, network is trained such that the squared L2 distances in the embedding space directly correspond to face similarity: faces of the same person have small distances and faces of distinct people have large distances. Once this embedding has been produced, downstream tasks become straight-forward: face verification simply involves thresholding the distance between the two embeddings; recognition becomes a k-NN classification problem; and clustering can be achieved using off-the-shelf techniques such as k-means or agglomerative clustering.
  • On the widely used Labeled Faces in the Wild (LFW) dataset, their system achieves a new record accuracy of 99.63%, which cuts the error rate in comparison to the best published result by 30% on both datasets.
  • They explore two different deep convolutional network architectures that have been recently used to great success in the computer vision community. The first architecture is based on the Zeiler&Fergus model which consists of multiple interleaved layers of convolutions, non-linear activations, local response normalizations, and max pooling layers. The second architecture is based on the Inception model of Szegedy et al. which was recently used as the winning approach for ImageNet 2014. These networks use mixed layers that run several different convolutional and pooling layers in parallel and concatenate their responses which reduces the number of parameters by up to 20 times and have the potential to reduce the number of FLOPS required for comparable performance.
  • They also introduce the concept of harmonic embeddings, and a harmonic triplet loss, which describe different versions of face embeddings (produced by different networks) that are compatible to each other and allow for direct comparison between each other.
Distilling the Knowledge in a Neural Network
  • This paper by Hinton et al. from Google in NeurIPS 2014 introduces a very simple way to improve the performance of almost any machine learning algorithm by training many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets.
  • Caruana et al. have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and the authors develop this approach further using a different compression technique. They achieve some surprising results on MNIST and show that they can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. They also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. This shows that distilling works very well for transferring knowledge from an ensemble or from a large highly regularized model into a smaller, distilled model.
  • The results show that on MNIST, distillation works remarkably well even when the transfer set that is used to train the distilled model lacks any examples of one or more of the classes. For a deep acoustic model that is version of the one used by Android voice search, they have shown that nearly all of the improvement that is achieved by training an ensemble of deep neural nets can be distilled into a single neural net of the same size which is far easier to deploy.
  • For really big neural networks, it can be infeasible even to train a full ensemble, but have shown that the performance of a single really big net that has been trained for a very long time can be significantly improved by learning a large number of specialist nets, each of which learns to discriminate between the classes in a highly confusable cluster.
Deep Unsupervised Learning using Nonequilibrium Thermodynamics
  • A central problem in machine learning involves modeling complex data-sets using highly flexible families of probability distributions in which learning, sampling, inference, and evaluation are still analytically or computationally tractable.
  • This paper by Dickstein et al. from Surya Ganguli’s lab at Stanford in 2015 develops an approach that simultaneously achieves both flexibility and tractability. They introduce a novel algorithm for modeling probability distributions that enables exact sampling and evaluation of probabilities and demonstrated its effectiveness on a variety of toy and real datasets, including challenging natural image datasets. The essential idea, inspired by non-equilibrium statistical physics, is to systematically and slowly destroy structure in a data distribution through an iterative forward diffusion process.
  • They then learn a reverse diffusion process that restores structure in data, yielding a highly flexible and tractable generative model of the data. This approach allows them to rapidly learn, sample from, and evaluate probabilities in deep generative models with thousands of layers or time steps, as well as to compute conditional and posterior probabilities under the learned model.
  • For each of the tests they conduct, they use a similar basic algorithm, showing that their method can accurately model a wide variety of distributions. Most existing density estimation techniques must sacrifice modeling power in order to stay tractable and efficient, and sampling or evaluation are often extremely expensive. The core of their algorithm consists of estimating the reversal of a Markov diffusion chain which maps data to a noise distribution; as the number of steps is made large, the reversal distribution of each diffusion step becomes simple and easy to estimate.
  • The result is an algorithm that can learn a fit to any data distribution, but which remains tractable to train, exactly sample from, and evaluate, and under which it is straightforward to manipulate conditional and posterior distributions.
  • Code.


Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks
  • In recent years, supervised learning with convolutional networks (CNNs) has seen huge adoption in computer vision applications. Comparatively, unsupervised learning with CNNs has received less attention.
  • This paper by Radford et al. in ICLR 2016 helps bridge the gap between the success of CNNs for supervised learning and unsupervised learning. They introduce a class of CNNs called deep convolutional generative adversarial networks (DCGANs), that have certain architectural constraints, and demonstrate that they are a strong candidate for unsupervised learning.
  • Training on various image datasets, they show convincing evidence that their deep convolutional adversarial pair learns a hierarchy of representations from object parts to scenes in both the generator and discriminator.
  • Additionally, they use the learned features for novel tasks - demonstrating their applicability as general image representations.
Rethinking the Inception Architecture for Computer Vision
  • This paper by Szegedy et al. from Google in CVPR 2016 proposed InceptionV2, V3 by improving the Inception model based on the following principles:
    • Using the same principle as VGG, the authors factorized \(5 \times 5\) and \(7 \times 7\) (in InceptionV3) convolutions to two and three \(3 \times 3\) sequential convolutions respectively. This improves computational speed and utilizes far less parameters.
    • Used spatially separable convolutions. Simply, a \(3 \times 3\) kernel is decomposed into two smaller ones: a \(1 \times 3\) and a \(3 \times 1\) kernel, which are applied sequentially.
    • Widened the inception modules (more number of filters).
    • Distributed the computational budget in a balanced way between the depth and width of the network.
    • Added batch normalization.
Deep Residual Learning for Image Recognition
  • ResNet paper by He et al. from Facebook AI in CVPR 2016. Most cited in several AI fields.
  • The issue of vanishing gradients when training a deep neural network was addressed with two tricks:
    • Batch normalization and,
    • Short skip connections
  • Instead of \(H(x) = F(x)\), the skip connection leads to \(H(x) = F(x) + x\), which implies that the model is learning the difference (i.e., residual), \(F(x) = H(x) - x\).
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
  • State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet and Fast R-CNN have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck.
  • This paper by Ren et al. from University of Science and Technology of China and Microsoft Research in 2016 proposes a Region Proposal Network (RPN) for efficient and accurate region proposal generation that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. By sharing convolutional features with the down-stream detection network, the region proposal step is nearly cost-free.
  • An RPN is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection.
  • They further merge RPN and Fast R-CNN into a single network by sharing their convolutional features – using the recently popular terminology of neural networks with ‘attention’ mechanisms, the RPN component tells the unified network where to look.
  • For the very deep VGG-16 model, their detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image. In ILSVRC and COCO 2015 competitions, Faster R-CNN and RPN are the foundations of the 1st-place winning entries in several tracks.
  • Faster R-CNN enables a unified, deep-learning-based object detection system to run at near real-time frame rates. The learned RPN also improves region proposal quality and thus the overall object detection accuracy.
  • Code.

You Only Look Once: Unified, Real-Time Object Detection
  • Prior work on object detection repurposes classifiers to perform detection.
  • This paper by Redmon et al. from Ali Farhadi’s group at UW in 2016 presents YOLO, a new approach to object detection which frames object detection as a regression problem to spatially separated bounding boxes and associated class probabilities.
  • A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation.
  • Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance. Unlike classifier-based approaches, YOLO is trained on a loss function that directly corresponds to detection performance and the entire model is trained jointly.
  • YOLO is extremely fast and can thus be utilized for real-time object detection. The base YOLO model processes images in real-time at 45 frames per second. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors.
  • Compared to state-of-the-art detection systems, YOLO makes more localization errors but is far less likely to predict false detections where nothing exists. Finally, YOLO learns very general representations of objects. It outperforms all other detection methods, including DPM and R-CNN, by a wide margin when generalizing from natural images to artwork on both the Picasso Dataset and the People-Art Dataset.


Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning
  • This paper by Szegedy et al. from Google in AAAI 2017 introduced the latest versions of the Inception model – InceptionV4 and Inception-ResNet.
Photo-Realistic Single Image Super-Resolution using a GAN
  • This paper by Ledig et al. from Twitter in CVPR 2017 applied GANs for single image super-resolution (SISR).
Understanding intermediate layers using linear classifier probes
  • Neural network models have a notorious reputation for being black boxes.
  • This paper by Alain and Bengio from Mila and the University of Montreal in ICLR 2017 proposes to monitor the features at every layer of a model and measure how suitable they are for classification.
  • They use linear classifiers, which they refer to as “probes”, trained entirely independently of the model itself. This helps them better understand the roles and dynamics of the intermediate layers. They demonstrate how this can be used to develop a better intuition about models and to diagnose potential problems.
  • They apply this technique to the popular models Inception v3 and Resnet-50. Among other things, they observe experimentally that the linear separability of features increase monotonically along the depth of the model.
Image-to-Image Translation with Conditional Adversarial Networks
  • Many problems in image processing, graphics, and vision involve translating an input image into a corresponding output image. These problems are often treated with application-specific algorithms, even though the setting is always the same: map pixels to pixels. Conditional adversarial nets are a general-purpose solution that appears to work well on a wide variety of these problems.
  • This paper by et al. from UC Berkeley in CVPR 2017 introduces pix2pix, a conditional adversarial network-based framework for image-to-image translation.
  • These networks not only learn the mapping from input image to output image, but also learn a loss function to train this mapping. This makes it possible to apply the same generic approach to problems that traditionally would require very different loss formulations.
  • They demonstrate that this approach is effective at synthesizing photos from label maps, reconstructing objects from edge maps, and colorizing images, among other tasks.
  • The figure below from the paper shows the results of the method on several inputs. In each case, they use the same architecture and objective, and simply train on different data.

Improved Image Captioning via Policy Gradient optimization of SPIDEr
  • Current image captioning methods are usually trained via (penalized) maximum likelihood estimation. However, the log-likelihood score of a caption does not correlate well with human assessments of quality.
  • Standard syntactic evaluation metrics, such as BLEU, METEOR and ROUGE, are also not well correlated. The newer SPICE and CIDEr metrics are better correlated, but have traditionally been hard to optimize for.
  • This paper by Liu et al. from Oxford and Google in ICCV 2017 shows how to use a policy gradient (PG) method to directly optimize a linear combination of SPICE and CIDEr (a combination they call SPIDEr): the SPICE score ensures their captions are semantically faithful to the image, while CIDEr score ensures their captions are syntactically fluent.
  • The proposed PG method improves on the prior MIXER approach, by using Monte Carlo rollouts instead of mixing MLE training with PG. They show empirically that SPIDEr leads to easier optimization and improved results compared to MIXER.
  • Finally, they show that using their PG method they can optimize any of the metrics, including the proposed SPIDEr metric which results in image captions that are strongly preferred by human raters compared to captions generated by the same model but trained to optimize MLE or the COCO metrics.


From Recognition to Cognition: Visual Commonsense Reasoning
  • Visual understanding goes well beyond object recognition. With one glance at an image, they can effortlessly imagine the world beyond the pixels: for instance, they can infer people’s actions, goals, and mental states. While this task is easy for humans, it is tremendously difficult for today’s vision systems, requiring higher-order cognition and commonsense reasoning about the world.
  • This paper by Zellers et al. from UW in CVPR 2019 formalizes this task as Visual Commonsense Reasoning (VCR). Given a challenging question about an image, a machine must answer correctly and then provide a rationale justifying its answer.
  • Next, they introduce a new dataset, VCR, consisting of 290k multiple choice QA problems derived from 110k movie scenes. The key recipe for generating non-trivial and high-quality problems at scale is Adversarial Matching, a new approach to transform rich annotations into multiple choice questions with minimal bias. Experimental results show that while humans find VCR easy (over 90% accuracy), state-of-the-art vision models struggle (~45%).
  • To move towards cognition-level understanding, they present a new reasoning engine, Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning. R2C helps narrow the gap between humans and machines (~65%); still, the challenge is far from solved, and they provide analysis that suggests avenues for future work.
  • Website with models/datasets.
Focal Loss for Dense Object Detection
  • The highest accuracy object detectors to date are based on a two-stage approach popularized by R-CNN, where a classifier is applied to a sparse set of candidate object locations. In contrast, one-stage detectors that are applied over a regular, dense sampling of possible object locations have the potential to be faster and simpler, but have trailed the accuracy of two-stage detectors thus far.
  • This paper by Lin et al. from in 2017 investigates why this is the case and introduced focal loss. They discover that the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause. They propose to address this class imbalance by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples.
  • Focal loss applies a modulating term to the cross entropy loss in order to focus learning on hard misclassified examples. It is a dynamically scaled cross entropy loss, where the scaling factor decays to zero as confidence in the correct class increases.
  • Their novel Focal Loss focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training. To evaluate the effectiveness of their loss, they design and train a simple dense detector they call RetinaNet.
  • Their results show that when trained with the focal loss, RetinaNet is able to match the speed of previous one-stage detectors while surpassing the accuracy of all existing state-of-the-art two-stage detectors.
Relational inductive biases, deep learning, and graph networks
  • Recent advances in AI, propelled by deep learning, have been transformative across many important domains. Despite this, a vast gap between human and machine intelligence remains, especially with respect to efficient, generalizable learning.
  • This paper by Battaglia et al. (2018) from DeepMind/Google, MIT and the University of Edinburgh offers a great overview of the relational inductive biases of various neural net architectures, summarized in the table below from the paper.

  • They argue that combinatorial generalization must be a top priority for AI to achieve human-like abilities, and advocate for marrying complementary approaches which draw on ideas from human cognition, traditional computer science, standard engineering practice, and modern deep learning. Just as biology uses nature and nurture cooperatively, they reject the false choice between “hand-engineering” and “end-to-end” learning, and instead advocate for an approach which benefits from their complementary strengths.
  • They investigate how using relational inductive biases within deep learning architectures can facilitate learning about entities, relations, and rules for composing them.
  • They explore flexible learning-based approaches which implement strong relational inductive biases to capitalize on explicitly structured representations and computations, and present a new building block for the AI toolkit – the graph neural networks (GNNs).
  • GNNs generalize and extend various approaches for neural networks that operate on graphs, and provides a straightforward interface for manipulating structured knowledge and producing structured behaviors. GNNs are designed to promote building complex architectures using customizable graph-to-graph building blocks, and their relational inductive biases promote support relational reasoning, combinatorial generalization, and improved sample efficiency over other standard machine learning building blocks. This would help lay the foundation for more sophisticated, interpretable, and flexible patterns of reasoning.
Squeeze-and-Excitation Networks
  • The central building block of convolutional neural networks (CNNs) is the convolution operator, which enables networks to construct informative features by fusing both spatial and channel-wise information within local receptive fields at each layer. A broad range of prior research has investigated the spatial component of this relationship, seeking to strengthen the representational power of a CNN by enhancing the quality of spatial encodings throughout its feature hierarchy.
  • This paper by et al. from the Chinese Academy of Sciences, University of Macau, and the Visual Geometry Group at the University of Oxford focuses instead on the channel relationship and propose a novel architectural unit, which we term the “Squeeze-and-Excitation” (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels.
  • They show that these blocks can be stacked together to form SENet architectures that generalize extremely effectively across different datasets.
  • They further demonstrate that SE blocks bring significant improvements in performance for existing state-of-the-art CNNs at slight additional computational cost. Squeeze-and-Excitation Networks formed the foundation of our ILSVRC 2017 classification submission which won first place and reduced the top-5 error to 2.251%, surpassing the winning entry of 2016 by a relative improvement of ~25%.
  • The following figure from the paper shows a squeeze-and-excitation block.

  • The following figure from the paper shows (first half) the schema of the original Inception module (left) and the SEInception module (right); (second half) the original Residual module (left) and the SEResNet module (right).

When Does Label Smoothing Help?
  • The following paper summary has been contributed by Zhibo Zhang.
  • This paper by Müller et al. from Google Brain in NeurIPS 2019 studies label smoothing in terms of the effects on penultimate layer representations, model calibration as well as knowledge distillation (Hinton et al., 2015).
  • The figure below from the paper shows the visualization of the penultimate layer representations of the following models (trained with label smoothing, denoted by “w/ LS” in the figure; and without label smoothing, denoted by “w/o LS” in the figure) and datasets:
    • First row: AlexNet (Krizhevsky et al., 2012) on the CIFAR-10 (Krizhevsky, 2009) dataset, with the visualization of three semantically different classes.
    • Second row: ResNet-56 (He et al., 2016) on the CIFAR-100 (Krizhevsky, 2009) dataset, with the visualization of three semantically different classes.
    • Third row: Inception-v4 (Szegedy et al., 2017) on the ImageNet (Russakovsky et al., 2014) dataset, with the visualization of three semantically different classes.
    • Fourth row: Inception-v4 on the ImageNet dataset, with the visualization of two semantically similar classes and a semantically different one
  • It can be observed that with label smoothing, the activations of the same class are more closely tightened together compared to training without label smoothing, which is because training with label smoothing encourages the penultimate layer representations of the same class to be equally distant from other classes.
  • In order to study the effects of label smoothing on model calibration, the authors conducted experiments on image classification and machine translation tasks. It was observed that training with label smoothing could reduce the expected calibration error (Guo et al., 2017) compared to training without label smoothing.
  • In addition, the authors noticed that in knowledge distillation, while a teacher model trained with label smoothing could have better accuracy for the teacher model itself, it could produce student models with worse performance.
Unsupervised Feature Learning via Non-Parametric Instance Discrimination
  • This paper by Wu et al. from UC Berkeley, Chinese University of Hong Kong, and Amazon Rekognition in CVPR 2018 as a spotlight paper introduces a novel method for unsupervised feature learning in neural networks, leveraging non-parametric instance discrimination.
  • The unique approach involves treating each image as a distinct class and employing noise-contrastive estimation (NCE) to address the computational challenges posed by the vast number of instance classes.
  • A non-parametric softmax classifier is proposed, which uses direct feature representation instead of a class weight vector, allowing for precise instance comparisons. This involves projecting image features into a 128-dimensional space and normalizing them. To efficiently store these representations, the concept of a memory bank is introduced.
  • To reduce the computational burden of the softmax function over numerous classes, the paper implements NCE, which approximates the full softmax distribution and cuts computational complexity from \(O(n)\) to \(O(1)\) per sample, without sacrificing performance. To stabilize the learning process, proximal regularization is applied. This is crucial as each instance class is visited only once per epoch, aiding in smoother learning dynamics and faster convergence.
  • The paper also explores an alternative approach involving storing representations from previous batches in a queue to be used as negative examples in the loss (Wu et al., 2018). This method allows for smaller batch sizes but introduces asymmetry between “queries” (generated from current batch elements) and “keys” (stored in the queue). Only “queries” undergo gradient backpropagation, treating “key” representations as fixed. However, this leads to performance drops when the network rapidly evolves during training. To address this, He et al. (2020) proposed MoCo, a technique using two networks: one for keys and one for queries, with the keys’ network updating more slowly. This offers a more stable learning dynamic, as the query network is updated using backpropagation and stochastic gradient descent.
  • The following figure from the paper shows the pipeline of our unsupervised feature learning approach. We use a backbone CNN to encode each image as a feature vector, which is projected to a 128-dimensional space and L2 normalized. The optimal feature embedding is learned via instance-level discrimination, which tries to maximally scatter the features of training samples over the 128-dimensional unit sphere.

  • The method exhibits state-of-the-art performance in unsupervised image classification on standard datasets like CIFAR-10 and ImageNet, notably achieving a top-1 accuracy of 46.5% on ImageNet.
  • The learned features demonstrate strong generalization in semi-supervised learning and object detection, showcasing effective transfer learning.
  • The scalability and efficiency of the approach are highlighted by the compact 128-dimensional representation, requiring only 600MB for a million images, enabling rapid nearest neighbor retrieval at runtime.
  • Code.


Objects as Points
  • This paper by Zhou et al. from UT Austin in 2019 proposes CenterNet, a center point-based object detection approach, which is end-to-end differentiable, simpler, faster, and more accurate than other competitive bounding box based detectors.
  • CenterNet is an anchorless object detection architecture. As such, this structure has an important advantage in that it replaces the classical NMS (Non Maximum Suppression) step during post-processing. This mechanism enables faster inference.
  • Where most successful object detectors enumerate a nearly exhaustive list of potential object locations and classify each, which is wasteful, inefficient, and requires additional post-processing, CenterNet models an object as a single point — the center point of its bounding box. CenterNet object detector builds on successful keypoint estimation networks and uses keypoint estimation to find center points and regresses to all other object properties, such as size, 3D location, orientation, depth and extent, and pose in a single forward pass. The algorithm is simple, fast, accurate, and end-to-end differentiable without any NMS post-processing. The idea is general and has broad applications beyond simple two-dimensional detection.
  • Upon comparison with other state-of-the-art detectors in the COCO test-dev set. With multi-scale evaluation, CenterNet with Hourglass104 achieves an AP of 45.1%, outperforming all existing one-stage detectors. Sophisticated two-stage detectors are more accurate, but also slower.
RandAugment: Practical automated data augmentation with a reduced search space
  • Recent work has shown that data augmentation has the potential to significantly improve the generalization of deep learning models.
  • Recently, automated augmentation strategies have led to state-of-the-art results in image classification and object detection. While these strategies were optimized for improving validation accuracy, they also led to state-of-the-art results in semi-supervised learning and improved robustness to common corruptions of images.
  • An obstacle to a large-scale adoption of these methods is a separate search phase which increases the training complexity and may substantially increase the computational cost. Additionally, due to the separate search phase, these approaches are unable to adjust the regularization strength based on model or dataset size. Automated augmentation policies are often found by training small models on small datasets and subsequently applied to train larger models.
  • This paper by Cubuk et al. from Google Brain in 2019 demonstrates that previous methods of learned augmentation suffers from systematic drawbacks. Namely, not tailoring the number of distortions and the distortion magnitude to the dataset size nor the model size leads to sub-optimal performance. In previous work, scaling learned data augmentation to larger dataset and models have been a notable obstacle. For example, AutoAugment and Fast AutoAugment could only be optimized for small models on reduced subsets of data; population based augmentation was not reported for large-scale problems.
  • They propose RangAugment, a simple parameterization for targeting augmentation to particular model and dataset sizes, which seeks to remove both of the aforementioned obstacles. RandAugment has a significantly reduced search space which allows it to be trained on the target task with no need for a separate proxy task. Furthermore, due to the parameterization, the regularization strength may be tailored to different model and dataset sizes.
  • RandAugment can be used uniformly across different tasks and datasets and works out of the box, matching or surpassing all previous automated augmentation approaches on CIFAR-10/100, SVHN, and ImageNet without a separate search for data augmentation policies.
  • The proposed method scales quite well to datasets such as ImageNet and COCO while incurring minimal computational cost (e.g. 2 hyperparameters), but notable predictive performance gains.
  • On the ImageNet dataset, they achieve 85.0% accuracy, a 0.6% increase over the previous state-of-the-art and 1.0% increase over baseline augmentation. On object detection, RandAugment leads to 1.0-1.3% improvement over baseline augmentation, and is within 0.3% mAP of AutoAugment on COCO.
  • Finally, due to its interpretable hyperparameter, RandAugment may be used to investigate the role of data augmentation with varying model and dataset size.
Semantic Image Synthesis with Spatially-Adaptive Normalization
  • This paper by Park et al. from UC Berkeley, NVIDIA and MIT CSAIL proposes a spatially-adaptive normalization, a simple but effective layer for synthesizing photorealistic images given an input semantic layout. Previous methods directly feed the semantic layout as input to the deep network, which is then processed through stacks of convolution, normalization, and nonlinearity layers.
  • They show that this is suboptimal as the normalization layers tend to “wash away” semantic information.
  • To address the issue, they propose using the input layout for modulating the activations in normalization layers through a spatially-adaptive, learned affine transformation. The proposed normalization leads to the first semantic image synthesis model that can produce photorealistic outputs for diverse scenes including indoor, outdoor, landscape, and street scenes.
  • Experiments on several challenging datasets demonstrate the advantage of the proposed method over existing approaches, regarding both visual fidelity and alignment with input layouts.
  • Finally, their model allows user control over both semantic and style and demonstrate its application for multi-modal and guided image synthesis.
  • In the paper and the demo video, they showed GauGAN, an interactive app that generates realistic landscape images from the layout users draw. The model was trained on landscape images scraped from
  • Code; project page; online interactive demo of GauGAN; GauGAN360.
Generative Modeling by Estimating Gradients of the Data Distribution
  • This paper by Song and Ermon in NeurIPS 2019 introduces a new generative model where samples are produced via Langevin dynamics using gradients of the data distribution estimated with score matching.
  • Because gradients can be ill-defined and hard to estimate when the data resides on low-dimensional manifolds, they perturb the data with different levels of Gaussian noise, and jointly estimate the corresponding scores, i.e., the vector fields of gradients of the perturbed data distribution for all noise levels. For sampling, they propose an annealed Langevin dynamics where we use gradients corresponding to gradually decreasing noise levels as the sampling process gets closer to the data manifold.
  • Their framework allows flexible model architectures, requires no sampling during training or the use of adversarial methods, and provides a learning objective that can be used for principled model comparisons.
  • Their models produce samples comparable to GANs on MNIST, CelebA and CIFAR-10 datasets, achieving a new state-of-the-art inception score of 8.87 on CIFAR-10. Additionally, they demonstrate that their models learn effective representations via image inpainting experiments.


Denoising Diffusion Probabilistic Models
  • This paper by Ho et al. from Pieter Abbeel’s lab at UC Berkeley presents high quality image samples using diffusion probabilistic models (also called diffusion models), a class of latent variable models inspired by considerations from nonequilibrium thermodynamics.
  • Their best results are obtained by training on a weighted variational bound designed according to a novel connection between diffusion probabilistic models and denoising score matching with Langevin dynamics, and their models naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding.
  • On the unconditional CIFAR10 dataset, they obtain an Inception score of 9.46 and a state-of-the-art FID score of 3.17. On 256x256 LSUN, they obtain sample quality similar to ProgressiveGAN.
  • Code.
Designing Network Design Spaces
  • This paper by Radosavovic et al. from FAIR in CVPR 2020 presents a new network design paradigm. Their goal is to help advance the understanding of network design and discover design principles that generalize across settings. Instead of focusing on designing individual network instances, they design network design spaces that parametrize populations of networks. The overall process is analogous to classic manual design of networks, but elevated to the design space level.
  • Their methodology explores the structural aspect of network design and arrives at a low-dimensional design space consisting of simple, regular networks that they call RegNet. The core insight of the RegNet parametrization is surprisingly simple: widths and depths of good networks can be explained by a quantized linear function.
  • They analyze the RegNet design space and arrive at interesting findings that do not match the current practice of network design. The RegNet design space provides simple and fast networks that work well across a wide range of flop regimes.
  • Under comparable training settings and flops, the RegNet models outperform the popular EfficientNet models while being up to 5x faster on GPUs.
Training data-efficient image transformers & distillation through attention
  • Compared to CNNs, vision transformers are pre-trained with hundreds of millions of images using an expensive infrastructure, thereby limiting their adoption.
  • This paper by Touvron from Facebook AI and proposes DeiT, a competitive convolution-free transformer that does not require very large amount of data to be trained, thanks to improved training and in particular a novel distillation procedure. DeiT is trained on ImageNet on a single computer in less than 3 days. Their reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop evaluation) on ImageNet with no external data.
  • They introduce a teacher-student strategy specific to transformers. Using distillation can hamper the performance of neural networks. The student model pursues two different objectives that may be diverging: learning from a labeled dataset (strong supervision) and learning from the teacher. To alleviate this, they introduced a distillation token, which is a learned vector that flows through the network along with the transformed image data. The distillation token cues the model for its distillation output, which can differ from its class output. This new distillation method is specific to Transformers and further improves the image classification performance.
  • It relies on a distillation token ensuring that the student learns from the teacher through attention. They show the interest of this token-based distillation, especially when using a ConvNet as a teacher. This leads us to report results competitive with CNNs for both ImageNet (where they obtain up to 85.2% top-1 accuracy) and when transferring to other tasks.
  • Facebook AI post.
  • Code.
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
  • This paper by Mildenhall et al. from UC Berkeley, Google and UCSD in ECCV 2020 introduces NeRF, a method that achieves state-of-the-art results for synthesizing novel views of complex scenes by optimizing an underlying continuous volumetric scene function using a sparse set of input views.
  • Their algorithm represents a scene using a fully-connected (non-convolutional) deep network, whose input is a single continuous 5D coordinate (spatial location (x,y,z) and viewing direction (θ,ϕ)) and whose output is the volume density and view-dependent emitted radiance at that spatial location.
  • They synthesize views by querying 5D coordinates along camera rays and use classic volume rendering techniques to project the output colors and densities into an image. Because volume rendering is naturally differentiable, the only input required to optimize their representation is a set of images with known camera poses. They describe how to effectively optimize neural radiance fields to render photorealistic novel views of scenes with complicated geometry and appearance, and demonstrate results that outperform prior work on neural rendering and view synthesis.
  • Project page with videos and code.
Bootstrap your own latent: A new approach to self-supervised Learning
  • This paper by Grill et al. from DeepMind and Imperial College in 2020 introduces Bootstrap Your Own Latent (BYOL), a new approach to self-supervised image representation learning.
  • BYOL learns its representation by predicting previous versions of its outputs, without using negative pairs. BYOL relies on two neural networks, referred to as online and target networks, that interact and learn from each other. From an augmented view of an image, they train the online network to predict the target network representation of the same image under a different augmented view. At the same time, they update the target network with a slow-moving average of the online network.
  • While state-of-the art methods rely on negative pairs, BYOL achieves a new state of the art without them. BYOL reaches 74.3% top-1 classification accuracy on ImageNet using a linear evaluation with a ResNet-50 architecture and 79.6% with a larger ResNet, using 30% fewer parameters.
  • They show that BYOL performs on par or better than the current state of the art on both transfer and semi-supervised benchmarks.
  • Nevertheless, BYOL remains dependent on existing sets of augmentations that are specific to vision applications. To generalize BYOL to other modalities, it is necessary to obtain similarly suitable augmentations for each of them. Designing such augmentations may require significant effort and expertise. Therefore, automating the search for these augmentations would be an important next step to generalize BYOL to other modalities.
  • BYOL’s architecture is as shown below. BYOL minimizes a similarity loss between \(q_{\theta}\left(z_{\theta}\right)\) and \(\operatorname{sg}\left(z_{\xi}^{\prime}\right)\), where \(\theta\) are the trained weights, \(\xi\) are an exponential moving average of \(\theta\) and \(sg\) means stop-gradient. At the end of training, everything but \(f_{\theta}\) is discarded, and \(y_{\theta}\) is used as the image representation.

A Simple Framework for Contrastive Learning of Visual Representations
  • This paper by Chen et al. from Google Research and Hinton’s lab in ICML 2020 presents SimCLR, a simple framework for contrastive learning of visual representations.
  • They simplify recently proposed contrastive self-supervised learning algorithms without requiring specialized architectures or a memory bank. In order to understand what enables the contrastive prediction tasks to learn useful representations, they systematically study the major components of their framework and show the effects of different design choices.
  • They show that (1) composition of data augmentations plays a critical role in defining effective predictive tasks, (2) introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and (3) contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning.
  • By combining these findings, SimCLR is able to considerably outperform previous methods for self-supervised and semi-supervised learning on ImageNet. SimCLR differs from standard supervised learning on ImageNet only in the choice of data augmentation, the use of a nonlinear head at the end of the network, and the loss function. The strength of this simple framework suggests that, despite a recent surge in interest, self-supervised learning remains undervalued.
  • A linear classifier trained on self-supervised representations learned by SimCLR achieves 76.5% top-1 accuracy, which is a 7% relative improvement over previous state-of-the-art, matching the performance of a supervised ResNet-50. When fine-tuned on only 1% of the labels, SimCLR achieve 85.8% top-5 accuracy, outperforming AlexNet with 100x fewer labels.
  • The following diagram shows the SimCLR framework. Two separate data augmentation operators are sampled from the same family of augmentations (\(t \sim \mathcal{T}\) and \(t^{\prime} \sim \mathcal{T}\)) and applied to each data example to obtain two correlated views. A base encoder network \(f(\cdot)\) and a projection head \(g(\cdot)\) are trained to maximize agreement using a contrastive loss. After training is completed, they throw away the projection head \(g(\cdot)\) and use encoder \(f(\cdot)\) and representation \(\boldsymbol{h}\) for downstream tasks.

Conditional Negative Sampling for Contrastive Learning of Visual Representations
  • Recent methods for learning unsupervised visual representations, dubbed contrastive learning, optimize the noise-contrastive estimation (NCE) bound on mutual information between two views of an image. NCE uses randomly sampled negative examples to normalize the objective.
  • This paper by Wu et al. from Stanford in 2020 shows that choosing difficult negatives, or those more similar to the current instance, can yield stronger representations. To do this, they introduce a family of mutual information estimators called Conditional Noise Contrastive Estimator (CNCE) that sample negatives conditionally – in a “ring” around each positive, by approximating the partition function using samples from a class of conditional distributions. They prove that these estimators lower-bound mutual information, with higher bias but lower variance than NCE.
  • Applying these estimators as objectives in contrastive representation learning, shows that CNCE’s representations outperform existing approaches consistently across a spectrum of contrastive objectives, data distributions, and transfer tasks.
  • Experimentally, CNCE applied on top of existing models (IR, CMC, and MoCo) improves accuracy by 2-5% points in each case, measured by linear evaluation on four standard image datasets. Moreover, they find continued benefits when transferring features to a variety of new image distributions from the meta-dataset collection and to a variety of downstream tasks such as object detection, instance segmentation, and keypoint detection.
Momentum Contrast for Unsupervised Visual Representation Learning
  • This paper by He et al. from Facebook AI in CVPR 2020 presents Momentum Contrast (MoCo) for unsupervised visual representation learning. From a perspective on contrastive learning as dictionary look-up, MoCo builds a dynamic dictionary with a queue and a moving-averaged encoder. This enables building a large and consistent dictionary on-the-fly that facilitates contrastive unsupervised learning.
  • MoCo provides competitive results under the common linear protocol on ImageNet classification. More importantly, the representations learned by MoCo transfer well to downstream tasks. MoCo can outperform its supervised pre-training counterpart in 7 detection/segmentation tasks on PASCAL VOC, COCO, and other datasets, sometimes surpassing it by large margins. This suggests that the gap between unsupervised and supervised representation learning has been largely closed in many vision tasks.
  • Momentum Contrast (MoCo) trains a visual representation encoder by matching an encoded query \(q\) to a dictionary of encoded keys using a contrastive loss, as shown in the diagram below. The dictionary keys \(\left\{k_{0}, k_{1}, k_{2}, \ldots\right\}\) are defined on-the-fly by a set of data samples. The dictionary is built as a queue, with the current mini-batch enqueued and the oldest mini-batch dequeued, decoupling it from the mini-batch size. The keys are encoded by a slowly progressing encoder, driven by a momentum update with the query encoder. This method enables a large and consistent dictionary for learning visual representations.

  • The figure below from the paper shows the conceptual comparison of three contrastive loss mechanisms by illustrating one pair of query and key. The three mechanisms differ in how the keys are maintained and how the key encoder is updated. (a): The encoders for computing the query and key representations are updated end-to-end by back-propagation (the two encoders can be different). (b): The key representations are sampled from a memory bank. (c): MoCo encodes the new keys on-the-fly by a momentum-updated encoder, and maintains a queue (not illustrated in this figure) of keys.

Generative Pretraining from Pixels
  • Based on the observation that just as a large transformer model trained on language can generate coherent text, the same exact model trained on pixel sequences can generate coherent image completions and samples. By establishing a correlation between sample quality and image classification accuracy, they show that their best generative model also contains features competitive with top convolutional nets in the unsupervised setting.
  • This paper by Chen et al. from OpenAI in ICML 2020 examines whether similar models can learn useful representations for images, inspired by progress in unsupervised representation learning for natural language.
  • They train a sequence Transformer to auto-regressively predict pixels, without incorporating knowledge of the 2D input structure.
  • Despite training on low-resolution ImageNet without labels, they find that a GPT-2 scale model learns strong image representations as measured by linear probing, fine-tuning, and low-data classification. On CIFAR-10, they achieve 96.3% accuracy with a linear probe, outperforming a supervised Wide ResNet, and 99.0% accuracy with full finetuning, matching the top supervised pre-trained models.
  • An even larger model trained on a mixture of ImageNet and web images is competitive with self-supervised benchmarks on ImageNet, achieving 72.0% top-1 accuracy on a linear probe of their features.
  • OpenAI article.
Random Erasing Data Augmentation
  • This paper by Zhong et al. from Xiamen University, University of Technology Sydney, Australian National University, and CMU in AAAI 2020 introduces Random Erasing (“RandomErase”), a new data augmentation method for training the convolutional neural network (CNN). In training, Random Erasing randomly selects a rectangle region in an image and erases its pixels with random values.
  • In this process, training images with various levels of occlusion are generated, which reduces the risk of over-fitting and makes the model robust to occlusion. Random Erasing is parameter learning free, easy to implement, and can be integrated with most of the CNN-based recognition models.
  • Albeit simple, Random Erasing is complementary to commonly used data augmentation techniques such as random cropping and flipping, and yields consistent improvement over strong baselines in image classification, object detection and person re-identification.
  • The figure below from the paper shows examples of random erasing in image classification (a), person re-identification (re-ID) (b), object detection (c) and comparing with different augmentation methods (d). In CNN training, they randomly choose a rectangle region in the image and erase its pixels with random values or the ImageNet mean pixel value. Images with various levels of occlusion are thus generated.


An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
  • In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place.
  • This paper by Dosovitskiy et al. from Google Brain in ICLR 2021 shows that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks.
  • Inspired by the Transformer scaling successes in NLP, they experiment with applying a standard Transformer directly to images, with the fewest possible modifications. To do so, they split an image into patches and provide the sequence of linear embeddings of these patches as an input to a Transformer. Image patches are treated the same way as tokens (words) in an NLP application. They train the model on image classification in supervised fashion (as shown in the figure below from the paper).
  • They introduce three ViT configurations (Base, Large, and Huge) in the form of two models: ViT-H/14 and ViT-L/16 (where the notation used is ViT-C/N, C is used to indicate the model size and N is the input patch size; for instance, ViT-L/16 means the “Large” variant with \(16 \times 16\) input patch size).
  • When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), the proposed Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

RepVGG: Making VGG-style ConvNets Great Again
  • This paper by Ding et al. from Tsinghua University, MEGVII Technology, HKUST, and Aberystwyth University in CVPR 2021 Re-parameterization VGG (RepVGG), a simple but powerful architecture of convolutional neural network, which has a simple architecture with a stack of \(3 \times 3\) convolution and ReLU during inference time, which is especially suitable for GPU and specialized inference chips, while the training-time model has a multi-branch topology. Such decoupling of the training-time and inference-time architecture is realized by a structural re-parameterization technique so that the model is named RepVGG.
  • The figure below from the paper shows a sketch of RepVGG architecture. RepVGG has 5 stages and conducts down-sampling via stride-2 convolution at the beginning of a stage. Here, only the first 4 layers of a specific stage are shown. Inspired by ResNet, RepVGG also uses identity and \(1 \times 1\) branches, but only for training.

  • On ImageNet, RepVGG reaches over 80% top-1 accuracy, which is the first time for a plain model.
  • On NVIDIA 1080Ti GPU, RepVGG models run 83% faster than ResNet-50 or 101% faster than ResNet-101 with higher accuracy and show favorable accuracy-speed trade-off compared to the state-of-the-art models like EfficientNet and RegNet.
  • The figure below from the paper shows the Top-1 accuracy on ImageNet vs. actual speed. Left: lightweight and middleweight RepVGG and baselines trained in 120 epochs. Right: heavyweight models trained in 200 epochs. The speed is tested on the same 1080Ti with a batch size of 128, full precision (fp32), single crop, and measured in examples/second. The input resolution is 300 for EfficientNet-B3 and 224 for the others.

ArcFace: Additive Angular Margin Loss for Deep Face Recognition
  • Recently, a popular line of research in face recognition is adopting margins in the well-established softmax loss function to maximize class separability.
  • This paper by Deng et al. from in TPAMI introduces an Additive Angular Margin Loss (ArcFace), which not only has a clear geometric interpretation but also significantly enhances the discriminative power. - Since ArcFace is susceptible to the massive label noise, they further propose sub-center ArcFace, in which each class contains \(K\) sub-centers and training samples only need to be close to any of the \(K\) positive sub-centers. Sub-center ArcFace encourages one dominant sub-class that contains the majority of clean faces and non-dominant sub-classes that include hard or noisy faces.
  • Based on this self-propelled isolation, they boost the performance through automatically purifying raw web faces under massive real-world noise. Besides discriminative feature embedding, they also explore the inverse problem, mapping feature vectors to face images.
  • Without training any additional generator or discriminator, the pre-trained ArcFace model can generate identity-preserved face images for both subjects inside and outside the training data only by using the network gradient and Batch Normalization (BN) priors. Extensive experiments demonstrate that ArcFace can enhance the discriminative feature embedding as well as strengthen the generative face synthesis.
  • The figure below from the paper shows the comparisons of Triplet, Tuplet, ArcFace and sub-center ArcFace. Triplet and Tuplet conduct local sample-to-sample comparisons with Euclidean margins within the mini-batch. By contrast, ArcFace and sub-center ArcFace conduct global sample-to-class and sample-to-subclass comparisons with angular margins.

  • The figure below from the paper shows the training the deep face recognition model by the proposed ArcFace loss \((K=1)\) and sub-center ArcFace loss (e.g. \(K=3)\). Based on a \(\ell_2\) normalization step on both embedding feature \(\mathbf{x}_i \in \mathbb{R}^{512}\) and all sub-centers \(W \in \mathbb{R}^{512 \times N \times K}\), they get the subclass-wise similarity score \(\mathcal{S} \in \mathbb{R}^{N \times K}\) by a matrix multiplication \(W^T \mathbf{x}_i\). After a max pooling step, they can easily get the class-wise similarity score \(\mathcal{S}^{\prime} \in \mathbb{R}^{N \times 1}\). Afterwards, they calculate the \(\arccos \theta_{y_i}\) and get the angle between the feature \(x_i\) and the ground truth center \(W_{y_i}\). Then, they add an angular margin penalty \(m\) on the target (ground truth) angle \(\theta_{y_i}\). After that, they calculate \(\cos \left(\theta_{y_i}+m\right)\) and multiply all logits by the feature scale \(s\). Finally, the logits go through the softmax function and contribute to the cross entropy loss.

Do Vision Transformers See Like Convolutional Neural Networks?
  • Given the central role of convolutional neural networks in computer vision breakthroughs (leading to them being the de-facto model for visual data), it is remarkable that Transformer architectures (almost identical to those used in language) are capable of similar performance. For instance, recent work has shown that the Vision Transformer (ViT) model can achieve comparable or even superior performance on image classification tasks. This raises fundamental questions on whether these architectures work in the same way as CNNs: how are Vision Transformers solving these tasks? Are they acting like convolutional networks, or learning entirely different visual representations?
  • This paper by Raghu et al. from Google Brain in 2021 analyzes the internal representation structure of ViTs and CNNs on image classification benchmarks, and finds striking differences in the features and internal structures between the two architectures, such as ViT having more uniform representations across all layers. They explore how these differences arise, finding crucial roles played by self-attention, which enables early aggregation of global information (“earlier global features”), and ViT residual connections, which offer representation propagation of features from lower to higher layers, while also revealing that some CNN properties, e.g. local information aggregation at lower layers, are important to ViTs, being learned from scratch at scale.
  • They also examine the potential for ViTs to be used beyond classification through a study of spatial localization, discovering ViTs successfully preserve input spatial information with CLS tokens —- promising for future uses in object detection.
  • Finally, they investigate the effect of scale for transfer learning, finding larger ViT models develop significantly stronger intermediate representations through larger pretraining datasets. These results are also very pertinent to understanding recent architectures for vision such as the MLP-Mixer.
BEiT: BERT Pre-Training of Image Transformers
  • This paper by Wei et al. from Microsoft Research in 2021 introduces a self-supervised pre-trained representation model called BEiT, which stands for Bidirectional Encoder representations from Image Transformers. Following BERT developed in the natural language processing area, they propose a masked image modeling task to pretrain vision Transformers. Specifically, each image has two views in their pre-training, i.e, image patches (such as 16x16 pixels) the embeddings of which are calculated as linear projections of flattened patches, and visual tokens (i.e., discrete tokens) which are . Before pre-training, they learn a discrete variational autoencoder (dVAE) which acts as an “image tokenizer” learnt via autoencoding-style reconstruction, where the input image is tokenized into discrete visual tokens obtained by the latent codes of the discrete VAE (the one proposed in VQGAN and reused by CLIP in Ramesh et al., 2021) according to the learned vocabulary.
  • They show that the proposed method is critical to make BERT-like pre-training (i.e., auto-encoding with masked input) work well for image Transformers. They also present the intriguing property of automatically acquired knowledge about semantic regions, without using any human-annotated data.
  • Similar to the masked language modeling pre-training task of BERT, BEiT randomly masks some image patches and feeds them into the backbone Transformer. The pre-training objective is to recover the original visual tokens based on the corrupted image patches.
  • After pre-training BEiT, they directly fine-tune the model parameters on downstream tasks by appending task layers upon the pretrained encoder.
  • Experimental results on image classification and semantic segmentation show that BEiT achieves competitive results with previous pre-training methods. For example, base-size BEiT achieves 83.2% top-1 accuracy on ImageNet-1K, significantly outperforming from-scratch DeiT training (81.8%) with the same setup. Moreover, large-size BEiT obtains 86.3% only using ImageNet-1K, even outperforming ViT-L with supervised pre-training on ImageNet-22K (85.2%).
  • Code and pretrained models are here.

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
  • This paper by Liu et al. from Microsoft Research in 2021 presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision by producing a hierarchical feature representation and offers a linear computational complexity with respect to input image size. The key element of the Swin Transformer is the shifted window based self-attention.
  • The Swin transformer aims to address the challenges in adapting Transformer from language to vision which arise due to differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, they propose a hierarchical Transformer whose representation is computed with Shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection.
  • This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including ImageNet image classification (87.3 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as COCO object detection (58.7 box AP and 51.1 mask AP on COCO testdev) and ADE20K semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures.
  • Code and pretrained models are here.
CvT: Introducing Convolutions to Vision Transformers
  • This paper by Wu et al. from McGill and Microsoft in 2021 proposes the Convolutional vision Transformer (CvT), that improves Vision Transformer (ViT) in performance and efficiency by introducing convolutions into ViT to yield the best of both designs for image recognition tasks.
  • This is accomplished through two primary modifications: a hierarchy of Transformers containing a new convolutional token embedding, and a convolutional Transformer block leveraging a convolutional projection. These changes introduce desirable properties of convolutional neural networks (CNNs) to the ViT architecture (i.e., shift, scale, and distortion invariance) while maintaining the merits of Transformers (i.e., dynamic attention, global context, and better generalization).
  • They validate CvT by conducting extensive experiments, showing that this approach achieves state-of-the-art performance over other Vision Transformers and ResNets on ImageNet-1k, with fewer parameters and lower FLOPs.
  • In addition, performance gains are maintained when pretrained on larger datasets (for e.g., ImageNet-22k) and fine-tuned to downstream tasks. Pre-trained on ImageNet-22k, the CvT-W24 obtains a top-1 accuracy of 87.7% on the ImageNet-1k val set.
  • Furthermore, their results show that the positional encoding, a crucial component in existing Vision Transformers, can be safely removed in their model, giving it a potential advantage for adaption to a wide range of vision tasks requiring variable input resolution. This is due to the built-in local context structure introduced by convolutions, CvT no longer requires a position embedding.
  • CvTs thus introduce convolutions into the Vision Transformer architecture to merge the benefits of Transformers with the benefits of CNNs and demonstrate that the introduced convolutional token embedding and convolutional projection, along with the multi-stage design of the network enabled by convolutions, enable CvT to achieve superior performance while maintaining computational efficiency.
  • Code and pretrained models are here.
An Empirical Study of Training Self-Supervised Vision Transformers
  • While the training recipes for standard convolutional networks have been highly mature and robust, the recipes for ViT are yet to be built, especially in the self-supervised scenarios where training becomes more challenging.
  • This paper by Chen et al. from Facebook AI in ICCV 2021 studies a straightforward, incremental, yet must-know baseline given the recent progress in computer vision: self-supervised learning for Vision Transformers (ViT).
  • They go back to basics and investigate the effects of several fundamental components for training self-supervised ViT. Their comparisons concern several aspects, including ViT vs. convolutional networks, supervised vs. self-supervised, and contrastive learning vs. masked auto-encoding.
  • They observe that instability is a major issue that degrades accuracy, and it can be hidden by apparently good results. They reveal that these results are indeed partial failure, and they can be improved when training is made more stable.
  • They introduce “MoCo v3”, a framework which offers an incremental improvement of MoCo v1/2, and strikes for a better balance between simplicity, accuracy, and scalability. The pseudocode of MoCo v3 is as below:

  • They benchmark ViT results in MoCo v3 and several other self-supervised frameworks, with ablations in various aspects. They discuss the currently positive evidence as well as challenges and open questions.
Diffusion Models Beat GANs on Image Synthesis
  • This paper by Dhariwal and Nichol from OpenAI in 2021 shows that diffusion models, a class of likelihood-based models with a stationary training objective, can achieve image sample quality superior to the current state-of-the-art generative models.
  • They achieve this on unconditional image synthesis by finding a better architecture through a series of ablations. For conditional image synthesis, they further improve sample quality with classifier guidance: a simple, compute-efficient method for trading off diversity for fidelity using gradients from a classifier.
  • These guided diffusion models can reduce the sampling time gap between GANs and diffusion models, although diffusion models still require multiple forward passes during sampling. Finally, by combining guidance with upsampling, they can further improve sample quality on high-resolution conditional image synthesis.
  • They achieve an FID of 2.97 on ImageNet \(128 \times 128\), 4.59 on ImageNet \(256 \times 256\), and 7.72 on ImageNet \(512 \times 512\), and match BigGAN-deep even with as few as 25 forward passes per sample, all while maintaining better coverage of the distribution.
  • Finally, they find that classifier guidance combines well with upsampling diffusion models, further improving FID to 3.94 on ImageNet \(256 \times 256\) and 3.85 on ImageNet \(512 \times 512\).
  • Code.
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
  • Diffusion models have recently been shown to generate high-quality synthetic images, especially when paired with a guidance technique to trade off diversity for fidelity.
  • This paper by Nichol et al. from OpenAI in 2021 explores diffusion models for the problem of text-conditional image synthesis and compare two different guidance strategies: CLIP guidance and classifier-free guidance.
  • They find that the latter is preferred by human evaluators for both photorealism and caption similarity, and often produces photorealistic samples. Samples from a 3.5 billion parameter text-conditional diffusion model using classifier-free guidance are favored by human evaluators to those from DALL-E, even when the latter uses expensive CLIP reranking.
  • Additionally, they find that their models can be fine-tuned to perform image inpainting, enabling powerful text-driven image editing.
  • Code.
Multiscale Vision Transformers
  • This paper by Fan et al. from Facebook AI and UC Berkeley presents Multiscale Vision Transformers (MViT) for video and image recognition, by connecting the seminal idea of multiscale feature hierarchies with transformer models.
  • Multiscale Transformers have several channel-resolution scale stages. Starting from the input resolution and a small channel dimension, the stages hierarchically expand the channel capacity while reducing the spatial resolution. This creates a multiscale pyramid of features with early layers operating at high spatial resolution to model simple low-level visual information, and deeper layers at spatially coarse, but complex, high-dimensional features.
  • They evaluate this fundamental architectural prior for modeling the dense nature of visual signals for a variety of video recognition tasks where it outperforms concurrent vision transformers that rely on large scale external pre-training and are 5-10x more costly in computation and parameters.
  • They further remove the temporal dimension and apply their model for image classification where it outperforms prior work on vision transformers.
  • The figure below from the paper shows that Multiscale Vision Transformers learn a hierarchy from dense (in space) and simple (in channels) to coarse and complex features. Several resolution-channel scale stages progressively increase the channel capacity of the intermediate latent sequence while reducing its length and thereby spatial resolution.

Score-Based Generative Modeling through Stochastic Differential Equations
  • Creating noise from data is easy; creating data from noise is generative modeling.
  • This paper by Song et al. from Stanford and Google Brain in ICLR 2021 presents a stochastic differential equation (SDE) that smoothly transforms a complex data distribution to a known prior distribution by slowly injecting noise, and a corresponding reverse-time SDE that transforms the prior distribution back into the data distribution by slowly removing the noise.
  • Crucially, the reverse-time SDE depends only on the time-dependent gradient field (a.k.a., score) of the perturbed data distribution. By leveraging advances in score-based generative modeling, they can accurately estimate these scores with neural networks, and use numerical SDE solvers to generate samples. They show that this framework encapsulates previous approaches in score-based generative modeling and diffusion probabilistic modeling, allowing for new sampling procedures and new modeling capabilities.
  • In particular, they introduce a predictor-corrector framework to correct errors in the evolution of the discretized reverse-time SDE.
  • They also derive an equivalent neural ODE that samples from the same distribution as the SDE, but additionally enables exact likelihood computation, and improved sampling efficiency. In addition, they provide a new way to solve inverse problems with score-based models, as demonstrated with experiments on class-conditional generation, image inpainting, and colorization. Combined with multiple architectural improvements, they achieve record-breaking performance for unconditional image generation on CIFAR-10 with an Inception score of 9.89 and FID of 2.20, a competitive likelihood of 2.99 bits/dim, and demonstrate high fidelity generation of \(1024 \times 1024\) images for the first time from a score-based generative model.
  • The figure below from the paper shows that solving a reverse-time SDE yields a score-based generative model. Transforming data to a simple noise distribution can be accomplished with a continuous-time SDE. This SDE can be reversed if they know the score of the distribution at each intermediate time step, \(\nabla_{\mathbf{x}} \log p_t(\mathbf{x})\).

Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers
  • Most recent semantic segmentation methods adopt a fully-convolutional network (FCN) with an encoder-decoder architecture. The encoder progressively reduces the spatial resolution and learns more abstract/semantic visual concepts with larger receptive fields. Since context modeling is critical for segmentation, the latest efforts have been focused on increasing the receptive field, through either dilated/atrous convolutions or inserting attention modules. However, the encoder-decoder based FCN architecture remains unchanged.
  • This paper by Zheng et al. from in CVPR 2021 proposes SEgmentation TRansformer (SETR), which, utilizes a a pure transformer (i.e., without convolution and resolution reduction) to encode an image as a sequence of patches and aims to provide an alternative perspective to the segmentation problem by treating semantic segmentation as a sequence-to-sequence prediction task. Thanks to the Transformer self-attention architecture, which models global context in every layer, this results in being able to combine the encoder with a simple decoder to provide a powerful segmentation model.
  • Extensive experiments show that SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes. Particularly, they achieve the first position in the highly competitive ADE20K test server leaderboard on the day of submission.
  • The figure below from the paper shows a schematic illustration of the proposed SETR model; (a) They first split an image into fixed-size patches, linearly embed each of them, add position embeddings, and feed the resulting sequence of vectors to the standard Transformer encoder. To perform pixel-wise segmentation, they introduce different decoder designs: (b) progressive upsampling (resulting in a variant called SETRPUP); and (c) multi-level feature aggregation (a variant called SETR-MLA).

Scaling Vision with Sparse Mixture of Experts
  • Almost all prevalent computer vision models networks are “dense,” that is, every input is processed by every parameter.
  • This paper by Riquelme et al. from Google Brain introduces the Vision Mixture of Experts (V-MoE), a novel approach for scaling vision models. The V-MoE is a sparsely activated version of the Vision Transformer (ViT) that demonstrates scalability and competitiveness with larger dense networks in image recognition tasks.
  • The paper proposes a sparse variant of the Vision Transformer (ViT) that uses a mixture-of-experts architecture. This approach routes each image patch to a subset of experts, making it possible to scale up to 15B parameters while matching the performance of state-of-the-art dense models.
  • An innovative extension to the routing algorithm is presented, allowing prioritization of subsets of each input across the entire batch. This adaptive per-image compute leads to a trade-off between performance and computational efficiency during inference.
  • The figure below from the paper shows an overview of the architecture. V-MoE is composed of \(L\) ViT blocks. In some, we replace the MLP with a sparsely activated mixture of MLPs. Each MLP (the expert) is stored on a separate device, and processes a fixed number of tokens. The communication of these tokens between devices is shown in this example, which depicts the case when \(k=1\) expert is selected per token. Here each expert uses a capacity ratio \(C=\frac{4}{3}\): the sparse MoE layer receives 12 tokens per device, but each expert has capacity for \(16\left(\frac{16 \cdot 1}{12}=\frac{4}{3}\right.\)). Non-expert components of V-MoE such as routers, attention layers and normal MLP blocks are replicated identically across devices.

  • The V-MoE shows impressive scalability, successfully trained up to 15B parameters, and demonstrates strong performance, including 90.35% accuracy on ImageNet.
  • The paper explores the transfer learning abilities of V-MoE, showing its adaptability and effectiveness across different tasks and datasets, even with limited data.
  • A detailed analysis of the V-MoE’s routing decisions and the behavior of its experts is provided, offering insights into the model’s internal workings and guiding future improvements.
  • V-MoE models require less computational resources than dense counterparts, both in training and inference, thanks to their sparsely activated nature and the efficient use of the Batch Prioritized Routing algorithm.
  • The paper concludes with the potential of sparse conditional computation in vision tasks, emphasizing the environmental benefits due to reduced CO2 emissions and the promising directions for future research in large-scale multimodal or video modeling.
  • The paper represents a significant advancement in the field of computer vision, particularly in the development of scalable and efficient vision models.
MLP-Mixer: An all-MLP Architecture for Vision
  • This paper by Tolstikhin et al. from Google Brain introduces MLP-Mixer, a novel architecture for vision tasks that eschews convolutions and self-attention in favor of multi-layer perceptrons (MLPs). The architecture comprises two types of layers: channel-mixing MLPs, which operate on individual image patches (tokens) and allow for communication between different channels, and token-mixing MLPs, which facilitate communication between different spatial locations by operating on each channel independently.
  • The MLP-Mixer architecture is designed to maintain the input’s dimensionality (patches \(\times\) channels) throughout, with interleaved channel- and token-mixing layers to enable both intra- and inter-patch interactions. The model employs skip-connections, dropout, and LayerNorm, with a final classifier head consisting of global average pooling followed by a fully-connected layer.
  • The figure below from the paper illustrates that MLP-Mixer consists of per-patch linear embeddings, Mixer layers, and a classifier head. Mixer layers contain one token-mixing MLP and one channel-mixing MLP, each consisting of two fully-connected layers and a GELU nonlinearity. Other components include: skip-connections, dropout, and layer norm on the channels.

  • When trained on large-scale datasets or using modern regularization techniques, MLP-Mixer achieves competitive performance on image classification benchmarks compared to state-of-the-art models like CNNs and Transformers. Notably, it attains 87.94% top-1 validation accuracy on ImageNet when pre-trained on a dataset of ~100M images.
  • Experiments demonstrate the model’s robustness to input permutations and its ability to learn without the inductive biases present in convolutional and attention-based models. The paper discusses potential improvements through untying parameters within token-mixing MLPs, grouping channels for token mixing, and employing pyramid-like structures for scaling.
  • The MLP-Mixer code is provided in JAX/Flax, showcasing its simplicity and the straightforward implementation of its constituent MLP blocks and mixer layers.


A ConvNet for the 2020s
  • This paper by FAIR and UC Berkeley seeks to refute the recent apparent superiority of Transformers by re-examining the design of ConvNets and testing their limitations. The proposed approach is based on gradually modifying a standard ResNet50, following design choices closely inspired by Vision Transformer, to propose a new family of pure ConvNets called ConvNeXt, which can perform as good as a hierarchical vision Transformer on image classification, object detection, instance and semantic segmentation tasks.
  • The “Roaring 20s” of visual recognition began with the introduction of Vision Transformers (ViTs), which quickly superseded ConvNets as the state-of-the-art image classification model. A vanilla ViT, on the other hand, faces difficulties when applied to general computer vision tasks such as object detection and semantic segmentation. It is the hierarchical Transformers (e.g., Swin Transformers) that reintroduced several ConvNet priors, making Transformers practically viable as a generic vision backbone and demonstrating remarkable performance on a wide variety of vision tasks.
  • However, the effectiveness of such hybrid approaches is still largely credited to the intrinsic superiority of Transformers, rather than the inherent inductive biases of convolutions.
  • In this paper, the authors reexamine the design spaces and test the limits of what a pure ConvNet can achieve.
  • The authors gradually “modernize” a standard ResNet toward the design of a vision Transformer, and discover several key components that contribute to the performance difference along the way. They implement a series of design decisions starting with a ResNet50 trained with up-to-date techniques (extending the number of epochs, using AdamW optimizer, Stochastic Depth, Label Smoothing, and so on):
    • Macro Design: The authors considered two aspects of Swin Transformers’ macro design. The first is the number of blocks in each stage (stage compute ratio), which was adjusted from (4, 4, 6, 3) to (3, 3, 9, 3), following the Swin Transformer ratio of (1:1:3:1). The second is the stem cell configuration, which in the original ResNet50 consisted of 7\(\times\)7 convolutions with stride 2 followed by a max-pooling layer. This was substituted by a more Transformer-like “patchify” layer which utilizes 4\(\times\)4 non-overlapping convolutions with stride 4. These modifications improved the accuracy to 79.5%.
    • ResNeXt: In this part, the authors adopt two design choices of the popular ResNeXt: depthwise convolutions, which are interestingly similar to self-attention as they work on a per-channel basis, and a higher number of channels (from 64 to 96). These modifications improved the accuracy to 80.5%.
    • Inverted Bottleneck: An essential configuration of Transformers is the expansion-compression rate in the MLP block (the hidden dimension is 4 times higher than the input and output dimension). This feature was reproduced by adding the inverted bottleneck design used in ConvNets (where the input is expanded using \(1 \times 1\) convolutions and then shrunk through depthwise convolution and \(1 \times 1\) convolutions). This modification slightly improved the accuracy to 80.6%.
    • Large kernel sizes: The gold standard in ConvNet since the advent of VGG are 3\(\times\)3 kernels. Small kernels lead to the famous local receptive field, which, compared to the global self-attention, has a more limited area of focus. Although Swin Transformers reintroduced the concept of local attention, their window size has always been at least \(7 \times 7\). To explore larger kernels, the first thing is to move the depthwise convolution before the convolution, to reduce the number of channels before such an expensive operation. This first modification resulted in a temporary degradation to 79.9%, but, experimenting with different sizes, with a \(7 \times 7\) window (higher values did not bring any alterations in the results), the authors were able to achieve an accuracy of 80.6% again.
    • Micro Design: Finally, some micro design choices were added: GELU instead of ReLU, a single activation for each block (the original transformer module has just one activation after the MLP), fewer normalization layers, Batch Normalization substituted by Layer Normalization, and separate downsampling layer.
    • These modifications improved the accuracy to 82.0% and defined the final model, named ConvNeXt.
  • A comparison of this architecture with the Swin Transformer and ResNet is shown in the figure below.

  • Based entirely on convolutions, this model competed on par with Transformer-based architectures, achieving a top-1 accuracy of 87.8% on ImageNet classification. Equally excellent results were obtained in other tasks, such as object detection and segmentation on COCO and semantic segmentation on ADE20K.
  • The idea of modernizing ConvNets, adding all the concepts introduced over the past decade to a single model, is payback for convolutions, which have been ignored lately to the benefit of transformers. The authors suggest that ConvNeXt may be more suited for certain tasks, while Transformers may be more flexible for others. A case in point is multi-modal learning, in which a cross-attention module may be preferable for modeling feature interactions across many modalities. Additionally, Transformers may be more flexible when used for tasks requiring discretized, sparse, or structured outputs. They believe the architecture choice should meet the needs of the task at hand while striving for simplicity and efficiency.
  • Code.
Natural Language Descriptions of Deep Visual Features
  • Some neurons in deep networks specialize in recognizing highly specific perceptual, structural, or semantic features of inputs. In computer vision, techniques exist for identifying neurons that respond to individual concept categories like colors, textures, and object classes. But these techniques are limited in scope, labeling only a small subset of neurons and behaviors in any network. Is a richer characterization of neuron-level computation possible?
  • This paper by Hernandez et al. from MIT, Northeastern and Alleghency College in 2022 proposes MILAN, for mutual-information-guided linguistic annotation of neurons, that aims to generate open-ended, compositional, natural language descriptions of individual neurons in deep networks.
  • Given a neuron, MILAN generates a description by searching for a natural language string that maximizes pointwise mutual information with the image regions in which the neuron is active. These mutual information estimates are in turn produced by a pair of learned models trained on MILANNOTATIONS, a dataset of fine-grained image annotations released with this paper. MILAN produces fine-grained descriptions that capture categorical, relational, and logical structure in learned features. These descriptions obtain high agreement with human-generated feature descriptions across a diverse set of model architectures and tasks, and can aid in understanding and controlling learned models.
  • They highlight three applications of natural language neuron descriptions.
    • First, they use MILAN for analysis, characterizing the distribution and importance of neurons selective for attribute, category, and relational information in vision models.
    • Second, they use MILAN for auditing, surfacing neurons sensitive to protected categories like race and gender in models trained on datasets intended to obscure these features.
    • Finally, they use MILAN for editing, improving robustness in an image classifier by deleting neurons sensitive to text features spuriously correlated with class labels.
  • MarkTechPost link.
Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision
  • Discriminative self-supervised learning allows training models on any random group of internet images, and possibly recover salient information that helps differentiate between the images. Applied to ImageNet, this leads to object-centric features that perform on par with supervised features on most object-centric downstream tasks.
  • This paper by Goyal et al. in 2022 from FAIR questions that if using this ability, they can learn any salient and more representative information present in diverse unbounded set of images from across the globe. To do so, they train models on billions of random images without any data pre-processing or prior assumptions about what they want the model to learn. This is a very large-scale experiment in which a RegNet architecture scaled to a dense 10 billion parameters (to avoid underfitting on a large data size) is pre-trained using the SwAV self-supervised method on a large collection of 1 billion randomly selected public images from Instagram with a diversity of gender, ethnicity, cultures, and locations (all outside the EU because of GDPR).
  • They achieve state of the art results on a majority of 50 transfer tasks, including fairness, robustness to distribution shift, geographical diversity, fine-grained classification, image copy detection and many image classification datasets. The resulting model, not only captures well semantic information, it also captures information about artistic style and learns salient information such as geo-locations and multilingual word embeddings based on visual content only.
  • The key takeaway is that large-scale self-supervised pre-training yields more robust, fair, less harmful, and less biased results than supervised models or models trained on object centric datasets such as ImageNet.
Block-NeRF: Scalable Large Scene Neural View Synthesis
  • This paper by Tancik et al. from UC Berkeley, Waymo and Google Research in 2022 presents Block-NeRF, a variant of Neural Radiance Fields (NeRFs) that can reconstruct large-scale environments.
  • They demonstrate that when scaling NeRF to render city-scale scenes spanning multiple blocks, it is vital to decompose the scene into individually trained NeRFs that can be optimized independently. This decomposition decouples rendering time from scene size, enables rendering to scale to arbitrarily large environments, and allows per-block updates of the environment.
  • At such a scale, the data collected will necessarily have transient objects and variations in appearance, which they account for by modifying the underlying NeRF architecture to make NeRF robust to data captured over months under different environmental conditions. They add appearance embeddings, learned pose refinement, and controllable exposure to each individual NeRF, and introduce a procedure for aligning appearance between adjacent NeRFs so that they can be seamlessly combined.
  • They demonstrate the method’s efficacy by building an entire neighborhood in San Francisco from 2.8M images using a grid of Block-NeRFs, forming the largest neural scene representation to date.
VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning
  • This paper by Bardes et al. from FAIR, Inria, École normale supérieure, CNRS, PSL Research University, and NYU, in ICLR 2022, introduces Variance-Invariance-Covariance Regularization (VICReg), a novel self-supervised learning method that tackles the challenge of the collapse problem in image representation learning, where encoders may output constant or trivial vectors, by introducing a simple regularization term on the variance of the embeddings along each dimension individually. Unlike other methods that rely on implicit biases in the architecture, often lacking clear justification or interpretation, VICReg provides an explicit approach to prevent collapse.
  • VICReg distinguishes itself by explicitly avoiding the collapse problem through two innovative regularization terms applied to the embeddings: a variance term to ensure each embedding dimension maintains variance above a specific threshold, and a covariance term to decorrelate pairs of embedding variables. This method simplifies the architecture by eliminating the need for complexities such as weight sharing, batch normalization, feature-wise normalization, output quantization, stop gradient, memory banks, etc., commonly found in other self-supervised learning approaches.
  • Employing a joint embedding architecture with encoders and expanders, VICReg utilizes ResNet-50 backbones for encoders and fully-connected layers for expanders. It aims to preserve the information content in the embeddings by enforcing invariance to input transformations, maintaining variance across dimensions, and minimizing covariance to ensure embedding variables carry independent information.
  • The figure below from the paper shows VICReg: joint embedding architecture with variance, invariance and covariance regularization. Given a batch of images I, two batches of different views \(X\) and \(X'\) are produced and are then encoded into representations \(Y\) and \(Y'\). The representations are fed to an expander producing the embeddings \(Z\) and \(Z'\). The distance between two embeddings from the same image is minimized, the variance of each embedding variable over a batch is maintained above a threshold, and the covariance between pairs of embedding variables over a batch are attracted to zero, decorrelating the variables from each other. Although the two branches do not require identical architectures nor share weights, in most of our experiments, they are Siamese with shared weights: the encoders are ResNet-50 backbones with output dimension 2048. The expanders have 3 fully-connected layers of size 8192.

  • The method has been rigorously evaluated against several benchmarks, showcasing competitive performance on downstream tasks without necessitating large batch sizes, memory banks, or contrastive samples. Moreover, VICReg’s introduction of a variance term into other self-supervised methods has been shown to enhance training stability and improve performance.
  • An extensive analysis underscores VICReg’s robustness and adaptability, including its effectiveness in multi-modal scenarios and its advantage in maintaining independent embeddings across different architectural setups. VICReg’s methodological approach, based on a triple objective of learning invariance, avoiding representation collapse through variance preservation, and maximizing information content via covariance regularization, sets a new standard in self-supervised learning. It demonstrates that the method is not limited by the constraints of identical or similar embedding branches, opening up new possibilities for a wide range of applications beyond conventional single-modality image representation learning tasks.
Masked Autoencoders Are Scalable Vision Learners
  • Simple algorithms that scale well are the core of deep learning. In NLP, simple self-supervised learning methods enable benefits from exponentially scaling models. In computer vision, practical pre-training paradigms are dominantly supervised despite progress in self-supervised learning. In this study, they observe on ImageNet and in transfer learning that an autoencoder —- a simple self-supervised method similar to techniques in NLP – provides scalable benefits. Self-supervised learning in vision may thus now be embarking on a similar trajectory as in NLP.
  • This paper by He et al. from Facebook AI Research introduces Masked Autoencoders (MAE) as a scalable, self-supervised learning approach for computer vision (referred to as ViTMAE), drawing parallels to the trajectory of self-supervised learning in NLP.
  • MAE employs a simple yet effective strategy: masking random patches of the input image and reconstructing the missing pixels, a method analogous to techniques in NLP but adapted for the visual domain.
  • The asymmetric encoder-decoder architecture is a core design feature of MAE. The encoder operates only on the visible subset of image patches, significantly reducing the computational load. In contrast, the decoder, being lightweight, reconstructs the original image from the latent representation and mask tokens.
  • Another notable aspect of MAE is the high proportion of input image masking (e.g., 75%). This strategy creates a nontrivial and meaningful self-supervisory task, enabling the model to infer complex, holistic reconstructions and learn various visual concepts.
  • The encoder, based on Vision Transformer (ViT), processes only unmasked patches, optimizing computational efficiency. This approach allows for scaling up model size while maintaining practical training speeds (up to 3 times faster).
  • The following figure from the paper shows the MAE architecture. During pre-training, a large random subset of image patches (e.g., 75%) is masked out. The encoder is applied to the small subset of visible patches. Mask tokens are introduced after the encoder, and the full set of encoded patches and mask tokens is processed by a small decoder that reconstructs the original image in pixels. After pre-training, the decoder is discarded and the encoder is applied to uncorrupted images (full sets of patches) for recognition tasks.

  • The loss function utilized is the mean squared error (MSE) between the reconstructed and original pixels of the masked regions. This focus on masked area reconstruction drives the model to infer missing information effectively.
  • During training, both the encoder and decoder are jointly optimized. However, in downstream tasks, the decoder is typically discarded, and the encoder is fine-tuned for specific applications.
  • MAE’s performance is highlighted by its excellent results on benchmarks such as ImageNet-1K, where a ViT-Huge model achieves 87.8% accuracy, surpassing methods using only ImageNet-1K data. Its efficacy extends to transfer learning, outperforming supervised pre-training on tasks like object detection and instance segmentation on COCO, and semantic segmentation on ADE20K.
  • The paper also explores various practical aspects of MAE, including the masking ratio, decoder design variations, data augmentation, and mask sampling strategy.
  • In summary, ViTMAE is a simply yet effective self-supervised pre-training technique, where authors combined vision transformer with masked autoencoder. The images are first masked (with a relatively high masking ratio: 75%, compared to BERT’s 15%) and then the model tries to learn about the features through trying to reconstruct the original image. The image is not masked, but rather only the visible patches are fed to the encoder (and that is the only thing encoder sees). Next, a mask token is added to where the masked patches are (similar to BERT) and the mask tokens and encoded patches are fed to decoder. The decoder then tries to reconstruct the original image. As a result, the authors found out that high masking ratio works well in fine-tuning for downstream tasks and linear probing.
  • The simplicity and effectiveness of MAE, especially in learning robust visual representations without semantic entities, marks a significant advancement in self-supervised learning in computer vision, akin to the evolution observed in NLP.
  • Weights docs; Visualization demo: Masked Autoencoders (MAE)
The Effects of Regularization and Data Augmentation are Class Dependent
  • Regularization is a fundamental technique to prevent over-fitting and to improve generalization performances by constraining a model’s complexity. Current Deep Networks heavily rely on regularizers such as data augmentation (DA) or weight-decay, and employ structural risk minimization, i.e., cross-validation, to select the optimal regularization hyper-parameters.
  • This paper by Balestriero et al. from Facebook AI in 2022 demonstrates that regularization techniques such as DA or weight decay increases the average test performances at the cost of significant performance drops on some specific classes. In other words, regularization produces a model with a reduced complexity that is unfair across classes. By focusing on maximizing aggregate performance statistics they have produced learning mechanisms that can be potentially harmful, especially in transfer learning tasks. The optimal amount of DA or weight decay found from cross-validation leads to disastrous model performances on some classes, e.g., on ImageNet with a ResNet50, the “barn spider” classification test accuracy falls from 68% to 46% only by introducing random crop DA during training. Even more surprising, such performance drop also appears when introducing uninformative regularization techniques such as weight decay.
  • Those results demonstrate that their search for ever increasing generalization performance – averaged over all classes and samples – has left us with models and regularizers that silently sacrifice performances on some classes. In fact, they also observe that varying the amount of regularization employed during pre-training of a specific dataset impacts the per-class performances of that pre-trained model on different downstream tasks e.g. an ImageNet pre-trained ResNet50 deployed on INaturalist sees its performances fall from 70% to 30% on a particular classwhen introducing random crop DA during the Imagenet pre-training phase. Those results demonstrate that designing novel regularizers without class-dependent bias remains an open research question.
  • Here’s an intuitive explanation:
    • Some types of data augmentation and weight decay helps some categories but hurts others.
    • Categories largely identifiable by color or texture (for e.g., yellow bird, textured mushroom) are unaffected by aggressive cropping, while categories identifiable by shape (for e.g., corkscrew) see a performance degradation with aggressive cropping that only contains part of the object.
    • Conversely, color jitter does not affect shape or texture-based categories (for e.g., zebra), but affects color-based categories (for e.g., basket ball).
Instant Neural Graphics Primitives with a Multiresolution Hash Encoding
  • Neural graphics primitives, parameterized by fully connected neural networks, can be costly to train and evaluate. Moreover, many graphics problems rely on task specific data structures to exploit the sparsity or smoothness of the problem at hand.
  • This paper by Muller et al. from Nvidia in 2022 proposes InstantNeRF which reduce this cost with a versatile new input encoding that permits the use of a smaller network without sacrificing quality, thus significantly reducing the number of floating point and memory access operations. InstantNeRF offers near-instant training of neural graphics primitives on a single GPU for multiple tasks.
  • To this end, a small neural network is augmented by a multiresolution hash table of trainable feature vectors whose values are optimized through stochastic gradient descent. Multi-resolution hash encoding provides a practical learning-based alternative that automatically focuses on relevant detail, independent of task at hand. Its low overhead allows it to be used even in time-constrained settings like online training and inference.
  • In a gigapixel image, they represent an image by a neural network. SDF learns a signed distance function in 3D space whose zero level-set represents a 2D surface. NeRF uses 2D images and their camera poses to reconstruct a volumetric radiance-and-density field that is visualized using ray marching. Lastly, neural volume learns a denoised radiance and density field directly from a volumetric path tracer. In all tasks, their encoding and its efficient implementation provide clear benefits: instant training, high quality, and simplicity. Their encoding is task-agnostic: they use the same implementation and hyperparameters across all tasks and only vary the hash table size which trades off quality and performance.
  • The multiresolution structure allows the network to disambiguate hash collisions, making for a simple architecture that is trivial to parallelize on modern GPUs. In the context of neural network input encodings, it is a drop-in replacement, for example speeding up NeRF by several orders of magnitude and matching the performance of concurrent non-neural 3D reconstruction techniques.
  • They leverage this parallelism by implementing the whole system using fully-fused CUDA kernels with a focus on minimizing wasted bandwidth and compute operations.
  • While slow computational processes in any setting, from lightmap baking to the training of neural networks, can lead to frustrating workflows due to long iteration times, they achieve a combined speedup of several orders of magnitude, enabling training of high-quality neural graphics primitives in a matter of seconds, and rendering in tens of milliseconds at a resolution of 1920\(\times\)1080. They have demonstrated that single-GPU training times measured in seconds are within reach for many graphics applications, allowing neural approaches to be applied where previously they may have been discounted.
  • Code.
Pix2seq: A Language Modeling Framework for Object Detection
  • This paper by Chen et al. from Google Brain in ICLR 2022 presents Pix2Seq, a simple yet generic framework for object detection. This paper introduces Pix2Seq, a simple yet generic framework for object detection. By casting object detection as a language modeling task conditioned on the observed pixel inputs, Pix2Seq largely simplifies the detection pipeline, removing most of the specialization in modern detection algorithms.
  • Object descriptions (e.g., bounding boxes and class labels) are expressed as sequences of discrete tokens, and they train a neural network to perceive the image and generate the desired sequence.
  • Pix2Seq is based mainly on the intuition that if a neural network knows about where and what the objects are, they just need to teach it how to read them out.
  • Beyond the use of task-specific data augmentations, their approach makes minimal assumptions about the task, yet it achieves competitive results on the challenging COCO dataset, compared to highly specialized and well optimized detection algorithms.
  • Pix2Seq can be extended beyond object detection to solving a large variety of vision tasks where the output can be represented by a relatively concise sequence of discrete tokens (e.g., keypoint detection, image captioning, visual question answering).
  • A major limitation of Pix2Seq is that autoregressive modeling is expensive for long sequences (mainly during model inference). Practical measures to mitigate the issue includes: 1) stop inference when the ending token is produced (e.g., in COCO dataset, there are, in average, 7 objects per image, leading to a relatively small number of ∼35 tokens), 2) applying it to offline inference, or online scenarios where the objects of interest are relatively sparse (for e.g., locate a specific object with language description).
  • However, future work is needed to make it faster for real-time object detection applications. Another limitation is that the current approach for training Pix2Seq is entirely based on human annotation, and by reducing such dependence and letting the model train using unlabeled data in an unsupervised fashion, they can enable far more applications in the vision domain.
An Improved One millisecond Mobile Backbone
  • Efficient neural network backbones for mobile devices are often optimized for metrics such as FLOPs or parameter count. However, these metrics may not correlate well with latency of the network when deployed on a mobile device.
  • This paper by Vasu et al. from Apple in 2022 performs extensive analysis of different metrics by deploying several mobile friendly networks on a mobile device. They identify and analyze architectural and optimization bottlenecks in recent efficient neural networks and provide ways to mitigate these bottlenecks.
  • To this end, they design an efficient backbone MobileOne, with variants achieving an inference time under 1 ms on an iPhone12 with 75.9% top-1 accuracy on ImageNet. They show that MobileOne achieves state-of-the-art performance within the efficient architectures while being many times faster on mobile.
  • A MobileOne block has two different structures at train time and test time, inspired from RepVGG: Making VGG-style ConvNets Great Again. Left: Train time MobileOne block with reparameterizable branches. Right: MobileOne block at inference where the branches are reparameterized. Either ReLU or SE-ReLU is used as activation. The trivial over-parameterization factor \(k\) is a hyperparameter which is tuned for every variant.

  • Their best model obtains similar performance on ImageNet as MobileFormer while being 38x faster. MobileOne obtains 2.3% better top-1 accuracy on ImageNet than EfficientNet at similar latency. Furthermore, they show that their model generalizes to multiple tasks – image classification, object detection, and semantic segmentation with significant improvements in latency and accuracy as compared to existing efficient architectures when deployed on a mobile device.
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
  • This paper by Saharia et al. from Google Brain in 2022 presents Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding. Imagen showcases the effectiveness of frozen large pretrained language models as text encoders for the text-to-image generation using diffusion models.
  • Imagen builds on the power of large transformer language models in understanding text and hinges on the strength of diffusion models in high-fidelity image generation. With these novel components, Imagen produces \(1024 \times 1024\) samples with unprecedented photorealism and alignment with text.
  • Their key discovery is that generic large language models (e.g. T5), pretrained on text-only corpora, are surprisingly effective at encoding text for image synthesis: increasing the size of the language model in Imagen boosts both sample fidelity and image-text alignment much more than increasing the size of the image diffusion model.
  • Imagen achieves a new state-of-the-art FID score of 7.27 on the COCO dataset, without ever training on COCO, and human raters find Imagen samples to be on par with the COCO data itself in image-text alignment.
  • To assess text-to-image models in greater depth, they introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models. With DrawBench, they compare Imagen with recent methods including VQ-GAN+CLIP, Latent Diffusion Models, and DALL-E 2, and find that human raters prefer Imagen over other models in side-by-side comparisons, both in terms of sample quality and image-text alignment.
  • Google page with an overview of the results.
Swin Transformer V2: Scaling Up Capacity and Resolution
  • Large-scale NLP models have been shown to significantly improve the performance on language tasks with no signs of saturation. They also demonstrate amazing few-shot capabilities like that of human beings.
  • This paper by Liu et al. from Microsoft Research in 2022 explores large-scale models in computer vision. THey tackle three major issues in training and application of large vision models, including training instability, resolution gaps between pre-training and fine-tuning, and hunger on labelled data. Three main techniques are proposed: 1) a residual-post-norm method combined with cosine attention to improve training stability; 2) A log-spaced continuous position bias method to effectively transfer models pre-trained using low-resolution images to downstream tasks with high-resolution inputs; 3) A self-supervised pre-training method, SimMIM, to reduce the needs of vast labeled images.
  • Through these techniques, this paper successfully trained a 3 billion-parameter Swin Transformer V2 model, which is the largest dense vision model to date, and makes it capable of training with images of up to \(1,536 \times 1,536\) resolution.
  • By scaling up capacity and resolution, Swin V2 sets new performance records on 4 representative vision tasks, including ImageNet-V2 image classification, COCO object detection, ADE20K semantic segmentation, and Kinetics-400 video action classification. Also, note their training is much more efficient than that in Google’s billion-level visual models, which consumes 40 times less labelled data and 40 times less training time.
  • The diagram below from the paper presents the techniques for scaling Swin Transformer up to 3 billion parameters and making it capable of training with images of up to \(1,536 \times 1,536\) resolution, including the res-post-norm and scaled cosine attention to make the model easier to be scaled up in capacity, as well a log-spaced continuous relative position bias approach which lets the model more effectively transferred across window resolutions.

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
  • This paper by Yu et al. from Google Research in 2022 presents the Pathways Autoregressive Text-to-Image (Parti) model, which generates high-fidelity photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge. In particular, Parti is able to represent a broad range of visual world knowledge, such as landmarks, specific years, makes and models of vehicles, pottery types, visual styles – and integrate these into novel settings and configurations.
  • Parti treats text-to-image generation as a sequence-to-sequence modeling problem, akin to machine translation, with sequences of image tokens as the target outputs rather than text tokens in another language. This strategy can naturally tap into the rich body of prior work on large language models, which have seen continued advances in capabilities and performance through scaling data and model sizes.
  • Their approach is simple: First, Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens.
  • Second, they achieve consistent quality improvements by scaling the encoder-decoder Transformer model up to 20B parameters, with a new state-of-the-art zero-shot FID score of 7.23 and finetuned FID score of 3.22 on MS-COCO.
  • Their detailed analysis on Localized Narratives as well as PartiPrompts (P2), a new holistic benchmark of over 1600 English prompts, demonstrate the effectiveness of Parti across a wide variety of categories and difficulty aspects.
  • They also provide an extensive discussion of the limitations, including a breakdown of many kinds of model errors and challenges, that they hope will be useful both for contextualizing what the model can do and for highlighting opportunities for future research.
  • Parti opens up opportunities to integrate scaled autoregressive models with diffusion models, starting with having an autoregressive model generate an initial low-resolution image and then iteratively refining and super-resolving images with diffusion modules. Furthermore, the authors suggest conducting more experiments and comparisons with both autoregressive and diffusion models in order to understand their relative capabilities, to address key questions of fairness and bias in both classes of models and strategies for mitigating them, and to identify optimal opportunities for combining their strengths.
  • Key takeaways:
    • One of the most exciting research fields nowadays is text-to-image modeling. OpenAI’s DALL-E 2 and Google’s Imagen are phenomenal models in this area. Both used a Transformer to encode the text and use diffusion models to generate the image. Google’s Parti, consists solely of (really big) Transformer modules:
      • Text encoder: as with previous works, encoding the text with a Transformer is a no-brainer.
      • Image tokenizer and de-tokenizer: instead of generating the entire image, Parti will generate one patch at a time. A ViT-based module is used to encode and decode those patches.
      • Conditional decoder: conditioned on the encoded text and the tokenized image patches generated so far, a Transformer is used to generate the next patch (with the help of the de-tokenizer from the previous step).
  • Google page.
  • Code.

Sequencer: Deep LSTM for Image Classification
  • In recent computer vision research, the advent of the Vision Transformer (ViT) has rapidly revolutionized various architectural design efforts: ViT achieved state-of-the-art image classification performance using self-attention found in natural language processing, and MLP-Mixer achieved competitive performance using simple multi-layer perceptrons. In contrast, several studies have also suggested that carefully redesigned convolutional neural networks (CNNs) can achieve advanced performance comparable to ViT without resorting to these new ideas. Against this background, there is growing interest in what inductive bias is suitable for computer vision.
  • This paper by Tatsunami and Taki from Rikkyo, Japan in NeurIPS 2022 proposes Sequencer, a novel and competitive architecture alternative to ViT that provides a new perspective on these issues. Unlike ViTs, Sequencer models long-range dependencies using LSTMs rather than self-attention layers.
  • They also propose a two-dimensional version of Sequencer module, where an LSTM is decomposed into vertical and horizontal LSTMs to enhance performance. Despite its simplicity, several experiments demonstrate that Sequencer performs impressively well: Sequencer2D-L, with 54M parameters, realizes 84.6% top-1 accuracy on only ImageNet-1K.
  • Of note is the fact that the overall data appetite and time to converge was reported to be much better than the ViT and cousins since CNNs and LSTMs have great sample efficiency. Not only that, the paper shows that it has good transferability and the robust resolution adaptability on double resolution-band.

High-Resolution Image Synthesis with Latent Diffusion Models
  • The following paper summary has been contributed by Zhibo Zhang.
  • Diffusion models are known to be computationally expensive given that they require many steps of diffusion and denoising diffusion operations in possibly high-dimensional input feature spaces.
  • This paper by Rombach et al. from Ludwig Maximilian University of Munich & IWR, Heidelberg University and Runway ML in CVPR 2022 introduces diffusion models that operate on the latent space, aiming at generating high-resolution images with lower computation demands compared to those that operate directly on the pixel space.
  • In particular, the authors adopted an autoencoder that compresses the input images into a lower dimensional latent space. The autoencoder relies on either KL regularization or VQ regularization to constrain the variance of the latent space.
  • As shown in the illustration figure below by Rombach et al., in the latent space, the latent representation of the input image goes through a total of \(T\) diffusion operations to get the noisy representation. A U-Net is then applied on top of the noisy representation for \(T\) iterations to produce the denoised version of the representation. In addition, the authors introduced a cross attention mechanism to condition the denoising process on other types of inputs such as text and semantic maps.
  • In the final stage, the denoised representation will be mapped back to the pixel space using the decoder to get the synthesized image.
  • Empirically, the best performing latent diffusion model (with a carefully chosen downsampling factor) achieved competitive FID scores in image generation when comparing with a few other state-of-the-art generative models such as variations of generative adversarial nets on a few datasets including the CelebA-HQ dataset.
  • Code

Make-A-Video: Text-to-Video Generation without Text-Video Data
  • This paper by Singer et al. from Meta AI in 2022 proposes Make-A-Video – an approach for directly translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V).
  • Their intuition is simple: learn what the world looks like and how it is described from paired text-image data, and learn how the world moves from unsupervised video footage. Make-A-Video has three advantages: (1) it accelerates training of the T2V model (it does not need to learn visual and multimodal representations from scratch), (2) it does not require paired text-video data, and (3) the generated videos inherit the vastness (diversity in aesthetic, fantastical depictions, etc.) of today’s image generation models. They design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules.
  • First, they decompose the full temporal U-Net and attention tensors and approximate them in space and time. Second, they design a spatial temporal pipeline to generate high resolution and frame rate videos with a video decoder, interpolation model and two super resolution models that can enable various applications besides T2V.
  • In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation, as determined by both qualitative and quantitative measures.
Denoising Diffusion Implicit Models
  • Denoising diffusion probabilistic models (DDPMs) have achieved high quality image generation without adversarial training, yet they require simulating a Markov chain for many steps to produce a sample.
  • This paper by Song et al. in ICLR 2021 from Ermon’s lab at Stanford presents denoising diffusion implicit models (DDIMs), a more efficient class of iterative implicit probabilistic models with the same training procedure as DDPMs to accelerate the sampling process.
  • In DDPMs, the generative process is defined as the reverse of a Markovian diffusion process. DDIMs construct a class of non-Markovian diffusion processes that lead to the same training objective, but whose reverse process can be much faster to sample from.
  • They empirically demonstrate that DDIMs can produce high quality samples 10x to 50x faster in terms of wall-clock time compared to DDPMs, allow us to trade off computation for sample quality, and can perform semantically meaningful image interpolation directly in the latent space.
CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows
  • This paper by Dong et al. from the University of Science and Technology of China, Microsoft Research Asia, and Microsoft Cloud in 2022 presents the CSWin Transformer, an efficient and effective Transformer-based backbone for general-purpose vision tasks.
  • A challenging issue in Transformer design is that global self-attention is very expensive to compute whereas local self-attention often limits the field of interactions of each token. To address this issue, they develop the Cross-Shaped Window self-attention mechanism for computing self-attention in the horizontal and vertical stripes in parallel that form a cross-shaped window, with each stripe obtained by splitting the input feature into stripes of equal width.
  • They provide a mathematical analysis of the effect of the stripe width and vary the stripe width for different layers of the Transformer network which achieves strong modeling capability while limiting the computation cost.
  • They also introduce Locally-enhanced Positional Encoding (LePE), which handles the local positional information better than existing encoding schemes. LePE naturally supports arbitrary input resolutions, and is thus especially effective and friendly for downstream tasks.
  • Incorporated with these designs and a hierarchical structure, CSWin Transformer demonstrates competitive performance on common vision tasks. Specifically, it achieves 85.4% Top-1 accuracy on ImageNet-1K without any extra training data or label, 53.9 box AP and 46.4 mask AP on the COCO detection task, and 52.2 mIOU on the ADE20K semantic segmentation task, surpassing previous state-of-the-art Swin Transformer backbone by +1.2, +2.0, +1.4, and +2.0 respectively under the similar FLOPs setting.
  • By further pretraining on the larger dataset ImageNet-21K, they achieve 87.5% Top-1 accuracy on ImageNet-1K and high segmentation performance on ADE20K with 55.7 mIoU.
  • The following figure from the paper illustrates different self-attention mechanisms; CSWin is fundamentally different from two aspects. First, they split multi-heads (\(\{h1, \ldots , hK\}\)) into two groups and perform self-attention in horizontal and vertical stripes simultaneously. Second, they adjust the stripe width according to the depth network, which can achieve better trade-off between computation cost and capability.

  • The following figure from the paper shows: (Left) the overall architecture of their proposed CSWin Transformer; (Right) the illustration of CSWin Transformer block.

MViTv2: Improved Multiscale Vision Transformers for Classification and Detection
  • This paper by Li et al. from Facebook AI Research and UC Berkeley studies Multiscale Vision Transformers (MViTv2) as a unified architecture for image and video classification, as well as object detection.
  • They present an improved version of MViT that incorporates decomposed relative positional embeddings and residual pooling connections.
  • They instantiate this architecture in five sizes and evaluate it for ImageNet classification, COCO detection and Kinetics video recognition where it outperforms prior work.
  • They further compare MViTv2s’ pooling attention to window attention mechanisms where it outperforms the latter in accuracy/compute. Without bells-and-whistles, MViTv2 has state-of-the-art performance in 3 domains: 88.8% accuracy on ImageNet classification, 58.7 boxAP on COCO object detection as well as 86.1% on Kinetics-400 video classification.
  • The following figure from the paper shows the improved Pooling Attention mechanism that incorporating decomposed relative position embedding, \(R_{p(i), p(j)}\), and residual pooling connection modules in the attention block.

  • The following figure from the paper illustrates MViTv2 as a multiscale transformer with state-of-the-art performance across three visual recognition tasks.

iBOT: Image BERT Pre-training with Online Tokenizer
  • This paper by Zhou et al. from ByteDance, Johns Hopkins University, Shanghai Jiao Tong University, and UC Santa Cruz, presented at ICLR 2022, introduces a novel self-supervised framework for Vision Transformers (ViTs). This framework, named iBOT, is distinctive in its use of an online tokenizer in Masked Image Modeling (MIM) and the incorporation of self-distillation.
  • Self-distillation is a process where a model learns to predict its own outputs, effectively teaching itself. In the context of iBOT, it involves the model learning to predict masked parts of the input image based on the unmasked parts, allowing it to refine its understanding of visual features and patterns. This method of learning from its own predictions, rather than relying on external labels or pre-trained models, is a key aspect of iBOT’s self-supervised learning approach. Furthermore, in iBOT, the teacher model’s weights are an Exponential Moving Average (EMA) of the student’s weights. This method ensures that the teacher’s weights are a temporally smoothed version of the student’s, enabling more stable and consistent learning targets for the student model during the self-distillation process.
    • Self-distillation, proposed in DINO, distills knowledge not from posterior distributions \(P_{\theta}(x)\) but past iterations of model itself \(P_{\theta'}(x)\) and is cast as a \textit{discriminative self-supervised objective}. Given the training set \(\mathcal{L}\), an image \(x \sim \mathcal{L}\) is sampled uniformly, over which two random augmentations are applied, yielding two distorted views \(u\) and \(v\). The two distorted views are then put through a teacher-student framework to get the predictive categorical distributions from the \texttt{[CLS]} token: \(v^{[\texttt{CLS}]} = P_{\theta'}^{[\texttt{CLS}]}(v)\) and \(u^{[\texttt{CLS}]} = P_{\theta}^{[\texttt{CLS}]}(u)\). The knowledge is distilled from teacher to student by minimizing their cross-entropy, formulated as
    \[\mathcal{L}_{\texttt{CLS}} = -P_{\theta'}^{[\texttt{CLS}]}(v)^T \log P_{\theta}^{[\texttt{CLS}]}(u).\]
    • The teacher and the student share the same architecture consisting of a backbone \(f\) (e.g., ViT) and a projection head \(h_{\texttt{CLS}}\). The parameters of the student network \(\theta\) are Exponentially Moving Averaged (EMA) to the parameters of teacher network \(\theta'\).
  • The loss function employed in iBOT is critical to its design. It combines the self-distillation loss, where the model’s own predictions are used as soft labels for learning, with the standard Masked Image Modeling (MIM) loss. This combined loss function encourages the model to learn more robust and generalizable features by aligning the predictions of the teacher and student models within the self-distillation framework.
  • The implementation details of iBOT include the use of Vision Transformers and Swin Transformers with various parameter configurations. The models are pre-trained and fine-tuned on ImageNet datasets using a mix of self-distillation and MIM. The unique aspect of iBOT is its use of an online tokenizer, which evolves during training, learning to capture high-level visual semantics. This tokenizer is an integral part of the self-distillation process, helping the model to recover masked patch tokens accurately.
  • iBOT’s performance is evaluated across several tasks, including image classification, object detection, instance segmentation, and semantic segmentation. It achieves state-of-the-art results on these tasks, demonstrating its effectiveness. Notably, iBOT shows superior performance in terms of linear probing accuracy and fine-tuning accuracy on ImageNet-1K, as well as robustness against common image corruptions and adaptability to different datasets.
  • The following figure from the paper shows the overview of iBOT framework, performing masked image modeling with an online tokenizer. Given two views \(\boldsymbol{u}\) and \(\boldsymbol{v}\) of an image \(\boldsymbol{x}\), each view is passed through a teacher network \(h_t \circ f_t\) and a student network \(h_s \circ f_s\). iBOT minimizes two losses. The first loss \(\mathcal{L}_{\text {[CLS] }}\) is selfdistillation between cross-view [CLS] tokens. The second loss \(\mathcal{L}_{\text {MIM }}\) is self-distillation between in-view patch tokens, with some tokens masked and replaced by \(\boldsymbol{e}_{[\mathrm{MASK}]}\) for the student network. The objective is to reconstruct the masked tokens with the teacher networks’ outputs as supervision.

  • An in-depth analysis within the paper reveals that iBOT learns high-level semantic patterns in image patches, contributing to its strong performance on visual tasks. This is a notable advancement over traditional tokenization methods in Vision Transformers, which often focus on low-level details. The authors also conduct various ablation studies, underscoring the importance of the online tokenizer and its role in achieving high performance.
  • In conclusion, iBOT represents a significant step forward in self-supervised learning for Vision Transformers, particularly in how it handles the tokenization and learning of visual semantics. The results suggest that this method could be highly effective for a wide range of visual recognition tasks.
Imagen Video: High Definition Video Generation with Diffusion Models
  • This paper by Ho et al. from Google Research, Brain Team, introduces Imagen Video, a text-conditional video generation system leveraging a cascade of video diffusion models. Imagen Video generates high-definition videos from text prompts using a base video generation model and a sequence of interleaved spatial and temporal super-resolution models.
  • The core contributions and methodology of this work include the following technical details:
    • Architecture and Components: Imagen Video utilizes a frozen T5 text encoder to process the text prompts, followed by a base video diffusion model and multiple spatial and temporal super-resolution (SSR and TSR) models. Specifically, the system comprises seven sub-models: one base video generation model, three SSR models, and three TSR models. This cascade structure allows the system to generate 1280x768 resolution videos at 24 frames per second, with a total of 128 frames (approximately 5.3 seconds).
    • Diffusion Models: The diffusion models in Imagen Video are based on continuous-time formulations, with a forward process defined as a Gaussian process. The training objective is to denoise the latent variables through a noise-prediction loss. The v-parameterization is employed to predict noise, which ensures numerical stability and avoids color-shifting artifacts.
    • Text Conditioning and Cascading: Text conditioning is achieved by injecting contextual embeddings from the T5-XXL text encoder into all models, ensuring alignment between the generated video and the text prompt. The cascading approach involves generating a low-resolution video first, which is then progressively enhanced through spatial and temporal super-resolution models. This method allows for high-resolution outputs without overly complex individual models.
  • The following figure from the paper shows the cascaded sampling pipeline starting from a text prompt input to generating a 5.3-second, 1280 \(\times\) 768 video at 24fps. “SSR” and “TSR” denote spatial and temporal super-resolution respectively, and videos are labeled as frames \(\times\) width \(\times\) height. In practice, the text embeddings are injected into all models, not just the base model.

  • Implementation Details:
    • v-parameterization: Used for numerical stability and to avoid artifacts in high-resolution video generation.
    • Conditioning Augmentation: Gaussian noise augmentation is applied to the conditioning inputs during training to reduce domain gaps and facilitate parallel training of different models in the cascade.
    • Joint Training on Images and Videos: The models are trained on a mix of video-text pairs and image-text pairs, treating individual images as single-frame videos. This approach allows the model to leverage larger and more diverse image-text datasets.
    • Classifier-Free Guidance: This method enhances sample fidelity and ensures that the generated video closely follows the text prompt by adjusting the denoising prediction.
    • Progressive Distillation: This technique is used to speed up the sampling process. It involves distilling a trained DDIM sampler into a model requiring fewer steps, thus significantly reducing computation time while maintaining sample quality.
  • Experiments and Findings:
    • The model shows high fidelity in video generation and can produce diverse content, including 3D object understanding and various artistic styles.
    • Scaling the parameter count of the video U-Net leads to improved performance, indicating that video modeling benefits significantly from larger models.
    • The v-parameterization outperforms ε-parameterization, especially at higher resolutions, due to faster convergence and reduced color inconsistencies.
    • Distillation reduces sampling time by 18x, making the model more efficient without sacrificing perceptual quality.
  • Conclusion: Imagen Video extends text-to-image diffusion techniques to video generation, achieving high-quality, temporally consistent videos. The integration of various advanced methodologies from image generation, such as v-parameterization, conditioning augmentation, and classifier-free guidance, demonstrates their effectiveness in the video domain. The work also highlights the potential for further improvements in video generation capabilities through continued research and development.
  • Project page


Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles
  • Vision Transformers (ViTs) have dominated several tasks in computer vision. While architecturally simple, their accuracy and ability to scale make them still a popular choice today. Moreover, their simplicity unlocks the use of powerful pretraining strategies such as MAE, which make ViTs computationally and data efficient to train. However, this simplicity comes at a cost: by using the same spatial resolution and number of channels throughout the network, ViTs make inefficient use of their parameters. This is in contrast to prior “hierarchical” or “multi-scale” models (such as AlexNet or ResNet), which use fewer channels but higher spatial resolution in early stages with simpler features, and more channels but lower spatial resolution later in the model with more complex features.
  • While several domain specific hierarchical vision transformers have been introduced (such as hierarchical design, such as Swin or MViT), they have added several vision-specific components in the pursuit of supervised classification performance. While these components lead to effective accuracies and attractive FLOP counts, the added complexity actually makes these transformers slower than their vanilla ViT counterparts.
  • This paper by Ryali et al. from Facebook AI Research, Georgia Tech, and Johns Hopkins University argues that this additional bulk is unnecessary. By pretraining with a strong visual pretext task (MAE), they can strip out all the bells-and-whistles from a state-of-the-art multi-stage vision transformer without losing accuracy.
  • In the process, they create Hiera, an extremely simple hierarchical vision transformer that is more accurate than previous models while being significantly faster both at inference and during training. They evaluate Hiera on a variety of tasks for image and video recognition.
  • The following figure from the paper shows: (Left) the overall architecture of their proposed CSWin Transformer; (Right) the illustration of CSWin Transformer block.

Tree-Ring Watermarks: Fingerprints for Diffusion Images that are Invisible and Robust
  • Watermarking the outputs of generative models is a crucial technique for tracing copyright and preventing potential harm from AI-generated content.
  • This paper by Wen at al. from the University of Maryland introduces a novel technique called Tree-Ring Watermarking that robustly fingerprints diffusion model outputs.
  • Unlike existing methods that perform post-hoc modifications to images after sampling, Tree-Ring Watermarking subtly influences the entire sampling process, resulting in a model fingerprint that is invisible to humans. The watermark embeds a pattern into the initial noise vector used for sampling.
  • Because these patterns are structured in Fourier space so that they are invariant to perturbation such as convolutions, crops, dilations, flips, and rotations. After image generation, the watermark signal is detected by inverting the diffusion process to retrieve the noise vector, which is then checked for the embedded signal.
  • They demonstrate that this technique can be easily applied to arbitrary diffusion models, including text-conditioned Stable Diffusion, as a plug-in with negligible loss in FID. Their watermark is semantically hidden in the image space and is far more robust than watermarking alternatives that are currently deployed.
  • The following figure from the paper illustrates the pipeline for tree-ring Watermarking. A diffusion model generation is watermarked and later detected through ring-patterns in the Fourier space of the initial noise vector.

From Sparse to Soft Mixtures of Experts
  • Sparse Mixture of Experts (MoE) architectures scale model capacity without large increases in training or inference costs. MoE allows us to dramatically scale model sizes without significantly increasing inference latency. In short, each “expert” can separately attend to a different subset of tasks via different data subsets before they are combined via an input routing mechanism. Thus, the model can learn a wide variety of tasks, but still specialize when appropriate. Despite their success, MoEs suffer from a number of issues: training instability, token dropping, inability to scale the number of experts, or ineffective finetuning.
  • This paper by Puigcerver et al. from Google DeepMind proposes Soft MoE, a fully-differentiable sparse Transformer that addresses these challenges, while maintaining the benefits of MoEs.
  • Extra-large models like Google’s PaLM (540B parameters) or OpenAI’s GPT-4 use Sparse MoE under the hood, which suffers from training instabilities, because it’s not fully differentiable. Soft-MoE replaces the non-differentiable expert routing with a differentiable layer. The end-to-end model is fully differentiable again, can be trained with ordinary SGD-like optimizers, and the training instabilities go away.
  • Soft MoE performs an implicit soft assignment by passing different weighted combinations of all input tokens to each expert. As in other MoE works, experts in Soft MoE only process a subset of the (combined) tokens, enabling larger model capacity at lower inference cost.
  • The following figure from the paper illustrates the main differences between Sparse and Soft MoE layers. While the router in Sparse MoE layers (left) learns to assign individual input tokens to each of the available slots, in Soft MoE layers (right) each slot is the result of a (different) weighted average of all the input tokens. Learning to make discrete assignments introduces several optimization and implementation issues that Soft MoE sidesteps.

  • They propose a fully-differentiable sparse vision transformer (ViT) that addresses aforementioned challenges such as training instability, token dropping, and inefficient finetuning. In the context of visual recognition, Soft MoE greatly outperforms the standard ViT and popular MoE variants (Tokens Choice and Experts Choice). Soft MoE scales ViT models to >50B parameters with little effect on inference latency. For example, Soft MoE-Base/16 requires 10.5x lower inference cost (5.7x lower wall-clock time) than ViT-Huge/14 while matching its performance after similar training. Soft MoE also scales well: Soft MoE Huge/14 with 128 experts in 16 MoE layers has over 40x more parameters than ViT Huge/14, while inference time cost grows by only 2%, and it performs substantially better.
  • The following figure from the paper illustrates the Soft MoE routing algorithm. Soft MoE first computes scores or logits for every pair of input token and slot, based on some learnable per-slot parameters. These logits are then normalized per slot (columns) and every slot computes a linear combination of all the input tokens based on these weights (in green). Each expert (an MLP in this work) then processes its slots (e.g. 2 slots per expert, in this diagram). Finally, the same original logits are normalized per token (i.e., by row) and used to combine all the slot outputs, for every input token (in blue). Dashed boxes represent learnable parameters.

  • The following infographic (source) presents an overview of their results:

Estimating Example Difficulty using Variance of Gradients
  • This paper titled “Estimating Example Difficulty using Variance of Gradients” by Agarwal et al. introduces the concept of Variance of Gradients (VoG) as a metric to determine the difficulty of examples for machine learning models. The main highlights and contributions of the paper are as follows:
  • The authors propose VoG as an efficient metric to rank data by difficulty, enabling the identification of the most challenging examples for more focused human-in-the-loop auditing. The metric is particularly adept at identifying data points that are difficult for the model to learn, often correlating with corrupted or memorized examples.
  • The following \(5 \times 5\) grid from the paper shows the top-25 Cifar-10 and Cifar-100 training-set images with the lowest and highest VoG scores in the Early (a) and Late (b) training stage respectively of two randomly chosen classes. Lower VoG images evidence uncluttered backgrounds (for both apple and plane) in the Late training stage. VoG also appears to capture a color bias present during the Early training stage for both apple (red). The VoG images in Late training stage present unusual vantage points, with images where the frame is zoomed in on the object of interest.

  • The study demonstrates the effectiveness of VoG across multiple architectures and datasets, including Cifar-10, Cifar-100, and ImageNet. VoG is shown to identify clusters of images with distinct semantic properties, where high VoG scores often align with images having cluttered backgrounds and atypical vantage points. The method also proves effective in surfacing memorized examples and provides insights into the learning cycle of the model.
  • An extensive evaluation of VoG’s utility as an auditing tool is conducted. This includes qualitative analysis of images with high and low VoG scores, demonstrating a correlation between VoG scores and the distinct visual properties of images. It’s observed that images with low VoG scores typically have uncluttered backgrounds and more prototypical views, whereas high VoG scores are associated with more challenging images. The study also shows that test set errors increase with higher VoG scores, especially in more complex datasets.
  • The stability of the VoG ranking is confirmed, which is crucial for building trust with users. The method produces consistent rankings across different training runs, demonstrating negligible deviation in VoG scores across samples. This stability is observed for both Cifar-10 and Cifar-100 datasets.
  • VoG’s role as an unsupervised auditing tool is highlighted, showing its capability to produce reliable rankings even without labels at test time. This feature is particularly valuable for datasets where obtaining labels for protected attributes is infeasible or intrusive.
  • The paper delves into VoG’s understanding of early and late training dynamics, revealing that VoG scores can capture different aspects of learning at various stages of training. For instance, during early training, higher VoG scores correlate with lower average error rates, while in the later stages, this trend reverses.
  • VoG is evaluated for its ability to identify memorized examples and its effectiveness as an Out-of-Distribution (OoD) detection tool. It is found to be discriminative in distinguishing between memorized and non-memorized examples. Additionally, when compared to other OoD detection methods, VoG outperforms most, highlighting its efficacy and scalability for large datasets and complex network architectures like ResNet-50.
  • In conclusion, the paper emphasizes the value of VoG in ranking data by difficulty and its utility in identifying challenging examples for auditing. Its domain-agnostic nature and ability to work with training and test examples make it a versatile tool in the realm of machine learning interpretability and model auditing.
EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything
  • This paper by Xiong et al. from Meta AI Research, the authors propose EfficientSAM, a novel approach for efficient image segmentation using lightweight models.
  • The paper addresses the computational challenges of the Segment Anything Model (SAM) with a super large Transformer model trained on the SA-1B dataset. The authors introduce EfficientSAM, which uses a lightweight SAM model for decent performance with significantly reduced complexity.
  • The core idea of EfficientSAM is leveraging masked image pretraining (SAMI), which learns to reconstruct features from the SAM image encoder, leading to effective visual representation learning.
  • EfficientSAM involves two key stages:
    1. SAMI-pretrained lightweight image encoders and a mask decoder are used to build EfficientSAMs.
    2. The models are fine-tuned on the SA-1B dataset for the segment anything task.
  • The following figure from the paper shows the proposed EfficientSAM contains two stages: SAMI pretraining (top) on ImageNet and SAM finetuning (bottom) on SA-1B.

  • The authors evaluate EfficientSAM on multiple vision tasks, including image classification, object detection, instance segmentation, and semantic object detection. SAMI consistently outperforms other masked image pretraining methods.
  • The paper reports significant improvements in zero-shot instance segmentation tasks, highlighting the advantage of SAMI-pretrained lightweight image encoders. For example, on the COCO/LVIS dataset, EfficientSAMs achieved approximately 4 AP improvement over other fast SAM models.
  • A key contribution of the paper is demonstrating that SAMI-pretrained backbones can generalize well to many tasks, including image classification, object detection, and semantic segmentation.
  • EfficientSAMs offer state-of-the-art quality-efficiency trade-offs, substantially reducing inference time and parameter size with minimal performance drops compared to the original SAM model.
  • The authors conclude that their masked image pretraining approach, SAMI, explores the potential of Vision Transformers under the guidance of SAM foundation models, improving masked image pretraining and enabling efficient segment anything tasks.
  • Project page; Code; Demo; Video.
Initializing Models with Larger Ones
  • This paper by Xu et al. from the University of Pennsylvania, UC Berkeley, MBZUAI, and Meta AI Research, introduces a groundbreaking weight initialization method for neural networks, known as ‘weight selection’. This method involves initializing smaller models (student models) by selectively borrowing weights from a pretrained, larger model (teacher model), which belongs to the same model family, such as Llama.
  • The weight selection process is detailed and straightforward. It includes three steps: (1) layer selection, where the first n layers of the teacher model are chosen corresponding to the number of layers in the student model; (2) component mapping, which aligns components like MLPs between the teacher and student models; (3) element selection, where the student’s components are initialized using their larger counterparts from the teacher. This selection can be uniform (selecting evenly spaced elements) or even random, as long as the same indices are chosen across all layers.
  • The following figure from the paper shows the process of weight selection. To initialize a smaller variant of a pretrained model, they uniformly select parameters from the corresponding component of the pretrained model.

  • The effectiveness of this method was rigorously tested across various image classification datasets, including ImageNet-1K, CIFAR-10, and CIFAR-100. The results demonstrated marked improvements in accuracy and substantial reductions in training time compared to networks initialized with standard methods like Kaiming and Xavier initialization. The student models initialized with teacher model weights not only performed better but also reached the performance level of randomly initialized models in just a third of the training time.
  • The paper provides a comprehensive analysis, comparing weight selection with classic initialization methods and different layer selection strategies (first-N layer selection, uniform layer selection). It also examines the method’s compatibility with knowledge distillation.
  • This weight selection approach is especially advantageous in resource-constrained environments, where deploying large models is not feasible. It opens up new possibilities for efficiently leveraging large pretrained models to enhance smaller models’ training, suggesting potential for widespread application and future research in the field.
Rethinking FID: Towards a Better Evaluation Metric for Image Generation
  • This paper by Jayasumana et al. from Google Research challenges the widely used Frechet Inception Distance (FID) for evaluating image generation models, highlighting its significant drawbacks: reliance on Inception’s features that poorly represent the rich content of modern text-to-image models, incorrect normality assumptions, and poor sample complexity. They propose an alternative metric, CMMD (based on CLIP embeddings and the maximum mean discrepancy distance), which does not assume any probability distribution of embeddings and is sample-efficient.
  • The authors critically analyze FID’s limitations through statistical tests and empirical evaluations, revealing that FID often contradicts human raters, fails to reflect iterative model improvements, and struggles with capturing complex image distortions. They introduce CMMD as a robust and reliable metric, demonstrating its superior alignment with human evaluation and effectiveness in capturing quality gradations in generated images.
  • CMMD leverages the rich image representations of CLIP embeddings, trained on a much larger and diverse dataset compared to Inception-v3, making it more suitable for evaluating the diverse content generated by modern image generation models. Unlike FID, CMMD is an unbiased estimator and does not suffer from dependency on sample size, showcasing better performance in various scenarios, including progressive image generation models and image distortions.
  • The authors conduct extensive experiments to compare FID and CMMD across different settings, including progressive image generation models, image distortions, sample efficiency, and computational cost. Their findings advocate for a shift from FID to CMMD for evaluating image generation models, highlighting CMMD’s advantages in consistency, robustness, and alignment with human perception of image quality.
  • The following table from the paper shows a comparison of options for comparing two image distributions. FID, the current de facto standard for text-to-image evaluation is in the upper-left corner. The proposed metric, CMMD, is in the lower-right corner and has many desirable properties over FID.

  • The study’s findings are supported by a detailed examination of the FID metric’s assumptions and performance, as well as the introduction and validation of the CMMD metric through theoretical analysis, empirical demonstrations, and human evaluations. This comprehensive analysis underscores the need for the image generation research community to reevaluate the use of FID and consider adopting CMMD for more reliable and meaningful evaluations.
  • Code
Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution
  • This paper by Dehghani et al. from Google DeepMind introduces NaViT (Native Resolution ViT), a vision transformer designed to process images of arbitrary resolutions and aspect ratios without resizing them to a fixed resolution, which is common but suboptimal.
  • NaViT leverages sequence packing during training, a technique inspired by natural language processing where multiple examples are packed into a single sequence, allowing efficient training on variable length inputs. This is termed Patch n’ Pack.
  • Architectural Changes: NaViT builds on the Vision Transformer (ViT) but introduces masked self-attention and masked pooling to prevent different examples from attending to each other. It also uses factorized and fractional positional embeddings to handle arbitrary resolutions and aspect ratios. These embeddings are decomposed into separate embeddings for x and y coordinates and summed together, allowing for easy extrapolation to unseen resolutions.
  • Training Enhancements: NaViT employs continuous token dropping, varying the token dropping rate per image, and resolution sampling, allowing mixed-resolution training by sampling from a distribution of image sizes while preserving aspect ratios. This enhances throughput and exposes the model to high-resolution images during training, yielding substantial performance improvements over equivalent ViTs.
  • Efficiency: NaViT demonstrates significant computational efficiency, processing five times more images during training than ViT within the same compute budget. The O(n^2) cost of attention, a concern when packing multiple images into longer sequences, diminishes with model scale, making the attention cost a smaller proportion of the overall computation.
  • The following figure from the paper shows an example packing enables variable resolution images with preserved aspect ratio, reducing training time, improving performance and increasing flexibility. We show here the aspects of the data preprocessing and modelling that need to be modified to support Patch n’ Pack. The position-wise operations in the network, such as MLPs, residual connections, and layer normalisations, do not need to be altered.

  • Implementation: The authors implemented NaViT using a greedy packing approach to fix the final sequence lengths containing multiple examples. They addressed padding issues and example-level loss computation by modifying pooling heads to account for packing and using chunked contrastive loss to manage memory and time constraints.
  • Performance: NaViT consistently outperforms ViT across various tasks, including image and video classification, object detection, and semantic segmentation. It shows improved results on robustness and fairness benchmarks, achieving better performance with lower computational costs and providing flexibility in handling different resolutions during inference.
  • Evaluation: NaViT’s training and adaptation efficiency were evaluated through empirical studies on datasets like ImageNet, LVIS, WebLI, and ADE20k. The model demonstrated superior performance in terms of accuracy and computational efficiency, highlighting the benefits of preserving aspect ratios and using mixed-resolution training.
  • NaViT represents a significant departure from the traditional convolutional neural network (CNN)-designed pipelines, offering a promising direction for Vision Transformers by enabling flexible and efficient processing of images at their native resolutions.



Long Short-Term Memory
  • Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow.
  • This paper by Hochreiter and Schmidhuber in Neural Computation 1997 briefly reviews Hochreiter’s (1991) analysis of this problem, then addresses it by introducing a novel, efficient, gradient based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow.
  • LSTM is local in space and time; its computational complexity per time step and weight is O. 1. Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations.
  • In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.


A Neural Probabilistic Language Model
  • This paper by Bengio from the University of Montreal in 2003 revolutionized statistical language modeling by replacing “tables of conditional probabilities” (n-gram language models) with more compact and smoother representations based on distributed representations that can accommodate far more conditioning variables.
  • The traditional technique of learning the joint probability function of sequences of words in a language was intrinsically difficult because of the curse of dimensionality: a word sequence on which the model will be tested is likely to be different from all the word sequences seen during training. Traditional but very successful approaches based on n-grams obtain generalization by concatenating very short overlapping sequences seen in the training set.
  • They propose to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential/combinatorial number of semantically neighboring sentences, which forms the main reason for the spectacular improvements the proposed approach offers. The model learns simultaneously (i) a distributed representation for each word along with (ii) the probability function for word sequences, expressed in terms of these representations.
  • Generalization is obtained because a sequence of words that has never been seen before gets high probability if it is made of words that are similar (in the sense of having a nearby representation) to words forming an already seen sentence.
  • They report on experiments using neural networks for the probability function, showing on two text corpora that the proposed approach significantly improves on state-of-the-art n-gram models, and that the proposed approach allows to take advantage of longer contexts.


ROUGE: A Package for Automatic Evaluation of Summaries
  • This paper by Lin in ACL 2004 introduces Recall-Oriented Understudy for Gisting Evaluation (ROUGE).
  • It includes measures to automatically determine the quality of a summary by comparing it to other (ideal) summaries created by humans. The measures count the number of overlapping units such as n-gram, word sequences, and word pairs between the computer-generated summary to be evaluated and the ideal summaries created by humans. This paper introduces four different ROUGE measures: ROUGE-N, ROUGE-L, ROUGE-W, and ROUGE-S included in the ROUGE summarization evaluation package and their evaluations. Three of them have been used in the Document Understanding Conference (DUC) 2004, a large-scale summarization evaluation sponsored by NIST.


METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments
  • This paper by Banerjee and Lavie in ACL 2005 introduces METEOR, an automatic metric for machine translation evaluation that is based on a generalized concept of unigram matching between the machine-produced translation and human-produced reference translations.
  • Unigrams can be matched based on their surface forms, stemmed forms, and meanings; furthermore, METEOR can be easily extended to include more advanced matching strategies.
  • Once all generalized unigram matches between the two strings have been found, METEOR computes a score for this matching using a combination of unigram-precision, unigram-recall, and a measure of fragmentation that is designed to directly capture how well-ordered the matched words in the machine translation are in relation to the reference.
  • They evaluate METEOR by measuring the correlation between the metric scores and human judgments of translation quality.
  • They compute the Pearson \(R\) correlation value between its scores and human quality assessments of the LDC TIDES 2003 Arabic-to-English and Chinese-to-English datasets.
  • They perform segment-by-segment correlation, and show that METEOR gets an \(R\) correlation value of 0.347 on the Arabic data and 0.331 on the Chinese data. This is shown to be an improvement on using simply unigramprecision, unigram-recall and their harmonic F1 combination. They also perform experiments to show the relative contributions of the various mapping modules.


Recurrent neural network based language model
  • This paper by Mikolov et al. from Khudanpur’s lab at JHU in Interspeech 2010, was the first to propose using a recurrent neural network-based language model (RNN LM) with applications to speech recognition.
  • The results indicate that it is possible to obtain around 50% reduction of perplexity (PPL) by using a mixture of several RNN LMs, compared to a state of the art backoff language model. Speech recognition experiments show around 18% reduction of word error rate on the Wall Street Journal task when comparing models trained on the same amount of data, and around 5% on the much harder NIST RT05 task, and 12% even when the backoff model is trained on 5 times more data than the RNN model. For NIST RT05, they can conclude that models trained on just 5.4M words of in-domain data can outperform big backoff models, which are trained on hundreds times more data.
  • They provide ample empirical evidence to suggest that connectionist language models are superior to standard n-gram techniques, except their high computational (training) complexity. Recurrent neural networks outperformed significantly state of the art backoff models in all of the experiments, most notably even in case when backoff models were trained on much more data than RNN LMs.
  • The paper seeks to break the myth that language modeling is just about counting n-grams, and that the only reasonable way how to improve results is by acquiring new training data.


Generating Text with Recurrent Neural Networks
  • Recurrent Neural Networks (RNNs) are very powerful sequence models that do not enjoy widespread use because it is extremely difficult to train them properly. Fortunately, recent advances in Hessian-free optimization have been able to overcome the difficulties associated with training RNNs, making it possible to apply them successfully to challenging sequence problems.
  • This paper by Sutskever et al. from UofT in ICML 2011 demonstrates the power of RNNs trained with the new Hessian-Free optimizer (HF) by applying them to character-level language modeling tasks. The standard RNN architecture, while effective, is not ideally suited for such tasks, so they introduce a new RNN variant that uses multiplicative (or “gated”) connections which allow the current input character to determine the transition matrix from one hidden state vector to the next.
  • Having applied a modestly-sized standard RNN architecture to the character-level language modeling problem (where the target output at each time step is defined as the the input character at the next time-step), they found the performance somewhat unsatisfactory, and that while increasing the dimensionality of the hidden state did help, the per-parameter gain in test performance was not sufficient to allow the method to be both practical and competitive with state-of-the-art approaches. They address this problem by proposing a new temporal architecture called the Multiplicative RNN (MRNN) which they argue is better suited to the language modeling task.
  • Modeling language at the character level seems unnecessarily difficult. This is because morphemes are the appropriate units for making semantic and syntactic predictions and as such, converting large databases into sequences of morphemes, however, is non-trivial compared with treating them as character strings. Also, learning which character strings make words is a relatively easy task compared with discovering the subtleties of semantic and syntactic structure. So, given a powerful learning system like an MRNN, the convenience of using characters may outweigh the extra work of having to learn the words. Their experiments show that an MRNN finds it very easy to learn words. With the exception of proper names, the generated text contains very few non-words. At the same time, the MRNN also assigns probability to (and occasionally generates) plausible words that do not appear in the training set (e.g., “cryptoliation”, “homosomalist”, or “un-ameliary”). This is a desirable property which enabled the MRNN to gracefully deal with real words that it nonetheless didn’t see in the training set. Predicting the next word by making a sequence of character predictions avoids having to use a huge softmax over all known words and this is so advantageous that some word-level language models actually make up binary “spellings” of words so that they can predict them one bit at a time (Mnih & Hinton, 2009).
  • MRNNs already learn surprisingly good language models using only 1500 hidden units, and unlike other approaches such as the sequence memoizer and PAQ, they are easy to extend along various dimensions. If much bigger MRNNs could be trained with millions of units and billions of connections, it is possible that brute force alone would be sufficient to achieve an even higher standard of performance. But this will of course require considerably more computational power.
  • After training the multiplicative RNN with the HF optimizer for five days on 8 high-end Graphics Processing Units, they were able to surpass the performance of the best previous single method for character level language modeling – a hierarchical nonparametric sequence model. At this point, this represents the largest recurrent neural network application to date.


Efficient Estimation of Word Representations in Vector Space
  • “You shall know a word by the company it keeps” — J. R. Firth.
  • This paper by Mikolov et al. from Google in 2013 proposes word2vec which comprises of two novel model architectures for computing continuous vector representations of words from very large data sets. They studied the quality of vector representations of words derived by various models on a collection of syntactic and semantic language tasks involving word similarity, and the results are compared to the previously best performing techniques based on different types of neural networks. They observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set.
  • Based on a two-layer MLP neural network (i.e., one hidden layer and output layer), they propose two new model architectures for learning distributed representations of words that try to minimize computational complexity. The Continuous Bag-of-Words (CBOW) model architecture predicts the current word based on the context, while the skip-gram model predicts surrounding/context words given the current word.
  • They observed that it is possible to train high quality word vectors using very simple model architectures, compared to the popular neural network models (both feedforward and recurrent). Because of the much lower computational complexity, it is possible to compute very accurate high dimensional word vectors from a much larger data set.
  • Furthermore, they show that these vectors provide state-of-the-art performance on their test set for measuring syntactic and semantic word similarities.
  • Word2vec popularized the “King – Man + Woman = Queen” analogy.
  • Overall, two important learnings from Word2Vec were:
    • Embeddings of semantically similar words are close in cosine similarity.
    • Word embeddings support intuitive arithmetic properties. (An important consequence of this statement is that phrase embeddings can be obtained as the sum of word embeddings.)
Distributed Representations of Words and Phrases and their Compositionality
  • This paper by Mikolov et al. from Google in NeurIPS 2013 builds on their earlier paper Efficient Estimation of Word Representations in Vector Space which proposed the Skip-gram model as an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. They present several extensions that improve both the quality of the vectors and the training speed.
  • They describe a simple alternative to the hierarchical softmax called negative sampling, packaged as Skipgram with Negative Sampling (SGNS). Negative sampling is an extremely simple training method that learns accurate representations especially for frequent words. Furthermore, they propose subsampling of frequent words which is shown to to yield both faster training and significantly better representations of uncommon words.
  • An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of “Canada” and “Air” cannot be easily combined to obtain “Air Canada”. Motivated by this example, they present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.
  • The techniques introduced in this paper can be used also for training the continuous bag-of-words model introduced in Efficient Estimation of Word Representations in Vector Space.
  • Owing to the training optimizations proposed in this paper, successfully trained models on several orders of magnitude more data than the previously published models, thanks to the computationally efficient model architecture. This results in a great improvement in the quality of the learned word and phrase representations, especially for the rare entities.
  • The choice of the training algorithm and the hyper-parameter selection is a task specific decision, as different problems have different optimal hyperparameter configurations. In their experiments, the most crucial decisions that affect the performance are the choice of the model architecture, the size of the vectors, the subsampling rate, and the size of the training window.
  • A very interesting result of this work is that the word vectors can be somewhat meaningfully combined using just simple vector addition.
  • Another approach for learning representations of phrases presented in this paper is to simply represent the phrases with a single token. Combination of these two approaches gives a powerful yet simple way how to represent longer pieces of text, while having minimal computational complexity. Our work can thus be seen as complementary to the existing approaches that attempt to represent phrases using recursive matrix-vector operations.


On the Properties of Neural Machine Translation: Encoder–Decoder Approaches
  • This paper by Cho from Bengio’s lab in Universite de Montreal in 2014 first introduced Gated Recurrent Units (GRUs).
  • Neural machine translation is a relatively new approach to statistical machine translation based purely on neural networks in which models often consist of an encoder and a decoder. The encoder extracts a fixed-length representation from a variable-length input sentence, and the decoder generates a correct translation from this representation.
  • The paper focuses on analyzing the properties of the neural machine translation using two types of neural networks that are able to process variable-length sequences (and differ in the choice of the encoder): (i) an recurrent neural network with gated hidden units, and (ii) the newly proposed gated recursive convolutional neural network. They show that the neural machine translation performs relatively well on short sentences without unknown words, but its performance degrades rapidly as the length of the sentence and the number of unknown words increase.
  • Furthermore, they find that the proposed gated recursive convolutional network learns a grammatical structure of a sentence automatically.
GloVe: Global Vectors for Word Representation
  • Word2vec relies only on local information of language. That is, the semantics learnt for a given word, is only affected by the surrounding words.
  • This paper by Pennington et al. from Stanford in EMNLP 2014 proposed Global Vectors (GloVe), an unsupervised learning algorithm which captures both global statistics and local statistics of a corpus, in order to train word vectors. Training is performed on aggregated global word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.
  • Contemporary methods focused considerable attention on the question of whether distributional word representations are best learned from count-based methods or from prediction-based methods. Currently, prediction-based models garner substantial support; for example, Baroni et al. (2014) argue that these models perform better across a range of tasks. They argue that the two classes of methods are not dramatically different at a fundamental level since they both probe the underlying co-occurrence statistics of the corpus, but the efficiency with which the count-based methods capture global statistics can be advantageous.
  • After Tomas Mikolov et al. released word2vec, there was a boom of papers about word vector representations. GloVe was one such proposal, which explained why such algorithms work and reformulated word2vec optimizations as a special kind of factorization for word co-occurence matrices. Note that GloVe does not use neural networks while word2vec does.
  • They construct a model that utilizes this main benefit of count data while simultaneously capturing the meaningful linear substructures prevalent in recent log-bilinear prediction-based methods like word2vec. The result, GloVe, is a new global log-bilinear regression model for the unsupervised learning of word representations that outperforms other models on word analogy, word similarity, and named entity recognition tasks.
Sequence to Sequence Learning with Neural Networks
  • This paper by Sutskever et al. from Google in 2014 introduced seq2seq encoder-decoder learning to map sequences to sequences, a task that simple Deep Neural Networks (DNNs) cannot be used to accomplish.
  • They present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Their method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. They show that a large deep LSTM with a limited vocabulary can outperform a standard statistical machine translation (SMT)-based system whose vocabulary is unlimited on a large-scale MT task. The success of their simple LSTM-based approach on MT suggests that it should do well on many other sequence learning problems, provided they have enough training data.
  • Their main result is that on an English to French translation task from the WMT-14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.8 on the entire test set, where the LSTM’s BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When they used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which is close to the previous state of the art. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice.
  • They also find that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM’s performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.
Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation
  • This paper by Cho et al. from Bengio’s lab in EMNLP 2014 introduced the seq2seq encoder-decoder model for neural machine translation. They propose a novel neural network model called RNN Encoder–Decoder that consists of two recurrent neural networks (RNN) that is together able to learn the mapping from a sequence of an arbitrary length to another sequence, possibly from a different set, of an arbitrary length. The encoder RNN encodes a sequence of symbols into a fixed-length vector representation, and the other decodes the representation into another sequence of symbols.
  • The proposed RNN Encoder–Decoder is able to either score a pair of sequences (in terms of a conditional probability) or generate a target sequence given a source sequence.
  • The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence.
  • Along with the new architecture, they propose a novel hidden unit that includes a reset gate and an update gate that adaptively control how much each hidden unit remembers or forgets while reading/generating a sequence.
  • They evaluated the proposed model with the task of statistical machine translation, where they used the RNN Encoder–Decoder to score each phrase pair in the phrase table. Qualitatively, they were able to show that the new model is able to capture linguistic regularities in the phrase pairs well and also that the RNN Encoder–Decoder is able to propose well-formed target phrases.
  • The scores by the RNN Encoder–Decoder were found to improve the overall translation performance in terms of BLEU scores. Also, they found that the contribution by the RNN Encoder–Decoder is rather orthogonal to the existing approach of using neural networks in the SMT system, so that they can improve further the performance by using, for instance, the RNN Encoder–Decoder and the neural net language model together.
  • Qualitative analysis of the the proposed model learns a semantically and syntactically meaningful representation of linguistic phrases at multiple levels, i.e., at the word level as well as phrase level. This suggests that there may be more natural language related applications that may benefit from the proposed RNN Encoder–Decoder.


Neural Machine Translation by Jointly Learning to Align and Translate
  • This paper by Bahdanau et al. from Bengio’s lab in ICLR 2015 borrowed the attention mechanism from the field of information retrieval and introduced it within the context of NLP (commonly called Bahdanau attention or additive attention in the field).
  • This paper introduces an attention mechanism for recurrent neural networks (RNN) to improve long-range sequence modeling capabilities. This allows RNNs to translate longer sentences more accurately, which served as the motivation behind developing the original transformer architecture later.
  • The following diagram from the paper illustrates the proposed model trying to generate the \(t^{th}\) target word \(y^t\) given a source sentence \((x^1, x^2, \ldots , x^T)\).

  • Referring to the figure above, the architecture consists of a bidirectional RNN as an encoder and a decoder that emulates searching through a source sentence during decoding a translation.
  • Decoder:
    • In prior encoder-decoder approaches, the decoder defines a probability over the translation \(y\) by decomposing the joint probability into the ordered conditionals: \(p(\mathbf{y})=\prod_{t=1}^T p\left(y_t \mid\left\{y_1, \cdots, y_{t-1}\right\}, c\right)\)

      • where \(\mathbf{y}=\left(y_1, \cdots, y_{T_y}\right)\). With an RNN, each conditional probability is modeled as, \(p\left(y_t \mid\left\{y_1, \cdots, y_{t-1}\right\}, c\right)=g\left(y_{t-1}, s_t, c\right)\)
    • On the other hand, in the proposed model architecture, they define each conditional probability over the otuput translation \(y\) by decomposing the joint probability into the ordered conditionals as, \(p\left(y_i \mid y_1, \ldots, y_{i-1}, \mathbf{x}\right)=g\left(y_{i-1}, s_i, c_i\right)\)

      • where \(s_i\) is an RNN hidden state for time \(i\), computed by \(s_i=f\left(s_{i-1}, y_{i-1}, c_i\right) \text {. }\)
    • It should be noted that unlike the prior encoder-decoder approaches, here the probability is conditioned on a distinct context vector \(c_i\) for each target word \(y_i\). The context vector \(c_i\) depends on a sequence of annotations \(\left(h_1, \cdots, h_{T_x}\right)\) to which an encoder maps the input sentence. Each annotation \(h_i\) contains information about the whole input sequence with a strong focus on the parts surrounding the \(i^{th}\) word of the input sequence. More information on obtaining annotations in the section on Encoder below.
    • The context vector \(c_i\) is, then, computed as a weighted sum of these annotations \(h_i\): \(c_i=\sum_{j=1}^{T_x} \alpha_{i j} h_j\)

    • The weight \(\alpha_{i j}\) of each annotation \(h_j\) is computed by, \(\alpha_{i j}=\frac{\exp \left(e_{i j}\right)}{\sum_{k=1}^{T_x} \exp \left(e_{i k}\right)},\)

      • where, \(e_{i j}=a\left(s_{i-1}, h_j\right)\)

      • is an alignment model which scores how well the inputs around position \(j\) and the output at position \(i\) match. The score is based on the RNN hidden state \(s_{i-1}\) and the \(j^{th}\) annotation \(h_j\) of the input sentence.

  • Encoder:
    • The prior RNN architecture reads an input sequence \(\mathbf{x}\) in order starting from the first symbol \(x_1\) to the last one \(x_{T_x}\). However, in the proposed scheme, we would like the annotation of each word to summarize not only the preceding words, but also the following words. Hence, they propose to use a bidirectional RNN, which has been successfully used recently in speech recognition.
    • A BiRNN consists of forward and backward RNN’s. The forward RNN \(\vec{f}\) reads the input sequence as it is ordered (from \(x_1\) to \(x_{T_x}\) and calculates a sequence of forward hidden states \(\left(\vec{h}_1, \cdots, \vec{h}_{T_x}\right)\). The backward RNN \(\overleftarrow{f}\) reads the sequence in the reverse order (from \(x_{T_x}\) to \(x_1\)), resulting in a sequence of backward hidden states \(\left(\overleftarrow{h}_1, \cdots, \overleftarrow{h}_{T_x}\right)\).
    • They obtain an annotation for each word \(x_j\) by concatenating the forward hidden state \(\vec{h}_j\) and the backward one \(\overleftarrow{h}_j\), i.e., \(h_j=\left[\vec{h}_j^{\top} ; \overleftarrow{h}_j^{\top}\right]^{\top}\). In this way, the annotation \(h_j\) contains the summaries of both the preceding words and the following words. Due to the tendency of RNNs to better represent recent inputs, the annotation \(h_j\) will be focused on the words around \(x_j\). This sequence of annotations is used by the decoder and the alignment model later to compute the context vector per the equations in the Decoder section above.
Effective Approaches to Attention-based Neural Machine Translation
  • Neural Machine Translation by Jointly Learning to Align and Translate proposed an attention mechanism to improve neural machine translation (NMT) by selectively focusing on parts of the source sentence during translation.
  • This paper by Luong et al. in EMNLP 2015 from Manning’s group at Stanford explores useful architectures for attention-based NMT. This paper examines two simple and effective classes of attentional mechanism: a global approach which always attends to all source words and a local one that only looks at a subset of source words at a time.
  • They demonstrate the effectiveness of both approaches over the WMT translation tasks between English and German in both directions. With local attention, they achieve a significant gain of 5.0 BLEU points over non-attentional systems which already incorporate known techniques such as dropout.
  • Their ensemble model using different attention architectures has established a new state-of-the-art result in the WMT’15 English to German translation task with 25.9 BLEU points, an improvement of 1.0 BLEU points over the existing best system backed by NMT and an n-gram reranker.
Skip-Thought Vectors
  • The following paper summary has been contributed by Zhibo Zhang.
  • This paper by Kiros et al. from University of Toronto, Canadian Institute for Advanced Research and Massachusetts Institute of Technology in NeurIPS 2015 proposes skip-thoughts, a model for learning sentence representations. The vector sentence representations generated are named as skip-thought vectors.
  • The method is shown in the figure below from the paper. Given a tuple composed of three consecutive sentences, skip-thoughts will learn to predict the first sentence and the third sentence in the tuple (“I got back home” and “this was strange” in the example of the figure) based on the second sentence of the tuple (“I could see the cat on the steps” in the example of the figure). Specifically, in skip-thoughts,
    • There are three models (one encoder and two decoders), and each color represents a separate model in the figure.
    • The encoder, which is an Recurrent Neural Network with Gated Recurrent Unit (Cho et al., 2014) activations, is used to encode the second sentence of the tuple.
    • The two decoders, which are Recurrent Neural Networks with conditional Gated Recurrent Units (conditioned on the encoder output represented by the unattached arrows in the figure), predict the sentence before and after accordingly.

  • Empirically, the authors validated the capability of skip-thought vectors on tasks such as semantic relatedness, paraphrase detection, image annotation, image search and classification tasks. The skip-thought vectors showed robustness of representations based on the performance on these tasks.


Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
  • Neural Machine Translation (NMT) is an end-to-end learning approach for automated translation, with the potential to overcome many of the weaknesses of conventional phrase-based translation systems. Unfortunately, NMT systems are known to be computationally expensive both in training and in translation inference. Also, most NMT systems have difficulty with rare words. These issues have hindered NMT’s use in practical deployments and services, where both accuracy and speed are essential.
  • This paper by Wu et al. from Google in 2016 presents GNMT, Google’s Neural Machine Translation system, which attempts to address many of these issues. Their model consists of a deep LSTM network with 8 encoder and 8 decoder layers using attention and residual connections.
  • To improve parallelism and therefore decrease training time, their attention mechanism connects the bottom layer of the decoder to the top layer of the encoder. To accelerate the final translation speed, they employ low-precision arithmetic during inference computations.
  • To improve handling of rare words, they divide words into a limited set of common sub-word units (“wordpieces”) for both input and output. This method provides a good balance between the flexibility of “character”-delimited models and the efficiency of “word”-delimited models, naturally handles translation of rare words, and ultimately improves the overall accuracy of the system.
  • Their beam search technique employs a length-normalization procedure and uses a coverage penalty, which encourages generation of an output sentence that is most likely to cover all the words in the source sentence.
  • On the WMT’14 English-to-French and English-to-German benchmarks, GNMT achieves competitive results to state-of-the-art. Using a human side-by-side evaluation on a set of isolated simple sentences, it reduces translation errors by an average of 60% compared to Google’s phrase-based production system.
Neural machine translation of rare words with subword units
  • Neural machine translation (NMT) models typically operate with a fixed vocabulary, but translation is an open-vocabulary problem. Previous work addresses the translation of out-of-vocabulary words by backing off to a dictionary.
  • This paper by Sennrich et al. from the University of Edinburgh in ACL 2016 introduces a simpler and more effective approach based on Byte Pair Encoding (BPE), making the NMT model capable of open-vocabulary translation by encoding rare and unknown words as sequences of subword units. This is based on the intuition that various word classes are translatable via smaller units than words, for instance names (via character copying or transliteration), compounds (via compositional translation), and cognates and loanwords (via phonological and morphological transformations).
  • They discuss the suitability of different word segmentation techniques, including simple character n-gram models and a segmentation based on the byte pair encoding compression algorithm, and empirically show that subword models improve over a back-off dictionary baseline for the WMT 15 translation tasks English-German and English-Russian by 1.1 and 1.3 BLEU, respectively.
  • This paper by Ha et al. from Google Brain introduces an innovative approach where a smaller network, termed a “hypernetwork,” generates the weights for a larger network, referred to as the “main network.” This concept draws inspiration from evolutionary computing methods and aims to manage the large search spaces involved in weight parameters of neural networks. The hypernetwork approach is designed to be efficient and scalable, trained end-to-end with backpropagation.
  • Hypernetworks are a form of abstraction similar to the genotype-phenotype relationship in nature. They can be viewed as a relaxed form of weight-sharing across layers in neural networks, striking a balance between the flexibility of convolutional networks (which typically do not have weight-sharing) and the rigidity of recurrent networks (which do).
  • The following figure from the paper illustrates that hypernetworks generates weights for a feedforward network. Black connections and parameters are associated the main network whereas orange connections and parameters are associated with the hypernetwork.

  • For convolutional networks, hypernetworks generate weights for each convolutional layer. This method was shown to be effective for image recognition tasks with fewer learnable parameters, achieving respectable results on datasets like CIFAR-10.
  • For recurrent networks, such as LSTMs, hypernetworks can dynamically generate weights that vary across many timesteps. This approach has been demonstrated to be effective for a variety of sequence modeling tasks, including language modeling and handwriting generation, achieving near state-of-the-art results on datasets like Character Penn Treebank and Hutter Prize Wikipedia.
  • Hypernetworks have been shown to generate non-shared weights for LSTMs, outperforming standard LSTM versions in certain tasks. They are also beneficial in reducing the number of learnable parameters while maintaining competitive performance in tasks like image recognition and language modeling.
  • The paper reports experiments in different domains:
    • For image recognition, hypernetworks demonstrated effectiveness in generating filters for convolutional networks, tested on MNIST and CIFAR-10 datasets.
    • In language modeling, hypernetworks were applied to character-level prediction tasks on the Penn Treebank corpus and the Hutter Prize Wikipedia dataset (enwik8), showing competitive results.
    • For handwriting generation, the hypernetwork model was trained on the IAM handwriting database, outperforming several configurations of LSTM models.
    • In the neural machine translation task, HyperLSTM cells replaced LSTM cells in a wordpiece model architecture, improving performance on the WMT’14 English to French dataset.
  • The method presented in the paper is efficient, scalable, and works well with fewer parameters. Hypernetworks proved competitive or sometimes superior to state-of-the-art models in various applications like image recognition, language modeling, and handwriting generation


Attention Is All You Need
  • This paper by Vaswani et al. from Google in NeurIPS 2017 introduced Transformers (that are based on scaled dot-product multi-headed attention) which are prevalent in most NLP and CV areas today.
  • Please refer the Transformer primer for a detailed discourse on Transformers.
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
  • The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significant algorithmic and performance challenges. Also, static neural network architectures apply the same function to every example. In contrast, input dependent models attempt to tailor the function to each example. While it is straightforward for a human to manually specify a single static architecture, it is infeasible to specify every input-dependent function by hand. Instead, the input-dependent function must be automatically inferred by the model, which introduces an extra level of complexity in optimization.
  • Given the need to automatically infer architectures for each example, a natural solution is to define a single large model (supernetwork) with a numerous subnetworks (experts), and route examples through a path in the supernetwork. The figure below from Ramachandran and Le (2019) visualizes an example of a routing network.. Intuitively, similar examples can be routed through similar paths and dissimilar examples can be routed through different paths. The example-dependent routing also encourages expert specialization, in which experts devote their representational capacity to transforming a chosen subset of examples.

  • Learning to route examples to well-matched experts is critical for good performance. Effective routing can be achieved by training another small neural network (router) that learns to route examples through the supernetwork. The router takes the example as input and outputs the next expert to use. The router can take advantage of the intermediate representations of the example produced in the supernetwork.
  • This paper by Shazeer et al. in ICLR 2017 addresses these challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational efficiency on modern GPU clusters.
  • They introduce a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks. A trainable gating network determines a sparse combination of these experts to use for each example. In this per-example routing setup, different examples are processed by different subcomponents, or experts, inside a larger model, a.k.a. a supernetwork.
  • Specifically, the proposed MoE layer takes as an input a token representation \(x\) and then routes this to the best determined top-\(k\) experts, selected from a set \(\left\{E_i(x)\right\}_{i=1}^N\) of \(N\) experts. The router variable \(W_r\) produces logits \(h(x)=W_r \cdot x\) which are normalized via a softmax distribution over the available \(N\) experts at that layer. The gate-value for expert \(i\) is given by,
\[p_i(x)=\frac{e^{h(x)_i}}{\sum_j^N e^{h(x)_j}}\]
  • The top-\(k\) gate values are selected for routing the token \(x\). If \(\mathcal{T}\) is the set of selected top-\(k\) indices then the output computation of the layer is the linearly weighted combination of each expert’s computation on the token by the gate value,
\[y=\sum_{i \in \mathcal{T}} p_i(x) E_i(x)\]
  • They apply the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora. We present model architectures in which a MoE with up to 137 billion parameters is applied convolutionally between stacked LSTM layers. On large language modeling and machine translation benchmarks, these models achieve significantly better results than state-of-the-art at lower computational cost.
  • The following diagram from the paper illustrates a Mixture of Experts (MoE) layer embedded within a recurrent language model. In this case, the sparse gating function selects two experts to perform computations. Their outputs are modulated by the outputs of the gating network.

Using the Output Embedding to Improve Language Models
  • This paper by Press and Wolf from Tel-Aviv University in EACL 2017 proposes the concept of weight tying, by studying the topmost weight matrix of neural network language models.
  • They show that this matrix constitutes a valid word embedding. When training language models, they recommend tying the input embedding and this output embedding.
  • They analyze the resulting update rules and show that the tied embedding evolves in a more similar way to the output embedding than to the input embedding in the untied model.
  • They also offer a new method of regularizing the output embedding.
  • Their methods lead to a significant reduction in perplexity, as they are able to show on a variety of neural network language models. Finally, they show that weight tying can reduce the size of neural translation models to less than half of their original size without harming their performance.
Enriching Word Vectors with Subword Information
  • Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Popular models that learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for languages with large vocabularies and many rare words.
  • This paper by Bojanowski et al. from Facebook AI Research, published in TACL 2017, introduces fastText, a novel approach to word vector representation. FastText, an extension of the skipgram model from word2vec, represents each word as a bag of character n-grams, enriching the traditional word vector models by incorporating subword information. This method is particularly crucial for handling the morphology of words, which is a significant challenge in languages with large vocabularies and many rare words.
  • In fastText, a vector representation is associated with each character n-gram, and words are represented as the sum of these n-gram vectors. This approach is efficient and enables the model to quickly train on large corpora. Additionally, it allows for the computation of word representations for words not present in the training data, addressing a key limitation of models that assign a distinct vector to each word.
  • The implementation of fastText involves generating character n-grams for each word, ranging from 3 to 6 characters in length, and using special boundary symbols to differentiate between prefixes and suffixes. The words themselves are included in the n-gram set. A hashing function is employed to map these n-grams to integers, which helps in managing memory requirements. The model parameters are shared across threads and updated asynchronously during the optimization process, which uses stochastic gradient descent with a linear decay of the step size and negative log-likelihood as the objective function.
  • Extensive evaluations were carried out on nine languages using word similarity and analogy tasks. The results demonstrated that fastText’s word representations outperform traditional models, especially in morphologically rich languages. Its proficiency in handling out-of-vocabulary (OOV) words and robustness against dataset size variations were also highlighted.
  • Qualitative analyses reveal that the most significant n-grams in a word often correspond to morphemes, contributing to more accurate word representations. FastText’s capability in representing OOV words and its application in various language modeling tasks underscore its versatility and state-of-the-art performance in the field.
  • Overall, fastText presents a simple yet powerful enhancement to word vectors by integrating subword information, thereby advancing natural language processing, particularly for languages with complex morphology.
  • Code.


Deep contextualized word representations
  • This paper by Peters et al. from Allen AI and UW in NAACL 2018 introduced LSTM-based Embeddings from Language Models (ELMo), an approach for learning high-quality deep context-dependent/context-sensitive word representations/embeddings from biLMs.
  • These deep contextualized word representations model both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy).
  • ELMo’s word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus. They show that these representations can be easily added to existing models and significantly improve the state of the art across six challenging NLP problems, including question answering, textual entailment, and sentiment analysis. They also present an analysis showing that exposing the deep internals of the pre-trained network is crucial, allowing downstream models to mix different types of semi-supervision signals.
  • Through ablations and other controlled experiments, they have confirmed that the biLM layers efficiently encode different types of syntactic and semantic information about words-in-context, and that using all layers improves overall task performance, enabling ELMo to show large improvements on a broad range of NLP tasks.
Improving Language Understanding by Generative Pre-Training
  • Natural language understanding comprises a wide range of diverse tasks such as textual entailment, question answering, semantic similarity assessment, and document classification. Although large unlabeled text corpora are abundant, labeled data for learning these specific tasks is scarce, making it challenging for discriminatively trained models to perform adequately.
  • This paper by Radford et al. from OpenAI in 2018 introduces a framework for achieving strong natural language understanding with a single task-agnostic model through generative pre-training and discriminative fine-tuning and demonstrates large gains on the aforementioned NLU tasks can be realized by generative pre-training of a language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task.
  • In contrast to previous approaches, they make use of task-aware input transformations during fine-tuning to achieve effective transfer while requiring minimal changes to the model architecture.
  • By pre-training on a diverse corpus with long stretches of contiguous text, their model acquires significant world knowledge and ability to process long-range dependencies which are then successfully transferred to solving discriminative tasks such as question answering, semantic similarity assessment, entailment determination, and text classification, improving the state of the art on 9 of the 12 datasets and thus outperforming discriminatively trained models that use architectures specifically crafted for each task. For instance, they achieve absolute improvements of 8.9% on commonsense reasoning (Stories Cloze Test), 5.7% on question answering (RACE), and 1.5% on textual entailment (MultiNLI).
  • Using unsupervised (pre-)training to boost performance on discriminative tasks has long been an important goal of Machine Learning research. Their work suggests that achieving significant performance gains is indeed possible, and offers hints as to what models (Transformers) and data sets (text with long range dependencies) work best with this approach.
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
  • This paper by Kudo and Richardson in EMNLP 2018 describes SentencePiece, a language-independent subword tokenizer and detokenizer designed for Neural-based text processing, including Neural Machine Translation. - It provides open-source C++ and Python implementations for subword units. While existing subword segmentation tools assume that the input is pre-tokenized into word sequences, SentencePiece can train subword models directly from raw sentences, which allows us to make a purely end-to-end and language independent system.
  • They perform a validation experiment of NMT on English-Japanese machine translation, and find that it is possible to achieve comparable accuracy to direct subword training from raw sentences.
  • They also compare the performance of subword training and segmentation with various configurations.
  • Code.
Self-Attention with Relative Position Representations
  • Relying entirely on an attention mechanism, the Transformer introduced by Vaswani et al. (2017) achieves state-of-the-art results for machine translation. In contrast to recurrent and convolutional neural networks, it does not explicitly model relative or absolute position information in its structure. Instead, it requires adding representations of absolute positions to its inputs.
  • This paper by Shaw et al. in NAACL 2018 presents an alternative approach, extending the self-attention mechanism to efficiently consider representations of the relative positions, or distances between sequence elements.
  • The figure below from the paper shows example edges representing relative positions, or the distance between elements. They learn representations for each relative position within a clipping distance \(k\). The figure assumes \(2 <= k <= n − 4\). Note that not all edges are shown.

  • In the original Transformer paper, the authors hypothesized that in contrast to learned, absolute position representations, sinusoidal position encodings would help the model to generalize to sequence lengths unseen during training, by allowing it to learn to attend also by relative position. This property is shared by our relative position representations which, in contrast to absolute position representations, are invariant to the total sequence length.
  • On the WMT 2014 English-to-German and English-to-French translation tasks, this approach yields improvements of 1.3 BLEU and 0.3 BLEU over absolute position representations, respectively. Notably, they observe that combining relative and absolute position representations yields no further improvement in translation quality.
  • They describe an efficient implementation of their method and cast it as an instance of relation-aware self-attention mechanisms that can generalize to arbitrary graph-labeled inputs.
Blockwise Parallel Decoding for Deep Autoregressive Models
  • Deep autoregressive sequence-to-sequence models have demonstrated impressive performance across a wide variety of tasks in recent years. While common architecture classes such as recurrent, convolutional, and self-attention networks make different trade-offs between the amount of computation needed per layer and the length of the critical path at training time, generation still remains an inherently sequential process.
  • This paper by Stern et al. from Google in NeurIPS 2018 seeks to overcome this limitation by propose a novel blockwise parallel decoding scheme in which we make predictions for multiple time steps in parallel then back off to the longest prefix validated by a scoring model. This allows for substantial theoretical improvements in generation speed when applied to architectures that can process output sequences in parallel.
  • They verify their approach empirically through a series of experiments using state-of-the-art self-attention models for machine translation and image super-resolution, achieving iteration reductions of up to 2x over a baseline greedy decoder with no loss in quality, or up to 7x in exchange for a slight decrease in performance. In terms of wall-clock time, their fastest models exhibit real-time speedups of up to 4x over standard greedy decoding.
  • The following figure from the paper shows the three substeps of blockwise parallel decoding. In the predict substep, the greedy model and two proposal models independently and in parallel predict “in”, “the”, and “bus”. In the verify substep, the greedy model scores each of the three independent predictions, conditioning on the previous independent predictions where applicable. When using a Transformer or convolutional sequence-to-sequence model, these three computations can be done in parallel. The highest-probability prediction for the third position is “car”, which differs from the independently predicted “bus”. In the accept substep, \(\hat{y}\) is hence extended to include only “in” and “the” before making the next \(k\) independent predictions.

  • The following figure from the paper illustrates the fact that combining the scoring and proposal models allows us to merge the previous verification substep with the next prediction substep. This makes it feasible to call the model just once per iteration rather than twice, halving the number of model invocations required for decoding.

Universal Language Model Fine-tuning for Text Classification
  • Inductive transfer learning has greatly impacted computer vision, but existing approaches in NLP still require task-specific modifications and training from scratch.
  • This paper by Hoard and Ruder in ACL 2018 proposes Universal Language Model Fine-tuning (ULMFiT), an effective transfer learning method that can be applied to any task in NLP, and introduce techniques that are key for fine-tuning a language model.
  • The following figure from the paper shows that ULMFiT consists of three stages: a) The LM is trained on a general-domain corpus to capture general features of the language in different layers. b) The full LM is fine-tuned on target task data using discriminative fine-tuning (‘Discr’) and slanted triangular learning rates (STLR) to learn task-specific features. c) The classifier is fine-tuned on the target task using gradual unfreezing, ‘Discr’, and STLR to preserve low-level representations and adapt high-level ones (shaded: unfreezing stages; black: frozen).

  • ULMFiT significantly outperforms the state-of-the-art on six text classification tasks, reducing the error by 18-24% on the majority of datasets.
  • Furthermore, with only 100 labeled examples, it matches the performance of training from scratch on 100x more data.
  • Code.
Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models
  • The paper by Vijayakumar et al. from Virgina Tech and Indiana University presents an alternative to the traditional Beam Search (BS) method, known as Diverse Beam Search (DBS). The paper is focused on enhancing the diversity in the solutions decoded from neural sequence models, addressing the issue that BS often results in sequences with minor variations and fails to capture the inherent ambiguity of complex AI tasks.
  • The paper introduces Diverse Beam Search (DBS), an algorithm that decodes a list of diverse outputs by optimizing a diversity-augmented objective. DBS divides the beam budget into groups and enforces diversity between these groups.
    • Comparing image captioning outputs decoded by BS and our method, Diverse Beam Search (DBS) – we notice that BS captions are near-duplicates with similar shared paths in the search tree and minor variations in the end. In contrast, DBS captions are significantly diverse and similar to the inter-human variability in describing images.
  • The following figure from the paper demonstrates that DBS finds better top-1 solutions compared to BS by controlling the exploration and exploitation of the search space. This implies that DBS is a superior search algorithm in terms of result diversity.

  • The authors also study the impact of the number of groups, the strength of diversity penalty, and various forms of diversity functions for language models. They explore various forms of the dissimilarity term used in DBS, such as Hamming Diversity, Cumulative Diversity, and n-gram Diversity, and their impact on model performance.
  • The paper provides empirical evidence through experiments on image captioning, machine translation, and visual question generation tasks. It uses both standard quantitative metrics and qualitative human studies to validate the effectiveness of DBS.
  • DBS shows significant improvements in diversity without compromising task-specific performance metrics. This is particularly evident in cases of complex images where diverse descriptions are more likely.
  • The paper discusses the role of diversity in image-grounded language generation tasks, highlighting that DBS consistently outperforms BS and previously proposed techniques for diverse decoding. DBS is shown to be robust over a wide range of parameter values and is general enough to incorporate various forms of the dissimilarity term.
  • Overall, the paper makes a significant contribution to the field of neural sequence modeling by proposing a novel approach to enhance the diversity of decoded solutions, demonstrating its efficacy across different applications and providing insights into the role of diversity in complex AI tasks.
MS MARCO: A Human Generated MAchine Reading COmprehension Dataset
  • This paper by Bajaj et al. from Microsoft AI & Research introduces the MS MARCO dataset for machine reading comprehension (MRC) and open-domain question answering (QA).
  • The MS MARCO dataset is a large-scale, real-world reading comprehension dataset, consisting of over 1 million anonymized questions derived from Bing’s search query logs, each paired with a human-generated answer. Additionally, it includes 182,669 completely human rewritten generated answers and 8,841,823 passages extracted from 3,563,535 web documents retrieved by Bing.
  • The dataset is designed to address shortcomings of existing MRC and QA datasets by using real user search queries, making it more representative of natural information-seeking behavior. This contrasts with other datasets where questions are often generated by crowd workers based on provided text spans or documents.
  • The dataset poses three distinct tasks with varying difficulty levels: predicting if a question is answerable given context passages and synthesizing an answer, generating a well-formed answer based on context passages, and ranking a set of retrieved passages given a question.
  • MS MARCO’s complexity and real-world relevance are intended to benchmark machine reading comprehension and question-answering models, especially in handling realistic, noisy, and problematic inputs.
  • The paper also discusses the unique features of the dataset, such as questions being real user queries issued to Bing, the presence of multiple or no answers for some questions, and the inclusion of a large set of passages for each question to mimic real-world information retrieval conditions.
  • It provides benchmarking results on the dataset, evaluating different machine learning models for their effectiveness in handling the dataset’s tasks. These results include assessments of generative and discriminative models using metrics like ROUGE-L and BLEU for answer quality.
  • In summary, the MS MARCO dataset represents a significant step forward in developing and benchmarking MRC and QA systems, offering a large-scale, realistic dataset derived from actual user queries and incorporating a variety of tasks to test different aspects of machine comprehension.


BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
  • This paper by Devlin et al. from Google in ACL 2019 proposed BERT (Bidirectional Encoder Representations from Transformers), a Transformer-based language representation model which proposed pre-training bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. BERT is pre-trained using two unsupervised tasks: (i) masked language modeling (MLM) and, (ii) next sentence prediction (NSP).
    • MLM is often referred to as a Cloze task in the literature (Taylor, 1953). In this case, the final hidden vectors corresponding to the mask tokens are fed into an output softmax over the vocabulary, as in a standard LM.
    • NSP is needed because many important downstream tasks such as Question Answering (QA) and Natural Language Inference (NLI) are based on understanding the relationship between two sentences, which is not directly captured by language modeling. In order to train a model that understands sentence relationships, they pre-train for a binarized next sentence prediction task that can be trivially generated from any monolingual corpus.
  • Fine-tuning for the task at hand involves using an additional output layer, to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.
  • BERT comes in two flavors: (i) BERT Base: 12 layers (transformer blocks), 12 attention heads, and 110 million parameters; (ii) BERT Large: 24 layers (transformer blocks), 16 attention heads, and 340 million parameters.
  • BERT consumes a max of 512 input tokens. At its output, word embeddings for BERT (what is called BERT-base) have 768 dimensions.
  • BERT obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).
  • BERT demonstrated that unsupervised pretraining is an integral part of many language understanding systems and enables even low-resource tasks to benefit from them.
  • Google Blog’s article that discusses using BERT for improving search relevance and ranking.
  • Also, here’s a brief timeline of NLP models from Bag of Words to the Transformer family from Fabio Chiusano:

RoBERTa: A Robustly Optimized BERT Pretraining Approach
  • Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes, while hyperparameter choices have significant impact on the final results.
  • This paper by Liu et al. from University of Washington and Facebook AI in 2019 carefully evaluates a number of design decisions when pretraining BERT models.
  • They present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and training data size. They find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it. They find that performance can be substantially improved by training the model longer, with bigger batches over more data; removing the next sentence prediction objective; training on longer sequences; and dynamically changing the masking pattern applied to the training data.
  • Their improved pretraining procedure, which they call RoBERTa, achieves state-of-the-art results on GLUE, RACE and SQuAD, without multi-task finetuning for GLUE or additional data for SQuAD. These results highlight the importance of previously overlooked design choices, and suggest that BERT’s pretraining objective remains competitive with recently proposed alternatives.
  • Note that RoBERTa uses only the masked language model objective (and does not train using the next sentence prediction objective), and achieves better results than the original BERT.
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
  • This paper by Lewis et al. from Facebook AI in 2019 presented BART, a denoising autoencoder for pretraining sequence-to-sequence models that learns to map corrupted documents to the original. BART is trained by corrupting text with an arbitrary noising function, and learning a model to reconstruct the original text. It uses a standard Tranformer-based neural machine translation architecture which, despite its simplicity, can be seen as generalizing BERT (due to the bidirectional encoder), GPT (with the left-to-right decoder), and many other more recent pretraining schemes.
  • They evaluate a number of noising approaches, finding the best performance by both randomly shuffling the order of the original sentences and using a novel in-filling scheme, where spans of text are replaced with a single mask token.
  • Background: With BERT, random tokens are replaced with masks, and the document is encoded bidirectionally. Missing tokens are predicted independently, so BERT cannot easily be used for generation.

  • With GPT, tokens are predicted auto-regressively (generation of a new token is conditioned on the prior tokens), meaning GPT can be used for generation. However words can only condition on leftward context, so it cannot learn bidirectional interactions.

  • BART applies noising schemes to an input document and thus corrupts it by replacing spans of text with mask symbols. In the diagram below, the corrupted document (left) is encoded with a bidirectional model, and then the likelihood of the original document (right) is calculated with an autoregressive decoder. For fine-tuning, an uncorrupted document is input to both the encoder and decoder, and they use representations from the final hidden state of the decoder. The advantage of using this scheme is that inputs to the encoder need not be aligned with decoder outputs, allowing arbitary noise transformations.

  • BART is particularly effective when finetuned for text generation but also works well for comprehension tasks. It matches the performance of RoBERTa with comparable training resources on GLUE and SQuAD, achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 6 ROUGE.
  • BART also provides a 1.1 BLEU increase over a back-translation system for machine translation, with only target language pretraining.
  • BART achieves similar performance to RoBERTa on discriminative tasks, while achieving new state-of-the-art results on a number of text generation tasks.
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
  • As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging.
  • This paper by Sanh et al. from Huggingface in the Energy Efficient Machine Learning and Cognitive Computing - NeurIPS 2019 introduced a language representation model, DistilBERT which is a general-purpose pre-trained version of BERT. DistilBERT is 40% smaller, 60% faster, cheaper to pre-train, and retains 97% of the language understanding capabilities. DistilBERT can be fine-tuned with good performances on a wide range of tasks much like its larger counterparts.
  • While most prior work investigated the use of distillation for building task-specific models, they leverage knowledge distillation during the pre-training phase and show that DistilBERT is a compelling option for edge applications.
  • To leverage the inductive biases learned by larger models during pretraining, they introduce a triple loss combining language modeling, distillation and cosine-distance losses.
  • The following graph shows the parameter counts of several recently released pretrained language models:

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
  • Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling.
  • This paper by Dai et al. from CMU and Google Brain in 2019 proposes a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence.
  • Transformer-XL consists of a segment-level recurrence mechanism and a novel positional encoding scheme that uses relative positional embeddings (compared to the absolute positional encoding in a vanilla Transformer architecture) which enable longer-context attention.
  • Transformer-XL not only enables capturing longer-term dependency than RNNs and vanilla Transformers, achieves substantial speedup during evaluation, but also resolves the context fragmentation problem. As a result, Transformer-XL learns dependency that is 80% longer than RNNs and 450% longer than vanilla Transformers, achieves better performance on both short and long sequences, and is up to 1,800+ times faster than vanilla Transformers during evaluation.
  • They improve the state-of-the-art results of BPC/Perplexity to 0.99 on enwiki8, 1.08 on text8, 18.3 on WikiText-103, 21.8 on One Billion Word, and 54.5 on Penn Treebank (without finetuning). When trained only on WikiText-103, Transformer-XL manages to generate reasonably coherent, novel text articles with thousands of tokens.
XLNet: Generalized Autoregressive Pretraining for Language Understanding
  • With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive language modeling. However, relying on corrupting the input with masks, BERT neglects the dependency between the masked positions and suffers from a pretrain-finetune discrepancy.
  • This paper by Yang et al. from CMU and Google in 2019 proposes XLNet considering BERT’s aforementioned pros and cons, and offers a generalized autoregressive pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order (thereby proposing a new objective called Permutation Language Modeling), and (2) overcomes the limitations of BERT thanks to its autoregressive formulation. Put simply, XLNet is a generalized autoregressive pretraining method that uses a permutation language modeling objective to combine the advantages of autoregressive and autoencoder methods.
  • Furthermore, the neural architecture of XLNet is developed to work seamlessly with the autoregressive objective, including integrating ideas from Transformer-XL, the state-of-the-art autoregressive model and the careful design of the two-stream attention mechanism. Empirically, under comparable experiment settings, XLNet outperforms BERT on 20 tasks, often by a large margin, including question answering, natural language inference, sentiment analysis, and document ranking.
  • Code.
Adaptive Input Representations for Neural Language Modeling
  • This paper by Baevski and Auli from Facebook AI in 2019 introduces adaptive input representations by varying the size of input word embeddings for neural language modeling. Adaptive input embeddings can improve accuracy while drastically reducing the number of model parameters.
  • There are several choices on how to factorize the input and output layers, and whether to model words, characters or sub-word units.
  • They perform a systematic comparison of popular choices for a self-attentional architecture.
  • Their experiments show that models equipped with adaptive embeddings are more than twice as fast to train than the popular character input CNN while having a lower number of parameters.
  • On the WIKITEXT-103 benchmark, they achieve 18.7 perplexity, an improvement of 10.5 perplexity compared to the previously best published result and on the BILLION WORD benchmark, they achieve 23.02 perplexity.
Attention Interpretability Across NLP Tasks
  • This paper by Vashishth et al. from IISc and Google in 2019 seeks to empirically prove the hypothesis that attention weights are interpretable and are correlated with feature importance measures, However, this holds only for cases when attention weights are essential for model’s prediction.
  • Some works (Jain & Wallace, 2019; Vig & Belinkov, 2019) have demonstrated that attention weights are not interpretable, and altering them does not affect the model output while several others have shown that attention captures several linguistic notions in the model. They extend the analysis of prior works to diverse NLP tasks and demonstrate that attention weights are interpretable and are correlated with feature importance measures. However, this holds only for cases when attention weights are essential for model’s prediction and cannot simply be reduced to a gating unit. This paper takes a balanced approach – rather than taking a black and white approach – they draw on previous literature that raised issues with the fact “attentions are indicative of model predictions” and show “when is attention interpretable and when it is not”.
  • The attention layer in a neural network model provides insights into the model’s reasoning behind its prediction, which are usually criticized for being opaque. Recently, seemingly contradictory viewpoints have emerged about the interpretability of attention weights. Amid such confusion arises the need to understand attention mechanism more systematically. The paper attempts to fill this gap by giving a comprehensive explanation which justifies both kinds of observations (i.e., when is attention interpretable and when it is not). Through a series of experiments on diverse NLP tasks, they validate their observations and reinforce the claim of interpretability of attention through manual evaluation.
  • They find that in both single and pair sequence tasks, the attention weights in samples with original weights do make sense in general. However, in the former case, the attention mechanism learns to give higher weights to tokens relevant to both kinds of sentiment. They show that attention weights in single sequence tasks do not provide a reason for the prediction, which in the case of pairwise tasks, attention do reflect the reasoning behind model output.
  • Unrelated to the paper: To use attention visualization as a proxy for interpreting your predictions, use the BertViz library. The lib supports multiple views and supports a plethora of models (BERT, GPT-2, XLNet, RoBERTa, XLM, ALBERT, DistilBERT, BART etc.). The BertViz repo has some nice examples to get started.

Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
  • This paper by Selvaraju et al. from Parikh/Batra’s team at GATech in 2019 proposes a technique for producing ‘visual explanations’ for decisions from a large class of CNN-based models, making them more transparent and explainable.
  • Their approach – Gradient-weighted Class Activation Mapping (Grad-CAM), uses the gradients of any target concept (say ‘dog’ in a classification network or a sequence of words in captioning network) flowing into the final convolutional layer to produce a coarse localization map highlighting the important regions in the image for predicting the concept.
  • Unlike previous approaches, Grad-CAM is applicable to a wide variety of CNN model-families: (1) CNNs with fully-connected layers (e.g. VGG), (2) CNNs used for structured outputs (e.g. captioning), (3) CNNs used in tasks with multimodal inputs (e.g. visual question answering) or reinforcement learning, all without architectural changes or re-training.
  • They combine Grad-CAM with existing fine-grained visualizations to create a high-resolution class-discriminative visualization, Guided Grad-CAM, and apply it to image classification, image captioning, and visual question answering (VQA) models, including ResNet-based architectures.
  • In the context of image classification models, their visualizations (a) lend insights into failure modes of these models (showing that seemingly unreasonable predictions have reasonable explanations), (b) outperform previous methods on the ILSVRC-15 weakly-supervised localization task, (c) are robust to adversarial perturbations, (d) are more faithful to the underlying model, and (e) help achieve model generalization by identifying dataset bias.
  • For image captioning and VQA, their visualizations show that even non-attention based models learn to localize discriminative regions of input image.
  • They devise a way to identify important neurons through GradCAM and combine it with neuron names to provide textual explanations for model decisions.
  • Finally, they design and conduct human studies to measure if Grad-CAM explanations help users establish appropriate trust in predictions from deep networks and show that Grad-CAM helps untrained users successfully discern a ‘stronger’ deep network from a ‘weaker’ one even when both make identical predictions.
  • Code; CloudCV demo.
Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond
  • This paper by Artetxe and Schwenk from University of the Basque Country and FAIR introduces an architecture to learn joint multilingual sentence representations, called LASER (Language-Agnostic SEntence Representations), for 93 languages, belonging to more than 30 different families and written in 28 different scripts. The work focuses on universal language agnostic sentence embeddings, that is, vector representations of sentences that are general with respect to two dimensions: the input language and the NLP task. The motivations for such representations are multiple: the hope that languages with limited resources benefit from joint training over many languages, the desire to perform zero-shot transfer of an NLP model from one language (typically English) to another, and the possibility to handle code-switching. To that end, they train a single encoder to handle multiple languages, so that semantically similar sentences in different languages are close in the embedding space.
  • Their system uses a single BiLSTM encoder with a shared BPE vocabulary for all languages, which is coupled with an auxiliary decoder and trained on publicly available parallel corpora. This enables them to learn a classifier on top of the resulting embeddings using English annotated data only, and transfer it to any of the 93 languages without any modification.
  • Their experiments in cross-lingual natural language inference (XNLI dataset), cross-lingual document classification (MLDoc dataset) and parallel corpus mining (BUCC dataset) show the effectiveness of their approach.
  • They also introduce a new test set of aligned sentences in 112 languages, and show that their sentence embeddings obtain strong results in multilingual similarity search even for low-resource languages.
  • Code with the pretrained encoder and multilingual test set.
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
  • For natural language understanding (NLU) technology to be maximally useful, both practically and as a scientific object of study, it must be general: it must be able to process language in a way that is not exclusively tailored to any one specific task or dataset.
  • This paper by Wang et al. from NYU, UW, and Deepmin in ICLR 2019 introduces the General Language Understanding Evaluation benchmark (GLUE), a tool for evaluating and analyzing the performance of models across a diverse range of existing NLU tasks. GLUE is model-agnostic, but it incentivizes sharing knowledge across tasks because certain tasks have very limited training data. They further provide a hand-crafted diagnostic test suite that enables detailed linguistic analysis of NLU models.
  • They evaluate baselines based on current methods for multi-task and transfer learning and find that they do not immediately give substantial improvements over the aggregate performance of training a separate model per task, indicating room for improvement in developing general and robust NLU systems.
Parameter-Efficient Transfer Learning for NLP
  • Fine-tuning large pre-trained models is an effective transfer mechanism in NLP. However, in the presence of many downstream tasks, fine-tuning is parameter inefficient: an entire new model is required for every task.
  • As an alternative, they propose transfer with adapter modules. Adapter modules yield a compact and extensible model; they add only a few trainable parameters per task, and new tasks can be added without revisiting previous ones. The parameters of the original network remain fixed, yielding a high degree of parameter sharing.
  • To demonstrate adapter’s effectiveness, they transfer the recently proposed BERT Transformer model to 26 diverse text classification tasks, including the GLUE benchmark.
  • Adapters attain near state-of-the-art performance, whilst adding only a few parameters per task. On GLUE, they attain within 0.4% of the performance of full fine-tuning, adding only 3.6% parameters per task. By contrast, fine-tuning trains 100% of the parameters per task.
  • The following figure from the paper shows the architecture of the adapter module and its integration with the Transformer. Left: They add the adapter module twice to each Transformer layer: after the projection following multiheaded attention and after the two feed-forward layers. Right: The adapter consists of a bottleneck which contains few parameters relative to the attention and feedforward layers in the original model. The adapter also contains a skip-connection. During adapter tuning, the green layers are trained on the downstream data, this includes the adapter, the layer normalization parameters, and the final classification layer (not shown in the figure).

Cross-lingual Language Model Pretraining
  • Recent studies have demonstrated the efficiency of generative pretraining for English natural language understanding.
  • This paper by Lampe and Conneau from FAIR extends this approach to multiple languages and show the effectiveness of cross-lingual pretraining. They propose two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual data, and one supervised that leverages parallel data with a new cross-lingual language model objective.
  • They utilize a shared sub-word vocabulary by processing all languages with the same shared vocabulary created through Byte Pair Encoding (BPE). This greatly improves the alignment of embedding spaces across languages that share either the same alphabet or anchor tokens such as digits or proper nouns. THey learn the BPE splits on the concatenation of sentences sampled randomly from the monolingual corpora.
  • They re-balance low/high resource languages using multinomial sampling. Specifically, sentences are sampled according to a multinomial distribution (with \(\alpha=0.5\)) with probabilities \(\left\{q_i\right\}_{i=1 \ldots N}\), where:

    \[q_i=\frac{p_i^\alpha}{\sum_{j=1}^N p_j^\alpha} \quad \text { with } p_i=\frac{n_i}{\sum_{k=1}^N n_k}\]
  • Sampling with this distribution increases the number of tokens associated to low-resource languages and alleviates the bias towards high-resource languages. In particular, this prevents words of low-resource languages from being split at the character level.
  • They obtain state-of-the-art results on cross-lingual classification, unsupervised and supervised machine translation.
  • On XNLI, XLM pushes the state of the art by an absolute gain of 4.9% accuracy. On unsupervised machine translation, we obtain 34.3 BLEU on WMT’16 German-English, improving the previous state of the art by more than 9 BLEU. On supervised machine translation, we obtain a new state of the art of 38.5 BLEU on WMT’16 Romanian-English, outperforming the previous best approach by more than 4 BLEU.
  • The following figure from the paper shows the concept behind cross-lingual language model pretraining. The MLM objective is similar to the one of Devlin et al. (2018), but with continuous streams of text as opposed to sentence pairs. The TLM objective extends MLM to pairs of parallel sentences. To predict a masked English word, the model can attend to both the English sentence and its French translation, and is encouraged to align English and French representations. Position embeddings of the target sentence are reset to facilitate the alignment.

MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance
  • A robust evaluation metric has a profound impact on the development of text generation systems. A desirable metric compares system output against references based on their semantics rather than surface forms.
  • This paper by Zhao et al. in EMNLP 2019 proposes a new metric, called MoverScore, that shows a high correlation with human judgment of text quality.
  • They validate MoverScore on a number of text generation tasks including summarization, machine translation, image captioning, and data-to-text generation, where the outputs are produced by a variety of neural and non-neural systems.
  • Their findings suggest that metrics combining contextualized representations with a distance measure perform the best. Such metrics also demonstrate strong generalization capability across tasks. For ease-of-use we make our metrics available as web service.
  • The following figure from the paper shows an illustration of MoverScore vs. BERTScore.

Neural Oblivious Decision Ensembles for Deep Learning on Tabular Data
  • This paper by Popov et al. from Yandex introduces the Neural Oblivious Decision Ensembles (NODE) architecture for machine learning on tabular data.
  • NODE generalizes ensembles of oblivious decision trees, allowing for gradient-based optimization and multi-layer hierarchical representation learning. It’s designed to improve performance on tabular data, a domain where deep learning hasn’t outperformed gradient boosting decision trees (GBDT).
  • NODE uses differentiable oblivious decision trees, which are more efficient and less prone to overfitting compared to conventional decision trees. This architecture allows for end-to-end training and integrates smoothly into deep learning pipelines.
  • A key feature of NODE is the use of the entmax transformation, which enables differentiable split decision construction within the tree nodes. Entmax generalizes both sparsemax and softmax; it is able to learn sparse decision rules, but is smoother than sparsemax, being more appropriate for gradient-based optimization. Entmax is capable of producing sparse probability distributions and learning splitting decisions based on a small subset of data features.
  • The following figure from the paper shows a single oblivious decision trees (ODT) inside the NODE layer. The splitting features and the splitting thresholds are shared across all the internal nodes of the same depth. The output is a sum of leaf responses scaled by the choice weights.

  • The following figure from the paper shows an illustration of the NODE architecture, consisting of densely connected NODE layers. Each layer contains several trees whose outputs are concatenated and serve as input for the subsequent layer. The final prediction is obtained by averaging the outputs of all trees from all the layers.

  • The architecture was extensively compared to leading GBDT implementations like CatBoost and XGBoost across various datasets. NODE consistently outperformed these traditional methods, particularly in settings with default hyperparameters.
  • NODE’s design includes a multidimensional tree output for classification problems and a concatenation of outputs from multiple trees. This facilitates learning both shallow and deep decision rules.
  • The paper also presents an ablative analysis, demonstrating the influence of different architectural choices, like choice functions (e.g., softmax, entmax) and architecture depth on performance.
  • The authors highlight the potential of incorporating NODE into complex pipelines for multi-modal problems, suggesting future research directions in integrating tabular data with other data types like images or sequences.
  • Overall, NODE introduces an innovative deep learning architecture for tabular data, showcasing its effectiveness over traditional methods and opening new avenues for research in this domain.
Latent Retrieval for Weakly Supervised Open Domain Question Answering
  • This paper by Lee et al. in ACL 2019 from Google Research, addresses the challenge of open domain question answering (QA) without relying on strong supervision of evidence or black-box information retrieval (IR) systems.
  • The authors introduce the Open-Retrieval Question Answering system (ORQA), which learns to retrieve evidence from an open corpus, supervised only by question-answer string pairs. This approach contrasts with traditional methods that either assume gold evidence or depend on black-box IR systems.
  • A central aspect of ORQA is its ability to retrieve any text from an open corpus, unlike traditional methods that rely on a closed set of evidence documents. This capability is enabled by pre-training the retriever using an unsupervised Inverse Cloze Task (ICT). In ICT, a sentence is treated as a pseudo-question, and its context is treated as pseudo-evidence, requiring the model to predict the context given the sentence.
  • The implementation of ORQA leverages the BERT (Bidirectional Encoder Representations from Transformers) architecture for both its retriever and reader components. This choice capitalizes on recent advances in transfer learning and the strong representational power of BERT.
    • Here are the key aspects of the ORQA model architecture:
      1. Retriever Component:
        • The retriever is the first key component of ORQA. It is responsible for selecting relevant document passages from a large corpus that may contain the information required to answer the input question.
        • This component is pre-trained using an unsupervised learning task called the Inverse Cloze Task (ICT). In ICT, the model is given a sentence (treated as a pseudo-question) and is tasked with identifying its surrounding context (treated as pseudo-evidence) from the corpus. This pre-training helps the model learn an effective strategy for document retrieval based on the context of questions.
      2. Reader Component:
        • Following the retrieval stage, the reader component takes over. It processes the passages retrieved by the retriever to generate a precise answer to the input question.
        • The reader, like the retriever, is based on the BERT model. It is fine-tuned to perform the question answering task, taking into account the context provided by the passages retrieved by the retriever.
      3. Integration of BERT:
        • Both the retriever and reader components are built on the BERT framework. BERT’s powerful bidirectional context understanding capabilities make it ideal for understanding the nuances in natural language questions and passages.
        • The use of BERT as a base model facilitates effective transfer learning, where the model, pre-trained on a large corpus, adapts to the specific requirements of question answering and document retrieval tasks through fine-tuning.
      4. End-to-End Training:
        • ORQA is unique in that it is trained end-to-end, meaning that both the retriever and reader are trained simultaneously. This approach allows the retriever to be optimized specifically for the types of questions and answers handled by the reader, leading to a more coherent and effective QA system.
    • In essence, ORQA’s architecture represents a significant advance in open-domain question answering systems, allowing it to handle a wide range of questions by effectively searching and interpreting a vast corpus of unstructured text.
  • The authors address the challenges of inference and learning in an open evidence corpus with a large search space and latent navigation requirements. They accomplish this by pre-training the retriever to provide a strong initialization, enabling dynamic yet fast top-k retrieval during fine-tuning.
  • The following figure from the paper shows an overview of ORQA. A subset of all possible answer derivations given a question \(q\) is shown here. Retrieval scores \(S_{\text {retr }}(q, b)\) are computed via inner products between BERT-based encoders. Top-scoring evidence blocks are jointly encoded with the question, and span representations are scored with a multi-layer perceptron (MLP) to compute \(S_{\text {read }}(q, b, s)\). The final joint model score is \(S_{\text {retr }}(q, b)+S_{\text {read }}(q, b, s)\). Unlike previous work using IR systems for candidate proposal, we learn to retrieve from all of Wikipedia directly.

  • ORQA’s effectiveness is demonstrated through its performance on open versions of five QA datasets. Notably, on datasets where question writers are genuinely seeking information (as opposed to knowing the answer beforehand), ORQA significantly outperforms traditional IR systems like BM25.
  • The paper includes a comprehensive experimental setup, comparing ORQA with other retrieval methods on different datasets. These comparisons illustrate ORQA’s strengths, especially in scenarios where question askers do not already know the answer, highlighting the importance of learned retrieval in such contexts.
  • The authors also discuss the challenges and potential biases in the datasets used for evaluation, providing insights into the limitations and considerations for open-domain QA systems.
  • In summary, this paper presents a novel approach to open-domain question answering by introducing an end-to-end model that jointly learns retriever and reader components. This model significantly improves upon traditional methods in scenarios that reflect genuine information-seeking questions, marking a notable advancement in the field of natural language processing and QA systems.
Multi-Stage Document Ranking with BERT
  • This paper by Nogueira et al. from NYU and the University of Waterloo, published in 2019, introduces a novel approach to document ranking using BERT in a multi-stage architecture.
  • The authors propose two BERT-based models for document ranking: monoBERT and duoBERT. MonoBERT operates as a pointwise classification model assessing individual document relevance, while duoBERT adopts a pairwise approach, comparing pairs of documents for relevance.
  • Their multi-stage ranking system integrates these models into a pipeline, striking a balance between computational efficiency and ranking quality. The approach allows for a trade-off between result quality and inference latency by controlling candidate admission at each stage.
  • Experiments conducted on the MS MARCO and TREC CAR datasets demonstrate the models’ effectiveness. The system matches or exceeds state-of-the-art performance, with detailed ablation studies showing the contribution of each component.
  • The following figure from the paper shows an illustration of our multi-stage ranking architecture. In the first stage \(H_0\), given a query \(q\), the top-\(k_0\) (\(k_0=5\) in the figure) candidate documents \(R_0\) are retrieved using BM25. In the second stage \(H_1\), monoBERT produces a relevance score \(s_i\) for each pair of query \(q\) and candidate \(d_i \in R_0\). The top- \(k_1\left(k_1=3\right.\) in the figure) candidates with respect to these relevance scores are passed to the last stage \(H_2\), in which duoBERT computes a relevance score \(p_{i, j}\) for each triple \(\left(q, d_i, d_j\right)\). The final list of candidates \(R_2\) is formed by re-ranking the candidates according to these scores. These pairwise scores are aggregated as mentioned below.

  • In the multi-stage document ranking system using duoBERT, pairwise scores are aggregated using five different methods to so that each document receives a single score: Sum, Binary, Min, Max, and Sample. These methods vary in how they interpret and utilize the pairwise agreement scores for re-ranking the candidates from the previous stage, each focusing on different aspects of the pairwise comparisons. The Sum method aggregates by summing the pairwise agreement scores, indicating the relevance of a candidate document over others. Binary is inspired by the Condorcet method, considering if the pairwise score is greater than a threshold (0.5). Min and Max methods focus on the strongest and weakest competitor, respectively. The Sample method reduces computational costs by sampling from the pairwise comparisons. These methods allow for re-ranking the candidates from the previous stage according to their aggregated scores.
  • The research addresses the trade-offs inherent in multi-stage ranking systems, exploring the balance between the depth of the ranking model and the breadth of candidate documents considered.
  • The authors also emphasize the potential of BERT models in document ranking tasks, highlighting the advantages over traditional ranking methods, especially in terms of leveraging deep learning and natural language understanding capabilities.
Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts
  • This paper by Ma et al. published in KDD 2018, introduces a novel approach to multi-task learning called Multi-gate Mixture-of-Experts (MMoE). The method aims to enhance the performance of multi-task learning models by better handling the relationships between different tasks.
  • The MMoE model adapts the Mixture-of-Experts (MoE) framework to multi-task learning by sharing expert submodels across all tasks and using a gating network optimized for each task. This design allows the model to dynamically allocate shared and task-specific resources, efficiently handling tasks with varying degrees of relatedness.
  • The paper presents experiments using synthetic data and real datasets, including a binary classification benchmark and a large-scale content recommendation system at Google. These experiments demonstrate MMoE’s effectiveness in scenarios where tasks have low relatedness and its superiority over traditional shared-bottom multi-task models in terms of both performance and trainability.
  • MMoE’s architecture consists of multiple experts (feed-forward networks) and a gating network for each task, which determines the contribution of each expert to the task. This setup allows the model to learn nuanced relationships between tasks and allocate computation resources more effectively.
  • The following figure from the paper shows a (a) shared-bottom model, (b) one-gate MoE model, (c) multi-gate MoE model.

  • In the experiments with the Census-income dataset, a UCI benchmark dataset, the task was to predict whether an individual’s income exceeds $50,000 based on census data. The dataset contains demographic and employment-related information. MMoE’s application to this dataset involved addressing the challenge of binary classification using multiple socio-economic factors as input features.
  • On synthetic data, MMoE showed better performance, especially when task correlation is low, and demonstrated improved trainability with less variance in model performance across runs. On real-world datasets, including the UCI Census-income dataset and Google’s content recommendation system, MMoE consistently outperformed baseline models in terms of accuracy and robustness.
  • MMoE offers computational efficiency by using lightweight gating networks and shared expert networks, making it suitable for large-scale applications. The experiments on Google’s recommendation system highlighted MMoE’s ability to improve both engagement and satisfaction metrics in live experiments compared to single-task and shared-bottom models.
Synthetic QA Corpora Generation with Roundtrip Consistency
  • This paper by Alberti et al. from Google Research, introduces a novel method for generating synthetic question answering (QA) corpora. The method employs roundtrip consistency to filter results, combining models of question generation and answer extraction.
  • For a given passage \(C\), they sample an extractive short answer \(A\) (Step (1) in Table 1). In Step (2), they generate a question \(Q\) conditioned on \(A\) and \(C\), then (Step (3)) predict the extractive answer \(A^{\prime}\) conditioned on \(Q\) and \(C\). If \(A\) and \(A^{\prime}\) match they finally emit \((C, Q, A)\) as a new synthetic training example (Step (4)). They train a separate model on labeled QA data for each of the first three steps, and then apply the models in sequence on a large number of unlabeled text passages. They show that pretraining on synthetic data generated through this procedure provides us with significant improvements on two challenging datasets, SQuAD2 and NQ, achieving a new state of the art on the latter.
  • The approach utilizes BERT models in several key aspects. For answer extraction, the team employed two distinct BERT-based models:
    • Question-Unconditional Extractive Answer Model: This model extracts potential answer spans from the context without any question input. It’s trained on labeled answer spans, enabling it to identify likely answer segments within a given text.
    • Question-Conditional Extractive Answer Model: This variant takes both the passage and a question as input, and it predicts the answer span within the passage. It is fine-tuned on a QA dataset, allowing it to extract answers that are specifically relevant to the given question.
  • In question generation, two approaches were explored:
    • Fine-Tuning BERT Model for Question Generation: This method involves fine-tuning a pre-trained BERT model to generate questions based on the input context and answer spans. This approach utilizes the natural language understanding capabilities of BERT to generate relevant and contextually appropriate questions.
    • Sequence-to-Sequence Generation Model Involving Full Pretraining and Fine-Tuning: Here, a more complex approach was taken. A sequence-to-sequence model was first fully pre-trained and then fine-tuned to generate questions. This method likely involves using a BERT-like model for encoding the input context and answer, followed by a generative model (like a Transformer-based decoder) to generate the question.
  • The following figure from the paper shows an example of how synthetic question-answer pairs are generated. The model’s predicted answer (\(A'\)) matches the original answer the question was generated from, so the example is kept.

  • The paper’s experiments used datasets like SQuAD2 and NQ, demonstrating significant improvements in QA tasks by pretraining on synthetic data generated through this method. The paper reports results indicating performance close to human levels on these datasets.
  • The paper also explores the efficiency of roundtrip consistency filtering, showing its benefits in improving model performance. It notes differences in the style and quality of generated question-answer pairs depending on the approach used.
  • The authors suggest future research directions, including a more formal understanding of why roundtrip consistency improves QA tasks and potential integration with other methods.
Towards VQA Models That Can Read
  • This paper by Singh et al. from Facebook AI Research and Georgia Institute of Technology, published in CVPR 2019, presents advancements in Visual Question Answering (VQA) models to include the capability of reading text in images. The authors introduce a new dataset, TextVQA, and propose a novel model architecture named Look, Read, Reason & Answer (LoRRA).
  • The TextVQA dataset contains 45,336 questions based on 28,408 images from the Open Images dataset, focusing on images with text. This dataset is designed to facilitate the development and evaluation of VQA models that require reading text within images to answer questions. It addresses the gap in existing datasets that either have a small proportion of text-related questions (e.g., VQA dataset) or are too small in size (e.g., VizWiz dataset).
  • The figure below from the paper illustrates examples from the TextVQA dataset. TextVQA questions require VQA models to understand text embedded in the images to answer them correctly. Ground truth answers are shown in green and the answers predicted by a state-of-the-art VQA model (Pythia) are shown in red. Clearly, today’s VQA models fail at answering questions that involve reading and reasoning about text in images.

  • LoRRA Architecture:
    • VQA Component: LoRRA extends a typical VQA model by incorporating text detection and reading capabilities. It embeds the question using a pre-trained embedding function (e.g., GloVe) and encodes it with a recurrent network (e.g., LSTM) to produce a question embedding.
    • Reading Component: Utilizes an OCR model to extract text from images. The extracted words are embedded using pre-trained word embeddings (e.g., FastText) and processed with an attention mechanism to generate combined OCR-question features.
    • Answering Module: Integrates a mechanism to decide whether the answer should be directly copied from the OCR output or deduced from a fixed vocabulary. This module predicts probabilities for both pre-determined answers and dynamically identified OCR tokens.
  • Implementation Details:
    • Image Features: Uses both grid-based and region-based features from ResNet-152 and Faster-RCNN models, respectively. These features are attended to using a top-down attention mechanism based on the question embedding.
    • Question and OCR Embeddings: Question words are embedded using GloVe and encoded with LSTM. OCR tokens are embedded using FastText and attended to separately.
    • Combining Features: The final answer prediction is made by combining the VQA and OCR features using a multi-layer perceptron (MLP) that produces logits for each possible answer, including the OCR tokens.
  • The figure below from the paper illustrates an overview of LoRRA. LoRRA looks at the image, reads its text, reasons about the image and text content and then answers, either with an answer a from the fixed answer vocabulary or by selecting one of the OCR strings s. Dashed lines indicate components that are not jointly-trained. The answer cubes on the right with darker color have more attention weight. The OCR token “20” has the highest attention weight in the example.

  • Evaluation:
    • Performance on TextVQA: LoRRA significantly outperforms existing state-of-the-art VQA models on the TextVQA dataset. The authors provide extensive ablation studies showing the contributions of different components of the model.
    • Upper Bounds and Heuristics: The paper evaluates various baselines, demonstrating that incorporating OCR tokens significantly enhances the model’s performance. The best-performing model achieves a validation accuracy of 26.56% on TextVQA.
    • Human vs. Machine Performance: There remains a significant gap between human performance (85.01%) and the model’s performance, highlighting the challenge of the task and the potential for future improvements.
  • Key Findings:
    • The introduction of specialized components like OCR within VQA models substantially improves their ability to handle text-related questions.
    • The TextVQA dataset provides a robust benchmark for developing and evaluating VQA models capable of reading and reasoning about text in images.
    • LoRRA’s architecture, which integrates reading and reasoning components, sets a new direction for enhancing the capabilities of VQA models.
  • The paper emphasizes the importance of specialized skills in VQA models to address real-world applications, particularly for visually impaired users. The proposed methods and dataset mark significant steps towards more capable and context-aware VQA systems.
  • Project page


Language Models are Few-Shot Learners
  • Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do.
  • This paper by Brown et al. from OpenAI in 2020 introduces Generative Pretrained Transformer (GPT)-3 and shows that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches.
  • Specifically, they train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model.
  • GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.
  • At the same time, they also identify some datasets where GPT-3’s few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, they find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans.
  • They also present broader societal impacts of their findings and of GPT-3 in general.
Longformer: The Long-Document Transformer
  • Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length.
  • This paper by Beltagy et al. from Allen AI in 2020 seeks to address this limitation, by introducing the Longformer with an attention mechanism that scales linearly with sequence length (commonly called Sliding Window Attention in the field), making it easy to process documents of thousands of tokens or longer.
  • Longformer’s attention mechanism is a drop-in replacement for the standard self-attention and combines a local windowed attention with a task motivated global attention.
  • The figure below from the paper compares the full self-attention pattern and the configuration of attention patterns in Longformer.

  • Following prior work on long-sequence transformers, they evaluate Longformer on character-level language modeling and achieve state-of-the-art results on text8 and enwik8.
  • In contrast to most prior work, they also pretrain Longformer and finetune it on a variety of downstream tasks.
  • Their pretrained Longformer consistently outperforms RoBERTa on long document tasks and sets new state-of-the-art results on WikiHop and TriviaQA. They finally introduce the Longformer-Encoder-Decoder (LED), a Longformer variant for supporting long document generative sequence-to-sequence tasks, and demonstrate its effectiveness on the arXiv summarization dataset.
  • The figure below from the paper illustrates the runtime and memory of full self-attention and different implementations of Longformer’s self-attention; Longformer-loop is nonvectorized, Longformer-chunk is vectorized, and Longformer-cuda is a custom cuda kernel implementations. Longformer’s memory usage scales linearly with the sequence length, unlike the full self-attention mechanism that runs out of memory for long sequences on current GPUs. Different implementations vary in speed, with the vectorized Longformer-chunk being the fastest.

Big Bird: Transformers for Longer Sequences
  • The primary limitation of Transformer-based models is the quadratic complexity (mainly in terms of memory, but also computation) on the sequence length due to their full attention mechanism. BigBird by Zaheer et al. from Google, published in NeurIPS 2020, remedied this by proposing a sparse attention mechanism that reduces this quadratic complexity to linear.
Beyond Accuracy: Behavioral Testing of NLP models with CheckList
  • Although measuring held-out test-set accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models either focus on individual tasks or on specific behaviors. Further, ML systems can run to completion without throwing any errors (indicating functional correctness) but can still produce incorrect outputs (indicating behavioral issues). Thus, it is important to test the behavioral aspects of your model to make sure it works as you expected.
  • This paper by Ribeiro et al. from Microsoft, UW and UCI in 2020 introduces CheckList, a model-agnostic and task-agnostic methodology for testing NLP models inspired by principles of behavioral testing in software engineering. CheckList tests individual capabilities of the model using three different test types.
  • Checklist includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation, as well as a software tool to generate a large and diverse number of test cases quickly. They illustrate the utility of CheckList with tests for three tasks, identifying critical failures in both commercial and state-of-art models.
  • Tests created with CheckList can be applied to any model, making it easy to incorporate in current benchmarks or evaluation pipelines. In a user study, a team responsible for a commercial sentiment analysis model found new and actionable bugs in an extensively tested model that has “solved” existing benchmarks on three different tasks. They incorporated three distinct types of tests:
    • Minimum Functionality Test (MFT): A Minimum Functionality Test (MFT) uses simple examples to make sure the model can perform a specific task well. For example, they might want to test the performance of a sentiment model when dealing with negations.
    • Invariance Test: Besides testing the functionality of a model, they might also want to test if the model prediction stays the same when trivial parts of inputs are slightly perturbed. These tests are called Invariance Tests (IV).
    • Directional Expectation Test: In the Invariance Test, they expect the outputs after the perturbation to be the same. However, sometimes they might expect the output after perturbation to change. That is when Directional Expectation Tests comes in handy. In another user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.
  • Code.
The Curious Case of Neural Text Degeneration
  • Despite considerable advancements with deep neural language models, the enigma of neural text degeneration persists when these models are tested as text generators. The counter-intuitive empirical observation is that even though the use of likelihood as training objective leads to high quality models for a broad range of language understanding tasks, using likelihood as a decoding objective leads to text that is bland and strangely repetitive.
  • This paper by Holztman et al. from Choi’s lab at UW in ICLR 2020 provided a deep analysis into the properties of the most common decoding methods for open-ended language generation. It reveals surprising distributional differences between human text and machine text.
  • In addition, they find that decoding strategies alone can dramatically effect the quality of machine text, even when generated from exactly the same neural language model. They show that likelihood maximizing decoding causes repetition and overly generic language usage, while sampling methods without truncation risk sampling from the low-confidence tail of a model’s predicted distribution. Their findings motivate Nucleus (or top-p) Sampling, a simple but effective method that captures the region of confidence of language models effectively to draw the best out of neural generation.
  • By sampling text from the dynamic nucleus of the probability distribution, which allows for diversity while effectively truncating the less reliable tail of the distribution, the resulting text better demonstrates the quality of human text, yielding enhanced diversity without sacrificing fluency and coherence.
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
  • Pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to downstream NLP tasks, they generally require large amounts of compute to be effective.
  • This paper by Clark et al. in 2020 from Manning’s lab at Stanford proposes a more sample-efficient pre-training alternative task called replaced token detection, a new self-supervised task for language representation learning compared to BERT’s masked language modeling (MLM). Instead of masking the input, their approach corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead of training a model that predicts the original identities of the corrupted tokens, the key idea is training a discriminative text encoder model to distinguish input tokens from high-quality negative samples produced by an small generator network.
  • Thorough experiments demonstrate this new pre-training task is more efficient than MLM because the task is defined over all input tokens rather than just the small subset that was masked out.
  • As a result, compared to MLM, their pre-training objective is more compute-efficient and results in better performance on downstream tasks. The contextual representations learned by their approach substantially outperform the ones learned by BERT given the same model size, data, and compute.
  • The gains are particularly strong for small models; for example, they train a model on one GPU for 4 days that outperforms GPT (trained using 30x more compute) on the GLUE natural language understanding benchmark. Their approach also works well at scale, where it performs comparably to RoBERTa and XLNet while using less than 1/4 of their compute and outperforms them when using the same amount of compute.
  • Since ELECTRA works well even when using relatively small amounts of compute, the authors hope this will make developing and applying pre-trained text encoders more accessible to researchers and practitioners with less access to computing resources.
TinyBERT: Distilling BERT for Natural Language Understanding
  • Language model pre-training, such as BERT, has significantly improved the performances of many natural language processing tasks. However, pre-trained language models are usually computationally expensive, so it is difficult to efficiently execute them on resource-restricted devices.
  • This paper by Jiao et al. from Huazhong University of Science and Technology, Wuhan National Lab for Optoelectronics, and Huawei Noah’s Ark Lab in EMNLP 2020 propose a novel Transformer distillation method to accelerate inference and reduce model size while maintaining accuracy, that is specially designed for knowledge distillation (KD) of the Transformer-based models. They also propose a two-stage framework for TinyBERT.
  • By leveraging this new KD method, the plenty of knowledge encoded in a large teacher BERT can be effectively transferred to a small student Tiny-BERT.
  • Then, they introduce a new two-stage learning framework for TinyBERT, which performs Transformer distillation at both the pretraining and task-specific learning stages. This framework ensures that TinyBERT can capture he general-domain as well as the task-specific knowledge in BERT.
  • TinyBERT with 4 layers is empirically effective and achieves more than 96.8% the performance of its teacher BERTBASE on GLUE benchmark, while being 7.5x smaller and 9.4x faster on inference.
  • Extensive experiments show that TinyBERT achieves competitive performances meanwhile significantly reducing the model size and inference time of BERTBASE, which provides an effective way to deploy BERT-based NLP models on edge devices. Specifically, TinyBERT with 4 layers is also significantly better than 4-layer state-of-the-art baselines on BERT distillation, with only about 28% parameters and about 31% inference time of them. Moreover, TinyBERT with 6 layers performs on-par with its teacher BERTBASE.
  • Code.
MPNet: Masked and Permuted Pre-training for Language Understanding
  • BERT adopts masked language modeling (MLM) for pre-training and is one of the most successful pre-training models. Since BERT neglects dependency among predicted tokens, XLNet introduces permuted language modeling (PLM) for pre-training to address this problem. However, XLNet does not leverage the full position information of a sentence and thus suffers from position discrepancy between pre-training and fine-tuning.
  • This paper by Song et al. from Nanjing University and Microsoft Research in NeurIPS 2020 proposes MPNet, a novel pre-training method that inherits the advantages of BERT and XLNet and avoids their limitations.
  • MPNet leverages the dependency among predicted tokens through permuted language modeling (vs. MLM in BERT), and takes auxiliary position information as input to make the model see a full sentence and thus reducing the position discrepancy (vs. PLM in XLNet).
  • They pre-train MPNet on a large-scale dataset (over 160GB text corpora) and fine-tune on a variety of down-streaming tasks (GLUE, SQuAD, etc). Experimental results show that MPNet outperforms MLM and PLM by a large margin, and achieves better results on these tasks compared with previous state-of-the-art pre-trained methods (e.g., BERT, XLNet, RoBERTa) under the same model setting.
  • Code with code and pre-trained models.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
  • This paper by Raffel et al. from Google in JMLR delves into the domain of transfer learning in natural language processing (NLP). Published in 2020, it introduces a unified framework, named the Text-to-Text Transfer Transformer (T5), which reformulates all NLP tasks into a text-to-text format. This approach enables the use of a single model, loss function, and set of hyperparameters across various tasks such as translation, question answering, classification, summarization, and sentiment analysis.
  • The paper’s primary goal is not to propose new methods, but to offer a comprehensive view of the existing landscape in transfer learning for NLP. It includes a survey, exploration, and empirical comparison of existing techniques. The team scales up their models to up to 11 billion parameters to assess the current limits and achieve state-of-the-art results on numerous benchmarks.
  • The T5 model is built upon the Transformer architecture, which has become prevalent in recent NLP research. This architecture, originally designed for machine translation, has been effectively applied to various NLP tasks. The T5 model, in particular, employs an encoder-decoder structure with each component being of similar size and configuration to a BERTBASE stack, amounting to approximately 220 million parameters in total.
  • The following figure from the paper shows a diagram of our text-to-text framework. Every task they consider—including translation, question answering, and classification—is cast as feeding our model text as input and training it to generate some target text. This allows us to use the same model, loss function, hyperparameters, etc. across our diverse set of tasks. It also provides a standard testbed for the methods included in our empirical survey. “T5” refers to our model, which they dub the “Text-to-Text Transfer Transformer”.

  • For the training process, the T5 employs a denoising objective for pre-training, where a portion of the input tokens is randomly masked, and the model is trained to predict these missing tokens. This pre-training is conducted using unlabeled data, leveraging a dataset named “Colossal Clean Crawled Corpus” (C4), which is an extensive collection of clean and natural English text extracted and processed from the Common Crawl archive. The following figure from the paper shows the schematic of the objective T5 uses in our baseline model. In this example, they process the sentence “Thank you for inviting me to your party last week.” The words “for”, “inviting” and “last” (marked with an \(\times\)) are randomly chosen for corruption. Each consecutive span of corrupted tokens is replaced by a sentinel token (shown as <X> and <Y>) that is unique over the example. Since “for” and “inviting” occur consecutively, they are replaced by a single sentinel <X>. The output sequence then consists of the dropped-out spans, delimited by the sentinel tokens used to replace them in the input plus a final sentinel token <Z>.

  • All tasks are formulated as text-to-text tasks, enabling standard maximum likelihood training, namely, through teacher forcing and a cross-entropy loss. For optimization, AdaFactor is utilized. At test time, greedy decoding is employed, which involves selecting the highest-probability logit at each timestep. Each model is pre-trained for \(2^{19} = 524,288\) steps on C4 before fine-tuning, with a maximum sequence length of 512 and a batch size of 128 sequences. Whenever feasible, multiple sequences are “packed” into each batch entry (hence, colloquially called the “T5 packing trick”), so that batches roughly contain \(2^{16} = 65,536\) tokens. The total batch size and number of steps correspond to pre-training on \(2^{35} \approx 34B\) tokens, considerably less than the roughly 137B tokens used by BERT or the approximately 2.2T tokens used by RoBERTa. Utilizing only \(2^{35}\) tokens achieves a reasonable computational budget while still providing a sufficient amount of pre-training for acceptable performance. Notably, \(2^{35}\) tokens cover only a fraction of the entire C4 dataset, ensuring no data repetition during pre-training.
  • An “inverse square root” learning rate schedule is applied during pre-training, defined as \(\frac{1}{\sqrt{\max(n, k)}}\), where \(n\) is the current training iteration and \(k\) is the number of warm-up steps, set to \(10^4\) in all experiments. This establishes a constant learning rate of 0.01 for the first 10^4 steps, then allows for an exponential decay of the learning rate until pre-training concludes. Although experimenting with a triangular learning rate schedule yielded slightly better results, it necessitates prior knowledge of the total number of training steps. Given the variable number of training steps in some experiments, the more flexible inverse square root schedule is preferred.
  • Models are fine-tuned for \(2^{18} = 262,144\) steps on all tasks, balancing the needs of high-resource tasks, which benefit from more fine-tuning, against the propensity of low-resource tasks to quickly overfit. During fine-tuning, batches of 128 length-512 sequences (\(2^{16}\) tokens per batch) are continued, with a constant learning rate of 0.001. Checkpoints are saved every 5,000 steps, and results are reported on the checkpoint with the highest validation performance. For models fine-tuned on multiple tasks, the best checkpoint for each task is selected independently. Except for the experiments described in Section 3.7, results are reported on the validation set to avoid model selection bias on the test set.
  • The model’s training uses the maximum likelihood objective with teacher forcing and a cross-entropy loss. SentencePiece is used for encoding text into WordPiece tokens, with a vocabulary size of 32,000 wordpieces, encompassing English, German, French, and Romanian languages.
  • In terms of architectural variants, the paper examines different attention mask patterns used in Transformer models. For instance, the encoder in T5 uses a fully-visible attention mask, enabling the self-attention mechanism to consider the entire input sequence. In contrast, the decoder employs a causal masking pattern, preventing each output element from depending on future input elements.
  • The methodology and findings in this paper provide valuable insights into the practical application of transfer learning in NLP, particularly in how large-scale pre-trained models can be effectively adapted to a wide range of language tasks.
Scaling Laws for Neural Language Models
  • This paper by from Kaplan et al. from OpenAI studies empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range.
  • Simple equations govern the dependence of overfitting on model/dataset size and the dependence of training speed on model size. These relationships allow them to determine the optimal allocation of a fixed compute budget. Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.
  • The following figure from the paper shows that language modeling performance improves smoothly as they increase the model size, dataset size, and amount of compute used for training. For optimal performance all three factors must be scaled up in tandem. Empirical performance has a power-law relationship with each individual factor when not bottlenecked by the other two.

  • In particular, they propose 10x more compute should be spent on 5.5x larger model and 1.8x more tokens (vs. Chincilla’s 10x more compute should be spent on 3.2x larger model and 3.2x more tokens)
Unsupervised Cross-lingual Representation Learning at Scale
  • This paper by Conneau et al. from Facebook AI in ACL 2020 shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks.
  • They train a Transformer-based masked language model on one hundred languages, using more than two terabytes of filtered CommonCrawl data.
  • Their model, dubbed XLM-R, significantly outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +14.6% average accuracy on XNLI, +13% average F1 score on MLQA, and +2.4% F1 score on NER.
  • XLM-R performs particularly well on low-resource languages, improving 15.7% in XNLI accuracy for Swahili and 11.4% for Urdu over previous XLM models.
  • They also present a detailed empirical analysis of the key factors that are required to achieve these gains, including the trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and low resource languages at scale.
  • Finally, they show, for the first time, the possibility of multilingual modeling without sacrificing per-language performance; XLM-R is very competitive with strong monolingual models on the GLUE and XNLI benchmarks.
  • Facebook AI post.
SpanBERT: Improving Pre-training by Representing and Predicting Spans
  • This paper by Joshi et al. from UW, Princeton University, Allen Institute of Artificial Intelligence, and FAIR in TACL 2020 presents SpanBERT, a pre-training method that is designed to better represent and predict spans of text.
  • Their approach extends BERT by (1) masking contiguous random spans, rather than random tokens, and (2) training the span boundary representations to predict the entire content of the masked span, without relying on the individual token representations within it.
  • SpanBERT consistently outperforms BERT and their better-tuned baselines, with substantial gains on span selection tasks such as question answering and coreference resolution.
  • In particular, with the same training data and model size as BERT-large, Span-BERT obtains 94.6% and 88.7% F1 on SQuAD 1.1 and 2.0, respectively.
  • They also achieve a new state of the art on the OntoNotes coreference resolution task (79.6% F1), strong performance on the TACRED relation extraction benchmark, and even show gains on GLUE.
  • The following figure from the paper offers an illustration of SpanBERT training. The span an American football game is masked. The span boundary objective (SBO) uses the output representations of the boundary tokens, \(\mathbf{x}_4\) and \(\mathbf{x}_9\) (in blue), to predict each token in the masked span. The equation shows the MLM and SBO loss terms for predicting the token, football (in pink), which as marked by the position embedding \(\mathbf{p}_3\), is the third token from \(x_4\).

Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning
  • Although pretrained language models can be fine-tuned to produce state-of-the-art results for a very wide range of language understanding tasks, the dynamics of this process are not well understood, especially in the low data regime. Why can we use relatively vanilla gradient descent algorithms (e.g., without strong regularization) to tune a model with hundreds of millions of parameters on datasets with only hundreds or thousands of labeled examples?
  • This paper by Aghajanyan et al. from Facebook AI argues that analyzing fine-tuning through the lens of intrinsic dimension provides us with empirical and theoretical intuitions to explain this remarkable phenomenon.
  • They empirically show that common pre-trained models have a very low intrinsic dimension; in other words, there exists a low dimension reparameterization that is as effective for fine-tuning as the full parameter space. For example, by optimizing only 200 trainable parameters randomly projected back into the full space, we can tune a RoBERTa model to achieve 90% of the full parameter performance levels on MRPC.
  • Furthermore, they empirically show that pre-training implicitly minimizes intrinsic dimension and, perhaps surprisingly, larger models tend to have lower intrinsic dimension after a fixed number of pre-training updates, at least in part explaining their extreme effectiveness.
  • Lastly, they connect intrinsic dimensionality with low dimensional task representations and compression based generalization bounds to provide intrinsic-dimension-based generalization bounds that are independent of the full parameter count.
Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval
  • This paper by Xiong et al. from Microsoft presents a novel approach to improve dense text retrieval (DR) efficiency and effectiveness.
  • The paper identifies a primary bottleneck in dense retrieval training: the use of uninformative negatives, which leads to slow learning convergence. These negatives are locally sampled in batches and yield diminishing gradient norms and large stochastic gradient variances.
  • To address this, the authors propose Approximate Nearest Neighbor Negative Contrastive Estimation (ANCE). ANCE selects hard training negatives globally from the entire corpus using an asynchronously updated ANN index. This method aligns the distribution of negative samples in training with irrelevant documents in testing.
  • ANCE is implemented using an asynchronously updated ANN index. This involves maintaining an ‘Inferencer’ that parallelly computes document encodings with a recent checkpoint from the DR model and refreshes the ANN index, keeping up with the model training.
  • The following figure from the paper shows the asynchronous training of ANCE. The Trainer learns the representation using negatives from the ANN index. The Inferencer uses a recent checkpoint to update the representation of documents in the corpus and once finished, refreshes the ANN index with most up-to-date encodings.

  • The effectiveness of ANCE was demonstrated in three text retrieval scenarios: standard web search, OpenQA (Open Domain Question Answering), and a commercial search engine’s retrieval system. ANCE showed significant improvements over traditional methods, nearly matching the accuracy of BERT-based cascade IR pipeline while being 100x more efficient.
  • The authors empirically validated that the gradient norms on ANCE sampled negatives are much bigger than local negatives, hence improving the convergence of dense retrieval models.
  • The paper also includes extensive experimental methodologies, evaluation results, and discussions on the convergence of dense retrieval training, highlighting the empirical analyses and theoretical foundations that underpin ANCE.
  • Overall, this paper presents a significant advancement in dense text retrieval by addressing the critical issue of ineffective negative sampling and demonstrating the efficiency and effectiveness of ANCE in various retrieval scenarios.
Document Ranking with a Pretrained Sequence-to-Sequence Model
  • This paper by Nogueira et al. from the University of Waterloo presents MonoT5, a novel approach using a pretrained sequence-to-sequence model, specifically T5, for document ranking tasks.
  • This approach involves training the model to generate relevance labels as “target words,” interpreting the logits of these words as relevance probabilities, a significant shift from conventional methods.
  • The model demonstrates strong performance on the MS MARCO passage ranking task, matching or surpassing previous models, and even outperforming state-of-the-art models in zero-shot transfer on the TREC 2004 Robust Track.
  • Particularly notable is its superior performance in data-poor scenarios with few training examples, leveraging latent knowledge from pretraining.
  • Probing experiments, varying target words, offer insights into how the model utilizes latent knowledge for relevance predictions, highlighting its innovative approach to document ranking.
ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT
  • This paper by Khattab and Zaharia from Stanford University, introduce ColBERT, a novel ranking model for efficient and effective passage search, adapting deep Language Models (LMs), specifically BERT, for retrieval tasks.
  • ColBERT employs a late interaction mechanism, which encodes queries and documents separately using BERT, then applies a cheap but powerful interaction step to evaluate fine-grained similarity. This allows for pre-computing document representations offline, significantly speeding up query processing.
  • The late interaction aspect of ColBERT is unique as it separates the encoding of queries and documents, allowing for efficient pre-computation and storage of document representations. This architecture contrasts with traditional methods that intertwine query and document processing. By independently encoding queries and documents, ColBERT facilitates rapid retrieval while maintaining fine-grained similarity comparisons through its MaxSim operator. This design significantly enhances query processing speed, making it highly practical for large-scale applications.
  • The MaxSim operator in ColBERT is a key component of its late interaction architecture. It operates after the independent encoding of queries and documents. The MaxSim operator computes the maximum similarity score between each query term and every term in a document. This approach allows for a fine-grained similarity measurement, as it captures the best match for each query term within the document. The use of MaxSim contributes to the efficiency and effectiveness of ColBERT in large-scale information retrieval tasks, providing a balance between computational efficiency and retrieval accuracy.
  • The model demonstrates competitive effectiveness with existing BERT-based models, achieving considerably faster execution and requiring substantially fewer FLOPs per query.
  • It leverages vector-similarity indexes for end-to-end retrieval from large document collections, improving recall over traditional models.
  • The following figure from the paper shows schematic diagrams illustrating query–document matching paradigms in neural IR. The figure contrasts existing approaches (sub-figures (a), (b), and (c)) with the proposed late interaction paradigm (sub-figure (d)).

  • The following figure from the paper shows the general architecture of ColBERT given a query \(q\) and a document \(d\).

  • ColBERT’s architecture consists of separate BERT-based query and document encoders and a MaxSim operator for late interaction. The encoders produce m-dimensional embeddings, reducing computational load and space requirements.
  • The paper also details the evaluation of ColBERT on MS MARCO and TREC CAR datasets, highlighting its robustness and scalability in various retrieval scenarios.
REALM: Retrieval-Augmented Language Model Pre-Training
  • This paper by Guu et al. from Google Research introduces Retrieval-Augmented Language Model Pre-Training (REALM), a novel framework that augments language model pre-training with a latent knowledge retriever, enabling the model to access a vast external knowledge corpus like Wikipedia during pre-training, fine-tuning, and inference phases. REALM uniquely addresses the challenge of integrating explicit, external knowledge into language models in a modular and interpretable way, differing from traditional models that implicitly store knowledge within their parameters.
  • The authors propose an unsupervised approach to pre-train the knowledge retriever using masked language modeling as a learning signal, which is novel in allowing backpropagation through a retrieval step over millions of documents. The retrieval process is optimized to improve the language model’s perplexity, rewarding helpful retrievals and penalizing uninformative ones. This method also poses a significant computational challenge, which is tackled by structuring the retriever to facilitate caching and asynchronous updates.
  • The following figure from the paper shows the overall framework of REALM. Left: Unsupervised pre-training. The knowledge retriever and knowledge-augmented encoder are jointly pre-trained on the unsupervised language modeling task. Right: Supervised fine-tuning. After the parameters of the retriever (\(\theta\)) and encoder (\(\phi\)) have been pre-trained, they are then fine-tuned on a task of primary interest, using supervised examples.

  • REALM is evaluated on the Open-domain Question Answering (Open-QA) task against state-of-the-art models that either store knowledge implicitly or use non-learned heuristics for retrieval. The results demonstrate that REALM significantly outperforms all previous methods by 4-16% absolute accuracy across three popular Open-QA benchmarks, highlighting the effectiveness of retrieval-augmented pre-training.
  • The implementation includes several innovative strategies to enhance performance, such as salient span masking, incorporating a null document to model the absence of necessary external knowledge, and preventing trivial retrievals during pre-training. These strategies are crucial for focusing the learning process on examples that require world knowledge and ensuring meaningful retriever training.
  • The paper also discusses the potential for REALM to generalize to structured knowledge, multilingual settings, and multi-modal inputs, suggesting a broad applicability of retrieval-augmented models in capturing and utilizing external knowledge across various domains and tasks.
Linformer: Self-Attention with Linear Complexity
  • This paper by Wang et al. from Facebook AI proposes a novel approach to optimizing the self-attention mechanism in Transformer models, reducing its complexity from quadratic to linear with respect to sequence length. This method, named Linformer, maintains competitive performance with standard Transformer models while significantly enhancing efficiency in both time and memory usage.
  • Linformer introduces a low-rank approximation of the self-attention mechanism. By empirically and theoretically demonstrating that the self-attention matrix is of low rank, the authors propose a decomposition of the original scaled dot-product attention into multiple smaller attentions via linear projections. This factorization effectively reduces both the space and time complexity of self-attention from \(O(n^2)\) to \(O(n)\), addressing the scalability issues of traditional Transformers.
  • The model architecture involves projecting key and value matrices into lower-dimensional spaces before computing the attention, which retains the model’s effectiveness while reducing computational demands. The approach includes options for parameter sharing across projections, which can further reduce the number of trainable parameters without significantly impacting performance.
  • In summary, here’s how Linformer achieves linear-time attention:

    1. Low-Rank Approximation: The core idea behind Linformer is the observation that self-attention can be approximated by a low-rank matrix. This implies that the complex relationships captured by self-attention in Transformers do not necessarily require a full rank matrix, allowing for a more efficient representation.

    2. Reduced Complexity: While standard self-attention mechanisms in Transformers have a time and space complexity of \(O(n^2)\) with respect to the sequence length (n), Linformer reduces this complexity to \(O(n)\). This significant reduction is both in terms of time and space, making it much more efficient for processing longer sequences.

    3. Mechanism of Linear Self-Attention: The Linformer achieves this by decomposing the scaled dot-product attention into multiple smaller attentions through linear projections. Specifically, it introduces two linear projection matrices \(E_i\) and \(F_i\) which are used when computing the key and value matrices. By first projecting the original high-dimensional key and value matrices into a lower-dimensional space (\(n \times k\)), Linformer effectively reduces the complexity of the attention mechanism.

    4. Combination of Operations: The combination of these operations forms a low-rank factorization of the original attention matrix. Essentially, Linformer simplifies the computational process by approximating the full attention mechanism with a series of smaller, more manageable operations that collectively capture the essential characteristics of the original full-rank attention.

  • The figure below from the paper shows: (left and bottom-right) architecture and example of the proposed multihead linear self-attention; (top right) inference time vs. sequence length the various Linformer models.

  • Experimental validation shows that Linformer achieves similar or better performance compared to the original Transformer on standard NLP tasks such as sentiment analysis and question answering, using datasets like GLUE and IMDB reviews. Notably, the model offers considerable improvements in training and inference speeds, especially beneficial for longer sequences.
  • Additionally, various strategies for enhancing the efficiency of Linformer are tested, including different levels of parameter sharing and the use of non-uniform projected dimensions tailored to the specific demands of different layers within the model.
  • The authors suggest that the reduced computational requirements of Linformer not only make high-performance models more accessible and cost-effective but also open the door to environmentally friendlier AI practices due to decreased energy consumption.
  • In summary, Linformer proposes a more efficient self-attention mechanism for Transformers by leveraging the low-rank nature of self-attention matrices. This approach significantly reduces the computational burden, especially for long sequences, by lowering the complexity of attention calculations from quadratic to linear in terms of both time and space. This makes Linformer an attractive choice for tasks involving large datasets or long sequence inputs, where traditional Transformers might be less feasible due to their higher computational demands.
BLEURT: Learning Robust Metrics for Text Generation
  • This paper by Sellam et al. from Google Research, published in ACL 2020, introduces BLEURT, a learned evaluation metric based on BERT. BLEURT is designed to better model human judgments for text generation tasks, addressing the limitations of traditional metrics like BLEU and ROUGE which often correlate poorly with human evaluations. This is achieved through a novel pre-training scheme that leverages millions of synthetic examples to enhance generalization capabilities, particularly when training data is scarce or out-of-distribution.
  • The authors detail the fine-tuning process for BLEURT using a BERT-based model. This involves initial unsupervised pre-training of BERT to learn contextualized text representations, followed by supervised fine-tuning with a linear layer on top of the BERT output to predict quality ratings. The fine-tuning uses regression loss based on human ratings.
  • A significant aspect of their approach is the creation of synthetic training data. They generate numerous synthetic sentence pairs by perturbing sentences from Wikipedia using techniques like mask-filling with BERT, backtranslation, and random word dropout. This synthetic data is used to train BLEURT under various pre-training tasks, each designed to capture different lexical and semantic discrepancies that are crucial for the robustness of the metric.
  • The table below from the paper shows their pre-training signals.

  • BLEURT was evaluated across different domains and setups, showing state-of-the-art performance on the WMT Metrics Shared Task for the years 2017 to 2019 and the WebNLG 2017 Challenge. The paper reports significant improvements in metric performance when pre-trained on synthetic data, highlighting BLEURT’s ability to handle quality and domain drifts effectively.
  • The robustness of BLEURT is further demonstrated through ablation studies, which assess the impact of each pre-training task on the model’s performance. The results validate the importance of comprehensive pre-training in achieving high correlation with human judgments, even under challenging conditions involving data skewness and distribution shifts.
Query-Key Normalization for Transformers
  • This paper by Henry et al. from Cyndx Technologies proposes QKNORM, a normalization technique designed to improve the Transformer model’s performance on low-resource language translation tasks. The primary innovation of QKNORM is its modification of the attention mechanism, specifically targeting the softmax function to prevent arbitrary saturation while maintaining the expressivity of the model.
  • The authors introduce QKNORM as a way to modify the attention mechanism in Transformers. Instead of the traditional scaled dot product attention, QKNORM applies L2 normalization to the query (Q) and key (K) matrices along the head dimension before their multiplication. This results in the dot product being replaced by cosine similarity, which is then scaled up by a learnable parameter rather than divided by the square root of the embedding dimension. This change aims to keep the input to the softmax function within an appropriate range, thus avoiding saturation issues.
  • Implementation Details:
    1. Layer Normalization:
      • The proposed QKNORM technique builds upon existing normalization strategies. It combines FIXNORM (unit length normalization for word embeddings), PRENORM (applying layer normalization to the input of each sublayer), and vanilla LAYERNORM.
      • The new QKNORM method applies L2 normalization along the head dimension of Q and K, resulting in cosine similarity instead of dot products for the attention mechanism.
    2. Learnable Scaling Parameter:
      • The learnable parameter \(g\) is introduced to scale the cosine similarities. It is initialized based on the length of the sequences in the training data using the heuristic \(g_0 = \log_2(L^2 - L)\), where \(L\) is the 97.5th percentile sequence length.
    3. Attention Mechanism Modification:
      • Traditional attention is modified from \(\text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V\) to \(\text{softmax}(g \cdot \hat{Q}\hat{K}^T)V\), where \(\hat{Q}\) and \(\hat{K}\) are the L2-normalized versions of \(Q\) and \(K\).
    4. Experimental Setup:
      • The experiments were conducted on five low-resource language pairs from the TED Talks corpus and IWSLT’15, translating Arabic, Slovak, and Galician to English, and English to Hebrew and Vietnamese.
      • Training used tokenization scripts from Qi et al. (2018) and the Moses toolkit for BLEU scoring. The experiments followed the implementation from Nguyen and Salazar (2019), adjusting for tokenization differences.
    5. Results:
      • QKNORM showed an average improvement of 0.928 BLEU over the baseline, with significant gains across all five language pairs. For example, the IWSLT’15 en→vi task achieved a test BLEU score of 33.24, compared to 32.79 for the baseline.
      • Ablation studies confirmed the importance of each component, particularly the learnable scaling factor \(g\), which had the most substantial impact on performance.
  • The figure below from the paper shows Scaled Dot Product Attention. Self-attention heatmaps for 4 heads from one encoder layer displaying more “concentrated” attention, consistent with the conjecture that unnormalized dot products in QKT saturate the softmax and limit the attention patterns that can be learned.

  • The figure below from the paper shows Query-Key Normalized Attention. Self-attention heatmaps of the same 4 heads in the above figure. QKNORM enables more diffuse attention patterns.

  • QKNORM effectively enhances Transformer performance on low-resource translation tasks by stabilizing the input to the softmax function, allowing for more diffuse attention patterns and preventing the saturation issues inherent to traditional dot product attention. Future work will explore applying QKNORM in high-resource settings and further analyze its impact on attention head learning dynamics.


Towards a Unified View of Parameter-Efficient Transfer Learning
  • Fine-tuning large pre-trained language models on downstream tasks has become the de-facto learning paradigm in NLP. However, conventional approaches fine-tune all the parameters of the pre-trained model, which becomes prohibitive as the model size and the number of tasks grow. Recent work has proposed a variety of parameter-efficient transfer learning methods that only fine-tune a small number of (extra) parameters to attain strong performance. While effective, the critical ingredients for success and the connections among the various methods are poorly understood.
  • This paper by He et al. from Neubig’s lab at CMU in ICLR 2022 breaks down the design of state-of-the-art parameter-efficient transfer learning methods and present a unified framework that establishes connections between them. Specifically, they re-frame them as modifications to specific hidden states in pre-trained models, and define a set of design dimensions along which different methods vary, such as the function to compute the modification and the position to apply the modification.
  • Through comprehensive empirical studies across machine translation, text summarization, language understanding, and text classification benchmarks, they utilize the unified view to identify important design choices in previous methods. Furthermore, their unified framework enables the transfer of design elements across different approaches, and as a result they are able to instantiate new parameter-efficient fine-tuning methods that tune less parameters than previous methods while being more effective, achieving comparable results to fine-tuning all parameters on all four tasks.
  • The below figure from the paper offers a graphical illustration of existing methods and the proposed variants. “PLM module” represents a certain sublayer of the PLM (e.g., attention or FFN) that is frozen. “Scaled PA” denotes scaled parallel adapter.

BinaryBERT: Pushing the Limit of BERT Quantization
  • The rapid development of large pre-trained language models has greatly increased the demand for model compression techniques, among which quantization is a popular solution. In this paper,
  • This paper by Bai et al. from CUHK and Huawei Noah’s Ark Lab in 2021 proposes BinaryBERT, which pushes BERT quantization to the limit by weight binarization.
  • They find that a binary BERT is hard to be trained directly than a ternary counterpart due to its steep and complex loss landscape. Therefore, they propose ternary weight splitting, which initializes BinaryBERT by equivalently splitting from a half-sized ternary network, followed by fine-tuning for further refinement.
  • The binary model thus inherits the good performance of the ternary one, and can be further enhanced by fine-tuning the new architecture after splitting.
  • Their approach also supports adaptive splitting that can tailor the size of BinaryBERT based on the edge device constraints.
  • Empirical results show that BinaryBERT has only a slight performance drop compared with the full-precision model while being 24x smaller, achieving the state-of-the-art compression results on the GLUE and SQuAD benchmarks.
Towards Zero-Label Language Learning
  • This paper by Wang et al. from Google in 2021 explores “zero-label” learning in NLP, whereby no human-annotated data is used anywhere during training and models are trained purely on synthetic data. They show that language models (LMs) are also few-shot generators or example creators (rather than just few-shot learners as in the GPT-3 paper) in that they can be used to generate high-quality synthetic data in a fully unsupervised manner. In other words, their propose that labelled-data generation is easy with prompting, LMs are great few-shot data generators, and that classic fine-tuning » zero/few shot prompting.
  • At the core of their framework is a novel approach for better leveraging the powerful pretrained LMs. Specifically, inspired by the recent success of few-shot inference on GPT-3, they present a training data creation procedure named Unsupervised Data Generation (UDG), which leverages few-shot prompts to synthesize high-quality training data without real human annotations.
  • Their method enables zero-label learning as they train task-specific models solely on the synthetic data, yet they achieve better or comparable results from strong baseline models trained on human-labeled data. Furthermore, when mixed with labeled data, their approach serves as a highly effective data augmentation procedure, achieving new state-of-the-art results on the SuperGLUE benchmark.
  • The paper illustrates a promising direction for future transfer learning research in NLP.
  • Key takeaways:
    • Old idea (from OpenAI’s GPT3 paper):
      • Treat LMs as few-shot learners.
      • Create prompts with <sample, label> pair(s).
      • Ask the model to infer the label for a new
      • The emphasis is on the inference.
    • New idea (from Google’s zero-label paper):
      • Treat LMs as few-shot generators (rather than few-shot learners).
      • Create prompts with <sample, label> pair(s).
      • Ask the model to generate more for the same label.
      • The emphasis is on the labelled data generation (rather than inference).
    • Learnings:
      • Old idea created a new wave of prompt programming, i.e. no need for conventional task specific fine-tuning.
      • However, prompting can solve only lower-order tasks, for e.g., classification, NLI. Even with lower-order tasks it is not practical because you cannot build a human-in-the-loop system to continually improve the model.
      • The new idea is about generating more data and going with conventional route.
      • This paper confirms all the above by introducing UDG using LMs, even for complex higher-order tasks and empirically shows classical fine-tuning with more data works better.
  • The diagram below from Prithvi Da summarizes the proposed approach.

Improving Language Models by Retrieving from Trillions of Tokens
  • This paper by Borgeaud et al. from DeepMind in 2021 proposes Retrieval-Enhanced Transformer (RETRO) which enhances auto-regressive language models by conditioning on document chunks retrieved from a large corpus, based on local similarity with preceding tokens. RETRO incorporates information retrieved from a database to free its parameters from being an expensive store of facts and world knowledge. With a 2 trillion token database, RETRO obtains comparable performance to GPT-3 and Jurassic-1 on the Pile, despite using 25x fewer parameters.
  • After fine-tuning, RETRO performance translates to downstream knowledge-intensive tasks such as question answering. RETRO combines a frozen BERT retriever, a differentiable encoder and a chunked cross-attention mechanism to predict tokens based on an order of magnitude more data than what is typically consumed during training.
  • The figure below from the paper shows the Retro architecture. Left: simplified version where a sequence of length \(n = 12\) is split into \(l = 3\) chunks of size \(m = 4\). For each chunk, we retrieve \(k = 2\) neighbors of \(r = 5\) tokens each. The retrieval pathway is shown on top. Right: Details of the interactions in the CCA operator. Causality is maintained as neighbors of the first chunk only affect the last token of the first chunk and tokens from the second chunk.

  • On Wikitext103 and the Pile, RETRO outperforms previous models trained on large scale datasets. They also show that RETRO is competitive on retrieval-intensive downstream tasks such as question answering.
  • RETRO models are flexible and can be used without retrieval at evaluation and still achieve comparable performance to baseline models. Conversely, baseline pre-trained transformer models can be rapidly fine-tuned (“RETROfit with retrieval”) to obtain nearly the same performance as if trained from scratch.
  • They demonstrates at an unprecedented scale that improving semi-parametric language models through explicit memory can provide an orthogonal, more efficient approach than raw parameter scaling as they seek to build more powerful language models.
  • Related: The Illustrated Retrieval Transformer by Jay Alammar.
WebGPT: Browser-assisted question-answering with human feedback
  • This paper by Nakano et al. from OpenAI in 2021 proposes WebGPT, which is a fine-tuned version of GPT-3 to more accurately answer open-ended questions using a text-based web browser. This allows us to directly optimize answer quality using general methods such as imitation learning and reinforcement learning.
  • Their prototype copies how humans research answers to questions online —- it submits search queries, follows links, and scrolls up and down web pages. It is trained to cite its sources, which makes it easier to give feedback to improve factual accuracy.
  • By setting up the task so that it can be performed by humans, they are able to train models on the task using imitation learning, and then optimize answer quality with human feedback. To make human evaluation of factual accuracy easier, models must collect references while browsing in support of their answers.
  • They train and evaluate their models on ELI5, a dataset of questions asked by Reddit users. Their best model is obtained by fine-tuning GPT-3 using behavior cloning, and then performing rejection sampling against a reward model trained to predict human preferences. This model’s answers are preferred by humans 56% of the time to those of their human demonstrators, and 69% of the time to the highest-voted answer from Reddit. While their best model outperforms humans on ELI5, but still struggles with out-of-distribution questions.
The Power of Scale for Parameter-Efficient Prompt Tuning
  • This paper by Lester et al. introduces a simple yet effective method called prompt tuning, which learns soft prompts to condition frozen language models to perform specific downstream tasks. Unlike the discrete text prompts, soft prompts are learned through backpropagation and can be tuned to incorporate signals from any number of labeled examples.
  • Also, prompt tuning only requires storing a small task-specific prompt for each task, and enables mixed-task inference using the original pre-trained model.
  • The authors show that prompt tuning outperforms few-shot learning by a large margin, and becomes more competitive with scale.
  • This is an interesting approach that can help to effectively use a single frozen model for multi-task serving.
  • Model tuning requires making a task-specific copy of the entire pre-trained model for each downstream task and inference must be performed in separate batches. Prompt tuning only requires storing a small task-specific prompt for each task, and enables mixed-task inference using the original pretrained model. With a T5 “XXL” model, each copy of the tuned model requires 11 billion parameters. By contrast, their tuned prompts would only require 20,480 parameters per task—a reduction of over five orders of magnitude – assuming a prompt length of 5 tokens.

Prefix-Tuning: Optimizing Continuous Prompts for Generation
  • Fine-tuning is the de facto way to leverage large pretrained language models to perform downstream tasks. However, it modifies all the language model parameters and therefore necessitates storing a full copy for each task.
  • This paper by Li and Liang from Stanford proposes prefix-tuning, a lightweight alternative to fine-tuning for natural language generation tasks, which keeps language model parameters frozen, but optimizes a small continuous task-specific vector (called the prefix).
  • Instead of adding a soft prompt to the model input, it prepends trainable parameters to the hidden states of all transformer blocks. During fine-tuning, the LM’s original parameters are kept frozen while the prefix parameters are updated.
  • Prefix-tuning draws inspiration from prompting, allowing subsequent tokens to attend to this prefix as if it were “virtual tokens”.
  • The figure below from the paper shows that fine-tuning (top) updates all Transformer parameters (the red Transformer box) and requires storing a full model copy for each task. They propose prefix-tuning (bottom), which freezes the Transformer parameters and only optimizes the prefix (the red prefix blocks). Consequently, prefix-tuning only need to store the prefix for each task, making prefix-tuning modular and space-efficient. Note that each vertical block denote transformer activations at one time step.

  • They apply prefix-tuning to GPT-2 for table-to-text generation and to BART for summarization. They find that by learning only 0.1% of the parameters, prefix-tuning obtains comparable performance in the full data setting, outperforms fine-tuning in low-data settings, and extrapolates better to examples with topics unseen during training. A potential hypothesis is that training fewer parameters helped reduce overfitting on smaller target datasets.
LoRA: Low-Rank Adaptation of Large Language Models
  • An important paradigm of natural language processing consists of large-scale pre-training on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes less feasible. Using GPT-3 175B as an example – deploying independent instances of fine-tuned models, each with 175B parameters, is prohibitively expensive.
  • Powerful models with billions of parameters, such as GPT-3, are prohibitively expensive to fine-tune in order to adapt them to particular tasks or domains. LoRA proposes to freeze pre-trained model weights and inject trainable layers (rank-decomposition matrices) in each transformer block. This greatly reduces the number of trainable parameters and GPU memory requirements since gradients don’t need to be computed for most model weights. The researchers found that by focusing on the Transformer attention blocks of large-language models, fine-tuning quality with LoRA was on par with full model fine-tuning while being much faster and requiring less compute.
  • This paper by Hu et al. from Microsoft in 2021 proposes Low-Rank Adaptation (LoRA), which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks.
  • LoRA is a technique where weight updates are designed to be the product of two low-rank matrices. It was inspired by Aghajanyan et al. which showed that, when adapting to a specific task, pre-trained language models have a low intrinsic dimension and can still learn efficiently despite a random projection into a smaller subspace. Thus, LoRA hypothesized that weight updates \(\Delta W\) during adaption also have low intrinsic rank.
  • The figure below from the paper shows LoRA’s reparametrization. They only train \(A\) and \(B\).

  • Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times. LoRA performs on-par or better than fine-tuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite having fewer trainable parameters, a higher training throughput, and, unlike adapters, no additional inference latency.
  • Similar to prefix tuning, they found that LoRA outperformed several baselines including full fine-tuning. Again, the hypothesis is that LoRA, thanks to its reduced rank, provides implicit regularization. In contrast, full fine-tuning, which updates all weights, could be prone to overfitting.
  • They also provide an empirical investigation into rank-deficiency in language model adaptation, which sheds light on the efficacy of LoRA.
  • They release a package that facilitates the integration of LoRA with PyTorch models and provide their implementations and model checkpoints for RoBERTa, DeBERTa, and GPT-2.
  • Code
Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm
  • Prevailing methods for mapping large generative language models to supervised tasks may fail to sufficiently probe models’ novel capabilities. Using GPT-3 as a case study, they show that 0-shot prompts can significantly outperform few-shot prompts. They suggest that the function of few-shot examples in these cases is better described as locating an already learned task rather than meta-learning. This analysis motivates rethinking the role of prompts in controlling and evaluating powerful language models.
  • This paper by Reynolds and McDonell discusses methods of prompt programming, emphasizing the usefulness of considering prompts through the lens of natural language. They explore techniques for exploiting the capacity of narratives and cultural anchors to encode nuanced intentions and techniques for encouraging deconstruction of a problem into components before producing a verdict.
  • Informed by this more encompassing theory of prompt programming, they also introduce the idea of a metaprompt that seeds the model to generate its own natural language prompts for a range of tasks. Finally, they discuss how these more general methods of interacting with language models can be incorporated into existing and future benchmarks and practical applications.
Muppet: Massive Multi-task Representations with Pre-Finetuning
  • This paper by Aghajanyan et al. from Meta AI proposes pre-finetuning, an additional large-scale learning stage between language model pre-training and fine-tuning.
  • Pre-finetuning is massively multi-task learning (around 50 datasets, over 4.8 million total labeled examples), and is designed to encourage learning of representations that generalize better to many different tasks. Pre-finetuning is performed on around 50 classification, summarization, question answering, and common sense reasoning tasks.
  • After pretraining, they prefinetune the model on each of the aforementioned tasks by attaching a task-specific head (MLP) to the output of the [CLS] token for each task that they prefinetune the model on. They use the output of this task-specific head as the model output for that prefinetuning task. While training, the overall loss function used is a convex combination of the seven task-specific losses.
  • They show, in particular, that standard multi-tasking schemes can be unstable and often fail to learn high quality representations. However, they introduce a new training scheme which uses loss scaling and task-heterogeneous batches so that gradient steps are more evenly balanced across multiple different competing tasks, greatly improving training stability and overall performance.
  • Accumulating gradients across tasks (i.e., the concept of “heterogeneous batches”) is important since the model is trying to optimize not a single objective but several potentially competing objectives to create a unified representation across several tasks during model training. During gradient descent, moving along the gradient of a single task may not be the optimal direction for the model to move to learn a single unified representation across tasks. To overcome this, we ensure each batch their model optimizes consists of several tasks. Each worker samples a random batch from their set of tasks and computes a gradient, accumulated for the final update. Empirically we use 64 GPUs for pre-finetuning, resulting in each batch consisting of gradients across 64 sampled tasks. This strategy allows for their model to arrive at a better representation for end task finetuning.
  • As pre-finetuning optimizes several different types of tasks and datasets, each having its own output spaces, loss scaling becomes essential to ensure stable training. They attempted various forms of loss-scaling throughout initial experimentation, but the most effective was the novel method as follows.
    • Let us denote \(\mathcal{L}_i\left(x_i, y_i ; \theta\right)\) as the loss for datapoint \(i\) for a model parameterized by \(\theta\). Remember that the loss depends on the type of task (commonsense loss is different from binary classification). Furthermore let \(n: \mathbb{N} \rightarrow \mathbb{N}\) be a function which for each data-point returns the number of predictions \(\mathcal{L}\) operates over. For example, for binary classification, \(n\) would return two, while for generation, \(n\) would return the size of the vocabulary (since we average across loss per token generated). They scale data-point loss so that, if the class distribution were uniformly distributed along with their models predictions, all of their losses would have equivalent values.
    \[\mathcal{L}_i^{\text {scaled }}\left(x_i, y_i ; \theta\right)=\frac{\mathcal{L}_i\left(x_i, y_i ; \theta\right)}{\log n(i)}\]
    • They found that this static scaling worked incredibly well, outperforming other loss scaling methods in early experimentation.
  • They show that pre-finetuning consistently improves performance for pretrained discriminators (e.g.~RoBERTa) and generation models (e.g.~BART) on a wide range of tasks (sentence prediction, commonsense reasoning, MRC, etc.), while also significantly improving sample efficiency during fine-tuning. They also show that large-scale multi-tasking is crucial; pre-finetuning can hurt performance when few tasks are used up until a critical point (usually above 15) after which performance improves linearly in the number of tasks.
  • The figure below from the paper shows a plot of RoBERTa’s evaluation accuracy of five datasets: RTE, BoolQ, RACE, SQuAD, and MNLI, across various scales of multi-task learning measured in the number of datasets. They notice that performance initially degrades until a critical point is reached regarding the number of the datasets used by the MTL framework for all but one dataset. Post this critical point; pre-finetuning improve over the original RoBERTa model.

Synthesizer: Rethinking Self-Attention in Transformer Models
  • The dot product self-attention is known to be central and indispensable to state-of-the-art Transformer models. But is it really required?
  • This paper by Tay et al. from Google in ICML 2021 investigates the true importance and contribution of the dot product-based self-attention mechanism on the performance of Transformer models.
  • Via extensive experiments, they find that (1) random alignment matrices surprisingly perform quite competitively and (2) learning attention weights from token-token (query-key) interactions is useful but not that important after all.
  • To this end, they propose Synthesizer, a model that learns synthetic attention weights without token-token interactions.
  • The figure below from the paper shows the proposed Synthesizer model architecture.

  • In their experiments, they first show that simple Synthesizers achieve highly competitive performance when compared against vanilla Transformer models across a range of tasks, including machine translation, language modeling, text generation and GLUE/SuperGLUE benchmarks.
  • When composed with dot product attention, they find that Synthesizers consistently outperform Transformers. Moreover, they conduct additional comparisons of Synthesizers against Dynamic Convolutions, showing that simple Random Synthesizer is not only 60% faster but also improves perplexity by a relative 3.5%. Finally, they show that simple factorized Synthesizers can outperform Linformers on encoding only tasks.
CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation
  • The paper by Wang et al. from Salesforce Research Asia and NTU Singapore in EMNLP 2021 proposes CodeT5, a pre-trained encoder-decoder model for code understanding and generation tasks.
  • It builds on the T5 architecture and proposes two novel pre-training objectives:
    • Identifier tagging: Predict whether a token is an identifier.
    • Masked identifier prediction: Recover masked identifiers using sentinel tokens.
  • These objectives enable CodeT5 to leverage the semantic information from identifiers.
  • It also proposes a bimodal dual generation task using code-comment pairs to improve Natural Languages (NL)-Programming Languages (PL) alignment.
  • CodeT5 significantly outperforms prior work on 6 tasks across 14 datasets in CodeXGLUE.
  • The figure below from the paper illustrates CodeT5 for code-related understanding and generation tasks.

  • The figure below from the paper illustrates the pre-training tasks of CodeT5. They first alternately train span prediction, identifier prediction, and identifier tagging on both unimodal and bimodal data, and then leverage the bimodal data for dual generation training.

  • Ablations show the identifier-aware pre-training and bimodal dual generation are effective. Identifier masking helps capture semantics while span masking focuses on syntax.
  • Comprehensive experiments show that CodeT5 significantly outperforms prior methods on understanding tasks such as code defect detection and clone detection, and generation tasks across various directions including PL-NL, NL-PL, and PL-PL, allowing for multi-task learning.
  • Overall, CodeT5 incorporates code structure via a novel identifier-aware pre-training and demonstrates strong performance on a diverse set of code intelligence tasks.
  • Code.
Extracting Training Data from Large Language Models
  • This paper by Carlini et al. from Google, Stanford, UC Berkeley, Northeastern University, OpenAI, Harvard, and Apple, examines the vulnerability of large language models (LMs) like GPT-2 to training data extraction attacks. These attacks can retrieve individual training examples, including sensitive and personal information, by querying the language model.
  • The core of the paper revolves around a methodology for extracting verbatim sequences from an LM’s training set, even in cases where these examples are not evidently distinguishable from test examples in terms of loss. This approach involves generating a diverse set of high-likelihood samples from the model and then sorting each sample using various metrics to estimate the likelihood of each sample being a memorized instance.
  • For their experiments, the authors focused on the GPT-2 model, utilizing its different sizes (with the largest having 1.5 billion parameters). The model was trained on data scraped from the public Internet, specifically from websites linked on Reddit, amounting to about 40GB of text data.
  • The researchers began with a baseline attack that generates text using a simple method and then infers which outputs contain memorized content based on the model’s confidence in the generated sequences. They identified weaknesses in this initial approach, such as low diversity of outputs and a high rate of false positives.
  • To improve the attack, they employed better sampling methods, including sampling with a decaying temperature to increase output diversity and reduce the likelihood of stepping off memorized sequences. This improved approach involved generating 200,000 samples using three different strategies: top-n sampling, temperature-based sampling, and Internet-conditioned sampling.
  • The evaluation of these methods revealed that 604 unique memorized training examples were identified from 1800 candidates, representing a 33.5% aggregate true positive rate, with the best variant achieving a 67% rate. The memorized content varied, including canonical text from news, forums, wikis, religious texts, as well as unique data like 128-bit UUIDs, resolving URLs, and personal information.
  • The figure below from the paper illustrates the extraction attack. Given query access to a neural network language model, they extract an individual person’s name, email address, phone number, fax number, and physical address. The example in this figure shows information that is all accurate so they’ve redacted it to protect privacy.

  • The research highlighted that the most effective way to identify memorized content was sampling conditioned on Internet text, although all generation schemes revealed significant memorized content. The comparison-based metrics for membership inference were more effective than directly looking at LM perplexity.
  • A notable finding was the extraction of personally identifiable information (PII), with examples containing individual names, phone numbers, addresses, and social media accounts, indicating a considerable privacy risk associated with large LMs trained on public data.
Large Dual Encoders Are Generalizable Retrievers
  • This paper by Ni et al. from Google Research presents a study on the scalability of dual encoder models for retrieval tasks.
  • The team challenges the belief that the simple dot-product bottleneck in dual encoders limits out-of-domain generalization. They scale up the model size of dual encoders while keeping the bottleneck embedding size fixed.
  • Their approach, Generalizable T5-based dense Retrievers (GTR), significantly outperforms existing sparse and dense retrievers on the BEIR dataset for a variety of retrieval tasks, especially in out-of-domain generalization.
  • GTR utilizes dual encoders by leveraging the encoder part of T5. For effectively using the power of large models, they collect roughly two billion community question-answer pairs as generic pre-training data. By combining pre-training using generic training data and fine-tuning using MS Marco, they are able to train large-scale dual encoder retrieval models.
  • A major finding is that GTR models are data-efficient, requiring only 10% of MS Marco supervised data for optimal out-of-domain performance. The study includes a multi-stage training approach using a combination of web-mined corpus and high-quality search datasets for pre-training and fine-tuning respectively.
  • The figure below from the paper architecture of Generalizable T5-based dense Retrievers. The research question we ask is: can scaling up dual encoder model size improve the retrieval performance while keeping the bottleneck layers fixed? Only the encoder is taken from the pre-train T5 models, and the question tower and document tower of the dual encoder share parameters.

  • Results show that scaling up model size improves retrieval performance across all evaluated tasks, suggesting that larger models can better capture semantic nuances for effective retrieval.
  • The paper also discusses the data efficiency of large-scale models, demonstrating that GTR models can achieve comparable or superior performance with reduced training data.
  • Additional insights are provided through ablation studies, which highlight the importance of both the model scale and the nature of the training dataset.
  • The paper presents a significant advancement in the field of information retrieval, showcasing the potential of large-scale dual encoder models in improving generalizability and efficiency in retrieval task.
Text Generation by Learning from Demonstrations
  • This paper by Pang and He from NYU in ICLR 2021, introduces Generation by Off-policy Learning from Demonstrations (GOLD), an approach to text generation using offline reinforcement learning (RL) with expert demonstrations.
  • GOLD addresses issues in text generation caused by the mismatch between training objectives (maximum likelihood estimation) and evaluation metrics (quality), as well as exposure bias (discrepancy between training and inference).
  • The proposed algorithm upweights confident tokens and downweights less confident ones in references during training, which helps to avoid optimization issues faced by previous RL methods reliant on online data collection.
  • The authors validate GOLD’s effectiveness in various tasks including news summarization, question generation, and machine translation, showing it outperforms traditional maximum likelihood estimation (MLE) and policy gradient methods in both automatic and human evaluations.
  • GOLD-trained models demonstrate less sensitivity to decoding algorithms and alleviate exposure bias issues, maintaining output quality across varying generation lengths.
  • The paper also discusses the algorithm’s ability to focus on “easy” tokens for learning, leading to high-precision models at the expense of lower recall.
Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering
  • This paper by Izacard and Grave from Facebook AI Research, ENS, PSL University, and Inria, explores the integration of passage retrieval with generative models to improve open domain question answering (QA).
  • The paper introduces a novel approach, combining passage retrieval and generative sequence-to-sequence models, significantly enhancing performance on benchmarks like Natural Questions and TriviaQA.
  • The method involves two key steps: firstly, retrieving text passages that potentially contain evidence using techniques like BM25 and Dense Passage Retrieval (DPR), and secondly, processing these passages with a generative model.
  • The figure below from the paper shows a simple approach to open domain question answering. First, it retrieves support text passages from an external source of knowledge such as Wikipedia. Then, a generative encoder-decoder model produces the answer, conditioned on the question and the retrieved passages. This approach scales well with the number of retrieved passages, as the performance keeps improving when retrieving up to one hundred passages.

  • The figure below from the paper shows the architecture of the Fusion-in-Decoder method.

  • A key finding is the positive correlation between the number of retrieved passages and performance improvement, indicating the generative model’s effectiveness in aggregating and combining evidence from multiple sources.
  • The generative model is based on a sequence-to-sequence network, such as T5 or BART, pretrained on unsupervised data. It processes each retrieved passage independently in the encoder but jointly in the decoder, a method referred to as Fusion-in-Decoder.
  • Technically, the model utilizes special tokens for question, title, and context to process the passages. The approach differs from other models by independently processing passages in the encoder, allowing scalability with a large number of contexts, while jointly processing in the decoder enhances evidence aggregation.
  • Empirically, the paper demonstrates that the Fusion-in-Decoder model outperforms existing methods, particularly in scenarios requiring the aggregation of evidence from multiple passages.
  • For implementation, the authors used pretrained T5 models, varying in size (base and large), and fine-tuned them on each dataset independently using Adam optimization, with specific training configurations like a constant learning rate, dropout rate, and batch size.
  • The approach shows significant performance gains in accuracy compared to traditional methods, especially when processing a large number of passages, indicating the potential of generative models combined with passage retrieval in open domain QA.
A General Language Assistant as a Laboratory for Alignment
  • This paper by Askell et al. from Anthropic introduces a comprehensive study towards aligning general-purpose, text-based AI systems with human values, focusing on making AI helpful, honest, and harmless (HHH). Given the capabilities of large language models, the authors investigate various alignment techniques and their evaluations to ensure these models adhere to human preferences without compromising performance.
  • The authors begin by examining naive prompting as a baseline for alignment, finding that the benefits from such interventions increase with model size and generalize across multiple alignment evaluations. Prompting was shown to impose negligible performance costs (‘alignment taxes’) on large models. The paper also explores the scaling trends of several training objectives relevant to alignment, including imitation learning, binary discrimination, and ranked preference modeling. The results indicate that ranked preference modeling significantly outperforms imitation learning and scales more favorably with model size, while binary discrimination performs similarly to imitation learning.
  • A key innovation discussed is ‘preference model pre-training’ (PMP), which aims to improve the sample efficiency of fine-tuning models on human preferences. This involves pre-training on large public datasets that encode human preferences, such as Stack Exchange, Reddit, and Wikipedia edits, before fine-tuning on smaller, more specific datasets. The findings suggest that PMP substantially enhances sample efficiency and often improves asymptotic performance when fine-tuning on human feedback datasets.
  • Implementation Details:
    • Prompts and Context Distillation: The authors utilize a prompt composed of 14 fictional conversations to induce the HHH criteria in models. They introduce ‘context distillation,’ a method where the model is fine-tuned using the KL divergence between the model’s predictions and the distribution conditioned on the prompt context. This technique effectively transfers the prompt’s conditioning into the model.
    • Training Objectives:
      • Imitation Learning: Models are trained to imitate ‘good’ behavior using supervised learning on sequences labeled as correct or desirable.
      • Binary Discrimination: Models distinguish between ‘correct’ and ‘incorrect’ behavior by training on pairs of correct and incorrect samples.
      • Ranked Preference Modeling: Models are trained to assign higher scores to better samples from ranked datasets using pairwise comparisons, a more complex but effective approach for capturing preferences.
    • Preference Model Pre-Training (PMP): The training pipeline includes a PMP stage where models are pre-trained on binary discriminations sourced from Stack Exchange, Reddit, and Wikipedia edits. This stage significantly enhances sample efficiency during subsequent fine-tuning on smaller datasets.
  • Results:
    • Prompting: Simple prompting significantly improves model performance on alignment evaluations, including HHH criteria and toxicity reduction. Prompting and context distillation both decrease toxicity in generated text as model size increases.
    • Scaling Trends: Ranked preference modeling outperforms imitation learning, especially on tasks with ranked data like summarization and HellaSwag. Binary discrimination shows little improvement over imitation learning.
    • Sample Efficiency: PMP dramatically increases the sample efficiency of fine-tuning, with larger models benefiting more from PMP than smaller ones. Binary discrimination during PMP is found to transfer better than ranked preference modeling.
  • The figure below from the paper shows: (Left) Simple prompting significantly improves performance and scaling on our HHH alignment evaluations (y-axis measures accuracy at choosing better responses on our HHH evaluations). (Right) Prompts impose little or no ‘alignment tax’ on large models, even on complex evaluations like function synthesis. Here we have evaluated our python code models on the HumanEval codex dataset at temperature T = 0.6 and top P = 0.95.

  • The study demonstrates that simple alignment techniques like prompting can lead to meaningful improvements in AI behavior, while more sophisticated methods like preference modeling and PMP offer scalable and efficient solutions for aligning large language models with human values.


Formal Mathematics Statement Curriculum Learning
  • This paper by Polu et al. from OpenAI in 2022 proposes a neural theorem prover using GPT-f that can successfully solve a curriculum of increasingly difficult problems out of a set of formal statements of sufficiently varied difficulty, including many high-school Math Olympiad problems. The prover uses a language model to find proofs of formal statements.
  • They explore the use of expert iteration in the context of language modeling applied to formal mathematics. They show that at same compute budget, expert iteration, by which they mean proof search interleaved with learning, dramatically outperforms proof search only. They also observe that when applied to a collection of formal statements of sufficiently varied difficulty, expert iteration is capable of finding and solving a curriculum of increasingly difficult problems, without the need for associated ground-truth proofs.
  • Finally, by applying this expert iteration to a manually curated set of problem statements, they achieve state-of-the-art on the miniF2F benchmark, automatically solving multiple challenging problems drawn from high school olympiads.
  • Their results suggest that the lack of self-play in the formal mathematics setup can be effectively compensated for by automatically as well as manually curated sets of formal statements, which are much cheaper to formalize than full proofs. The statement curriculum learning methodology presented in this work can help accelerate progress in automated reasoning, especially if scaled with automated generation and curation of formal statements in the future.
  • OpenAI link.
Survey of Hallucination in Natural Language Generation
  • While natural language generation (NLG) has improved exponentially in recent years thanks to the development of deep learning technologies such as Transformer-based language models, large language models (LLMs) -based NLG often produces false statements that are disconnected from reality because such models are not grounded in reality. Such generation includes hallucinated texts, which makes the performances of text generation fail to meet users’ expectations in many real-world scenarios owing to the lack of commonsense built from experiencing the real world.
  • This paper by Ji et al. from Pascale Fung’s group at Hong Kong University of Science and Technology in 2022 reviews studies in evaluation and mitigation methods of hallucinations that have been presented in various tasks.
  • They provide a broad overview of the research progress and challenges in the hallucination problem of NLG. The survey is organized into two big divisions: (i) a general overview of metrics, mitigation methods, and future directions; (ii) task-specific research progress for hallucinations in a large set of downstream tasks: abstractive summarization, dialogue generation, generative question answering, data-to-text generation, and machine translation.
Transformer Quality in Linear Time
  • This paper by Hua et al. form Cornell University and Google Brain in 2022 revisits the design choices in Transformers, and propose methods to address their weaknesses in handling long sequences by presenting FLASH - a novel efficient modification of Transformer architecture. This is achieved by designing a performant layer (gated linear unit) and by combining it with an accelerator-efficient approximation strategy (mixed chunk attention).
  • Existing efficient attention methods often cause significant quality drop compared to full self-attention. At the same time they might be difficult to implement to fully leverage hardware accelerators. The authors introduce GAU (gated attention unit; a generalization of GLU - gated linear unit) that allows for better and more efficient approximation of multi-head attention than many other efficient attention methods by using a weaker single-head attention with minimal quality loss.
  • Next, complementary to this new layer, they propose mixed chunk attention - an efficient linear approximation method that combines the benefits from partial and linear attention mechanisms, which is accelerator-friendly and highly competitive in quality. The method works on chunks of tokens and leverages local (within chunk) and global (between chunks) attention spans.
  • The resulting model, named FLASH, when deployed on bidirectional and auto-regressive language modeling tasks, outperforms three baselines: vanilla Transformer, Performer and Combiner in terms of quality and efficiency. FLASH matches the quality (perplexity) of fully-augmented Transformers over both short (512) and long (8K) context lengths, while being substantially faster to train than the state-of-the-art - achieving training speedups of up to 4.9x on Wiki-40B and 12.1x on PG-19 for auto-regressive language modeling, and 4.8x on C4 for masked language modeling. The differences are particularly pronounced for larger context sizes (4096-8192).
Chain of Thought Prompting Elicits Reasoning in Large Language Models
  • Although scaling up language model size has reliably improved performance on a range of NLP tasks, even the largest models currently struggle with certain reasoning tasks such as arithmetic reasoning, math word problems, symbolic manipulation, and commonsense reasoning.
  • This paper by Wei et al. from Google in 2022 explores the ability of language models to generate a coherent chain of thought – a series of short sentences that mimic the reasoning process a person might have when responding to a question. The idea is strikingly simple: instead of being terse while prompting show the model a few examples of a multi-step reasoning process (the like of which a human would use). Couple this with LLMs (the larger the better) and magic happens! Check out the below image from the paper.

  • They have explored chain of thought prompting as a simple and broadly applicable method for enhancing reasoning in language models. The superb results you can elucidate via this method are an emergent property of model scale (surprise surprise) - bigger models benefit more from this, and the conclusion holds across models (LaMDA, GPT, PaLM).
  • Interestingly enough, the more complex the task of interest is (in the sense of requiring multi-step reasoning approach), the bigger the boost from the chain of thought prompting!
  • In order to make sure that the performance boost comes from this multi-step approach and not simply because of e.g. more compute, the authors have done a couple of ablations: (i) outputting a terse equation instead of a multi-step reasoning description, (ii) outputting the answer and only then the chain of thought, etc. None of these experiments yielded good results.
  • The method also proved to be fairly robust (always outperforms standard prompting) to the choice of exact few shot exemplars. Despite different annotators, different styles, etc. the method is always better than standard prompting.
  • Through experiments on arithmetic, symbolic, and commonsense reasoning, they find that chain of thought processing is an emergent property of model scale that can be induced via prompting and can enable sufficiently large language models to better perform reasoning tasks that otherwise have flat scaling curves.
PaLM: Scaling Language Modeling with Pathways
  • This paper by Chowdhery et al. from Google in 2022 introduces Pathways Language Model (PaLM), a single 540 billion parameter dense Transformer language model, trained on 780B tokens of high-quality, diverse text, that generalizes across domains and tasks while being highly efficient. PaLM pushes the boundaries of scale for few-shot language understanding and generation.
  • Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application.
  • To further their understanding of the impact of scale on few-shot learning, they trained a 540-billion parameter, densely activated, Transformer language model, which they call Pathways Language Model PaLM. They trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. They demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks.
  • On a number of these tasks, PaLM 540B achieves breakthrough few-shot performance on language, reasoning, and code tasks, achieving state-of-the-art results on 28 out of the 29 most widely evaluated English NLP tasks when compared to the best finetuned per-task result from any previous large language model. Their evaluation suite consists of multi-step reasoning tasks, and comparisons to average human performance on the recently released BIG-bench benchmark.
  • Another critical takeaway from this work is the breakthrough performance on reasoning tasks, which require multi-step logical inference. Their few-shot results match or exceed the finetuned state of the art across a number of different arithmetic and commonsense reasoning tasks. The results on reasoning tasks are not achieved through model scale alone, but by a combination of scale and chain-of-thought prompting, where the model is explicitly prompted to generate a natural language logical inference chain before making its prediction. They present a number of intriguing examples where PaLM was able to write explicit logical inference chains to both explain jokes and answer complex questions about scenarios. On BIG-bench, a recently developed benchmark containing 150+ challenging new language tasks, PaLM 5-shot achieves higher performance than the average performance score of humans who were asked to complete the same tasks. Additional state-of-the-art performance is demonstrated on source code understanding/generation, multilingual NLP, and machine translation.
  • From these results, they draw a number of conclusions.
    • First, the results presented here suggest that the improvements from scale for few-shot language understanding have not yet plateaued. When they compare results from PaLM 540B to their own identically trained 62B and 8B model variants, improvements are typically log-linear. This alone suggests that they have not yet reached the apex point of the scaling curve. However, a number of BIG-bench benchmarks showed discontinuous improvements from model scale, improvements are actually discontinuous, meaning that the improvements from 8B to 62B are very modest, but then steeply increase when scaling to 540B. This suggests that certain capabilities of language models only emerge when trained at sufficient scale, and there are additional capabilities that could emerge from future generations of models.
    • Second, the breakthrough performance on reasoning tasks has critical implications. It is obvious that a model being able to generate natural language to explain its predictions is beneficial to the end user of a system, in order to better understand why a model made a certain prediction. However, these results go far beyond that, demonstrating that prompting the model to generate explicit inference chains can drastically increase the quality of the predictions themselves. In other words, the model’s generation (rather than just understanding) capabilities can be immensely beneficial even for tasks that are modeled as categorical prediction or regression, which typically do not require significant language generation.
  • Finally, although they achieved their goal of further pushing the boundaries of scale for few-shot language modeling, there are still many open questions about the ideal network architecture and training scheme for future generations of models. PaLM is only the first step in their vision towards establishing Pathways as the future of ML scaling at Google and beyond. To that end, they chose to demonstrate this scaling capability on a well-studied, well-established recipe: a dense, decoder-only, full-attention Transformer model, which is trained to perform autoregressive language modeling. However, their wider goal is to explore a diverse array of novel architectural choices and training schemes, and combine the most promising systems with the extreme scaling capabilities of Pathways.
  • They believe that PaLM demonstrates a strong foundation in their ultimate goal of developing a large-scale, modularized system that will have broad generalization capabilities across multiple modalities.
  • They additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale.
  • Finally, they discuss the ethical considerations related to large language models and discuss potential mitigation strategies.
  • Google AI blog.

Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity
  • When primed with only a handful of training samples, very large, pretrained language models such as GPT-3 have shown competitive results when compared to fully-supervised, fine-tuned, large, pretrained language models.
  • This paper by Lu et al. from in 2022 demonstrates that that few-shot prompts suffer from order sensitivity, in that for the same prompt the order in which samples are provided can make the difference between state-of-the-art and random performance – essentially some permutations are “fantastic” and some not.
  • They analyze this phenomenon in detail, establishing that the problem is prevalent across tasks, model sizes (even for the largest current models), prompt templates, it is not related to a specific subset of samples, number of training samples, and that a given good permutation for one model is not transferable to another.
  • While one could use a development set to determine which permutations are performant, this would deviate from the true few-shot setting as it requires additional annotated data. Instead, to alleviate this problem, they introduce a novel probing method that exploits the generative nature of language models to construct an artificial development set. They identity performant permutations for prompts using entropy-based statistics over this set, which yields a 13% relative improvement for GPT-family models across eleven different established text classification tasks.
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
  • Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that they understand the present and near-future capabilities and limitations of language models.
  • This paper by Srivastava et al. from Google in 2022 addresses this challenge by introducing the Beyond the Imitation Game benchmark (BIG-bench), a benchmark that can measure progress well beyond the current state-of-the-art. BIG-bench currently consists of 204 tasks, contributed by 442 authors across 132 institutions.
  • Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. They evaluate the behavior of OpenAI’s GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters.
  • In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit “breakthrough” behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.
  • Code.
Training Compute-Optimal Large Language Models
  • Previous work in training LLMs offered a heuristic that given a 10x increase in computational budget, model size should increase 5.5x, and the number of tokens should only increase 1.8x.
  • This paper by Hoffman et al. from DeepMind in 2022 challenges that assumption and shows that model and data size should increase in accordance! Thus collecting high-quality datasets will play a key role in further scaling of LLMs. They investigate the optimal model size and number of tokens for training a transformer language model under a given compute budget.
  • They find that current large language models are significantly undertrained, a consequence of the recent focus on scaling language models whilst keeping the amount of training data constant.
  • By training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, they find that for compute-optimal training, the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled.
  • The following image from the paper overlays the predictions from three different approaches, along with projections from Kaplan et al. (2020). THey find that all three methods predict that current large models should be substantially smaller and therefore trained much longer than is currently done.

  • They test this hypothesis by training a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and 4x more more data.
  • Chinchilla uniformly and significantly outperforms Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a large range of downstream evaluation tasks.
  • This also means that Chinchilla uses substantially less compute for fine-tuning and inference, greatly facilitating downstream usage. As a highlight, Chinchilla reaches a state-of-the-art average accuracy of 67.5% on the MMLU benchmark, greater than a 7% improvement over Gopher.
  • In particular, they propose 10x more compute should be spent on 3.2x larger model and 3.2x more tokens (vs. OpenAI’s Scaling Laws paper which suggests 10x more compute should be spent on 5.5x larger model and 1.8x more tokens)
Large Language Models Still Can’t Plan (A Benchmark for LLMs on Planning and Reasoning about Change)
  • The recent advances in large language models (LLMs) have transformed the field of natural language processing (NLP). From GPT-3 to PaLM, the state-of-the-art performance on natural language tasks is being pushed forward with every new large language model. Along with natural language abilities, there has been a significant interest in understanding whether such models, trained on enormous amounts of data, exhibit reasoning capabilities. Hence there has been interest in developing benchmarks for various reasoning tasks and the preliminary results from testing LLMs over such benchmarks seem mostly positive. However, the current benchmarks are relatively simplistic and the performance over these benchmarks cannot be used as an evidence to support, many a times outlandish, claims being made about LLMs’ reasoning capabilities. As of right now, these benchmarks only represent a very limited set of simple reasoning tasks and they need to look at more sophisticated reasoning problems if they are to measure the true limits of such LLM-based systems.
  • This paper by Valmeekam et al. from ASU in 2022 proposes an extensible assessment framework motivated by the above gaps in current benchmarks to test the abilities of LLMs on a central aspect of human intelligence, which is reasoning about actions and change.
  • They provide multiple test cases that are more involved than any of the previously established reasoning benchmarks and each test case evaluates a certain aspect of reasoning about actions and change. Their initial results on even on simple common-sense planning tasks the base version of GPT-3 (Davinci) seems to display a dismal performance.
OPT: Open Pre-trained Transformer Language Models
  • Large language models, which are often trained for hundreds of thousands of compute days, have shown remarkable capabilities for zero- and few-shot learning. Given their computational cost, these models are difficult to replicate without significant capital. For the few that are available through APIs, no access is granted to the full model weights, making them difficult to study.
  • This paper by Zhang et al. from Facebook AI introduces Open Pre-trained Transformers (OPT), a collection of auto-regressive/decoder-only pre-trained transformer-based language models ranging in size from 125M to 175B parameters, which they aim to fully and responsibly share with interested researchers.
  • Their goal is to replicate the performance and sizes of the GPT-3 class of models, while also applying the latest best practices in data curation and training efficiency.
  • They show that OPT-175B is comparable to GPT-3, while requiring only 1/7th the carbon footprint to develop. They also release their logbook detailing the infrastructure challenges they faced, along with code for experimenting with all of the released models.
  • They believe that broad access to these types of models will increase the diversity of voices defining the ethical considerations of such technologies.
  • Code.
Diffusion-LM Improves Controllable Text Generation
  • The following paper summary has been contributed by Zhibo Zhang.
  • Controlling the behavior of language models (LMs) without re-training is a major open problem in natural language generation. While recent works have demonstrated successes on controlling simple sentence attributes (e.g., sentiment), there has been little progress on complex, fine-grained controls (e.g., syntactic structure).
  • This paper by Li et al. from Stanford in NeurIPS 2022 seeks to address this challenge by introducing Diffusion-LM, a novel non-autoregressive language model based on continuous diffusion for controllable text generation, which enables new forms of complex fine-grained control tasks.
  • Building upon the recent successes of diffusion models in continuous domains, Diffusion-LM iteratively denoises a sequence of Gaussian vectors into word vectors, yielding a sequence of intermediate latent variables. The continuous, hierarchical nature of these intermediate variables enables a simple gradient-based algorithm to perform complex, controllable generation tasks.
  • Considering the discrete nature of text, the authors add an extra step on top of the Markov chain of standard diffusion models. As shown in the illustration figure, in the forward diffusion process, this extra step (embedding) is responsible for converting text into numerical embeddings. In the reverse process, this extra step (rounding) maps the continuous vectors back into text.
  • In order for the model to generate a vector that closely aligns with a word embedding in the reverse process, the authors did a re-parameterization such that the model directly predicts the word embedding state of the Markov chain at each term of the loss function.
  • In order to make the text generation process controllable, under a particular control objective, the conditional inference at each state of the Markov chain is decomposed into two parts:
    • The Markov transition probability between the latent variables of two consecutive time steps, which is used as fluency regularization.
    • The probability of the control objective given the latent variable of the current time step, which is used for controlling the text generation.
  • Empirically, the authors validated that Diffusion-LM significantly outperforms prior work by almost doubling the control success rate compared to the PPLM (Dathathri et al., 2020) and FUDGE (Yang et al., 2021) baselines on six fine-grained control tasks: Semantic Content, Parts-of-speech, Syntax Tree, Syntax Spans and Length.

DeepPERF: A Deep Learning-Based Approach For Improving Software Performance
  • Performance bugs may not cause system failure and may depend on user input, so detecting them can be challenging. They also tend to be harder to fix than non-performance bugs.
  • In recent years, a variety of performance bug detection approaches have emerged to help developers identify performance issues. However, a majority of existing performance bug detection approaches focus on specific types of performance problems and rely on expert-written algorithms or pre-defined set of rules to detect and fix issues. Building rule-based analyzers is a non-trivial task, as it requires achieving the right balance between precision and recall. Once developed, maintaining these rules can also be costly.
  • Transformer-based approaches have been shown to achieve state-of-the-art performance, not only in various NLP problems, but also in a variety of software engineering tasks such as code-completion, documentation generation, unit test generation, bug detection, etc. In this paper, the authors present an approach called DeepPERF that uses a large transformer model to suggest changes at application source code level to improve its performance. The authors first pretrain the model using masked language modelling (MLM) tasks on English text and source code taken from open source repositories on GitHub, followed by finetuning on millions of performance commits made by .NET developers.
  • This paper by Garg et al. from Microsoft in 2022 shows that their approach is able to recommend patches to provide a wide-range of performance optimizations in C# applications. Most suggested changes involve modifications to high-level constructs like API/Data Structure usages or other algorithmic changes, often spanning multiple methods, which cannot be optimized away automatically by the C# compiler and could, therefore, lead to slow-downs on the user’s side.
  • Their evaluation shows that the model can generate the same performance improvement suggestion as the developer fix in ∼53% of the cases, getting ∼34% of them verbatim in their expert-verified dataset of performance changes made by C# developers. Additionally, the authors evaluate DeepPERF on 50 open source C# repositories on GitHub using both benchmark and unit tests and find that the model is able to suggest valid performance improvements that can improve both CPU usage and Memory allocations.
No Language Left Behind: Scaling Human-Centered Machine Translation
  • Driven by the goal of eradicating language barriers on a global scale, machine translation has solidified itself as a key focus of artificial intelligence research today. However, such efforts have coalesced around a small subset of languages, leaving behind the vast majority of mostly low-resource languages.
  • This paper by Costa-jussà et al. from Meta AI in 2022 explores what it takes to break the 200 language barrier while ensuring safe, high quality results, all while keeping ethical considerations in mind. In No Language Left Behind, they take on this challenge by first contextualizing the need for low-resource language translation support through exploratory interviews with native speakers.
  • Furthermore, they created datasets and models aimed at narrowing the performance gap between low and high-resource languages. More specifically, they developed a conditional compute model based on Sparsely Gated Mixture of Experts that is trained on data obtained with novel and effective data mining techniques tailored for low-resource languages.
  • They propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks. Critically, they evaluated the performance of over 40,000 different translation directions using a human-translated benchmark, Flores-200, and combined human evaluation with a novel toxicity benchmark covering all languages in Flores-200 to assess translation safety. - Their model achieves an improvement of 44% BLEU relative to the previous state-of-the-art, laying important groundwork towards realizing a universal translation system.
  • Facebook AI article; Code
  • They tackle three major tasks:
    • Automatic dataset construction for low-resource languages: They’ve solved this by investing in a teacher-student training procedure, making it possible to 1) extend LASER’s language coverage to 200 languages, and 2) produce a massive amount of data, even for low resource languages.
      • Specifically, to scale one model to hundreds of languages, as the first step, they built an appropriate data set. Meta created an initial model able to detect languages automatically, which they call their language identification system.
      • It then uses another language model based on Transformers to find sentence pairs for all the scrapped data. These two models are only used to build the 200 paired-languages datasets they need to train the final language translation model, NLLB200.
    • Modeling 200 languages: They’ve developed a Sparse Mixture-of-Experts model that has a shared and specialized capacity, so low-resource languages without much data can be automatically routed to the shared capacity. When combined with better regularization systems, this avoids overfitting. Further, they used self-supervised learning and large-scale data augmentation through multiple types of back translation.
      • Specifically, the multi-language translation model is a Transformer based encoder-decoder architecture. This implies NLLB200 takes a text sentence, encodes it and then decodes it to produce a new text sentence, a translated version of the input.
      • What’s new is the modifications they’ve done to the model to scale up to so many different languages instead of being limited to one. The first modification is adding a variable identifying the source language of the input, taken from the language detector we just discussed. This will help the encoder do a better job for the current input language. Then, they do the same thing with the decoder giving it which language to translate to. Note that this conditioned encoding scheme is very similar to CLIP, which encodes images and text similarly. Here, in ideal conditions, it will encode a sentence similarly whatever the language.
      • They use Sparsely Gated Mixture of Experts models to achieve a more optimal trade-off between cross-lingual transfer and interference and improve performance for low-resource languages. Sparsely Gated Mixture of Experts are basically regular models but only activate a subset of model parameters per input instead of involving most if not all parameters every time. You can easily see how this is the perfect kind of model for this application. The Mixture of Experts is simply an extra step added in the Transformer architecture for both the encoder and decoder, replacing the feed-forward network sublayer with \(N\) feed-forward networks, each with input and output projections, and the Transformer model automatically learns which subnetwork to use for each language during training.
    • Evaluating translation quality: They’ve extended 2x the coverage of FLORES, a human-translated evaluation benchmark, to now cover 200 languages. Through automatic metrics and human evaluation support, we’re able to extensively quantify the quality of their translations.
Efficient Few-Shot Learning Without Prompts
  • Recent few-shot methods, such as parameter-efficient fine-tuning (PEFT) and pattern exploiting training (PET), have achieved impressive results in label-scarce settings. However, they are difficult to employ since they are subject to high variability from manually crafted prompts, and typically require billion-parameter language models to achieve high accuracy.
  • This paper by Tunstall et al. from Weights, cohere, TU Darmstadt, and Intel Labs in 2022 addresses these shortcomings by proposing SetFit (Sentence Transformer Fine-tuning), an efficient and prompt-free framework for few-shot fine-tuning of Sentence Transformers (ST). SetFit works by first fine-tuning a pretrained ST on a small number of text pairs, in a contrastive Siamese manner.
  • The resulting model is then used to generate rich text embeddings, which are used to train a classification head. Compared to other few-shot learning methods, SetFit has several unique features:
    • No prompts or verbalisers: Current techniques for few-shot fine-tuning require handcrafted prompts or verbalisers to convert examples into a format that’s suitable for the underlying language model. SetFit dispenses with prompts altogether by generating rich embeddings directly from text examples.
    • Fast to train: SetFit doesn’t require large-scale models like T0 or GPT-3 to achieve high accuracy. As a result, it is typically an order of magnitude (or more) faster to train and run inference with.
    • Multilingual support: SetFit can be used with any Sentence Transformer on the Hub, which means you can classify text in multiple languages by simply fine-tuning a multilingual checkpoint.
    • Achieves high accuracy: SetFit achieves high accuracy with little labeled data - for instance, with only 8 labeled examples per class on the Customer Reviews sentiment dataset, SetFit is competitive with fine-tuning RoBERTa Large on the full training set of 3k examples. This is accomplished with orders of magnitude less parameters than existing techniques.
  • Their experiments show that SetFit obtains comparable results with PEFT and PET techniques, while being an order of magnitude faster to train. We also show that SetFit can be applied in multilingual settings by simply switching the ST body.
  • Code.

Large language models are different
  • The following summary has been contributed by Zhibo Zhang.
  • Large language models have shown promising capability in various natural language tasks in recent years. This presentation by Wei from Google Brain in 2022 covers some of the recent works for large language models.
  • The motivation behind large language models is clear: It is ideal to have pre-trained models that can easily generalize to different downstream tasks rather than training a new model for each different task that will require a new dataset. In addition, pre-trained large language models only require a few labeled examples to learn from when it comes to a new task.
  • Training a large language model and doing inference on it typically contains the following components:
    • Pre-train a language model on a massive amount of data. The model size nowadays is huge, such as GPT-3 which contains 175 billion parameters (Brown et al., 2020) and PaLM which contains 540 billion parameters (Chowdhery et al., 2022). An important property of large language models is the emergent ability. That is, the performance of the model grows from near-random to well above-random after the model size reaches a certain threshold (Wei et al., 2022).
    • Perform in-context learning with a few examples. This step is typically done through promoting techniques, where a few example natural language tasks are provided in the form of input-label pairs, and the machine is expected to generalize the learning outcome to predict the label for an unseen input. Notice that the term “learning” here does not involve any optimization step of the model parameters.
  • Researchers have been trying to understand the property of prompting. In particular, Zhao et al., 2020 discusses three major biases introduced by the natural language prompts during in-context learning:
    • The majority label bias: the predictions largely depend on the majority label in the prompts.
    • The recency bias: the labels near the end of the prompts affect the predictions more.
    • Common token bias: the predictions are more likely to be high frequency words in the n-gram model.
  • The authors of the paper proposed to use affine transformation to calibrate the probabilistic output of the model for each specific prediction, named contextual calibration.
  • Min et al., 2022 pointed out that whether the demonstration prompts have the correct labels or not does not significantly affect the prediction. The input text distribution, the label space and the input-label pairing format have a larger impact on the predictions.
  • The speaker also mentioned other prompting techniques, such as chain-of-thought prompting (Wei et al., 2022).
  • In addition to prompting, Wei et al., 2021 shows that fine tuning language models on different datasets through instructions can improve the model performance when there are no demonstrations given for downstream tasks.
Solving Quantitative Reasoning Problems with Language Models
  • The following paper summary has been contributed by Zhibo Zhang.
  • Solving Quantitative Reasoning Problems with Language Models by Lewkowycz et al. from Google Research in NeurIPS 2022 introduces Minerva, a language model based on PaLM (Chowdhery et al., 2022) to solve quantitative reasoning problems.
  • Specifically, the authors used the pre-trained PaLM models with 8 billion, 62 billion and 540 billion parameters accordingly and fine-tuned them on the technical training dataset that is composed of web pages of mathematical content, arXiv papers and general natural language data.
  • At the inference stage, the authors utilized the following techniques to boost the performance of the model:
    • Selecting the most common answer based on a total of \(k\) sampled solutions.
    • Prompting the model with 4 examples when evaluating on the MATH dataset (Hendrycks et al., 2021) and with 5 examples when evaluating on the STEM (science, technology, engineering and mathematics) subset of the MMLU dataset (Hendrycks et al., 2021).
    • Chain-of-thought prompting when evaluating on the GSM8k dataset (Cobbe et al., 2021) and the subset of the MMLU dataset.
  • Empirically, under the same model scale, Minerva consistently outperformed the PaLM model on the evaluation datasets according to the paper. In addition, Minerva with 62 billion parameters and 540 billion parameters outperformed both OpenAI davinci-002 and published state-of-the-art on the MATH dataset.
  • Through additional validation, the authors concluded that there is little evidence that memorization contributes to the model performance.
AD-DROP: Attribution-Driven Dropout for Robust Language Model Fine-Tuning
  • The following paper summary has been contributed by Zhibo Zhang.
  • AD-DROP: Attribution-Driven Dropout for Robust Language Model Fine-Tuning by Yang et al. from Sun Yat-sen University and Meta AI in NeurIPS 2022 proposes AD-DROP, an attribution-based dropout mechanism for self-attention modules.
  • The authors propose to generate attribution scores based on existing input gradient explanation methods. In particular, the attribution scores are generated for the attention map of each attention head with respect to the output logit of the Transformer for a particular class.
  • Following the above attribution methods, the authors empirically observed that dropping neurons with low attribution scores will lead to a larger degree of overfitting compared to random dropping, and dropping neurons with high attribution scores increases training loss but alleviates the overfitting problem.
  • Based on the above empirical finding, the authors proposed AD-DROP, as indicated in the illustration figure (below) from the paper: the attribution matrices are generated for the self-attention maps based on the logits from the forward pass. The mask matrices (that contain information about which position to drop) are then produced relying on the attribution scores and sampling. As a last step, an element-wise addition operation between the mask matrices and the original self-attention maps is done to produce the masked self-attention maps, which are then used to perform the forward propagation.
  • In addition, the authors proposed a cross-tuning algorithm to alternatively perform optimization without dropout (at odd number epochs) and optimization with AD-DROP (at even number epochs) during the training process.
  • The authors conducted experiments on eight tasks of the GLUE benchmark (Wang et al., 2019) using BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019) models as the base, observing that AD-DROP had the best average performance compared to several other regularization methods.

Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers
  • The following paper summary has been contributed by Zhibo Zhang.
  • In-Context Learning has been an effective strategy in adapting a pre-trained large language model to a new task by showing the model with a few input-label pairs through prompts. How In-Context Learning works has been an active research topic that people try to understand.
  • This paper by Dai et al. from Peking University, Tsinghua University and Microsoft Research in 2022 proposes that In-Context Learning can be understood as performing implicit fine-tuning on the Transformer models.
  • In particular, Aizerman et al. and Irie et al. pointed out that linear attention is a dual form of linear layers with gradient descent optimization.
  • Based on the above finding and through relaxing the standard attention into linear attention, the authors demonstrate that it is possible to express the attention outcome as a linear expression of any new input query, where the weight matrix can be decomposed into two parts: the part based on the pre-trained model and the updates of the former part due to prompt demonstrations.
  • Empirically, the authors compared between the models generated by fine-tuning and In-Context Learning accordingly on 6 datasets, observing similarities between the two in terms of prediction capability, updates of the attention output (where the pre-trained model is used as a baseline when calculating the updates) as well as attention maps.
Finetuned language models are zero-shot learners
  • This paper by Wei et al. from Google, published at ICLR 2022, investigates a method called “instruction tuning”, a simple method for improving the zero-shot learning abilities of language models. The authors show that instruction tuning – finetuning language models on a collection of datasets described via instructions – substantially improves zeroshot performance on unseen tasks.
  • The authors modify a 137B parameter pretrained language model (LaMDA-PT), instruction tuning it on over 60 NLP datasets described via natural language instructions. This results in the Finetuned LAnguage Net (FLAN) model, evaluated on on unseen task types, which notably surpasses zero-shot 175B GPT-3 on 20 out of 25 datasets and outperforms few-shot GPT-3 on several key datasets. This process is illustrated below with a couple of examples:

  • The model architecture used is LaMDA-PT, a decoder-only transformer language model, pretrained on a diverse collection of web documents, dialog data, and Wikipedia. The instruction tuning involves mixing all datasets and randomly sampling from each, with limitations on training example numbers per dataset and a specific mixing scheme.
  • FLAN’s evaluation encompassed a variety of tasks: natural language inference (NLI), reading comprehension, closed-book QA, translation, commonsense reasoning, coreference resolution, and struct-to-text. The evaluation strategy involved grouping datasets into task clusters, holding out each cluster during instruction tuning, and then evaluating performance on these held-out tasks. This method showed that instruction tuning significantly improves the base model’s performance on most datasets.
  • In more detail, FLAN demonstrated superior performance in NLI tasks by rephrasing queries into more natural questions. For reading comprehension, it excelled in tasks like MultiRC, OBQA, and BoolQ. In closed-book QA, FLAN outperformed GPT-3 across all four datasets. In translation tasks, FLAN showed strong results, especially in translating into English, though it was weaker in translating from English into other languages.
  • The study revealed some limitations of instruction tuning. It didn’t improve performance on tasks closely aligned with the original language modeling pre-training objective, like commonsense reasoning or coreference resolution tasks framed as sentence completions.
  • The figure below from the paper shows instruction tuning as a simple method that merges elements of pretraining-finetuning and prompting paradigms, can effectively improve a model’s performance on unseen tasks, especially when the model scale is substantial. This finding highlights the potential of using labeled data in instruction tuning to enhance the capabilities of large language models across various tasks by using supervision via finetuning to improve language model’s responses to inference-time text interactions, shifting the focus towards more generalist models. Their empirical results demonstrate promising abilities of language models to perform tasks described purely via instructions.

Learning to summarize from human feedback
  • As language models become more powerful, training and evaluation are increasingly bottlenecked by the data and metrics used for a particular task. For example, summarization models are often trained to predict human reference summaries and evaluated using ROUGE, but both of these metrics are rough proxies for what we really care about – summary quality.
  • This paper by Stiennon et al. from OpenAI in 2022 introduces Reinforcement Learning from Human Feedback (RLHF), a framework that shows that it is possible to significantly improve summary quality by training a model to optimize for human preferences.
  • They collect a large, high-quality dataset of human preferences/comparisons between summaries, train a reward model via supervised learning to predict the human-preferred summary, and use that model as a reward function (“reward model”) to fine-tune large pretrained models (they use GPT-3) using a summarization policy obtained using reinforcement learning. Specifically, they train a policy via reinforcement learning (RL) to maximize the score given by the reward model; the policy generates a token of text at each ‘time step’, and is updated using the proximal policy optimization (PPO) algorithm based on the reward model’s reward given to the entire generated summary. They can then gather more human data using samples from the resulting policy, and repeat the process.
  • Empirically, RLHF tends to perform better than supervised fine-tuning. This is because supervised fine-tuning uses a token-level loss (that can be summed or averaged over the text passage), and RLHF takes the entire text passage, as a whole, into account.
  • They apply the method to a version of the TL;DR dataset of Reddit posts and find that their models significantly outperform both human reference summaries and much larger models fine-tuned with supervised learning alone.
  • Their models also transfer to CNN/DM news articles, producing summaries nearly as good as the human reference without any news-specific fine-tuning.
  • They conduct extensive analyses to understand their human feedback dataset and fine-tuned models. They establish that their reward model generalizes to new datasets, and that optimizing their reward model results in better summaries than optimizing ROUGE according to humans.
  • The key takeaway point here is that pay closer attention to how training loss affects the model behavior they is actually desired.
  • The graph below from the paper shows the fraction of the time humans prefer summaries from variations of the trained models over the human-generated reference summaries on the TL;DR dataset.

Training language models to follow instructions with human feedback
  • Making language models bigger does not inherently make them better at following a user’s intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users.
  • This paper by Ouyang et al. from OpenAI in 2022 introduces InstructGPT, a model that aligns language models with user intent on a wide range of tasks by fine-tuning with human feedback.
  • Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, they collect a dataset of labeler demonstrations of the desired model behavior, which they use to fine-tune GPT-3 using supervised fine-tuning (SFT). This process is referred to as “instruction tuning” by other papers such as Wei et al. (2022).
  • They then collect a dataset of rankings of model outputs, which they use to further fine-tune this supervised model using Reinforcement Learning from Human Feedback (RLHF).
  • In human evaluations on their prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters.
  • Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, their results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.
  • It is important to note that ChatGPT is trained using the same methods as InstructGPT (using SFT followed by RLHF), but is fine-tuned from a model in the GPT-3.5 series.
  • Furthermore, the fine-tuning process proposed in the paper isn’t without its challenges. First, we need a significant volume of demonstration data. For instance, in the InstructGPT paper, they used 13k instruction-output samples for supervised fine-tuning, 33k output comparisons for reward modeling, and 31k prompts without human labels as input for RLHF. Second, fine-tuning comes with an alignment tax “negative transfer” – the process can lead to lower performance on certain critical tasks. (There’s no free lunch after all.) The same InstructGPT paper found that RLHF led to performance regressions (relative to the GPT-3 base model) on public NLP tasks like SQuAD, HellaSwag, and WMT 2015 French to English. A potential workaround is to have several smaller, specialized models that excel at narrow tasks.
  • The figure below from the paper illustrates the three steps of training InstructGPT: (1) SFT, (2) reward model training, and (3) reinforcement learning via proximal policy optimization (PPO) on this reward model. Blue arrows indicate that this data is used to train the respective model in the diagram. In Step 2, boxes A-D are samples from the SFT model that get ranked by labelers.

Constitutional AI: Harmlessness from AI Feedback
  • As AI systems become more capable, we would like to enlist their help to supervise other AIs.
  • This paper by Bai et al. from Anthropic in 2022 experiments with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so they refer to the method as ‘Constitutional AI’.
  • The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase they sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, they sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences.
  • They then train with RL using the preference model as the reward signal, i.e. they use ‘RL from AI Feedback’ (RLAIF). As a result they are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels.
  • The figure below from the paper shows the basic steps of their Constitutional AI (CAI) process, which consists of both a supervised learning (SL) stage, consisting of the steps at the top, and a Reinforcement Learning (RL) stage, shown as the sequence of steps at the bottom of the figure. Both the critiques and the AI feedback are steered by a small set of principles drawn from a ‘constitution’. The supervised stage significantly improves the initial model, and gives some control over the initial behavior at the start of the RL phase, addressing potential exploration problems. The RL stage significantly improves performance and reliability.

  • The graph below shows harmlessness versus helpfulness Elo scores (higher is better, only differences are meaningful) computed from crowdworkers’ model comparisons for all 52B RL runs. Points further to the right are later steps in RL training. The Helpful and HH models were trained with human feedback as in [Bai et al., 2022], and exhibit a tradeoff between helpfulness and harmlessness. The RL-CAI models trained with AI feedback learn to be less harmful at a given level of helpfulness. The crowdworkers evaluating these models were instructed to prefer less evasive responses when both responses were equally harmless; this is why the human feedback-trained Helpful and HH models do not differ more in their harmlessness scores.

RoFormer: Enhanced Transformer with Rotary Position Embedding
  • Position encoding recently has shown effective in the transformer architecture. It enables valuable supervision for dependency modeling between elements at different positions of the sequence.
  • This paper by Su et al. from Zhuiyi Technology Co., Ltd. in 2022 first investigates various methods to integrate positional information into the learning process of transformer-based language models. Then, they propose a novel method named Rotary Position Embedding (RoPE) to effectively leverage positional information. Specifically, the proposed RoPE encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative position dependency in self-attention formulation.
  • RoPE thus takes relative positions in account so it means that attention scores are affected by the distance between two tokens (rather than indices) and acts as a decay. Larger the distance between two words, lesser is the effect. Notably, RoPE enables valuable properties, including the flexibility of sequence length, decaying inter-token dependency with increasing relative distances, and the capability of equipping the linear self-attention with relative position encoding.
  • The following figure from the paper shows the implementation of Rotary Position Embedding (RoPE).

  • Also, RoPE is multiplicative in nature; so instead of “shifting” the word embedding by addition (similar to the “bias” term in neural networks, which has a shifting effect), it “scales” the effect due to rotation.
  • You can use RoPE with “linear attention” (a type of efficient attention which that has \(O(N)\) complexity compared to \(O(N^2)\) in regular attention).
  • Finally, they evaluate the enhanced transformer with rotary position embedding, also called RoFormer, on various long text classification benchmark datasets. Our experiments show that it consistently overcomes its alternatives. Furthermore, they provide a theoretical analysis to explain some experimental results.
  • Weights docs.
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
  • Since the introduction of the transformer model by Vaswani et al. (2017), a fundamental question has yet to be answered: how does a model achieve extrapolation at inference time for sequences that are longer than it saw during training? We first show that extrapolation can be enabled by simply changing the position representation method, though they find that current methods do not allow for efficient extrapolation.
  • This paper by Press et al. from University of Washington and Facebook AI Research in ICLR 2022 introduces a simpler and more efficient position method, Attention with Linear Biases (ALiBi). ALiBi does not add positional embeddings to word embeddings; instead, it biases query-key attention scores with a penalty that is proportional to their distance.
  • They show that this method trains a 1.3 billion parameter model on input sequences of length 1024 that extrapolates to input sequences of length 2048, achieving the same perplexity as a sinusoidal position embedding model trained on inputs of length 2048 but training 11% faster and using 11% less memory.
  • ALiBi’s inductive bias towards recency also leads it to outperform multiple strong position methods on the WikiText-103 benchmark.
  • The figure below from the paper shows that when computing attention scores for each head, their linearly biased attention method, ALiBi, adds a constant bias (right) to each attention score \(\left(\mathbf{q}_i \cdot \mathbf{k}_j\right.\), left). As in the unmodified attention sublayer, the softmax function is then applied to these scores, and the rest of the computation is unmodified. \(\mathbf{m}\) is a head-specific scalar that is set and not learned throughout training. They show that their method for setting \(m\) values generalizes to multiple text domains, models and training compute oudgets. When using ALiBi, they do not add positional embeddings at the bottom of the network.

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
  • In deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) defies this and instead selects different parameters for each incoming example. The result is a sparsely-activated model – with outrageous numbers of parameters – but a constant computational cost.
  • This paper by Fedus et al. from Google in JMLR 2022 introduces the Switch Transformer which seeks to address the lack of widespread adoption of MoE which has been hindered by complexity, communication costs, and training instability.
  • They simplify the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs. Our proposed training techniques help wrangle the instabilities and they show large sparse models may be trained, for the first time, with lower precision (bfloat16) formats.
  • The guiding design principle for Switch Transformers is to maximize the parameter count of a Transformer model (Vaswani et al., 2017) in a simple and computationally efficient way. The benefit of scale was exhaustively studied in Kaplan et al. (2020) which uncovered powerlaw scaling with model size, data set size and computational budget. Importantly, this work advocates training large models on relatively small amounts of data as the computationally optimal approach. Heeding these results, they investigate a fourth axis: increase the parameter count while keeping the floating point operations (FLOPs) per example constant. Our hypothesis is that the parameter count, independent of total computation performed, is a separately important axis on which to scale. They achieve this by designing a sparsely activated model that efficiently uses hardware designed for dense matrix multiplications such as GPUs and TPUs. In their distributed training setup, their sparsely activated layers split unique weights on different devices. Therefore, the weights of the model increase with the number of devices, all while maintaining a manageable memory and computational footprint on each device.
  • Their switch routing proposal reimagines MoE. Shazeer et al. (2017) conjectured that routing to \(k > 1\) experts was necessary in order to have non-trivial gradients to the routing functions. The authors intuited that learning to route would not work without the ability to compare at least two experts. Ramachandran and Le (2018) went further to study the top-\(k\) decision and found that higher \(k\)-values in lower layers in the model were important for models with many routing layers. Contrary to these ideas, they instead use a simplified strategy where they route to only a single expert. They show this simplification preserves model quality, reduces routing computation and performs better. This \(k = 1\) routing strategy is later referred to as a Switch layer.
  • The following figure from the paper illustrates the Switch Transformer encoder block. We replace the dense feed forward network (FFN) layer present in the Transformer with a sparse Switch FFN layer (light blue). The layer operates independently on the tokens in the sequence. They diagram two tokens (\(x_1\) = “More” and \(x_2\) = “Parameters” below) being routed (solid lines) across four FFN experts, where the router independently routes each token. The switch FFN layer returns the output of the selected FFN multiplied by the router gate value (dotted-line).

  • They design models based off T5-Base and T5-Large to obtain up to 7x increases in pre-training speed with the same computational resources. These improvements extend into multilingual settings where they measure gains over the mT5-Base version across all 101 languages.
  • Finally, they advance the current scale of language models by pre-training up to trillion parameter models on the “Colossal Clean Crawled Corpus” and achieve a 4x speedup over the T5-XXL model.
  • Code.
Locating and Editing Factual Associations in GPT
  • This paper by Meng at l. from MIT CSAIL, Northeastern University, and Technion in NeurIPS 2022 analyzes the storage and recall of factual associations in autoregressive transformer language models, finding evidence that these associations correspond to localized, directly-editable computations.
  • They first develop a causal intervention for identifying neuron activations that are decisive in a model’s factual predictions. This reveals a distinct set of steps in middle-layer feed-forward modules that mediate factual predictions while processing subject tokens. Specifically, they perform the following steps to locate factual retrieval:
    • To identify decisive computations, they introduce a method called Causal Tracing. By isolating the causal effect of individual states within the network while processing a factual statement, we can trace the path followed by information through the network.

    • Causal traces work by running a network multiple times, introducing corruptions to frustrate the computation, and then restoring individual states in order to identify the information that restores the results. Tracing can be used to test any individual state or combinations of states. We use carefully-designed traces to identify a specific small set of MLP module computations that mediate retrieval of factual associations.
    • Then they check this finding by asking: can the MLP module computations be altered to edit a model’s belief in a specific fact?
  • To test their hypothesis that these computations correspond to factual association recall, they modify feed-forward weights to update specific factual associations using Rank-One Model Editing (ROME). Specifically, they perform the following steps to edit factual storage:
    • To modify individual facts within a GPT model, we introduce a method called ROME, or Rank-One Model Editing. It treats an MLP module as a simple key-value store: for example, if the key encodes a subject and the value encodes knowledge about the subject, then the MLP can recall the association by retrieving the value corresponding to the key. ROME uses a rank-one modification of the MLP weights to directly write in a new key-value pair.

    • The figure above illustrates a single MLP module within a transformer. The D-dimensional vector at (b) acts as the key that represents a subject to know about, and the H-dimensional output at (c) acts at the value that encodes learned properties about the subject. ROME inserts new association by making a rank-one change to the matrix (d) that maps from keys to values.
    • Note that ROME assumes a linear view of memory within a neural network rather than an individual-neuron view. This linear perspective sees individual memories as rank-one slices of parameter space. Experiments confirm this view: when we do a rank-one update to an MLP module in the computational center identified by causal tracing, we find that associations of individual facts can be updated in a way that is both specific and generalized.
  • They find that ROME is effective on a standard zero-shot relation extraction (zsRE) model-editing task, comparable to existing methods. To perform a more sensitive evaluation, they also evaluate ROME on a new dataset of counterfactual assertions, on which it simultaneously maintains both specificity and generalization, whereas other methods sacrifice one or another.
  • Their results confirm an important role for mid-layer feed-forward modules in storing factual associations and suggest that direct manipulation of computational mechanisms may be a feasible approach for model editing.
  • Project page.
Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers
  • There remain many open questions pertaining to the scaling behaviour of Transformer architectures. These scaling decisions and findings can be critical, as training runs often come with an associated computational cost which have both financial and/or environmental impact.
  • This paper by Tay et al. from Google Research and DeepMind in ICLR 2022 presents scaling insights from pretraining and finetuning Transformers. While Kaplan et al. presents a comprehensive study of the scaling behaviour of Transformer language models, the scope is only on the upstream (pretraining) loss. Therefore, it is still unclear if these set of findings transfer to downstream task within the context of the pretrain-finetune paradigm.
  • The key findings of this paper are as follows:
    1. They show that aside from only the model size, model shape matters for downstream fine-tuning;
    2. Scaling protocols operate differently at different compute regions;
    3. Widely adopted T5-base and T5-large sizes are Pareto-inefficient.
  • To this end, they present improved scaling protocols whereby their redesigned models achieve similar downstream fine-tuning quality while having 50% fewer parameters and training 40% faster compared to the widely adopted T5-base model.
  • In terms of scaling recommendations, they recommend a DeepNarrow strategy where the model’s depth is preferentially increased before considering any other forms of uniform scaling across other dimensions. This is largely due to how much depth influences the Pareto-frontier. Specifically, a tall small (deep and narrow) model is generally more efficient compared to the base model. Likewise, a tall base model might also generally more efficient compared to a large model. They generally find that, regardless of size, even if absolute performance might increase as we continue to stack layers, the relative gain of Pareto-efficiency diminishes as we increase the layers, converging at 32 to 36 layers.
  • They publicly release over 100 pretrained checkpoints of different T5 configurations to facilitate future research and analysis.
Holistic Evaluation of Language Models
  • Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood.
  • This technical report by Liang et al. from Stanford’s Center for Research on Foundation Models (CRFM) presents Holistic Evaluation of Language Models (HELM) to improve the transparency of language models.
  • First, HELM taxonomizes the vast space of potential scenarios (i.e. use cases) and metrics (i.e. desiderata) that are of interest for LMs. Then we select a broad subset based on coverage and feasibility, noting what’s missing or underrepresented (e.g. question answering for neglected English dialects, metrics for trustworthiness).
  • The figure below from the paper shows the importance of the taxonomy to HELM. Previous language model benchmarks (e.g. SuperGLUE, EleutherAI LM Evaluation Harness, BIG-Bench) are collections of datasets, each with a standard task framing and canonical metric, usually accuracy (left). In comparison, in HELM we take a top-down approach of first explicitly stating what we want to evaluate (i.e. scenarios and metrics) by working through their underlying structure. Given this stated taxonomy, we make deliberate decisions on what subset we implement and evaluate, which makes explicit what we miss (e.g. coverage of languages beyond English).

  • Second, HELM adopts a multi-metric approach: They measure 7 metrics (accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency) for each of 16 core scenarios when possible (87.5% of the time). This ensures metrics beyond accuracy don’t fall to the wayside, and that trade-offs are clearly exposed. They also perform 7 targeted evaluations, based on 26 targeted scenarios, to analyze specific aspects (e.g. reasoning, disinformation).
  • The figure below from the paper shows HELM’s multi-metric measurement for each use case. In comparison to most prior benchmarks of language technologies, which primarily center accuracy and often relegate other desiderata to their own bespoke datasets (if at all), in HELM we take a multi-metric approach. This foregrounds metrics beyond accuracy and allows one to study the tradeoffs between the metrics.

  • Third, HELM conducts a large-scale evaluation of 30 prominent language models (spanning open, limited-access, and closed models) on all 42 scenarios, 21 of which were not previously used in mainstream LM evaluation.
  • Prior to HELM, models on average were evaluated on just 17.9% of the core HELM scenarios, with some prominent models not sharing a single scenario in common. HELM improves this to 96.0%: now all 30 models have been densely benchmarked on the same core scenarios and metrics under standardized conditions.
  • HELM’s evaluation surfaces 25 top-level findings. For full transparency, they release all raw model prompts and completions publicly for further analysis, as well as a general modular toolkit. They intend for HELM to be a living benchmark for the community, continuously updated with new scenarios, metrics, and models.
  • Project page.
SummaC: Re-Visiting NLI-based Models for Inconsistency Detection in Summarization
  • In the summarization domain, a key requirement for summaries is to be factually consistent with the input document. Previous work has found that natural language inference (NLI) models do not perform competitively when applied to inconsistency detection.
  • This paper by Laban et al. from UC Berkeley and Microsoft in TACL 2022 proposes \(SummaC\) and revisits the use of NLI for inconsistency detection, finding that past work suffered from a mismatch in input granularity between NLI datasets (sentence-level), and inconsistency detection (document level).
  • The figure below from the paper shows an example document with an inconsistent summary. When running each sentence pair \((D_i, S_j)\) through an NLI model, \(S_3\) is not entailed by any document sentence. However, when running the entire (document, summary) at once, the NLI model incorrectly predicts that the document highly entails the entire summary.

  • They provide a highly effective and light-weight method called \(SummaC_{Conv}\) that enables NLI models to be successfully used for this task by segmenting documents into sentence units and aggregating scores between pairs of sentences.
  • The figure below from the paper shows a diagram of the \(SummaC_{ZS}\) (top) and \(SummaC_{Conv}\) (bottom) models. Both models utilize the same NLI Pair Matrix (middle) but differ in their processing to obtain a score. \(SummaC_{ZS}\) is Zero-Shot, and does not have trained parameters. \(SummaC_{Conv}\) uses a convolutional layer trained on a binned version of the NLI Pair Matrix.

  • On their newly introduced benchmark called \(SummaC\) (Summary Consistency) consisting of six large inconsistency detection datasets, \(SummaC_{Conv}\) obtains state-of-the-art results with a balanced accuracy of 74.4%, a 5% point improvement compared to prior work.
  • Code.
InCoder: A Generative Model for Code Infilling and Synthesis
  • Code is seldom written in a single left-to-right pass and is instead repeatedly edited and refined.
  • This paper by Fried et al. in ICLR 2023 from FAIR, UW, UC Berkeley, TTI-Chicago, and CMU introduces InCoder, a unified generative model that can perform program synthesis (via left-to-right generation) as well as editing (via infilling).
  • InCoder is trained to generate code files from a large corpus of permissively licensed code, where regions of code have been randomly masked and moved to the end of each file, allowing code infilling with bidirectional context.
  • InCoder is the first generative model that is able to directly perform zero-shot code infilling, which they evaluate on challenging tasks such as type inference, comment generation, and variable re-naming.
  • The figure below from the paper shows that during training time (top), InCoder’s causal masking objective samples one or more spans of code in training documents (in the upper left figure, a single span) and moves these spans to the end of the document, with their original location denoted by special mask sentinel tokens. An autoregressive language model is trained to produce these entire masked documents, allowing it to learn to generate insertion text conditioned on bidirectional context. At inference time (bottom), InCoder can perform a variety of code editing and infilling tasks in a zero-shot fashion by inserting mask tokens at desired locations and allowing the model to generate code to insert there. All examples shown are real outputs from the InCoder-6.7B model, with the regions inserted by the model highlighted in orange.

  • They find that the ability to condition on bidirectional context substantially improves performance on these tasks, while still performing comparably on standard program synthesis benchmarks in comparison to left-to-right only models pretrained at similar scale.
Large Language Models are Zero-Shot Reasoners
  • This paper by Kojima et al. from University of Tokyo and Google Brain in NeurIPS 2022 explores the zero-shot reasoning capabilities of large language models (LLMs) through a simple technique called Zero-shot Chain of Thought (Zero-shot-CoT).
  • Zero-shot-CoT adds the prompt “Let’s think step by step” before the answer to elicit multi-step reasoning from LLMs, without requiring any task-specific examples like prior work in Chain of Thought (CoT) prompting.
  • The figure below from the paper shows example inputs and outputs of GPT-3 with (a) standard Few-shot ([Brown et al., 2020]), (b) Few-shot-CoT ([Wei et al., 2022]), (c) standard Zero-shot, and (d) Zero-shot-CoT. Similar to Few-shot-CoT, Zero-shot-CoT facilitates multi-step reasoning (blue text) and reach correct answer where standard prompting fails. Unlike Few-shot-CoT using step-by-step reasoning examples per task, Zero-shot-CoT does not need any examples and just uses the same prompt “Let’s think step by step” across all tasks (arithmetic, symbolic, commonsense, and other logical reasoning tasks).

  • Experiments across 12 diverse reasoning tasks (arithmetic, symbolic, commonsense, logical) show Zero-shot-CoT substantially improves over standard zero-shot prompting. For example, on MultiArith accuracy increases from 17.7% to 78.7% for InstructGPT.
  • Zero-shot-CoT also shows improvements with scaling model size, akin to few-shot CoT prompting, suggesting the single prompt unlocks latent multi-step reasoning capabilities inside LLMs.
  • The simplicity and versatility of Zero-shot-CoT across tasks, compared to careful per-task prompt engineering in prior work, highlights the surprisingly broad cognitive capabilities hidden in LLMs.
  • The authors suggest Zero-shot-CoT serves as a strong zero-shot baseline and encourages further analysis of the multi-task, broad cognitive abilities of LLMs before crafting specialized prompts or datasets.
  • Code.
An Efficient Memory-Augmented Transformer for Knowledge-Intensive NLP Tasks
  • This paper by Wu et al. from UCL, Harbin Institute of Technology, and University of Edinburgh in EMNLP 2022 proposes Efficient Memory-Augmented Transformer (EMAT), which augments Transformer models with an efficient key-value memory module to leverage external knowledge for knowledge-intensive NLP tasks like open-domain QA, outperforming baselines on knowledge-intensive NLP tasks.
  • EMAT encodes external knowledge (e.g. question-answer pairs from PAQ) into dense key and value representations to build the memory. Keys are encoded questions, values are encoded answers.
  • At inference time, EMAT produces a query to retrieve relevant keys and values from memory using fast maximum inner product search. The retrieved representations are integrated into the Transformer encoder to inform generation.
  • EMAT requires only a single inference pass through the Transformer, allowing memory access to run concurrently for efficiency.
  • The figure below from the paper shows the architecture of the proposed Efficient Key-Value Memory Augmented Transformers (EMAT): factual knowledge is stored in a key-value memory where keys and values correspond to questions and answers, respectively; during inference, the model retrieves information from the memory via MIPS and uses it to condition the generation process.

  • EMAT introduces pre-training objectives for learning informative key-value representations and an implicit strategy to integrate multiple memory slots.
  • Experiments on open-domain QA, dialogue, and long-form QA show EMAT significantly outperforms vanilla Transformers while retaining high throughput. It also outperforms retrieval-augmented models on some tasks while being much faster.
  • Ablations demonstrate the importance of the pre-training objectives. Qualitative analysis shows EMAT retrieves useful information but does not just copy from memory.
  • Main limitations are the need for weak supervision to train the retriever, and large memory requirements.
Unsupervised Dense Information Retrieval with Contrastive Learning
  • This paper by Izacard et al. and published in Transactions on Machine Learning Research (TMLR) in August 2022 presents a novel approach in the field of information retrieval.
  • The study focuses on overcoming the limitations of dense retrievers that utilize neural networks, which perform well on large training datasets but struggle with new applications lacking specific training data. Traditional methods like BM25, based on term-frequency, often outperform these dense retrievers in unsupervised settings.
  • The authors explore the application of contrastive learning for training unsupervised dense retrievers. This approach, inspired by successful applications in computer vision, is examined to see if it can match or exceed the performance of term-frequency methods like BM25.
    • Contrastive learning is an approach that relies on the fact that every document is, in some way, unique. This signal is the only information available in the absence of manual supervision. A contrastive loss is used to learn by discriminating between documents. This loss compares either positive (from the same document) or negative (from different documents) pairs of document representations. Formally, given a query \(q\) with an associated positive document \(k_{+}\), and a pool of negative documents \(\left(k_i\right)_{i=1 . . K}\), the contrastive InfoNCE loss is defined as:
    \[\mathcal{L}\left(q, k_{+}\right)=-\frac{\exp \left(s\left(q, k_{+}\right) / \tau\right)}{\exp \left(s\left(q, k_{+}\right) / \tau\right)+\sum_{i=1}^K \exp \left(s\left(q, k_i\right) / \tau\right)}\]
    • where \(\tau\) is a temperature parameter. This loss encourages positive pairs to have high scores and negative pairs to have low scores. Another interpretation of this loss function is the following: given the query representation \(q\), the goal is to recover, or retrieve, the representation \(k_{+}\) corresponding to the positive document, among all the negatives \(k_i\). In the following, we refer to the left-hand side representations in the score \(s\) as queries and the right-hand side representations as keys.
  • A critical ingredient for this training paradigm is to obtain positive pairs from a single text document, which is done as follows:
    • A crucial element of contrastive learning is how to build positive pairs from a single input. In computer vision, this step relies on applying two independent data augmentations to the same image, resulting in two “views” that form a positive pair. While we consider similar independent text transformation, we also explore dependent transformations designed to reduce the correlation between views.
    • Inverse Cloze Task is a data augmentation that generates two mutually exclusive views of a document, introduced in the context of retrieval by Lee et al. (2019). The first view is obtained by randomly sampling a span of tokens from a segment of text, while the complement of the span forms the second view. Specifically, given a sequence of text \(\left(w_1, \ldots, w_n\right)\), ICT samples a span \(\left(w_a, \ldots, w_b\right)\), where \(1 \leq a \leq b \leq n\), uses the tokens of the span as the query and the complement \(\left(w_1, \ldots, w_{a-1}, w_{b+1}, \ldots, w_n\right)\) as the key. In the original implementation by Lee et al. (2019) the span corresponds to a sentence, and is kept in the document 10% of the time to encourage lexical matching. The Inverse Cloze Task is closely related to the Cloze task which uses the span complement \(\left(w_1, \ldots, w_{a-1}, w_{b+1}, \ldots, w_n\right)\) as the query.
    • Independent cropping is a common independent data augmentation used for images where views are generated independently by cropping the input. In the context of text, cropping is equivalent to sampling a span of tokens. This strategy thus samples independently two spans from a document to form a positive pair. As opposed to the inverse Cloze task, in cropping both views of the example correspond to contiguous subsequence of the original data. A second difference between cropping and ICT is the fact that independent random cropping is symmetric: both the queries and documents follow the same distribution. Independent cropping also lead to overlap between the two views of the data, hence encouraging the network to learn exact matches between the query and document, in a way that is similar to lexical matching methods like BM25. In practice, we can either fix the length of the span for the query and the key, or sample them.
    • Additional data augmentation. Finally, we also consider additional data augmentations such as random word deletion, replacement or masking. We use these perturbations in addition to random cropping.
  • An important aspect of contrastive learning is to sample a large set of negatives. Most standard frameworks differ from each other in terms of how the negatives are handled, and we briefly describe two of them, in-batch negative sampling and MoCo, that we use in this work.
    • Negatives within a batch. A first solution is to generate the negatives by using the other examples from the same batch: each example in a batch is transformed twice to generate positive pairs, and we generate negatives by using the views from the other examples in the batch. We will refer to this technique as “in-batch negatives”. In that case, the gradient is back-propagated through the representations of both the queries and the keys. A downside of this approach is that it requires extremely large batch sizes to work well, in some cases reporting improvement in the context of information retrieval up to 8192 negatives. This method has been widely used to train information retrieval models with supervised data and was also considered when using ICT to pre-train retrievers.
    • Negative pairs across batches. An alternative approach is to store representations from previous batches in a queue and use them as negative examples in the loss (Wu et al., 2018). This allows for smaller batch size but slightly changes the loss by making it asymmetric between “queries” (one of the view generated from the elements of the current batch), and “keys” (the elements stored in the queue). Gradient is only backpropagated through the “queries”, and the representation of the “keys” are considered as fixed. In practice, the features stored in the queue from previous batches comes form previous iterations of the network. This leads to a drop of performance when the network rapidly changes during training. Instead, He et al. (2020) proposed to generate representations of keys from a second network that is updated more slowly. This approach, called MoCo, considers two networks: one for the keys, parametrized by \(\theta_k\), and one of the query, parametrized by \(\theta_q\). The parameters of the query network are updated with backpropagation and stochastic gradient descent, similarly to when using in-batch negatives, while the parameters of the key network, or Momentum encoder, is updated from the parameters of the query network by using a exponential moving average: \(\theta_k \leftarrow m \theta_k+(1-m) \theta_q\) where \(m\) is the momentum parameter that takes its value in \([0,1]\).
  • They use a transformer network to embed both queries and documents. Alternatively, two different encoders can be used to encode queries and documents respectively as in DPR. Empirically, they observed that using the same encoder, such as in Xiong et al. (2020) and Reimers & Gurevych (2019), generally improves robustness in the context of zero-shot transfer or few-shot learning, while having no impact on other settings.
  • Significant contributions of the paper include demonstrating that contrastive learning can lead to competitive unsupervised retrievers. The model, named Contriever, shows promising results on the BEIR benchmark, outperforming BM25 on 11 out of 15 datasets for Recall@100. The model benefits from a few training examples and achieves better results than models transferred from large datasets like MS MARCO. Ablation studies highlighted that cropping is a more effective approach than the inverse Cloze task for building positive pairs.
  • The implementation details of Contriever reveal that it employs MoCo with random cropping for contrastive learning. This is a deviation from the Inverse Cloze Task (ICT) approach. The training data includes a mix of documents from Wikipedia and CCNet. The model shows strong performance against established benchmarks such as NaturalQuestions and TriviaQA, even in fully unsupervised settings without fine-tuning on MS MARCO or other annotated data.
  • The paper also delves into the realm of multilingual retrieval, a significant area where large labeled datasets are typically scarce, especially for lower-resource languages. The multilingual model, mContriever, demonstrates strong performance in both fully unsupervised settings and when fine-tuned on English data. This model is capable of effective cross-lingual retrieval, a significant advancement over traditional lexical matching methods.
  • In summary, the paper introduces Contriever, an unsupervised dense retriever trained using contrastive learning, which effectively handles tasks in information retrieval, including multilingual and cross-lingual retrieval. This approach marks a notable advancement in the field, particularly in settings where large annotated datasets are unavailable.
Implicit Relation Linking for Question Answering over Knowledge Graph
  • This paper by Zhao et al. from Nanjing University and Alibaba Group in ACL 2022 addresses the challenge of linking implicit relations in natural language to knowledge graphs for question answering systems.
  • The authors introduce ImRL, a novel method that links natural language relation phrases to relation paths in knowledge graphs. This approach is significant as it deals with the ambiguity of natural language and the incompleteness of knowledge graphs.
  • The figure below from the paper shows an example of RL to DBpedia. There is no explicit relation between dbr:Dragonaut:_The_Resonance and dbr:Japan. We expect to implicitly link the phrase “from” to an indirect relation path dbp:publisher \(\rightarrow\) dbo:country.

  • ImRL incorporates a unique path ranking model that aligns textual information in word embeddings with structural information in knowledge graph embeddings. This model is designed to capture the correlation between single relations and relation paths, effectively addressing relation phrases with vague meanings.
  • To enhance the model’s performance, the authors integrate external paraphrase dictionaries using a gated mechanism with attention. This feature injects prior knowledge into the model, aiding in the disambiguation of relation phrases.
  • The figure below from the paper shows an overview of ImRL. The method has two parts: (1) Path generation parses the input question and finds the relation path candidates in the KG, by entity linking, relation identification and candidate generation. (2) Path ranking encodes the relation phrase in the question and path candidates in the KG in the BERT embedding space and RotatE embedding space, utilizes a ranking model to rank those candidates, and takes the one with the highest similarity score as answer. It also leverages a gated mechanism with attention to inject prior knowledge from external dictionaries to help relation disambiguation.

  • The paper presents a comprehensive evaluation using two benchmark datasets and a newly created dataset, demonstrating that ImRL significantly outperforms existing state-of-the-art methods, particularly in scenarios involving implicit relation linking.
  • The authors’ experiments and results highlight ImRL’s effectiveness in dealing with the inherent challenges of knowledge-based question answering systems, such as handling incomplete knowledge graphs and interpreting ambiguous natural language expressions.
Galactica: A Large Language Model for Science
  • This paper Taylor et al. from Meta AI introduces Galactica, a large language model designed for organizing scientific knowledge, trained on a diverse corpus including scientific papers, textbooks, and more.
  • The model stands out for its dataset design, incorporating scientific texts and task-specific datasets in pre-training, enhancing knowledge composition for various tasks.
  • Galactica utilizes specialized tokenization for various scientific formats, aiding in tasks like citation prediction, LaTeX equations, and protein sequences.
  • he figure below from the paper shows the tokenizing nature. Galactica trains on text sequences that represent scientific phenomena.

  • It shows superior performance in scientific tasks and knowledge probes, outperforming general models like GPT-3 in LaTeX equations and achieving state-of-the-art results in tasks like PubMedQA.
  • The model incorporates a unique working memory token (<work>) for complex reasoning, demonstrating strong performance in reasoning tasks and mathematical benchmarks.
  • The figure below from the paper shows model-machine symbiosis. They show an example answer with the <work> working memory token. It performs exact steps for rearranging the equation, and when it reaches a calculation that it cannot solve reliably in a forward-pass, it writes a program, which can then be offloaded to a classical computer.

  • The architecture involves a Transformer setup with modifications like GeLU activation, a 2048-length context window, and a 50k BPE vocabulary.
  • Training involved multiple model sizes up to 120B parameters, showing improved performance with repeated tokens, contrary to previous beliefs in the field.
  • Galactica’s potential as a scientific knowledge interface is highlighted, with applications in synthesizing and organizing scientific information.
MuRAG: Multimodal Retrieval-Augmented Generator
  • This paper by Chen et al. from Google Research proposes Multimodal Retrieval-Augmented Transformer (MuRAG), which looks to extend the retrieval process beyond text to include other modalities like images or structured data, which can then be used alongside textual information to inform the generation process.
  • MuRAG’s magic lies in its two-phase training approach: pre-training and fine-tuning, each carefully crafted to build the model’s ability to tap into a vast expanse of multimodal knowledge.
  • The key goal of MuRAG is to incorporate both visual and textual knowledge into language models to improve their capability for multimodal question answering.
  • MuRAG is distinct in its ability to access an external non-parametric multimodal memory (images and texts) to enhance language generation, addressing the limitations of text-only retrieval in previous models.
  • MuRAG has a dual-encoder architecture combines pre-trained visual transformer (ViT) and a text encoder (T5) models to create a backbone encoder, enabling the encoding of image-text pairs, image-only, and text-only inputs into a unified/joint multimodal representation.
  • MuRAG is pre-trained on a mixture of image-text data (LAION, Conceptual Captions) and text-only data (PAQ, VQA). It uses a contrastive loss for retrieving relevant knowledge and a generation loss for answer prediction. It employs a two-stage training pipeline: initial training with small in-batch memory followed by training with a large global memory.
  • During the retriever stage, MuRAG takes a query \(q\) of any modality as input and retrieves from a memory \(\mathcal{M}\) of image-text pairs. Specifically, we apply the backbone encoder \(f_\theta\) to encode a query \(q\), and use maximum inner product search (MIPS) over all of the memory candidates \(m \in \mathcal{M}\) to find the top-\(k\) nearest neighbors \(\operatorname{Top}_K(\mathcal{M} \mid q)=\left[m_1, \cdots, m_k\right]\). Formally, we define \(\operatorname{Top}_K(\mathcal{M} \mid q)\) as follows:
\[\operatorname{Top}_K(\mathcal{M} \mid q)=\underset{m \in \mathcal{M}}{\operatorname{Top}} \quad f_\theta(q)_{[\mathrm{CLS}]} \cdot f_\theta(m)_{[\mathrm{CLS}]}\]
  • During the reader stage, the retrievals (the raw image patches) are combined with the query \(q\) as an augmented input \(\left[m_1, \cdots, m_k, q\right]\), which is fed to the backbone encoder \(f_\theta\) to produce retrieval augmented encoding. The decoder model \(g_\theta\) uses attention over this representation to generate textual outputs \(\mathbf{y}=y_1, \cdots, y_n\) token by token.

    \[p\left(y_i \mid y_{i-1}\right)=g_\theta\left(y_i \mid f_\theta\left(\operatorname{Top}_K(\mathcal{M} \mid q) ; q\right) ; y_{1: i-1}\right)\]
    • where \(y\) is decoded from a given vocabulary \(\mathcal{V}\).
  • The figure below from the paper (source) shows how the model taps into an external repository to retrieve a diverse range of knowledge encapsulated within both images and textual fragments. This multimodal information is then employed to enhance the generative process. The upper section outlines the setup for the pre-training phase, whereas the lower section specifies the framework for the fine-tuning phase.

  • The process can be summarized as follows:
    • For retrieval, MuRAG uses maximum inner product search to find the top-\(k\) most relevant image-text pairs from the memory given a question. The “memory” here refers to the external knowledge base that the model can retrieve information from. Specifically, the memory contains a large collection of image-text pairs that are encoded offline by the backbone encoder prior to training.
    • During training and inference, given a question, MuRAG’s retriever module will search through this memory to find the most relevant image-text pairs using maximum inner product search.
    • The memory serves as the knowledge source and can contain various types of multimodal data like images with captions, passages of text, tables, etc. that are related to the downstream task.
    • For example, when fine-tuning on the WebQA dataset, the memory contains 1.1 million image-text pairs extracted from Wikipedia that the model can retrieve from to answer questions.
    • So in summary, the memory is the large non-parametric external knowledge base encoded in a multimodal space that MuRAG learns to retrieve relevant knowledge from given a question, in order to augment its language generation capabilities. The memory provides the world knowledge to complement what is stored implicitly in the model’s parameters.
    • For reading, the retrieved multimodal context is combined with the question embedding and fed into the decoder to generate an answer.
  • MuRAG achieves state-of-the-art results on two multimodal QA datasets - WebQA and MultimodalQA, outperforming text-only methods by 10-20% accuracy. It demonstrates the value of incorporating both visual and textual knowledge.
  • Key limitations are the reliance on large-scale pre-training data, computational costs, and issues in visual reasoning like counting objects. But overall, MuRAG represents an important advance in building visually-grounded language models.
Distilling Knowledge from Reader to Retriever for Question Answering
  • This paper by Izacard and Grave from Facebook AI Research, École normale supérieure, and Inria, published at ICLR 2021, proposes a novel approach for training retriever systems in open-domain question answering tasks without strong supervision in the form of query-document pairs.
  • The authors introduce a method leveraging the attention scores of a sequence-to-sequence reader model to generate synthetic labels for training a retriever model. This approach, inspired by knowledge distillation, uses the reader model’s cross-attention activations over input documents as a relevance measure for training the retriever.
  • The methodology includes:
    1. Utilizing a dense bi-encoder for passage retrieval, encoding questions and passages into fixed-size representations.
    2. Distilling the reader’s cross-attention scores to the bi-encoder, employing various loss functions like KL-divergence, mean squared error, and max-margin loss.
    3. Implementing iterative training, where the reader and retriever are trained alternately, improving the retriever’s performance with each iteration.
  • The authors demonstrate the effectiveness of this method on several question-answering benchmarks, achieving state-of-the-art results. The experiments used datasets like TriviaQA, NaturalQuestions, and NarrativeQA, and the iterative training process showed consistent improvement in retrieval accuracy and question answering performance.
  • The paper contributes significantly to the field by enabling the training of efficient retrievers without requiring annotated pairs of queries and documents, a common limitation in information retrieval tasks. The approach is particularly beneficial for tasks with limited or no supervision, like fact-checking or long-form question answering.
  • Code
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
  • In this paper by Lu et al. from the University of California Los Angeles, Arizona State University, and the Allen Institute for AI, presented at NeurIPS 2022, the authors propose a new approach for science question answering, leveraging multimodal reasoning through the generation of explanatory thought chains.
  • The core contribution is the development of the Science Question Answering (ScienceQA) dataset, consisting of approximately 21,208 multimodal multiple-choice questions across diverse science topics, coupled with annotated lectures and explanations.
  • The figure below from the paper shows a sample of the ScienceQA dataset where a data example consists of multimodal question answering information and the grounded lecture and explanation. They study if QA models can generate a reasonable explanation to reveal the chain-of-thought reasoning.

  • The authors propose a method where language models are trained to generate lectures and explanations, forming a ‘Chain of Thought’ (CoT), to mimic human-like multi-hop reasoning processes. This approach aims to improve both the interpretability and performance of AI systems in science question answering.
  • Experimental results show that incorporating CoT into language models enhances question-answering performance. Specifically, a 1.20% improvement was observed in few-shot GPT-3 and 3.99% in fine-tuned UnifiedQA models. Further, the study explores the upper bound for model performance when feeding explanations directly into the input, resulting in an 18.96% improvement in few-shot GPT-3’s performance.
  • The study also underscores how language models, akin to humans, benefit from explanations, achieving comparable performance with just 40% of the training data.
  • The implementation details include fine-tuning UnifiedQA and prompting GPT-3 with the CoT framework. UnifiedQA was fine-tuned to generate a sequence consisting of an answer followed by lecture and explanation. GPT-3 was prompted using a CoT style, where the model generates the answer and then the lecture and explanation.
  • The paper’s findings suggest that explanations in the CoT format not only aid in better performance but also in learning efficiency, demonstrating the potential of this approach in educational and AI interpretability contexts.
  • Project page; Leaderboard
BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models
  • This paper by Ben-Zaken et al. from Yoav Goldberg’s group at Bar Ilan University and the Allen Institute for Artificial Intelligence introduces BitFit, a fine-tuning method for pre-trained BERT models. BitFit focuses on updating only the bias-terms of the model, which are a minimal fraction of the model’s parameters, effectively reducing the memory footprint and computational demands typically associated with full model fine-tuning.
  • BitFit’s methodology leverages the observation that fine-tuning often doesn’t require extensive retraining of all parameters. Instead, fine-tuning only the bias terms achieves competitive results compared to full model fine-tuning, especially with small to medium-sized datasets. In scenarios permitting slight performance degradation, the method can be constrained to adjust only two specific types of bias terms, representing just 0.04% of the total model parameters.
  • Implementation details include freezing the transformer-encoder’s main weights and training only the bias terms along with a task-specific classification layer. This approach allows the model to handle multiple tasks efficiently in a streaming fashion without requiring simultaneous access to all task datasets.
  • Experimental results on the GLUE benchmark show that BitFit is comparable or superior to full fine-tuning in several NLP tasks. It also outperforms other parameter-efficient methods like Diff-Pruning and Adapters in terms of the number of parameters modified, showcasing its effectiveness in achieving high performance with significantly fewer trainable parameters.
  • The findings underscore the potential of focusing fine-tuning efforts on a small subset of parameters, specifically bias terms, to maintain or even enhance performance while minimizing computational costs. This approach also prompts further exploration of the role of bias terms in neural networks and their impact on model behavior and task transferability.
Recurrent Memory Transformer
  • This paper by Bulatov et al. from Moscow Institute of Physics and Technology and AIRI, presented at NeurIPS 2022, introduces the Recurrent Memory Transformer (RMT), a memory-augmented segment-level recurrent Transformer designed to improve the processing of long sequences. Transformer-based models are effective across multiple domains due to their self-attention mechanism, which combines information from all sequence elements into context-aware representations. However, these models have limitations, such as storing global and local information within single element-wise representations and the quadratic computational complexity of self-attention that limits input sequence length.
  • The proposed RMT addresses these limitations by incorporating memory tokens into the input or output sequences, allowing the model to store and process both local and global information efficiently. These memory tokens facilitate the passage of information between segments of long sequences through recurrence. Unlike previous memory-augmented models, RMT requires no modifications to the Transformer architecture itself, making it compatible with existing Transformer models.
  • Implementation details include adding special memory tokens to the sequence, which are used to read from and write to memory states. These memory tokens are positioned at the beginning and end of each segment, enabling seamless information transfer across segments. During training, gradients flow through the memory tokens from one segment to the previous ones, leveraging Backpropagation Through Time (BPTT) for effective learning.
  • The following figure from the paper shows the Recurrent Memory Transformer. Memory is added as tokens to the input sequence and memory output is passed to the next segment. During training gradients flow from the current segment through memory to the previous segment.

  • The following figure from the paper shows a comparison of Recurrent Memory Transformer (RMT) and Transformer-XL architectures. Recurrent Memory Transformer augments Transformer with global memory tokens and passes them to allow a segment-level recurrence. Special read/write memory tokens are added to the input sequence. Multiple memory tokens can be used in each read/write block. Updated representations of write memory are passed to the next segment. During training, RMT uses BPTT to propagate gradient to previous segments through memory tokens representation. Effective context length for recurrence with memory is not limited by the depth of a network which is the case for the cache of Transformer-XL.

  • Experimental results show that RMT performs on par with Transformer-XL in language modeling tasks when using smaller memory sizes and outperforms it in tasks requiring longer sequence processing. Specifically, RMT demonstrated superior performance on tasks requiring long-term dependency management, such as copy, reverse, associative retrieval, and quadratic equations. Additionally, combining RMT with Transformer-XL’s cache mechanism further improved language modeling performance.
  • The attention maps visualized in the experiments showed that RMT effectively separates memory operations from sequence token representations, avoiding the mixing issues observed in Transformer-XL. This separation contributes to RMT’s enhanced ability to handle long-term dependencies and general-purpose memory processing, making it a promising architecture for tasks involving algorithmic reasoning and other complex applications.
  • The RMT model’s efficiency in memory usage is notable, maintaining performance even with smaller memory sizes compared to Transformer-XL. The ability to augment any Transformer-based model with memory tokens without altering the underlying architecture underscores the practicality and versatility of RMT.
  • Overall, the Recurrent Memory Transformer introduces a robust method for extending the capabilities of Transformer models, particularly in scenarios requiring extensive sequence processing and memory management, while maintaining compatibility with existing architectures and improving computational efficiency. This makes RMT a promising solution for applications that necessitate learning long-term dependencies and efficient memory processing, such as algorithmic tasks and complex reasoning.


ReAct: Synergizing Reasoning and Acting in Language Models
  • While large language models (LLMs) have demonstrated impressive capabilities across tasks in language understanding and interactive decision making, their abilities for reasoning (e.g., chain-of-thought prompting) and acting (e.g., action plan generation) have primarily been studied as separate topics.
  • This paper by Yao et al. from Princeton and Google Brain in ICLR 2023 proposes ReAct, approach that Reasons and Acts by exploring the use of LLMs to generate both reasoning traces and task-specific actions in an interleaved manner, allowing for greater synergy between the two: reasoning traces help the model induce, track, and update action plans as well as handle exceptions, while actions allow it to interface with external sources, such as knowledge bases or environments, to gather additional information.
  • They apply ReAct to a diverse set of language and decision making tasks and demonstrate its effectiveness over state-of-the-art baselines, as well as improved human interpretability and trustworthiness over methods without reasoning or acting components.
  • Concretely, on question answering (HotpotQA) and fact verification (Fever), ReAct overcomes issues of hallucination and error propagation prevalent in chain-of-thought reasoning by interacting with a simple Wikipedia API, and generates human-like task-solving trajectories that are more interpretable than baselines without reasoning traces.
  • On two interactive decision making benchmarks (ALFWorld and WebShop), ReAct outperforms imitation and reinforcement learning methods by an absolute success rate of 34% and 10% respectively, while being prompted with only one or two in-context examples.
  • The figure below from the paper shows a comparison of four prompting methods: (a) standard, (b) Chain-of-Thought (CoT, Reason Only), (c) Act-only, and (d) ReAct (Reason+Act), solving a HotpotQA question. (2) Comparison of (a) Act-only and (b) ReAct prompting to solve an AlfWorld. Note that in both domains, in-context examples are omitted as part of the prompt, and only show task solving trajectories generated by the model (Act, Thought) and the environment (Obs).

LLaMA: Open and Efficient Foundation Language Models
  • This paper by Touvron et al. from Meta AI in 2023 introduces LLaMA, a collection of foundation language models ranging from 7B to 65B parameters.
  • They train LLaMA models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets.
  • In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B. They release all their models to the research community.
  • Please refer the LLaMA primer for an article on LLaMA.
Alpaca: A Strong, Replicable Instruction-Following Model
  • Stanford’s Alpaca 7B, a model fine-tuned from the LLaMA 7B model on 52K instruction-following demonstrations. On their preliminary evaluation of single-turn instruction following, Alpaca behaves qualitatively similarly to OpenAI’s text-davinci-003, while being surprisingly small and easy/cheap to reproduce.
Transformer models: an introduction and catalog
  • In the past few years, we have seen the meteoric appearance of dozens of models of the Transformer family, all of which have funny, but not self-explanatory, names. The goal of this paper is to offer a somewhat comprehensive but simple catalog and classification of the most popular Transformer models. The paper also includes an introduction to the most important aspects and innovation in Transformer models.
  • Spreadsheet tabulation of the paper.
  • The following plot from the paper shows the transformers family tree with prevalent models:

  • And, the plot below from the paper shows the timeline for prevalent transformer models:

  • Lastly, the plot below, again from the paper, shows the timeline vs. number of parameters for prevalent transformer models:

Learning to Compress Prompts with Gist Tokens
  • Prompting is now the primary way to utilize the multitask capabilities of language models (LMs), but prompts occupy valuable space in the input context window, and re-encoding the same prompt is computationally inefficient.
  • Finetuning and distillation methods allow for specialization of LMs without prompting, but require retraining the model for each task.
  • This paper by Mu et al. from Stanford in 2023 avoids this trade-off entirely by presenting gisting, which trains an LM to compress prompts into smaller sets of “gist” tokens which can be reused for compute efficiency.
  • Gist models can be easily trained as part of instruction finetuning via a restricted attention mask that encourages prompt compression.
  • On decoder (LLaMA-7B) and encoder-decoder (FLAN-T5-XXL) LMs, gisting enables up to 26x compression of prompts, resulting in up to 40% FLOPs reductions, 4.2% wall time speedups, storage savings, and minimal loss in output quality.
  • The figure below from the paper shows prompting (top), which retains the multitask capabilities of LMs, but is computationally inefficient. Finetuning/distillation (middle) removes the dependence on prompts, but requires training a model for each task. Gisting (bottom) compresses prompts into a smaller set of gist tokens, saving compute while also generalizing to novel prompts during deployment.

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
  • This paper by Zhang et al. from Shanghai Artificial Intelligence Laboratory, CUHK MMLab, and UCLA presents LLaMA-Adapter, a lightweight adaption method to efficiently fine-tune LLaMA into an instruction-following model.
  • Using 52K self-instruct demonstrations, LLaMA-Adapter only introduces 1.2M learnable parameters upon the frozen LLaMA 7B model, and costs less than one hour for fine-tuning on 8 A100 GPUs.
  • Specifically, they adopt a set of learnable adaption prompts, and prepend them to the input text tokens at higher transformer layers. Then, a zero-init attention mechanism with zero gating is proposed, which adaptively injects the new instructional cues into LLaMA, while effectively preserves its pre-trained knowledge.
  • With efficient training, LLaMA-Adapter generates high-quality responses, comparable to Alpaca with fully fine-tuned 7B parameters. Furthermore, their approach can be simply extended to multi-modal input, e.g., images, for image-conditioned LLaMA, which achieves superior reasoning capacity on ScienceQA.
  • Code.
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
  • How to efficiently transform large language models (LLMs) into instruction followers is recently a popular research direction, while training LLM for multi-modal reasoning remains less explored. Although the recent LLaMA-Adapter demonstrates the potential to handle visual inputs with LLMs, it still cannot generalize well to open-ended visual instructions and lags behind GPT-4.
  • This paper by Zhang et al. from Shanghai Artificial Intelligence Laboratory, CUHK MMLab, and Rutgers University presents LLaMA-Adapter V2, a parameter-efficient visual instruction model.
  • Specifically, they first augment LLaMA-Adapter by unlocking more learnable parameters (e.g., norm, bias and scale), which distribute the instruction-following ability across the entire LLaMA model besides adapters. Secondly, they propose an early fusion strategy to feed visual tokens only into the early LLM layers, contributing to better visual knowledge incorporation. Thirdly, a joint training paradigm of image-text pairs and instruction-following data is introduced by optimizing disjoint groups of learnable parameters.
  • This strategy effectively alleviates the interference between the two tasks of image-text alignment and instruction following and achieves strong multi-modal reasoning with only a small-scale image-text and instruction dataset.
  • During inference, they incorporate additional expert models (e.g., captioning/OCR systems) into LLaMA-Adapter to further enhance its image understanding capability without incurring training costs. Compared to the original LLaMA-Adapter, their LLaMA-Adapter V2 can perform open-ended multi-modal instructions by merely introducing 14M parameters over LLaMA. The newly designed framework also exhibits stronger language-only instruction-following capabilities and even excels in chat interactions.
  • The figure below from the paper shows the training pipeline of LLaMA-Adapter V2. They introduce several strategies to enhance the capability of LLaMA-Adapter, which enable a parameter-efficient visual instruction model with superior multi-modal reasoning.

LIMA: Less Is More for Alignment
  • Large language models are trained in two stages: (1) unsupervised pretraining from raw text, to learn general-purpose representations, and (2) large scale instruction tuning and reinforcement learning, to better align to end tasks and user preferences.
  • This paper by Zhou et al. from Meta AI, Carnegie Mellon University, University of Southern California, and Tel Aviv University in 2023 measures the relative importance of these two stages by training LIMA, a 65B parameter LLaMa language model fine-tuned with the standard supervised loss on only 1,000 carefully curated prompts and responses, without any reinforcement learning or human preference modeling.
  • They define the Superficial Alignment Hypothesis: A model’s knowledge and capabilities are learnt almost entirely during pretraining, while alignment teaches it which subdistribution of formats should be used when interacting with users. If this hypothesis is correct, and alignment is largely about learning style, then a corollary of the Superficial Alignment Hypothesis is that one could sufficiently tune a pretrained language model with a rather small set of examples. To that end, they collect a dataset of 1,000 prompts and responses, where the outputs (responses) are stylistically aligned with each other, but the inputs (prompts) are diverse. Specifically, they seek outputs in the style of a helpful AI assistant. They curate such examples from a variety of sources, primarily split into community Q&A forums and manually authored examples. They also collect a test set of 300 prompts and a development set of 50.
  • LIMA demonstrates remarkably strong performance, learning to follow specific response formats from only a handful of examples in the training data, including complex queries that range from planning trip itineraries to speculating about alternate history. Moreover, the model tends to generalize well to unseen tasks that did not appear in the training data. In a controlled human study, responses from LIMA are either equivalent or strictly preferred to GPT-4 in 43% of cases; this statistic is as high as 58% when compared to Bard and 65% versus DaVinci003, which was trained with human feedback.
  • Taken together, these results strongly suggest that almost all knowledge in large language models is learned during pretraining, and only limited instruction tuning data is necessary to teach models to produce high quality output.
  • The figure below from the paper shows (left) the human preference evaluation, comparing LIMA to 5 different baselines across 300 test prompts; (right) preference evaluation using GPT-4 as the annotator, given the same instructions provided to humans.

Language Is Not All You Need: Aligning Perception with Language Models
  • This paper by Huang et al. from Microsoft, introduces KOSMOS-1, a Multimodal Large Language Model (MLLM) designed to perceive various modalities, learn in context (few-shot learning), and follow instructions (zero-shot learning). The model is trained from scratch on a web-scale multimodal corpus comprising interleaved text and images, image-caption pairs, and text data. KOSMOS-1 demonstrates remarkable performance in language understanding and generation, OCR-free NLP, perception-language tasks like multimodal dialogue and image captioning, and vision tasks such as image recognition with textual descriptions.
  • KOSMOS-1, a Transformer-based causal language model, auto-regressively generates texts and handles multimodal input via a Transformer decoder. The input format includes special tokens to indicate the beginning and end of sequences and encoded image embeddings.
  • The figure below from the paper shows that KOSMOS-1 is a multimodal large language model (MLLM) that is capable of perceiving multimodal input, following instructions, and performing in-context learning for not only language tasks but also multimodal tasks. In this work, we align vision with large language models (LLMs), advancing the trend of going from LLMs to MLLMs.

  • Technical details of the implementation include using MAGNETO, a Transformer variant, as the backbone architecture, and XPOS for relative position encoding. MAGNETO offers training stability and improved performance across modalities, while XPOS enhances long-context modeling and attention resolution.
  • The training involves web-scale multimodal corpora and focuses on next-token prediction to maximize log-likelihood of tokens. The data sources for training include The Pile, Common Crawl, LAION-2B, LAION-400M, COYO-700M, and Conceptual Captions. The model also undergoes language-only instruction tuning using the Unnatural Instructions and FLANv2 datasets to align better with human instructions.
  • Evaluation of KOSMOS-1 covered a wide array of tasks:
    • Language tasks: language understanding, generation, and OCR-free text classification.
    • Cross-modal transfer and commonsense reasoning.
    • Nonverbal reasoning using Raven’s Progressive Matrices.
    • Perception-language tasks like image captioning and visual question answering.
    • Vision tasks, including zero-shot image classification.
  • In perception-language tasks, the model excels in image captioning and visual question answering. For image captioning, it was tested on MS COCO Caption and Flickr30k, achieving a CIDEr score of 67.1 on the Flickr30k dataset. In visual question answering, KOSMOS-1 showed higher accuracy and robustness on VQAv2 and VizWiz datasets compared to other models.
  • OCR-free language understanding involved understanding text within images without OCR. WebSRC dataset was used for evaluating web page question answering, where KOSMOS-1 showed the ability to benefit from the layout and style information of web pages in images.
  • Chain-of-thought prompting was also investigated, enabling KOSMOS-1 to generate a rationale first, then tackle complex question-answering and reasoning tasks. This approach showed better performance compared to standard prompting methods.
  • For zero-shot image classification on ImageNet, KOSMOS-1 significantly outperformed GIT in both constrained and unconstrained settings. The approach involved prompting the model with an image and a corresponding natural language query to predict the category name of the image.
  • Code
QLoRA: Efficient Finetuning of Quantized LLMs
  • This paper by Dettmers et al. from UW presents QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance. Put simply, QLoRA is a new technique to reduce the memory footprint of large language models during finetuning, without sacrificing performance. QLoRA backpropagates gradients through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters (LoRA).
  • Put simply, QLoRA is a method designed to efficiently fine-tune large pre-trained language models (LLMs), like a 65B parameter model, on limited GPU memory without sacrificing performance. It combines the principles of Low-Rank Adaptation (LoRA) with innovative 4-bit NormalFloat (NF4) quantization and Double Quantization techniques, optimizing parameter efficiency and computational resource utilization.
  • At a top-level, QLoRA operates based on the following steps:
    • Quantize the pre-trained model to 4 bits and freeze it.
    • Attach small, trainable adapter layers (similar to LoRA).
    • Finetune only the adapter layers while using the frozen quantized model for context.
  • Key Components:
    1. Low-Rank Adaptation: QLoRA follows LoRA’s strategy of injecting trainable low-rank matrices into the architecture of pretrained LLMs, specifically targeting Transformer layers. This selective fine-tuning strategy focuses on optimizing these low-rank matrices rather than the entire model, reducing the number of trainable parameters and computational costs.
    2. Quantization: The distinguishing aspect of QLoRA lies in its quantization approach, which includes:
      • NF4 Quantization: This technique involves quantizing the model weights to 4-bit NormalFloat (NF4), efficiently compressing them to fit a specific distribution suitable for NF4 without complex algorithms.
      • Double Quantization: This secondary quantization further reduces memory overhead by quantizing the quantization constants themselves, using 8-bit floats with a 256 block size, achieving significant memory savings without affecting model performance.
  • Operation:
    • QLoRA employs a frozen, 4-bit quantized pretrained language model and fine-tunes it by backpropagating gradients into the low rank adapters. This method optimizes computation through low-bit quantization and reduces the number of parameters by using low-rank structures, striking a balance between efficiency and performance.
  • Their best model family, which they name Guanaco, outperforms all previous openly released models on the Vicuna benchmark, reaching 99.3% of the performance level of ChatGPT while only requiring 24 hours of finetuning on a single GPU. QLoRA introduces a number of innovations to save memory without sacrificing performance: (a) 4-bit NormalFloat (NF4), a new data type that is information theoretically optimal for normally distributed weights (b) double quantization to reduce the average memory footprint by quantizing the quantization constants, and (c) paged optimziers to manage memory spikes.
  • They use QLoRA to finetune more than 1,000 models, providing a detailed analysis of instruction following and chatbot performance across 8 instruction datasets, multiple model types (LLaMA, T5), and model scales that would be infeasible to run with regular finetuning (e.g. 33B and 65B parameter models).
  • Their results show that QLoRA finetuning on a small high-quality dataset leads to state-of-the-art results, even when using smaller models than the previous SoTA. They provide a detailed analysis of chatbot performance based on both human and GPT-4 evaluations showing that GPT-4 evaluations are a cheap and reasonable alternative to human evaluation. Furthermore, they find that current chatbot benchmarks are not trustworthy to accurately evaluate the performance levels of chatbots. A lemon-picked analysis demonstrates where Guanaco fails compared to ChatGPT.
  • The figure below from the paper shows different finetuning methods and their memory requirements. QLORA improves over LoRA by quantizing the transformer model to 4-bit precision and using paged optimizers to handle memory spikes.

  • In the QLoRA approach, it is the original model’s weights that are quantized to 4-bit precision. The newly added Low-rank Adapter (LoRA) weights are not quantized; they remain at a higher precision and are fine-tuned during the training process. This strategy allows for efficient memory use while maintaining the performance of large language models during finetuning.
  • To learn more about QLoRA and how it works, the Weights blog post is highly recommended.
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
  • Large-scale unsupervised language models (LMs) acquire extensive world knowledge and reasoning skills, but precisely controlling their behavior is challenging due to their unsupervised training nature. Traditionally, methods like Reinforcement Learning from Human Feedback (RLHF), discussed earlier in this article, are used to steer these models, involving two stages: training a reward model based on human preference labels and then fine-tuning the LM to align with these preferences using reinforcement learning (RL). However, RLHF presents complexities and instability issues, necessitating fitting a reward model and then training a policy to optimize this reward, which is prone to stability concerns.
  • This paper by Rafailov et al. from Stanford in 2023 introduces Direct Preference Optimization (DPO), a novel approach that simplifies and enhances this process. DPO leverages a mathematical relationship between optimal policies and reward functions, demonstrating that the constrained reward maximization problem in RLHF can be optimized more effectively with a single stage of policy training. DPO redefines the RLHF objective by showing that the reward can be rewritten purely as a function of policy probabilities, allowing the LM to implicitly define both the policy and the reward function. This innovation eliminates the need for a separate reward model and the complexities of RL.
  • This paper introduces a novel algorithm that gets rid of the two stages of RL, namely - fitting a reward model, and training a policy to optimize the reward via sampling. The second stage is particularly hard to get right due to stability concerns, which DPO obliterates. The way it works is, given a dataset of the form <prompt, worse completion, better completion>, you train your LLM using a new loss function which essentially encourages it to increase the likelihood of the better completion and decrease the likelihood of the worse completion, weighted by how much higher the implicit reward model. This method obviates the need for an explicit reward model, as the LLM itself acts as a reward model. The key advantage is that it’s a straightforward loss function optimized using backpropagation.
  • The stability, performance, and computational efficiency of DPO are significant improvements over traditional methods. It eliminates the need for sampling from the LM during fine-tuning, fitting a separate reward model, or extensive hyperparameter tuning.
  • The figure below from the paper illustrates that DPO optimizes for human preferences while avoiding reinforcement learning. Existing methods for fine-tuning language models with human feedback first fit a reward model to a dataset of prompts and human preferences over pairs of responses, and then use RL to find a policy that maximizes the learned reward. In contrast, DPO directly optimizes for the policy best satisfying the preferences with a simple classification objective, without an explicit reward function or RL.

  • Experiments demonstrate that DPO can fine-tune LMs to align with human preferences as effectively, if not more so, than traditional RLHF methods. It notably surpasses RLHF in controlling the sentiment of generations and enhances response quality in tasks like summarization and single-turn dialogue. Its implementation and training processes are substantially simpler.
  • In essence, DPO represents a groundbreaking shift in training language models to align with human preferences. It consolidates the two-stage process of RLHF into a single, efficient end-to-end policy learning approach. By reparameterizing the reward function and unifying policy learning and reward modeling into one streamlined optimization process, DPO offers a more efficient and lightweight method for training language models to match human preferences.
Deduplicating Training Data Makes Language Models Better
  • This paper by Lee et al. from Google Brain in 2023 finds that existing language modeling datasets contain many near-duplicate examples and long repetitive substrings. As a result, over 1% of the unprompted output of language models trained on these datasets is copied verbatim from the training data.
  • They develop two tools that allow us to deduplicate training datasets – for example removing from C4 a single 61 word English sentence that is repeated over 60,000 times.
  • Deduplication allows them to train models that emit memorized text ten times less frequently and require fewer train steps to achieve the same or better accuracy. They can also reduce train-test overlap, which affects over 4% of the validation set of standard datasets, thus allowing for more accurate evaluation.
  • Code.
Llama 2: Open Foundation and Fine-Tuned Chat Models
  • Llama 2 is a collection of pretrained and fine-tuned large language models (LLMs) from Meta AI ranging in scale from 7 billion to 70 billion parameters. The fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Their models outperform open-source chat models on most benchmarks we tested, and based on their human evaluations for helpfulness and safety, may be a suitable substitute for closed source models. We provide a detailed description of their approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on their work and contribute to the responsible development of LLMs.
  • Llama 2 is powered by Ghost Attention (GAtt), introduced in the paper, which improves multi-turn memory. From section 3.3 in the technical report:
    • “In a dialogue setup, some instructions should apply for all the conversation turns, e.g., to respond succinctly, or to “act as” some public figure. When we provided such instructions to Llama 2-Chat, the subsequent response should always respect the constraint. However, their initial RLHF models tended to forget the initial instruction after a few turns of dialogue, as illustrated in the below figure (left) which shows that issues with multi-turn memory (left) can be improved with GAtt (right).

    • To address these limitations, we propose Ghost Attention (GAtt), a very simple method inspired by Context Distillation (Bai et al., 2022) that hacks the fine-tuning data to help the attention focus in a multi-stage process. GAtt enables dialogue control over multiple turns, as illustrated in the figure above (right).
    • GAtt Method: Assume we have access to a multi-turn dialogue dataset between two persons (e.g., a user and an assistant), with a list of messages \(\left[u_1, a_1, \ldots, u_n, a_n\right]\), where \(u_n\) and \(a_n\) correspond to the user and assistant messages for turn \(n\), respectively. Then, we define an instruction, inst, that should be respected throughout the dialogue. For example, inst could be “act as.” We can then synthetically concatenate this instruction to all the user messages of the conversation.
    • Next, we can sample from this synthetic data using the latest RLHF model. We now have a context-dialogue and the sample with which to fine-tune a model, in a process analogous to Rejection Sampling. Instead of augmenting all context-dialogue turns with the instruction, we can drop it in all but the first turn, but this would lead to a mismatch at training time between the system message, i.e., all the intermediate assistant messages that come before the last turn, and their sample. To fix this issue, which could hurt the training, we simply set the loss to 0 for all the tokens from the previous turns, including assistant messages.
    • For the training instructions, we created a few synthetic constraints to sample from: Hobbies (“You enjoy e.g. Tennis”), Language (“Speak in e.g. French”), or Public Figure (“Act as e.g. Napoleon”). To obtain the lists of hobbies and public figures, we asked Llama 2-Chat to generate it, avoiding a mismatch between the instruction and model knowledge (e.g., asking the model to act as someone it had not encountered during training). To make the instructions more complex and diverse, we construct the final instruction by randomly combining the above constraints. When constructing the final system message for the training data, we also modify the original instruction half of the time to be less verbose, e.g., “Always act as Napoleon from now”-> “Figure: Napoleon.” These steps produce an SFT dataset, on which we can fine-tune Llama 2-Chat.
    • GAtt Evaluation: We applied GAtt after RLHF V3. We report a quantitative analysis indicating that GAtt is consistent up to 20+ turns, until the maximum context length is reached (see Appendix A.3.5 in the paper). We tried to set constraints not present in the training of GAtt at inference time, for instance “Always answer with Haiku,” for which the model was found to remain consistent.
    • To illustrate how GAtt helped reshape attention during fine-tuning, we display the maximum attention activations of the model in Figure 10. The left-hand side of each figure corresponds to the system message (“Act as Oscar Wilde”). From the figure above, we can see that the GAtt-equipped model (right) maintains large attention activations with respect to the system message for a larger portion of the dialogue, as compared to the model without GAtt (left).
    • Despite its utility, the current implementation of GAtt is vanilla, and more development and iteration on this technique could likely further benefit the model. For instance, we could teach the model to change the system message during the conversation by integrating such data during fine-tuning.”
  • Another important aspect that is highlighted in the report is the effect of RLHF on Llama 2, and this graph from Meta’s paper shows how high-quality human preferences data (obtained from Surge AI) keeps on improving Llama 2 – without saturation.

  • They also call out the importance of supervised fine-tuning (SFT) data quality (in the “quality is all you need” section) – it’s not about volume, but diversity and quality.
  • From Linxi Fan’s notes:
    • Llama-2 likely costed $20M+ to train. Meta has done an incredible service to the community by releasing the model with a commercially-friendly license. AI researchers from big companies were wary of Llama-1 due to licensing issues, but now many of them will jump on the ship and contribute their firepower.
    • Meta’s team did a human study on 4K prompts to evaluate Llama-2’s helpfulness. They use “win rate” as a metric to compare models, in similar spirit as the Vicuna benchmark. 70B model roughly ties with GPT-3.5-0301, and performs noticeably stronger than Falcon, MPT, and Vicuna. These real human ratings should be trusted more than academic benchmarks, because they typically capture the “in-the-wild vibe” better.
    • Llama-2 is not yet at GPT-3.5 level, mainly because of its weak coding abilities. On “HumanEval” (standard coding benchmark), it isn’t nearly as good as StarCoder or many other models specifically designed for coding. That being said, I have little doubt that Llama-2 will improve significantly thanks to its open weights.
    • Meta’s team goes above and beyond on AI safety issues. In fact, almost half of the paper is talking about safety guardrails, red-teaming, and evaluations. A round of applause for such responsible efforts!
    • In prior works, there’s a thorny trade-ff between helpfulness and safety. Meta mitigates this by training 2 separate reward models. They aren’t open-source yet, but would be extremely valuable to the community.
    • Llama-2 will dramatically boost multimodal AI and robotics research. These fields need more than just blackbox access to an API.
    • So far, we have to convert the complex sensory signals (video, audio, 3D perception) to text description and then feed to an LLM, which is awkward and leads to huge information loss. It’d be much more effective to graft sensory modules directly on a strong LLM backbone.
    • The whitepaper itself is a masterpiece. Unlike GPT-4’s paper that shared very little info, Llama-2 spelled out the entire recipe, including model details, training stages, hardware, data pipeline, and annotation process. For example, there’s a systematic analysis on the effect of RLHF with nice visualizations. Quote sec 5.1: “We posit that the superior writing abilities of LLMs, as manifested in surpassing human annotators in certain tasks, are fundamentally driven by RLHF.”
  • The following figure from the paper shows the training of Llama 2-Chat: This process begins with the pretraining of Llama 2 using publicly available online sources. Following this, they create an initial version of Llama 2-Chat through the application of supervised fine-tuning. Subsequently, the model is iteratively refined using Reinforcement Learning with Human Feedback (RLHF) methodologies, specifically through rejection sampling and Proximal Policy Optimization (PPO). Throughout the RLHF stage, the accumulation of iterative reward modeling data in parallel with model enhancements is crucial to ensure the reward models remain within distribution.

  • Summary:
    • Llama 2 is available for free (including commercial license).
    • Llama 2 can be accessed via managed services in Azure and AWS.
    • Llama is trained on 2B tokens, with 4 variants, ranging from 7-70B parameters.
    • Llama is intended to be used in English, with almost 90% of the pre-training data being in English.
    • The commercial license specifies a number of harmful use cases that violate the license, including spam!
    • Llama 2 is very comparable to ChatGPT 3.5 in most benchmarks (particularly, it beats ChatGPT in human evaluation on helpfulness: Win 36%; Tie 32%; Loss 32%) other than coding, looking at the data mix coding data is still quite small (classified under the - unknown language category)
    • Llama 2 outperforms all other open-source models including Falcon and MPT, and has three variants including 7B, 13B, and 70B; the 70B variant achieves top performance across the board.
    • Benchmarks were done both on standardized ones (like MMLU) and head to head competition against other models, including PaLM-2 Bison and ChatGPT 3.5.
    • A large portion of the paper focuses on RLHF improvements and objectives which is super neat.
    • Model toxicity and evaluation is another large focus, including evaluations like red-teaming which were found in the Claude 2 model card. Generally Llama 2 performed very well with fewer safety violations than ChatGPT in human evaluations.
    • The tokenizer is the same as Llama 1 which is interesting, but the context length is now 4k, double the original 2k!
    • There’s both a regular and chat variation, as has been the trend in recent papers.
    • Llama 2 (with fine tuning) offers better domain-specificity via fine-tuning at lower cost, and better guardrails.
    • Llama 2 is trained on 40% more data than Llama 1 and performs well against benchmarks.
    • In short: companies can create their own enterprise “ChatGPT” (without sharing any data with OpenAI).
  • Quantized Llama 2 weights are available for local inference here.

  • The following diagram presents summarizes the key graphs/tables of the Llama 2 paper:

  • The following infographic (source) presents an overview of Llama 2:

Retentive Network: A Successor to Transformer for Large Language Models
  • This paper by Sun wet al. from Microsoft Research and Tsinghua University proposes a foundation architecture called Retentive Network (RetNet) to replace the transformer as default backbone for language modelling, simultaneously achieving training parallelism, low-cost inference, and good performance.
  • One of the main reasons why NLP research couldn’t progress beyond a particular level with RNNs and LSTMs was that they weren’t parallelizable, which hindered people from developing reasonably huge models that could learn large range dependencies with them. Transformers enabled parallelism, however suffer from quadratic computational complexity. RetNet is a smart “mathematical makeover” for RNNs which makes it parallelizable, thus circumventing their biggest limitation while still enabling linear time complexity.
  • The idea behind RetNet is to combine recurrence and parallelism in a way that is flexible and combines the best of both worlds. To achieve this, the researchers introduced a mechanism called “retention” that can be formulated in both a recurrent and a parallel way (or even both at the same time). They theoretically derive the connection between recurrence and attention. Then they propose the retention mechanism for sequence modeling, which supports three computation paradigms, i.e., parallel, recurrent, and chunkwise recurrent.
  • As a rule of thumb, the parallel representation allows for training parallelism. The recurrent representation enables low-cost \(O(1)\) inference, which improves decoding throughput, latency, and GPU memory without sacrificing performance. A hybrid form deals with a few exceptions like long sequences.
  • The chunkwise recurrent representation facilitates efficient long-sequence modeling with linear complexity, where each chunk is encoded parallelly while recurrently summarizing the chunks.
  • RetNet shows favorable scaling laws compared to the transformer; they observe better perplexity for sizes north of 2B parameters!
  • They trained RetNet on 512 AMD MI200 GPUs.
  • The following figure from the paper illustrates the fact that RetNet makes the “impossible triangle” possible, which achieves training parallelism, good performance, and low inference cost simultaneously.

  • The following figure from the paper presents the dual form of RetNet. “GN” is short for GroupNorm.

  • The following figure from the paper presents RetNet which achieves low-cost inference (i.e., GPU memory, throughput, and latency), training parallelism, and favorable scaling curves compared with Transformer. Results of inference cost are reported with 8k as input length. Figure 6 shows more results on different sequence lengths. Put simply, RetNet additionally makes inference much more efficient (smaller memory footprint) and lower cost, while keeping the desirable properties of transformers (training parallelism) by offering \(O(1)\) inference, in contrast with \(O(n)\) for transformers!

  • Experimental results on language modeling show that RetNet achieves favorable scaling results, parallel training, low-cost deployment, and efficient inference. With alluring benefits on basically all fronts – less memory, higher throughput, better scaling, and faster inference – these intriguing properties make RetNet a strong successor to Transformer for large language models.
  • The following table from the paper shows the model comparison from various perspectives. RetNet achieves training parallelization, constant inference cost, linear long-sequence memory complexity, and good performance.

The case for 4-bit precision: k-bit Inference Scaling Laws
  • Quantization methods reduce the number of bits required to represent each parameter in a model, trading accuracy for smaller memory footprints and inference latencies. However, the final model size depends on both the number of parameters of the original model and the rate of compression. - Put simply, this paper seeks to answer the question: “what’s the optimal number of bits for quantizing transformer weights if you wish to maximize the zero shot accuracy given a particular budget of total model weight bits”. For example, a 30B 8-bit model and a 60B 4-bit model have the same number of bits but may have very different zero-shot accuracies with the the 4-bit model outperforming the 8-bit model.
  • In this work, we study this trade-off by developing inference scaling laws of zero-shot performance in Large Language Models (LLMs) to determine the bit-precision and model size that maximizes zero-shot performance.
  • They run more than 35,000 experiments with 16-bit inputs and k-bit parameters to examine which zero-shot quantization methods improve scaling for 3 to 8-bit precision at scales of 19M to 176B parameters across the LLM families BLOOM, OPT, NeoX/Pythia, and GPT-2. We find that it is challenging to improve the bit-level scaling trade-off, with the only improvements being the use of a small block size – splitting the parameters into small independently quantized blocks – and the quantization data type being used (e.g., Int vs Float).
  • Overall, their findings show that {4-bit} precision is almost universally optimal for total model bits and zero-shot accuracy.
  • There are practical limitations to this method though. Low-bit models with 16-bit inputs might be less latency efficient if such a model is deployed to be used by many users (i.e. bigger batch sizes). Something to keep in mind.
  • The following table from the paper illustrates bit-level scaling laws for mean zero-shot performance across four datasets for 125M to 176B parameter OPT models. Zero-shot performance increases steadily for fixed model bits as they reduce the quantization precision from 16 to 4 bits. At 3-bits, this relationship reverses, making 4-bit precision optimal.

DeBERTa: Decoding-enhanced BERT with Disentangled Attention
  • Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks.
  • This paper by He et al. from Microsoft in ICLR 2021 proposes a new model architecture Decoding-enhanced BERT with disentangled attention (DeBERTa) that improves the BERT and RoBERTa models using two novel techniques.
  • The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions, respectively.
  • Second, an enhanced mask decoder is used to incorporate absolute positions in the decoding layer to predict the masked tokens in model pre-training.
  • In addition, a new virtual adversarial training method is used for fine-tuning to improve models’ generalization.
  • They show that these techniques significantly improve the efficiency of model pre-training and the performance of both natural language understanding (NLU) and natural langauge generation (NLG) downstream tasks.
  • Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). Notably, they scale up DeBERTa by training a larger version that consists of 48 Transform layers with 1.5 billion parameters. The significant performance boost makes the single DeBERTa model surpass the human performance on the SuperGLUE benchmark (Wang et al., 2019a) for the first time in terms of macro-average score (89.9 versus 89.8), and the ensemble DeBERTa model sits atop the SuperGLUE leaderboard as of January 6, 2021, out performing the human baseline by a decent margin (90.3 versus 89.8).
  • The following infographic (source) presents the fact that interestingly, DeBERTa-1.5B (and encoder-only model) beats Llama 2 on BoolQ, which is a nice example that encoders still outperform large decoders on classification tasks. For fairness: The DeBERTa-1.5B model was likely finetuned on the training data whereas Llama 2 was used via few-shot prompting. In that case, it highlights once more that finetuning custom LLMs remains worthwhile.

UL2: Unifying Language Learning Paradigms
  • Existing pre-trained models are generally geared towards a particular class of problems. To date, there seems to be still no consensus on what the right architecture and pre-training setup should be. This paper presents a unified framework for pre-training models that are universally effective across datasets and setups.
  • THis paper by Tay et al. from GOogle Brain begin by disentangling architectural archetypes with pre-training objectives – two concepts that are commonly conflated. Next, we present a generalized & unified perspective for self-supervision in NLP and show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective. We then propose Mixture-of-Denoisers (MoD), a pre-training objective that combines diverse pre-training paradigms together. We furthermore introduce a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training schemes. We conduct extensive ablative experiments to compare multiple pre-training objectives and find that their method pushes the Pareto-frontier by outperforming T5 & GPT-like models across multiple diverse setups. By scaling their model up to 20B parameters, we achieve SOTA performance on 50 well-established supervised finetuning based NLP tasks. Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization. On 0-shot MMLU, UL2 20B outperforms T0 and T5 models. UL2 20B also works well with chain-of-thought prompting and reasoning, making it an appealing choice for research into reasoning at a small to medium scale of 20B parameters. Finally, we apply FLAN instruction tuning to the UL2 20B model, achieving MMLU and Big-Bench scores competitive to FLAN-PaLM 62B.
  • They release Flax-based T5X checkpoints for the UL2 20B & Flan-UL2 20B.
  • The following table from the paper illustrates an overview of UL2 pretraining paradigm. UL2 proposes a new pretraining objective that works well on a diverse suite of downstream tasks.

  • The following table from the paper illustrates the mixture of denoisers for training UL2. Greyed out rectangles are masked tokens that are shifted to ‘targets’ for prediction.

Graph of Thoughts: Solving Elaborate Problems with Large Language Models
  • Similar to Beyond Chain-of-Thought, Effective Graph-of-Thought Reasoning in Large Language Models, this paper by Besta et al. from ETH Zurich, Cledar, and Warsaw University of Technology introduces Graph of Thoughts (GoT) a framework that advances prompting capabilities in large language models (LLMs) beyond those offered by paradigms such as Chain-of-Thought or Tree of Thoughts (ToT).
  • The key idea and primary advantage of GoT is the ability to model the information generated by an LLM as an arbitrary graph, where units of information (“LLM thoughts”) are vertices, and edges correspond to dependencies between these vertices.
  • This approach enables combining arbitrary LLM thoughts into synergistic outcomes, distilling the essence of whole networks of thoughts, or enhancing thoughts using feedback loops.
  • They illustrate that GoT offers advantages over state of the art on different tasks, for example increasing the quality of sorting by 62% over ToT, while simultaneously reducing costs by >31%.
  • They ensure that GoT is extensible with new thought transformations and thus can be used to spearhead new prompting schemes. This work brings the LLM reasoning closer to human thinking or brain mechanisms such as recurrence, both of which form complex networks.
  • The following table from the paper shows a comparison of Graph of Thoughts (GoT) with other prompting strategies.

Accelerating Large Language Model Decoding with Speculative Sampling
  • This paper by Chen et al. from Google DeepMind presents speculative sampling, an algorithm for accelerating transformer decoding by enabling the generation of multiple tokens from each transformer call.
  • Their algorithm relies on the observation that the latency of parallel scoring of short continuations, generated by a faster but less powerful draft model, is comparable to that of sampling a single token from the larger target model. This is combined with a novel modified rejection sampling scheme which preserves the distribution of the target model within hardware numerics.
  • They benchmark speculative sampling with Chinchilla, a 70 billion parameter language model, achieving a 2-2.5x decoding speedup in a distributed setup, without compromising the sample quality or making modifications to the model itself.
Pretraining Language Models with Human Preferences
  • The following paper summary has been contributed by Zhibo Zhang.
  • This paper by Korbak et al. from University of Sussex, New York University, FAR AI, Northeastern University and Anthropic in ICML 2023 explores objective functions that incorporate human preferences when pre-training language models.
  • Assuming access to a reward function that assigns scores to document segments, on top of maximum likelihood estimation, the authors explore the following PHF (pre-training with human preferences) objective functions: maximum likelihood estimation with filtering (Solaiman & Dennison, 2021, Wang et al., 2022); conditional training (Ficler & Goldberg, 2017, Fan et al., 2018, Keskar et al., 2019); unlikelihood (Welleck et al., 2020); reward-weighted regression (Peters & Schaal, 2007); advantage-weighted regression (Peng et al., 2019).
  • The authors studied the chosen objective functions through three perspectives: (i) avoiding toxic content, (ii) avoid leaking personally identifiable information, and (iii) code generation that aligns with user intent.
  • The objective functions in question were evaluated from the alignment and capability perspectives through misalignment scores and KL divergence from the GPT-3 model (Brown et al., 2020) accordingly. It was observed that among the PHF objective functions investigated, conditional training achieved the best balance between alignment and utility.
  • The authors also evaluated robustness to adversarial prompts for the models pre-trained with the objective functions in question. It was observed from the misalignment scores that conditional training and filtering are most robust to adversarial prompts overall.
  • The following table from the paper illustrates the toxicity score (lower is better) of LMs pretrained with the standard objective (solid blue), using conditional training (solid orange) and LMs finetuned using conditional training for 1.6B (orange dashed) and 330M tokens (orange dotted). Pretraining with Human Feedback (PHF) reduces the amount of offensive content much more effectively than finetuning with human feedback.

Large Language Models as Optimizers
  • Optimization is ubiquitous. While derivative-based algorithms have been powerful tools for various problems, the absence of gradient imposes challenges on many real-world applications.
  • This paper by Yang et al. from Google DeepMind proposes Optimization by PROmpting (OPRO), a simple and effective approach to leverage large language models (LLMs) as optimizers, where the optimization task is described in natural language. In each optimization step, the LLM generates new solutions from the prompt that contains previously generated solutions with their values, then the new solutions are evaluated and added to the prompt for the next optimization step.
  • They first showcase OPRO on linear regression and traveling salesman problems, then move on to prompt optimization where the goal is to find instructions that maximize the task accuracy.
  • The following table from the paper illustrates an overview of the OPRO framework. Given the meta-prompt as the input, the LLM generates new solutions to the objective function, then the new solutions and their scores are added into the meta-prompt for the next optimization step. The meta-prompt contains the solution-score pairs obtained throughout the optimization process, as well as a natural language description of the task and (in prompt optimization) a few exemplars from the task. See the below figure from the paper for a sample meta-prompt for prompt optimization.

  • The following table from the paper illustrates an example of the meta-prompt for prompt optimization with instruction-tuned PaLM 2-L (PaLM 2-L-IT) on GSM8K, where the generated instruction will be prepended to the beginning of “A:” in the scorer LLM output (A_begin in Section 4.1). <INS> denotes the position where the generated instruction will be added. The blue text contains solution-score pairs; the purple text describes the optimization task and output format; the orange text are meta-instructions.

  • With a variety of LLMs, they demonstrate that the best prompts optimized by OPRO outperform human-designed prompts by up to 8% on GSM8K, and by up to 50% on Big-Bench Hard tasks.
  • In terms of limitations, OPRO is designed for neither outperforming the state-of-the-art gradient-based optimization algorithms for continuous mathematical optimization, nor surpassing the performance of specialized solvers for classical combinatorial optimization problems such as TSP. Instead, the goal is to demonstrate that LLMs are able to optimize different kinds of objective functions simply through prompting, and reach the global optimum for some small-scale problems. Their evaluation reveals several limitations of OPRO for mathematical optimization. Specifically, the length limit of the LLM context window makes it hard to fit large-scale optimization problem descriptions in the prompt, e.g., linear regression with high-dimensional data, and traveling salesman problems with a large set of nodes to visit. In addition, the optimization landscape of some objective functions are too bumpy for the LLM to propose a correct descending direction, causing the optimization to get stuck halfway.
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
  • The quality of texts generated by natural language generation (NLG) systems is hard to measure automatically. Conventional reference-based metrics, such as BLEU and ROUGE, have been shown to have relatively low correlation with human judgments, especially for tasks that require creativity and diversity. Recent studies suggest using large language models (LLMs) as reference-free metrics for NLG evaluation, which have the benefit of being applicable to new tasks that lack human references. However, these LLM-based evaluators still have lower human correspondence than medium-size neural evaluators.
  • This paper by Liu et al. from presents G-Eval, a framework of using large language models with chain-of-thoughts (CoT) and a form-filling paradigm, to assess the quality of NLG outputs.
  • The following table from the paper illustrates the overall framework of G-Eval. We first input Task Introduction and Evaluation Criteria to the LLM, and ask it to generate a CoT of detailed Evaluation Steps. Then we use the prompt along with the generated CoT to evaluate the NLG outputs in a form-filling paradigm. Finally, we use the probability-weighted summation of the output scores as the final score.

  • They experiment with two generation tasks, text summarization and dialogue generation. They show that G-Eval with GPT-4 as the backbone model achieves a Spearman correlation of 0.514 with human on summarization task, outperforming all previous methods by a large margin.
  • They also propose preliminary analysis on the behavior of LLM-based evaluators, and highlight the potential issue of LLM-based evaluators having a bias towards the LLM-generated texts.
  • Code
Chain-of-Verification Reduces Hallucination in Large Language Models
  • Generation of plausible yet incorrect factual information, termed hallucination, is an unsolved issue in large language models.
  • This paper by Dhuliawala et al. from Meta AI and ETH Zurich studies the ability of language models to deliberate on the responses they give in order to correct their mistakes.
  • They develop the Chain-of-Verification (CoVe) method whereby the model first (i) drafts an initial response; then (ii) plans verification questions to fact-check its draft; (iii) answers those questions independently so the answers are not biased by other responses; and (iv) generates its final verified response.
  • The following table from the paper illustrates the Chain-of-Verification (CoVe) method. Given a user query, a large language model generates a baseline response that may contain inaccuracies, e.g. factual hallucinations. We show a query here which failed for ChatGPT (see section 9 for more details). To improve this, CoVe first generates a plan of a set of verification questions to ask, and then executes that plan by answering them and hence checking for agreement. We find that individual verification questions are typically answered with higher accuracy than the original accuracy of the facts in the original longform generation. Finally, the revised response takes into account the verifications. The factored version of CoVe answers verification questions such that they cannot condition on the original response, avoiding repetition and improving performance.

  • Via experiments, they show that CoVe decreases hallucinations across a variety of tasks, from list-based questions from Wikidata, closed book MultiSpanQA and longform text generation.
LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models
  • This paper by Chen et al. from CUHK and MIT presents LongLoRA, an efficient fine-tuning approach that extends the context sizes of pre-trained large language models (LLMs) during fine-tuning, with limited computation cost.
  • Typically, training LLMs with long context sizes is computationally expensive, requiring extensive training hours and GPU resources. For example, training on the context length of 8192 needs 16x computational costs in self-attention layers as that of 2048.
  • LongLoRA speeds up the context extension of LLMs in two aspects. On the one hand, although dense global attention is needed during inference, fine-tuning the model can be effectively and efficiently done by sparse local attention. The proposed shift short attention (\(S^2\)-attention) effectively enables context extension, leading to non-trivial computation savings with similar performance to fine-tuning with vanilla attention. Particularly, it can be implemented with only two lines of code in training, while being optional in inference.
  • \(S^2\)-attention splits the context into groups and only attends within each group. Tokens are shifted between groups in different heads to enable information flow. This approximates full attention but is much more efficient.
  • On the other hand, they revisit the parameter-efficient fine-tuning regime for context expansion. Notably, they find that LoRA for context extension works well under the premise of trainable embedding and normalization. LongLoRA demonstrates strong empirical results on various tasks on Llama 2 models from 7B/13B to 70B.
  • LongLoRA adopts Llama 2 7B from 4k context to 100k, or Llama 2 70B to 32k on a single 8x A100 machine. LongLoRA extends models’ context while retaining their original architectures, and is compatible with most existing techniques, like FlashAttention-2.
  • In addition, to make LongLoRA practical, they collect a dataset, LongQA, for supervised fine-tuning. It contains more than 3k long context question-answer pairs.
  • In summary, the key ideas are:
    1. Shift Short Attention (S2-Attn): During fine-tuning, standard full self-attention is very costly for long contexts. S2-Attn approximates the full attention using short sparse attention within groups of tokens. It splits the sequence into groups, computes attention in each group, and shifts the groups in half the heads to allow information flow. This is inspired by Swin Transformers. S2-Attn enables efficient training while allowing full attention at inference.
    2. Improved LoRA: Original LoRA only adapts attention weights. For long contexts, the gap to full fine-tuning is large. LongLoRA shows embedding and normalization layers are key. Though small, making them trainable bridges the gap.
    3. Compatibility with optimizations like FlashAttention-2: As S2-Attn resembles pre-training attention, optimizations like FlashAttention-2 still work at both train and inference. But many efficient attention mechanisms have large gaps to pre-training attention, making fine-tuning infeasible.
    4. Evaluation: LongLoRA extends the context of Llama 2 7B to 100k tokens, 13B to 64k tokens, and 70B to 32k tokens on one 8x A100 machine. It achieves strong perplexity compared to full fine-tuning baselines, while being much more efficient. For example, for Llama 2 7B with 32k context, LongLoRA reduces training time from 52 GPU hours to 24 hours.
  • The following table from the paper illustrates an overview of LongLoRA designs. LongLoRA introduces shift short attention during finetuning. The trained model can retain its original standard self-attention during inference. In addition to plain LoRA weights, LongLoRA additionally makes embedding and normalization layers trainable, which is essential to long context learning, but takes up only a small proportion of parameters.

  • The following table from the paper shows a performance and efficiency comparison between full fine-tuning, plain LoRA, and our LongLoRA. They fine-tune LLaMA2 7B on various context lengths, with FlashAttention-2 and DeepSpeed stage 2. Perplexity is evaluated on the Proof-pile test set. Plain LoRA baseline spends limited GPU memory cost, but its perplexity gets worse as the context length increases. LongLoRA achieves comparable performance to full fine-tuning while the computational cost is much less.

Mass-Editing Memory in a Transformer
  • Recent work has shown exciting promise in updating large language models with new memories, so as to replace obsolete information or add specialized knowledge. However, this line of work is predominantly limited to updating single associations.
  • ICLR 2023 develops MEMIT, a method for directly updating a language model with many memories, demonstrating experimentally that it can scale up to thousands of associations for GPT-J (6B) and GPT-NeoX (20B), exceeding prior work by orders of magnitude.
  • The following table from the paper illustrates the fact that MEMIT modifies transformer parameters on the critical path of MLP-mediated factual recall. We edit stored associations based on observed patterns of causal mediation: (a) first, the early-layer attention modules gather subject names into vector representations at the last subject token \(S\). (b) Then MLPs at layers \(l \in \mathcal{R}\) read these encodings and add memories to the residual stream. (c) Those hidden states are read by attention to produce the output. (d) MEMIT edits memories by storing vector associations in the critical MLPs.

  • The following table from the paper shows the MEMIT update process. They first (i) replace \(h_i^l\) with the vector \(z_i\) and optimize Eqn. 16 in the paper so that it conveys the new memory. Then, after all \(z_i\) are calculated we (ii) iteratively insert a fraction of the residuals for all \(z_i\) over the range of critical MLP modules, executing each layer’s update by applying Eqn. 14 in the paper. Because changing one layer will affect activations of downstream modules, they recollect activations after each iteration.

MTEB: Massive Text Embedding Benchmark
  • Text embeddings are commonly evaluated on a small set of datasets from a single task not covering their possible applications to other tasks. It is unclear whether state-of-the-art embeddings on semantic textual similarity (STS) can be equally well applied to other tasks like clustering or reranking. This makes progress in the field difficult to track, as various models are constantly being proposed without proper evaluation.
  • To solve this problem, Muennighoff et al. from Weights and introduce the Massive Text Embedding Benchmark (MTEB) Leaderboard. MTEB spans 8 embedding tasks covering a total of 58 datasets and 112 languages.
  • Through the benchmarking of 33 models on MTEB, they establish the most comprehensive benchmark of text embeddings to date. The following figure from the paper shows an overview of tasks and datasets in MTEB. Multilingual datasets are marked with a purple shade

  • They find that no particular text embedding method dominates across all tasks. This suggests that the field has yet to converge on a universal text embedding method and scale it up sufficiently to provide state-of-the-art results on all embedding tasks.
Language Modeling Is Compression
  • It has long been established that predictive models can be transformed into lossless compressors and vice versa. Incidentally, in recent years, the machine learning community has focused on training increasingly large and powerful self-supervised (language) models. Since these large language models exhibit impressive predictive capabilities, they are well-positioned to be strong compressors.
  • This paper by Delétang et al. from DeepMind, Meta AI, and Inria, advocates for viewing the prediction problem through the lens of compression and evaluate the compression capabilities of large (foundation) models.
  • They empirically investigate the lossless compression capabilities of foundation models. To that end, we review how to compress with predictive models via arithmetic coding and call attention to the connection between current language modeling research and compression.
  • The following figure from the paper shows arithmetic encoding of the sequence ‘AIXI’ with a probabilistic (language) model \(P\) (both in blue) resulting in the binary code ‘0101001’ (in green). Arithmetic coding compresses data by assigning unique intervals to symbols based on the probabilities assigned by \(P\). It progressively refines these intervals to output compressed bits, which represent the original message. To decode, arithmetic coding initializes an interval based on the received compressed bits. It iteratively matches intervals with symbols using the probabilities given by \(P\) to reconstruct the original message.

  • They show that foundation models, trained primarily on text, are general-purpose compressors due to their in-context learning abilities. In other words, large language models are powerful general-purpose predictors and that the compression viewpoint provides novel insights into scaling laws, tokenization, and in-context learning. Specifically, they provide a novel view on scaling laws, showing that the dataset size provides a hard limit on model size in terms of compression performance and that scaling is not a silver bullet. They also demonstrate that tokenization, which can be viewed as a pre-compression, does, in general, not improve compression performance, but allows models to increase the information content in their context and is thus generally employed to improve prediction performance. For example, Chinchilla 70B, while trained primarily on text, compresses ImageNet patches to 43.4% and LibriSpeech samples to 16.4% of their raw size, beating domain-specific compressors like PNG (58.5%) or FLAC (30.3%), respectively.
  • They leverage the compression-prediction equivalence to employ compressors as generative models and visually illustrate the performance of the underlying compressor.
  • Finally, they show that the prediction-compression equivalence allows us to use any compressor (like gzip) to build a conditional generative model.
SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models
  • Generative Large Language Models (LLMs) such as GPT-3 are capable of generating highly fluent responses to a wide variety of user prompts. However, LLMs are known to hallucinate facts and make non-factual statements which can undermine trust in their output. Existing fact-checking approaches either require access to the output probability distribution (which may not be available for systems such as ChatGPT) or external databases that are interfaced via separate, often complex, modules.
  • This paper by Manakul et al. from Cambridge in EMNLP 2023 proposes “SelfCheckGPT”, a simple sampling-based approach that can be used to fact-check the responses of black-box models in a zero-resource fashion, i.e. without an external database.
  • SelfCheckGPT leverages the simple idea that if an LLM has knowledge of a given concept, sampled responses are likely to be similar and contain consistent facts. However, for hallucinated facts, stochastically sampled responses (i.e., token sampling methods such as top-p/top-k sampling or beam search, adjusting the softmax temperature, etc.) are likely to diverge and contradict one another.
  • The following figure from the paper illustrates SelfCheckGPT with Prompt. Each LLM-generated sentence is compared against stochastically generated responses with no external database. A comparison method can be, for example, through LLM prompting as shown above.

  • They investigate this approach by using GPT-3 to generate passages about individuals from the WikiBio dataset, and manually annotate the factuality of the generated passages.
  • They demonstrate that SelfCheckGPT can: (i) detect non-factual and factual sentences; and (ii) rank passages in terms of factuality.
  • They compare SelfCheckGPT to several baselines and show that our approach has considerably higher AUC-PR scores in sentence-level hallucination detection and higher correlation scores in passage-level factuality assessment compared to grey-box methods.
Zephyr: Direct Distillation of LM Alignment
  • This paper by Tunstall et al. from Huggingface introduces a technique termed distilled direct preference optimization (dDPO), designed to align a small language model (LM) to user intent via distillation, eliminating the need for human feedback. Furthermore, the study presents a 7B parameter language model named Zephyr, which is specifically tailored to align with user intent. Their approach has three main steps:
    1. Distilled Supervised Fine-Tuning (dSFT): They first fine-tune the base 7B Mistral model using the UltraChat dataset, which contains 1.4M dialogues generated by having a large proprietary teacher model like GPT-3.5 Turbo converse with itself. This provides a strong initialization for the student model.
    2. AI Feedback (AIF) Collection: An ensemble of diverse open chat models (e.g., Claude, Falcon) are used to generate responses to prompts from the UltraFeedback dataset. These responses are then scored by a powerful teacher model like GPT-4. The top scoring response is taken as the “chosen” response and one random lower scoring response as the “rejected” response. This provides training pairs of good vs. bad responses.
    3. Distilled Direct Preference Optimization (dDPO): The dSFT model is further optimized by training it to rank the “chosen” responses higher than “rejected” responses from the AIF collection step. This is done by directly optimizing a preference likelihood objective on the static AIF data without needing to sample from the model during training.
  • They apply this approach to train Zephyr-7B, starting from Mistral-7B. First dSFT using UltraChat (1.4M examples from GPT-3.5), then AIF from UltraFeedback (64K prompts ranked by GPT-4), then dDPO.
  • Results:
    • Zephyr-7B sets a new SOTA for alignment and conversational ability compared to other 7B models on MT-Bench (7.34 score) and AlpacaEval (90.6% win rate), surpassing prior best dSFT and PPO distillation methods.
    • It matches (and in some cases, even outperforms) the performance of 70B RLHF models like LLaMA2 on MT-Bench.
    • Ablations show dSFT is necessary before dDPO, and overfitting dDPO can still improve performance.
  • The key technical innovation is direct distillation of preferences without human involvement, through dSFT then dDPO, achieving strong alignment for small 7B models.
  • Key advantages are that it requires no human labeling or feedback, scales easily to larger models, and can be trained in just a few hours on commercially available hardware. Limitations are potential biases inherited from the teacher models and lack of safety considerations. Overall, it demonstrates the surprising efficacy of distillation and preference learning for aligning smaller open models.
  • The image below (source) gives a graphical sense of Zephyr’s performance on tasks as compared with pther prevalent LLMs.

  1. Start with the strongest pretrained model you can find: Mistral 7B is by far the strongest 7B pretrained model.
  2. Scale human-preference annotations: Several studies have show how for many tasks GPT4 is on-par with the average human annotators while making scalable annotations as easy as an API call: The Weights H4 team started from the largest and most diverse public GPT4 preference annotation dataset: UltraFeedback.
  3. Drop Reinforcement Learning in favor of DPO (Direct Preference Optimization): While using RL with LLMs is definitely much easier compared to the struggles of getting deep-RL to work from scratch, DPO totally remove RL from the preference annotation training and directly optimize the preference model in a much more stable training procedure in the H4 team’s experiments.
  4. Don’t be scared of overfitting on the preference dataset: This is maybe the most counter-intuitive results of the work. While the train/test loss of DPO training shows signs of overfitting on the feedback dataset after just one epoch, training further still show significant improvements on downstream tasks even up to 3 epochs without signs of performances regression.
Weights’s Alignment Handbook
  • The Alignment Handbook contains robust recipes to align language models with human and AI preferences. It also contains code to train your very own Zephyr models:
    • Full fine-tuning with Microsoft’s DeepSpeed ZeRO-3 on A100s
    • LoRA or QLoRA fine-tuning on consumer GPUs

  • Dataset from Weights called No Robots of 10k instructions and demonstrations to train instruct models. This is based on the SFT dataset from OpenAI’s InstructGPT paper. 100% organic and written entirely by skilled human annotators.
Evaluating Large Language Models: A Comprehensive Survey
  • This paper by Guo et al. from Tianjin University offers a comprehensive survey providing an in-depth analysis of evaluating large language models (LLMs).
  • The paper categorizes LLM evaluation into three key domains: knowledge and capability evaluation, alignment evaluation, and safety evaluation, addressing the need for rigorous assessment across various tasks and applications.
  • The following figure from the paper illustrates the proposed taxonomy of major categories and sub-categories of LLM evaluation.

  • In-depth exploration of knowledge and capability evaluation includes question answering, knowledge completion, reasoning, and tool learning, highlighting LLMs’ growing sophistication in handling diverse information processing tasks.
  • Alignment evaluation focuses on ethics, bias, toxicity, and truthfulness, critical for ensuring LLM outputs align with societal values and user expectations.
  • Safety evaluation examines robustness and risks associated with LLM deployment, emphasizing the importance of secure and reliable model performance in real-world applications.
  • The survey also covers specialized evaluations in fields like biology, medicine, education, legislation, computer science, and finance, demonstrating the broad applicability and impact of LLMs.
  • Future directions suggest enhanced evaluation methods, including dynamic, agent-oriented, and risk-focused assessments, to guide responsible LLM development and maximize societal benefits.
Tamil-LLaMA: A New Tamil Language Model Based on LLaMA 2
  • This paper by Abhinand Balachandran introduces Tamil-LLaMA, an enhancement of the open-source LLaMA model, tailored for Tamil language processing.
  • The tokenization process is a crucial aspect of enhancing the model’s proficiency in handling the Tamil language. The integration of an additional 16,000 Tamil tokens into the LLaMA model’s vocabulary is a key step in this process. This expansion of the vocabulary allows for a more accurate and nuanced representation of the Tamil language, improving the model’s ability to understand and generate Tamil text. The tokenization specifically aims to address the unique linguistic features of Tamil, making the model more effective in tasks involving this language.
  • The approach uses the Low-Rank Adaptation (LoRA) methodology for efficient model training, focusing on a comprehensive Tamil corpus. This ensures computational feasibility while enhancing the model’s robustness in text generation.
  • Tamil-LLaMA utilizes datasets like CulturaX for pre-training and a Tamil-translated version of the Alpaca dataset, along with a subset of the OpenOrca dataset, for instruction fine-tuning.
  • Key contributions include the expansion of LLaMA’s vocabulary with 16,000 Tamil tokens, training on a comprehensive Tamil dataset, and presenting Tamil-translated versions of Alpaca and OpenOrca datasets for instruction fine-tuning.
  • Tamil LLaMA outperforms its predecessors and other open-source models in tasks specific to the Tamil language, demonstrating significant advancements in performance. The paper presents results from instruction tasks, showing Tamil-LLaMA’s superior performance in areas like reasoning, translation, code generation, and open question answering. It surpasses GPT-3.5-turbo in many tasks, according to evaluations using GPT-4.
  • Performance comparison on the IndicSentiment-7B dataset (left) and the IndicGLUE Text Classification (right).

  • The paper emphasizes the importance of language diversity in LLMs and contributes to advancing language models for Indian languages, with public access to models, datasets, and code to foster further research.
  • The table below shows a list of available models:

Think before you speak: Training Language Models With Pause Tokens
  • This paper by Goyal from Carnegie Mellon University and Google Research introduces a novel training method for language models using pause tokens.
  • The concept involves appending learnable pause tokens to the input during both pretraining and downstream finetuning, allowing the model additional computation time before generating responses.
  • The following figure from the paper illustrates standard vs. pause-inference (and finetuning). We consider a downstream task where, given a prefix, the decoder-only model (bidirectionally) attends to all of the prefix to generate its target answer. The rounded squares denote one Transformer operation (a self-attention and MLP) in a 2-layer Transformer. Any “Ignore Output” denotes that during inference, the corresponding output token is not extracted and thus, not fed back autoregressively; during finetuning, this output is not backpropagated through. The connecting lines denote some (not all) of the “computational pathways” within the model. Specifically, we visualize only those pathways that begin at a specific token in the prefix (here arbitrarily chosen to be “4 is”) and end at an output token (here arbitrarily chosen to be “25+”). All differences between the two settings are highlighted in color. (a) In standard inference (finetuning), the model’s output is extracted immediately upon seeing the last prefix token. (b) In pause-inference (and pause-finetuning), this is initiated only after appending a manually specified number of <pause> tokens. This introduces new computational pathways (the colored lines) between the prefix token and the output token of interest.

  • The following figure from the paper illustrates standard vs. pause-pretraining. We consider pretraining based on causal language modeling, where each token is predicted given all preceding tokens in the sequence, using unidirectional self-attention. Here, we visualize the computational pathways beginning from the token “is” on the input side of the decoder-only model, to a subsequent token “soccer” on the output side. Please see the above figure for a guide on how to follow this visualization. (a) In standard pretraining, we compute the model’s loss at each output token, and backpropagate through it. (b) In pause-pretraining, we insert multiple copies of <pause> tokens at uniformly random locations in the input. However, we do not apply a loss on the model to predict these tokens, as indicated by each corresponding Ignore Output flags. This introduces new computational pathways connecting the input token and the output token of interest.

  • This method demonstrates significant improvements in various tasks, notably an 18% increase in Exact Match score on the SQuAD question-answering task and 8% on CommonSenseQA.
  • The paper reveals that the gains are most pronounced when pause tokens are used during both pretraining and finetuning, with lesser improvements observed when used only during finetuning.
  • The approach alters the traditional immediate next-token prediction in language models, introducing a new paradigm – delayed next-token prediction – that offers enhanced performance on complex language tasks.
YaRN: Efficient Context Window Extension of Large Language Models
  • This paper by Peng et al. from Nous Research, EleutherAI, and the University of Geneva, proposes Yet Another RoPE extensioN method (YaRN) to efficiently extend the context window of transformer-based language models using Rotary Position Embeddings (RoPE).
  • The authors address the limitation of transformer-based language models, specifically their inability to generalize beyond the sequence length they were trained on. YaRN demonstrates a compute-efficient way to extend the context window of such models, requiring significantly fewer tokens and training steps compared to previous methods.
  • YaRN enables LLaMA models to effectively utilize and extrapolate to context lengths much longer than their original pre-training would allow. This method surpasses previous state-of-the-art approaches in context window extension.
  • The paper details various technical aspects of YaRN, including its capability to extrapolate beyond the limited context of a fine-tuning dataset. The models fine-tuned using YaRN have been reproduced online, supporting context lengths up to 128k.
  • YaRN introduces an innovative technique known as “Dynamic NTK” (Neural Tangents Kernel) interpolation, which modifies the attention mechanism of the model. This dynamic scaling allows the model to handle longer contexts without extensive retraining. By doing so, YaRN surpasses previous approaches in context window extension and significantly reduces the computational resources required. Put simply, Dynamic NTK is designed to address the challenge of extending the context window of transformer-based language models using Rotary Position Embeddings (RoPE). It achieves this by dynamically scaling the attention mechanism of the model, allowing it to efficiently process longer text sequences without requiring extensive retraining.
  • Dynamic NTK interpolation modifies the traditional attention mechanism to adapt to extended contexts, ensuring that the model can effectively utilize and extrapolate to context lengths much longer than its original pre-training would allow. This dynamic scaling approach optimizes the use of available resources and computational power.
  • Dynamic NTK interpolation is a key component of YaRN that empowers language models to handle extended context windows with improved efficiency and performance, making it a valuable advancement in the field of large language models.
  • Additionally, YaRN incorporates a temperature parameter that affects the perplexity across different data samples and token positions within the extended context window. Adjusting this temperature parameter modifies the attention mechanism, enhancing the model’s ability to handle extended context lengths efficiently.
  • Extensive experiments demonstrate YaRN’s efficacy. For instance, it achieves context window extension of language models with RoPE as the position embedding, using only about 0.1% of the original pre-training corpus, a significant reduction in computational resources.
  • The following figure from the paper illustrates that evaluations focus on several aspects, such as perplexity scores of fine-tuned models with extended context windows, the passkey retrieval task, and performance on standardized LLM benchmarks. YaRN models show strong performance across all contexts, effectively extending the context window of LLaMA 2 models to 128k. The following figure from the paper illustrates the sliding window perplexity (S = 256) of ten 128k Proof-pile documents truncated to evaluation context window size.

  • The paper concludes that YaRN improves upon all existing RoPE interpolation methods and acts as a highly efficient drop-in replacement. It preserves the original abilities of fine-tuned models while attending to very large context sizes and allows for efficient extrapolation and transfer learning under compute-constrained scenarios.
  • The research illustrates YaRN as a significant advancement in extending the context window of large language models, offering a more compute-efficient approach with broad implications for model training and performance.
  • Code.
StarCoder: May the Source Be with You!
  • The BigCode community, an open-scientific collaboration, introduces StarCoder and StarCoderBase: Large Language Models (LLMs) for code, each with 15.5 billion parameters, 8K context length, infilling capabilities, and fast large-batch inference enabled by multi-query attention.
  • StarCoderBase was trained on 1 trillion tokens from The Stack, a large collection of permissively licensed GitHub repositories, covering over 80 programming languages, GitHub issues, Git commits, and Jupyter notebooks. StarCoder was fine-tuned on an additional 35 billion Python tokens.
  • These models exhibit novel architectural features like an 8K context length, Fill-in-the-Middle (FIM) capabilities, and Multi-Query-Attention (MQA). An extensive evaluation of these models was conducted, showcasing their superiority over other open Code LLMs in handling multiple programming languages and even matching or surpassing the OpenAI code-cushman-001 model.
  • StarCoder, when fine-tuned on Python, significantly outperforms other Python-tuned LLMs and, with its 8K token context, can function as a virtual technical assistant without requiring instruction-tuning or Reinforcement Learning from Human Feedback (RLHF).
  • Significant steps were taken towards ensuring a safe open model release. StarCoder is released under the OpenRAIL-M license, promoting transparency and commercial viability. The release includes an integrated attribution tool in the VSCode demo for detecting and locating model generations potentially copied from the training set. Additionally, a robust Personally Identifiable Information (PII) detection model, StarEncoder, was developed to enhance privacy protection, utilizing a dataset containing 12,000 files with 22,950 annotated entities.
Let’s Verify Step by Step
  • This paper by Lightman et al. from OpenAI presents a detailed investigation into the effectiveness of process supervision compared to outcome supervision in training language models for complex multi-step reasoning.
  • The authors explore the concepts of outcome and process supervision. Outcome-supervised reward models (ORMs) focus on the final result of a model’s reasoning chain, while process-supervised reward models (PRMs) receive feedback at each step in the reasoning chain.
  • To collect process supervision data, they present human data-labelers with step-by-step solutions to MATH problems sampled by the large-scale generator. Their task is to assign each step in the solution a label of positive, negative, or neutral, as shown in the below figure. A positive label indicates that the step is correct and reasonable. A negative label indicates that the step is either incorrect or unreasonable. A neutral label indicates ambiguity. In practice, a step may be labelled neutral if it is subtly misleading, or if it is a poor suggestion that is technically still valid. Neutral labels allows them to defer the decision about how to handle ambiguity: at test time, we can treat neutral labels as either positive or negative. The following figure from the paper shows a screenshot of the interface used to collect feedback for each step in a solution.

  • The following figure from the paper shows two solutions to the same problem, graded by the PRM. The solution on the left is correct while the solution on the right is incorrect. A green background indicates a high PRM score, and a red background indicates a low score. The PRM correctly identifies the mistake in the incorrect solution.

  • For their experiments, they used large-scale models fine-tuned from GPT-4 and smaller models for detailed comparisons. These models were trained on the MATH dataset, which includes complex mathematical problems.
  • The paper introduces a new dataset, PRM800K, comprising 800,000 step-level human feedback labels, which was instrumental in training their PRM models.
  • The key findings show that process supervision significantly outperforms outcome supervision in training models to solve complex problems. Specifically, their PRM model solved 78.2% of problems from a representative subset of the MATH test set.
  • The researchers also demonstrate that active learning significantly improves the efficiency of process supervision, leading to better data utilization.
  • They conducted out-of-distribution generalization tests using recent STEM tests like AP Physics and Calculus exams, where the PRM continued to outperform other methods.
  • The paper discusses the implications of their findings for AI alignment, highlighting the advantages of process supervision in producing more interpretable and aligned models.
  • They acknowledge potential limitations related to test set contamination but argue that the relative comparisons made in their work are robust against such issues.
  • This research contributes to the field by showing the effectiveness of process supervision and active learning in improving the reasoning capabilities of language models, especially in complex domains like mathematics.
Scalable Extraction of Training Data from (Production) Language Models
  • This paper by Nasr et al. from Google DeepMind, University of Washington, Cornell, CMU, UC Berkeley, and ETH Zurich, investigates “extractable memorization” in language models. It focuses on the ability of an adversary to efficiently extract training data by querying a machine learning model, including open-source models like Pythia and GPT-Neo, semi-open models like LLaMA and Falcon, and closed models like ChatGPT.
  • The paper highlights a new attack method, termed “divergence attack,” which induces ChatGPT to deviate from its typical chatbot-style responses and emit training data at a much higher rate.
  • The paper unifies two major approaches: large-scale studies of total memorized training data and practical attacks for data extraction. It introduces a scalable methodology to detect memorization across trillions of tokens in terabyte-sized datasets. Notably, larger and more capable models were found to be more vulnerable to data extraction attacks.
  • The study first addresses open-source models with publicly available parameters and training sets. It follows a conservative definition of memorization, focusing on verbatim matches of training data as a basis for “extractable memorization.” The study adapts previous data extraction methods and uses a suffix array to efficiently verify if generated outputs are part of the training set.
  • The research applies the attack methodology to nine open-source models. The study revealed a strong correlation between model size and both the rate of emitting memorized output and the total number of unique memorized sequences, with rates of memorization ranging from 0.1% to 1% and the number of unique memorized sequences varying from several hundred thousand to several million.
  • The study finds that the amount of memorization extracted grows nearly linearly with the number of generations, presenting a challenge in estimating total memorization. The rate of extractable memorization was found to decrease with more queries, particularly in smaller models like Pythia 1.4B, compared to larger models like GPT-Neo 6B.
  • A comparison between discoverable and extractable memorization reveals that only a portion of discoverably memorized data is extractable, indicating potential improvements in current extraction techniques. The research suggests that measuring discoverable memorization provides a reasonably tight characterization of data that can be extracted by an adversary.
  • The study then extends its methodology to semi-closed models, where model parameters are available but training datasets and algorithms are not public. For these models, the researchers established a new “ground truth” using an auxiliary dataset built from a large corpus of Internet text, allowing them to verify and quantify extractable memorization.
  • The study confronts the unique challenges posed by aligned models like ChatGPT. A new attack strategy, the “divergence attack,” was developed to cause ChatGPT to diverge from its alignment training and revert to its original language modeling objective. This approach successfully extracted over 10,000 unique verbatim-memorized training examples from ChatGPT using only $200 worth of queries.
  • The following figure from the paper shows the process of extracting pre-training data from ChatGPT. They discover a prompting strategy that causes LLMs to diverge and emit verbatim pre-training examples. The figure show an example of ChatGPT revealing a person’s email signature which includes their personal contact information.

  • The following figure from the paper shows the test for memorization in large language models. Models emit more memorized training data as they get larger. The aligned ChatGPT (gpt-3.5-turbo) appears 50\(\times\) more private than any prior model, but they develop an attack that shows it is not. Using our attack, ChatGPT emits training data 150\(\times\) more frequently than with prior attacks, and 3\(\times\) more frequently than the base model.

  • The extracted data from ChatGPT covered a wide range of text sources, including personally identifiable information, NSFW content, literature, URLs, UUIDs, code snippets, research papers, boilerplate text, and merged memorized outputs.
Gemini: A Family of Highly Capable Multimodal Models
  • Google’s Gemini series represents a milestone in AI development, featuring three models: Ultra, Pro, and Nano, each tailored for specific tasks ranging from complex problem-solving to on-device operations. Gemini Ultra, the flagship model, excels in demanding tasks and sets new benchmarks in AI performance. Gemini Pro is optimized for a wide range of tasks, while Nano is designed for efficiency in on-device applications. This suite of models, part of Google DeepMind’s vision, marks a significant scientific and engineering endeavor for the company.
  • Gemini models are built with a transformative architecture that allows for a “deep fusion” of modalities, surpassing the capabilities of typical modular AI designs. This integration enables seamless concept transfer across various domains, such as vision and language. The models, trained on TPUs, support a 32k context length and are capable of handling diverse inputs and outputs, including text, vision, and audio. The visual encoder, inspired by Flamingo, and the comprehensive training data, comprising web documents, books, code, and multimedia, contribute to the models’ versatility.
  • The figure below from the paper illustrates that Gemini supports interleaved sequences of text, image, audio, and video as inputs (illustrated by tokens of different colors in the input sequence). It can output responses with interleaved image and text.

  • The training infrastructure for Gemini utilizes Google’s latest TPU v4 and v5e accelerators, ensuring efficient scaling and reliable performance at an unprecedented scale. This advanced setup is integral to handling hardware failures and silent data corruption, ensuring high-quality training outcomes.
  • The training dataset is multimodal and multilingual, with quality and safety filters to enhance model performance. The dataset mix is adjusted during training to emphasize domain-relevant data, contributing to the models’ high performance.
  • Gemini Ultra showcases extraordinary capabilities across various benchmarks, surpassing GPT-4 in areas like coding and reasoning. Its performance in benchmarks like HumanEval and Natural2Code, as well as its superior reasoning capabilities in complex subjects like math and physics, demonstrate its state-of-the-art capabilities. For instance, the figure below from the paper shows solving a geometrical reasoning task. Gemini shows good understanding of the task and is able to provide meaningful reasoning steps despite slightly unclear instructions.

  • Furthermore, in another instance, the figure below from the paper shows Gemini verifying a student’s solution to a physics problem. The model is able to correctly recognize all of the handwritten content and verify the reasoning. On top of understanding the text in the image, it needs to understand the problem setup and correctly follow instructions to generate LaTeX.

  • Gemini outperforms OpenAI’s GPT-4 in 30 out of 32 benchmarks. Furthermore, it’s worth noting is that Gemini Ultra is the first model to outperform human experts on MMLU (massive multitask language understanding). The following table from Google’s blog Gemini surpasses state-of-the-art performance on a range of benchmarks including text and coding.

  • For image understanding, Gemini Ultra sets new standards by outperforming existing models in zero-shot evaluations for OCR-related tasks. Its native multimodality and complex reasoning abilities enable it to excel in interpreting and reasoning with visual information. The following table from Google’s blog Gemini surpasses state-of-the-art performance on a range of multimodal benchmarks.

  • Gemini’s training involves Reinforcement Learning from Human Feedback (RLHF), enhancing its performance and capabilities. This advanced training, combined with its innovative architecture and diverse dataset, contributes to its exceptional performance across various tasks.
  • Despite its remarkable capabilities, specific details about Gemini’s architecture, training data, and the size of the Ultra and Pro models remain undisclosed. However, the models represent a significant leap in AI development, driven by the promise of AI to benefit humanity in diverse ways.
  • Safety and responsibility are central to Gemini’s development, with comprehensive safety evaluations for bias and toxicity. Google is collaborating with external experts and partners to stress-test the models and ensure they adhere to robust safety policies, aligning with Google’s AI Principles.
  • Gemini’s capabilities and its development approach reflect Google’s commitment to advancing AI responsibly and ethically, emphasizing safety and collaboration with the industry and broader ecosystem to define best practices and safety benchmarks.
  • Report; Blog.
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
  • The Purple Llama initiative by Meta, aimed at promoting responsible and safe development in generative AI, encompasses a variety of tools and evaluations, including the notable Llama-Guard and the Llama 7B model for content moderation.
  • Llama Guard, a component of Meta’s Purple Llama Initiative, is a 7B parameter model based on Llama2, designed to classify content in LLM prompts and responses, enhancing trust and safety in AI applications.
  • It uses a safety risk taxonomy for content moderation, detecting policy violations and indicating the safety level of text, with detailed subcategory violations when necessary.
  • The model is instruction-tuned on a dataset comprising about 13,000 examples, including prompts and responses annotated for safety, with training inputs from the Anthropic dataset and in-house redteaming examples.
  • Llama Guard outperforms existing moderation tools in benchmarks like the OpenAI Moderation Evaluation dataset and ToxicChat, and is adept at detecting harmful content across various categories.
  • Its functionality includes evaluating probabilities for classifying text as safe or unsafe, and it can generate outputs indicating safety status and policy violations.
  • The following figure from the blog shows example task instructions for the Llama Guard prompt and response classification tasks. A task consists of four main components. Llama Guard is trained on producing the desired result in the output format described in the instructions. It acts as an LLM: it generates text in its output that indicates whether a given prompt or response is safe/unsafe, and if unsafe based on a policy, it also lists the violating subcategories. Here is an example:

  • Customizable for different use cases, Llama Guard is adaptable for chatbots and digital assistants, offering flexibility without compromising safety.
  • Part of the broader Purple Llama ecosystem, which includes industry collaborations and is available on Weights, Llama Guard’s model weights are released for public use, licensed permissively for research and commercial applications.
  • In summary, Meta’s Purple Llama initiative represents a major advancement in ensuring safe and responsible development in generative AI. By providing a suite of tools, including the Llama-Guard and the Llama 7B model for content moderation, the initiative addresses the need for comprehensive safety measures in AI applications, fostering an open environment for ongoing innovation in the field.
  • Blog; Model; Notebook; Benchmark.
Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models
  • Training language models typically requires vast quantities of human-generated text, which can be scarce or of variable quality, especially for specialized domains like mathematics or programming. This scarcity limits the model’s ability to learn diverse patterns and hinders its performance. \(ReST_{EM}\) addresses this problem by reducing the reliance on human-curated datasets and instead exploring the potential of fine-tuning models using self-generated data validated through scalar feedback mechanisms.
  • This paper by Singh et al. from Google DeepMind, presented at NeurIPS 2023, explores a new frontier in Large Language Model (LLM) training: Reinforced Self-Training based on expectation-maximization (\(ReST_{EM}\)). This innovative approach aims to reduce reliance on human data while avoiding the pitfalls of a synthetic data death spiral, a trend becoming increasingly evident in LLM training.
  • \(ReST_{EM}\) is a potent alternative to traditional dataset curation, comprising two primary stages: generating multiple output samples (E-step) and fine-tuning the language model on these samples (M-step). This process is cyclically iterated, combining the generation of model-derived answers and their subsequent refinement. The feedback for filtering these outputs is sourced from tasks with binary feedback, such as math problems with clear right or wrong answers.
  • The paper’s focus is on two challenging domains: advanced mathematical problem-solving (MATH) and code generation (APPS). Utilizing PaLM 2 models of various scales, the study demonstrates that \(ReST_{EM}\) significantly outperforms models fine-tuned solely on human-generated data, offering up to 2x performance boosts. This indicates a major step toward more independent AI systems, seeking less human input for skill refinement.
  • \(ReST_{EM}\) employs an iterative self-training process leveraging expectation-maximization. It first generates outputs from the language model, then applies a filtering mechanism based on binary correctness feedback—essentially sorting the wheat from the chaff. Subsequently, the model is fine-tuned using these high-quality, self-generated samples. This cycle is repeated several times, thus iteratively enhancing the model’s accuracy and performance on tasks by self-generating and self-validating the training data.
  • Notably, the experiments revealed diminishing returns beyond a certain number of ReST iterations, suggesting potential overfitting issues. Ablation studies further assessed the impact of dataset size, the number of model-generated solutions, and the number of iterations on the effectiveness of ReST.
  • The models fine-tuned using ReST showed enhanced performance on related but distinct benchmarks like GSM8K, Hungarian HS finals, and Big-Bench Hard tasks, without any noticeable degradation in broader capabilities. This finding underscores the method’s versatility and generalizability.
  • The following figure from the paper shows Pass@K results for PaLM-2-L pretrained model as well as model fine-tuned with \(ReST_{EM}\). For a fixed number of samples \(K\), fine-tuning with \(ReST_{EM}\) substantially improves Pass@K performance. They set temperature to 1.0 and use nucleus sampling with \(p = 0.95\).

  • While ReST offers significant advantages in performance, it necessitates a moderate-sized training set of problems or prompts and access to a manually-designed or learned reward function. It’s highly data-efficient but requires careful application to prevent overfitting.
  • This research opens new avenues for self-improvement in language models, suggesting the need for automating manual parts of the pipeline and exploring algorithmic improvements to further enhance performance. With \(ReST_{EM}\) showing promising results, especially in larger models, one can anticipate further exploration in applying self-training techniques to various other domains beyond math and coding tasks. The significant improvement over fine-tuning on human data implies that future models can be made more efficient, less reliant on extensive datasets, and potentially achieve better performance.
Human-Centered Loss Functions (HALOs)
  • This report by Ethayarajh et al. from Stanford University presents a novel approach to aligning large language models (LLMs) with human feedback, building upon Kahneman & Tversky’s prospect theory. The proposed Kahneman-Tversky Optimization (KTO) loss function diverges from existing methods by not requiring paired preference data, relying instead on the knowledge of whether an output is desirable or undesirable for a given input. This makes KTO significantly easier to deploy in real-world scenarios where such data is more abundant.
  • The report identifies that existing methods for aligning LLMs with human feedback can be seen as human-centered loss functions, which implicitly model some of the distortions in human perception as suggested by prospect theory. By adopting this perspective, the authors derive a HALO that maximizes the utility of LLM generations directly, rather than relying on maximizing the log-likelihood of preferences, as current methods do.
  • The KTO-aligned models were found to match or exceed the performance of direct preference optimization methods across scales from 1B to 30B. One of the key advantages of KTO is its feasibility in real-world applications, as it requires less specific types of data compared to other methods.
  • To validate the effectiveness of KTO and understand how alignment scales across model sizes, the authors introduced Archangel, a suite comprising 56 models. These models, ranging from 1B to 30B, were aligned using various methods, including KTO, on human-feedback datasets such as Anthropic HH, Stanford Human Preferences, and OpenAssistant.
  • The following report from the paper illustrates the fact that LLM alignment involves supervised finetuning followed by optimizing a human-centered loss (HALO). However, the paired preferences that existing approaches need are hard-to-get. Kahneman-Tversky Optimization (KTO) uses a far more abundant kind of data, making it much easier to use in the real world.

  • The report’s experimental findings reveal surprising insights into the scaling and effectiveness of different alignment methods. It was observed that supervised finetuning (SFT) contributes significantly to the performance gains at every scale under 30B. The benefits of combining SFT with alignment methods become apparent at model sizes of around 7B and above. Interestingly, KTO alone was found to be significantly better than DPO (Direct Preference Optimization) alone at scales of 13B and 30B.
  • The practical implications of KTO are notable, especially in contexts where abundant data on customer interactions and outcomes is available, but counterfactual data is scarce. This aspect underscores KTO’s potential for broader application in real-world settings compared to preference-based methods like DPO.
  • Future work suggested by the authors includes exploring a human value function specifically for language, examining differences in model behavior at different scales, and investigating the potential of synthetic data in model alignment with KTO. The report highlights the importance of understanding how human-centered loss functions can influence the alignment of LLMs with human preferences and perceptions.
  • Code
Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a Global Scale Prompt Hacking Competition
  • Large Language Models (LLMs), including InstructGPT, BLOOM, and GPT-4, are deployed in interactive settings like chatbots and writing assistants. These models are vulnerable to prompt hacking—prompt injection and jailbreaking — where they are manipulated to ignore original instructions and execute potentially harmful commands.
  • This paper by Schulhoff et al. from University of Maryland, Mila, Towards AI, Stanford University, Technical University of Sofia, University of Milan, NYU, and University of Arizona, aims to understand the vulnerabilities of LLMs to prompt hacking and create a large-scale dataset of adversarial prompts through a global competition.
  • The competition, offering 600K+ adversarial prompts against three state-of-the-art LLMs, provides new insights into the types of attacks and their effectiveness.
  • Background and Motivation:
    • The paper highlights the limited research in prompt hacking and the need for a comprehensive understanding of LLM vulnerabilities.
    • It references various studies and efforts to test LLM robustness, including automated and human-driven adversarial approaches.
    • The HackAPrompt competition represents the largest collection of human-written adversarial prompts, aiming to fill the gap in systematic understanding and defense strategies against prompt hacking.
  • Key Intentions of Prompt Hacking:
    • Six major intents identified for prompt hacking: Prompt Leaking, Training Data Reconstruction, Malicious Action Generation, Harmful Information Generation, Token Wasting, and Denial of Service.
    • The competition focuses on Prompt Leaking, Harmful Information Generation (Target Phrase Generation), and Malicious Action Generation. Other intents are not directly studied but believed to be applicable in similar settings.
  • Competition Structure and Implementation:
    • The competition presented ten real-world-inspired prompt hacking challenges, each with specific task descriptions and prompt templates.
    • Participants used various models (GPT-3, ChatGPT, FlanT5-XXL) to generate responses and were encouraged to submit JSON files with their prompt-model pairings for each challenge.
    • The competition setup aimed to mirror real-world attack scenarios, with the goal of either outputting a specific phrase or a hidden key in the prompt template.
  • The following report from the paper illustrates the uses of LLMs often define the task via a prompt template (top left), which is combined with user input (bottom left). They created a competition to see if user input can overrule the original task instructions and elicit specific target output (right).

  • Analysis of Prompt Hacking Strategies:
    • The competition revealed various hacking strategies, including novel techniques like Context Overflow attacks.
    • Two datasets were used for analysis: Submissions Dataset and Playground Dataset, providing a comprehensive view of the hacking process and the effectiveness of different strategies.
    • Notable strategies included Two Token Attack, using Chinese characters to bypass letter separation, and Context Overflow attacks.
    • The analysis also explored the frequent words used in successful hacks, offering insights into potential defense strategies.
  • Taxonomical Ontology of Prompt Hacking Exploits:
    • The paper presents a comprehensive taxonomy of prompt hacking techniques, breaking down attacks into component parts and describing their relationships.
    • This ontology is the first data-driven classification of prompt hacking exploits, based on the competition submissions and literature review.
    • It includes various attack types such as Simple Instruction Attack, Context Ignoring Attack, and many others.
  • Conclusion and Ethical Considerations:
    • The competition aimed to advance research in LLM security and prompt hacking, yielding over 600K adversarial prompts.
    • It documented 29 prompt hacking techniques and explored their generalizability across different intents, LLMs, and modalities.
    • The paper discusses the ethical considerations of releasing such a dataset, emphasizing its utility for defensive purposes and responsible use.
    • Limitations include the focus on a few LLMs, potential non-reproducibility due to model updates, and the unexplored risks like training data poisoning.
Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs
  • This paper by Ovadia et al. from Microsoft presents an insightful comparison of knowledge injection methods in large language models (LLMs). The core question addressed is whether unsupervised fine-tuning (USFT) is more effective than retrieval-augmented generation (RAG) for improving LLM performance on knowledge-intensive tasks.
  • The researchers focus on LLMs’ ability to memorize, understand, and retrieve factual data, using a knowledge base scraped from Wikipedia and a dataset of current events questions created with GPT-4. The study employs models like Llama2-7B, Mistral-7B, and Orca2-7B, evaluating them on tasks from the Massively Multitask Language Understanding Evaluation (MMLU) benchmark and a current events dataset.
  • Two methods of knowledge injection are explored: fine-tuning, which continues the model’s pre-training process using task-specific data, and retrieval-augmented generation (RAG), which uses external knowledge sources to enhance LLMs’ responses. The paper also delves into supervised, unsupervised, and reinforcement learning-based fine-tuning methods.
  • The key finding is that RAG outperforms unsupervised fine-tuning in knowledge injection. RAG, which uses external knowledge sources, is notably more effective in terms of knowledge injection than USFT alone and even more so than a combination of RAG and fine-tuning, particularly in scenarios where questions directly corresponded to the auxiliary dataset. This suggests that USFT may not be as efficient in embedding new knowledge into the model’s parameters.
  • The figure below from the paper shows a visualization of the knowledge injection framework.

  • Note that USFT in this context is a direct continuation of pre-training (hence also called continued pre-training in literature), predicting the next token on the dataset. Interestingly, fine-tuning with multiple paraphrases of the same fact significantly improves the baseline performance, indicating the importance of repetition and varied presentation of information for effective knowledge assimilation.
  • The authors created a knowledge base by scraping Wikipedia articles relevant to various topics, which was used for both fine-tuning and RAG. Additionally, a dataset of multiple-choice questions about current events was generated using GPT-4, with paraphrases created to augment this dataset.
  • Limitations of the study include the exclusive focus on unsupervised fine-tuning, without exploring supervised fine-tuning or reinforcement learning from human feedback (RLHF). The study also notes a high variance in accuracy performance across experiments, making it challenging to ascertain the statistical significance of the results.
  • The paper also questions why baseline models don’t achieve a 25% accuracy rate for multiple-choice questions with four options, suggesting that the tasks may not represent truly “unseen” knowledge. Moreover, the research primarily assesses straightforward knowledge or fact tasks, without delving into reasoning capabilities.
  • In summary, while fine-tuning can be beneficial, RAG is identified as a superior method for knowledge injection in LLMs, especially for tasks involving new information. The results highlight the potential of using diverse fine-tuning techniques and auxiliary knowledge bases for further research in this domain.
Tuning Language Models by Proxy
  • This paper by Liu et al. from the University of Washington and the Allen Institute for AI introduces proxy-tuning, a promising method for adapting large language models (LLMs) without direct model modification. Proxy-tuning operates during the decoding phase, where it modifies the logits of the target LLM using a smaller, modifiable “proxy” LM. Specifically, the method computes the difference in the logits over the output vocabulary between a smaller base model (M2) and a finetuned version of it (M3), and then applies this difference to the target model’s (M1) logits. Put simply, the improved target model’s outputs are calculated as \(M1'(x) = M1(x) + [M3(x) - M2(x)]\).
  • The adjusted output aligns with the predictions of the small tuned model (the expert), contrasting its untuned version (the anti-expert), and ultimately influencing the large, fixed LM’s predictions while maintaining the benefits of large-scale pretraining.
  • The technique was explored in various contexts including instruction-tuning, domain adaptation (specifically for code), and task-specific finetuning (e.g., for question-answering and math problems). For instruction-tuning experiments, proxy-tuning was applied to models of the Llama2 family. The process involved improving a Llama2 70B Base model to the level of Llama2 70B Chat without RLHF, using a 7B-parameter expert for guidance. The results demonstrated that proxy-tuning significantly closes the performance gap between base models and their directly tuned counterparts, achieving performance close to Llama2 70B Chat.
  • The annotated figure below from the paper (source) illustrates: (Top) proxy-tuning “tunes” a large pretrained model without accessing its internal weights, by steering it using an “expert” (a small tuned model) and its corresponding “anti-expert” (the small model, untuned). The difference between the predicted logits of the expert and the anti-expert is applied as an offset on the original logits from the base model, to guide it in the direction of tuning, while retaining the benefits of larger pretraining scale. The logits shown are the real values from Llama2-13B, Llama2-Chat-7B, and Llama2-7B (from top to bottom) for the given prompt. (Bottom): proxy-tuning pushes the base model close to the directly tuned model without updating any weights of the base model.

  • In domain adaptation, the method was used to adapt pre-trained models to coding tasks using CodeLlama-7B-Python as the expert. This led to substantial improvements in coding benchmarks, though proxy-tuning a larger model did not always surpass the performance of the tuned 7B expert in this context.
  • The method’s effectiveness was also tested in task finetuning for various tasks. Applying the proxy-tuning technique to large models with a small task-specific expert resulted in significant performance improvements across tasks.
  • The paper further examines the influence of proxy-tuning at the token level, observing a notable contribution in promoting reasoning and stylistic tokens. A hyperparameter introduced in the method allows control over the guidance level, facilitating a balance between different attributes of generations.
  • Proxy-tuning provides an efficient way to customize large, potentially proprietary LMs through decoding-time guidance, without needing direct model modifications. The method’s success hinges on the caveat that smaller models need to be trained on the same vocabulary as the larger model, enabling the creation of specialized versions of LLMs using this approach.
Group Preference Optimization: Few-shot Alignment of Large Language Models
  • This paper by Zhao et al. from UCLA proposes Group Preference Optimization (GPO), a novel framework for aligning large language models (LLMs) with the opinions and preferences of desired interest group(s) in a few-shot manner. The method aims to address the challenge of steering LLMs to align with various groups’ preferences, which often requires substantial group-specific data and computational resources. The key idea in GPO is to view the alignment of an LLM policy as a few-shot adaptation problem within the embedded space of an LLM.
  • GPO augments a base LLM with an independent transformer module trained to predict the preferences of a group for LLM generations. This module is parameterized via an independent transformer and is trained via meta-learning on several groups, allowing for few-shot adaptation to new groups during testing. The authors employ an in-context autoregressive transformer, offering efficient adaptation with limited group-specific data. Put simply, the preference module in GPO is trained to explicitly perform in-context supervised learning to predict preferences (targets) given joint embeddings (inputs) of prompts and corresponding LLM responses. These embeddings allow efficient processing of in-context examples, with each example being a potentially long sequence of prompt and generated response. The module facilitates rapid adaptation to new, unseen groups with minimal examples via in-context learning.
  • GPO is designed to perform group alignment by learning a few-shot preference model that augments the base LLM. Once learned, the preference module can be used to update the LLM via any standard preference optimization or reweighting algorithm (e.g., PPO, DPO, Best-of-N). Specifically, GPO is parameterized via a transformer and trained to perform in-context learning on the training preference datasets. Given a training group \(g \in G_{\text {train }}\), they randomly split its preference dataset \(\mathcal{D}_g\) into a set of \(m\) context points and \(n-m\) target points, where \(n=\|\mathcal{D}_g\|\) is the size of the preference dataset for group \(g\). Thereafter, GPO is trained to predict the target preferences \(y_{m+1: n}^g\) given the context points \(\left(x_{1: m}^g, y_{1: m}^g\right)\) and target inputs \(x_{m+1: n}^g\). Mathematically, this objective can be expressed as:

    \[L(\theta)=\mathbb{E}_{g, m}\left[\log p_\theta\left(y_{m+1: n}^g \mid x_{1: n}^g, y_{1: m}^g\right)\right]\]
    • where the training group \(g \sim G_{\text {train }}\) and context size \(m\) are sampled uniformly. \(\theta\) represents the parameters of the GPO preference model.
  • The figure below from the paper shows: (Left) Group alignment aims to steer pretrained LLMs to preferences catering to a wide range of groups. For each group \(g\), they represent its preference dataset as \(\mathcal{D}_g=\) \(\left\{\left(x_1^g, y_1^g\right), \ldots,\left(x_n^g, y_n^g\right)\right\}\). Here, \(y_i^g\) signifies the preference of group \(g\) for a pair of given prompt \(q_i^g\) and response \(r_i^g\), while \(x_i^g\) is its LLM representation obtained with \(\pi_{\mathrm{emb}}\left(q_i^g, r_i^g\right)\). (Right) Once trained, GPO provides a few-shot framework for aligning any base LLM to a test group given a small amount of in-context preference data.

  • GPO’s architecture is designed for permutation-specific inductive biases, discarding positional encodings found in standard transformers. However, this loses the pairwise relations between the inputs and outputs. To solve this, GPO concatenates each pair of inputs and outputs into a single token, informing the transformer of their pairwise relation. The target inputs are padded with a dummy token (e.g., 0), and a masking strategy is employed where context pairs can self-attend, but padded targets can only attend to context points.
  • Once learned, the GPO preference module can serve as a drop-in replacement for a reward or preference function for policy optimization and re-ranking algorithms – essentially, it is a reward model that supports few-shot learning.
  • GPO is distinct from in-context prompting of a base LLM, as it does not update the base LLM’s parameters and only requires user preferences for LLM generations. The few-shot model learned by GPO augments the base LLM, offering more flexibility than traditional prompting methods.
  • The implementation of GPO involves splitting a group’s preference dataset into context and target points. The model is trained to predict target preferences given the context points and target inputs. The figure below from the paper illustrates the GPO architecture for a sequence of \(n\) points, with \(m\) context points and \(n-m\) target points. The context \(\left(x_{1: m}, y_{1: m}\right)\) serves as few-shot conditioning for GPO. GPO processes the full sequence using a transformer and predicts the preference scores \(\hat{y}_{m+1: n}\).

  • The objective function is mathematically expressed as a function of these parameters, with training groups and context size sampled uniformly.
  • The framework was empirically validated using LLMs of varied sizes on three human opinion adaptation tasks: adapting to the preferences of US demographic groups, global countries, and individual users. Results showed that GPO not only aligns models more accurately to these preferences but also requires fewer group-specific preferences and less computational resources, outperforming existing strategies like in-context steering and fine-tuning methods.
  • Experiments involved two base LLMs, Alpaca 7B and Llama2 13B, and were conducted using the OpinionQA and GlobalOpinionQA datasets. GPO demonstrated significant improvements over various baselines, achieving a 7.1% increase in alignment score over the In-context Finetune method for the OpinionQA dataset and an 8.4% improvement for the GlobalOpinionQA dataset.
  • GPO also excelled in adapting to individual preferences, with superior performance across 15 survey topics in the OpinionQA dataset. This ability is particularly noteworthy given the diverse and often contrasting opinions within individual and demographic groups.
  • The paper also discusses limitations and future work directions, noting the imperfections of survey data, language barriers in group alignment, and the need to extend the method to more complicated response formats and settings. Additionally, the authors highlight potential ethical concerns, such as misuse of aligned models and amplification of biased or harmful outputs, suggesting future research should address these issues.
  • Code
Large Language Models Are Neurosymbolic Reasoners
  • This paper by Fang et al. from the University of Liverpool, Eindhoven University of Technology, University of Technology Sydney, and University College London in AAAI 2024 investigates the application of Large Language Models (LLMs) as symbolic reasoners in the context of text-based games. These games, serving as benchmarks for artificial intelligence, often present challenges in areas such as mathematics, logic puzzles, and navigation, requiring a combination of NLP and symbolic reasoning skills.
  • The study addresses the issue of present-day LLMs predominantly leveraging patterns in data without explicit symbolic manipulation capabilities. To enhance LLMs for tasks necessitating strong symbolic reasoning within text-based game environments, the authors propose an LLM agent integrated with a symbolic module. This hybrid approach allows the LLM to process observations and valid actions from the games, enhancing its capability in symbolic reasoning tasks.
  • The symbolic module works in tandem with the LLM, processing observations and valid actions from the game environments. It enhances the LLM’s ability to understand complex symbolic tasks such as arithmetic calculations, logical deductions, and navigation in the games. This integration illustrates a move towards a more hybrid AI approach, where statistical learning and symbolic reasoning are orchestrated to handle complex tasks. The table below from the paper shows examples of text-based games with symbolic tasks and how their corresponding symbolic modules are utilized. INPUT refers to the current action that is sent to the symbolic modules. RESPONSE denotes the responses generated by the symbolic modules at the present time.

  • Symbolic modules thus play a crucial role in maximizing the reasoning capabilities of LLMs. For example, as shown in the figure below, consider a scenario where a mathematical problem is presented, and a calculator is available. In such cases, the LLM’s reasoning can effectively use the calculator to complete the task in a zero-shot manner. Furthermore, symbolic modules are adept at their functions, as employing an external tool like a calculator is considered an action in itself. The scenarios include four distinct symbolic modules: the Calculation Module, Sorting Module, Knowledge Base Module, and Navigation Module. For instance, in a mathematical task, the LLM agent may select a computational action such as “multiply 8 by 7” (mul 8 7). This action triggers the symbolic module to calculate the product, and the resulting observation, “Multiplying 8 and 7 results in 56,” is then returned.
  • The figure below from the paper shows an overview of how an LLM agent plays textbased games with external symbolic modules. The following procedural steps are involved in utilizing the LLM agent for engaging in a text-based game. Initially, the LLM agent is provided with a role initialization prompt. The first observation received by the LLM agent comes from the text game environment. As depicted in the diagram, the selection of actions, determined by the LLM’s reasoning, activates the symbolic module. Subsequently, the symbolic module provides output, including observations related to the module. Then the next action chosen by the LLM agent is influenced by the outcome from the symbolic module. This process is executed repeatedly until the end of the game.

  • The LLM agent, initialized with its role and equipped with symbolic computational capacity, utilizes both natural language processing abilities and symbolic reasoning to select and perform in-game actions. This integration illustrates a move towards a more hybrid AI approach, where statistical learning and symbolic reasoning are orchestrated to handle complex tasks.
  • Experimental results show the LLM agent outperforming baselines in tasks requiring symbolic reasoning, achieving an impressive average performance of 88% across various tasks. This demonstrates the effectiveness of integrating symbolic reasoning with LLMs for complex decision-making in text-based games.
  • The implementation details include tailored prompts for interaction with the game and symbolic modules, focusing on specific tasks like arithmetic, map reading, sorting, and common sense applications. The agent’s in-context generalization capabilities highlight the feasibility of using LLMs as neurosymbolic reasoners without extensive training data.
  • The paper indicates a promising pathway for the utilization of LLMs in symbolic reasoning applications. Future steps involve refining and scaling the methodology for more complex environments, exploring the agent’s limitations and potential biases within these symbolic tasks, and potentially extending such hybrid models beyond gaming to other real-world symbolic tasks such as automated theorem proving, software program synthesis, or advanced knowledge representation.
LM-Infinite: Simple On-The-Fly Length Generalization for Large Language Models
  • This paper by Han et al. from University of Illinois Urbana-Champaign and Meta, presents LM-Infinite, a method for enabling large language models (LLMs) to generate longer texts without retraining or additional parameters.
  • LM-Infinite addresses the issue of LLMs’ length generalization failures, particularly on unseen sequence lengths. The authors identify three out-of-distribution (OOD) factors contributing to this problem: (1) unseen distances, (2) unseen number of tokens, and (3) implicitly encoded absolute position. The figure below from the paper shows a diagnosis of three OOD factors in LLMs.

  • The proposed solution, LM-Infinite, introduces a \(\Lambda\)-shaped attention mask and a distance limit during attention. This method is computationally efficient, requiring only \(O(n)\) time and space, and is compatible with various LLMs using relative-position encoding methods.
  • In terms of implementation, LM-Infinite uses a sliding window approach with a \(\Lambda\)-shaped mask for attention calculation. It allows for longer sequence handling by dynamically adjusting the window size based on token distances. The model also incorporates a novel truncation scheme to handle tokens outside the attention window, which are summarized and reused to maintain context relevance without significant computational overhead. The figure below from the paper shows: (a) LM-Infinite is a plug-and-play solution for various LLMs, consisting of a \(\Lambda\)-shaped mask and a distance constraint during attention. (b) A conceptual model for understanding how relative position encoding works.

  • An overview of LM-Infinite is illustrated in the figure above. This simple solution consists of two components: a \(\Lambda\)–shaped attention mask and a distance limit. As visualized in the figure, the \(\Lambda\)-shaped attention mask has two branches: a global branch on the left and a local branch on the right. The global branch allows each token to attend to the starting \(n_{\text {global }}\) tokens if they appear before the current token. The local branch allows each token to attend to preceding tokens within \(n_{\text {local }}\) distance. Any other tokens outside these two branches are ignored during attention. Here they heuristically set \(n_{\text {local }}=L_{\text {pretrain }}\) as equal to the training length limit. This choice includes the “comfort zone” of LLMs in attention. The selection of \(n_{\text {global }}\) has less effect on model performance, and they find that the range \([10,100]\) is generally okay. Note that \(n_{\text {global }}=0\) will lead to immediate quality degradation. This design is based on the OOD factors 2 and 3 above, where they aim to control the number of tokens to be attended to, while also ensuring the inclusion of starting tokens. Theoretically, LM-Infinite can access information from a context as long as \(n_{\text {layer }} L_{\text {pretrain }}\), because each deeper layer allows the attention to span \(L_{\text {pretrain }}\) farther than the layer above.
  • The distance limit involves bounding the “effective distance” \(d\) within \(L_{\text {pretrain }}\). This only affects tokens that are in the global branch. In specific, recall that in relative positional encoding, the attention logit is originally calculated as \(w(\mathbf{q}, \mathbf{k}, d)\), where \(d\) is the distance between two tokens. Now they modify it as \(w\left(\mathbf{q}, \mathbf{k}, \min \left(d, L_{\text {pretrain }}\right)\right)\). This design is motivated by the OOD factor 1 and ensures that LLMs are not exposed to distances unseen during pre-training.
  • LM-Infinite demonstrates consistent text generation fluency and quality for lengths up to 128k tokens on the ArXiv and OpenWebText2 datasets, achieving 2.72x decoding speedup without parameter updates.
  • Evaluation includes experiments on ArXiv and OpenWebText2 corpora with models like LLaMA, Llama-2, MPT-7B, and GPT-J. LM-Infinite shows comparable or superior performance to LLMs fine-tuned on long sequences in terms of perplexity and BLEU/ROUGE metrics.
  • The paper suggests that LM-Infinite can extend task-solving ability to longer sequences than training samples and offers potential for future work in understanding information in the masked-out attention region.
LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning
  • This paper by Jin et al. addresses the limitation of Large Language Models (LLMs) in managing long context sequences, particularly due to their training on fixed-length sequences. This results in unpredictable behavior and performance degradation when LLMs encounter long input sequences during inference. The authors propose that LLMs inherently possess the ability to handle long contexts without requiring fine-tuning.
  • Their core contributions are as follows:
    1. Self-Extend Method: The paper introduces ‘Self-Extend’, a method aimed at leveraging the intrinsic long context capabilities of LLMs. Self-Extend uses a FLOOR operation to map unseen large relative positions (encountered during inference) to known positions from the training phase. This approach is designed to address the positional out-of-distribution issue, allowing LLMs to process longer texts coherently without additional fine-tuning.
    2. Bi-Level Attention Information: Self-Extend constructs bi-level attention - group level and neighbor level. Group level attention is for tokens with long distances, employing the FLOOR operation on positions. Neighbor level attention targets nearby tokens without any modifications. This design aims to retain the relative ordering of information in extended texts.
  • The figure below from the paper shows the attention score matrix (the matrix before Softmax operation) of the proposed Self-Extend while a sequence of length 10 is input to a LLM with pretraining context window (\(L\)) of length 7. The number is the relative distance between the corresponding query and key tokens. Self-Extend has two kinds of attention mechanism: for neighbor tokens within the neighbor window (\(w_n\), in this figure, it’s 4), it adapts the normal self-attention in transformers; for tokens out of the window, it adapts the values from the grouped attention. The group size (\(G\)) is set to 2. After the two parts merge, the same as the normal attention, the softmax operation is applied to the attention value matrix and gets the attention weight matrix.

  • The effectiveness of Self-Extend was evaluated on popular LLMs like Llama-2, Mistral, and SOLAR across a variety of tasks, including language modeling, synthetic long context tasks, and real-world long context tasks. The results showed significant improvements in long context understanding, often surpassing fine-tuning-based methods.
  • Performance was analyzed along the following dimensions:
    1. Language Modeling: Self-Extend was evaluated for language modeling on the PG19 dataset, which comprises long books. It effectively maintained low perplexity outside the pretraining context window for models like Llama-2-chat and Mistral.
    2. Synthetic Long Context Tasks: The method was assessed using the passkey retrieval task, testing the LLMs’ ability to recognize information throughout the entire length of a long text sequence.
    3. Real Long Context Tasks: The evaluation included benchmarks such as Longbench and L-Eval. Results showed notable performance improvements across different datasets. Self-Extend was found to perform comparably or better than several fine-tuned models, with some exceptions attributed to the specific characteristics of certain datasets.
  • The paper concludes that Self-Extend, which is effective during inference without needing fine-tuning, can achieve performance on par with or superior to learning-based methods. This represents a significant advancement in enhancing the natural and efficient handling of longer contexts by LLMs.
Large Language Models are Null-Shot Learners
  • This paper by Taveekitworachai et al. from Ritsumeikan University introduces the concept of null-shot (\(\varnothing\)-shot) prompting, a novel approach aimed at addressing the issue of hallucination in Large Language Models (LLMs). Hallucination refers to the phenomenon where LLMs generate factually incorrect or ungrounded information, particularly in zero-shot learning scenarios where models need to perform tasks without specific example-based fine-tuning. The study focuses on embracing hallucination to improve LLM performance on tasks like reading comprehension and arithmetic reasoning, overcoming some limitations inherent in zero-shot prompting.
  • Null-shot prompting works by directing the LLM to reference a non-existent “Examples” section, a counterintuitive method that paradoxically encourages the model to draw on its internal knowledge, leading to more accurate or relevant responses. This strategy banks on the LLM’s capacity for imaginative synthesis, effectively tricking it into producing more useful information for tasks it has not been fine-tuned for.
  • The figure below from the paper shows examples outputted by PaLM 2 for Chat for WinoGrande.

  • The paper encompasses experiments across six different LLMs over eight datasets in areas like arithmetic reasoning, commonsense reasoning, reading comprehension, and closed-book question answering. The results indicate notable improvements in task performance, illustrating the potential of \(\varnothing\)-shot prompting in harnessing hallucinations for enhanced output.
  • A key aspect of the study is its exploration of \(\varnothing\)-shot prompting as a method for hallucination detection in LLMs. The performance enhancement achieved via \(\varnothing\)-shot prompting could act as an indicator of the level of hallucination in a model, potentially serving as a new diagnostic tool.
  • Several ablation studies, including scaling analysis and modifications inspired by zero-shot chain-of-thought prompting, provide insights into the effectiveness of null-shot prompting. The positioning of the \(\varnothing\)-shot phrase, whether before or after the task instruction, significantly influences performance.
  • The authors speculate that LLMs internally conjure examples based on their training, influencing task performance. This suggests an inherent “mental model” within LLMs that can be utilized for task execution, bypassing the need for explicit examples.
  • Looking ahead, the potential applications of \(\varnothing\)-shot prompting are vast. It could lead to a subtle shift in LLM training and task performance evaluation, especially for more complex reasoning tasks. The technique could also integrate with existing methodologies, blurring the lines between perceived weaknesses and strengths of LLMs.
  • The paper concludes by discussing the implications of \(\varnothing\)-shot prompting in various fields, acknowledging limitations such as potential dataset poisoning risks and the uncertainty surrounding the nature of examples envisioned by LLMs during output generation.
  • This comprehensive study underscores the innovative concept of leveraging inherent hallucination tendencies in LLMs, showcasing a unique approach to enhancing their performance and providing new insights into their operational characteristics.
Knowledge Fusion of Large Language Models
  • This paper by Wan et al. from Sun Yat-sen University and Tencent AI Lab, and published at ICLR 2024, this paper introduces the concept of knowledge fusion for Large Language Models (LLMs). It presents a method to merge the capabilities of different pre-trained LLMs into a more potent model, addressing the high costs and redundancy in LLM development.
  • The novel approach, FuseLLM, leverages generative distributions from source LLMs to transfer their collective knowledge and strengths to a target LLM. It involves aligning tokenizations from different LLMs and fusing their probability distributions using lightweight continual training. This reduces divergence between the target LLM’s probabilistic distributions and those of the source LLMs.
  • The figure below from the paper illustrates conventional model fusion techniques (ensemble and weight merging) and our knowledge fusion approach for LLMs (FuseLLM). Different animal icons represent different LLMs, with various species denoting LLMs possessing differing architectures. FuseLLM externalizes the knowledge from multiple LLMs and transfers their capabilities to a target LLM.

  • The paper tests this concept with three distinct LLMs: Llama-2, MPT, and OpenLLaMA. Evaluations across 42 tasks in reasoning, commonsense, and code generation showed that FuseLLM outperforms each source LLM and the baseline in most tasks. For example, in Big-Bench Hard (BBH), FuseLLM improved exact match accuracy by 2.5% compared to Llama-2 CLM, showing a 3.9× reduction in token requirement.
  • The paper also compares FuseLLM with traditional model ensemble and weight merging techniques. In experiments simulating scenarios with multiple LLMs trained on distinct corpora, FuseLLM consistently achieved the lowest average perplexity, demonstrating its effectiveness in integrating diverse models.
  • The study underscores the potential of FuseLLM for harnessing collective knowledge more effectively than traditional methods, particularly given the diverse structures and substantial sizes of LLMs.
  • Code
MentaLLaMA: Interpretable Mental Health Analysis on Social Media with Large Language Models
  • This paper by Yang et al. from various universities explore the use of Large Language Models (LLMs) for interpretable mental health analysis on social media. They introduce MentaLLaMA, an open-source LLM series capable of generating high-quality explanations for mental health analysis.
  • The authors develop the Interpretable Mental Health Instruction (IMHI) dataset, featuring 105K samples from 10 social media sources covering 8 mental health tasks. The dataset is unique for its multi-task and multi-source nature, focusing on generating detailed explanations for each mental health condition.
  • The figure below from the paper illustrates some examples of MentaLLaMA’s capabilities in diverse mental health analysis tasks.

  • To construct the dataset, they used expert-written few-shot prompts to generate explanations via ChatGPT. The explanations’ quality was assured through rigorous automatic and human evaluations, ensuring correctness, consistency, and professional standards.
  • MentaLLaMA models, based on the LLaMA2 foundation models, are fine-tuned on the IMHI dataset. Three versions of MentaLLaMA were developed: MentaLLaMA-7B, MentaLLaMA-chat-7B, and MentaLLaMA-chat-13B, each demonstrating strong capabilities in mental health classification and explanation generation.
  • The models’ performance was evaluated on the IMHI benchmark, demonstrating that MentaLLaMA-chat-13B approaches or surpasses state-of-the-art discriminative methods in correctness. The model was also found to generate explanations of ChatGPT-level quality, benefiting from instruction tuning, reinforcement learning from human feedback, and increasing model sizes.
  • MentaLLaMA models show strong generalizability to unseen tasks, underlining their adaptability in diverse mental health analysis scenarios. The research provides an essential step towards interpretable, automatic mental health analysis on social media, leveraging the advanced capabilities of large language models.
  • Code
ChatQA: Building GPT-4 Level Conversational QA Models
  • This paper by Liu et al. from NVIDIA introduces ChatQA, a family of conversational question-answering (QA) models that achieve GPT-4 level accuracies. The main contribution is a two-stage instruction tuning method which significantly improves zero-shot conversational QA results from large language models (LLMs).
  • The first stage involves supervised fine-tuning (SFT) on a blend of instruction-following and dialogue datasets. The second stage, context-enhanced instruction tuning, integrates contextualized QA datasets into the instruction tuning blend. This method outperforms regular instruction tuning or RLHF-based recipes (e.g. Llama-2-Chat).
  • For retrieval-augmented generation (RAG) in conversational QA, the authors fine-tune state-of-the-art single-turn query retrievers on a human-annotated multi-turn QA dataset. This approach is as effective as using state-of-the-art LLM-based query rewriting models (e.g., GPT-3.5-turbo) but more cost-effective.
  • The figure below from the paper illustrates the two-stage instruction tuning framework for ChatQA.

  • The authors built a family of ChatQA models based on Llama2-7B, Llama2-13B, Llama2-70B, and an in-house 8B pretrained GPT model. They conducted a comprehensive study on 10 conversational QA datasets, including datasets requiring retrieval and datasets with tables. ChatQA-70B outperformed GPT-4 in terms of average score on these datasets without using any synthetic data from OpenAI GPT models.
  • They also investigated the “unanswerable” scenario, demonstrating that adding a small number of “unanswerable” samples in instruction tuning can steer the model to generate “cannot answer” responses, thereby reducing hallucination. ChatQA-70B outperformed GPT-3.5-turbo in this regard, though it had a slight gap compared to GPT-4.
  • The paper highlights the effectiveness of the proposed two-stage instruction tuning method, emphasizing its superiority in integrating user-provided or retrieved context for zero-shot conversational QA tasks. The method’s ability to enhance LLMs’ capability of integrating context in conversational QA is a significant advancement.
Parameter-efficient Tuning for Large Language Model without Calculating Its Gradients
  • This paper by Jin et al. from the Institute of Automation Chinese Academy of Sciences, School of Artificial Intelligence University of Chinese Academy of Sciences, Wuhan AI Research, and Shanghai Artificial Intelligence Laboratory, a novel method for tuning large language models (LLMs) without calculating their gradients is introduced. This approach, aimed at addressing the high computational demand and memory requirements of conventional tuning methods, leverages similarities between parameter-efficient modules learned by both large and small language models.
  • The proposed method transfers parameter-efficient modules, initially derived from small language models (SLMs), to larger ones. To tackle dimension mismatch and facilitate effective interaction between models, a Bridge model is introduced. This model ensures dimensional consistency and dynamic interaction between the parameter-efficient module and the LLM.
  • The figure below from the paper illustrates: (a) In the training stage, they use parameter-efficient tuning methods to learn task-specific characteristics in a smaller model and fine-tune the Bridge model with the acquired parameter-efficient modules. (b) In the inference stage, they directly plug the parameter-efficient modules into the large language model for efficient predictions.

  • The effectiveness of this approach is demonstrated using the T5 and GPT-2 series of language models on the SuperGLUE benchmark. It achieves comparable performance to traditional fine-tuning and parameter-efficient tuning methods like Adapter, Prefix tuning, and LoRA, but without the need for gradient-based optimization. Notably, it also achieves a significant memory reduction, up to 5.7× compared to traditional parameter-efficient tuning methods.
  • The process includes employing parameter-efficient tuning methods on a SLM to learn task-specific characteristics, followed by fine-tuning a Bridge model along with the parameter-efficient module. This dual-step method ensures the acquired module matches the LLM’s dimensions and enriches it with the larger model’s knowledge. The tuned module is then seamlessly integrated into the LLM for inference.
  • The research highlights the potential of combining the capabilities of small and large models, indicating that substantial task-specific similarities exist across models of different sizes. This method opens avenues for leveraging expansive LLMs more efficiently and economically, presenting a promising direction for future large-scale model tuning.
Mathematical Discoveries from Program Search with Large Language Models
  • This paper by Romera-Paredes et al. from Google DeepMind in Nature, introduces FunSearch, a novel approach combining a pre-trained Large Language Model (LLM) with a systematic evaluator to make scientific discoveries. It addresses the issue of LLMs producing plausible but incorrect statements, a hindrance in scientific discovery. FunSearch, standing for “searching in the function space,” evolves initial low-scoring programs into high-scoring ones, surpassing existing LLM-based methods. Its effectiveness is demonstrated in solving problems in extremal combinatorics, notably the cap set problem, and online bin packing, yielding new heuristics surpassing widely-used baselines. FunSearch not only finds solutions but generates programs describing the solution process, enhancing interpretability and application in real-world scenarios.
  • FunSearch operates using an evolutionary method. It evolves a crucial part of an algorithm, the ‘priority function,’ within a given skeleton containing boilerplate code and prior knowledge. This approach focuses LLM resources on evolving critical logic parts, enhancing performance. A user provides a problem description in code, including an evaluation procedure and a seed program. The system iteratively selects and improves programs, feeding them to an LLM, which generates new solutions. The highest-scoring programs are continuously added back, creating a self-improving loop. The use of Codey, an LLM based on the PaLM2 model family, is pivotal, chosen for its balance between sample quality and inference speed. FunSearch’s effectiveness isn’t highly sensitive to the choice of LLM as long as it’s trained on a sufficient code corpus. This demonstrates the applicability of pre-trained models without fine-tuning for specific problems. This method introduces multiple key components, such as utilizing common knowledge and focusing on critical ideas, ensuring idea diversity, and parallel processing to improve efficiency and avoid stagnation.
  • The figure below from the paper illustrates the FunSearch process. The LLM is shown a selection of the best programs it has generated so far (retrieved from the programs database), and asked to generate an even better one. The programs proposed by the LLM are automatically executed, and evaluated. The best programs are added to the database, for selection in subsequent cycles. The user can at any point retrieve the highest-scoring programs discovered so far.

  • The cap set problem in extremal combinatorics, an open challenge in mathematics, was tackled using FunSearch. FunSearch identified new cap sets in 8 dimensions, the largest known set ever found, representing a significant leap in the field. This demonstrated its scalability and ability to generate novel constructions from scratch. Additionally, it found new algorithms for the online bin packing problem that outperformed standard heuristics like first fit and best fit. FunSearch’s heuristic, starting from the best fit, was evolved and tested against benchmarks, showing superior performance and generalization across different problem sizes.
  • The figure below from the paper illustrates: (Left) Inspecting code generated by FunSearch yielded further actionable insights (highlights added by us). (Right) The raw “admissible” set constructed using the (much shorter) program on the left.

  • FunSearch generated concise, interpretable programs with low Kolmogorov complexity that scale to larger problem instances than traditional approaches, while offering insights for further scientific exploration. is best suited for problems with an efficient evaluator, rich scoring feedback, and a component amenable to evolution. The approach reveals new insights, such as discovering symmetries in problem solutions, and is envisioned to be further extended to tackle a broader range of problems, benefiting from the rapid development of LLMs. FunSearch’s potential extends beyond theoretical problems to practical applications in fields like communication theory and industrial systems.
  • Demo; Nature
Gaussian Error Linear Units (GELUs)
  • This paper by Hendrycks and Gimpel from UC Berkeley and Toyota Technological Institute introduces the Gaussian Error Linear Unit (GELU), a novel neural network activation function that improves performance across a range of computer vision, natural language processing, and speech tasks. The GELU function, denoted as \(x\Phi(x)\), combines input values with their standard Gaussian cumulative distribution, offering a probabilistic approach to activation by weighting inputs by their magnitude rather than their sign, which is a departure from ReLUs.
  • The GELU activation is presented as an expectation of a modification to Adaptive Dropout, proposing a more statistical view of neuron output. It was empirically tested against ReLU and ELU activations, displaying superior or matching performance in several domains including MNIST classification and autoencoding, Twitter part-of-speech tagging, TIMIT frame classification, and CIFAR-10/100 classification.
  • The paper outlines the GELU’s mathematical foundation, emphasizing its probabilistic interpretation and input-dependent stochastic regularizing effect. A significant aspect of GELU is its formulation to introduce no additional hyperparameters, using a standardized setting (\(\mu = 0, \sigma = 1\)) for all experiments, simplifying its application.
  • Implementation details include employing Adam optimizer and suggesting the use of approximations like \(0.5x(1 + \tanh[\sqrt{2/\pi}(x + 0.044715x^3)])\) for computational efficiency without sacrificing accuracy. The authors also discuss practical tips for deploying GELUs, such as momentum optimization and choosing effective approximations of the Gaussian CDF.
  • The figure below from the paper shows the log loss on TIMIT Frame Classification. Learning curves show training set convergence, and the lighter curves show the validation set convergence.

  • The empirical evaluation covered diverse tasks: for MNIST classification, GELUs achieved lower median training log loss compared to ELUs and ReLUs, both with and without dropout. In MNIST autoencoding, GELUs significantly outperformed other nonlinearities across different learning rates. For Twitter POS tagging and TIMIT frame classification, GELUs demonstrated comparable or superior generalization from limited data. CIFAR-10/100 classification results showed GELUs leading to lower error rates in both shallow and deep network architectures.
  • The paper concludes that GELUs consistently outperform ELUs and ReLUs across various tasks and datasets, making them a strong alternative for neural network activation functions. Additionally, it highlights the probabilistic interpretation and input-responsive regularization as key advantages of GELUs over traditional activation functions.
  • Code
BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining
  • This paper by Luo et al. from MSR, AI4Science, and Peking University, introduces BioGPT, a domain-specific generative pre-trained Transformer language model for biomedical text generation and mining. Unlike previous biomedical pre-trained models (BioBERT and PubMedBERT) which focus on language understanding tasks, BioGPT is designed to also excel in generation tasks, addressing the limitations of models like BioBERT and PubMedBERT.
  • BioGPT, leveraging a Transformer language model backbone, is pre-trained from scratch on 15 million PubMed abstracts. It diverges from typical approaches by adopting a domain-specific training corpus and vocabulary, using Byte Pair Encoding (BPE) for efficient text representation.
  • The paper presents BioGPT’s evaluation across six biomedical NLP tasks, including end-to-end relation extraction on BC5CDR, KD-DTI, DDI datasets, question answering on PubMedQA, document classification on HoC, and text generation. BioGPT sets new performance benchmarks on most tasks, particularly excelling with F1 scores of 44.98% on BC5CDR, 38.42% on KD-DTI, 40.76% on DDI, and achieving 78.2% accuracy on PubMedQA.
  • Significant attention is given to the methodological innovation of BioGPT, including the design of target sequence formats and prompts for fine-tuning on downstream tasks. The model’s adaptability is showcased through different strategies for converting task-specific labels into sequences that align with its pre-training on natural language, demonstrating superior performance compared to structured prompts used in previous works.
  • The study also experiments with scaling BioGPT to a larger model size, BioGPT-Large, based on the GPT-2 XL architecture, further exploring its potential in enhancing biomedical text mining and generation capabilities.
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
  • This paper by Wu et al. from Microsoft Research, Pennsylvania State University, University of Washington, and Xidian University, introduces AutoGen, an open-source framework designed to facilitate the development of multi-agent large language model (LLM) applications. The framework allows the creation of customizable, conversable agents that can operate in various modes combining LLMs, human inputs, and tools.
  • AutoGen agents can be easily programmed using both natural language and computer code to define flexible conversation patterns for different applications. The framework supports hierarchical chat, joint chat, and other conversation patterns, enabling agents to converse and cooperate to solve tasks. The agents can hold multiple-turn conversations with other agents or solicit human inputs, enhancing their ability to solve complex tasks.
  • The figure below from the paper illustrates the fact that AutoGen enables diverse LLM-based applications using multi-agent conversations. (Left) AutoGen agents are conversable, customizable, and can be based on LLMs, tools, humans, or even a combination of them. (Top-middle) Agents can converse to solve tasks. (Right) They can form a chat, potentially with humans in the loop. (Bottom-middle) The framework supports flexible conversation patterns.

  • Key technical details include the design of conversable agents and conversation programming. Conversable agents can send and receive messages, maintain internal context, and be configured with various capabilities such as LLMs, human inputs, and tools. These agents can also be extended to include more custom behaviors. Conversation programming involves defining agent roles and capabilities and programming their interactions using a combination of natural and programming languages. This approach simplifies complex workflows into intuitive multi-agent conversations.
  • Implementation details:
    1. Conversable Agents: AutoGen provides a generic design for agents, enabling them to leverage LLMs, human inputs, tools, or a combination. The agents can autonomously hold conversations and solicit human inputs at certain stages. Developers can easily create specialized agents with different roles by configuring built-in capabilities and extending agent backends.
    2. Conversation Programming: AutoGen adopts a conversation programming paradigm to streamline LLM application workflows. This involves defining conversable agents and programming their interactions via conversation-centric computation and control. The framework supports various conversation patterns, including static and dynamic flows, allowing for flexible agent interactions.
    3. Unified Interfaces and Auto-Reply Mechanisms: Agents in AutoGen have unified interfaces for sending, receiving, and generating replies. An auto-reply mechanism enables conversation-driven control, where agents automatically generate and send replies based on received messages unless a termination condition is met. Custom reply functions can also be registered to define specific behavior patterns.
    4. Control Flow: AutoGen allows control over conversations using both natural language and programming languages. Natural language prompts guide LLM-backed agents, while Python code specifies conditions for human input, tool execution, and termination. This flexibility supports diverse multi-agent conversation patterns, including dynamic group chats managed by the GroupChatManager class.
  • The figure below from the paper illustrates how to use AutoGen to program a multi-agent conversation. The top subfigure illustrates the built-in agents provided by AutoGen, which have unified conversation interfaces and can be customized. The middle sub-figure shows an example of using AutoGen to develop a two-agent system with a custom reply function. The bottom sub-figure illustrates the resulting automated agent chat from the two-agent system during program execution.

  • The paper details the framework’s architecture, where agents are defined with specific roles and capabilities, interacting through structured conversations to process tasks efficiently. This approach improves task performance, reduces development effort, and enhances application flexibility. The significant technical aspects include using a unified interface for agent interaction, conversation-centric computation for defining agent behaviors, and conversation-driven control flows that manage the sequence of interactions among agents.
  • Applications demonstrate AutoGen’s capabilities in various domains, such as:
    • Math Problem Solving: AutoGen builds systems for autonomous and human-in-the-loop math problem solving, outperforming other approaches on the MATH dataset.
    • Retrieval-Augmented Code Generation and Question Answering: The framework enhances retrieval-augmented generation systems, improving performance on question-answering tasks through interactive retrieval mechanisms.
    • Decision Making in Text World Environments: AutoGen implements effective interactive decision-making applications using benchmarks like ALFWorld.
    • Multi-Agent Coding: The framework simplifies coding tasks by dividing responsibilities among agents, improving code safety and efficiency.
    • Dynamic Group Chat: AutoGen supports dynamic group chats, enabling collaborative problem-solving without predefined communication orders.
    • Conversational Chess: The framework creates engaging chess games with natural language interfaces, ensuring valid moves through a board agent.
  • The empirical results indicate that AutoGen significantly outperforms existing single-agent and some multi-agent systems in complex task environments by effectively integrating and managing multiple agents’ capabilities. The paper includes a figure illustrating the use of AutoGen to program a multi-agent conversation, showing built-in agents, a two-agent system with a custom reply function, and the resulting automated agent chat.
  • The authors highlight the potential for AutoGen to improve LLM applications by reducing development effort, enhancing performance, and enabling innovative uses of LLMs. Future work will explore optimal multi-agent workflows, agent capabilities, scaling, safety, and human involvement in multi-agent conversations. The open-source library invites contributions from the broader community to further develop and refine AutoGen.
Towards Expert-Level Medical Question Answering with Large Language Models
  • This paper by Singhal et al. from Google Research and DeepMind introduces Med-PaLM 2, a large language model (LLM) significantly advancing the field of medical question answering. The model builds on the previous Med-PaLM, incorporating improvements from the base model PaLM 2, specialized medical domain fine-tuning, and novel prompting strategies, including ensemble refinement.
  • Med-PaLM 2 notably scored up to 86.5% on the MedQA dataset, surpassing the previous model by over 19%, and demonstrated competitive performance on MedMCQA, PubMedQA, and MMLU clinical topics, often reaching or exceeding state-of-the-art results.
  • A novel component of Med-PaLM 2’s development is the Ensemble Refinement (ER) prompting strategy, which involves generating multiple reasoning paths from the model, then refining these into a single, more accurate response. This method leveraged chain-of-thought and self-consistency approaches to enhance reasoning capabilities.
  • ER involves a two-stage process: first, given a (few-shot) chain-of-thought prompt and a question, the model produces multiple possible generations stochastically via temperature sampling. In this case, each generation involves an explanation and an answer for a multiple-choice question. Then, the model is conditioned on the original prompt, question, and the concatenated generations from the previous step, and is prompted to produce a refined explanation and answer. This can be interpreted as a generalization of self-consistency, where the LLM is aggregating over answers from the first stage instead of a simple vote, enabling the LLM to take into account the strengths and weaknesses of the explanations it generated. Here, to improve performance they perform the second stage multiple times, and then finally do a plurality vote over these generated answers to determine the final answer. The figure below from the paper illustrates ER with Med-PaLM 2. In this approach, an LLM is conditioned on multiple possible reasoning paths that it generates to enable it to refine and improves its answer.

  • The model’s efficacy was extensively tested through various benchmarks and human evaluations, comparing its performance to that of practicing physicians across multiple axes, such as factual accuracy, medical knowledge recall, and reasoning. In tests involving 1066 consumer medical questions, Med-PaLM 2’s responses were preferred over those from human physicians in the majority of cases, especially in terms of reflecting medical consensus and reducing the likelihood of harm.
  • Despite its successes, the paper notes the need for ongoing validation in real-world settings, stressing that while Med-PaLM 2 represents a significant advance in medical LLMs, further research is essential to optimize its practical application and ensure safety in clinical environments.
Adaptation with Self-Evaluation to Improve Selective Prediction in LLMs
  • This paper by Chen et al. from the University of Wisconsin-Madison and Google, published in EMNLP 2023, introduces a novel framework named ASPIRE (Adaptation with Self-Evaluation to Improve Selective Prediction in LLMs) aimed at enhancing the reliability of Large Language Models (LLMs) in high-stakes decision-making by improving their selective prediction performance. Selective prediction allows LLMs to abstain from answering when unsure, critical for applications requiring high reliability.
  • ASPIRE utilizes parameter-efficient tuning for task-specific adaptation of LLMs, significantly improving their self-evaluation ability to judge the correctness of their answers. This is achieved without requiring the generation of multiple outputs for uncertainty estimation, thus reducing computational costs and latency.
  • The methodology involves two key steps: First, task-specific tuning to adapt LLMs to particular tasks, followed by self-evaluation learning where the LLM learns to distinguish between correct and incorrect answers using additional adaptable parameters. This process relies on the generation of answers with varied likelihoods for comprehensive learning.
  • The following figure from the paper shows a safety-critical question from the TriviaQA dataset: “Which vitamin helps regulate blood clotting?” The OPT-2.7B model incorrectly answers “Vitamin C”, when the correct answer is “Vitamin K”. Without selective prediction, LLMs will directly output the wrong answer which in this case could lead users to take the wrong medicine, and thus causing potential harm. With selective prediction, LLMs will output a low selection score along with the wrong answer and can further output “I don’t know!” to warn users not to trust it or verify it using other sources.

  • The following figure from the paper shows that in the proposed framework ASPIRE, they first perform task specific tuning to train adaptable parameters \(\theta_p\) while freezing the LLM. Then, they use the LLM with the learned \(\theta_p\) to generate different answers for each training question to create a dataset for self-evaluation learning. Finally, they train the adaptable parameters \(\theta_s\) to learn self-evaluation using the created dataset while freezing the LLM and the learned \(\theta_p\).

  • Extensive experiments demonstrate ASPIRE’s superior performance over state-of-the-art selective prediction methods across multiple question-answering datasets, notably improving the AUACC (Area Under the Accuracy-Coverage Curve) and AUROC (Area Under the Receiver Operating Characteristic curve) metrics on benchmarks like CoQA, TriviaQA, and SQuAD with various LLMs including OPT and GPT-2 models.
  • The paper also explores the impacts of different decoding algorithms for answer generation within the ASPIRE framework, revealing that sampling diverse high-likelihood answers is crucial for achieving optimal selective prediction performance.
  • Implementation details reveal the use of soft prompt tuning for adaptable parameter learning, indicating the practical applicability and efficiency of ASPIRE in enhancing LLMs for selective prediction, particularly in settings where computational resources are limited or when high selective prediction performance is desired with minimal inference costs.
Least-to-Most Prompting Enables Complex Reasoning in Large Language Models
  • This paper by Zhou et al. from Google Research introduces a novel prompting method called least-to-most prompting, aimed at enhancing problem-solving capabilities in large language models like GPT-3. This method, which draws from educational psychology, involves decomposing complex problems into simpler, sequential subproblems, leveraging answers from prior subproblems to facilitate the solving of subsequent ones.
  • The implementation of least-to-most prompting does not require model training or finetuning. It is executed entirely through few-shot prompting, demonstrated effectively in tasks like symbolic manipulation, compositional generalization, and mathematical reasoning.
  • The following figure from the paper shows least-to-most prompting solving a math word problem in two stages: (1) query the language model to decompose the problem into subproblems; (2) query the language model to sequentially solve the subproblems. The answer to the second subproblem is built on the answer to the first subproblem. The demonstration examples for each stage’s prompt are omitted in this illustration.

  • In empirical evaluations, least-to-most prompting significantly outperformed traditional chain-of-thought prompting, especially in handling tasks that required generalization from simple to complex problem-solving scenarios. For example, in the compositional generalization benchmark SCAN, least-to-most prompting achieved a 99% accuracy with just 14 examples, compared to the 16% accuracy with traditional methods.
  • Notably, the method has shown remarkable efficiency in length generalization tasks, where it maintained high performance as the complexity of test cases increased, unlike other prompting methods that exhibited steep performance declines.
  • The paper also discusses the integration of this method with other prompting techniques like self-consistency, highlighting its flexibility and effectiveness in diverse problem-solving contexts without additional computational costs or complex model modifications.
BitNet: Scaling 1-bit Transformers for Large Language Models
  • In the paper “BitNet: Scaling 1-bit Transformers for Large Language Models” by Wang et al., the authors introduce BitNet, a novel 1-bit transformer architecture designed to address the challenges of deploying increasingly large language models (LLMs). Traditionally, LLMs are loaded into memory with each parameter stored as a float16 type, significantly increasing memory usage—for instance, a 7B model takes up 14GB of space (assuming half-precision/16-bit floats). BitNet addresses this issue by training LLMs with reduced precision from the outset, as opposed to conventional methods where quantization occurs post-training.
  • BitNet employs a new component called BitLinear—a drop-in replacement for the nn.Linear layer—to train 1-bit weights from scratch, effectively reducing the precision of each parameter and thus the overall memory footprint. This method significantly cuts down on memory usage and energy consumption while maintaining competitive performance with traditional FP16 Transformers and outperforming state-of-the-art 8-bit quantization methods.
  • The interesting aspect of BitNet is the reduced computational requirements since large LLM models involve millions of matrix multiplications. To multiply a matrix (weights) by a vector (input), we must do multiplication and addition. If your matrix consists of binarized weights, you can replace multiplication and addition with just addition.
  • The implementation of BitNet involves replacing linear projections in the transformer architecture with BitLinear layers. BitLinear functions like a linear layer but utilizes 1-bit weights, enhancing computational efficiency. Crucially, since BitNet performs dequantization as execution moves from one layer to another, it ensures that other components remain high-precision to preserve model quality.
  • This straightforward implementation strategy, coupled with novel quantization techniques for both weights (using the signum function for binarization) and activations (using an “absmax” approach for scaling), allows BitNet to efficiently manage the scaling of large models. The architecture adheres to scaling laws similar to those of full-precision transformers, as demonstrated through comparative analyses with FP16 transformers, showing that BitNet achieves similar or better performance while consuming significantly less energy.
  • The following figure from the paper shows: (a) The computation flow of BitLinear. (b) The architecture of BitNet, consisting of the stacks of attentions and FFNs, where matrix multiplication is implemented as BitLinear.

  • Overall, BitNet advances the efficiency and scalability of large language models, offering a promising solution to the cost and environmental concerns associated with current large-scale neural networks, without sacrificing the robust performance essential for practical applications.


Relying on the Unreliable: The Impact of Language Models’ Reluctance to Express Uncertainty
  • This paper by Zhou et al. from Stanford University, USC, CMU, and Allen Institute for AI investigates the impact of language models’ (LMs) reluctance to express uncertainty in their responses. This reluctance leads to overconfidence, which can have significant implications for human-LM interactions.
  • The study examines how LMs, including GPT, LLaMA-2, and Claude, communicate uncertainties and how users respond to these LM-articulated uncertainties. It was found that LMs rarely express uncertainties, even when incorrect, and tend to overuse expressions of certainty, leading to high error rates.
  • The authors conducted human experiments to assess the impact of LM overconfidence. Results showed that users heavily rely on LM-generated responses, whether marked by certainty or not. Additionally, even minor miscalibrations in how a model uses epistemic markers can lead to long-term harms in human performance.
  • An investigation into the origins of model overconfidence identified reinforcement learning with human feedback (RLHF) as a key contributing factor. The study revealed a bias against expressions of uncertainty in the preference-annotated datasets used in RLHF alignment.
  • The paper highlights the risks posed by overconfidence in LMs and proposes design recommendations to mitigate these risks. It suggests designing LMs to verbalize uncertainty more effectively and counteract human biases in interpreting uncertainties.
  • This work exposes new safety concerns in human-LM collaborations and underscores the need for more nuanced communication strategies in LMs, particularly regarding expressing uncertainty and confidence.
Matryoshka Representation Learning
  • The paper by Kusupati et al. from UW introduces Matryoshka Representation Learning (MRL), a novel approach for adaptive and efficient representation learning. This technique, adopted in OpenAI’s latest embedding update, text-embedding-3-large, is characterized by its ability to encode information at multiple granularities within a single high-dimensional vector. Drawing an analogy from the Russian Matryoshka dolls, MRL encapsulates details at various levels within a single embedding structure, allowing for adaptability to the computational and statistical needs of different tasks.
  • The essence of MRL lies in its ability to create coarse-to-fine representations, where earlier dimensions in the embedding vector store more crucial information, and subsequent dimensions add finer details. You can understand how this works by the analogy of trying to classify an image at multiple resolutions – the lower resolutions give high-level info and the higher resolutions add finer details – human perception of the natural world also has a naturally coarse-to-fine granularity, as shown in the animation below.

  • MRL achieves this by modifying the loss function in the model, where the total loss is the sum of losses over individual vector dimension ranges: \(Loss_{Total} = L(\text{upto 8d}) + L(\text{upto 16d}) + L(\text{upto 32d}) + \ldots + L(\text{upto 2048d})\). As a result, MRL incentivizes the model to capture essential information in each subsection of the vector. Notably, this technique allows for the use of any subset of the embedding dimensions, offering flexibility beyond fixed dimension slices like 8, 16, 32, etc.
  • The figure below from the paper shows that MRL is adaptable to any representation learning setup and begets a Matryoshka Representation \(z\) by optimizing the original loss \(L(.)\) at \(O(\log(d))\) chosen representation sizes. Matryoshka Representation can be utilized effectively for adaptive deployment across environments and downstream tasks.

  • MRL’s adaptability extends to a wide range of modalities, including vision, vision+language, and language models (such as ViT, ResNet, ALIGN, and BERT). The method has shown remarkable results in various applications, such as adaptive classification and retrieval, robustness evaluations, few-shot and long-tail learning, and analyses of model disagreement. In practical terms, MRL facilitates up to 14x smaller embedding sizes for tasks like ImageNet-1K classification without compromising accuracy, up to 14x real-world speed-ups for large-scale retrieval, and up to 2% accuracy improvements in long-tail few-shot classification.
  • One of the striking outcomes of using MRL is demonstrated in OpenAI’s text-embedding-3-large model, which, when trimmed to 256 dimensions, outperforms the full-sized text-embedding-ada-002 with 1536 dimensions on the MTEB benchmark. This indicates a significant reduction in size (to about 1/6th) while maintaining or even enhancing performance.
  • Importantly, MRL integrates seamlessly with existing representation learning pipelines, requiring minimal modifications and imposing no additional costs during inference and deployment. Its flexibility and efficiency make it a promising technique for handling web-scale datasets and tasks. OpenAI has made the pretrained models and code for MRL publicly available, underlining the method’s potential as a game-changer in the field of representation learning.
  • Code; OpenAI Blog
Self-Refine: Iterative Refinement with Self-Feedback
  • Like humans, large language models (LLMs) do not always generate the best output on their first try.
  • This paper by Madaan et al. from CMU, Allen AI, UW, NVIDIA, UC San Diego, and Google Research introduces a novel approach for enhancing outputs from large language models (LLMs) like GPT-3.5 and GPT-4 through self-generated iterative feedback and refinement, without the need for additional training data or supervised learning – similar to how humans refine their written text.
  • Put simply, the main idea is to generate an initial output using an LLM; then, the same LLM provides feedback for its output and uses it to refine itself, iteratively. This process repeats until a predefined condition is met. Self-Refine does not require any supervised training data, additional training, or reinforcement learning, and instead uses a single LLM as the generator, refiner and the feedback provider.
  • The figure below from the paper shows that given an input (step 0), Self-Refine starts by generating an output and passing it back to the same model M to get feedback (step 1). The feedback is passed back to M, which refines the previously generated output (step 2). Steps (step 1) and (step 2) iterate until a stopping condition is met. SELF-REFINE is instantiated with a language model such as GPT-3.5 and does not involve human assistance.

  • The approach is evaluated across seven diverse tasks, including dialogue response and code optimization, demonstrating significant improvements over conventional one-step generation methods. This method leverages few-shot prompting for guiding the LLM to generate feedback and incorporate it for output refinement.
  • The results show that Self-Refine significantly enhances output quality in terms of human preference and task-specific metrics, indicating its potential to improve LLM-generated content across a range of applications.
  • Code
The Claude 3 Model Family: Opus, Sonnet, Haiku
  • Anthropic has introduced Claude 3, comprising Claude 3 Opus (most capable), Sonnet (balance of skills and speed), and Haiku (fastest and most cost-effective). These models incorporate vision capabilities for processing and analyzing image data, establishing new industry benchmarks in reasoning, math, coding, multilingual understanding, and vision quality.
  • Claude 3 Opus, lau