Papers List

  • A curated set of papers I’ve reviewed for my latest scoop in AI/ML.

Seminal Papers / Need-to-know

Computer Vision


Noise-contrastive estimation: A new estimation principle for unnormalized statistical models
  • This paper by Gutmann and Hyvarinen in AISTATS 2010 introduced the concept of negative sampling that forms the basis of contrastive learning.
  • They propose a new estimation principle for parameterized statistical models, noise-contrastive estimation, which discriminates between observed data and artificially generated noise. This is accomplished by performing nonlinear logistic regression to discriminate between the observed data and some artificially generated noise, using the model log-density function in the regression nonlinearity. They show that this leads to a consistent (convergent) estimator of the parameters, and analyze the asymptotic variance.
  • In particular, the method is shown to directly work for unnormalized models, i.e., models where the density function does not integrate to one. The normalization constant can be estimated just like any other parameter.
  • For a tractable ICA model, they compare the method with other estimation methods that can be used to learn unnormalized models, including score matching, contrastive divergence, and maximum-likelihood where the normalization constant is estimated with importance sampling.
  • Simulations show that noise-contrastive estimation offers the best trade-off between computational and statistical efficiency.
  • They apply the method to the modeling of natural images and show that the method can successfully estimate a large-scale two-layer model and a Markov random field.


ImageNet Classification with Deep Convolutional Neural Networks
  • The original AlexNet paper by Krizhevsky et al. from NeurIPS 2012 that started it all. This trail-blazer was the first to apply deep supervised learning to the area of image classification.
  • They rained a large, deep convolutional neural network to classify the 1.3 million high-resolution images in the LSVRC-2010 ImageNet training set into the 1000 different classes.
  • On the test data, they achieved top-1 and top-5 error rates of 39.7% and 18.9% which was considerably better than the previous state-of-the-art results.
  • The neural network, which has 60 million parameters and 500,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and two globally connected layers with a final 1000-way softmax.
  • To make training faster, they used non-saturating neurons and a very efficient GPU implementation of convolutional nets. To reduce overfitting in the globally connected layers, they employed a new regularization method that proved to be very effective.
  • The following figure from the paper shows an illustration of the architecture of their CNN, explicitly showing the delineation of responsibilities between the two GPUs. One GPU runs the layer-parts at the top of the figure while the other runs the layer-parts at the bottom. The GPUs communicate only at certain layers.

3D Convolutional Neural Networks for Human Action Recognition
  • This paper by Ji et al. from ASU and NEC Labs in IEEE PAMI 2012 introduced 3D CNNs.
  • Their problem statement is the fully automated recognition of actions in an uncontrolled environment. Most existing work relies on domain knowledge to construct complex handcrafted features from inputs. In addition, the environments are usually assumed to be controlled.
  • Convolutional neural networks (CNNs) are a type of deep models that can act directly on the raw inputs, thus automating the process of feature construction. However, such models are currently limited to handle 2D inputs. This paper develops a novel 3D CNN model for action recognition.
  • This model extracts features from both spatial and temporal dimensions by performing 3D convolutions, thereby capturing the motion information encoded in multiple adjacent frames. The developed model generates multiple channels of information from the input frames, and the final feature representation is obtained by combining information from all channels.
  • They apply the developed model to recognize human actions in real-world environment, and it achieves superior performance without relying on handcrafted features.


Visualizing and Understanding Convolutional Networks
  • This legendary paper by Zeiler and Fergus from the Courant Institute, NYU in 2013 seeks to demystify why CNNs perform so well on image classification, or how they might be improved. This paper seeks to address both issues.
  • They introduce a novel visualization technique that gives insight into the function of intermediate feature layers and the operation of the classifier.
  • They also perform an ablation study to discover the performance contribution from different model layers. This enables us to find model architectures that outperform Krizhevsky et. al on the ImageNet classification benchmark.
  • They show their ImageNet model generalizes well to other datasets: when the softmax classifier is retrained, it convincingly beats the current state-of-the-art results on Caltech-101 and Caltech-256 datasets.
Learning Factored Representations in a Deep Mixture of Experts
  • Mixtures of Experts combine the outputs of several “expert” networks, each of which specializes in a different part of the input space. This is achieved by training a “gating” network that maps each input to a distribution over the experts. Such models show promise for building larger networks that are still cheap to compute at test time, and more parallelizable at training time.
  • This paper by Eigen et al. from Google and NYU Courant in 2013 extends the Mixture of Experts to a stacked model, the Deep Mixture of Experts, with multiple sets of gating and experts. This exponentially increases the number of effective experts by associating each input with a combination of experts at each layer, yet maintains a modest model size.
  • On a randomly translated version of the MNIST dataset, they find that the Deep Mixture of Experts automatically learns to develop location-dependent (“where”) experts at the first layer, and class-specific (“what”) experts at the second layer. In addition, they see that the different combinations are in use when the model is applied to a dataset of speech monophones. These demonstrate effective use of all expert combinations.
  • The figure below from the paper shows (a) Mixture of Experts; (b) Deep Mixture of Experts with two layers.


Generative Adversarial Networks
  • This paper by Goodfellow et al. from NeurIPS 2014 proposes a new framework called Generative Adversarial Networks (GANs) that estimates generative models via an adversarial process that corresponds to a zero-sum minimax two-player game. In this process, two models are simultaneously trained: a generative model \(G\) that captures the data distribution, and a discriminative model \(D\) that estimates the probability that a sample came from the training data rather than \(G\). The training procedure for \(G\) is to maximize the probability of \(D\) making a mistake. In the space of arbitrary functions \(G\) and \(D\), a unique solution exists, with \(G\) recovering the training data distribution and \(D\) equal to \frac{1}{2} everywhere. In the case where \(G\) and \(D\) are defined by multilayer perceptrons, the entire system can be trained with backpropagation.
  • There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples.
  • Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.


Very Deep Convolutional Networks for Large-Scale Image Recognition
  • This paper by Simonyan and Zisserman from DeepMind and Oxford in ICLR 2015 proposed the VGG architecture. They showed that a significant performance improvement can be achieved by pushing the depth to 16-19 weight layers, i.e., VGG-16 and VGG-19.
  • The main principle is that using a stack of \(3 \times 3\) convolution filters are better than a single \(7 \times 7\) layer. Firstly, because they use three non-linear activations (instead of one), which makes the function more discriminative. Secondly, the \(3 \times 3\) design decreases the number of parameters – specifically, you need \(3 \times (3^2)C^2 = 27C^2\) weights, compared to a \(7 \times 7\) conv layer which would require \(1 \times (7^2)C^2 = 49C^2\) parameters (81% more).
Going Deeper with Convolutions
  • This paper by Szegedy et al. from Google in CVPR 2015 introduced the Inception (also known as GoogLeNet or InceptionNet) architecture which achieved state of the art results for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) 2014.
  • Ideas from the paper:
    • Increased the depth (number of layers) is not the only way to make a model bigger. What about increasing both the depth and width of the network while keeping computations at a manageable level? This time the inspiration comes from the human visual system, wherein information is processed at multiple scales and then aggregated locally. How do you achieve this without a memory explosion? The answer is with \(1 \times 1\) convolutions! The main purpose is channel dimensionality reduction, by reducing the output channels of the input. Next, \(1 \times 1\) convolutions are used to compute reductions before the computationally expensive convolutions (\(3 \times 3\) and \(5 \times 5\)). Inception uses convolutions of different kernel sizes (\(5 \times 5\), \(3 \times 3\), \(1 \times 1\)) to capture details at multiple scales.
    • To enable concatenation of features convolved with different kernels, they pad the output to make it the same size as the input. To find the appropriate padding with single stride convs without dilation, padding \(p\) and kernel \(k\) are defined so that \(out=in\) (i.e., input and output have the same spatial dimensions): \(p = (k-1)/2p\) (since \(out = in + 2p - k + 1\)).
FaceNet: A Unified Embedding for Face Recognition and Clustering
  • This paper by Schroff et al. from Google in 2015 proposes FaceNet, a system that directly learns a mapping from face images to a compact Euclidean space where distances directly correspond to a measure of face similarity. Once this space has been produced, tasks such as face recognition, verification and clustering can be easily implemented using standard techniques with FaceNet embeddings as feature vectors.
  • Their method uses a deep convolutional network trained to directly optimize the embedding itself, rather than an intermediate bottleneck layer as in previous deep learning approaches. To train, they use triplets of roughly aligned matching / non-matching face patches generated using a novel online triplet mining method. The benefit of our
  • approach is much greater representational efficiency: they achieve state-of-the-art face recognition performance using only 128-bytes per face.
  • Previous face recognition approaches based on deep networks use a classification layer trained over a set of known face identities and then take an intermediate bottle neck layer as a representation used to generalize recognition beyond the set of identities used in training. The downsides of this approach are its indirectness and its inefficiency: one has to hope that the bottleneck representation generalizes well to new faces; and by using a bottleneck layer the representation size per face is usually very large (1000s of dimensions). Some recent work has reduced this dimensionality using PCA, but this is a linear transformation that can be easily learnt in one layer of the network. In contrast to these approaches, FaceNet directly trains its output to be a compact 128-D embedding using a triplet-based loss function based on LMNN. Their triplets consist of two matching face thumbnails and a non-matching face thumbnail and the loss aims to separate the positive pair from the negative by a distance margin.
  • Choosing which triplets to use turns out to be very important for achieving good performance and, inspired by curriculum learning, they present a novel online negative exemplar mining strategy which ensures consistently increasing difficulty of triplets as the network trains. To improve clustering accuracy, they also explore hard-positive mining techniques which encourage spherical clusters for the embeddings of a single person.
  • The triplet loss minimizes the L2-distance between faces of the same identity and enforces a margin between the distance of faces of different identities and encourages a relative distance constraint. Specifically, the Triplet Loss minimizes the distance between an anchor and a positive, both of which have the same identity, and maximizes the distance between the anchor and a negative of a different identity. Thus, network is trained such that the squared L2 distances in the embedding space directly correspond to face similarity: faces of the same person have small distances and faces of distinct people have large distances. Once this embedding has been produced, downstream tasks become straight-forward: face verification simply involves thresholding the distance between the two embeddings; recognition becomes a k-NN classification problem; and clustering can be achieved using off-the-shelf techniques such as k-means or agglomerative clustering.
  • On the widely used Labeled Faces in the Wild (LFW) dataset, their system achieves a new record accuracy of 99.63%, which cuts the error rate in comparison to the best published result by 30% on both datasets.
  • They explore two different deep convolutional network architectures that have been recently used to great success in the computer vision community. The first architecture is based on the Zeiler&Fergus model which consists of multiple interleaved layers of convolutions, non-linear activations, local response normalizations, and max pooling layers. The second architecture is based on the Inception model of Szegedy et al. which was recently used as the winning approach for ImageNet 2014. These networks use mixed layers that run several different convolutional and pooling layers in parallel and concatenate their responses which reduces the number of parameters by up to 20 times and have the potential to reduce the number of FLOPS required for comparable performance.
  • They also introduce the concept of harmonic embeddings, and a harmonic triplet loss, which describe different versions of face embeddings (produced by different networks) that are compatible to each other and allow for direct comparison between each other.
Distilling the Knowledge in a Neural Network
  • This paper by Hinton et al. from Google in NeurIPS 2014 introduces a very simple way to improve the performance of almost any machine learning algorithm by training many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets.
  • Caruana et al. have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and the authors develop this approach further using a different compression technique. They achieve some surprising results on MNIST and show that they can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. They also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. This shows that distilling works very well for transferring knowledge from an ensemble or from a large highly regularized model into a smaller, distilled model.
  • The results show that on MNIST, distillation works remarkably well even when the transfer set that is used to train the distilled model lacks any examples of one or more of the classes. For a deep acoustic model that is version of the one used by Android voice search, they have shown that nearly all of the improvement that is achieved by training an ensemble of deep neural nets can be distilled into a single neural net of the same size which is far easier to deploy.
  • For really big neural networks, it can be infeasible even to train a full ensemble, but have shown that the performance of a single really big net that has been trained for a very long time can be significantly improved by learning a large number of specialist nets, each of which learns to discriminate between the classes in a highly confusable cluster.
Deep Unsupervised Learning using Nonequilibrium Thermodynamics
  • A central problem in machine learning involves modeling complex data-sets using highly flexible families of probability distributions in which learning, sampling, inference, and evaluation are still analytically or computationally tractable.
  • This paper by Dickstein et al. from Surya Ganguli’s lab at Stanford in 2015 develops an approach that simultaneously achieves both flexibility and tractability. They introduce a novel algorithm for modeling probability distributions that enables exact sampling and evaluation of probabilities and demonstrated its effectiveness on a variety of toy and real datasets, including challenging natural image datasets. The essential idea, inspired by non-equilibrium statistical physics, is to systematically and slowly destroy structure in a data distribution through an iterative forward diffusion process.
  • They then learn a reverse diffusion process that restores structure in data, yielding a highly flexible and tractable generative model of the data. This approach allows them to rapidly learn, sample from, and evaluate probabilities in deep generative models with thousands of layers or time steps, as well as to compute conditional and posterior probabilities under the learned model.
  • For each of the tests they conduct, they use a similar basic algorithm, showing that their method can accurately model a wide variety of distributions. Most existing density estimation techniques must sacrifice modeling power in order to stay tractable and efficient, and sampling or evaluation are often extremely expensive. The core of their algorithm consists of estimating the reversal of a Markov diffusion chain which maps data to a noise distribution; as the number of steps is made large, the reversal distribution of each diffusion step becomes simple and easy to estimate.
  • The result is an algorithm that can learn a fit to any data distribution, but which remains tractable to train, exactly sample from, and evaluate, and under which it is straightforward to manipulate conditional and posterior distributions.
  • Code.


Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks
  • In recent years, supervised learning with convolutional networks (CNNs) has seen huge adoption in computer vision applications. Comparatively, unsupervised learning with CNNs has received less attention.
  • This paper by Radford et al. in ICLR 2016 helps bridge the gap between the success of CNNs for supervised learning and unsupervised learning. They introduce a class of CNNs called deep convolutional generative adversarial networks (DCGANs), that have certain architectural constraints, and demonstrate that they are a strong candidate for unsupervised learning.
  • Training on various image datasets, they show convincing evidence that their deep convolutional adversarial pair learns a hierarchy of representations from object parts to scenes in both the generator and discriminator.
  • Additionally, they use the learned features for novel tasks - demonstrating their applicability as general image representations.
Rethinking the Inception Architecture for Computer Vision
  • This paper by Szegedy et al. from Google in CVPR 2016 proposed InceptionV2, V3 by improving the Inception model based on the following principles:
    • Using the same principle as VGG, the authors factorized \(5 \times 5\) and \(7 \times 7\) (in InceptionV3) convolutions to two and three \(3 \times 3\) sequential convolutions respectively. This improves computational speed and utilizes far less parameters.
    • Used spatially separable convolutions. Simply, a \(3 \times 3\) kernel is decomposed into two smaller ones: a \(1 \times 3\) and a \(3 \times 1\) kernel, which are applied sequentially.
    • Widened the inception modules (more number of filters).
    • Distributed the computational budget in a balanced way between the depth and width of the network.
    • Added batch normalization.
Deep Residual Learning for Image Recognition
  • ResNet paper by He et al. from Facebook AI in CVPR 2016. Most cited in several AI fields.
  • The issue of vanishing gradients when training a deep neural network was addressed with two tricks:
    • Batch normalization and,
    • Short skip connections
  • Instead of \(H(x) = F(x)\), the skip connection leads to \(H(x) = F(x) + x\), which implies that the model is learning the difference (i.e., residual), \(F(x) = H(x) - x\).
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
  • State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet and Fast R-CNN have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck.
  • This paper by Ren et al. from University of Science and Technology of China and Microsoft Research in 2016 proposes a Region Proposal Network (RPN) for efficient and accurate region proposal generation that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. By sharing convolutional features with the down-stream detection network, the region proposal step is nearly cost-free.
  • An RPN is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection.
  • They further merge RPN and Fast R-CNN into a single network by sharing their convolutional features – using the recently popular terminology of neural networks with ‘attention’ mechanisms, the RPN component tells the unified network where to look.
  • For the very deep VGG-16 model, their detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image. In ILSVRC and COCO 2015 competitions, Faster R-CNN and RPN are the foundations of the 1st-place winning entries in several tracks.
  • Faster R-CNN enables a unified, deep-learning-based object detection system to run at near real-time frame rates. The learned RPN also improves region proposal quality and thus the overall object detection accuracy.
  • Code.

You Only Look Once: Unified, Real-Time Object Detection
  • Prior work on object detection repurposes classifiers to perform detection.
  • This paper by Redmon et al. from Ali Farhadi’s group at UW in 2016 presents YOLO, a new approach to object detection which frames object detection as a regression problem to spatially separated bounding boxes and associated class probabilities.
  • A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation.
  • Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance. Unlike classifier-based approaches, YOLO is trained on a loss function that directly corresponds to detection performance and the entire model is trained jointly.
  • YOLO is extremely fast and can thus be utilized for real-time object detection. The base YOLO model processes images in real-time at 45 frames per second. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors.
  • Compared to state-of-the-art detection systems, YOLO makes more localization errors but is far less likely to predict false detections where nothing exists. Finally, YOLO learns very general representations of objects. It outperforms all other detection methods, including DPM and R-CNN, by a wide margin when generalizing from natural images to artwork on both the Picasso Dataset and the People-Art Dataset.


Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning
  • This paper by Szegedy et al. from Google in AAAI 2017 introduced the latest versions of the Inception model – InceptionV4 and Inception-ResNet.
Photo-Realistic Single Image Super-Resolution using a GAN
  • This paper by Ledig et al. from Twitter in CVPR 2017 applied GANs for single image super-resolution (SISR).
Understanding intermediate layers using linear classifier probes
  • Neural network models have a notorious reputation for being black boxes.
  • This paper by Alain and Bengio from Mila and the University of Montreal in ICLR 2017 proposes to monitor the features at every layer of a model and measure how suitable they are for classification.
  • They use linear classifiers, which they refer to as “probes”, trained entirely independently of the model itself. This helps them better understand the roles and dynamics of the intermediate layers. They demonstrate how this can be used to develop a better intuition about models and to diagnose potential problems.
  • They apply this technique to the popular models Inception v3 and Resnet-50. Among other things, they observe experimentally that the linear separability of features increase monotonically along the depth of the model.
Image-to-Image Translation with Conditional Adversarial Networks
  • Many problems in image processing, graphics, and vision involve translating an input image into a corresponding output image. These problems are often treated with application-specific algorithms, even though the setting is always the same: map pixels to pixels. Conditional adversarial nets are a general-purpose solution that appears to work well on a wide variety of these problems.
  • This paper by et al. from UC Berkeley in CVPR 2017 introduces pix2pix, a conditional adversarial network-based framework for image-to-image translation.
  • These networks not only learn the mapping from input image to output image, but also learn a loss function to train this mapping. This makes it possible to apply the same generic approach to problems that traditionally would require very different loss formulations.
  • They demonstrate that this approach is effective at synthesizing photos from label maps, reconstructing objects from edge maps, and colorizing images, among other tasks.
  • The figure below from the paper shows the results of the method on several inputs. In each case, they use the same architecture and objective, and simply train on different data.

Improved Image Captioning via Policy Gradient optimization of SPIDEr
  • Current image captioning methods are usually trained via (penalized) maximum likelihood estimation. However, the log-likelihood score of a caption does not correlate well with human assessments of quality.
  • Standard syntactic evaluation metrics, such as BLEU, METEOR and ROUGE, are also not well correlated. The newer SPICE and CIDEr metrics are better correlated, but have traditionally been hard to optimize for.
  • This paper by Liu et al. from Oxford and Google in ICCV 2017 shows how to use a policy gradient (PG) method to directly optimize a linear combination of SPICE and CIDEr (a combination they call SPIDEr): the SPICE score ensures their captions are semantically faithful to the image, while CIDEr score ensures their captions are syntactically fluent.
  • The proposed PG method improves on the prior MIXER approach, by using Monte Carlo rollouts instead of mixing MLE training with PG. They show empirically that SPIDEr leads to easier optimization and improved results compared to MIXER.
  • Finally, they show that using their PG method they can optimize any of the metrics, including the proposed SPIDEr metric which results in image captions that are strongly preferred by human raters compared to captions generated by the same model but trained to optimize MLE or the COCO metrics.


From Recognition to Cognition: Visual Commonsense Reasoning
  • Visual understanding goes well beyond object recognition. With one glance at an image, they can effortlessly imagine the world beyond the pixels: for instance, they can infer people’s actions, goals, and mental states. While this task is easy for humans, it is tremendously difficult for today’s vision systems, requiring higher-order cognition and commonsense reasoning about the world.
  • This paper by Zellers et al. from UW in CVPR 2019 formalizes this task as Visual Commonsense Reasoning (VCR). Given a challenging question about an image, a machine must answer correctly and then provide a rationale justifying its answer.
  • Next, they introduce a new dataset, VCR, consisting of 290k multiple choice QA problems derived from 110k movie scenes. The key recipe for generating non-trivial and high-quality problems at scale is Adversarial Matching, a new approach to transform rich annotations into multiple choice questions with minimal bias. Experimental results show that while humans find VCR easy (over 90% accuracy), state-of-the-art vision models struggle (~45%).
  • To move towards cognition-level understanding, they present a new reasoning engine, Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning. R2C helps narrow the gap between humans and machines (~65%); still, the challenge is far from solved, and they provide analysis that suggests avenues for future work.
  • Website with models/datasets.
Focal Loss for Dense Object Detection
  • The highest accuracy object detectors to date are based on a two-stage approach popularized by R-CNN, where a classifier is applied to a sparse set of candidate object locations. In contrast, one-stage detectors that are applied over a regular, dense sampling of possible object locations have the potential to be faster and simpler, but have trailed the accuracy of two-stage detectors thus far.
  • This paper by Lin et al. from in 2017 investigates why this is the case and introduced focal loss. They discover that the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause. They propose to address this class imbalance by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples.
  • Focal loss applies a modulating term to the cross entropy loss in order to focus learning on hard misclassified examples. It is a dynamically scaled cross entropy loss, where the scaling factor decays to zero as confidence in the correct class increases.
  • Their novel Focal Loss focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training. To evaluate the effectiveness of their loss, they design and train a simple dense detector they call RetinaNet.
  • Their results show that when trained with the focal loss, RetinaNet is able to match the speed of previous one-stage detectors while surpassing the accuracy of all existing state-of-the-art two-stage detectors.
Relational inductive biases, deep learning, and graph networks
  • Recent advances in AI, propelled by deep learning, have been transformative across many important domains. Despite this, a vast gap between human and machine intelligence remains, especially with respect to efficient, generalizable learning.
  • This paper by Battaglia et al. (2018) from DeepMind/Google, MIT and the University of Edinburgh offers a great overview of the relational inductive biases of various neural net architectures, summarized in the table below from the paper.

  • They argue that combinatorial generalization must be a top priority for AI to achieve human-like abilities, and advocate for marrying complementary approaches which draw on ideas from human cognition, traditional computer science, standard engineering practice, and modern deep learning. Just as biology uses nature and nurture cooperatively, they reject the false choice between “hand-engineering” and “end-to-end” learning, and instead advocate for an approach which benefits from their complementary strengths.
  • They investigate how using relational inductive biases within deep learning architectures can facilitate learning about entities, relations, and rules for composing them.
  • They explore flexible learning-based approaches which implement strong relational inductive biases to capitalize on explicitly structured representations and computations, and present a new building block for the AI toolkit – the graph neural networks (GNNs).
  • GNNs generalize and extend various approaches for neural networks that operate on graphs, and provides a straightforward interface for manipulating structured knowledge and producing structured behaviors. GNNs are designed to promote building complex architectures using customizable graph-to-graph building blocks, and their relational inductive biases promote support relational reasoning, combinatorial generalization, and improved sample efficiency over other standard machine learning building blocks. This would help lay the foundation for more sophisticated, interpretable, and flexible patterns of reasoning.
Squeeze-and-Excitation Networks
  • The central building block of convolutional neural networks (CNNs) is the convolution operator, which enables networks to construct informative features by fusing both spatial and channel-wise information within local receptive fields at each layer. A broad range of prior research has investigated the spatial component of this relationship, seeking to strengthen the representational power of a CNN by enhancing the quality of spatial encodings throughout its feature hierarchy.
  • This paper by et al. from the Chinese Academy of Sciences, University of Macau, and the Visual Geometry Group at the University of Oxford focuses instead on the channel relationship and propose a novel architectural unit, which we term the “Squeeze-and-Excitation” (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels.
  • They show that these blocks can be stacked together to form SENet architectures that generalize extremely effectively across different datasets.
  • They further demonstrate that SE blocks bring significant improvements in performance for existing state-of-the-art CNNs at slight additional computational cost. Squeeze-and-Excitation Networks formed the foundation of our ILSVRC 2017 classification submission which won first place and reduced the top-5 error to 2.251%, surpassing the winning entry of 2016 by a relative improvement of ~25%.
  • The following figure from the paper shows a squeeze-and-excitation block.

  • The following figure from the paper shows (first half) the schema of the original Inception module (left) and the SEInception module (right); (second half) the original Residual module (left) and the SEResNet module (right).

When Does Label Smoothing Help?
  • The following paper summary has been contributed by Zhibo Zhang.
  • This paper by Müller et al. from Google Brain in NeurIPS 2019 studies label smoothing in terms of the effects on penultimate layer representations, model calibration as well as knowledge distillation (Hinton et al., 2015).
  • The figure below from the paper shows the visualization of the penultimate layer representations of the following models (trained with label smoothing, denoted by “w/ LS” in the figure; and without label smoothing, denoted by “w/o LS” in the figure) and datasets:
    • First row: AlexNet (Krizhevsky et al., 2012) on the CIFAR-10 (Krizhevsky, 2009) dataset, with the visualization of three semantically different classes.
    • Second row: ResNet-56 (He et al., 2016) on the CIFAR-100 (Krizhevsky, 2009) dataset, with the visualization of three semantically different classes.
    • Third row: Inception-v4 (Szegedy et al., 2017) on the ImageNet (Russakovsky et al., 2014) dataset, with the visualization of three semantically different classes.
    • Fourth row: Inception-v4 on the ImageNet dataset, with the visualization of two semantically similar classes and a semantically different one
  • It can be observed that with label smoothing, the activations of the same class are more closely tightened together compared to training without label smoothing, which is because training with label smoothing encourages the penultimate layer representations of the same class to be equally distant from other classes.
  • In order to study the effects of label smoothing on model calibration, the authors conducted experiments on image classification and machine translation tasks. It was observed that training with label smoothing could reduce the expected calibration error (Guo et al., 2017) compared to training without label smoothing.
  • In addition, the authors noticed that in knowledge distillation, while a teacher model trained with label smoothing could have better accuracy for the teacher model itself, it could produce student models with worse performance.
Unsupervised Feature Learning via Non-Parametric Instance Discrimination
  • This paper by Wu et al. from UC Berkeley, Chinese University of Hong Kong, and Amazon Rekognition in CVPR 2018 as a spotlight paper introduces a novel method for unsupervised feature learning in neural networks, leveraging non-parametric instance discrimination.
  • The unique approach involves treating each image as a distinct class and employing noise-contrastive estimation (NCE) to address the computational challenges posed by the vast number of instance classes.
  • A non-parametric softmax classifier is proposed, which uses direct feature representation instead of a class weight vector, allowing for precise instance comparisons. This involves projecting image features into a 128-dimensional space and normalizing them. To efficiently store these representations, the concept of a memory bank is introduced.
  • To reduce the computational burden of the softmax function over numerous classes, the paper implements NCE, which approximates the full softmax distribution and cuts computational complexity from \(O(n)\) to \(O(1)\) per sample, without sacrificing performance. To stabilize the learning process, proximal regularization is applied. This is crucial as each instance class is visited only once per epoch, aiding in smoother learning dynamics and faster convergence.
  • The paper also explores an alternative approach involving storing representations from previous batches in a queue to be used as negative examples in the loss (Wu et al., 2018). This method allows for smaller batch sizes but introduces asymmetry between “queries” (generated from current batch elements) and “keys” (stored in the queue). Only “queries” undergo gradient backpropagation, treating “key” representations as fixed. However, this leads to performance drops when the network rapidly evolves during training. To address this, He et al. (2020) proposed MoCo, a technique using two networks: one for keys and one for queries, with the keys’ network updating more slowly. This offers a more stable learning dynamic, as the query network is updated using backpropagation and stochastic gradient descent.
  • The following figure from the paper shows the pipeline of our unsupervised feature learning approach. We use a backbone CNN to encode each image as a feature vector, which is projected to a 128-dimensional space and L2 normalized. The optimal feature embedding is learned via instance-level discrimination, which tries to maximally scatter the features of training samples over the 128-dimensional unit sphere.

  • The method exhibits state-of-the-art performance in unsupervised image classification on standard datasets like CIFAR-10 and ImageNet, notably achieving a top-1 accuracy of 46.5% on ImageNet.
  • The learned features demonstrate strong generalization in semi-supervised learning and object detection, showcasing effective transfer learning.
  • The scalability and efficiency of the approach are highlighted by the compact 128-dimensional representation, requiring only 600MB for a million images, enabling rapid nearest neighbor retrieval at runtime.
  • Code.


Objects as Points
  • This paper by Zhou et al. from UT Austin in 2019 proposes CenterNet, a center point-based object detection approach, which is end-to-end differentiable, simpler, faster, and more accurate than other competitive bounding box based detectors.
  • CenterNet is an anchorless object detection architecture. As such, this structure has an important advantage in that it replaces the classical NMS (Non Maximum Suppression) step during post-processing. This mechanism enables faster inference.
  • Where most successful object detectors enumerate a nearly exhaustive list of potential object locations and classify each, which is wasteful, inefficient, and requires additional post-processing, CenterNet models an object as a single point — the center point of its bounding box. CenterNet object detector builds on successful keypoint estimation networks and uses keypoint estimation to find center points and regresses to all other object properties, such as size, 3D location, orientation, depth and extent, and pose in a single forward pass. The algorithm is simple, fast, accurate, and end-to-end differentiable without any NMS post-processing. The idea is general and has broad applications beyond simple two-dimensional detection.
  • Upon comparison with other state-of-the-art detectors in the COCO test-dev set. With multi-scale evaluation, CenterNet with Hourglass104 achieves an AP of 45.1%, outperforming all existing one-stage detectors. Sophisticated two-stage detectors are more accurate, but also slower.
RandAugment: Practical automated data augmentation with a reduced search space
  • Recent work has shown that data augmentation has the potential to significantly improve the generalization of deep learning models.
  • Recently, automated augmentation strategies have led to state-of-the-art results in image classification and object detection. While these strategies were optimized for improving validation accuracy, they also led to state-of-the-art results in semi-supervised learning and improved robustness to common corruptions of images.
  • An obstacle to a large-scale adoption of these methods is a separate search phase which increases the training complexity and may substantially increase the computational cost. Additionally, due to the separate search phase, these approaches are unable to adjust the regularization strength based on model or dataset size. Automated augmentation policies are often found by training small models on small datasets and subsequently applied to train larger models.
  • This paper by Cubuk et al. from Google Brain in 2019 demonstrates that previous methods of learned augmentation suffers from systematic drawbacks. Namely, not tailoring the number of distortions and the distortion magnitude to the dataset size nor the model size leads to sub-optimal performance. In previous work, scaling learned data augmentation to larger dataset and models have been a notable obstacle. For example, AutoAugment and Fast AutoAugment could only be optimized for small models on reduced subsets of data; population based augmentation was not reported for large-scale problems.
  • They propose RangAugment, a simple parameterization for targeting augmentation to particular model and dataset sizes, which seeks to remove both of the aforementioned obstacles. RandAugment has a significantly reduced search space which allows it to be trained on the target task with no need for a separate proxy task. Furthermore, due to the parameterization, the regularization strength may be tailored to different model and dataset sizes.
  • RandAugment can be used uniformly across different tasks and datasets and works out of the box, matching or surpassing all previous automated augmentation approaches on CIFAR-10/100, SVHN, and ImageNet without a separate search for data augmentation policies.
  • The proposed method scales quite well to datasets such as ImageNet and COCO while incurring minimal computational cost (e.g. 2 hyperparameters), but notable predictive performance gains.
  • On the ImageNet dataset, they achieve 85.0% accuracy, a 0.6% increase over the previous state-of-the-art and 1.0% increase over baseline augmentation. On object detection, RandAugment leads to 1.0-1.3% improvement over baseline augmentation, and is within 0.3% mAP of AutoAugment on COCO.
  • Finally, due to its interpretable hyperparameter, RandAugment may be used to investigate the role of data augmentation with varying model and dataset size.
Semantic Image Synthesis with Spatially-Adaptive Normalization
  • This paper by Park et al. from UC Berkeley, NVIDIA and MIT CSAIL proposes a spatially-adaptive normalization, a simple but effective layer for synthesizing photorealistic images given an input semantic layout. Previous methods directly feed the semantic layout as input to the deep network, which is then processed through stacks of convolution, normalization, and nonlinearity layers.
  • They show that this is suboptimal as the normalization layers tend to “wash away” semantic information.
  • To address the issue, they propose using the input layout for modulating the activations in normalization layers through a spatially-adaptive, learned affine transformation. The proposed normalization leads to the first semantic image synthesis model that can produce photorealistic outputs for diverse scenes including indoor, outdoor, landscape, and street scenes.
  • Experiments on several challenging datasets demonstrate the advantage of the proposed method over existing approaches, regarding both visual fidelity and alignment with input layouts.
  • Finally, their model allows user control over both semantic and style and demonstrate its application for multi-modal and guided image synthesis.
  • In the paper and the demo video, they showed GauGAN, an interactive app that generates realistic landscape images from the layout users draw. The model was trained on landscape images scraped from
  • Code; project page; online interactive demo of GauGAN; GauGAN360.
Generative Modeling by Estimating Gradients of the Data Distribution
  • This paper by Song and Ermon in NeurIPS 2019 introduces a new generative model where samples are produced via Langevin dynamics using gradients of the data distribution estimated with score matching.
  • Because gradients can be ill-defined and hard to estimate when the data resides on low-dimensional manifolds, they perturb the data with different levels of Gaussian noise, and jointly estimate the corresponding scores, i.e., the vector fields of gradients of the perturbed data distribution for all noise levels. For sampling, they propose an annealed Langevin dynamics where we use gradients corresponding to gradually decreasing noise levels as the sampling process gets closer to the data manifold.
  • Their framework allows flexible model architectures, requires no sampling during training or the use of adversarial methods, and provides a learning objective that can be used for principled model comparisons.
  • Their models produce samples comparable to GANs on MNIST, CelebA and CIFAR-10 datasets, achieving a new state-of-the-art inception score of 8.87 on CIFAR-10. Additionally, they demonstrate that their models learn effective representations via image inpainting experiments.


Denoising Diffusion Probabilistic Models
  • This paper by Ho et al. from Pieter Abbeel’s lab at UC Berkeley presents high quality image samples using diffusion probabilistic models (also called diffusion models), a class of latent variable models inspired by considerations from nonequilibrium thermodynamics.
  • Their best results are obtained by training on a weighted variational bound designed according to a novel connection between diffusion probabilistic models and denoising score matching with Langevin dynamics, and their models naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding.
  • On the unconditional CIFAR10 dataset, they obtain an Inception score of 9.46 and a state-of-the-art FID score of 3.17. On 256x256 LSUN, they obtain sample quality similar to ProgressiveGAN.
  • Code.
Designing Network Design Spaces
  • This paper by Radosavovic et al. from FAIR in CVPR 2020 presents a new network design paradigm. Their goal is to help advance the understanding of network design and discover design principles that generalize across settings. Instead of focusing on designing individual network instances, they design network design spaces that parametrize populations of networks. The overall process is analogous to classic manual design of networks, but elevated to the design space level.
  • Their methodology explores the structural aspect of network design and arrives at a low-dimensional design space consisting of simple, regular networks that they call RegNet. The core insight of the RegNet parametrization is surprisingly simple: widths and depths of good networks can be explained by a quantized linear function.
  • They analyze the RegNet design space and arrive at interesting findings that do not match the current practice of network design. The RegNet design space provides simple and fast networks that work well across a wide range of flop regimes.
  • Under comparable training settings and flops, the RegNet models outperform the popular EfficientNet models while being up to 5x faster on GPUs.
Training data-efficient image transformers & distillation through attention
  • Compared to CNNs, vision transformers are pre-trained with hundreds of millions of images using an expensive infrastructure, thereby limiting their adoption.
  • This paper by Touvron from Facebook AI and proposes DeiT, a competitive convolution-free transformer that does not require very large amount of data to be trained, thanks to improved training and in particular a novel distillation procedure. DeiT is trained on ImageNet on a single computer in less than 3 days. Their reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop evaluation) on ImageNet with no external data.
  • They introduce a teacher-student strategy specific to transformers. Using distillation can hamper the performance of neural networks. The student model pursues two different objectives that may be diverging: learning from a labeled dataset (strong supervision) and learning from the teacher. To alleviate this, they introduced a distillation token, which is a learned vector that flows through the network along with the transformed image data. The distillation token cues the model for its distillation output, which can differ from its class output. This new distillation method is specific to Transformers and further improves the image classification performance.
  • It relies on a distillation token ensuring that the student learns from the teacher through attention. They show the interest of this token-based distillation, especially when using a ConvNet as a teacher. This leads us to report results competitive with CNNs for both ImageNet (where they obtain up to 85.2% top-1 accuracy) and when transferring to other tasks.
  • Facebook AI post.
  • Code.
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
  • This paper by Mildenhall et al. from UC Berkeley, Google and UCSD in ECCV 2020 introduces NeRF, a method that achieves state-of-the-art results for synthesizing novel views of complex scenes by optimizing an underlying continuous volumetric scene function using a sparse set of input views.
  • Their algorithm represents a scene using a fully-connected (non-convolutional) deep network, whose input is a single continuous 5D coordinate (spatial location (x,y,z) and viewing direction (θ,ϕ)) and whose output is the volume density and view-dependent emitted radiance at that spatial location.
  • They synthesize views by querying 5D coordinates along camera rays and use classic volume rendering techniques to project the output colors and densities into an image. Because volume rendering is naturally differentiable, the only input required to optimize their representation is a set of images with known camera poses. They describe how to effectively optimize neural radiance fields to render photorealistic novel views of scenes with complicated geometry and appearance, and demonstrate results that outperform prior work on neural rendering and view synthesis.
  • Project page with videos and code.
Bootstrap your own latent: A new approach to self-supervised Learning
  • This paper by Grill et al. from DeepMind and Imperial College in 2020 introduces Bootstrap Your Own Latent (BYOL), a new approach to self-supervised image representation learning.
  • BYOL learns its representation by predicting previous versions of its outputs, without using negative pairs. BYOL relies on two neural networks, referred to as online and target networks, that interact and learn from each other. From an augmented view of an image, they train the online network to predict the target network representation of the same image under a different augmented view. At the same time, they update the target network with a slow-moving average of the online network.
  • While state-of-the art methods rely on negative pairs, BYOL achieves a new state of the art without them. BYOL reaches 74.3% top-1 classification accuracy on ImageNet using a linear evaluation with a ResNet-50 architecture and 79.6% with a larger ResNet, using 30% fewer parameters.
  • They show that BYOL performs on par or better than the current state of the art on both transfer and semi-supervised benchmarks.
  • Nevertheless, BYOL remains dependent on existing sets of augmentations that are specific to vision applications. To generalize BYOL to other modalities, it is necessary to obtain similarly suitable augmentations for each of them. Designing such augmentations may require significant effort and expertise. Therefore, automating the search for these augmentations would be an important next step to generalize BYOL to other modalities.
  • BYOL’s architecture is as shown below. BYOL minimizes a similarity loss between \(q_{\theta}\left(z_{\theta}\right)\) and \(\operatorname{sg}\left(z_{\xi}^{\prime}\right)\), where \(\theta\) are the trained weights, \(\xi\) are an exponential moving average of \(\theta\) and \(sg\) means stop-gradient. At the end of training, everything but \(f_{\theta}\) is discarded, and \(y_{\theta}\) is used as the image representation.

A Simple Framework for Contrastive Learning of Visual Representations
  • This paper by Chen et al. from Google Research and Hinton’s lab in ICML 2020 presents SimCLR, a simple framework for contrastive learning of visual representations.
  • They simplify recently proposed contrastive self-supervised learning algorithms without requiring specialized architectures or a memory bank. In order to understand what enables the contrastive prediction tasks to learn useful representations, they systematically study the major components of their framework and show the effects of different design choices.
  • They show that (1) composition of data augmentations plays a critical role in defining effective predictive tasks, (2) introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and (3) contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning.
  • By combining these findings, SimCLR is able to considerably outperform previous methods for self-supervised and semi-supervised learning on ImageNet. SimCLR differs from standard supervised learning on ImageNet only in the choice of data augmentation, the use of a nonlinear head at the end of the network, and the loss function. The strength of this simple framework suggests that, despite a recent surge in interest, self-supervised learning remains undervalued.
  • A linear classifier trained on self-supervised representations learned by SimCLR achieves 76.5% top-1 accuracy, which is a 7% relative improvement over previous state-of-the-art, matching the performance of a supervised ResNet-50. When fine-tuned on only 1% of the labels, SimCLR achieve 85.8% top-5 accuracy, outperforming AlexNet with 100x fewer labels.
  • The following diagram shows the SimCLR framework. Two separate data augmentation operators are sampled from the same family of augmentations (\(t \sim \mathcal{T}\) and \(t^{\prime} \sim \mathcal{T}\)) and applied to each data example to obtain two correlated views. A base encoder network \(f(\cdot)\) and a projection head \(g(\cdot)\) are trained to maximize agreement using a contrastive loss. After training is completed, they throw away the projection head \(g(\cdot)\) and use encoder \(f(\cdot)\) and representation \(\boldsymbol{h}\) for downstream tasks.

Conditional Negative Sampling for Contrastive Learning of Visual Representations
  • Recent methods for learning unsupervised visual representations, dubbed contrastive learning, optimize the noise-contrastive estimation (NCE) bound on mutual information between two views of an image. NCE uses randomly sampled negative examples to normalize the objective.
  • This paper by Wu et al. from Stanford in 2020 shows that choosing difficult negatives, or those more similar to the current instance, can yield stronger representations. To do this, they introduce a family of mutual information estimators called Conditional Noise Contrastive Estimator (CNCE) that sample negatives conditionally – in a “ring” around each positive, by approximating the partition function using samples from a class of conditional distributions. They prove that these estimators lower-bound mutual information, with higher bias but lower variance than NCE.
  • Applying these estimators as objectives in contrastive representation learning, shows that CNCE’s representations outperform existing approaches consistently across a spectrum of contrastive objectives, data distributions, and transfer tasks.
  • Experimentally, CNCE applied on top of existing models (IR, CMC, and MoCo) improves accuracy by 2-5% points in each case, measured by linear evaluation on four standard image datasets. Moreover, they find continued benefits when transferring features to a variety of new image distributions from the meta-dataset collection and to a variety of downstream tasks such as object detection, instance segmentation, and keypoint detection.
Momentum Contrast for Unsupervised Visual Representation Learning
  • This paper by He et al. from Facebook AI in CVPR 2020 presents Momentum Contrast (MoCo) for unsupervised visual representation learning. From a perspective on contrastive learning as dictionary look-up, MoCo builds a dynamic dictionary with a queue and a moving-averaged encoder. This enables building a large and consistent dictionary on-the-fly that facilitates contrastive unsupervised learning.
  • MoCo provides competitive results under the common linear protocol on ImageNet classification. More importantly, the representations learned by MoCo transfer well to downstream tasks. MoCo can outperform its supervised pre-training counterpart in 7 detection/segmentation tasks on PASCAL VOC, COCO, and other datasets, sometimes surpassing it by large margins. This suggests that the gap between unsupervised and supervised representation learning has been largely closed in many vision tasks.
  • Momentum Contrast (MoCo) trains a visual representation encoder by matching an encoded query \(q\) to a dictionary of encoded keys using a contrastive loss, as shown in the diagram below. The dictionary keys \(\left\{k_{0}, k_{1}, k_{2}, \ldots\right\}\) are defined on-the-fly by a set of data samples. The dictionary is built as a queue, with the current mini-batch enqueued and the oldest mini-batch dequeued, decoupling it from the mini-batch size. The keys are encoded by a slowly progressing encoder, driven by a momentum update with the query encoder. This method enables a large and consistent dictionary for learning visual representations.

  • The figure below from the paper shows the conceptual comparison of three contrastive loss mechanisms by illustrating one pair of query and key. The three mechanisms differ in how the keys are maintained and how the key encoder is updated. (a): The encoders for computing the query and key representations are updated end-to-end by back-propagation (the two encoders can be different). (b): The key representations are sampled from a memory bank. (c): MoCo encodes the new keys on-the-fly by a momentum-updated encoder, and maintains a queue (not illustrated in this figure) of keys.

Generative Pretraining from Pixels
  • Based on the observation that just as a large transformer model trained on language can generate coherent text, the same exact model trained on pixel sequences can generate coherent image completions and samples. By establishing a correlation between sample quality and image classification accuracy, they show that their best generative model also contains features competitive with top convolutional nets in the unsupervised setting.
  • This paper by Chen et al. from OpenAI in ICML 2020 examines whether similar models can learn useful representations for images, inspired by progress in unsupervised representation learning for natural language.
  • They train a sequence Transformer to auto-regressively predict pixels, without incorporating knowledge of the 2D input structure.
  • Despite training on low-resolution ImageNet without labels, they find that a GPT-2 scale model learns strong image representations as measured by linear probing, fine-tuning, and low-data classification. On CIFAR-10, they achieve 96.3% accuracy with a linear probe, outperforming a supervised Wide ResNet, and 99.0% accuracy with full finetuning, matching the top supervised pre-trained models.
  • An even larger model trained on a mixture of ImageNet and web images is competitive with self-supervised benchmarks on ImageNet, achieving 72.0% top-1 accuracy on a linear probe of their features.
  • OpenAI article.
Random Erasing Data Augmentation
  • This paper by Zhong et al. from Xiamen University, University of Technology Sydney, Australian National University, and CMU in AAAI 2020 introduces Random Erasing (“RandomErase”), a new data augmentation method for training the convolutional neural network (CNN). In training, Random Erasing randomly selects a rectangle region in an image and erases its pixels with random values.
  • In this process, training images with various levels of occlusion are generated, which reduces the risk of over-fitting and makes the model robust to occlusion. Random Erasing is parameter learning free, easy to implement, and can be integrated with most of the CNN-based recognition models.
  • Albeit simple, Random Erasing is complementary to commonly used data augmentation techniques such as random cropping and flipping, and yields consistent improvement over strong baselines in image classification, object detection and person re-identification.
  • The figure below from the paper shows examples of random erasing in image classification (a), person re-identification (re-ID) (b), object detection (c) and comparing with different augmentation methods (d). In CNN training, they randomly choose a rectangle region in the image and erase its pixels with random values or the ImageNet mean pixel value. Images with various levels of occlusion are thus generated.


An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
  • In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place.
  • This paper by Dosovitskiy et al. from Google Brain in ICLR 2021 shows that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks.
  • Inspired by the Transformer scaling successes in NLP, they experiment with applying a standard Transformer directly to images, with the fewest possible modifications. To do so, they split an image into patches and provide the sequence of linear embeddings of these patches as an input to a Transformer. Image patches are treated the same way as tokens (words) in an NLP application. They train the model on image classification in supervised fashion (as shown in the figure below from the paper).
  • They introduce three ViT configurations (Base, Large, and Huge) in the form of two models: ViT-H/14 and ViT-L/16 (where the notation used is ViT-C/N, C is used to indicate the model size and N is the input patch size; for instance, ViT-L/16 means the “Large” variant with \(16 \times 16\) input patch size).
  • When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), the proposed Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

RepVGG: Making VGG-style ConvNets Great Again
  • This paper by Ding et al. from Tsinghua University, MEGVII Technology, HKUST, and Aberystwyth University in CVPR 2021 Re-parameterization VGG (RepVGG), a simple but powerful architecture of convolutional neural network, which has a simple architecture with a stack of \(3 \times 3\) convolution and ReLU during inference time, which is especially suitable for GPU and specialized inference chips, while the training-time model has a multi-branch topology. Such decoupling of the training-time and inference-time architecture is realized by a structural re-parameterization technique so that the model is named RepVGG.
  • The figure below from the paper shows a sketch of RepVGG architecture. RepVGG has 5 stages and conducts down-sampling via stride-2 convolution at the beginning of a stage. Here, only the first 4 layers of a specific stage are shown. Inspired by ResNet, RepVGG also uses identity and \(1 \times 1\) branches, but only for training.

  • On ImageNet, RepVGG reaches over 80% top-1 accuracy, which is the first time for a plain model.
  • On NVIDIA 1080Ti GPU, RepVGG models run 83% faster than ResNet-50 or 101% faster than ResNet-101 with higher accuracy and show favorable accuracy-speed trade-off compared to the state-of-the-art models like EfficientNet and RegNet.
  • The figure below from the paper shows the Top-1 accuracy on ImageNet vs. actual speed. Left: lightweight and middleweight RepVGG and baselines trained in 120 epochs. Right: heavyweight models trained in 200 epochs. The speed is tested on the same 1080Ti with a batch size of 128, full precision (fp32), single crop, and measured in examples/second. The input resolution is 300 for EfficientNet-B3 and 224 for the others.

ArcFace: Additive Angular Margin Loss for Deep Face Recognition
  • Recently, a popular line of research in face recognition is adopting margins in the well-established softmax loss function to maximize class separability.
  • This paper by Deng et al. from in TPAMI introduces an Additive Angular Margin Loss (ArcFace), which not only has a clear geometric interpretation but also significantly enhances the discriminative power. - Since ArcFace is susceptible to the massive label noise, they further propose sub-center ArcFace, in which each class contains \(K\) sub-centers and training samples only need to be close to any of the \(K\) positive sub-centers. Sub-center ArcFace encourages one dominant sub-class that contains the majority of clean faces and non-dominant sub-classes that include hard or noisy faces.
  • Based on this self-propelled isolation, they boost the performance through automatically purifying raw web faces under massive real-world noise. Besides discriminative feature embedding, they also explore the inverse problem, mapping feature vectors to face images.
  • Without training any additional generator or discriminator, the pre-trained ArcFace model can generate identity-preserved face images for both subjects inside and outside the training data only by using the network gradient and Batch Normalization (BN) priors. Extensive experiments demonstrate that ArcFace can enhance the discriminative feature embedding as well as strengthen the generative face synthesis.
  • The figure below from the paper shows the comparisons of Triplet, Tuplet, ArcFace and sub-center ArcFace. Triplet and Tuplet conduct local sample-to-sample comparisons with Euclidean margins within the mini-batch. By contrast, ArcFace and sub-center ArcFace conduct global sample-to-class and sample-to-subclass comparisons with angular margins.

  • The figure below from the paper shows the training the deep face recognition model by the proposed ArcFace loss \((K=1)\) and sub-center ArcFace loss (e.g. \(K=3)\). Based on a \(\ell_2\) normalization step on both embedding feature \(\mathbf{x}_i \in \mathbb{R}^{512}\) and all sub-centers \(W \in \mathbb{R}^{512 \times N \times K}\), they get the subclass-wise similarity score \(\mathcal{S} \in \mathbb{R}^{N \times K}\) by a matrix multiplication \(W^T \mathbf{x}_i\). After a max pooling step, they can easily get the class-wise similarity score \(\mathcal{S}^{\prime} \in \mathbb{R}^{N \times 1}\). Afterwards, they calculate the \(\arccos \theta_{y_i}\) and get the angle between the feature \(x_i\) and the ground truth center \(W_{y_i}\). Then, they add an angular margin penalty \(m\) on the target (ground truth) angle \(\theta_{y_i}\). After that, they calculate \(\cos \left(\theta_{y_i}+m\right)\) and multiply all logits by the feature scale \(s\). Finally, the logits go through the softmax function and contribute to the cross entropy loss.

Do Vision Transformers See Like Convolutional Neural Networks?
  • Given the central role of convolutional neural networks in computer vision breakthroughs (leading to them being the de-facto model for visual data), it is remarkable that Transformer architectures (almost identical to those used in language) are capable of similar performance. For instance, recent work has shown that the Vision Transformer (ViT) model can achieve comparable or even superior performance on image classification tasks. This raises fundamental questions on whether these architectures work in the same way as CNNs: how are Vision Transformers solving these tasks? Are they acting like convolutional networks, or learning entirely different visual representations?
  • This paper by Raghu et al. from Google Brain in 2021 analyzes the internal representation structure of ViTs and CNNs on image classification benchmarks, and finds striking differences in the features and internal structures between the two architectures, such as ViT having more uniform representations across all layers. They explore how these differences arise, finding crucial roles played by self-attention, which enables early aggregation of global information (“earlier global features”), and ViT residual connections, which offer representation propagation of features from lower to higher layers, while also revealing that some CNN properties, e.g. local information aggregation at lower layers, are important to ViTs, being learned from scratch at scale.
  • They also examine the potential for ViTs to be used beyond classification through a study of spatial localization, discovering ViTs successfully preserve input spatial information with CLS tokens —- promising for future uses in object detection.
  • Finally, they investigate the effect of scale for transfer learning, finding larger ViT models develop significantly stronger intermediate representations through larger pretraining datasets. These results are also very pertinent to understanding recent architectures for vision such as the MLP-Mixer.
BEiT: BERT Pre-Training of Image Transformers
  • This paper by Wei et al. from Microsoft Research in 2021 introduces a self-supervised pre-trained representation model called BEiT, which stands for Bidirectional Encoder representations from Image Transformers. Following BERT developed in the natural language processing area, they propose a masked image modeling task to pretrain vision Transformers. Specifically, each image has two views in their pre-training, i.e, image patches (such as 16x16 pixels) the embeddings of which are calculated as linear projections of flattened patches, and visual tokens (i.e., discrete tokens) which are . Before pre-training, they learn a discrete variational autoencoder (dVAE) which acts as an “image tokenizer” learnt via autoencoding-style reconstruction, where the input image is tokenized into discrete visual tokens obtained by the latent codes of the discrete VAE (the one proposed in VQGAN and reused by CLIP in Ramesh et al., 2021) according to the learned vocabulary.
  • They show that the proposed method is critical to make BERT-like pre-training (i.e., auto-encoding with masked input) work well for image Transformers. They also present the intriguing property of automatically acquired knowledge about semantic regions, without using any human-annotated data.
  • Similar to the masked language modeling pre-training task of BERT, BEiT randomly masks some image patches and feeds them into the backbone Transformer. The pre-training objective is to recover the original visual tokens based on the corrupted image patches.
  • After pre-training BEiT, they directly fine-tune the model parameters on downstream tasks by appending task layers upon the pretrained encoder.
  • Experimental results on image classification and semantic segmentation show that BEiT achieves competitive results with previous pre-training methods. For example, base-size BEiT achieves 83.2% top-1 accuracy on ImageNet-1K, significantly outperforming from-scratch DeiT training (81.8%) with the same setup. Moreover, large-size BEiT obtains 86.3% only using ImageNet-1K, even outperforming ViT-L with supervised pre-training on ImageNet-22K (85.2%).
  • Code and pretrained models are here.

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
  • This paper by Liu et al. from Microsoft Research in 2021 presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision by producing a hierarchical feature representation and offers a linear computational complexity with respect to input image size. The key element of the Swin Transformer is the shifted window based self-attention.
  • The Swin transformer aims to address the challenges in adapting Transformer from language to vision which arise due to differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, they propose a hierarchical Transformer whose representation is computed with Shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection.
  • This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including ImageNet image classification (87.3 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as COCO object detection (58.7 box AP and 51.1 mask AP on COCO testdev) and ADE20K semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures.
  • Code and pretrained models are here.
CvT: Introducing Convolutions to Vision Transformers
  • This paper by Wu et al. from McGill and Microsoft in 2021 proposes the Convolutional vision Transformer (CvT), that improves Vision Transformer (ViT) in performance and efficiency by introducing convolutions into ViT to yield the best of both designs for image recognition tasks.
  • This is accomplished through two primary modifications: a hierarchy of Transformers containing a new convolutional token embedding, and a convolutional Transformer block leveraging a convolutional projection. These changes introduce desirable properties of convolutional neural networks (CNNs) to the ViT architecture (i.e., shift, scale, and distortion invariance) while maintaining the merits of Transformers (i.e., dynamic attention, global context, and better generalization).
  • They validate CvT by conducting extensive experiments, showing that this approach achieves state-of-the-art performance over other Vision Transformers and ResNets on ImageNet-1k, with fewer parameters and lower FLOPs.
  • In addition, performance gains are maintained when pretrained on larger datasets (for e.g., ImageNet-22k) and fine-tuned to downstream tasks. Pre-trained on ImageNet-22k, the CvT-W24 obtains a top-1 accuracy of 87.7% on the ImageNet-1k val set.
  • Furthermore, their results show that the positional encoding, a crucial component in existing Vision Transformers, can be safely removed in their model, giving it a potential advantage for adaption to a wide range of vision tasks requiring variable input resolution. This is due to the built-in local context structure introduced by convolutions, CvT no longer requires a position embedding.
  • CvTs thus introduce convolutions into the Vision Transformer architecture to merge the benefits of Transformers with the benefits of CNNs and demonstrate that the introduced convolutional token embedding and convolutional projection, along with the multi-stage design of the network enabled by convolutions, enable CvT to achieve superior performance while maintaining computational efficiency.
  • Code and pretrained models are here.
An Empirical Study of Training Self-Supervised Vision Transformers
  • While the training recipes for standard convolutional networks have been highly mature and robust, the recipes for ViT are yet to be built, especially in the self-supervised scenarios where training becomes more challenging.
  • This paper by Chen et al. from Facebook AI in ICCV 2021 studies a straightforward, incremental, yet must-know baseline given the recent progress in computer vision: self-supervised learning for Vision Transformers (ViT).
  • They go back to basics and investigate the effects of several fundamental components for training self-supervised ViT. Their comparisons concern several aspects, including ViT vs. convolutional networks, supervised vs. self-supervised, and contrastive learning vs. masked auto-encoding.
  • They observe that instability is a major issue that degrades accuracy, and it can be hidden by apparently good results. They reveal that these results are indeed partial failure, and they can be improved when training is made more stable.
  • They introduce “MoCo v3”, a framework which offers an incremental improvement of MoCo v1/2, and strikes for a better balance between simplicity, accuracy, and scalability. The pseudocode of MoCo v3 is as below:

  • They benchmark ViT results in MoCo v3 and several other self-supervised frameworks, with ablations in various aspects. They discuss the currently positive evidence as well as challenges and open questions.
Diffusion Models Beat GANs on Image Synthesis
  • This paper by Dhariwal and Nichol from OpenAI in 2021 shows that diffusion models, a class of likelihood-based models with a stationary training objective, can achieve image sample quality superior to the current state-of-the-art generative models.
  • They achieve this on unconditional image synthesis by finding a better architecture through a series of ablations. For conditional image synthesis, they further improve sample quality with classifier guidance: a simple, compute-efficient method for trading off diversity for fidelity using gradients from a classifier.
  • These guided diffusion models can reduce the sampling time gap between GANs and diffusion models, although diffusion models still require multiple forward passes during sampling. Finally, by combining guidance with upsampling, they can further improve sample quality on high-resolution conditional image synthesis.
  • They achieve an FID of 2.97 on ImageNet \(128 \times 128\), 4.59 on ImageNet \(256 \times 256\), and 7.72 on ImageNet \(512 \times 512\), and match BigGAN-deep even with as few as 25 forward passes per sample, all while maintaining better coverage of the distribution.
  • Finally, they find that classifier guidance combines well with upsampling diffusion models, further improving FID to 3.94 on ImageNet \(256 \times 256\) and 3.85 on ImageNet \(512 \times 512\).
  • Code.
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
  • Diffusion models have recently been shown to generate high-quality synthetic images, especially when paired with a guidance technique to trade off diversity for fidelity.
  • This paper by Nichol et al. from OpenAI in 2021 explores diffusion models for the problem of text-conditional image synthesis and compare two different guidance strategies: CLIP guidance and classifier-free guidance.
  • They find that the latter is preferred by human evaluators for both photorealism and caption similarity, and often produces photorealistic samples. Samples from a 3.5 billion parameter text-conditional diffusion model using classifier-free guidance are favored by human evaluators to those from DALL-E, even when the latter uses expensive CLIP reranking.
  • Additionally, they find that their models can be fine-tuned to perform image inpainting, enabling powerful text-driven image editing.
  • Code.
Multiscale Vision Transformers
  • This paper by Fan et al. from Facebook AI and UC Berkeley presents Multiscale Vision Transformers (MViT) for video and image recognition, by connecting the seminal idea of multiscale feature hierarchies with transformer models.
  • Multiscale Transformers have several channel-resolution scale stages. Starting from the input resolution and a small channel dimension, the stages hierarchically expand the channel capacity while reducing the spatial resolution. This creates a multiscale pyramid of features with early layers operating at high spatial resolution to model simple low-level visual information, and deeper layers at spatially coarse, but complex, high-dimensional features.
  • They evaluate this fundamental architectural prior for modeling the dense nature of visual signals for a variety of video recognition tasks where it outperforms concurrent vision transformers that rely on large scale external pre-training and are 5-10x more costly in computation and parameters.
  • They further remove the temporal dimension and apply their model for image classification where it outperforms prior work on vision transformers.
  • The figure below from the paper shows that Multiscale Vision Transformers learn a hierarchy from dense (in space) and simple (in channels) to coarse and complex features. Several resolution-channel scale stages progressively increase the channel capacity of the intermediate latent sequence while reducing its length and thereby spatial resolution.

Score-Based Generative Modeling through Stochastic Differential Equations
  • Creating noise from data is easy; creating data from noise is generative modeling.
  • This paper by Song et al. from Stanford and Google Brain in ICLR 2021 presents a stochastic differential equation (SDE) that smoothly transforms a complex data distribution to a known prior distribution by slowly injecting noise, and a corresponding reverse-time SDE that transforms the prior distribution back into the data distribution by slowly removing the noise.
  • Crucially, the reverse-time SDE depends only on the time-dependent gradient field (a.k.a., score) of the perturbed data distribution. By leveraging advances in score-based generative modeling, they can accurately estimate these scores with neural networks, and use numerical SDE solvers to generate samples. They show that this framework encapsulates previous approaches in score-based generative modeling and diffusion probabilistic modeling, allowing for new sampling procedures and new modeling capabilities.
  • In particular, they introduce a predictor-corrector framework to correct errors in the evolution of the discretized reverse-time SDE.
  • They also derive an equivalent neural ODE that samples from the same distribution as the SDE, but additionally enables exact likelihood computation, and improved sampling efficiency. In addition, they provide a new way to solve inverse problems with score-based models, as demonstrated with experiments on class-conditional generation, image inpainting, and colorization. Combined with multiple architectural improvements, they achieve record-breaking performance for unconditional image generation on CIFAR-10 with an Inception score of 9.89 and FID of 2.20, a competitive likelihood of 2.99 bits/dim, and demonstrate high fidelity generation of \(1024 \times 1024\) images for the first time from a score-based generative model.
  • The figure below from the paper shows that solving a reverse-time SDE yields a score-based generative model. Transforming data to a simple noise distribution can be accomplished with a continuous-time SDE. This SDE can be reversed if they know the score of the distribution at each intermediate time step, \(\nabla_{\mathbf{x}} \log p_t(\mathbf{x})\).

Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers
  • Most recent semantic segmentation methods adopt a fully-convolutional network (FCN) with an encoder-decoder architecture. The encoder progressively reduces the spatial resolution and learns more abstract/semantic visual concepts with larger receptive fields. Since context modeling is critical for segmentation, the latest efforts have been focused on increasing the receptive field, through either dilated/atrous convolutions or inserting attention modules. However, the encoder-decoder based FCN architecture remains unchanged.
  • This paper by Zheng et al. from in CVPR 2021 proposes SEgmentation TRansformer (SETR), which, utilizes a a pure transformer (i.e., without convolution and resolution reduction) to encode an image as a sequence of patches and aims to provide an alternative perspective to the segmentation problem by treating semantic segmentation as a sequence-to-sequence prediction task. Thanks to the Transformer self-attention architecture, which models global context in every layer, this results in being able to combine the encoder with a simple decoder to provide a powerful segmentation model.
  • Extensive experiments show that SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes. Particularly, they achieve the first position in the highly competitive ADE20K test server leaderboard on the day of submission.
  • The figure below from the paper shows a schematic illustration of the proposed SETR model; (a) They first split an image into fixed-size patches, linearly embed each of them, add position embeddings, and feed the resulting sequence of vectors to the standard Transformer encoder. To perform pixel-wise segmentation, they introduce different decoder designs: (b) progressive upsampling (resulting in a variant called SETRPUP); and (c) multi-level feature aggregation (a variant called SETR-MLA).

Scaling Vision with Sparse Mixture of Experts
  • Almost all prevalent computer vision models networks are “dense,” that is, every input is processed by every parameter.
  • This paper by Riquelme et al. from Google Brain introduces the Vision Mixture of Experts (V-MoE), a novel approach for scaling vision models. The V-MoE is a sparsely activated version of the Vision Transformer (ViT) that demonstrates scalability and competitiveness with larger dense networks in image recognition tasks.
  • The paper proposes a sparse variant of the Vision Transformer (ViT) that uses a mixture-of-experts architecture. This approach routes each image patch to a subset of experts, making it possible to scale up to 15B parameters while matching the performance of state-of-the-art dense models.
  • An innovative extension to the routing algorithm is presented, allowing prioritization of subsets of each input across the entire batch. This adaptive per-image compute leads to a trade-off between performance and computational efficiency during inference.
  • The figure below from the paper shows an overview of the architecture. V-MoE is composed of \(L\) ViT blocks. In some, we replace the MLP with a sparsely activated mixture of MLPs. Each MLP (the expert) is stored on a separate device, and processes a fixed number of tokens. The communication of these tokens between devices is shown in this example, which depicts the case when \(k=1\) expert is selected per token. Here each expert uses a capacity ratio $C=\frac{4}{3}\(: the sparse MoE layer receives 12 tokens per device, but each expert has capacity for\)16\left(\frac{16 \cdot 1}{12}=\frac{4}{3}\right.$$). Non-expert components of V-MoE such as routers, attention layers and normal MLP blocks are replicated identically across devices.

  • The V-MoE shows impressive scalability, successfully trained up to 15B parameters, and demonstrates strong performance, including 90.35% accuracy on ImageNet.
  • The paper explores the transfer learning abilities of V-MoE, showing its adaptability and effectiveness across different tasks and datasets, even with limited data.
  • A detailed analysis of the V-MoE’s routing decisions and the behavior of its experts is provided, offering insights into the model’s internal workings and guiding future improvements.
  • V-MoE models require less computational resources than dense counterparts, both in training and inference, thanks to their sparsely activated nature and the efficient use of the Batch Prioritized Routing algorithm.
  • The paper concludes with the potential of sparse conditional computation in vision tasks, emphasizing the environmental benefits due to reduced CO2 emissions and the promising directions for future research in large-scale multimodal or video modeling.
  • The paper represents a significant advancement in the field of computer vision, particularly in the development of scalable and efficient vision models.


A ConvNet for the 2020s
  • This paper by FAIR and UC Berkeley seeks to refute the recent apparent superiority of Transformers by re-examining the design of ConvNets and testing their limitations. The proposed approach is based on gradually modifying a standard ResNet50, following design choices closely inspired by Vision Transformer, to propose a new family of pure ConvNets called ConvNeXt, which can perform as good as a hierarchical vision Transformer on image classification, object detection, instance and semantic segmentation tasks.
  • The “Roaring 20s” of visual recognition began with the introduction of Vision Transformers (ViTs), which quickly superseded ConvNets as the state-of-the-art image classification model. A vanilla ViT, on the other hand, faces difficulties when applied to general computer vision tasks such as object detection and semantic segmentation. It is the hierarchical Transformers (e.g., Swin Transformers) that reintroduced several ConvNet priors, making Transformers practically viable as a generic vision backbone and demonstrating remarkable performance on a wide variety of vision tasks.
  • However, the effectiveness of such hybrid approaches is still largely credited to the intrinsic superiority of Transformers, rather than the inherent inductive biases of convolutions.
  • In this paper, the authors reexamine the design spaces and test the limits of what a pure ConvNet can achieve.
  • The authors gradually “modernize” a standard ResNet toward the design of a vision Transformer, and discover several key components that contribute to the performance difference along the way. They implement a series of design decisions starting with a ResNet50 trained with up-to-date techniques (extending the number of epochs, using AdamW optimizer, Stochastic Depth, Label Smoothing, and so on):
    • Macro Design: The authors considered two aspects of Swin Transformers’ macro design. The first is the number of blocks in each stage (stage compute ratio), which was adjusted from (4, 4, 6, 3) to (3, 3, 9, 3), following the Swin Transformer ratio of (1:1:3:1). The second is the stem cell configuration, which in the original ResNet50 consisted of 7×7 convolutions with stride 2 followed by a max-pooling layer. This was substituted by a more Transformer-like “patchify” layer which utilizes 4×4 non-overlapping convolutions with stride 4. These modifications improved the accuracy to 79.5%.
    • ResNeXt: In this part, the authors adopt two design choices of the popular ResNeXt: depthwise convolutions, which are interestingly similar to self-attention as they work on a per-channel basis, and a higher number of channels (from 64 to 96). These modifications improved the accuracy to 80.5%.
    • Inverted Bottleneck: An essential configuration of Transformers is the expansion-compression rate in the MLP block (the hidden dimension is 4 times higher than the input and output dimension). This feature was reproduced by adding the inverted bottleneck design used in ConvNets (where the input is expanded using \(1 \times 1\) convolutions and then shrunk through depthwise convolution and \(1 \times 1\) convolutions). This modification slightly improved the accuracy to 80.6%.
    • Large kernel sizes: The gold standard in ConvNet since the advent of VGG are 3×3 kernels. Small kernels lead to the famous local receptive field, which, compared to the global self-attention, has a more limited area of focus. Although Swin Transformers reintroduced the concept of local attention, their window size has always been at least \(7 \times 7\). To explore larger kernels, the first thing is to move the depthwise convolution before the convolution, to reduce the number of channels before such an expensive operation. This first modification resulted in a temporary degradation to 79.9%, but, experimenting with different sizes, with a \(7 \times 7\) window (higher values did not bring any alterations in the results), the authors were able to achieve an accuracy of 80.6% again.
    • Micro Design: Finally, some micro design choices were added: GELU instead of ReLU, a single activation for each block (the original transformer module has just one activation after the MLP), fewer normalization layers, Batch Normalization substituted by Layer Normalization, and separate downsampling layer.
    • These modifications improved the accuracy to 82.0% and defined the final model, named ConvNeXt.
  • A comparison of this architecture with the Swin Transformer and ResNet is shown in the figure below.

  • Based entirely on convolutions, this model competed on par with Transformer-based architectures, achieving a top-1 accuracy of 87.8% on ImageNet classification. Equally excellent results were obtained in other tasks, such as object detection and segmentation on COCO and semantic segmentation on ADE20K.
  • The idea of modernizing ConvNets, adding all the concepts introduced over the past decade to a single model, is payback for convolutions, which have been ignored lately to the benefit of transformers. The authors suggest that ConvNeXt may be more suited for certain tasks, while Transformers may be more flexible for others. A case in point is multi-modal learning, in which a cross-attention module may be preferable for modeling feature interactions across many modalities. Additionally, Transformers may be more flexible when used for tasks requiring discretized, sparse, or structured outputs. They believe the architecture choice should meet the needs of the task at hand while striving for simplicity and efficiency.
  • Code.
Natural Language Descriptions of Deep Visual Features
  • Some neurons in deep networks specialize in recognizing highly specific perceptual, structural, or semantic features of inputs. In computer vision, techniques exist for identifying neurons that respond to individual concept categories like colors, textures, and object classes. But these techniques are limited in scope, labeling only a small subset of neurons and behaviors in any network. Is a richer characterization of neuron-level computation possible?
  • This paper by Hernandez et al. from MIT, Northeastern and Alleghency College in 2022 proposes MILAN, for mutual-information-guided linguistic annotation of neurons, that aims to generate open-ended, compositional, natural language descriptions of individual neurons in deep networks.
  • Given a neuron, MILAN generates a description by searching for a natural language string that maximizes pointwise mutual information with the image regions in which the neuron is active. These mutual information estimates are in turn produced by a pair of learned models trained on MILANNOTATIONS, a dataset of fine-grained image annotations released with this paper. MILAN produces fine-grained descriptions that capture categorical, relational, and logical structure in learned features. These descriptions obtain high agreement with human-generated feature descriptions across a diverse set of model architectures and tasks, and can aid in understanding and controlling learned models.
  • They highlight three applications of natural language neuron descriptions.
    • First, they use MILAN for analysis, characterizing the distribution and importance of neurons selective for attribute, category, and relational information in vision models.
    • Second, they use MILAN for auditing, surfacing neurons sensitive to protected categories like race and gender in models trained on datasets intended to obscure these features.
    • Finally, they use MILAN for editing, improving robustness in an image classifier by deleting neurons sensitive to text features spuriously correlated with class labels.
  • MarkTechPost link.
Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision
  • Discriminative self-supervised learning allows training models on any random group of internet images, and possibly recover salient information that helps differentiate between the images. Applied to ImageNet, this leads to object-centric features that perform on par with supervised features on most object-centric downstream tasks.
  • This paper by Goyal et al. in 2022 from FAIR questions that if using this ability, they can learn any salient and more representative information present in diverse unbounded set of images from across the globe. To do so, they train models on billions of random images without any data pre-processing or prior assumptions about what they want the model to learn. This is a very large-scale experiment in which a RegNet architecture scaled to a dense 10 billion parameters (to avoid underfitting on a large data size) is pre-trained using the SwAV self-supervised method on a large collection of 1 billion randomly selected public images from Instagram with a diversity of gender, ethnicity, cultures, and locations (all outside the EU because of GDPR).
  • They achieve state of the art results on a majority of 50 transfer tasks, including fairness, robustness to distribution shift, geographical diversity, fine-grained classification, image copy detection and many image classification datasets. The resulting model, not only captures well semantic information, it also captures information about artistic style and learns salient information such as geo-locations and multilingual word embeddings based on visual content only.
  • The key takeaway is that large-scale self-supervised pre-training yields more robust, fair, less harmful, and less biased results than supervised models or models trained on object centric datasets such as ImageNet.
Block-NeRF: Scalable Large Scene Neural View Synthesis
  • This paper by Tancik et al. from UC Berkeley, Waymo and Google Research in 2022 presents Block-NeRF, a variant of Neural Radiance Fields (NeRFs) that can reconstruct large-scale environments.
  • They demonstrate that when scaling NeRF to render city-scale scenes spanning multiple blocks, it is vital to decompose the scene into individually trained NeRFs that can be optimized independently. This decomposition decouples rendering time from scene size, enables rendering to scale to arbitrarily large environments, and allows per-block updates of the environment.
  • At such a scale, the data collected will necessarily have transient objects and variations in appearance, which they account for by modifying the underlying NeRF architecture to make NeRF robust to data captured over months under different environmental conditions. They add appearance embeddings, learned pose refinement, and controllable exposure to each individual NeRF, and introduce a procedure for aligning appearance between adjacent NeRFs so that they can be seamlessly combined.
  • They demonstrate the method’s efficacy by building an entire neighborhood in San Francisco from 2.8M images using a grid of Block-NeRFs, forming the largest neural scene representation to date.
VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning
  • Recent self-supervised methods for image representation learning are based on maximizing the agreement between embedding vectors from different views of the same image. A trivial solution is obtained when the encoder outputs constant vectors. This collapse problem is often avoided through implicit biases in the learning architecture, that often lack a clear justification or interpretation.
  • This paper by Bardes et al. from FAIR and NYU in ICLR 2022 introduces VICReg (Variance-Invariance-Covariance Regularization), a method that explicitly avoids the collapse problem with a simple regularization term on the variance of the embeddings along each dimension individually.
  • VICReg offers simple approach to self-supervised learning based on a triple objective: learning invariance to different views with a invariance term, avoiding collapse of the representations with a variance preservation term, and maximizing the information content of the representation with a covariance regularization term.
  • VICReg combines the variance term with a decorrelation mechanism based on redundancy reduction and covariance regularization, and achieves results on par with the state of the art on several downstream tasks, but is not subject to the same limitations as most other methods, particularly because it does not require the embedding branches to be identical or even similar. In addition, they show that incorporating their new variance term into other methods helps stabilize the training and leads to performance improvements.
Masked Autoencoders Are Scalable Vision Learners
  • Simple algorithms that scale well are the core of deep learning. In NLP, simple self-supervised learning methods enable benefits from exponentially scaling models. In computer vision, practical pre-training paradigms are dominantly supervised despite progress in self-supervised learning. In this study, they observe on ImageNet and in transfer learning that an autoencoder —- a simple self-supervised method similar to techniques in NLP – provides scalable benefits. Self-supervised learning in vision may thus now be embarking on a similar trajectory as in NLP.
  • This paper by He et al. from Facebook AI in 2022 shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision.
  • Their MAE approach is simple: they mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs.
  • First, they develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens.
  • Second, they note that images and languages are signals of a different nature and this difference must be addressed carefully. Images are merely recorded light without a semantic decomposition into the visual analogue of words. The word (or subword) analog for images are pixels. But decomposing the image into patches (like ViT) reduces the quadratic computation cost of transformers compared to operating at the pixel level. However, ViT and its derived models are infamous for their data appetite and/or training slowness. Instead of attempting to remove objects, they remove random patches that most likely do not form a semantic segment. Likewise, MAE reconstructs pixels, which are not semantic entities. They find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task. Coupling these two designs enables them to train large models efficiently and effectively: they accelerate training (by 3x or more) and improve accuracy.
  • Like any autoencoder, you train and throw away the decoder and fine-tune the encoder for downstream tasks.
  • Their scalable approach allows for learning high-capacity models that generalize well: e.g., a vanilla ViT-Huge model (ViTMAE) achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data. Transfer performance in downstream tasks outperforms supervised pre-training and shows promising scaling behavior.
  • Overall, they observe that MAE infers complex, holistic reconstructions, suggesting it has learned numerous visual concepts, i.e., semantics. They hypothesize that this behavior occurs by way of a rich hidden representation inside the MAE.
  • Hugging Face docs
  • The following figure from the paper shows the MAE architecture. During pre-training, a large random subset of image patches (e.g., 75%) is masked out. The encoder is applied to the small subset of visible patches. Mask tokens are introduced after the encoder, and the full set of encoded patches and mask tokens is processed by a small decoder that reconstructs the original image in pixels. After pre-training, the decoder is discarded and the encoder is applied to uncorrupted images (full sets of patches) for recognition tasks.

The Effects of Regularization and Data Augmentation are Class Dependent
  • Regularization is a fundamental technique to prevent over-fitting and to improve generalization performances by constraining a model’s complexity. Current Deep Networks heavily rely on regularizers such as data augmentation (DA) or weight-decay, and employ structural risk minimization, i.e., cross-validation, to select the optimal regularization hyper-parameters.
  • This paper by Balestriero et al. from Facebook AI in 2022 demonstrates that regularization techniques such as DA or weight decay increases the average test performances at the cost of significant performance drops on some specific classes. In other words, regularization produces a model with a reduced complexity that is unfair across classes. By focusing on maximizing aggregate performance statistics they have produced learning mechanisms that can be potentially harmful, especially in transfer learning tasks. The optimal amount of DA or weight decay found from cross-validation leads to disastrous model performances on some classes, e.g., on ImageNet with a ResNet50, the “barn spider” classification test accuracy falls from 68% to 46% only by introducing random crop DA during training. Even more surprising, such performance drop also appears when introducing uninformative regularization techniques such as weight decay.
  • Those results demonstrate that their search for ever increasing generalization performance – averaged over all classes and samples – has left us with models and regularizers that silently sacrifice performances on some classes. In fact, they also observe that varying the amount of regularization employed during pre-training of a specific dataset impacts the per-class performances of that pre-trained model on different downstream tasks e.g. an ImageNet pre-trained ResNet50 deployed on INaturalist sees its performances fall from 70% to 30% on a particular classwhen introducing random crop DA during the Imagenet pre-training phase. Those results demonstrate that designing novel regularizers without class-dependent bias remains an open research question.
  • Here’s an intuitive explanation:
    • Some types of data augmentation and weight decay helps some categories but hurts others.
    • Categories largely identifiable by color or texture (for e.g., yellow bird, textured mushroom) are unaffected by aggressive cropping, while categories identifiable by shape (for e.g., corkscrew) see a performance degradation with aggressive cropping that only contains part of the object.
    • Conversely, color jitter does not affect shape or texture-based categories (for e.g., zebra), but affects color-based categories (for e.g., basket ball).
Instant Neural Graphics Primitives with a Multiresolution Hash Encoding
  • Neural graphics primitives, parameterized by fully connected neural networks, can be costly to train and evaluate. Moreover, many graphics problems rely on task specific data structures to exploit the sparsity or smoothness of the problem at hand.
  • This paper by Muller et al. from Nvidia in 2022 proposes InstantNeRF which reduce this cost with a versatile new input encoding that permits the use of a smaller network without sacrificing quality, thus significantly reducing the number of floating point and memory access operations. InstantNeRF offers near-instant training of neural graphics primitives on a single GPU for multiple tasks.
  • To this end, a small neural network is augmented by a multiresolution hash table of trainable feature vectors whose values are optimized through stochastic gradient descent. Multi-resolution hash encoding provides a practical learning-based alternative that automatically focuses on relevant detail, independent of task at hand. Its low overhead allows it to be used even in time-constrained settings like online training and inference.
  • In a gigapixel image, they represent an image by a neural network. SDF learns a signed distance function in 3D space whose zero level-set represents a 2D surface. NeRF uses 2D images and their camera poses to reconstruct a volumetric radiance-and-density field that is visualized using ray marching. Lastly, neural volume learns a denoised radiance and density field directly from a volumetric path tracer. In all tasks, their encoding and its efficient implementation provide clear benefits: instant training, high quality, and simplicity. Their encoding is task-agnostic: they use the same implementation and hyperparameters across all tasks and only vary the hash table size which trades off quality and performance.
  • The multiresolution structure allows the network to disambiguate hash collisions, making for a simple architecture that is trivial to parallelize on modern GPUs. In the context of neural network input encodings, it is a drop-in replacement, for example speeding up NeRF by several orders of magnitude and matching the performance of concurrent non-neural 3D reconstruction techniques.
  • They leverage this parallelism by implementing the whole system using fully-fused CUDA kernels with a focus on minimizing wasted bandwidth and compute operations.
  • While slow computational processes in any setting, from lightmap baking to the training of neural networks, can lead to frustrating workflows due to long iteration times, they achieve a combined speedup of several orders of magnitude, enabling training of high-quality neural graphics primitives in a matter of seconds, and rendering in tens of milliseconds at a resolution of 1920×1080. They have demonstrated that single-GPU training times measured in seconds are within reach for many graphics applications, allowing neural approaches to be applied where previously they may have been discounted.
  • Code.
Pix2seq: A Language Modeling Framework for Object Detection
  • This paper by Chen et al. from Google Brain in ICLR 2022 presents Pix2Seq, a simple yet generic framework for object detection. This paper introduces Pix2Seq, a simple yet generic framework for object detection. By casting object detection as a language modeling task conditioned on the observed pixel inputs, Pix2Seq largely simplifies the detection pipeline, removing most of the specialization in modern detection algorithms.
  • Object descriptions (e.g., bounding boxes and class labels) are expressed as sequences of discrete tokens, and they train a neural network to perceive the image and generate the desired sequence.
  • Pix2Seq is based mainly on the intuition that if a neural network knows about where and what the objects are, they just need to teach it how to read them out.
  • Beyond the use of task-specific data augmentations, their approach makes minimal assumptions about the task, yet it achieves competitive results on the challenging COCO dataset, compared to highly specialized and well optimized detection algorithms.
  • Pix2Seq can be extended beyond object detection to solving a large variety of vision tasks where the output can be represented by a relatively concise sequence of discrete tokens (e.g., keypoint detection, image captioning, visual question answering).
  • A major limitation of Pix2Seq is that autoregressive modeling is expensive for long sequences (mainly during model inference). Practical measures to mitigate the issue includes: 1) stop inference when the ending token is produced (e.g., in COCO dataset, there are, in average, 7 objects per image, leading to a relatively small number of ∼35 tokens), 2) applying it to offline inference, or online scenarios where the objects of interest are relatively sparse (for e.g., locate a specific object with language description).
  • However, future work is needed to make it faster for real-time object detection applications. Another limitation is that the current approach for training Pix2Seq is entirely based on human annotation, and by reducing such dependence and letting the model train using unlabeled data in an unsupervised fashion, they can enable far more applications in the vision domain.
An Improved One millisecond Mobile Backbone
  • Efficient neural network backbones for mobile devices are often optimized for metrics such as FLOPs or parameter count. However, these metrics may not correlate well with latency of the network when deployed on a mobile device.
  • This paper by Vasu et al. from Apple in 2022 performs extensive analysis of different metrics by deploying several mobile friendly networks on a mobile device. They identify and analyze architectural and optimization bottlenecks in recent efficient neural networks and provide ways to mitigate these bottlenecks.
  • To this end, they design an efficient backbone MobileOne, with variants achieving an inference time under 1 ms on an iPhone12 with 75.9% top-1 accuracy on ImageNet. They show that MobileOne achieves state-of-the-art performance within the efficient architectures while being many times faster on mobile.
  • A MobileOne block has two different structures at train time and test time, inspired from RepVGG: Making VGG-style ConvNets Great Again. Left: Train time MobileOne block with reparameterizable branches. Right: MobileOne block at inference where the branches are reparameterized. Either ReLU or SE-ReLU is used as activation. The trivial over-parameterization factor \(k\) is a hyperparameter which is tuned for every variant.

  • Their best model obtains similar performance on ImageNet as MobileFormer while being 38x faster. MobileOne obtains 2.3% better top-1 accuracy on ImageNet than EfficientNet at similar latency. Furthermore, they show that their model generalizes to multiple tasks – image classification, object detection, and semantic segmentation with significant improvements in latency and accuracy as compared to existing efficient architectures when deployed on a mobile device.
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
  • This paper by Saharia et al. from Google Brain in 2022 presents Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding. Imagen showcases the effectiveness of frozen large pretrained language models as text encoders for the text-to-image generation using diffusion models.
  • Imagen builds on the power of large transformer language models in understanding text and hinges on the strength of diffusion models in high-fidelity image generation. With these novel components, Imagen produces \(1024 \times 1024\) samples with unprecedented photorealism and alignment with text.
  • Their key discovery is that generic large language models (e.g. T5), pretrained on text-only corpora, are surprisingly effective at encoding text for image synthesis: increasing the size of the language model in Imagen boosts both sample fidelity and image-text alignment much more than increasing the size of the image diffusion model.
  • Imagen achieves a new state-of-the-art FID score of 7.27 on the COCO dataset, without ever training on COCO, and human raters find Imagen samples to be on par with the COCO data itself in image-text alignment.
  • To assess text-to-image models in greater depth, they introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models. With DrawBench, they compare Imagen with recent methods including VQ-GAN+CLIP, Latent Diffusion Models, and DALL-E 2, and find that human raters prefer Imagen over other models in side-by-side comparisons, both in terms of sample quality and image-text alignment.
  • Google page with an overview of the results.
Swin Transformer V2: Scaling Up Capacity and Resolution
  • Large-scale NLP models have been shown to significantly improve the performance on language tasks with no signs of saturation. They also demonstrate amazing few-shot capabilities like that of human beings.
  • This paper by Liu et al. from Microsoft Research in 2022 explores large-scale models in computer vision. THey tackle three major issues in training and application of large vision models, including training instability, resolution gaps between pre-training and fine-tuning, and hunger on labelled data. Three main techniques are proposed: 1) a residual-post-norm method combined with cosine attention to improve training stability; 2) A log-spaced continuous position bias method to effectively transfer models pre-trained using low-resolution images to downstream tasks with high-resolution inputs; 3) A self-supervised pre-training method, SimMIM, to reduce the needs of vast labeled images.
  • Through these techniques, this paper successfully trained a 3 billion-parameter Swin Transformer V2 model, which is the largest dense vision model to date, and makes it capable of training with images of up to \(1,536 \times 1,536\) resolution.
  • By scaling up capacity and resolution, Swin V2 sets new performance records on 4 representative vision tasks, including ImageNet-V2 image classification, COCO object detection, ADE20K semantic segmentation, and Kinetics-400 video action classification. Also, note their training is much more efficient than that in Google’s billion-level visual models, which consumes 40 times less labelled data and 40 times less training time.
  • The diagram below from the paper presents the techniques for scaling Swin Transformer up to 3 billion parameters and making it capable of training with images of up to \(1,536 \times 1,536\) resolution, including the res-post-norm and scaled cosine attention to make the model easier to be scaled up in capacity, as well a log-spaced continuous relative position bias approach which lets the model more effectively transferred across window resolutions.

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
  • This paper by Yu et al. from Google Research in 2022 presents the Pathways Autoregressive Text-to-Image (Parti) model, which generates high-fidelity photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge. In particular, Parti is able to represent a broad range of visual world knowledge, such as landmarks, specific years, makes and models of vehicles, pottery types, visual styles – and integrate these into novel settings and configurations.
  • Parti treats text-to-image generation as a sequence-to-sequence modeling problem, akin to machine translation, with sequences of image tokens as the target outputs rather than text tokens in another language. This strategy can naturally tap into the rich body of prior work on large language models, which have seen continued advances in capabilities and performance through scaling data and model sizes.
  • Their approach is simple: First, Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens.
  • Second, they achieve consistent quality improvements by scaling the encoder-decoder Transformer model up to 20B parameters, with a new state-of-the-art zero-shot FID score of 7.23 and finetuned FID score of 3.22 on MS-COCO.
  • Their detailed analysis on Localized Narratives as well as PartiPrompts (P2), a new holistic benchmark of over 1600 English prompts, demonstrate the effectiveness of Parti across a wide variety of categories and difficulty aspects.
  • They also provide an extensive discussion of the limitations, including a breakdown of many kinds of model errors and challenges, that they hope will be useful both for contextualizing what the model can do and for highlighting opportunities for future research.
  • Parti opens up opportunities to integrate scaled autoregressive models with diffusion models, starting with having an autoregressive model generate an initial low-resolution image and then iteratively refining and super-resolving images with diffusion modules. Furthermore, the authors suggest conducting more experiments and comparisons with both autoregressive and diffusion models in order to understand their relative capabilities, to address key questions of fairness and bias in both classes of models and strategies for mitigating them, and to identify optimal opportunities for combining their strengths.
  • Key takeaways:
    • One of the most exciting research fields nowadays is text-to-image modeling. OpenAI’s DALL-E 2 and Google’s Imagen are phenomenal models in this area. Both used a Transformer to encode the text and use diffusion models to generate the image. Google’s Parti, consists solely of (really big) Transformer modules:
      • Text encoder: as with previous works, encoding the text with a Transformer is a no-brainer.
      • Image tokenizer and de-tokenizer: instead of generating the entire image, Parti will generate one patch at a time. A ViT-based module is used to encode and decode those patches.
      • Conditional decoder: conditioned on the encoded text and the tokenized image patches generated so far, a Transformer is used to generate the next patch (with the help of the de-tokenizer from the previous step).
  • Google page.
  • Code.

Sequencer: Deep LSTM for Image Classification
  • In recent computer vision research, the advent of the Vision Transformer (ViT) has rapidly revolutionized various architectural design efforts: ViT achieved state-of-the-art image classification performance using self-attention found in natural language processing, and MLP-Mixer achieved competitive performance using simple multi-layer perceptrons. In contrast, several studies have also suggested that carefully redesigned convolutional neural networks (CNNs) can achieve advanced performance comparable to ViT without resorting to these new ideas. Against this background, there is growing interest in what inductive bias is suitable for computer vision.
  • This paper by Tatsunami and Taki from Rikkyo, Japan in NeurIPS 2022 proposes Sequencer, a novel and competitive architecture alternative to ViT that provides a new perspective on these issues. Unlike ViTs, Sequencer models long-range dependencies using LSTMs rather than self-attention layers.
  • They also propose a two-dimensional version of Sequencer module, where an LSTM is decomposed into vertical and horizontal LSTMs to enhance performance. Despite its simplicity, several experiments demonstrate that Sequencer performs impressively well: Sequencer2D-L, with 54M parameters, realizes 84.6% top-1 accuracy on only ImageNet-1K.
  • Of note is the fact that the overall data appetite and time to converge was reported to be much better than the ViT and cousins since CNNs and LSTMs have great sample efficiency. Not only that, the paper shows that it has good transferability and the robust resolution adaptability on double resolution-band.

High-Resolution Image Synthesis with Latent Diffusion Models
  • The following paper summary has been contributed by Zhibo Zhang.
  • Diffusion models are known to be computationally expensive given that they require many steps of diffusion and denoising diffusion operations in possibly high-dimensional input feature spaces.
  • This paper by Rombach et al. from Ludwig Maximilian University of Munich & IWR, Heidelberg University and Runway ML in CVPR 2022 introduces diffusion models that operate on the latent space, aiming at generating high-resolution images with lower computation demands compared to those that operate directly on the pixel space.
  • In particular, the authors adopted an autoencoder that compresses the input images into a lower dimensional latent space. The autoencoder relies on either KL regularization or VQ regularization to constrain the variance of the latent space.
  • As shown in the illustration figure below by Rombach et al., in the latent space, the latent representation of the input image goes through a total of \(T\) diffusion operations to get the noisy representation. A U-Net is then applied on top of the noisy representation for \(T\) iterations to produce the denoised version of the representation. In addition, the authors introduced a cross attention mechanism to condition the denoising process on other types of inputs such as text and semantic maps.
  • In the final stage, the denoised representation will be mapped back to the pixel space using the decoder to get the synthesized image.
  • Empirically, the best performing latent diffusion model (with a carefully chosen downsampling factor) achieved competitive FID scores in image generation when comparing with a few other state-of-the-art generative models such as variations of generative adversarial nets on a few datasets including the CelebA-HQ dataset.
  • Code

Make-A-Video: Text-to-Video Generation without Text-Video Data
  • This paper by Singer et al. from Meta AI in 2022 proposes Make-A-Video – an approach for directly translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V).
  • Their intuition is simple: learn what the world looks like and how it is described from paired text-image data, and learn how the world moves from unsupervised video footage. Make-A-Video has three advantages: (1) it accelerates training of the T2V model (it does not need to learn visual and multimodal representations from scratch), (2) it does not require paired text-video data, and (3) the generated videos inherit the vastness (diversity in aesthetic, fantastical depictions, etc.) of today’s image generation models. They design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules.
  • First, they decompose the full temporal U-Net and attention tensors and approximate them in space and time. Second, they design a spatial temporal pipeline to generate high resolution and frame rate videos with a video decoder, interpolation model and two super resolution models that can enable various applications besides T2V.
  • In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation, as determined by both qualitative and quantitative measures.
Grounded Language-Image Pre-training
  • This paper by Li et al. from UCLA, MS Research, UW, UW-Madison, etc. in 2022 present a grounded language-image pretraining (GLIP) model for learning object-level, languageaware, and semantic-rich visual representations. GLIP unifies object detection and phrase grounding for pre-training.
  • The unification brings two benefits: 1) it allows GLIP to learn from both detection and grounding data to improve both tasks and bootstrap a good grounding model; 2) GLIP can leverage massive image-text pairs by generating grounding boxes in a self-training fashion, making the learned representations semantic-rich. In their experiments, they pre-train GLIP on 27M grounding data, including 3M human-annotated and 24M web-crawled image-text pairs.
  • The learned representations demonstrate strong zero-shot and few-shot transferability to various object-level recognition tasks.
  • When directly evaluated on COCO and LVIS (without seeing any images in COCO during pre-training), GLIP achieves 49.8 AP and 26.9 AP, respectively, surpassing many supervised baselines.
  • After fine-tuned on COCO, GLIP achieves 60.8 AP on val and 61.5 AP on test-dev, surpassing prior SoTA.
  • When transferred to 13 downstream object detection tasks, a 1-shot GLIP rivals with a fully-supervised Dynamic Head.
  • Code.
Denoising Diffusion Implicit Models
  • Denoising diffusion probabilistic models (DDPMs) have achieved high quality image generation without adversarial training, yet they require simulating a Markov chain for many steps to produce a sample.
  • This paper by Song et al. in ICLR 2021 from Ermon’s lab at Stanford presents denoising diffusion implicit models (DDIMs), a more efficient class of iterative implicit probabilistic models with the same training procedure as DDPMs to accelerate the sampling process.
  • In DDPMs, the generative process is defined as the reverse of a Markovian diffusion process. DDIMs construct a class of non-Markovian diffusion processes that lead to the same training objective, but whose reverse process can be much faster to sample from.
  • They empirically demonstrate that DDIMs can produce high quality samples 10x to 50x faster in terms of wall-clock time compared to DDPMs, allow us to trade off computation for sample quality, and can perform semantically meaningful image interpolation directly in the latent space.
CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows
  • This paper by Dong et al. from the University of Science and Technology of China, Microsoft Research Asia, and Microsoft Cloud in 2022 presents the CSWin Transformer, an efficient and effective Transformer-based backbone for general-purpose vision tasks.
  • A challenging issue in Transformer design is that global self-attention is very expensive to compute whereas local self-attention often limits the field of interactions of each token. To address this issue, they develop the Cross-Shaped Window self-attention mechanism for computing self-attention in the horizontal and vertical stripes in parallel that form a cross-shaped window, with each stripe obtained by splitting the input feature into stripes of equal width.
  • They provide a mathematical analysis of the effect of the stripe width and vary the stripe width for different layers of the Transformer network which achieves strong modeling capability while limiting the computation cost.
  • They also introduce Locally-enhanced Positional Encoding (LePE), which handles the local positional information better than existing encoding schemes. LePE naturally supports arbitrary input resolutions, and is thus especially effective and friendly for downstream tasks.
  • Incorporated with these designs and a hierarchical structure, CSWin Transformer demonstrates competitive performance on common vision tasks. Specifically, it achieves 85.4% Top-1 accuracy on ImageNet-1K without any extra training data or label, 53.9 box AP and 46.4 mask AP on the COCO detection task, and 52.2 mIOU on the ADE20K semantic segmentation task, surpassing previous state-of-the-art Swin Transformer backbone by +1.2, +2.0, +1.4, and +2.0 respectively under the similar FLOPs setting.
  • By further pretraining on the larger dataset ImageNet-21K, they achieve 87.5% Top-1 accuracy on ImageNet-1K and high segmentation performance on ADE20K with 55.7 mIoU.
  • The following figure from the paper illustrates different self-attention mechanisms; CSWin is fundamentally different from two aspects. First, they split multi-heads (\(\{h1, \ldots , hK\}\)) into two groups and perform self-attention in horizontal and vertical stripes simultaneously. Second, they adjust the stripe width according to the depth network, which can achieve better trade-off between computation cost and capability.

  • The following figure from the paper shows: (Left) the overall architecture of their proposed CSWin Transformer; (Right) the illustration of CSWin Transformer block.

MViTv2: Improved Multiscale Vision Transformers for Classification and Detection
  • This paper by Li et al. from Facebook AI Research and UC Berkeley studies Multiscale Vision Transformers (MViTv2) as a unified architecture for image and video classification, as well as object detection.
  • They present an improved version of MViT that incorporates decomposed relative positional embeddings and residual pooling connections.
  • They instantiate this architecture in five sizes and evaluate it for ImageNet classification, COCO detection and Kinetics video recognition where it outperforms prior work.
  • They further compare MViTv2s’ pooling attention to window attention mechanisms where it outperforms the latter in accuracy/compute. Without bells-and-whistles, MViTv2 has state-of-the-art performance in 3 domains: 88.8% accuracy on ImageNet classification, 58.7 boxAP on COCO object detection as well as 86.1% on Kinetics-400 video classification.
  • The following figure from the paper shows the improved Pooling Attention mechanism that incorporating decomposed relative position embedding, \(R_{p(i), p(j)}\), and residual pooling connection modules in the attention block.

  • The following figure from the paper illustrates MViTv2 as a multiscale transformer with state-of-the-art performance across three visual recognition tasks.


Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles
  • Vision Transformers (ViTs) have dominated several tasks in computer vision. While architecturally simple, their accuracy and ability to scale make them still a popular choice today. Moreover, their simplicity unlocks the use of powerful pretraining strategies such as MAE, which make ViTs computationally and data efficient to train. However, this simplicity comes at a cost: by using the same spatial resolution and number of channels throughout the network, ViTs make inefficient use of their parameters. This is in contrast to prior “hierarchical” or “multi-scale” models (such as AlexNet or ResNet), which use fewer channels but higher spatial resolution in early stages with simpler features, and more channels but lower spatial resolution later in the model with more complex features.
  • While several domain specific hierarchical vision transformers have been introduced (such as hierarchical design, such as Swin or MViT), they have added several vision-specific components in the pursuit of supervised classification performance. While these components lead to effective accuracies and attractive FLOP counts, the added complexity actually makes these transformers slower than their vanilla ViT counterparts.
  • This paper by Ryali et al. from Facebook AI Research, Georgia Tech, and Johns Hopkins University argues that this additional bulk is unnecessary. By pretraining with a strong visual pretext task (MAE), they can strip out all the bells-and-whistles from a state-of-the-art multi-stage vision transformer without losing accuracy.
  • In the process, they create Hiera, an extremely simple hierarchical vision transformer that is more accurate than previous models while being significantly faster both at inference and during training. They evaluate Hiera on a variety of tasks for image and video recognition.
  • The following figure from the paper shows: (Left) the overall architecture of their proposed CSWin Transformer; (Right) the illustration of CSWin Transformer block.

Tree-Ring Watermarks: Fingerprints for Diffusion Images that are Invisible and Robust
  • Watermarking the outputs of generative models is a crucial technique for tracing copyright and preventing potential harm from AI-generated content.
  • This paper by Wen at al. from the University of Maryland introduces a novel technique called Tree-Ring Watermarking that robustly fingerprints diffusion model outputs.
  • Unlike existing methods that perform post-hoc modifications to images after sampling, Tree-Ring Watermarking subtly influences the entire sampling process, resulting in a model fingerprint that is invisible to humans. The watermark embeds a pattern into the initial noise vector used for sampling.
  • Because these patterns are structured in Fourier space so that they are invariant to perturbation such as convolutions, crops, dilations, flips, and rotations. After image generation, the watermark signal is detected by inverting the diffusion process to retrieve the noise vector, which is then checked for the embedded signal.
  • They demonstrate that this technique can be easily applied to arbitrary diffusion models, including text-conditioned Stable Diffusion, as a plug-in with negligible loss in FID. Their watermark is semantically hidden in the image space and is far more robust than watermarking alternatives that are currently deployed.
  • The following figure from the paper illustrates the pipeline for tree-ring Watermarking. A diffusion model generation is watermarked and later detected through ring-patterns in the Fourier space of the initial noise vector.

From Sparse to Soft Mixtures of Experts
  • Sparse Mixture of Experts (MoE) architectures scale model capacity without large increases in training or inference costs. MoE allows us to dramatically scale model sizes without significantly increasing inference latency. In short, each “expert” can separately attend to a different subset of tasks via different data subsets before they are combined via an input routing mechanism. Thus, the model can learn a wide variety of tasks, but still specialize when appropriate. Despite their success, MoEs suffer from a number of issues: training instability, token dropping, inability to scale the number of experts, or ineffective finetuning.
  • This paper by Puigcerver et al. from Google DeepMind proposes Soft MoE, a fully-differentiable sparse Transformer that addresses these challenges, while maintaining the benefits of MoEs.
  • Extra-large models like Google’s PaLM (540B parameters) or OpenAI’s GPT-4 use Sparse MoE under the hood, which suffers from training instabilities, because it’s not fully differentiable. Soft-MoE replaces the non-differentiable expert routing with a differentiable layer. The end-to-end model is fully differentiable again, can be trained with ordinary SGD-like optimizers, and the training instabilities go away.
  • Soft MoE performs an implicit soft assignment by passing different weighted combinations of all input tokens to each expert. As in other MoE works, experts in Soft MoE only process a subset of the (combined) tokens, enabling larger model capacity at lower inference cost.
  • The following figure from the paper illustrates the main differences between Sparse and Soft MoE layers. While the router in Sparse MoE layers (left) learns to assign individual input tokens to each of the available slots, in Soft MoE layers (right) each slot is the result of a (different) weighted average of all the input tokens. Learning to make discrete assignments introduces several optimization and implementation issues that Soft MoE sidesteps.

  • They propose a fully-differentiable sparse vision transformer (ViT) that addresses aforementioned challenges such as training instability, token dropping, and inefficient finetuning. In the context of visual recognition, Soft MoE greatly outperforms the standard ViT and popular MoE variants (Tokens Choice and Experts Choice). Soft MoE scales ViT models to >50B parameters with little effect on inference latency. For example, Soft MoE-Base/16 requires 10.5x lower inference cost (5.7x lower wall-clock time) than ViT-Huge/14 while matching its performance after similar training. Soft MoE also scales well: Soft MoE Huge/14 with 128 experts in 16 MoE layers has over 40x more parameters than ViT Huge/14, while inference time cost grows by only 2%, and it performs substantially better.
  • The following figure from the paper illustrates the Soft MoE routing algorithm. Soft MoE first computes scores or logits for every pair of input token and slot, based on some learnable per-slot parameters. These logits are then normalized per slot (columns) and every slot computes a linear combination of all the input tokens based on these weights (in green). Each expert (an MLP in this work) then processes its slots (e.g. 2 slots per expert, in this diagram). Finally, the same original logits are normalized per token (i.e., by row) and used to combine all the slot outputs, for every input token (in blue). Dashed boxes represent learnable parameters.

  • The following infographic (source) presents an overview of their results:

Estimating Example Difficulty using Variance of Gradients
  • The paper titled “Estimating Example Difficulty using Variance of Gradients” by Chirag Agarwal, Daniel D’souza, and Sara Hooker, published on arXiv in June 2022, introduces the concept of Variance of Gradients (VoG) as a metric to determine the difficulty of examples for machine learning models. The main highlights and contributions of the paper are as follows:
  • The authors propose VoG as an efficient metric to rank data by difficulty, enabling the identification of the most challenging examples for more focused human-in-the-loop auditing. The metric is particularly adept at identifying data points that are difficult for the model to learn, often correlating with corrupted or memorized examples.
  • The following \(5 \times 5\) grid from the paper shows the top-25 Cifar-10 and Cifar-100 training-set images with the lowest and highest VoG scores in the Early (a) and Late (b) training stage respectively of two randomly chosen classes. Lower VoG images evidence uncluttered backgrounds (for both apple and plane) in the Late training stage. VoG also appears to capture a color bias present during the Early training stage for both apple (red). The VoG images in Late training stage present unusual vantage points, with images where the frame is zoomed in on the object of interest.

  • The study demonstrates the effectiveness of VoG across multiple architectures and datasets, including Cifar-10, Cifar-100, and ImageNet. VoG is shown to identify clusters of images with distinct semantic properties, where high VoG scores often align with images having cluttered backgrounds and atypical vantage points. The method also proves effective in surfacing memorized examples and provides insights into the learning cycle of the model.
  • An extensive evaluation of VoG’s utility as an auditing tool is conducted. This includes qualitative analysis of images with high and low VoG scores, demonstrating a correlation between VoG scores and the distinct visual properties of images. It’s observed that images with low VoG scores typically have uncluttered backgrounds and more prototypical views, whereas high VoG scores are associated with more challenging images. The study also shows that test set errors increase with higher VoG scores, especially in more complex datasets.
  • The stability of the VoG ranking is confirmed, which is crucial for building trust with users. The method produces consistent rankings across different training runs, demonstrating negligible deviation in VoG scores across samples. This stability is observed for both Cifar-10 and Cifar-100 datasets.
  • VoG’s role as an unsupervised auditing tool is highlighted, showing its capability to produce reliable rankings even without labels at test time. This feature is particularly valuable for datasets where obtaining labels for protected attributes is infeasible or intrusive.
  • The paper delves into VoG’s understanding of early and late training dynamics, revealing that VoG scores can capture different aspects of learning at various stages of training. For instance, during early training, higher VoG scores correlate with lower average error rates, while in the later stages, this trend reverses.
  • VoG is evaluated for its ability to identify memorized examples and its effectiveness as an Out-of-Distribution (OoD) detection tool. It is found to be discriminative in distinguishing between memorized and non-memorized examples. Additionally, when compared to other OoD detection methods, VoG outperforms most, highlighting its efficacy and scalability for large datasets and complex network architectures like ResNet-50.
  • In conclusion, the paper emphasizes the value of VoG in ranking data by difficulty and its utility in identifying challenging examples for auditing. Its domain-agnostic nature and ability to work with training and test examples make it a versatile tool in the realm of machine learning interpretability and model auditing.



Long Short-Term Memory
  • Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow.
  • This paper by Hochreiter and Schmidhuber in Neural Computation 1997 briefly reviews Hochreiter’s (1991) analysis of this problem, then addresses it by introducing a novel, efficient, gradient based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow.
  • LSTM is local in space and time; its computational complexity per time step and weight is O. 1. Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations.
  • In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.


A Neural Probabilistic Language Model
  • This paper by Bengio from the University of Montreal in 2003 revolutionized statistical language modeling by replacing “tables of conditional probabilities” (n-gram language models) with more compact and smoother representations based on distributed representations that can accommodate far more conditioning variables.
  • The traditional technique of learning the joint probability function of sequences of words in a language was intrinsically difficult because of the curse of dimensionality: a word sequence on which the model will be tested is likely to be different from all the word sequences seen during training. Traditional but very successful approaches based on n-grams obtain generalization by concatenating very short overlapping sequences seen in the training set.
  • They propose to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential/combinatorial number of semantically neighboring sentences, which forms the main reason for the spectacular improvements the proposed approach offers. The model learns simultaneously (i) a distributed representation for each word along with (ii) the probability function for word sequences, expressed in terms of these representations.
  • Generalization is obtained because a sequence of words that has never been seen before gets high probability if it is made of words that are similar (in the sense of having a nearby representation) to words forming an already seen sentence.
  • They report on experiments using neural networks for the probability function, showing on two text corpora that the proposed approach significantly improves on state-of-the-art n-gram models, and that the proposed approach allows to take advantage of longer contexts.


ROUGE: A Package for Automatic Evaluation of Summaries
  • This paper by Lin in ACL 2004 introduces Recall-Oriented Understudy for Gisting Evaluation (ROUGE).
  • It includes measures to automatically determine the quality of a summary by comparing it to other (ideal) summaries created by humans. The measures count the number of overlapping units such as n-gram, word sequences, and word pairs between the computer-generated summary to be evaluated and the ideal summaries created by humans. This paper introduces four different ROUGE measures: ROUGE-N, ROUGE-L, ROUGE-W, and ROUGE-S included in the ROUGE summarization evaluation package and their evaluations. Three of them have been used in the Document Understanding Conference (DUC) 2004, a large-scale summarization evaluation sponsored by NIST.


METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments
  • This paper by Banerjee and Lavie in ACL 2005 introduces METEOR, an automatic metric for machine translation evaluation that is based on a generalized concept of unigram matching between the machine-produced translation and human-produced reference translations.
  • Unigrams can be matched based on their surface forms, stemmed forms, and meanings; furthermore, METEOR can be easily extended to include more advanced matching strategies.
  • Once all generalized unigram matches between the two strings have been found, METEOR computes a score for this matching using a combination of unigram-precision, unigram-recall, and a measure of fragmentation that is designed to directly capture how well-ordered the matched words in the machine translation are in relation to the reference.
  • They evaluate METEOR by measuring the correlation between the metric scores and human judgments of translation quality.
  • They compute the Pearson \(R\) correlation value between its scores and human quality assessments of the LDC TIDES 2003 Arabic-to-English and Chinese-to-English datasets.
  • They perform segment-by-segment correlation, and show that METEOR gets an \(R\) correlation value of 0.347 on the Arabic data and 0.331 on the Chinese data. This is shown to be an improvement on using simply unigramprecision, unigram-recall and their harmonic F1 combination. They also perform experiments to show the relative contributions of the various mapping modules.


Recurrent neural network based language model
  • This paper by Mikolov et al. from Khudanpur’s lab at JHU in Interspeech 2010, was the first to propose using a recurrent neural network-based language model (RNN LM) with applications to speech recognition.
  • The results indicate that it is possible to obtain around 50% reduction of perplexity (PPL) by using a mixture of several RNN LMs, compared to a state of the art backoff language model. Speech recognition experiments show around 18% reduction of word error rate on the Wall Street Journal task when comparing models trained on the same amount of data, and around 5% on the much harder NIST RT05 task, and 12% even when the backoff model is trained on 5 times more data than the RNN model. For NIST RT05, they can conclude that models trained on just 5.4M words of in-domain data can outperform big backoff models, which are trained on hundreds times more data.
  • They provide ample empirical evidence to suggest that connectionist language models are superior to standard n-gram techniques, except their high computational (training) complexity. Recurrent neural networks outperformed significantly state of the art backoff models in all of the experiments, most notably even in case when backoff models were trained on much more data than RNN LMs.
  • The paper seeks to break the myth that language modeling is just about counting n-grams, and that the only reasonable way how to improve results is by acquiring new training data.


Generating Text with Recurrent Neural Networks
  • Recurrent Neural Networks (RNNs) are very powerful sequence models that do not enjoy widespread use because it is extremely difficult to train them properly. Fortunately, recent advances in Hessian-free optimization have been able to overcome the difficulties associated with training RNNs, making it possible to apply them successfully to challenging sequence problems.
  • This paper by Sutskever et al. from UofT in ICML 2011 demonstrates the power of RNNs trained with the new Hessian-Free optimizer (HF) by applying them to character-level language modeling tasks. The standard RNN architecture, while effective, is not ideally suited for such tasks, so they introduce a new RNN variant that uses multiplicative (or “gated”) connections which allow the current input character to determine the transition matrix from one hidden state vector to the next.
  • Having applied a modestly-sized standard RNN architecture to the character-level language modeling problem (where the target output at each time step is defined as the the input character at the next time-step), they found the performance somewhat unsatisfactory, and that while increasing the dimensionality of the hidden state did help, the per-parameter gain in test performance was not sufficient to allow the method to be both practical and competitive with state-of-the-art approaches. They address this problem by proposing a new temporal architecture called the Multiplicative RNN (MRNN) which they argue is better suited to the language modeling task.
  • Modeling language at the character level seems unnecessarily difficult. This is because morphemes are the appropriate units for making semantic and syntactic predictions and as such, converting large databases into sequences of morphemes, however, is non-trivial compared with treating them as character strings. Also, learning which character strings make words is a relatively easy task compared with discovering the subtleties of semantic and syntactic structure. So, given a powerful learning system like an MRNN, the convenience of using characters may outweigh the extra work of having to learn the words. Their experiments show that an MRNN finds it very easy to learn words. With the exception of proper names, the generated text contains very few non-words. At the same time, the MRNN also assigns probability to (and occasionally generates) plausible words that do not appear in the training set (e.g., “cryptoliation”, “homosomalist”, or “un-ameliary”). This is a desirable property which enabled the MRNN to gracefully deal with real words that it nonetheless didn’t see in the training set. Predicting the next word by making a sequence of character predictions avoids having to use a huge softmax over all known words and this is so advantageous that some word-level language models actually make up binary “spellings” of words so that they can predict them one bit at a time (Mnih & Hinton, 2009).
  • MRNNs already learn surprisingly good language models using only 1500 hidden units, and unlike other approaches such as the sequence memoizer and PAQ, they are easy to extend along various dimensions. If much bigger MRNNs could be trained with millions of units and billions of connections, it is possible that brute force alone would be sufficient to achieve an even higher standard of performance. But this will of course require considerably more computational power.
  • After training the multiplicative RNN with the HF optimizer for five days on 8 high-end Graphics Processing Units, they were able to surpass the performance of the best previous single method for character level language modeling – a hierarchical nonparametric sequence model. At this point, this represents the largest recurrent neural network application to date.


Efficient Estimation of Word Representations in Vector Space
  • “You shall know a word by the company it keeps” — J. R. Firth.
  • This paper by Mikolov et al. from Google in 2013 proposes word2vec which comprises of two novel model architectures for computing continuous vector representations of words from very large data sets. They studied the quality of vector representations of words derived by various models on a collection of syntactic and semantic language tasks involving word similarity, and the results are compared to the previously best performing techniques based on different types of neural networks. They observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set.
  • Based on a two-layer MLP neural network (i.e., one hidden layer and output layer), they propose two new model architectures for learning distributed representations of words that try to minimize computational complexity. The Continuous Bag-of-Words (CBOW) model architecture predicts the current word based on the context, while the skip-gram model predicts surrounding/context words given the current word.
  • They observed that it is possible to train high quality word vectors using very simple model architectures, compared to the popular neural network models (both feedforward and recurrent). Because of the much lower computational complexity, it is possible to compute very accurate high dimensional word vectors from a much larger data set.
  • Furthermore, they show that these vectors provide state-of-the-art performance on their test set for measuring syntactic and semantic word similarities.
  • Word2vec popularized the “King – Man + Woman = Queen” analogy.
  • Overall, two important learnings from Word2Vec were:
    • Embeddings of semantically similar words are close in cosine similarity.
    • Word embeddings support intuitive arithmetic properties. (An important consequence of this statement is that phrase embeddings can be obtained as the sum of word embeddings.)
Distributed Representations of Words and Phrases and their Compositionality
  • This paper by Mikolov et al. from Google in NeurIPS 2013 builds on their earlier paper Efficient Estimation of Word Representations in Vector Space which proposed the Skip-gram model as an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. They present several extensions that improve both the quality of the vectors and the training speed.
  • They describe a simple alternative to the hierarchical softmax called negative sampling, packaged as Skipgram with Negative Sampling (SGNS). Negative sampling is an extremely simple training method that learns accurate representations especially for frequent words. Furthermore, they propose subsampling of frequent words which is shown to to yield both faster training and significantly better representations of uncommon words.
  • An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of “Canada” and “Air” cannot be easily combined to obtain “Air Canada”. Motivated by this example, they present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.
  • The techniques introduced in this paper can be used also for training the continuous bag-of-words model introduced in Efficient Estimation of Word Representations in Vector Space.
  • Owing to the training optimizations proposed in this paper, successfully trained models on several orders of magnitude more data than the previously published models, thanks to the computationally efficient model architecture. This results in a great improvement in the quality of the learned word and phrase representations, especially for the rare entities.
  • The choice of the training algorithm and the hyper-parameter selection is a task specific decision, as different problems have different optimal hyperparameter configurations. In their experiments, the most crucial decisions that affect the performance are the choice of the model architecture, the size of the vectors, the subsampling rate, and the size of the training window.
  • A very interesting result of this work is that the word vectors can be somewhat meaningfully combined using just simple vector addition.
  • Another approach for learning representations of phrases presented in this paper is to simply represent the phrases with a single token. Combination of these two approaches gives a powerful yet simple way how to represent longer pieces of text, while having minimal computational complexity. Our work can thus be seen as complementary to the existing approaches that attempt to represent phrases using recursive matrix-vector operations.


On the Properties of Neural Machine Translation: Encoder–Decoder Approaches
  • This paper by Cho from Bengio’s lab in Universite de Montreal in 2014 first introduced Gated Recurrent Units (GRUs).
  • Neural machine translation is a relatively new approach to statistical machine translation based purely on neural networks in which models often consist of an encoder and a decoder. The encoder extracts a fixed-length representation from a variable-length input sentence, and the decoder generates a correct translation from this representation.
  • The paper focuses on analyzing the properties of the neural machine translation using two types of neural networks that are able to process variable-length sequences (and differ in the choice of the encoder): (i) an recurrent neural network with gated hidden units, and (ii) the newly proposed gated recursive convolutional neural network. They show that the neural machine translation performs relatively well on short sentences without unknown words, but its performance degrades rapidly as the length of the sentence and the number of unknown words increase.
  • Furthermore, they find that the proposed gated recursive convolutional network learns a grammatical structure of a sentence automatically.
GloVe: Global Vectors for Word Representation
  • Word2vec relies only on local information of language. That is, the semantics learnt for a given word, is only affected by the surrounding words.
  • This paper by Pennington et al. from Stanford in EMNLP 2014 proposed Global Vectors (GloVe), an unsupervised learning algorithm which captures both global statistics and local statistics of a corpus, in order to train word vectors. Training is performed on aggregated global word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.
  • Contemporary methods focused considerable attention on the question of whether distributional word representations are best learned from count-based methods or from prediction-based methods. Currently, prediction-based models garner substantial support; for example, Baroni et al. (2014) argue that these models perform better across a range of tasks. They argue that the two classes of methods are not dramatically different at a fundamental level since they both probe the underlying co-occurrence statistics of the corpus, but the efficiency with which the count-based methods capture global statistics can be advantageous.
  • After Tomas Mikolov et al. released word2vec, there was a boom of papers about word vector representations. GloVe was one such proposal, which explained why such algorithms work and reformulated word2vec optimizations as a special kind of factorization for word co-occurence matrices. Note that GloVe does not use neural networks while word2vec does.
  • They construct a model that utilizes this main benefit of count data while simultaneously capturing the meaningful linear substructures prevalent in recent log-bilinear prediction-based methods like word2vec. The result, GloVe, is a new global log-bilinear regression model for the unsupervised learning of word representations that outperforms other models on word analogy, word similarity, and named entity recognition tasks.
Sequence to Sequence Learning with Neural Networks
  • This paper by Sutskever et al. from Google in 2014 introduced seq2seq encoder-decoder learning to map sequences to sequences, a task that simple Deep Neural Networks (DNNs) cannot be used to accomplish.
  • They present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Their method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. They show that a large deep LSTM with a limited vocabulary can outperform a standard statistical machine translation (SMT)-based system whose vocabulary is unlimited on a large-scale MT task. The success of their simple LSTM-based approach on MT suggests that it should do well on many other sequence learning problems, provided they have enough training data.
  • Their main result is that on an English to French translation task from the WMT-14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.8 on the entire test set, where the LSTM’s BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When they used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which is close to the previous state of the art. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice.
  • They also find that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM’s performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.
Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation
  • This paper by Cho et al. from Bengio’s lab in EMNLP 2014 introduced the seq2seq encoder-decoder model for neural machine translation. They propose a novel neural network model called RNN Encoder–Decoder that consists of two recurrent neural networks (RNN) that is together able to learn the mapping from a sequence of an arbitrary length to another sequence, possibly from a different set, of an arbitrary length. The encoder RNN encodes a sequence of symbols into a fixed-length vector representation, and the other decodes the representation into another sequence of symbols.
  • The proposed RNN Encoder–Decoder is able to either score a pair of sequences (in terms of a conditional probability) or generate a target sequence given a source sequence.
  • The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence.
  • Along with the new architecture, they propose a novel hidden unit that includes a reset gate and an update gate that adaptively control how much each hidden unit remembers or forgets while reading/generating a sequence.
  • They evaluated the proposed model with the task of statistical machine translation, where they used the RNN Encoder–Decoder to score each phrase pair in the phrase table. Qualitatively, they were able to show that the new model is able to capture linguistic regularities in the phrase pairs well and also that the RNN Encoder–Decoder is able to propose well-formed target phrases.
  • The scores by the RNN Encoder–Decoder were found to improve the overall translation performance in terms of BLEU scores. Also, they found that the contribution by the RNN Encoder–Decoder is rather orthogonal to the existing approach of using neural networks in the SMT system, so that they can improve further the performance by using, for instance, the RNN Encoder–Decoder and the neural net language model together.
  • Qualitative analysis of the the proposed model learns a semantically and syntactically meaningful representation of linguistic phrases at multiple levels, i.e., at the word level as well as phrase level. This suggests that there may be more natural language related applications that may benefit from the proposed RNN Encoder–Decoder.


Neural Machine Translation by Jointly Learning to Align and Translate
  • This paper by Bahdanau et al. from Bengio’s lab in ICLR 2015 borrowed the attention mechanism from the field of information retrieval and introduced it within the context of NLP (commonly called Bahdanau attention or additive attention in the field).
  • This paper introduces an attention mechanism for recurrent neural networks (RNN) to improve long-range sequence modeling capabilities. This allows RNNs to translate longer sentences more accurately, which served as the motivation behind developing the original transformer architecture later.
  • The following diagram from the paper illustrates the proposed model trying to generate the \(t^{th}\) target word \(y^t\) given a source sentence \((x^1, x^2, \ldots , x^T)\).

  • Referring to the figure above, the architecture consists of a bidirectional RNN as an encoder and a decoder that emulates searching through a source sentence during decoding a translation.
  • Decoder:
    • In prior encoder-decoder approaches, the decoder defines a probability over the translation \(y\) by decomposing the joint probability into the ordered conditionals: \(p(\mathbf{y})=\prod_{t=1}^T p\left(y_t \mid\left\{y_1, \cdots, y_{t-1}\right\}, c\right)\)

      • where \(\mathbf{y}=\left(y_1, \cdots, y_{T_y}\right)\). With an RNN, each conditional probability is modeled as, \(p\left(y_t \mid\left\{y_1, \cdots, y_{t-1}\right\}, c\right)=g\left(y_{t-1}, s_t, c\right)\)
    • On the other hand, in the proposed model architecture, they define each conditional probability over the otuput translation \(y\) by decomposing the joint probability into the ordered conditionals as, \(p\left(y_i \mid y_1, \ldots, y_{i-1}, \mathbf{x}\right)=g\left(y_{i-1}, s_i, c_i\right)\)

      • where \(s_i\) is an RNN hidden state for time \(i\), computed by \(s_i=f\left(s_{i-1}, y_{i-1}, c_i\right) \text {. }\)
    • It should be noted that unlike the prior encoder-decoder approaches, here the probability is conditioned on a distinct context vector \(c_i\) for each target word \(y_i\). The context vector \(c_i\) depends on a sequence of annotations \(\left(h_1, \cdots, h_{T_x}\right)\) to which an encoder maps the input sentence. Each annotation \(h_i\) contains information about the whole input sequence with a strong focus on the parts surrounding the \(i^{th}\) word of the input sequence. More information on obtaining annotations in the section on Encoder below.
    • The context vector \(c_i\) is, then, computed as a weighted sum of these annotations \(h_i\): \(c_i=\sum_{j=1}^{T_x} \alpha_{i j} h_j\)

    • The weight \(\alpha_{i j}\) of each annotation $h_j$ is computed by, \(\alpha_{i j}=\frac{\exp \left(e_{i j}\right)}{\sum_{k=1}^{T_x} \exp \left(e_{i k}\right)},\)

      • where, \(e_{i j}=a\left(s_{i-1}, h_j\right)\)

      • is an alignment model which scores how well the inputs around position \(j\) and the output at position \(i\) match. The score is based on the RNN hidden state \(s_{i-1}\) and the \(j^{th}\) annotation \(h_j\) of the input sentence.

  • Encoder:
    • The prior RNN architecture reads an input sequence \(mathbf{x}\) in order starting from the first symbol \(x_1\) to the last one \(x_{T_x}\). However, in the proposed scheme, we would like the annotation of each word to summarize not only the preceding words, but also the following words. Hence, they propose to use a bidirectional RNN, which has been successfully used recently in speech recognition.
    • A BiRNN consists of forward and backward RNN’s. The forward RNN \(\vec{f}\) reads the input sequence as it is ordered (from \(x_1\) to \(x_{T_x}\) and calculates a sequence of forward hidden states \(\left(\vec{h}_1, \cdots, \vec{h}_{T_x}\right)\). The backward RNN \(\overleftarrow{f}\) reads the sequence in the reverse order (from $x_{T_x}$ to $x_1$), resulting in a sequence of backward hidden states \(\left(\overleftarrow{h}_1, \cdots, \overleftarrow{h}_{T_x}\right)\).
    • They obtain an annotation for each word \(x_j\) by concatenating the forward hidden state \(\vec{h}_j\) and the backward one \(\overleftarrow{h}_j\), i.e., \(h_j=\left[\vec{h}_j^{\top} ; \overleftarrow{h}_j^{\top}\right]^{\top}\). In this way, the annotation \(h_j\) contains the summaries of both the preceding words and the following words. Due to the tendency of RNNs to better represent recent inputs, the annotation \(h_j\) will be focused on the words around \(x_j\). This sequence of annotations is used by the decoder and the alignment model later to compute the context vector per the equations in the Decoder section above.
Effective Approaches to Attention-based Neural Machine Translation
  • Neural Machine Translation by Jointly Learning to Align and Translate proposed an attention mechanism to improve neural machine translation (NMT) by selectively focusing on parts of the source sentence during translation.
  • This paper by Luong et al. in EMNLP 2015 from Manning’s group at Stanford explores useful architectures for attention-based NMT. This paper examines two simple and effective classes of attentional mechanism: a global approach which always attends to all source words and a local one that only looks at a subset of source words at a time.
  • They demonstrate the effectiveness of both approaches over the WMT translation tasks between English and German in both directions. With local attention, they achieve a significant gain of 5.0 BLEU points over non-attentional systems which already incorporate known techniques such as dropout.
  • Their ensemble model using different attention architectures has established a new state-of-the-art result in the WMT’15 English to German translation task with 25.9 BLEU points, an improvement of 1.0 BLEU points over the existing best system backed by NMT and an n-gram reranker.


Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
  • Neural Machine Translation (NMT) is an end-to-end learning approach for automated translation, with the potential to overcome many of the weaknesses of conventional phrase-based translation systems. Unfortunately, NMT systems are known to be computationally expensive both in training and in translation inference. Also, most NMT systems have difficulty with rare words. These issues have hindered NMT’s use in practical deployments and services, where both accuracy and speed are essential.
  • This paper by Wu et al. from Google in 2016 presents GNMT, Google’s Neural Machine Translation system, which attempts to address many of these issues. Their model consists of a deep LSTM network with 8 encoder and 8 decoder layers using attention and residual connections.
  • To improve parallelism and therefore decrease training time, their attention mechanism connects the bottom layer of the decoder to the top layer of the encoder. To accelerate the final translation speed, they employ low-precision arithmetic during inference computations.
  • To improve handling of rare words, they divide words into a limited set of common sub-word units (“wordpieces”) for both input and output. This method provides a good balance between the flexibility of “character”-delimited models and the efficiency of “word”-delimited models, naturally handles translation of rare words, and ultimately improves the overall accuracy of the system.
  • Their beam search technique employs a length-normalization procedure and uses a coverage penalty, which encourages generation of an output sentence that is most likely to cover all the words in the source sentence.
  • On the WMT’14 English-to-French and English-to-German benchmarks, GNMT achieves competitive results to state-of-the-art. Using a human side-by-side evaluation on a set of isolated simple sentences, it reduces translation errors by an average of 60% compared to Google’s phrase-based production system.
Neural machine translation of rare words with subword units
  • Neural machine translation (NMT) models typically operate with a fixed vocabulary, but translation is an open-vocabulary problem. Previous work addresses the translation of out-of-vocabulary words by backing off to a dictionary.
  • This paper by Sennrich et al. from the University of Edinburgh in ACL 2016 introduces a simpler and more effective approach based on Byte Pair Encoding (BPE), making the NMT model capable of open-vocabulary translation by encoding rare and unknown words as sequences of subword units. This is based on the intuition that various word classes are translatable via smaller units than words, for instance names (via character copying or transliteration), compounds (via compositional translation), and cognates and loanwords (via phonological and morphological transformations).
  • They discuss the suitability of different word segmentation techniques, including simple character n-gram models and a segmentation based on the byte pair encoding compression algorithm, and empirically show that subword models improve over a back-off dictionary baseline for the WMT 15 translation tasks English-German and English-Russian by 1.1 and 1.3 BLEU, respectively.
  • This paper by Ha et al. from Google Brain introduces an innovative approach where a smaller network, termed a “hypernetwork,” generates the weights for a larger network, referred to as the “main network.” This concept draws inspiration from evolutionary computing methods and aims to manage the large search spaces involved in weight parameters of neural networks. The hypernetwork approach is designed to be efficient and scalable, trained end-to-end with backpropagation.
  • Hypernetworks are a form of abstraction similar to the genotype-phenotype relationship in nature. They can be viewed as a relaxed form of weight-sharing across layers in neural networks, striking a balance between the flexibility of convolutional networks (which typically do not have weight-sharing) and the rigidity of recurrent networks (which do).
  • The following figure from the paper illustrates that hypernetworks generates weights for a feedforward network. Black connections and parameters are associated the main network whereas orange connections and parameters are associated with the hypernetwork.

  • For convolutional networks, hypernetworks generate weights for each convolutional layer. This method was shown to be effective for image recognition tasks with fewer learnable parameters, achieving respectable results on datasets like CIFAR-10.
  • For recurrent networks, such as LSTMs, hypernetworks can dynamically generate weights that vary across many timesteps. This approach has been demonstrated to be effective for a variety of sequence modeling tasks, including language modeling and handwriting generation, achieving near state-of-the-art results on datasets like Character Penn Treebank and Hutter Prize Wikipedia.
  • Hypernetworks have been shown to generate non-shared weights for LSTMs, outperforming standard LSTM versions in certain tasks. They are also beneficial in reducing the number of learnable parameters while maintaining competitive performance in tasks like image recognition and language modeling.
  • The paper reports experiments in different domains:
    • For image recognition, hypernetworks demonstrated effectiveness in generating filters for convolutional networks, tested on MNIST and CIFAR-10 datasets.
    • In language modeling, hypernetworks were applied to character-level prediction tasks on the Penn Treebank corpus and the Hutter Prize Wikipedia dataset (enwik8), showing competitive results.
    • For handwriting generation, the hypernetwork model was trained on the IAM handwriting database, outperforming several configurations of LSTM models.
    • In the neural machine translation task, HyperLSTM cells replaced LSTM cells in a wordpiece model architecture, improving performance on the WMT’14 English to French dataset.
  • The method presented in the paper is efficient, scalable, and works well with fewer parameters. Hypernetworks proved competitive or sometimes superior to state-of-the-art models in various applications like image recognition, language modeling, and handwriting generation


Attention Is All You Need
  • This paper by Vaswani et al. from Google in NeurIPS 2017 introduced Transformers (that are based on scaled dot-product multi-headed attention) which are prevalent in most NLP and CV areas today.
  • Please refer the Transformer primer for a detailed discourse on Transformers.
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
  • The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significant algorithmic and performance challenges. Also, static neural network architectures apply the same function to every example. In contrast, input dependent models attempt to tailor the function to each example. While it is straightforward for a human to manually specify a single static architecture, it is infeasible to specify every input-dependent function by hand. Instead, the input-dependent function must be automatically inferred by the model, which introduces an extra level of complexity in optimization.
  • Given the need to automatically infer architectures for each example, a natural solution is to define a single large model (supernetwork) with a numerous subnetworks (experts), and route examples through a path in the supernetwork. The figure below from Ramachandran and Le (2019) visualizes an example of a routing network.. Intuitively, similar examples can be routed through similar paths and dissimilar examples can be routed through different paths. The example-dependent routing also encourages expert specialization, in which experts devote their representational capacity to transforming a chosen subset of examples.

  • Learning to route examples to well-matched experts is critical for good performance. Effective routing can be achieved by training another small neural network (router) that learns to route examples through the supernetwork. The router takes the example as input and outputs the next expert to use. The router can take advantage of the intermediate representations of the example produced in the supernetwork.
  • This paper by Shazeer et al. in ICLR 2017 addresses these challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational efficiency on modern GPU clusters.
  • They introduce a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks. A trainable gating network determines a sparse combination of these experts to use for each example. In this per-example routing setup, different examples are processed by different subcomponents, or experts, inside a larger model, a.k.a. a supernetwork.
  • Specifically, the proposed MoE layer takes as an input a token representation \(x\) and then routes this to the best determined top-\(k\) experts, selected from a set \(\left\{E_i(x)\right\}_{i=1}^N\) of \(N\) experts. The router variable \(W_r\) produces logits \(h(x)=W_r \cdot x\) which are normalized via a softmax distribution over the available \(N\) experts at that layer. The gate-value for expert \(i\) is given by,
\[p_i(x)=\frac{e^{h(x)_i}}{\sum_j^N e^{h(x)_j}}\]
  • The top-\(k\) gate values are selected for routing the token \(x\). If \(\mathcal{T}\) is the set of selected top-\(k\) indices then the output computation of the layer is the linearly weighted combination of each expert’s computation on the token by the gate value,
\[y=\sum_{i \in \mathcal{T}} p_i(x) E_i(x)\]
  • They apply the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora. We present model architectures in which a MoE with up to 137 billion parameters is applied convolutionally between stacked LSTM layers. On large language modeling and machine translation benchmarks, these models achieve significantly better results than state-of-the-art at lower computational cost.
  • The following diagram from the paper illustrates a Mixture of Experts (MoE) layer embedded within a recurrent language model. In this case, the sparse gating function selects two experts to perform computations. Their outputs are modulated by the outputs of the gating network.

Using the Output Embedding to Improve Language Models
  • This paper by Press and Wolf from Tel-Aviv University in EACL 2017 proposes the concept of weight tying, by studying the topmost weight matrix of neural network language models.
  • They show that this matrix constitutes a valid word embedding. When training language models, they recommend tying the input embedding and this output embedding.
  • They analyze the resulting update rules and show that the tied embedding evolves in a more similar way to the output embedding than to the input embedding in the untied model.
  • They also offer a new method of regularizing the output embedding.
  • Their methods lead to a significant reduction in perplexity, as they are able to show on a variety of neural network language models. Finally, they show that weight tying can reduce the size of neural translation models to less than half of their original size without harming their performance.
Enriching Word Vectors with Subword Information
  • Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Popular models that learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for languages with large vocabularies and many rare words.
  • This paper by Bojanowski et al. from FAIR in TACL 2017 proposes fastText, a new approach based on the skipgram model, where each word is represented as a bag of character n-grams. A vector representation is associated to each character n-gram; words being represented as the sum of these representations.
  • As the name suggests, fastText is fast, allowing to train models on large corpora quickly and enabling computing word representations for words that did not appear in the training data.
  • They evaluate fastText’s word representations on nine different languages, both on word similarity and analogy tasks. By comparing with recently proposed morphological word representations, they show that fastText’s vectors achieve state-of-the-art performance on these tasks.
  • Code.


Deep contextualized word representations
  • This paper by Peters et al. from Allen AI and UW in NAACL 2018 introduced LSTM-based Embeddings from Language Models (ELMo), an approach for learning high-quality deep context-dependent/context-sensitive word representations/embeddings from biLMs.
  • These deep contextualized word representations model both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy).
  • ELMo’s word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus. They show that these representations can be easily added to existing models and significantly improve the state of the art across six challenging NLP problems, including question answering, textual entailment, and sentiment analysis. They also present an analysis showing that exposing the deep internals of the pre-trained network is crucial, allowing downstream models to mix different types of semi-supervision signals.
  • Through ablations and other controlled experiments, they have confirmed that the biLM layers efficiently encode different types of syntactic and semantic information about words-in-context, and that using all layers improves overall task performance, enabling ELMo to show large improvements on a broad range of NLP tasks.
Improving Language Understanding by Generative Pre-Training
  • Natural language understanding comprises a wide range of diverse tasks such as textual entailment, question answering, semantic similarity assessment, and document classification. Although large unlabeled text corpora are abundant, labeled data for learning these specific tasks is scarce, making it challenging for discriminatively trained models to perform adequately.
  • This paper by Radford et al. from OpenAI in 2018 introduces a framework for achieving strong natural language understanding with a single task-agnostic model through generative pre-training and discriminative fine-tuning and demonstrates large gains on the aforementioned NLU tasks can be realized by generative pre-training of a language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task.
  • In contrast to previous approaches, they make use of task-aware input transformations during fine-tuning to achieve effective transfer while requiring minimal changes to the model architecture.
  • By pre-training on a diverse corpus with long stretches of contiguous text, their model acquires significant world knowledge and ability to process long-range dependencies which are then successfully transferred to solving discriminative tasks such as question answering, semantic similarity assessment, entailment determination, and text classification, improving the state of the art on 9 of the 12 datasets and thus outperforming discriminatively trained models that use architectures specifically crafted for each task. For instance, they achieve absolute improvements of 8.9% on commonsense reasoning (Stories Cloze Test), 5.7% on question answering (RACE), and 1.5% on textual entailment (MultiNLI).
  • Using unsupervised (pre-)training to boost performance on discriminative tasks has long been an important goal of Machine Learning research. Their work suggests that achieving significant performance gains is indeed possible, and offers hints as to what models (Transformers) and data sets (text with long range dependencies) work best with this approach.
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
  • This paper by Kudo and Richardson in EMNLP 2018 describes SentencePiece, a language-independent subword tokenizer and detokenizer designed for Neural-based text processing, including Neural Machine Translation. - It provides open-source C++ and Python implementations for subword units. While existing subword segmentation tools assume that the input is pre-tokenized into word sequences, SentencePiece can train subword models directly from raw sentences, which allows us to make a purely end-to-end and language independent system.
  • They perform a validation experiment of NMT on English-Japanese machine translation, and find that it is possible to achieve comparable accuracy to direct subword training from raw sentences.
  • They also compare the performance of subword training and segmentation with various configurations.
  • Code.
Self-Attention with Relative Position Representations
  • Relying entirely on an attention mechanism, the Transformer introduced by Vaswani et al. (2017) achieves state-of-the-art results for machine translation. In contrast to recurrent and convolutional neural networks, it does not explicitly model relative or absolute position information in its structure. Instead, it requires adding representations of absolute positions to its inputs.
  • This paper by Shaw et al. in NAACL 2018 presents an alternative approach, extending the self-attention mechanism to efficiently consider representations of the relative positions, or distances between sequence elements.
  • The figure below from the paper shows example edges representing relative positions, or the distance between elements. They learn representations for each relative position within a clipping distance \(k\). The figure assumes \(2 <= k <= n − 4\). Note that not all edges are shown.

  • In the original Transformer paper, the authors hypothesized that in contrast to learned, absolute position representations, sinusoidal position encodings would help the model to generalize to sequence lengths unseen during training, by allowing it to learn to attend also by relative position. This property is shared by our relative position representations which, in contrast to absolute position representations, are invariant to the total sequence length.
  • On the WMT 2014 English-to-German and English-to-French translation tasks, this approach yields improvements of 1.3 BLEU and 0.3 BLEU over absolute position representations, respectively. Notably, they observe that combining relative and absolute position representations yields no further improvement in translation quality.
  • They describe an efficient implementation of their method and cast it as an instance of relation-aware self-attention mechanisms that can generalize to arbitrary graph-labeled inputs.
Blockwise Parallel Decoding for Deep Autoregressive Models
  • Deep autoregressive sequence-to-sequence models have demonstrated impressive performance across a wide variety of tasks in recent years. While common architecture classes such as recurrent, convolutional, and self-attention networks make different trade-offs between the amount of computation needed per layer and the length of the critical path at training time, generation still remains an inherently sequential process.
  • This paper by Stern et al. from Google in NeurIPS 2018 seeks to overcome this limitation by propose a novel blockwise parallel decoding scheme in which we make predictions for multiple time steps in parallel then back off to the longest prefix validated by a scoring model. This allows for substantial theoretical improvements in generation speed when applied to architectures that can process output sequences in parallel.
  • They verify their approach empirically through a series of experiments using state-of-the-art self-attention models for machine translation and image super-resolution, achieving iteration reductions of up to 2x over a baseline greedy decoder with no loss in quality, or up to 7x in exchange for a slight decrease in performance. In terms of wall-clock time, their fastest models exhibit real-time speedups of up to 4x over standard greedy decoding.
  • The following figure from the paper shows the three substeps of blockwise parallel decoding. In the predict substep, the greedy model and two proposal models independently and in parallel predict “in”, “the”, and “bus”. In the verify substep, the greedy model scores each of the three independent predictions, conditioning on the previous independent predictions where applicable. When using a Transformer or convolutional sequence-to-sequence model, these three computations can be done in parallel. The highest-probability prediction for the third position is “car”, which differs from the independently predicted “bus”. In the accept substep, \(\hat{y}\) is hence extended to include only “in” and “the” before making the next \(k\) independent predictions.

  • The following figure from the paper illustrates the fact that combining the scoring and proposal models allows us to merge the previous verification substep with the next prediction substep. This makes it feasible to call the model just once per iteration rather than twice, halving the number of model invocations required for decoding.

Universal Language Model Fine-tuning for Text Classification
  • Inductive transfer learning has greatly impacted computer vision, but existing approaches in NLP still require task-specific modifications and training from scratch.
  • This paper by Hoard and Ruder in ACL 2018 proposes Universal Language Model Fine-tuning (ULMFiT), an effective transfer learning method that can be applied to any task in NLP, and introduce techniques that are key for fine-tuning a language model.
  • The following figure from the paper shows that ULMFiT consists of three stages: a) The LM is trained on a general-domain corpus to capture general features of the language in different layers. b) The full LM is fine-tuned on target task data using discriminative fine-tuning (‘Discr’) and slanted triangular learning rates (STLR) to learn task-specific features. c) The classifier is fine-tuned on the target task using gradual unfreezing, ‘Discr’, and STLR to preserve low-level representations and adapt high-level ones (shaded: unfreezing stages; black: frozen).

  • ULMFiT significantly outperforms the state-of-the-art on six text classification tasks, reducing the error by 18-24% on the majority of datasets.
  • Furthermore, with only 100 labeled examples, it matches the performance of training from scratch on 100x more data.
  • Code.
Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models
  • The paper by Vijayakumar et al. from Virgina Tech and Indiana University presents an alternative to the traditional Beam Search (BS) method, known as Diverse Beam Search (DBS). The paper is focused on enhancing the diversity in the solutions decoded from neural sequence models, addressing the issue that BS often results in sequences with minor variations and fails to capture the inherent ambiguity of complex AI tasks.
  • The paper introduces Diverse Beam Search (DBS), an algorithm that decodes a list of diverse outputs by optimizing a diversity-augmented objective. DBS divides the beam budget into groups and enforces diversity between these groups.
    • Comparing image captioning outputs decoded by BS and our method, Diverse Beam Search (DBS) – we notice that BS captions are near-duplicates with similar shared paths in the search tree and minor variations in the end. In contrast, DBS captions are significantly diverse and similar to the inter-human variability in describing images.
  • The following figure from the paper demonstrates that DBS finds better top-1 solutions compared to BS by controlling the exploration and exploitation of the search space. This implies that DBS is a superior search algorithm in terms of result diversity.

  • The authors also study the impact of the number of groups, the strength of diversity penalty, and various forms of diversity functions for language models. They explore various forms of the dissimilarity term used in DBS, such as Hamming Diversity, Cumulative Diversity, and n-gram Diversity, and their impact on model performance.
  • The paper provides empirical evidence through experiments on image captioning, machine translation, and visual question generation tasks. It uses both standard quantitative metrics and qualitative human studies to validate the effectiveness of DBS.
  • DBS shows significant improvements in diversity without compromising task-specific performance metrics. This is particularly evident in cases of complex images where diverse descriptions are more likely.
  • The paper discusses the role of diversity in image-grounded language generation tasks, highlighting that DBS consistently outperforms BS and previously proposed techniques for diverse decoding. DBS is shown to be robust over a wide range of parameter values and is general enough to incorporate various forms of the dissimilarity term.
  • Overall, the paper makes a significant contribution to the field of neural sequence modeling by proposing a novel approach to enhance the diversity of decoded solutions, demonstrating its efficacy across different applications and providing insights into the role of diversity in complex AI tasks.
MS MARCO: A Human Generated MAchine Reading COmprehension Dataset
  • This paper by Bajaj et al. from Microsoft AI & Research introduces the MS MARCO dataset for machine reading comprehension (MRC) and open-domain question answering (QA).
  • The MS MARCO dataset is a large-scale, real-world reading comprehension dataset, consisting of over 1 million anonymized questions derived from Bing’s search query logs, each paired with a human-generated answer. Additionally, it includes 182,669 completely human rewritten generated answers and 8,841,823 passages extracted from 3,563,535 web documents retrieved by Bing.
  • The dataset is designed to address shortcomings of existing MRC and QA datasets by using real user search queries, making it more representative of natural information-seeking behavior. This contrasts with other datasets where questions are often generated by crowd workers based on provided text spans or documents.
  • The dataset poses three distinct tasks with varying difficulty levels: predicting if a question is answerable given context passages and synthesizing an answer, generating a well-formed answer based on context passages, and ranking a set of retrieved passages given a question.
  • MS MARCO’s complexity and real-world relevance are intended to benchmark machine reading comprehension and question-answering models, especially in handling realistic, noisy, and problematic inputs.
  • The paper also discusses the unique features of the dataset, such as questions being real user queries issued to Bing, the presence of multiple or no answers for some questions, and the inclusion of a large set of passages for each question to mimic real-world information retrieval conditions.
  • It provides benchmarking results on the dataset, evaluating different machine learning models for their effectiveness in handling the dataset’s tasks. These results include assessments of generative and discriminative models using metrics like ROUGE-L and BLEU for answer quality.
  • In summary, the MS MARCO dataset represents a significant step forward in developing and benchmarking MRC and QA systems, offering a large-scale, realistic dataset derived from actual user queries and incorporating a variety of tasks to test different aspects of machine comprehension.


BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
  • This paper by Devlin et al. from Google in ACL 2019 proposed BERT (Bidirectional Encoder Representations from Transformers), a Transformer-based language representation model which proposed pre-training bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. BERT is pre-trained using two unsupervised tasks: (i) masked language modeling (MLM) and, (ii) next sentence prediction (NSP).
    • MLM is often referred to as a Cloze task in the literature (Taylor, 1953). In this case, the final hidden vectors corresponding to the mask tokens are fed into an output softmax over the vocabulary, as in a standard LM.
    • NSP is needed because many important downstream tasks such as Question Answering (QA) and Natural Language Inference (NLI) are based on understanding the relationship between two sentences, which is not directly captured by language modeling. In order to train a model that understands sentence relationships, they pre-train for a binarized next sentence prediction task that can be trivially generated from any monolingual corpus.
  • Fine-tuning for the task at hand involves using an additional output layer, to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.
  • BERT comes in two flavors: (i) BERT Base: 12 layers (transformer blocks), 12 attention heads, and 110 million parameters; (ii) BERT Large: 24 layers (transformer blocks), 16 attention heads, and 340 million parameters.
  • BERT consumes a max of 512 input tokens. At its output, word embeddings for BERT (what is called BERT-base) have 768 dimensions.
  • BERT obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).
  • BERT demonstrated that unsupervised pretraining is an integral part of many language understanding systems and enables even low-resource tasks to benefit from them.
  • Google Blog’s article that discusses using BERT for improving search relevance and ranking.
  • Also, here’s a brief timeline of NLP models from Bag of Words to the Transformer family from Fabio Chiusano:

RoBERTa: A Robustly Optimized BERT Pretraining Approach
  • Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes, while hyperparameter choices have significant impact on the final results.
  • This paper by Liu et al. from University of Washington and Facebook AI in 2019 carefully evaluates a number of design decisions when pretraining BERT models.
  • They present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and training data size. They find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it. They find that performance can be substantially improved by training the model longer, with bigger batches over more data; removing the next sentence prediction objective; training on longer sequences; and dynamically changing the masking pattern applied to the training data.
  • Their improved pretraining procedure, which they call RoBERTa, achieves state-of-the-art results on GLUE, RACE and SQuAD, without multi-task finetuning for GLUE or additional data for SQuAD. These results highlight the importance of previously overlooked design choices, and suggest that BERT’s pretraining objective remains competitive with recently proposed alternatives.
  • Note that RoBERTa uses only the masked language model objective (and does not train using the next sentence prediction objective), and achieves better results than the original BERT.
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
  • This paper by Lewis et al. from Facebook AI in 2019 presented BART, a denoising autoencoder for pretraining sequence-to-sequence models that learns to map corrupted documents to the original. BART is trained by corrupting text with an arbitrary noising function, and learning a model to reconstruct the original text. It uses a standard Tranformer-based neural machine translation architecture which, despite its simplicity, can be seen as generalizing BERT (due to the bidirectional encoder), GPT (with the left-to-right decoder), and many other more recent pretraining schemes.
  • They evaluate a number of noising approaches, finding the best performance by both randomly shuffling the order of the original sentences and using a novel in-filling scheme, where spans of text are replaced with a single mask token.
  • Background: With BERT, random tokens are replaced with masks, and the document is encoded bidirectionally. Missing tokens are predicted independently, so BERT cannot easily be used for generation.

  • With GPT, tokens are predicted auto-regressively (generation of a new token is conditioned on the prior tokens), meaning GPT can be used for generation. However words can only condition on leftward context, so it cannot learn bidirectional interactions.

  • BART applies noising schemes to an input document and thus corrupts it by replacing spans of text with mask symbols. In the diagram below, the corrupted document (left) is encoded with a bidirectional model, and then the likelihood of the original document (right) is calculated with an autoregressive decoder. For fine-tuning, an uncorrupted document is input to both the encoder and decoder, and they use representations from the final hidden state of the decoder. The advantage of using this scheme is that inputs to the encoder need not be aligned with decoder outputs, allowing arbitary noise transformations.

  • BART is particularly effective when finetuned for text generation but also works well for comprehension tasks. It matches the performance of RoBERTa with comparable training resources on GLUE and SQuAD, achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 6 ROUGE.
  • BART also provides a 1.1 BLEU increase over a back-translation system for machine translation, with only target language pretraining.
  • BART achieves similar performance to RoBERTa on discriminative tasks, while achieving new state-of-the-art results on a number of text generation tasks.
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
  • As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging.
  • This paper by Sanh et al. from Huggingface in the Energy Efficient Machine Learning and Cognitive Computing - NeurIPS 2019 introduced a language representation model, DistilBERT which is a general-purpose pre-trained version of BERT. DistilBERT is 40% smaller, 60% faster, cheaper to pre-train, and retains 97% of the language understanding capabilities. DistilBERT can be fine-tuned with good performances on a wide range of tasks much like its larger counterparts.
  • While most prior work investigated the use of distillation for building task-specific models, they leverage knowledge distillation during the pre-training phase and show that DistilBERT is a compelling option for edge applications.
  • To leverage the inductive biases learned by larger models during pretraining, they introduce a triple loss combining language modeling, distillation and cosine-distance losses.
  • The following graph shows the parameter counts of several recently released pretrained language models:

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
  • Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling.
  • This paper by Dai et al. from CMU and Google Brain in 2019 proposes a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence.
  • Transformer-XL consists of a segment-level recurrence mechanism and a novel positional encoding scheme that uses relative positional embeddings (compared to the absolute positional encoding in a vanilla Transformer architecture) which enable longer-context attention.
  • Transformer-XL not only enables capturing longer-term dependency than RNNs and vanilla Transformers, achieves substantial speedup during evaluation, but also resolves the context fragmentation problem. As a result, Transformer-XL learns dependency that is 80% longer than RNNs and 450% longer than vanilla Transformers, achieves better performance on both short and long sequences, and is up to 1,800+ times faster than vanilla Transformers during evaluation.
  • They improve the state-of-the-art results of BPC/Perplexity to 0.99 on enwiki8, 1.08 on text8, 18.3 on WikiText-103, 21.8 on One Billion Word, and 54.5 on Penn Treebank (without finetuning). When trained only on WikiText-103, Transformer-XL manages to generate reasonably coherent, novel text articles with thousands of tokens.
XLNet: Generalized Autoregressive Pretraining for Language Understanding
  • With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive language modeling. However, relying on corrupting the input with masks, BERT neglects the dependency between the masked positions and suffers from a pretrain-finetune discrepancy.
  • This paper by Yang et al. from CMU and Google in 2019 proposes XLNet considering BERT’s aforementioned pros and cons, and offers a generalized autoregressive pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order (thereby proposing a new objective called Permutation Language Modeling), and (2) overcomes the limitations of BERT thanks to its autoregressive formulation. Put simply, XLNet is a generalized autoregressive pretraining method that uses a permutation language modeling objective to combine the advantages of autoregressive and autoencoder methods.
  • Furthermore, the neural architecture of XLNet is developed to work seamlessly with the autoregressive objective, including integrating ideas from Transformer-XL, the state-of-the-art autoregressive model and the careful design of the two-stream attention mechanism. Empirically, under comparable experiment settings, XLNet outperforms BERT on 20 tasks, often by a large margin, including question answering, natural language inference, sentiment analysis, and document ranking.
  • Code.
Adaptive Input Representations for Neural Language Modeling
  • This paper by Baevski and Auli from Facebook AI in 2019 introduces adaptive input representations by varying the size of input word embeddings for neural language modeling. Adaptive input embeddings can improve accuracy while drastically reducing the number of model parameters.
  • There are several choices on how to factorize the input and output layers, and whether to model words, characters or sub-word units.
  • They perform a systematic comparison of popular choices for a self-attentional architecture.
  • Their experiments show that models equipped with adaptive embeddings are more than twice as fast to train than the popular character input CNN while having a lower number of parameters.
  • On the WIKITEXT-103 benchmark, they achieve 18.7 perplexity, an improvement of 10.5 perplexity compared to the previously best published result and on the BILLION WORD benchmark, they achieve 23.02 perplexity.
Attention Interpretability Across NLP Tasks
  • This paper by Vashishth et al. from IISc and Google in 2019 seeks to empirically prove the hypothesis that attention weights are interpretable and are correlated with feature importance measures, However, this holds only for cases when attention weights are essential for model’s prediction.
  • Some works (Jain & Wallace, 2019; Vig & Belinkov, 2019) have demonstrated that attention weights are not interpretable, and altering them does not affect the model output while several others have shown that attention captures several linguistic notions in the model. They extend the analysis of prior works to diverse NLP tasks and demonstrate that attention weights are interpretable and are correlated with feature importance measures. However, this holds only for cases when attention weights are essential for model’s prediction and cannot simply be reduced to a gating unit. This paper takes a balanced approach – rather than taking a black and white approach – they draw on previous literature that raised issues with the fact “attentions are indicative of model predictions” and show “when is attention interpretable and when it is not”.
  • The attention layer in a neural network model provides insights into the model’s reasoning behind its prediction, which are usually criticized for being opaque. Recently, seemingly contradictory viewpoints have emerged about the interpretability of attention weights. Amid such confusion arises the need to understand attention mechanism more systematically. The paper attempts to fill this gap by giving a comprehensive explanation which justifies both kinds of observations (i.e., when is attention interpretable and when it is not). Through a series of experiments on diverse NLP tasks, they validate their observations and reinforce the claim of interpretability of attention through manual evaluation.
  • They find that in both single and pair sequence tasks, the attention weights in samples with original weights do make sense in general. However, in the former case, the attention mechanism learns to give higher weights to tokens relevant to both kinds of sentiment. They show that attention weights in single sequence tasks do not provide a reason for the prediction, which in the case of pairwise tasks, attention do reflect the reasoning behind model output.
  • Unrelated to the paper: To use attention visualization as a proxy for interpreting your predictions, use the BertViz library. The lib supports multiple views and supports a plethora of models (BERT, GPT-2, XLNet, RoBERTa, XLM, ALBERT, DistilBERT, BART etc.). The BertViz repo has some nice examples to get started.

Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
  • This paper by Selvaraju et al. from Parikh/Batra’s team at GATech in 2019 proposes a technique for producing ‘visual explanations’ for decisions from a large class of CNN-based models, making them more transparent and explainable.
  • Their approach – Gradient-weighted Class Activation Mapping (Grad-CAM), uses the gradients of any target concept (say ‘dog’ in a classification network or a sequence of words in captioning network) flowing into the final convolutional layer to produce a coarse localization map highlighting the important regions in the image for predicting the concept.
  • Unlike previous approaches, Grad-CAM is applicable to a wide variety of CNN model-families: (1) CNNs with fully-connected layers (e.g. VGG), (2) CNNs used for structured outputs (e.g. captioning), (3) CNNs used in tasks with multimodal inputs (e.g. visual question answering) or reinforcement learning, all without architectural changes or re-training.
  • They combine Grad-CAM with existing fine-grained visualizations to create a high-resolution class-discriminative visualization, Guided Grad-CAM, and apply it to image classification, image captioning, and visual question answering (VQA) models, including ResNet-based architectures.
  • In the context of image classification models, their visualizations (a) lend insights into failure modes of these models (showing that seemingly unreasonable predictions have reasonable explanations), (b) outperform previous methods on the ILSVRC-15 weakly-supervised localization task, (c) are robust to adversarial perturbations, (d) are more faithful to the underlying model, and (e) help achieve model generalization by identifying dataset bias.
  • For image captioning and VQA, their visualizations show that even non-attention based models learn to localize discriminative regions of input image.
  • They devise a way to identify important neurons through GradCAM and combine it with neuron names to provide textual explanations for model decisions.
  • Finally, they design and conduct human studies to measure if Grad-CAM explanations help users establish appropriate trust in predictions from deep networks and show that Grad-CAM helps untrained users successfully discern a ‘stronger’ deep network from a ‘weaker’ one even when both make identical predictions.
  • Code; CloudCV demo.
Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond
  • This paper by Artetxe and Schwenk from University of the Basque Country and FAIR introduces an architecture to learn joint multilingual sentence representations, called LASER (Language-Agnostic SEntence Representations), for 93 languages, belonging to more than 30 different families and written in 28 different scripts. The work focuses on universal language agnostic sentence embeddings, that is, vector representations of sentences that are general with respect to two dimensions: the input language and the NLP task. The motivations for such representations are multiple: the hope that languages with limited resources benefit from joint training over many languages, the desire to perform zero-shot transfer of an NLP model from one language (typically English) to another, and the possibility to handle code-switching. To that end, they train a single encoder to handle multiple languages, so that semantically similar sentences in different languages are close in the embedding space.
  • Their system uses a single BiLSTM encoder with a shared BPE vocabulary for all languages, which is coupled with an auxiliary decoder and trained on publicly available parallel corpora. This enables them to learn a classifier on top of the resulting embeddings using English annotated data only, and transfer it to any of the 93 languages without any modification.
  • Their experiments in cross-lingual natural language inference (XNLI dataset), cross-lingual document classification (MLDoc dataset) and parallel corpus mining (BUCC dataset) show the effectiveness of their approach.
  • They also introduce a new test set of aligned sentences in 112 languages, and show that their sentence embeddings obtain strong results in multilingual similarity search even for low-resource languages.
  • Code with the pretrained encoder and multilingual test set.
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
  • For natural language understanding (NLU) technology to be maximally useful, both practically and as a scientific object of study, it must be general: it must be able to process language in a way that is not exclusively tailored to any one specific task or dataset.
  • This paper by Wang et al. from NYU, UW, and Deepmin in ICLR 2019 introduces the General Language Understanding Evaluation benchmark (GLUE), a tool for evaluating and analyzing the performance of models across a diverse range of existing NLU tasks. GLUE is model-agnostic, but it incentivizes sharing knowledge across tasks because certain tasks have very limited training data. They further provide a hand-crafted diagnostic test suite that enables detailed linguistic analysis of NLU models.
  • They evaluate baselines based on current methods for multi-task and transfer learning and find that they do not immediately give substantial improvements over the aggregate performance of training a separate model per task, indicating room for improvement in developing general and robust NLU systems.
Parameter-Efficient Transfer Learning for NLP
  • Fine-tuning large pre-trained models is an effective transfer mechanism in NLP. However, in the presence of many downstream tasks, fine-tuning is parameter inefficient: an entire new model is required for every task.
  • As an alternative, they propose transfer with adapter modules. Adapter modules yield a compact and extensible model; they add only a few trainable parameters per task, and new tasks can be added without revisiting previous ones. The parameters of the original network remain fixed, yielding a high degree of parameter sharing.
  • To demonstrate adapter’s effectiveness, they transfer the recently proposed BERT Transformer model to 26 diverse text classification tasks, including the GLUE benchmark.
  • Adapters attain near state-of-the-art performance, whilst adding only a few parameters per task. On GLUE, they attain within 0.4% of the performance of full fine-tuning, adding only 3.6% parameters per task. By contrast, fine-tuning trains 100% of the parameters per task.
  • The following figure from the paper shows the architecture of the adapter module and its integration with the Transformer. Left: They add the adapter module twice to each Transformer layer: after the projection following multiheaded attention and after the two feed-forward layers. Right: The adapter consists of a bottleneck which contains few parameters relative to the attention and feedforward layers in the original model. The adapter also contains a skip-connection. During adapter tuning, the green layers are trained on the downstream data, this includes the adapter, the layer normalization parameters, and the final classification layer (not shown in the figure).

Cross-lingual Language Model Pretraining
  • Recent studies have demonstrated the efficiency of generative pretraining for English natural language understanding.
  • This paper by Lampe and Conneau from FAIR extends this approach to multiple languages and show the effectiveness of cross-lingual pretraining. They propose two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual data, and one supervised that leverages parallel data with a new cross-lingual language model objective.
  • They utilize a shared sub-word vocabulary by processing all languages with the same shared vocabulary created through Byte Pair Encoding (BPE). This greatly improves the alignment of embedding spaces across languages that share either the same alphabet or anchor tokens such as digits or proper nouns. THey learn the BPE splits on the concatenation of sentences sampled randomly from the monolingual corpora.
  • They re-balance low/high resource languages using multinomial sampling. Specifically, sentences are sampled according to a multinomial distribution (with \(\alpha=0.5\)) with probabilities \(\left\{q_i\right\}_{i=1 \ldots N}\), where:

    \[q_i=\frac{p_i^\alpha}{\sum_{j=1}^N p_j^\alpha} \quad \text { with } p_i=\frac{n_i}{\sum_{k=1}^N n_k}\]
  • Sampling with this distribution increases the number of tokens associated to low-resource languages and alleviates the bias towards high-resource languages. In particular, this prevents words of low-resource languages from being split at the character level.
  • They obtain state-of-the-art results on cross-lingual classification, unsupervised and supervised machine translation.
  • On XNLI, XLM pushes the state of the art by an absolute gain of 4.9% accuracy. On unsupervised machine translation, we obtain 34.3 BLEU on WMT’16 German-English, improving the previous state of the art by more than 9 BLEU. On supervised machine translation, we obtain a new state of the art of 38.5 BLEU on WMT’16 Romanian-English, outperforming the previous best approach by more than 4 BLEU.
  • The following figure from the paper shows the concept behind cross-lingual language model pretraining. The MLM objective is similar to the one of Devlin et al. (2018), but with continuous streams of text as opposed to sentence pairs. The TLM objective extends MLM to pairs of parallel sentences. To predict a masked English word, the model can attend to both the English sentence and its French translation, and is encouraged to align English and French representations. Position embeddings of the target sentence are reset to facilitate the alignment.

MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance
  • A robust evaluation metric has a profound impact on the development of text generation systems. A desirable metric compares system output against references based on their semantics rather than surface forms.
  • This paper by Zhao et al. in EMNLP 2019 proposes a new metric, called MoverScore, that shows a high correlation with human judgment of text quality.
  • They validate MoverScore on a number of text generation tasks including summarization, machine translation, image captioning, and data-to-text generation, where the outputs are produced by a variety of neural and non-neural systems. Our findings suggest that metrics combining contextualized representations with a distance measure perform the best. Such metrics also demonstrate strong generalization capability across tasks. For ease-of-use we make our metrics available as web service.
  • The following figure from the paper shows an illustration of MoverScore vs. BERTScore.

Neural Oblivious Decision Ensembles for Deep Learning on Tabular Data
  • This paper by Popov et al. from Yandex introduces the Neural Oblivious Decision Ensembles (NODE) architecture for machine learning on tabular data.
  • NODE generalizes ensembles of oblivious decision trees, allowing for gradient-based optimization and multi-layer hierarchical representation learning. It’s designed to improve performance on tabular data, a domain where deep learning hasn’t outperformed gradient boosting decision trees (GBDT).
  • NODE uses differentiable oblivious decision trees, which are more efficient and less prone to overfitting compared to conventional decision trees. This architecture allows for end-to-end training and integrates smoothly into deep learning pipelines.
  • A key feature of NODE is the use of the entmax transformation, which enables differentiable split decision construction within the tree nodes. Entmax generalizes both sparsemax and softmax; it is able to learn sparse decision rules, but is smoother than sparsemax, being more appropriate for gradient-based optimization. Entmax is capable of producing sparse probability distributions and learning splitting decisions based on a small subset of data features.
  • The following figure from the paper shows a single oblivious decision trees (ODT) inside the NODE layer. The splitting features and the splitting thresholds are shared across all the internal nodes of the same depth. The output is a sum of leaf responses scaled by the choice weights.

  • The following figure from the paper shows an illustration of the NODE architecture, consisting of densely connected NODE layers. Each layer contains several trees whose outputs are concatenated and serve as input for the subsequent layer. The final prediction is obtained by averaging the outputs of all trees from all the layers.

  • The architecture was extensively compared to leading GBDT implementations like CatBoost and XGBoost across various datasets. NODE consistently outperformed these traditional methods, particularly in settings with default hyperparameters.
  • NODE’s design includes a multidimensional tree output for classification problems and a concatenation of outputs from multiple trees. This facilitates learning both shallow and deep decision rules.
  • The paper also presents an ablative analysis, demonstrating the influence of different architectural choices, like choice functions (e.g., softmax, entmax) and architecture depth on performance.
  • The authors highlight the potential of incorporating NODE into complex pipelines for multi-modal problems, suggesting future research directions in integrating tabular data with other data types like images or sequences.
  • Overall, NODE introduces an innovative deep learning architecture for tabular data, showcasing its effectiveness over traditional methods and opening new avenues for research in this domain.
Latent Retrieval for Weakly Supervised Open Domain Question Answering
  • This paper by Lee et al. in ACL 2019 from Google Research, addresses the challenge of open domain question answering (QA) without relying on strong supervision of evidence or black-box information retrieval (IR) systems.
  • The authors introduce the Open-Retrieval Question Answering system (ORQA), which learns to retrieve evidence from an open corpus, supervised only by question-answer string pairs. This approach contrasts with traditional methods that either assume gold evidence or depend on black-box IR systems.
  • A central aspect of ORQA is its ability to retrieve any text from an open corpus, unlike traditional methods that rely on a closed set of evidence documents. This capability is enabled by pre-training the retriever using an unsupervised Inverse Cloze Task (ICT). In ICT, a sentence is treated as a pseudo-question, and its context is treated as pseudo-evidence, requiring the model to predict the context given the sentence.
  • The implementation of ORQA leverages the BERT (Bidirectional Encoder Representations from Transformers) architecture for both its retriever and reader components. This choice capitalizes on recent advances in transfer learning and the strong representational power of BERT.
    • Here are the key aspects of the ORQA model architecture:
      1. Retriever Component:
        • The retriever is the first key component of ORQA. It is responsible for selecting relevant document passages from a large corpus that may contain the information required to answer the input question.
        • This component is pre-trained using an unsupervised learning task called the Inverse Cloze Task (ICT). In ICT, the model is given a sentence (treated as a pseudo-question) and is tasked with identifying its surrounding context (treated as pseudo-evidence) from the corpus. This pre-training helps the model learn an effective strategy for document retrieval based on the context of questions.
      2. Reader Component:
        • Following the retrieval stage, the reader component takes over. It processes the passages retrieved by the retriever to generate a precise answer to the input question.
        • The reader, like the retriever, is based on the BERT model. It is fine-tuned to perform the question answering task, taking into account the context provided by the passages retrieved by the retriever.
      3. Integration of BERT:
        • Both the retriever and reader components are built on the BERT framework. BERT’s powerful bidirectional context understanding capabilities make it ideal for understanding the nuances in natural language questions and passages.
        • The use of BERT as a base model facilitates effective transfer learning, where the model, pre-trained on a large corpus, adapts to the specific requirements of question answering and document retrieval tasks through fine-tuning.
      4. End-to-End Training:
        • ORQA is unique in that it is trained end-to-end, meaning that both the retriever and reader are trained simultaneously. This approach allows the retriever to be optimized specifically for the types of questions and answers handled by the reader, leading to a more coherent and effective QA system.
    • In essence, ORQA’s architecture represents a significant advance in open-domain question answering systems, allowing it to handle a wide range of questions by effectively searching and interpreting a vast corpus of unstructured text.
  • The authors address the challenges of inference and learning in an open evidence corpus with a large search space and latent navigation requirements. They accomplish this by pre-training the retriever to provide a strong initialization, enabling dynamic yet fast top-k retrieval during fine-tuning.
  • Figure 1: Overview of ORQA. A subset of all possible answer derivations given a question $q$ is shown here. Retrieval scores $S_{\text {retr }}(q, b)$ are computed via inner products between BERT-based encoders. Top-scoring evidence blocks are jointly encoded with the question, and span representations are scored with a multi-layer perceptron (MLP) to compute $S_{\text {read }}(q, b, s)$. The final joint model score is $S_{\text {retr }}(q, b)+S_{\text {read }}(q, b, s)$. Unlike previous work using IR systems for candidate proposal, we learn to retrieve from all of Wikipedia directly.

  • ORQA’s effectiveness is demonstrated through its performance on open versions of five QA datasets. Notably, on datasets where question writers are genuinely seeking information (as opposed to knowing the answer beforehand), ORQA significantly outperforms traditional IR systems like BM25.
  • The paper includes a comprehensive experimental setup, comparing ORQA with other retrieval methods on different datasets. These comparisons illustrate ORQA’s strengths, especially in scenarios where question askers do not already know the answer, highlighting the importance of learned retrieval in such contexts.
  • The authors also discuss the challenges and potential biases in the datasets used for evaluation, providing insights into the limitations and considerations for open-domain QA systems.
  • In summary, this paper presents a novel approach to open-domain question answering by introducing an end-to-end model that jointly learns retriever and reader components. This model significantly improves upon traditional methods in scenarios that reflect genuine information-seeking questions, marking a notable advancement in the field of natural language processing and QA systems.


Language Models are Few-Shot Learners
  • Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do.
  • This paper by Brown et al. from OpenAI in 2020 introduces Generative Pretrained Transformer (GPT)-3 and shows that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches.
  • Specifically, they train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model.
  • GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.
  • At the same time, they also identify some datasets where GPT-3’s few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, they find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans.
  • They also present broader societal impacts of their findings and of GPT-3 in general.
Longformer: The Long-Document Transformer
  • Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length.
  • This paper by Beltagy et al. from Allen AI in 2020 seeks to address this limitation, by introducing the Longformer with an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer.
  • Longformer’s attention mechanism is a drop-in replacement for the standard self-attention and combines a local windowed attention with a task motivated global attention.
  • The figure below from the paper compares the full self-attention pattern and the configuration of attention patterns in Longformer.

  • Following prior work on long-sequence transformers, they evaluate Longformer on character-level language modeling and achieve state-of-the-art results on text8 and enwik8.
  • In contrast to most prior work, they also pretrain Longformer and finetune it on a variety of downstream tasks.
  • Their pretrained Longformer consistently outperforms RoBERTa on long document tasks and sets new state-of-the-art results on WikiHop and TriviaQA. They finally introduce the Longformer-Encoder-Decoder (LED), a Longformer variant for supporting long document generative sequence-to-sequence tasks, and demonstrate its effectiveness on the arXiv summarization dataset.
  • The figure below from the paper illustrates the runtime and memory of full self-attention and different implementations of Longformer’s self-attention; Longformer-loop is nonvectorized, Longformer-chunk is vectorized, and Longformer-cuda is a custom cuda kernel implementations. Longformer’s memory usage scales linearly with the sequence length, unlike the full self-attention mechanism that runs out of memory for long sequences on current GPUs. Different implementations vary in speed, with the vectorized Longformer-chunk being the fastest.

Big Bird: Transformers for Longer Sequences
  • The primary limitation of Transformer-based models is the quadratic complexity (mainly in terms of memory, but also computation) on the sequence length due to their full attention mechanism. BigBird by Zaheer et al. from Google, published in NeurIPS 2020, remedied this by proposing a sparse attention mechanism that reduces this quadratic complexity to linear.
Beyond Accuracy: Behavioral Testing of NLP models with CheckList
  • Although measuring held-out test-set accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models either focus on individual tasks or on specific behaviors. Further, ML systems can run to completion without throwing any errors (indicating functional correctness) but can still produce incorrect outputs (indicating behavioral issues). Thus, it is important to test the behavioral aspects of your model to make sure it works as you expected.
  • This paper by Ribeiro et al. from Microsoft, UW and UCI in 2020 introduces CheckList, a model-agnostic and task-agnostic methodology for testing NLP models inspired by principles of behavioral testing in software engineering. CheckList tests individual capabilities of the model using three different test types.
  • Checklist includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation, as well as a software tool to generate a large and diverse number of test cases quickly. They illustrate the utility of CheckList with tests for three tasks, identifying critical failures in both commercial and state-of-art models.
  • Tests created with CheckList can be applied to any model, making it easy to incorporate in current benchmarks or evaluation pipelines. In a user study, a team responsible for a commercial sentiment analysis model found new and actionable bugs in an extensively tested model that has “solved” existing benchmarks on three different tasks. They incorporated three distinct types of tests:
    • Minimum Functionality Test (MFT): A Minimum Functionality Test (MFT) uses simple examples to make sure the model can perform a specific task well. For example, they might want to test the performance of a sentiment model when dealing with negations.
    • Invariance Test: Besides testing the functionality of a model, they might also want to test if the model prediction stays the same when trivial parts of inputs are slightly perturbed. These tests are called Invariance Tests (IV).
    • Directional Expectation Test: In the Invariance Test, they expect the outputs after the perturbation to be the same. However, sometimes they might expect the output after perturbation to change. That is when Directional Expectation Tests comes in handy. In another user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.
  • Code.
The Curious Case of Neural Text Degeneration
  • Despite considerable advancements with deep neural language models, the enigma of neural text degeneration persists when these models are tested as text generators. The counter-intuitive empirical observation is that even though the use of likelihood as training objective leads to high quality models for a broad range of language understanding tasks, using likelihood as a decoding objective leads to text that is bland and strangely repetitive.
  • This paper by Holztman et al. from Choi’s lab at UW in ICLR 2020 provided a deep analysis into the properties of the most common decoding methods for open-ended language generation. It reveals surprising distributional differences between human text and machine text.
  • In addition, they find that decoding strategies alone can dramatically effect the quality of machine text, even when generated from exactly the same neural language model. They show that likelihood maximizing decoding causes repetition and overly generic language usage, while sampling methods without truncation risk sampling from the low-confidence tail of a model’s predicted distribution. Their findings motivate Nucleus (or top-p) Sampling, a simple but effective method that captures the region of confidence of language models effectively to draw the best out of neural generation.
  • By sampling text from the dynamic nucleus of the probability distribution, which allows for diversity while effectively truncating the less reliable tail of the distribution, the resulting text better demonstrates the quality of human text, yielding enhanced diversity without sacrificing fluency and coherence.
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
  • Pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to downstream NLP tasks, they generally require large amounts of compute to be effective.
  • This paper by Clark et al. in 2020 from Manning’s lab at Stanford proposes a more sample-efficient pre-training alternative task called replaced token detection, a new self-supervised task for language representation learning compared to BERT’s masked language modeling (MLM). Instead of masking the input, their approach corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead of training a model that predicts the original identities of the corrupted tokens, the key idea is training a discriminative text encoder model to distinguish input tokens from high-quality negative samples produced by an small generator network.
  • Thorough experiments demonstrate this new pre-training task is more efficient than MLM because the task is defined over all input tokens rather than just the small subset that was masked out.
  • As a result, compared to MLM, their pre-training objective is more compute-efficient and results in better performance on downstream tasks. The contextual representations learned by their approach substantially outperform the ones learned by BERT given the same model size, data, and compute.
  • The gains are particularly strong for small models; for example, they train a model on one GPU for 4 days that outperforms GPT (trained using 30x more compute) on the GLUE natural language understanding benchmark. Their approach also works well at scale, where it performs comparably to RoBERTa and XLNet while using less than 1/4 of their compute and outperforms them when using the same amount of compute.
  • Since ELECTRA works well even when using relatively small amounts of compute, the authors hope this will make developing and applying pre-trained text encoders more accessible to researchers and practitioners with less access to computing resources.
TinyBERT: Distilling BERT for Natural Language Understanding
  • Language model pre-training, such as BERT, has significantly improved the performances of many natural language processing tasks. However, pre-trained language models are usually computationally expensive, so it is difficult to efficiently execute them on resource-restricted devices.
  • This paper by Jiao et al. from Huazhong University of Science and Technology, Wuhan National Lab for Optoelectronics, and Huawei Noah’s Ark Lab in EMNLP 2020 propose a novel Transformer distillation method to accelerate inference and reduce model size while maintaining accuracy, that is specially designed for knowledge distillation (KD) of the Transformer-based models. They also propose a two-stage framework for TinyBERT.
  • By leveraging this new KD method, the plenty of knowledge encoded in a large teacher BERT can be effectively transferred to a small student Tiny-BERT.
  • Then, they introduce a new two-stage learning framework for TinyBERT, which performs Transformer distillation at both the pretraining and task-specific learning stages. This framework ensures that TinyBERT can capture he general-domain as well as the task-specific knowledge in BERT.
  • TinyBERT with 4 layers is empirically effective and achieves more than 96.8% the performance of its teacher BERTBASE on GLUE benchmark, while being 7.5x smaller and 9.4x faster on inference.
  • Extensive experiments show that TinyBERT achieves competitive performances meanwhile significantly reducing the model size and inference time of BERTBASE, which provides an effective way to deploy BERT-based NLP models on edge devices. Specifically, TinyBERT with 4 layers is also significantly better than 4-layer state-of-the-art baselines on BERT distillation, with only about 28% parameters and about 31% inference time of them. Moreover, TinyBERT with 6 layers performs on-par with its teacher BERTBASE.
  • Code.
MPNet: Masked and Permuted Pre-training for Language Understanding
  • BERT adopts masked language modeling (MLM) for pre-training and is one of the most successful pre-training models. Since BERT neglects dependency among predicted tokens, XLNet introduces permuted language modeling (PLM) for pre-training to address this problem. However, XLNet does not leverage the full position information of a sentence and thus suffers from position discrepancy between pre-training and fine-tuning.
  • This paper by Song et al. from Nanjing University and Microsoft Research in NeurIPS 2020 proposes MPNet, a novel pre-training method that inherits the advantages of BERT and XLNet and avoids their limitations.
  • MPNet leverages the dependency among predicted tokens through permuted language modeling (vs. MLM in BERT), and takes auxiliary position information as input to make the model see a full sentence and thus reducing the position discrepancy (vs. PLM in XLNet).
  • They pre-train MPNet on a large-scale dataset (over 160GB text corpora) and fine-tune on a variety of down-streaming tasks (GLUE, SQuAD, etc). Experimental results show that MPNet outperforms MLM and PLM by a large margin, and achieves better results on these tasks compared with previous state-of-the-art pre-trained methods (e.g., BERT, XLNet, RoBERTa) under the same model setting.
  • Code with code and pre-trained models.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
  • This paper by Raffel et al. from Google in JMLR delves into the domain of transfer learning in natural language processing (NLP). Published in 2020, it introduces a unified framework, named the Text-to-Text Transfer Transformer (T5), which reformulates all NLP tasks into a text-to-text format. This approach enables the use of a single model, loss function, and set of hyperparameters across various tasks such as translation, question answering, classification, summarization, and sentiment analysis.
  • The paper’s primary goal is not to propose new methods, but to offer a comprehensive view of the existing landscape in transfer learning for NLP. It includes a survey, exploration, and empirical comparison of existing techniques. The team scales up their models to up to 11 billion parameters to assess the current limits and achieve state-of-the-art results on numerous benchmarks.
  • The T5 model is built upon the Transformer architecture, which has become prevalent in recent NLP research. This architecture, originally designed for machine translation, has been effectively applied to various NLP tasks. The T5 model, in particular, employs an encoder-decoder structure with each component being of similar size and configuration to a BERTBASE stack, amounting to approximately 220 million parameters in total.
  • The following figure from the paper shows a diagram of our text-to-text framework. Every task we consider—including translation, question answering, and classification—is cast as feeding our model text as input and training it to generate some target text. This allows us to use the same model, loss function, hyperparameters, etc. across our diverse set of tasks. It also provides a standard testbed for the methods included in our empirical survey. “T5” refers to our model, which we dub the “Text-to-Text Transfer Transformer”.

  • For the training process, the T5 employs a denoising objective for pre-training, where a portion of the input tokens is randomly masked, and the model is trained to predict these missing tokens. This pre-training is conducted using unlabeled data, leveraging a dataset named “Colossal Clean Crawled Corpus” (C4), which is an extensive collection of clean and natural English text extracted and processed from the Common Crawl archive.
  • The model’s training uses the maximum likelihood objective with teacher forcing and a cross-entropy loss. SentencePiece is used for encoding text into WordPiece tokens, with a vocabulary size of 32,000 wordpieces, encompassing English, German, French, and Romanian languages.
  • In terms of architectural variants, the paper examines different attention mask patterns used in Transformer models. For instance, the encoder in T5 uses a fully-visible attention mask, enabling the self-attention mechanism to consider the entire input sequence. In contrast, the decoder employs a causal masking pattern, preventing each output element from depending on future input elements.
  • The methodology and findings in this paper provide valuable insights into the practical application of transfer learning in NLP, particularly in how large-scale pre-trained models can be effectively adapted to a wide range of language tasks.
Scaling Laws for Neural Language Models
  • This paper by from Kaplan et al. from OpenAI studies empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range.
  • Simple equations govern the dependence of overfitting on model/dataset size and the dependence of training speed on model size. These relationships allow them to determine the optimal allocation of a fixed compute budget. Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.
  • The following figure from the paper shows that language modeling performance improves smoothly as they increase the model size, dataset size, and amount of compute used for training. For optimal performance all three factors must be scaled up in tandem. Empirical performance has a power-law relationship with each individual factor when not bottlenecked by the other two.

  • In particular, they propose 10x more compute should be spent on 5.5x larger model and 1.8x more tokens (vs. Chincilla’s 10x more compute should be spent on 3.2x larger model and 3.2x more tokens)
Unsupervised Cross-lingual Representation Learning at Scale
  • This paper by Conneau et al. from Facebook AI in ACL 2020 shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks.
  • They train a Transformer-based masked language model on one hundred languages, using more than two terabytes of filtered CommonCrawl data.
  • Their model, dubbed XLM-R, significantly outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +14.6% average accuracy on XNLI, +13% average F1 score on MLQA, and +2.4% F1 score on NER.
  • XLM-R performs particularly well on low-resource languages, improving 15.7% in XNLI accuracy for Swahili and 11.4% for Urdu over previous XLM models.
  • They also present a detailed empirical analysis of the key factors that are required to achieve these gains, including the trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and low resource languages at scale.
  • Finally, they show, for the first time, the possibility of multilingual modeling without sacrificing per-language performance; XLM-R is very competitive with strong monolingual models on the GLUE and XNLI benchmarks.
  • Facebook AI post.
SpanBERT: Improving Pre-training by Representing and Predicting Spans
  • This paper by Joshi et al. from UWash, Princeton University, Allen Institute of Artificial Intelligence, and FAIR in TACL 2020 presents SpanBERT, a pre-training method that is designed to better represent and predict spans of text.
  • Their approach extends BERT by (1) masking contiguous random spans, rather than random tokens, and (2) training the span boundary representations to predict the entire content of the masked span, without relying on the individual token representations within it.
  • SpanBERT consistently outperforms BERT and their better-tuned baselines, with substantial gains on span selection tasks such as question answering and coreference resolution.
  • In particular, with the same training data and model size as BERT-large, Span-BERT obtains 94.6% and 88.7% F1 on SQuAD 1.1 and 2.0, respectively.
  • They also achieve a new state of the art on the OntoNotes coreference resolution task (79.6% F1), strong performance on the TACRED relation extraction benchmark, and even show gains on GLUE.
  • The following figure from the paper offers an illustration of SpanBERT training. The span an American football game is masked. The span boundary objective (SBO) uses the output representations of the boundary tokens, \(\mathbf{x}_4\) and \(\mathbf{x}_9\) (in blue), to predict each token in the masked span. The equation shows the MLM and SBO loss terms for predicting the token, football (in pink), which as marked by the position embedding \(\mathbf{p}_3\), is the third token from \(x_4\).

Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning
  • Although pretrained language models can be fine-tuned to produce state-of-the-art results for a very wide range of language understanding tasks, the dynamics of this process are not well understood, especially in the low data regime. Why can we use relatively vanilla gradient descent algorithms (e.g., without strong regularization) to tune a model with hundreds of millions of parameters on datasets with only hundreds or thousands of labeled examples?
  • This paper by Aghajanyan et al. from Facebook AI argues that analyzing fine-tuning through the lens of intrinsic dimension provides us with empirical and theoretical intuitions to explain this remarkable phenomenon.
  • They empirically show that common pre-trained models have a very low intrinsic dimension; in other words, there exists a low dimension reparameterization that is as effective for fine-tuning as the full parameter space. For example, by optimizing only 200 trainable parameters randomly projected back into the full space, we can tune a RoBERTa model to achieve 90% of the full parameter performance levels on MRPC.
  • Furthermore, they empirically show that pre-training implicitly minimizes intrinsic dimension and, perhaps surprisingly, larger models tend to have lower intrinsic dimension after a fixed number of pre-training updates, at least in part explaining their extreme effectiveness.
  • Lastly, they connect intrinsic dimensionality with low dimensional task representations and compression based generalization bounds to provide intrinsic-dimension-based generalization bounds that are independent of the full parameter count.
Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval
  • This paper by Xiong et al. from Microsoft presents a novel approach to improve dense text retrieval (DR) efficiency and effectiveness.
  • The paper identifies a primary bottleneck in dense retrieval training: the use of uninformative negatives, which leads to slow learning convergence. These negatives are locally sampled in batches and yield diminishing gradient norms and large stochastic gradient variances.
  • To address this, the authors propose Approximate Nearest Neighbor Negative Contrastive Estimation (ANCE). ANCE selects hard training negatives globally from the entire corpus using an asynchronously updated ANN index. This method aligns the distribution of negative samples in training with irrelevant documents in testing.
  • ANCE is implemented using an asynchronously updated ANN index. This involves maintaining an ‘Inferencer’ that parallelly computes document encodings with a recent checkpoint from the DR model and refreshes the ANN index, keeping up with the model training.
  • The following figure from the paper shows the asynchronous training of ANCE. The Trainer learns the representation using negatives from the ANN index. The Inferencer uses a recent checkpoint to update the representation of documents in the corpus and once finished, refreshes the ANN index with most up-to-date encodings.

  • The effectiveness of ANCE was demonstrated in three text retrieval scenarios: standard web search, OpenQA (Open Domain Question Answering), and a commercial search engine’s retrieval system. ANCE showed significant improvements over traditional methods, nearly matching the accuracy of BERT-based cascade IR pipeline while being 100x more efficient.
  • The authors empirically validated that the gradient norms on ANCE sampled negatives are much bigger than local negatives, hence improving the convergence of dense retrieval models.
  • The paper also includes extensive experimental methodologies, evaluation results, and discussions on the convergence of dense retrieval training, highlighting the empirical analyses and theoretical foundations that underpin ANCE.
  • Overall, this paper presents a significant advancement in dense text retrieval by addressing the critical issue of ineffective negative sampling and demonstrating the efficiency and effectiveness of ANCE in various retrieval scenarios.


Towards a Unified View of Parameter-Efficient Transfer Learning
  • Fine-tuning large pre-trained language models on downstream tasks has become the de-facto learning paradigm in NLP. However, conventional approaches fine-tune all the parameters of the pre-trained model, which becomes prohibitive as the model size and the number of tasks grow. Recent work has proposed a variety of parameter-efficient transfer learning methods that only fine-tune a small number of (extra) parameters to attain strong performance. While effective, the critical ingredients for success and the connections among the various methods are poorly understood.
  • This paper by He et al. from Neubig’s lab at CMU in ICLR 2022 breaks down the design of state-of-the-art parameter-efficient transfer learning methods and present a unified framework that establishes connections between them. Specifically, they re-frame them as modifications to specific hidden states in pre-trained models, and define a set of design dimensions along which different methods vary, such as the function to compute the modification and the position to apply the modification.
  • Through comprehensive empirical studies across machine translation, text summarization, language understanding, and text classification benchmarks, they utilize the unified view to identify important design choices in previous methods. Furthermore, their unified framework enables the transfer of design elements across different approaches, and as a result they are able to instantiate new parameter-efficient fine-tuning methods that tune less parameters than previous methods while being more effective, achieving comparable results to fine-tuning all parameters on all four tasks.
  • The below figure from the paper offers a graphical illustration of existing methods and the proposed variants. “PLM module” represents a certain sublayer of the PLM (e.g., attention or FFN) that is frozen. “Scaled PA” denotes scaled parallel adapter.

BinaryBERT: Pushing the Limit of BERT Quantization
  • The rapid development of large pre-trained language models has greatly increased the demand for model compression techniques, among which quantization is a popular solution. In this paper,
  • This paper by Bai et al. from CUHK and Huawei Noah’s Ark Lab in 2021 proposes BinaryBERT, which pushes BERT quantization to the limit by weight binarization.
  • They find that a binary BERT is hard to be trained directly than a ternary counterpart due to its steep and complex loss landscape. Therefore, they propose ternary weight splitting, which initializes BinaryBERT by equivalently splitting from a half-sized ternary network, followed by fine-tuning for further refinement.
  • The binary model thus inherits the good performance of the ternary one, and can be further enhanced by fine-tuning the new architecture after splitting.
  • Their approach also supports adaptive splitting that can tailor the size of BinaryBERT based on the edge device constraints.
  • Empirical results show that BinaryBERT has only a slight performance drop compared with the full-precision model while being 24x smaller, achieving the state-of-the-art compression results on the GLUE and SQuAD benchmarks.
Towards Zero-Label Language Learning
  • This paper by Wang et al. from Google in 2021 explores “zero-label” learning in NLP, whereby no human-annotated data is used anywhere during training and models are trained purely on synthetic data. They show that language models (LMs) are also few-shot generators or example creators (rather than just few-shot learners as in the GPT-3 paper) in that they can be used to generate high-quality synthetic data in a fully unsupervised manner. In other words, their propose that labelled-data generation is easy with prompting, LMs are great few-shot data generators, and that classic fine-tuning » zero/few shot prompting.
  • At the core of their framework is a novel approach for better leveraging the powerful pretrained LMs. Specifically, inspired by the recent success of few-shot inference on GPT-3, they present a training data creation procedure named Unsupervised Data Generation (UDG), which leverages few-shot prompts to synthesize high-quality training data without real human annotations.
  • Their method enables zero-label learning as they train task-specific models solely on the synthetic data, yet they achieve better or comparable results from strong baseline models trained on human-labeled data. Furthermore, when mixed with labeled data, their approach serves as a highly effective data augmentation procedure, achieving new state-of-the-art results on the SuperGLUE benchmark.
  • The paper illustrates a promising direction for future transfer learning research in NLP.
  • Key takeaways:
    • Old idea (from OpenAI’s GPT3 paper):
      • Treat LMs as few-shot learners.
      • Create prompts with <sample, label> pair(s).
      • Ask the model to infer the label for a new
      • The emphasis is on the inference.
    • New idea (from Google’s zero-label paper):
      • Treat LMs as few-shot generators (rather than few-shot learners).
      • Create prompts with <sample, label> pair(s).
      • Ask the model to generate more for the same label.
      • The emphasis is on the labelled data generation (rather than inference).
    • Learnings:
      • Old idea created a new wave of prompt programming, i.e. no need for conventional task specific fine-tuning.
      • However, prompting can solve only lower-order tasks, for e.g., classification, NLI. Even with lower-order tasks it is not practical because you cannot build a human-in-the-loop system to continually improve the model.
      • The new idea is about generating more data and going with conventional route.
      • This paper confirms all the above by introducing UDG using LMs, even for complex higher-order tasks and empirically shows classical fine-tuning with more data works better.
  • The diagram below from Prithvi Da summarizes the proposed approach.

Improving Language Models by Retrieving from Trillions of Tokens
  • This paper by Borgeaud et al. from DeepMind in 2021 proposes Retrieval-Enhanced Transformer (RETRO) which enhances auto-regressive language models by conditioning on document chunks retrieved from a large corpus, based on local similarity with preceding tokens. RETRO incorporates information retrieved from a database to free its parameters from being an expensive store of facts and world knowledge. With a 2 trillion token database, RETRO obtains comparable performance to GPT-3 and Jurassic-1 on the Pile, despite using 25x fewer parameters.
  • After fine-tuning, RETRO performance translates to downstream knowledge-intensive tasks such as question answering. RETRO combines a frozen BERT retriever, a differentiable encoder and a chunked cross-attention mechanism to predict tokens based on an order of magnitude more data than what is typically consumed during training.
  • The figure below from the paper shows the Retro architecture. Left: simplified version where a sequence of length \(n = 12\) is split into \(l = 3\) chunks of size \(m = 4\). For each chunk, we retrieve \(k = 2\) neighbors of \(r = 5\) tokens each. The retrieval pathway is shown on top. Right: Details of the interactions in the CCA operator. Causality is maintained as neighbors of the first chunk only affect the last token of the first chunk and tokens from the second chunk.

  • On Wikitext103 and the Pile, RETRO outperforms previous models trained on large scale datasets. They also show that RETRO is competitive on retrieval-intensive downstream tasks such as question answering.
  • RETRO models are flexible and can be used without retrieval at evaluation and still achieve comparable performance to baseline models. Conversely, baseline pre-trained transformer models can be rapidly fine-tuned (“RETROfit with retrieval”) to obtain nearly the same performance as if trained from scratch.
  • They demonstrates at an unprecedented scale that improving semi-parametric language models through explicit memory can provide an orthogonal, more efficient approach than raw parameter scaling as they seek to build more powerful language models.
  • Related: The Illustrated Retrieval Transformer by Jay Alammar.
WebGPT: Browser-assisted question-answering with human feedback
  • This paper by Nakano et al. from OpenAI in 2021 proposes WebGPT, which is a fine-tuned version of GPT-3 to more accurately answer open-ended questions using a text-based web browser. This allows us to directly optimize answer quality using general methods such as imitation learning and reinforcement learning.
  • Their prototype copies how humans research answers to questions online —- it submits search queries, follows links, and scrolls up and down web pages. It is trained to cite its sources, which makes it easier to give feedback to improve factual accuracy.
  • By setting up the task so that it can be performed by humans, they are able to train models on the task using imitation learning, and then optimize answer quality with human feedback. To make human evaluation of factual accuracy easier, models must collect references while browsing in support of their answers.
  • They train and evaluate their models on ELI5, a dataset of questions asked by Reddit users. Their best model is obtained by fine-tuning GPT-3 using behavior cloning, and then performing rejection sampling against a reward model trained to predict human preferences. This model’s answers are preferred by humans 56% of the time to those of their human demonstrators, and 69% of the time to the highest-voted answer from Reddit. While their best model outperforms humans on ELI5, but still struggles with out-of-distribution questions.
The Power of Scale for Parameter-Efficient Prompt Tuning
  • This paper by Lester et al. introduces a simple yet effective method called prompt tuning, which learns soft prompts to condition frozen language models to perform specific downstream tasks. Unlike the discrete text prompts, soft prompts are learned through backpropagation and can be tuned to incorporate signals from any number of labeled examples.
  • Also, prompt tuning only requires storing a small task-specific prompt for each task, and enables mixed-task inference using the original pre-trained model.
  • The authors show that prompt tuning outperforms few-shot learning by a large margin, and becomes more competitive with scale.
  • This is an interesting approach that can help to effectively use a single frozen model for multi-task serving.
  • Model tuning requires making a task-specific copy of the entire pre-trained model for each downstream task and inference must be performed in separate batches. Prompt tuning only requires storing a small task-specific prompt for each task, and enables mixed-task inference using the original pretrained model. With a T5 “XXL” model, each copy of the tuned model requires 11 billion parameters. By contrast, their tuned prompts would only require 20,480 parameters per task—a reduction of over five orders of magnitude – assuming a prompt length of 5 tokens.

Prefix-Tuning: Optimizing Continuous Prompts for Generation
  • Fine-tuning is the de facto way to leverage large pretrained language models to perform downstream tasks. However, it modifies all the language model parameters and therefore necessitates storing a full copy for each task.
  • This paper by Li and Liang from Stanford proposes prefix-tuning, a lightweight alternative to fine-tuning for natural language generation tasks, which keeps language model parameters frozen, but optimizes a small continuous task-specific vector (called the prefix).
  • Instead of adding a soft prompt to the model input, it prepends trainable parameters to the hidden states of all transformer blocks. During fine-tuning, the LM’s original parameters are kept frozen while the prefix parameters are updated.
  • Prefix-tuning draws inspiration from prompting, allowing subsequent tokens to attend to this prefix as if it were “virtual tokens”.
  • The figure below from the paper shows that fine-tuning (top) updates all Transformer parameters (the red Transformer box) and requires storing a full model copy for each task. They propose prefix-tuning (bottom), which freezes the Transformer parameters and only optimizes the prefix (the red prefix blocks). Consequently, prefix-tuning only need to store the prefix for each task, making prefix-tuning modular and space-efficient. Note that each vertical block denote transformer activations at one time step.

  • They apply prefix-tuning to GPT-2 for table-to-text generation and to BART for summarization. They find that by learning only 0.1% of the parameters, prefix-tuning obtains comparable performance in the full data setting, outperforms fine-tuning in low-data settings, and extrapolates better to examples with topics unseen during training. A potential hypothesis is that training fewer parameters helped reduce overfitting on smaller target datasets.
LoRA: Low-Rank Adaptation of Large Language Models
  • An important paradigm of natural language processing consists of large-scale pre-training on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes less feasible. Using GPT-3 175B as an example – deploying independent instances of fine-tuned models, each with 175B parameters, is prohibitively expensive.
  • Powerful models with billions of parameters, such as GPT-3, are prohibitively expensive to fine-tune in order to adapt them to particular tasks or domains. LoRA proposes to freeze pre-trained model weights and inject trainable layers (rank-decomposition matrices) in each transformer block. This greatly reduces the number of trainable parameters and GPU memory requirements since gradients don’t need to be computed for most model weights. The researchers found that by focusing on the Transformer attention blocks of large-language models, fine-tuning quality with LoRA was on par with full model fine-tuning while being much faster and requiring less compute.
  • This paper by Hu et al. from Microsoft in 2021 proposes Low-Rank Adaptation (LoRA), which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks.
  • LoRA is a technique where weight updates are designed to be the product of two low-rank matrices. It was inspired by Aghajanyan et al. which showed that, when adapting to a specific task, pre-trained language models have a low intrinsic dimension and can still learn efficiently despite a random projection into a smaller subspace. Thus, LoRA hypothesized that weight updates \(\Delta W\) during adaption also have low intrinsic rank.
  • Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times. LoRA performs on-par or better than fine-tuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite having fewer trainable parameters, a higher training throughput, and, unlike adapters, no additional inference latency.
  • Similar to prefix tuning, they found that LoRA outperformed several baselines including full fine-tuning. Again, the hypothesis is that LoRA, thanks to its reduced rank, provides implicit regularization. In contrast, full fine-tuning, which updates all weights, could be prone to overfitting.
  • They also provide an empirical investigation into rank-deficiency in language model adaptation, which sheds light on the efficacy of LoRA.
  • They release a package that facilitates the integration of LoRA with PyTorch models and provide their implementations and model checkpoints for RoBERTa, DeBERTa, and GPT-2.
  • The figure below from the paper shows LoRA’s reparametrization. They only train \(A\) and \(B\).

Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm
  • Prevailing methods for mapping large generative language models to supervised tasks may fail to sufficiently probe models’ novel capabilities. Using GPT-3 as a case study, they show that 0-shot prompts can significantly outperform few-shot prompts. They suggest that the function of few-shot examples in these cases is better described as locating an already learned task rather than meta-learning. This analysis motivates rethinking the role of prompts in controlling and evaluating powerful language models.
  • This paper by Reynolds and McDonell discusses methods of prompt programming, emphasizing the usefulness of considering prompts through the lens of natural language. They explore techniques for exploiting the capacity of narratives and cultural anchors to encode nuanced intentions and techniques for encouraging deconstruction of a problem into components before producing a verdict.
  • Informed by this more encompassing theory of prompt programming, they also introduce the idea of a metaprompt that seeds the model to generate its own natural language prompts for a range of tasks. Finally, they discuss how these more general methods of interacting with language models can be incorporated into existing and future benchmarks and practical applications.
Muppet: Massive Multi-task Representations with Pre-Finetuning
  • This paper by Aghajanyan et al. from Meta AI proposes pre-finetuning, an additional large-scale learning stage between language model pre-training and fine-tuning.
  • Pre-finetuning is massively multi-task learning (around 50 datasets, over 4.8 million total labeled examples), and is designed to encourage learning of representations that generalize better to many different tasks. Pre-finetuning is performed on around 50 classification, summarization, question answering, and common sense reasoning tasks.
  • After pretraining, they prefinetune the model on each of the aforementioned tasks by attaching a task-specific head (MLP) to the output of the [CLS] token for each task that they prefinetune the model on. They use the output of this task-specific head as the model output for that prefinetuning task. While training, the overall loss function used is a convex combination of the seven task-specific losses.
  • They show, in particular, that standard multi-tasking schemes can be unstable and often fail to learn high quality representations. However, they introduce a new training scheme which uses loss scaling and task-heterogeneous batches so that gradient steps are more evenly balanced across multiple different competing tasks, greatly improving training stability and overall performance.
  • Accumulating gradients across tasks (i.e., the concept of “heterogeneous batches”) is important since the model is trying to optimize not a single objective but several potentially competing objectives to create a unified representation across several tasks during model training. During gradient descent, moving along the gradient of a single task may not be the optimal direction for the model to move to learn a single unified representation across tasks. To overcome this, we ensure each batch their model optimizes consists of several tasks. Each worker samples a random batch from their set of tasks and computes a gradient, accumulated for the final update. Empirically we use 64 GPUs for pre-finetuning, resulting in each batch consisting of gradients across 64 sampled tasks. This strategy allows for their model to arrive at a better representation for end task finetuning.
  • As pre-finetuning optimizes several different types of tasks and datasets, each having its own output spaces, loss scaling becomes essential to ensure stable training. They attempted various forms of loss-scaling throughout initial experimentation, but the most effective was the novel method as follows.
    • Let us denote \(\mathcal{L}_i\left(x_i, y_i ; \theta\right)\) as the loss for datapoint \(i\) for a model parameterized by \(\theta\). Remember that the loss depends on the type of task (commonsense loss is different from binary classification). Furthermore let \(n: \mathbb{N} \rightarrow \mathbb{N}\) be a function which for each data-point returns the number of predictions \(\mathcal{L}\) operates over. For example, for binary classification, \(n\) would return two, while for generation, \(n\) would return the size of the vocabulary (since we average across loss per token generated). They scale data-point loss so that, if the class distribution were uniformly distributed along with their models predictions, all of their losses would have equivalent values.
    \[\mathcal{L}_i^{\text {scaled }}\left(x_i, y_i ; \theta\right)=\frac{\mathcal{L}_i\left(x_i, y_i ; \theta\right)}{\log n(i)}\]
    • They found that this static scaling worked incredibly well, outperforming other loss scaling methods in early experimentation.
  • They show that pre-finetuning consistently improves performance for pretrained discriminators (e.g.~RoBERTa) and generation models (e.g.~BART) on a wide range of tasks (sentence prediction, commonsense reasoning, MRC, etc.), while also significantly improving sample efficiency during fine-tuning. They also show that large-scale multi-tasking is crucial; pre-finetuning can hurt performance when few tasks are used up until a critical point (usually above 15) after which performance improves linearly in the number of tasks.
  • The figure below from the paper shows a plot of RoBERTa’s evaluation accuracy of five datasets: RTE, BoolQ, RACE, SQuAD, and MNLI, across various scales of multi-task learning measured in the number of datasets. They notice that performance initially degrades until a critical point is reached regarding the number of the datasets used by the MTL framework for all but one dataset. Post this critical point; pre-finetuning improve over the original RoBERTa model.

Synthesizer: Rethinking Self-Attention in Transformer Models
  • The dot product self-attention is known to be central and indispensable to state-of-the-art Transformer models. But is it really required?
  • This paper by Tay et al. from Google in ICML 2021 investigates the true importance and contribution of the dot product-based self-attention mechanism on the performance of Transformer models.
  • Via extensive experiments, they find that (1) random alignment matrices surprisingly perform quite competitively and (2) learning attention weights from token-token (query-key) interactions is useful but not that important after all.
  • To this end, they propose Synthesizer, a model that learns synthetic attention weights without token-token interactions.
  • The figure below from the paper shows the proposed Synthesizer model architecture.

  • In their experiments, they first show that simple Synthesizers achieve highly competitive performance when compared against vanilla Transformer models across a range of tasks, including machine translation, language modeling, text generation and GLUE/SuperGLUE benchmarks.
  • When composed with dot product attention, they find that Synthesizers consistently outperform Transformers. Moreover, they conduct additional comparisons of Synthesizers against Dynamic Convolutions, showing that simple Random Synthesizer is not only 60% faster but also improves perplexity by a relative 3.5%. Finally, they show that simple factorized Synthesizers can outperform Linformers on encoding only tasks.
CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation
  • The paper by Wang et al. from Salesforce Research Asia and NTU Singapore in EMNLP 2021 proposes CodeT5, a pre-trained encoder-decoder model for code understanding and generation tasks.
  • It builds on the T5 architecture and proposes two novel pre-training objectives:
    • Identifier tagging: Predict whether a token is an identifier.
    • Masked identifier prediction: Recover masked identifiers using sentinel tokens.
  • These objectives enable CodeT5 to leverage the semantic information from identifiers.
  • It also proposes a bimodal dual generation task using code-comment pairs to improve Natural Languages (NL)-Programming Languages (PL) alignment.
  • CodeT5 significantly outperforms prior work on 6 tasks across 14 datasets in CodeXGLUE.
  • The figure below from the paper illustrates CodeT5 for code-related understanding and generation tasks.

  • The figure below from the paper illustrates the pre-training tasks of CodeT5. They first alternately train span prediction, identifier prediction, and identifier tagging on both unimodal and bimodal data, and then leverage the bimodal data for dual generation training.

  • Ablations show the identifier-aware pre-training and bimodal dual generation are effective. Identifier masking helps capture semantics while span masking focuses on syntax.
  • Comprehensive experiments show that CodeT5 significantly outperforms prior methods on understanding tasks such as code defect detection and clone detection, and generation tasks across various directions including PL-NL, NL-PL, and PL-PL, allowing for multi-task learning.
  • Overall, CodeT5 incorporates code structure via a novel identifier-aware pre-training and demonstrates strong performance on a diverse set of code intelligence tasks.
  • Code.
Large Dual Encoders Are Generalizable Retrievers
  • This paper by Ni et al. from Google Research presents a study on the scalability of dual encoder models for retrieval tasks.
  • The team challenges the belief that the simple dot-product bottleneck in dual encoders limits out-of-domain generalization. They scale up the model size of dual encoders while keeping the bottleneck embedding size fixed.
  • Their approach, Generalizable T5-based dense Retrievers (GTR), significantly outperforms existing sparse and dense retrievers on the BEIR dataset for a variety of retrieval tasks, especially in out-of-domain generalization.
  • GTR utilizes dual encoders by leveraging the encoder part of T5. For effectively using the power of large models, they collect roughly two billion community question-answer pairs as generic pre-training data. By combining pre-training using generic training data and fine-tuning using MS Marco, they are able to train large-scale dual encoder retrieval models.
  • A major finding is that GTR models are data-efficient, requiring only 10% of MS Marco supervised data for optimal out-of-domain performance. The study includes a multi-stage training approach using a combination of web-mined corpus and high-quality search datasets for pre-training and fine-tuning respectively.
  • The figure below from the paper architecture of Generalizable T5-based dense Retrievers. The research question we ask is: can scaling up dual encoder model size improve the retrieval performance while keeping the bottleneck layers fixed? Only the encoder is taken from the pre-train T5 models, and the question tower and document tower of the dual encoder share parameters.

  • Results show that scaling up model size improves retrieval performance across all evaluated tasks, suggesting that larger models can better capture semantic nuances for effective retrieval.
  • The paper also discusses the data efficiency of large-scale models, demonstrating that GTR models can achieve comparable or superior performance with reduced training data.
  • Additional insights are provided through ablation studies, which highlight the importance of both the model scale and the nature of the training dataset.
  • The paper presents a significant advancement in the field of information retrieval, showcasing the potential of large-scale dual encoder models in improving generalizability and efficiency in retrieval task.


Formal Mathematics Statement Curriculum Learning
  • This paper by Polu et al. from OpenAI in 2022 proposes a neural theorem prover using GPT-f that can successfully solve a curriculum of increasingly difficult problems out of a set of formal statements of sufficiently varied difficulty, including many high-school Math Olympiad problems. The prover uses a language model to find proofs of formal statements.
  • They explore the use of expert iteration in the context of language modeling applied to formal mathematics. They show that at same compute budget, expert iteration, by which they mean proof search interleaved with learning, dramatically outperforms proof search only. They also observe that when applied to a collection of formal statements of sufficiently varied difficulty, expert iteration is capable of finding and solving a curriculum of increasingly difficult problems, without the need for associated ground-truth proofs.
  • Finally, by applying this expert iteration to a manually curated set of problem statements, they achieve state-of-the-art on the miniF2F benchmark, automatically solving multiple challenging problems drawn from high school olympiads.
  • Their results suggest that the lack of self-play in the formal mathematics setup can be effectively compensated for by automatically as well as manually curated sets of formal statements, which are much cheaper to formalize than full proofs. The statement curriculum learning methodology presented in this work can help accelerate progress in automated reasoning, especially if scaled with automated generation and curation of formal statements in the future.
  • OpenAI link.
Survey of Hallucination in Natural Language Generation
  • While natural language generation (NLG) has improved exponentially in recent years thanks to the development of deep learning technologies such as Transformer-based language models, large language models (LLMs) -based NLG often produces false statements that are disconnected from reality because such models are not grounded in reality. Such generation includes hallucinated texts, which makes the performances of text generation fail to meet users’ expectations in many real-world scenarios owing to the lack of commonsense built from experiencing the real world.
  • This paper by Ji et al. from Pascale Fung’s group at Hong Kong University of Science and Technology in 2022 reviews studies in evaluation and mitigation methods of hallucinations that have been presented in various tasks.
  • They provide a broad overview of the research progress and challenges in the hallucination problem of NLG. The survey is organized into two big divisions: (i) a general overview of metrics, mitigation methods, and future directions; (ii) task-specific research progress for hallucinations in a large set of downstream tasks: abstractive summarization, dialogue generation, generative question answering, data-to-text generation, and machine translation.
Transformer Quality in Linear Time
  • This paper by Hua et al. form Cornell University and Google Brain in 2022 revisits the design choices in Transformers, and propose methods to address their weaknesses in handling long sequences by presenting FLASH - a novel efficient modification of Transformer architecture. This is achieved by designing a performant layer (gated linear unit) and by combining it with an accelerator-efficient approximation strategy (mixed chunk attention).
  • Existing efficient attention methods often cause significant quality drop compared to full self-attention. At the same time they might be difficult to implement to fully leverage hardware accelerators. The authors introduce GAU (gated attention unit; a generalization of GLU - gated linear unit) that allows for better and more efficient approximation of multi-head attention than many other efficient attention methods by using a weaker single-head attention with minimal quality loss.
  • Next, complementary to this new layer, they propose mixed chunk attention - an efficient linear approximation method that combines the benefits from partial and linear attention mechanisms, which is accelerator-friendly and highly competitive in quality. The method works on chunks of tokens and leverages local (within chunk) and global (between chunks) attention spans.
  • The resulting model, named FLASH, when deployed on bidirectional and auto-regressive language modeling tasks, outperforms three baselines: vanilla Transformer, Performer and Combiner in terms of quality and efficiency. FLASH matches the quality (perplexity) of fully-augmented Transformers over both short (512) and long (8K) context lengths, while being substantially faster to train than the state-of-the-art - achieving training speedups of up to 4.9x on Wiki-40B and 12.1x on PG-19 for auto-regressive language modeling, and 4.8x on C4 for masked language modeling. The differences are particularly pronounced for larger context sizes (4096-8192).
Chain of Thought Prompting Elicits Reasoning in Large Language Models
  • Although scaling up language model size has reliably improved performance on a range of NLP tasks, even the largest models currently struggle with certain reasoning tasks such as arithmetic reasoning, math word problems, symbolic manipulation, and commonsense reasoning.
  • This paper by Wei et al. from Google in 2022 explores the ability of language models to generate a coherent chain of thought – a series of short sentences that mimic the reasoning process a person might have when responding to a question. The idea is strikingly simple: instead of being terse while prompting show the model a few examples of a multi-step reasoning process (the like of which a human would use). Couple this with LLMs (the larger the better) and magic happens! Check out the below image from the paper.

  • They have explored chain of thought prompting as a simple and broadly applicable method for enhancing reasoning in language models. The superb results you can elucidate via this method are an emergent property of model scale (surprise surprise) - bigger models benefit more from this, and the conclusion holds across models (LaMDA, GPT, PaLM).
  • Interestingly enough, the more complex the task of interest is (in the sense of requiring multi-step reasoning approach), the bigger the boost from the chain of thought prompting!
  • In order to make sure that the performance boost comes from this multi-step approach and not simply because of e.g. more compute, the authors have done a couple of ablations: (i) outputting a terse equation instead of a multi-step reasoning description, (ii) outputting the answer and only then the chain of thought, etc. None of these experiments yielded good results.
  • The method also proved to be fairly robust (always outperforms standard prompting) to the choice of exact few shot exemplars. Despite different annotators, different styles, etc. the method is always better than standard prompting.
  • Through experiments on arithmetic, symbolic, and commonsense reasoning, they find that chain of thought processing is an emergent property of model scale that can be induced via prompting and can enable sufficiently large language models to better perform reasoning tasks that otherwise have flat scaling curves.
PaLM: Scaling Language Modeling with Pathways
  • This paper by Chowdhery et al. from Google in 2022 introduces Pathways Language Model (PaLM), a single 540 billion parameter dense Transformer language model, trained on 780B tokens of high-quality, diverse text, that generalizes across domains and tasks while being highly efficient. PaLM pushes the boundaries of scale for few-shot language understanding and generation.
  • Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application.
  • To further their understanding of the impact of scale on few-shot learning, they trained a 540-billion parameter, densely activated, Transformer language model, which they call Pathways Language Model PaLM. They trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. They demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks.
  • On a number of these tasks, PaLM 540B achieves breakthrough few-shot performance on language, reasoning, and code tasks, achieving state-of-the-art results on 28 out of the 29 most widely evaluated English NLP tasks when compared to the best finetuned per-task result from any previous large language model. Their evaluation suite consists of multi-step reasoning tasks, and comparisons to average human performance on the recently released BIG-bench benchmark.
  • Another critical takeaway from this work is the breakthrough performance on reasoning tasks, which require multi-step logical inference. Their few-shot results match or exceed the finetuned state of the art across a number of different arithmetic and commonsense reasoning tasks. The results on reasoning tasks are not achieved through model scale alone, but by a combination of scale and chain-of-thought prompting, where the model is explicitly prompted to generate a natural language logical inference chain before making its prediction. They present a number of intriguing examples where PaLM was able to write explicit logical inference chains to both explain jokes and answer complex questions about scenarios. On BIG-bench, a recently developed benchmark containing 150+ challenging new language tasks, PaLM 5-shot achieves higher performance than the average performance score of humans who were asked to complete the same tasks. Additional state-of-the-art performance is demonstrated on source code understanding/generation, multilingual NLP, and machine translation.
  • From these results, they draw a number of conclusions.
    • First, the results presented here suggest that the improvements from scale for few-shot language understanding have not yet plateaued. When they compare results from PaLM 540B to their own identically trained 62B and 8B model variants, improvements are typically log-linear. This alone suggests that they have not yet reached the apex point of the scaling curve. However, a number of BIG-bench benchmarks showed discontinuous improvements from model scale, improvements are actually discontinuous, meaning that the improvements from 8B to 62B are very modest, but then steeply increase when scaling to 540B. This suggests that certain capabilities of language models only emerge when trained at sufficient scale, and there are additional capabilities that could emerge from future generations of models.
    • Second, the breakthrough performance on reasoning tasks has critical implications. It is obvious that a model being able to generate natural language to explain its predictions is beneficial to the end user of a system, in order to better understand why a model made a certain prediction. However, these results go far beyond that, demonstrating that prompting the model to generate explicit inference chains can drastically increase the quality of the predictions themselves. In other words, the model’s generation (rather than just understanding) capabilities can be immensely beneficial even for tasks that are modeled as categorical prediction or regression, which typically do not require significant language generation.
  • Finally, although they achieved their goal of further pushing the boundaries of scale for few-shot language modeling, there are still many open questions about the ideal network architecture and training scheme for future generations of models. PaLM is only the first step in their vision towards establishing Pathways as the future of ML scaling at Google and beyond. To that end, they chose to demonstrate this scaling capability on a well-studied, well-established recipe: a dense, decoder-only, full-attention Transformer model, which is trained to perform autoregressive language modeling. However, their wider goal is to explore a diverse array of novel architectural choices and training schemes, and combine the most promising systems with the extreme scaling capabilities of Pathways.
  • They believe that PaLM demonstrates a strong foundation in their ultimate goal of developing a large-scale, modularized system that will have broad generalization capabilities across multiple modalities.
  • They additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale.
  • Finally, they discuss the ethical considerations related to large language models and discuss potential mitigation strategies.
  • Google AI blog.

Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity
  • When primed with only a handful of training samples, very large, pretrained language models such as GPT-3 have shown competitive results when compared to fully-supervised, fine-tuned, large, pretrained language models.
  • This paper by Lu et al. from in 2022 demonstrates that that few-shot prompts suffer from order sensitivity, in that for the same prompt the order in which samples are provided can make the difference between state-of-the-art and random performance – essentially some permutations are “fantastic” and some not.
  • They analyze this phenomenon in detail, establishing that the problem is prevalent across tasks, model sizes (even for the largest current models), prompt templates, it is not related to a specific subset of samples, number of training samples, and that a given good permutation for one model is not transferable to another.
  • While one could use a development set to determine which permutations are performant, this would deviate from the true few-shot setting as it requires additional annotated data. Instead, to alleviate this problem, they introduce a novel probing method that exploits the generative nature of language models to construct an artificial development set. They identity performant permutations for prompts using entropy-based statistics over this set, which yields a 13% relative improvement for GPT-family models across eleven different established text classification tasks.
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
  • Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that they understand the present and near-future capabilities and limitations of language models.
  • This paper by Srivastava et al. from Google in 2022 addresses this challenge by introducing the Beyond the Imitation Game benchmark (BIG-bench), a benchmark that can measure progress well beyond the current state-of-the-art. BIG-bench currently consists of 204 tasks, contributed by 442 authors across 132 institutions.
  • Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. They evaluate the behavior of OpenAI’s GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters.
  • In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit “breakthrough” behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.
  • Code.
Training Compute-Optimal Large Language Models
  • Previous work in training LLMs offered a heuristic that given a 10x increase in computational budget, model size should increase 5.5x, and the number of tokens should only increase 1.8x.
  • This paper by Hoffman et al. from DeepMind in 2022 challenges that assumption and shows that model and data size should increase in accordance! Thus collecting high-quality datasets will play a key role in further scaling of LLMs. They investigate the optimal model size and number of tokens for training a transformer language model under a given compute budget.
  • They find that current large language models are significantly undertrained, a consequence of the recent focus on scaling language models whilst keeping the amount of training data constant.
  • By training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, they find that for compute-optimal training, the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled.
  • They test this hypothesis by training a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and 4x more more data.
  • Chinchilla uniformly and significantly outperforms Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a large range of downstream evaluation tasks.
  • This also means that Chinchilla uses substantially less compute for fine-tuning and inference, greatly facilitating downstream usage. As a highlight, Chinchilla reaches a state-of-the-art average accuracy of 67.5% on the MMLU benchmark, greater than a 7% improvement over Gopher.
  • In particular, they propose 10x more compute should be spent on 3.2x larger model and 3.2x more tokens (vs. OpenAI’s Scaling Laws paper which suggests 10x more compute should be spent on 5.5x larger model and 1.8x more tokens)
Large Language Models Still Can’t Plan (A Benchmark for LLMs on Planning and Reasoning about Change)
  • The recent advances in large language models (LLMs) have transformed the field of natural language processing (NLP). From GPT-3 to PaLM, the state-of-the-art performance on natural language tasks is being pushed forward with every new large language model. Along with natural language abilities, there has been a significant interest in understanding whether such models, trained on enormous amounts of data, exhibit reasoning capabilities. Hence there has been interest in developing benchmarks for various reasoning tasks and the preliminary results from testing LLMs over such benchmarks seem mostly positive. However, the current benchmarks are relatively simplistic and the performance over these benchmarks cannot be used as an evidence to support, many a times outlandish, claims being made about LLMs’ reasoning capabilities. As of right now, these benchmarks only represent a very limited set of simple reasoning tasks and they need to look at more sophisticated reasoning problems if they are to measure the true limits of such LLM-based systems.
  • This paper by Valmeekam et al. from ASU in 2022 proposes an extensible assessment framework motivated by the above gaps in current benchmarks to test the abilities of LLMs on a central aspect of human intelligence, which is reasoning about actions and change.
  • They provide multiple test cases that are more involved than any of the previously established reasoning benchmarks and each test case evaluates a certain aspect of reasoning about actions and change. Their initial results on even on simple common-sense planning tasks the base version of GPT-3 (Davinci) seems to display a dismal performance.
OPT: Open Pre-trained Transformer Language Models
  • Large language models, which are often trained for hundreds of thousands of compute days, have shown remarkable capabilities for zero- and few-shot learning. Given their computational cost, these models are difficult to replicate without significant capital. For the few that are available through APIs, no access is granted to the full model weights, making them difficult to study.
  • This paper by Zhang et al. from Facebook AI introduces Open Pre-trained Transformers (OPT), a collection of auto-regressive/decoder-only pre-trained transformer-based language models ranging in size from 125M to 175B parameters, which they aim to fully and responsibly share with interested researchers.
  • Their goal is to replicate the performance and sizes of the GPT-3 class of models, while also applying the latest best practices in data curation and training efficiency.
  • They show that OPT-175B is comparable to GPT-3, while requiring only 1/7th the carbon footprint to develop. They also release their logbook detailing the infrastructure challenges they faced, along with code for experimenting with all of the released models.
  • They believe that broad access to these types of models will increase the diversity of voices defining the ethical considerations of such technologies.
  • Code.
Diffusion-LM Improves Controllable Text Generation
  • The following paper summary has been contributed by Zhibo Zhang.
  • Controlling the behavior of language models (LMs) without re-training is a major open problem in natural language generation. While recent works have demonstrated successes on controlling simple sentence attributes (e.g., sentiment), there has been little progress on complex, fine-grained controls (e.g., syntactic structure).
  • This paper by Li et al. from Stanford in NeurIPS 2022 seeks to address this challenge by introducing Diffusion-LM, a novel non-autoregressive language model based on continuous diffusion for controllable text generation, which enables new forms of complex fine-grained control tasks.
  • Building upon the recent successes of diffusion models in continuous domains, Diffusion-LM iteratively denoises a sequence of Gaussian vectors into word vectors, yielding a sequence of intermediate latent variables. The continuous, hierarchical nature of these intermediate variables enables a simple gradient-based algorithm to perform complex, controllable generation tasks.
  • Considering the discrete nature of text, the authors add an extra step on top of the Markov chain of standard diffusion models. As shown in the illustration figure, in the forward diffusion process, this extra step (embedding) is responsible for converting text into numerical embeddings. In the reverse process, this extra step (rounding) maps the continuous vectors back into text.
  • In order for the model to generate a vector that closely aligns with a word embedding in the reverse process, the authors did a re-parameterization such that the model directly predicts the word embedding state of the Markov chain at each term of the loss function.
  • In order to make the text generation process controllable, under a particular control objective, the conditional inference at each state of the Markov chain is decomposed into two parts:
    • The Markov transition probability between the latent variables of two consecutive time steps, which is used as fluency regularization.
    • The probability of the control objective given the latent variable of the current time step, which is used for controlling the text generation.
  • Empirically, the authors validated that Diffusion-LM significantly outperforms prior work by almost doubling the control success rate compared to the PPLM (Dathathri et al., 2020) and FUDGE (Yang et al., 2021) baselines on six fine-grained control tasks: Semantic Content, Parts-of-speech, Syntax Tree, Syntax Spans and Length.

DeepPERF: A Deep Learning-Based Approach For Improving Software Performance
  • Performance bugs may not cause system failure and may depend on user input, so detecting them can be challenging. They also tend to be harder to fix than non-performance bugs.
  • In recent years, a variety of performance bug detection approaches have emerged to help developers identify performance issues. However, a majority of existing performance bug detection approaches focus on specific types of performance problems and rely on expert-written algorithms or pre-defined set of rules to detect and fix issues. Building rule-based analyzers is a non-trivial task, as it requires achieving the right balance between precision and recall. Once developed, maintaining these rules can also be costly.
  • Transformer-based approaches have been shown to achieve state-of-the-art performance, not only in various NLP problems, but also in a variety of software engineering tasks such as code-completion, documentation generation, unit test generation, bug detection, etc. In this paper, the authors present an approach called DeepPERF that uses a large transformer model to suggest changes at application source code level to improve its performance. The authors first pretrain the model using masked language modelling (MLM) tasks on English text and source code taken from open source repositories on GitHub, followed by finetuning on millions of performance commits made by .NET developers.
  • This paper by Garg et al. from Microsoft in 2022 shows that their approach is able to recommend patches to provide a wide-range of performance optimizations in C# applications. Most suggested changes involve modifications to high-level constructs like API/Data Structure usages or other algorithmic changes, often spanning multiple methods, which cannot be optimized away automatically by the C# compiler and could, therefore, lead to slow-downs on the user’s side.
  • Their evaluation shows that the model can generate the same performance improvement suggestion as the developer fix in ∼53% of the cases, getting ∼34% of them verbatim in their expert-verified dataset of performance changes made by C# developers. Additionally, the authors evaluate DeepPERF on 50 open source C# repositories on GitHub using both benchmark and unit tests and find that the model is able to suggest valid performance improvements that can improve both CPU usage and Memory allocations.
No Language Left Behind: Scaling Human-Centered Machine Translation
  • Driven by the goal of eradicating language barriers on a global scale, machine translation has solidified itself as a key focus of artificial intelligence research today. However, such efforts have coalesced around a small subset of languages, leaving behind the vast majority of mostly low-resource languages.
  • This paper by Costa-jussà et al. from Meta AI in 2022 explores what it takes to break the 200 language barrier while ensuring safe, high quality results, all while keeping ethical considerations in mind. In No Language Left Behind, they take on this challenge by first contextualizing the need for low-resource language translation support through exploratory interviews with native speakers.
  • Furthermore, they created datasets and models aimed at narrowing the performance gap between low and high-resource languages. More specifically, they developed a conditional compute model based on Sparsely Gated Mixture of Experts that is trained on data obtained with novel and effective data mining techniques tailored for low-resource languages.
  • They propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks. Critically, they evaluated the performance of over 40,000 different translation directions using a human-translated benchmark, Flores-200, and combined human evaluation with a novel toxicity benchmark covering all languages in Flores-200 to assess translation safety. - Their model achieves an improvement of 44% BLEU relative to the previous state-of-the-art, laying important groundwork towards realizing a universal translation system.
  • Facebook AI article; Code
  • They tackle three major tasks:
    • Automatic dataset construction for low-resource languages: They’ve solved this by investing in a teacher-student training procedure, making it possible to 1) extend LASER’s language coverage to 200 languages, and 2) produce a massive amount of data, even for low resource languages.
      • Specifically, to scale one model to hundreds of languages, as the first step, they built an appropriate data set. Meta created an initial model able to detect languages automatically, which they call their language identification system.
      • It then uses another language model based on Transformers to find sentence pairs for all the scrapped data. These two models are only used to build the 200 paired-languages datasets they need to train the final language translation model, NLLB200.
    • Modeling 200 languages: They’ve developed a Sparse Mixture-of-Experts model that has a shared and specialized capacity, so low-resource languages without much data can be automatically routed to the shared capacity. When combined with better regularization systems, this avoids overfitting. Further, they used self-supervised learning and large-scale data augmentation through multiple types of back translation.
      • Specifically, the multi-language translation model is a Transformer based encoder-decoder architecture. This implies NLLB200 takes a text sentence, encodes it and then decodes it to produce a new text sentence, a translated version of the input.
      • What’s new is the modifications they’ve done to the model to scale up to so many different languages instead of being limited to one. The first modification is adding a variable identifying the source language of the input, taken from the language detector we just discussed. This will help the encoder do a better job for the current input language. Then, they do the same thing with the decoder giving it which language to translate to. Note that this conditioned encoding scheme is very similar to CLIP, which encodes images and text similarly. Here, in ideal conditions, it will encode a sentence similarly whatever the language.
      • They use Sparsely Gated Mixture of Experts models to achieve a more optimal trade-off between cross-lingual transfer and interference and improve performance for low-resource languages. Sparsely Gated Mixture of Experts are basically regular models but only activate a subset of model parameters per input instead of involving most if not all parameters every time. You can easily see how this is the perfect kind of model for this application. The Mixture of Experts is simply an extra step added in the Transformer architecture for both the encoder and decoder, replacing the feed-forward network sublayer with \(N\) feed-forward networks, each with input and output projections, and the Transformer model automatically learns which subnetwork to use for each language during training.
    • Evaluating translation quality: They’ve extended 2x the coverage of FLORES, a human-translated evaluation benchmark, to now cover 200 languages. Through automatic metrics and human evaluation support, we’re able to extensively quantify the quality of their translations.
Efficient Few-Shot Learning Without Prompts
  • Recent few-shot methods, such as parameter-efficient fine-tuning (PEFT) and pattern exploiting training (PET), have achieved impressive results in label-scarce settings. However, they are difficult to employ since they are subject to high variability from manually crafted prompts, and typically require billion-parameter language models to achieve high accuracy.
  • This paper by Tunstall et al. from Hugging Face, cohere, TU Darmstadt, and Intel Labs in 2022 addresses these shortcomings by proposing SetFit (Sentence Transformer Fine-tuning), an efficient and prompt-free framework for few-shot fine-tuning of Sentence Transformers (ST). SetFit works by first fine-tuning a pretrained ST on a small number of text pairs, in a contrastive Siamese manner.
  • The resulting model is then used to generate rich text embeddings, which are used to train a classification head. Compared to other few-shot learning methods, SetFit has several unique features:
    • No prompts or verbalisers: Current techniques for few-shot fine-tuning require handcrafted prompts or verbalisers to convert examples into a format that’s suitable for the underlying language model. SetFit dispenses with prompts altogether by generating rich embeddings directly from text examples.
    • Fast to train: SetFit doesn’t require large-scale models like T0 or GPT-3 to achieve high accuracy. As a result, it is typically an order of magnitude (or more) faster to train and run inference with.
    • Multilingual support: SetFit can be used with any Sentence Transformer on the Hub, which means you can classify text in multiple languages by simply fine-tuning a multilingual checkpoint.
    • Achieves high accuracy: SetFit achieves high accuracy with little labeled data - for instance, with only 8 labeled examples per class on the Customer Reviews sentiment dataset, SetFit is competitive with fine-tuning RoBERTa Large on the full training set of 3k examples. This is accomplished with orders of magnitude less parameters than existing techniques.
  • Their experiments show that SetFit obtains comparable results with PEFT and PET techniques, while being an order of magnitude faster to train. We also show that SetFit can be applied in multilingual settings by simply switching the ST body.
  • Code.

Large language models are different
  • The following summary has been contributed by Zhibo Zhang.
  • Large language models have shown promising capability in various natural language tasks in recent years. This presentation by Wei from Google Brain in 2022 covers some of the recent works for large language models.
  • The motivation behind large language models is clear: It is ideal to have pre-trained models that can easily generalize to different downstream tasks rather than training a new model for each different task that will require a new dataset. In addition, pre-trained large language models only require a few labeled examples to learn from when it comes to a new task.
  • Training a large language model and doing inference on it typically contains the following components:
    • Pre-train a language model on a massive amount of data. The model size nowadays is huge, such as GPT-3 which contains 175 billion parameters (Brown et al., 2020) and PaLM which contains 540 billion parameters (Chowdhery et al., 2022). An important property of large language models is the emergent ability. That is, the performance of the model grows from near-random to well above-random after the model size reaches a certain threshold (Wei et al., 2022).
    • Perform in-context learning with a few examples. This step is typically done through promoting techniques, where a few example natural language tasks are provided in the form of input-label pairs, and the machine is expected to generalize the learning outcome to predict the label for an unseen input. Notice that the term “learning” here does not involve any optimization step of the model parameters.
  • Researchers have been trying to understand the property of prompting. In particular, Zhao et al., 2020 discusses three major biases introduced by the natural language prompts during in-context learning:
    • The majority label bias: the predictions largely depend on the majority label in the prompts.
    • The recency bias: the labels near the end of the prompts affect the predictions more.
    • Common token bias: the predictions are more likely to be high frequency words in the n-gram model.
  • The authors of the paper proposed to use affine transformation to calibrate the probabilistic output of the model for each specific prediction, named contextual calibration.
  • Min et al., 2022 pointed out that whether the demonstration prompts have the correct labels or not does not significantly affect the prediction. The input text distribution, the label space and the input-label pairing format have a larger impact on the predictions.
  • The speaker also mentioned other prompting techniques, such as chain-of-thought prompting (Wei et al., 2022).
  • In addition to prompting, Wei et al., 2021 shows that fine tuning language models on different datasets through instructions can improve the model performance when there are no demonstrations given for downstream tasks.
Solving Quantitative Reasoning Problems with Language Models
  • The following paper summary has been contributed by Zhibo Zhang.
  • Solving Quantitative Reasoning Problems with Language Models by Lewkowycz et al. from Google Research in NeurIPS 2022 introduces Minerva, a language model based on PaLM (Chowdhery et al., 2022) to solve quantitative reasoning problems.
  • Specifically, the authors used the pre-trained PaLM models with 8 billion, 62 billion and 540 billion parameters accordingly and fine-tuned them on the technical training dataset that is composed of web pages of mathematical content, arXiv papers and general natural language data.
  • At the inference stage, the authors utilized the following techniques to boost the performance of the model:
    • Selecting the most common answer based on a total of \(k\) sampled solutions.
    • Prompting the model with 4 examples when evaluating on the MATH dataset (Hendrycks et al., 2021) and with 5 examples when evaluating on the STEM (science, technology, engineering and mathematics) subset of the MMLU dataset (Hendrycks et al., 2021).
    • Chain-of-thought prompting when evaluating on the GSM8k dataset (Cobbe et al., 2021) and the subset of the MMLU dataset.
  • Empirically, under the same model scale, Minerva consistently outperformed the PaLM model on the evaluation datasets according to the paper. In addition, Minerva with 62 billion parameters and 540 billion parameters outperformed both OpenAI davinci-002 and published state-of-the-art on the MATH dataset.
  • Through additional validation, the authors concluded that there is little evidence that memorization contributes to the model performance.
AD-DROP: Attribution-Driven Dropout for Robust Language Model Fine-Tuning
  • The following paper summary has been contributed by Zhibo Zhang.
  • AD-DROP: Attribution-Driven Dropout for Robust Language Model Fine-Tuning by Yang et al. from Sun Yat-sen University and Meta AI in NeurIPS 2022 proposes AD-DROP, an attribution-based dropout mechanism for self-attention modules.
  • The authors propose to generate attribution scores based on existing input gradient explanation methods. In particular, the attribution scores are generated for the attention map of each attention head with respect to the output logit of the Transformer for a particular class.
  • Following the above attribution methods, the authors empirically observed that dropping neurons with low attribution scores will lead to a larger degree of overfitting compared to random dropping, and dropping neurons with high attribution scores increases training loss but alleviates the overfitting problem.
  • Based on the above empirical finding, the authors proposed AD-DROP, as indicated in the illustration figure (below) from the paper: the attribution matrices are generated for the self-attention maps based on the logits from the forward pass. The mask matrices (that contain information about which position to drop) are then produced relying on the attribution scores and sampling. As a last step, an element-wise addition operation between the mask matrices and the original self-attention maps is done to produce the masked self-attention maps, which are then used to perform the forward propagation.
  • In addition, the authors proposed a cross-tuning algorithm to alternatively perform optimization without dropout (at odd number epochs) and optimization with AD-DROP (at even number epochs) during the training process.
  • The authors conducted experiments on eight tasks of the GLUE benchmark (Wang et al., 2019) using BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019) models as the base, observing that AD-DROP had the best average performance compared to several other regularization methods.

Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers
  • The following paper summary has been contributed by Zhibo Zhang.
  • In-Context Learning has been an effective strategy in adapting a pre-trained large language model to a new task by showing the model with a few input-label pairs through prompts. How In-Context Learning works has been an active research topic that people try to understand.
  • This paper by Dai et al. from Peking University, Tsinghua University and Microsoft Research in 2022 proposes that In-Context Learning can be understood as performing implicit fine-tuning on the Transformer models.
  • In particular, Aizerman et al. and Irie et al. pointed out that linear attention is a dual form of linear layers with gradient descent optimization.
  • Based on the above finding and through relaxing the standard attention into linear attention, the authors demonstrate that it is possible to express the attention outcome as a linear expression of any new input query, where the weight matrix can be decomposed into two parts: the part based on the pre-trained model and the updates of the former part due to prompt demonstrations.
  • Empirically, the authors compared between the models generated by fine-tuning and In-Context Learning accordingly on 6 datasets, observing similarities between the two in terms of prediction capability, updates of the attention output (where the pre-trained model is used as a baseline when calculating the updates) as well as attention maps.
Finetuned language models are zero-shot learners
  • This paper by Wei et al. from Google in ICLR 2022 introduces Finetuned LAnguage Net (FLAN), which utilizes a simple method for improving the zero-shot learning abilities of language models.
  • They show that instruction tuning – finetuning language models on a collection of datasets described via instructions – substantially improves zeroshot performance on unseen tasks.
  • They take a 137B parameter pretrained language model, namely LaMDA-PT, and instruction tune it on over 60 NLP datasets verbalized via natural language instruction templates. They evaluate this instruction-tuned model, FLAN, on unseen task types. This process is illustrated below with a couple of examples:

  • FLAN substantially improves the performance of its unmodified counterpart and surpasses zero-shot 175B GPT-3 on 20 of 25 datasets that they evaluate.
  • FLAN even outperforms few-shot GPT-3 by a large margin on ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze.
  • Ablation studies reveal that number of finetuning datasets, model scale, and natural language instructions are key to the success of instruction tuning.
  • The figure below from the paper shows instruction tuning as a simple method that combines appealing aspects of both the pretrain–finetune and prompting paradigms by using supervision via finetuning to improve language model’s responses to inference-time text interactions. Their empirical results demonstrate promising abilities of language models to perform tasks described purely via instructions.

Learning to summarize from human feedback
  • As language models become more powerful, training and evaluation are increasingly bottlenecked by the data and metrics used for a particular task. For example, summarization models are often trained to predict human reference summaries and evaluated using ROUGE, but both of these metrics are rough proxies for what we really care about – summary quality.
  • This paper by Stiennon et al. from OpenAI in 2022 introduces Reinforcement Learning from Human Feedback (RLHF), a framework that shows that it is possible to significantly improve summary quality by training a model to optimize for human preferences.
  • They collect a large, high-quality dataset of human preferences/comparisons between summaries, train a reward model via supervised learning to predict the human-preferred summary, and use that model as a reward function (“reward model”) to fine-tune large pretrained models (they use GPT-3) using a summarization policy obtained using reinforcement learning. Specifically, they train a policy via reinforcement learning (RL) to maximize the score given by the reward model; the policy generates a token of text at each ‘time step’, and is updated using the proximal policy optimization (PPO) algorithm based on the reward model’s reward given to the entire generated summary. They can then gather more human data using samples from the resulting policy, and repeat the process.
  • Empirically, RLHF tends to perform better than supervised fine-tuning. This is because supervised fine-tuning uses a token-level loss (that can be summed or averaged over the text passage), and RLHF takes the entire text passage, as a whole, into account.
  • They apply the method to a version of the TL;DR dataset of Reddit posts and find that their models significantly outperform both human reference summaries and much larger models fine-tuned with supervised learning alone.
  • Their models also transfer to CNN/DM news articles, producing summaries nearly as good as the human reference without any news-specific fine-tuning.
  • They conduct extensive analyses to understand their human feedback dataset and fine-tuned models. They establish that their reward model generalizes to new datasets, and that optimizing their reward model results in better summaries than optimizing ROUGE according to humans.
  • The key takeaway point here is that pay closer attention to how training loss affects the model behavior they is actually desired.
  • The graph below from the paper shows the fraction of the time humans prefer summaries from variations of the trained models over the human-generated reference summaries on the TL;DR dataset.

Training language models to follow instructions with human feedback
  • Making language models bigger does not inherently make them better at following a user’s intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users.
  • This paper by Ouyang et al. from OpenAI in 2022 introduces InstructGPT, a model that aligns language models with user intent on a wide range of tasks by fine-tuning with human feedback.
  • Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, they collect a dataset of labeler demonstrations of the desired model behavior, which they use to fine-tune GPT-3 using supervised fine-tuning (SFT). This process is referred to as “instruction tuning” by other papers such as Wei et al. (2022).
  • They then collect a dataset of rankings of model outputs, which they use to further fine-tune this supervised model using Reinforcement Learning from Human Feedback (RLHF).
  • In human evaluations on their prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters.
  • Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, their results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.
  • It is important to note that ChatGPT is trained using the same methods as InstructGPT (using SFT followed by RLHF), but is fine-tuned from a model in the GPT-3.5 series.
  • Furthermore, the fine-tuning process proposed in the paper isn’t without its challenges. First, we need a significant volume of demonstration data. For instance, in the InstructGPT paper, they used 13k instruction-output samples for supervised fine-tuning, 33k output comparisons for reward modeling, and 31k prompts without human labels as input for RLHF. Second, fine-tuning comes with an alignment tax “negative transfer” – the process can lead to lower performance on certain critical tasks. (There’s no free lunch after all.) The same InstructGPT paper found that RLHF led to performance regressions (relative to the GPT-3 base model) on public NLP tasks like SQuAD, HellaSwag, and WMT 2015 French to English. A potential workaround is to have several smaller, specialized models that excel at narrow tasks.
  • The figure below from the paper illustrates the three steps of training InstructGPT: (1) SFT, (2) reward model training, and (3) reinforcement learning via proximal policy optimization (PPO) on this reward model. Blue arrows indicate that this data is used to train the respective model in the diagram. In Step 2, boxes A-D are samples from the SFT model that get ranked by labelers.

Constitutional AI: Harmlessness from AI Feedback
  • As AI systems become more capable, we would like to enlist their help to supervise other AIs.
  • This paper by Bai et al. from Anthropic in 2022 experiments with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so they refer to the method as ‘Constitutional AI’.
  • The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase they sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, they sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences.
  • They then train with RL using the preference model as the reward signal, i.e. they use ‘RL from AI Feedback’ (RLAIF). As a result they are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels.
  • The figure below from the paper shows the basic steps of their Constitutional AI (CAI) process, which consists of both a supervised learning (SL) stage, consisting of the steps at the top, and a Reinforcement Learning (RL) stage, shown as the sequence of steps at the bottom of the figure. Both the critiques and the AI feedback are steered by a small set of principles drawn from a ‘constitution’. The supervised stage significantly improves the initial model, and gives some control over the initial behavior at the start of the RL phase, addressing potential exploration problems. The RL stage significantly improves performance and reliability.

  • The graph below shows harmlessness versus helpfulness Elo scores (higher is better, only differences are meaningful) computed from crowdworkers’ model comparisons for all 52B RL runs. Points further to the right are later steps in RL training. The Helpful and HH models were trained with human feedback as in [Bai et al., 2022], and exhibit a tradeoff between helpfulness and harmlessness. The RL-CAI models trained with AI feedback learn to be less harmful at a given level of helpfulness. The crowdworkers evaluating these models were instructed to prefer less evasive responses when both responses were equally harmless; this is why the human feedback-trained Helpful and HH models do not differ more in their harmlessness scores.

RoFormer: Enhanced Transformer with Rotary Position Embedding
  • Position encoding recently has shown effective in the transformer architecture. It enables valuable supervision for dependency modeling between elements at different positions of the sequence.
  • This paper by Su et al. from Zhuiyi Technology Co., Ltd. in 2022 first investigates various methods to integrate positional information into the learning process of transformer-based language models. Then, they propose a novel method named Rotary Position Embedding (RoPE) to effectively leverage positional information. Specifically, the proposed RoPE encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative position dependency in self-attention formulation.
  • RoPE thus takes relative positions in account so it means that attention scores are affected by the distance between two tokens (rather than indices) and acts as a decay. Larger the distance between two words, lesser is the effect. Notably, RoPE enables valuable properties, including the flexibility of sequence length, decaying inter-token dependency with increasing relative distances, and the capability of equipping the linear self-attention with relative position encoding.
  • The following figure from the paper shows the implementation of Rotary Position Embedding (RoPE).

  • Also, RoPE is multiplicative in nature; so instead of “shifting” the word embedding by addition (similar to the “bias” term in neural networks, which has a shifting effect), it “scales” the effect due to rotation.
  • You can use RoPE with “linear attention” (a type of efficient attention which that has \(O(N)\) complexity compared to \(O(N^2)\) in regular attention).
  • Finally, they evaluate the enhanced transformer with rotary position embedding, also called RoFormer, on various long text classification benchmark datasets. Our experiments show that it consistently overcomes its alternatives. Furthermore, they provide a theoretical analysis to explain some experimental results.
  • Hugging Face docs.
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
  • Since the introduction of the transformer model by Vaswani et al. (2017), a fundamental question has yet to be answered: how does a model achieve extrapolation at inference time for sequences that are longer than it saw during training? We first show that extrapolation can be enabled by simply changing the position representation method, though they find that current methods do not allow for efficient extrapolation.
  • This paper by Press et al. from University of Washington and Facebook AI Research in ICLR 2022 introduces a simpler and more efficient position method, Attention with Linear Biases (ALiBi). ALiBi does not add positional embeddings to word embeddings; instead, it biases query-key attention scores with a penalty that is proportional to their distance.
  • They show that this method trains a 1.3 billion parameter model on input sequences of length 1024 that extrapolates to input sequences of length 2048, achieving the same perplexity as a sinusoidal position embedding model trained on inputs of length 2048 but training 11% faster and using 11% less memory.
  • ALiBi’s inductive bias towards recency also leads it to outperform multiple strong position methods on the WikiText-103 benchmark.
  • The figure below from the paper shows that when computing attention scores for each head, their linearly biased attention method, ALiBi, adds a constant bias (right) to each attention score \(\left(\mathbf{q}_i \cdot \mathbf{k}_j\right.\), left). As in the unmodified attention sublayer, the softmax function is then applied to these scores, and the rest of the computation is unmodified. \(\mathbf{m}\) is a head-specific scalar that is set and not learned throughout training. They show that their method for setting $m$ values generalizes to multiple text domains, models and training compute oudgets. When using ALiBi, they do not add positional embeddings at the bottom of the network.

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
  • In deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) defies this and instead selects different parameters for each incoming example. The result is a sparsely-activated model – with outrageous numbers of parameters – but a constant computational cost.
  • This paper by Fedus et al. from Google in JMLR 2022 introduces the Switch Transformer which seeks to address the lack of widespread adoption of MoE which has been hindered by complexity, communication costs, and training instability.
  • They simplify the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs. Our proposed training techniques help wrangle the instabilities and they show large sparse models may be trained, for the first time, with lower precision (bfloat16) formats.
  • The guiding design principle for Switch Transformers is to maximize the parameter count of a Transformer model (Vaswani et al., 2017) in a simple and computationally efficient way. The benefit of scale was exhaustively studied in Kaplan et al. (2020) which uncovered powerlaw scaling with model size, data set size and computational budget. Importantly, this work advocates training large models on relatively small amounts of data as the computationally optimal approach. Heeding these results, they investigate a fourth axis: increase the parameter count while keeping the floating point operations (FLOPs) per example constant. Our hypothesis is that the parameter count, independent of total computation performed, is a separately important axis on which to scale. They achieve this by designing a sparsely activated model that efficiently uses hardware designed for dense matrix multiplications such as GPUs and TPUs. In their distributed training setup, their sparsely activated layers split unique weights on different devices. Therefore, the weights of the model increase with the number of devices, all while maintaining a manageable memory and computational footprint on each device.
  • Their switch routing proposal reimagines MoE. Shazeer et al. (2017) conjectured that routing to \(k > 1\) experts was necessary in order to have non-trivial gradients to the routing functions. The authors intuited that learning to route would not work without the ability to compare at least two experts. Ramachandran and Le (2018) went further to study the top-\(k\) decision and found that higher \(k\)-values in lower layers in the model were important for models with many routing layers. Contrary to these ideas, they instead use a simplified strategy where they route to only a single expert. They show this simplification preserves model quality, reduces routing computation and performs better. This \(k = 1\) routing strategy is later referred to as a Switch layer.
  • The following figure from the paper illustrates the Switch Transformer encoder block. We replace the dense feed forward network (FFN) layer present in the Transformer with a sparse Switch FFN layer (light blue). The layer operates independently on the tokens in the sequence. They diagram two tokens (\(x_1\) = “More” and \(x_2\) = “Parameters” below) being routed (solid lines) across four FFN experts, where the router independently routes each token. The switch FFN layer returns the output of the selected FFN multiplied by the router gate value (dotted-line).

  • They design models based off T5-Base and T5-Large to obtain up to 7x increases in pre-training speed with the same computational resources. These improvements extend into multilingual settings where they measure gains over the mT5-Base version across all 101 languages.
  • Finally, they advance the current scale of language models by pre-training up to trillion parameter models on the “Colossal Clean Crawled Corpus” and achieve a 4x speedup over the T5-XXL model.
Locating and Editing Factual Associations in GPT
  • This paper by Meng at l. from MIT CSAIL, Northeastern University, and Technion in NeurIPS 2022 analyzes the storage and recall of factual associations in autoregressive transformer language models, finding evidence that these associations correspond to localized, directly-editable computations.
  • They first develop a causal intervention for identifying neuron activations that are decisive in a model’s factual predictions. This reveals a distinct set of steps in middle-layer feed-forward modules that mediate factual predictions while processing subject tokens. Specifically, they perform the following steps to locate factual retrieval:
    • To identify decisive computations, they introduce a method called Causal Tracing. By isolating the causal effect of individual states within the network while processing a factual statement, we can trace the path followed by information through the network.

    • Causal traces work by running a network multiple times, introducing corruptions to frustrate the computation, and then restoring individual states in order to identify the information that restores the results. Tracing can be used to test any individual state or combinations of states. We use carefully-designed traces to identify a specific small set of MLP module computations that mediate retrieval of factual associations.
    • Then they check this finding by asking: can the MLP module computations be altered to edit a model’s belief in a specific fact?
  • To test their hypothesis that these computations correspond to factual association recall, they modify feed-forward weights to update specific factual associations using Rank-One Model Editing (ROME). Specifically, they perform the following steps to edit factual storage:
    • To modify individual facts within a GPT model, we introduce a method called ROME, or Rank-One Model Editing. It treats an MLP module as a simple key-value store: for example, if the key encodes a subject and the value encodes knowledge about the subject, then the MLP can recall the association by retrieving the value corresponding to the key. ROME uses a rank-one modification of the MLP weights to directly write in a new key-value pair.

    • The figure above illustrates a single MLP module within a transformer. The D-dimensional vector at (b) acts as the key that represents a subject to know about, and the H-dimensional output at (c) acts at the value that encodes learned properties about the subject. ROME inserts new association by making a rank-one change to the matrix (d) that maps from keys to values.
    • Note that ROME assumes a linear view of memory within a neural network rather than an individual-neuron view. This linear perspective sees individual memories as rank-one slices of parameter space. Experiments confirm this view: when we do a rank-one update to an MLP module in the computational center identified by causal tracing, we find that associations of individual facts can be updated in a way that is both specific and generalized.
  • They find that ROME is effective on a standard zero-shot relation extraction (zsRE) model-editing task, comparable to existing methods. To perform a more sensitive evaluation, they also evaluate ROME on a new dataset of counterfactual assertions, on which it simultaneously maintains both specificity and generalization, whereas other methods sacrifice one or another.
  • Their results confirm an important role for mid-layer feed-forward modules in storing factual associations and suggest that direct manipulation of computational mechanisms may be a feasible approach for model editing.
  • Project page.
Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers
  • There remain many open questions pertaining to the scaling behaviour of Transformer architectures. These scaling decisions and findings can be critical, as training runs often come with an associated computational cost which have both financial and/or environmental impact.
  • This paper by Tay et al. from Google Research and DeepMind in ICLR 2022 presents scaling insights from pretraining and finetuning Transformers. While Kaplan et al. presents a comprehensive study of the scaling behaviour of Transformer language models, the scope is only on the upstream (pretraining) loss. Therefore, it is still unclear if these set of findings transfer to downstream task within the context of the pretrain-finetune paradigm.
  • The key findings of this paper are as follows:
    1. They show that aside from only the model size, model shape matters for downstream fine-tuning;
    2. Scaling protocols operate differently at different compute regions;
    3. Widely adopted T5-base and T5-large sizes are Pareto-inefficient.
  • To this end, they present improved scaling protocols whereby their redesigned models achieve similar downstream fine-tuning quality while having 50% fewer parameters and training 40% faster compared to the widely adopted T5-base model.
  • In terms of scaling recommendations, they recommend a DeepNarrow strategy where the model’s depth is preferentially increased before considering any other forms of uniform scaling across other dimensions. This is largely due to how much depth influences the Pareto-frontier. Specifically, a tall small (deep and narrow) model is generally more efficient compared to the base model. Likewise, a tall base model might also generally more efficient compared to a large model. They generally find that, regardless of size, even if absolute performance might increase as we continue to stack layers, the relative gain of Pareto-efficiency diminishes as we increase the layers, converging at 32 to 36 layers.
  • They publicly release over 100 pretrained checkpoints of different T5 configurations to facilitate future research and analysis.
Holistic Evaluation of Language Models
  • Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood.
  • This technical report by Liang et al. from Stanford’s Center for Research on Foundation Models (CRFM) presents Holistic Evaluation of Language Models (HELM) to improve the transparency of language models.
  • First, HELM taxonomizes the vast space of potential scenarios (i.e. use cases) and metrics (i.e. desiderata) that are of interest for LMs. Then we select a broad subset based on coverage and feasibility, noting what’s missing or underrepresented (e.g. question answering for neglected English dialects, metrics for trustworthiness).
  • The figure below from the paper shows the importance of the taxonomy to HELM. Previous language model benchmarks (e.g. SuperGLUE, EleutherAI LM Evaluation Harness, BIG-Bench) are collections of datasets, each with a standard task framing and canonical metric, usually accuracy (left). In comparison, in HELM we take a top-down approach of first explicitly stating what we want to evaluate (i.e. scenarios and metrics) by working through their underlying structure. Given this stated taxonomy, we make deliberate decisions on what subset we implement and evaluate, which makes explicit what we miss (e.g. coverage of languages beyond English).

  • Second, HELM adopts a multi-metric approach: They measure 7 metrics (accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency) for each of 16 core scenarios when possible (87.5% of the time). This ensures metrics beyond accuracy don’t fall to the wayside, and that trade-offs are clearly exposed. They also perform 7 targeted evaluations, based on 26 targeted scenarios, to analyze specific aspects (e.g. reasoning, disinformation).
  • The figure below from the paper shows HELM’s multi-metric measurement for each use case. In comparison to most prior benchmarks of language technologies, which primarily center accuracy and often relegate other desiderata to their own bespoke datasets (if at all), in HELM we take a multi-metric approach. This foregrounds metrics beyond accuracy and allows one to study the tradeoffs between the metrics.

  • Third, HELM conducts a large-scale evaluation of 30 prominent language models (spanning open, limited-access, and closed models) on all 42 scenarios, 21 of which were not previously used in mainstream LM evaluation.
  • Prior to HELM, models on average were evaluated on just 17.9% of the core HELM scenarios, with some prominent models not sharing a single scenario in common. HELM improves this to 96.0%: now all 30 models have been densely benchmarked on the same core scenarios and metrics under standardized conditions.
  • HELM’s evaluation surfaces 25 top-level findings. For full transparency, they release all raw model prompts and completions publicly for further analysis, as well as a general modular toolkit. They intend for HELM to be a living benchmark for the community, continuously updated with new scenarios, metrics, and models.
  • Project page.
SummaC: Re-Visiting NLI-based Models for Inconsistency Detection in Summarization
  • In the summarization domain, a key requirement for summaries is to be factually consistent with the input document. Previous work has found that natural language inference (NLI) models do not perform competitively when applied to inconsistency detection.
  • This paper by Laban et al. from UC Berkeley and Microsoft in TACL 2022 proposes \(SummaC\) and revisits the use of NLI for inconsistency detection, finding that past work suffered from a mismatch in input granularity between NLI datasets (sentence-level), and inconsistency detection (document level).
  • The figure below from the paper shows an example document with an inconsistent summary. When running each sentence pair \((D_i, S_j)\) through an NLI model, \(S_3\) is not entailed by any document sentence. However, when running the entire (document, summary) at once, the NLI model incorrectly predicts that the document highly entails the entire summary.

  • They provide a highly effective and light-weight method called \(SummaC_{Conv}\) that enables NLI models to be successfully used for this task by segmenting documents into sentence units and aggregating scores between pairs of sentences.
  • The figure below from the paper shows a diagram of the \(SummaC_{ZS}\) (top) and \(SummaC_{Conv}\) (bottom) models. Both models utilize the same NLI Pair Matrix (middle) but differ in their processing to obtain a score. \(SummaC_{ZS}\) is Zero-Shot, and does not have trained parameters. \(SummaC_{Conv}\) uses a convolutional layer trained on a binned version of the NLI Pair Matrix.

  • On their newly introduced benchmark called \(SummaC\) (Summary Consistency) consisting of six large inconsistency detection datasets, \(SummaC_{Conv}\) obtains state-of-the-art results with a balanced accuracy of 74.4%, a 5% point improvement compared to prior work.
  • Code.
InCoder: A Generative Model for Code Infilling and Synthesis
  • Code is seldom written in a single left-to-right pass and is instead repeatedly edited and refined.
  • This paper by Fried et al. in ICLR 2023 from FAIR, UW, UC Berkeley, TTI-Chicago, and CMU introduces InCoder, a unified generative model that can perform program synthesis (via left-to-right generation) as well as editing (via infilling).
  • InCoder is trained to generate code files from a large corpus of permissively licensed code, where regions of code have been randomly masked and moved to the end of each file, allowing code infilling with bidirectional context.
  • InCoder is the first generative model that is able to directly perform zero-shot code infilling, which they evaluate on challenging tasks such as type inference, comment generation, and variable re-naming.
  • The figure below from the paper shows that during training time (top), InCoder’s causal masking objective samples one or more spans of code in training documents (in the upper left figure, a single span) and moves these spans to the end of the document, with their original location denoted by special mask sentinel tokens. An autoregressive language model is trained to produce these entire masked documents, allowing it to learn to generate insertion text conditioned on bidirectional context. At inference time (bottom), InCoder can perform a variety of code editing and infilling tasks in a zero-shot fashion by inserting mask tokens at desired locations and allowing the model to generate code to insert there. All examples shown are real outputs from the InCoder-6.7B model, with the regions inserted by the model highlighted in orange.

  • They find that the ability to condition on bidirectional context substantially improves performance on these tasks, while still performing comparably on standard program synthesis benchmarks in comparison to left-to-right only models pretrained at similar scale.
Large Language Models are Zero-Shot Reasoners
  • This paper by Kojima et al. from University of Tokyo and Google Brain in NeurIPS 2022 explores the zero-shot reasoning capabilities of large language models (LLMs) through a simple technique called Zero-shot Chain of Thought (Zero-shot-CoT).
  • Zero-shot-CoT adds the prompt “Let’s think step by step” before the answer to elicit multi-step reasoning from LLMs, without requiring any task-specific examples like prior work in Chain of Thought (CoT) prompting.
  • The figure below from the paper shows example inputs and outputs of GPT-3 with (a) standard Few-shot ([Brown et al., 2020]), (b) Few-shot-CoT ([Wei et al., 2022]), (c) standard Zero-shot, and (d) Zero-shot-CoT. Similar to Few-shot-CoT, Zero-shot-CoT facilitates multi-step reasoning (blue text) and reach correct answer where standard prompting fails. Unlike Few-shot-CoT using step-by-step reasoning examples per task, Zero-shot-CoT does not need any examples and just uses the same prompt “Let’s think step by step” across all tasks (arithmetic, symbolic, commonsense, and other logical reasoning tasks).

  • Experiments across 12 diverse reasoning tasks (arithmetic, symbolic, commonsense, logical) show Zero-shot-CoT substantially improves over standard zero-shot prompting. For example, on MultiArith accuracy increases from 17.7% to 78.7% for InstructGPT.
  • Zero-shot-CoT also shows improvements with scaling model size, akin to few-shot CoT prompting, suggesting the single prompt unlocks latent multi-step reasoning capabilities inside LLMs.
  • The simplicity and versatility of Zero-shot-CoT across tasks, compared to careful per-task prompt engineering in prior work, highlights the surprisingly broad cognitive capabilities hidden in LLMs.
  • The authors suggest Zero-shot-CoT serves as a strong zero-shot baseline and encourages further analysis of the multi-task, broad cognitive abilities of LLMs before crafting specialized prompts or datasets.
  • Code.
An Efficient Memory-Augmented Transformer for Knowledge-Intensive NLP Tasks
  • This paper by Wu et al. from UCL, Harbin Institute of Technology, and University of Edinburgh in EMNLP 2022 proposes Efficient Memory-Augmented Transformer (EMAT), which augments Transformer models with an efficient key-value memory module to leverage external knowledge for knowledge-intensive NLP tasks like open-domain QA, outperforming baselines on knowledge-intensive NLP tasks.
  • EMAT encodes external knowledge (e.g. question-answer pairs from PAQ) into dense key and value representations to build the memory. Keys are encoded questions, values are encoded answers.
  • At inference time, EMAT produces a query to retrieve relevant keys and values from memory using fast maximum inner product search. The retrieved representations are integrated into the Transformer encoder to inform generation.
  • EMAT requires only a single inference pass through the Transformer, allowing memory access to run concurrently for efficiency.
  • The figure below from the paper shows the architecture of the proposed Efficient Key-Value Memory Augmented Transformers (EMAT): factual knowledge is stored in a key-value memory where keys and values correspond to questions and answers, respectively; during inference, the model retrieves information from the memory via MIPS and uses it to condition the generation process.

  • EMAT introduces pre-training objectives for learning informative key-value representations and an implicit strategy to integrate multiple memory slots.
  • Experiments on open-domain QA, dialogue, and long-form QA show EMAT significantly outperforms vanilla Transformers while retaining high throughput. It also outperforms retrieval-augmented models on some tasks while being much faster.
  • Ablations demonstrate the importance of the pre-training objectives. Qualitative analysis shows EMAT retrieves useful information but does not just copy from memory.
  • Main limitations are the need for weak supervision to train the retriever, and large memory requirements.
Unsupervised Dense Information Retrieval with Contrastive Learning
  • This paper by Izacard et al. and published in Transactions on Machine Learning Research (TMLR) in August 2022 presents a novel approach in the field of information retrieval.
  • The study focuses on overcoming the limitations of dense retrievers that utilize neural networks, which perform well on large training datasets but struggle with new applications lacking specific training data. Traditional methods like BM25, based on term-frequency, often outperform these dense retrievers in unsupervised settings.
  • The authors explore the application of contrastive learning for training unsupervised dense retrievers. This approach, inspired by successful applications in computer vision, is examined to see if it can match or exceed the performance of term-frequency methods like BM25.
    • Contrastive learning is an approach that relies on the fact that every document is, in some way, unique. This signal is the only information available in the absence of manual supervision. A contrastive loss is used to learn by discriminating between documents. This loss compares either positive (from the same document) or negative (from different documents) pairs of document representations. Formally, given a query \(q\) with an associated positive document \(k_{+}\), and a pool of negative documents \(\left(k_i\right)_{i=1 . . K}\), the contrastive InfoNCE loss is defined as:
    \[\mathcal{L}\left(q, k_{+}\right)=-\frac{\exp \left(s\left(q, k_{+}\right) / \tau\right)}{\exp \left(s\left(q, k_{+}\right) / \tau\right)+\sum_{i=1}^K \exp \left(s\left(q, k_i\right) / \tau\right)},\]
    • where \(\tau\) is a temperature parameter. This loss encourages positive pairs to have high scores and negative pairs to have low scores. Another interpretation of this loss function is the following: given the query representation \(q\), the goal is to recover, or retrieve, the representation \(k_{+}\) corresponding to the positive document, among all the negatives \(k_i\). In the following, we refer to the left-hand side representations in the score \(s\) as queries and the right-hand side representations as keys.
  • A critical ingredient for this training paradigm is to obtain positive pairs from a single text document, which is done as follows:
    • A crucial element of contrastive learning is how to build positive pairs from a single input. In computer vision, this step relies on applying two independent data augmentations to the same image, resulting in two “views” that form a positive pair. While we consider similar independent text transformation, we also explore dependent transformations designed to reduce the correlation between views.
    • Inverse Cloze Task is a data augmentation that generates two mutually exclusive views of a document, introduced in the context of retrieval by Lee et al. (2019). The first view is obtained by randomly sampling a span of tokens from a segment of text, while the complement of the span forms the second view. Specifically, given a sequence of text \(\left(w_1, \ldots, w_n\right)\), ICT samples a span \(\left(w_a, \ldots, w_b\right)\), where \(1 \leq a \leq b \leq n\), uses the tokens of the span as the query and the complement $\left(w_1, \ldots, w_{a-1}, w_{b+1}, \ldots, w_n\right)\(as the key. In the original implementation by [Lee et al. (2019)]( the span corresponds to a sentence, and is kept in the document 10% of the time to encourage lexical matching. The Inverse Cloze Task is closely related to the Cloze task which uses the span complement\)\left(w_1, \ldots, w_{a-1}, w_{b+1}, \ldots, w_n\right)$$ as the query.
    • Independent cropping is a common independent data augmentation used for images where views are generated independently by cropping the input. In the context of text, cropping is equivalent to sampling a span of tokens. This strategy thus samples independently two spans from a document to form a positive pair. As opposed to the inverse Cloze task, in cropping both views of the example correspond to contiguous subsequence of the original data. A second difference between cropping and ICT is the fact that independent random cropping is symmetric: both the queries and documents follow the same distribution. Independent cropping also lead to overlap between the two views of the data, hence encouraging the network to learn exact matches between the query and document, in a way that is similar to lexical matching methods like BM25. In practice, we can either fix the length of the span for the query and the key, or sample them.
    • Additional data augmentation. Finally, we also consider additional data augmentations such as random word deletion, replacement or masking. We use these perturbations in addition to random cropping.
  • An important aspect of contrastive learning is to sample a large set of negatives. Most standard frameworks differ from each other in terms of how the negatives are handled, and we briefly describe two of them, in-batch negative sampling and MoCo, that we use in this work.
    • Negatives within a batch. A first solution is to generate the negatives by using the other examples from the same batch: each example in a batch is transformed twice to generate positive pairs, and we generate negatives by using the views from the other examples in the batch. We will refer to this technique as “in-batch negatives”. In that case, the gradient is back-propagated through the representations of both the queries and the keys. A downside of this approach is that it requires extremely large batch sizes to work well, in some cases reporting improvement in the context of information retrieval up to 8192 negatives. This method has been widely used to train information retrieval models with supervised data and was also considered when using ICT to pre-train retrievers.
    • Negative pairs across batches. An alternative approach is to store representations from previous batches in a queue and use them as negative examples in the loss (Wu et al., 2018). This allows for smaller batch size but slightly changes the loss by making it asymmetric between “queries” (one of the view generated from the elements of the current batch), and “keys” (the elements stored in the queue). Gradient is only backpropagated through the “queries”, and the representation of the “keys” are considered as fixed. In practice, the features stored in the queue from previous batches comes form previous iterations of the network. This leads to a drop of performance when the network rapidly changes during training. Instead, He et al. (2020) proposed to generate representations of keys from a second network that is updated more slowly. This approach, called MoCo, considers two networks: one for the keys, parametrized by \(\theta_k\), and one of the query, parametrized by \(\theta_q\). The parameters of the query network are updated with backpropagation and stochastic gradient descent, similarly to when using in-batch negatives, while the parameters of the key network, or Momentum encoder, is updated from the parameters of the query network by using a exponential moving average: \(\theta_k \leftarrow m \theta_k+(1-m) \theta_q\) where \(m\) is the momentum parameter that takes its value in \([0,1]\).
  • They use a transformer network to embed both queries and documents. Alternatively, two different encoders can be used to encode queries and documents respectively as in DPR. Empirically, they observed that using the same encoder, such as in Xiong et al. (2020) and Reimers & Gurevych (2019), generally improves robustness in the context of zero-shot transfer or few-shot learning, while having no impact on other settings.
  • Significant contributions of the paper include demonstrating that contrastive learning can lead to competitive unsupervised retrievers. The model, named Contriever, shows promising results on the BEIR benchmark, outperforming BM25 on 11 out of 15 datasets for Recall@100. The model benefits from a few training examples and achieves better results than models transferred from large datasets like MS MARCO. Ablation studies highlighted that cropping is a more effective approach than the inverse Cloze task for building positive pairs.
  • The implementation details of Contriever reveal that it employs MoCo with random cropping for contrastive learning. This is a deviation from the Inverse Cloze Task (ICT) approach. The training data includes a mix of documents from Wikipedia and CCNet. The model shows strong performance against established benchmarks such as NaturalQuestions and TriviaQA, even in fully unsupervised settings without fine-tuning on MS MARCO or other annotated data.
  • The paper also delves into the realm of multilingual retrieval, a significant area where large labeled datasets are typically scarce, especially for lower-resource languages. The multilingual model, mContriever, demonstrates strong performance in both fully unsupervised settings and when fine-tuned on English data. This model is capable of effective cross-lingual retrieval, a significant advancement over traditional lexical matching methods.
  • In summary, the paper introduces Contriever, an unsupervised dense retriever trained using contrastive learning, which effectively handles tasks in information retrieval, including multilingual and cross-lingual retrieval. This approach marks a notable advancement in the field, particularly in settings where large annotated datasets are unavailable.
Implicit Relation Linking for Question Answering over Knowledge Graph
  • This paper by Zhao et al. from Nanjing University and Alibaba Group in ACL 2022 addresses the challenge of linking implicit relations in natural language to knowledge graphs for question answering systems.
  • The authors introduce ImRL, a novel method that links natural language relation phrases to relation paths in knowledge graphs. This approach is significant as it deals with the ambiguity of natural language and the incompleteness of knowledge graphs.
  • The figure below from the paper shows an example of RL to DBpedia. There is no explicit relation between dbr:Dragonaut:_The_Resonance and dbr:Japan. We expect to implicitly link the phrase “from” to an indirect relation path dbp:publisher \(\rightarrow\) dbo:country.

  • ImRL incorporates a unique path ranking model that aligns textual information in word embeddings with structural information in knowledge graph embeddings. This model is designed to capture the correlation between single relations and relation paths, effectively addressing relation phrases with vague meanings.
  • To enhance the model’s performance, the authors integrate external paraphrase dictionaries using a gated mechanism with attention. This feature injects prior knowledge into the model, aiding in the disambiguation of relation phrases.
  • The figure below from the paper shows an overview of ImRL. The method has two parts: (1) Path generation parses the input question and finds the relation path candidates in the KG, by entity linking, relation identification and candidate generation. (2) Path ranking encodes the relation phrase in the question and path candidates in the KG in the BERT embedding space and RotatE embedding space, utilizes a ranking model to rank those candidates, and takes the one with the highest similarity score as answer. It also leverages a gated mechanism with attention to inject prior knowledge from external dictionaries to help relation disambiguation.

  • The paper presents a comprehensive evaluation using two benchmark datasets and a newly created dataset, demonstrating that ImRL significantly outperforms existing state-of-the-art methods, particularly in scenarios involving implicit relation linking.
  • The authors’ experiments and results highlight ImRL’s effectiveness in dealing with the inherent challenges of knowledge-based question answering systems, such as handling incomplete knowledge graphs and interpreting ambiguous natural language expressions.


ReAct: Synergizing Reasoning and Acting in Language Models
  • While large language models (LLMs) have demonstrated impressive capabilities across tasks in language understanding and interactive decision making, their abilities for reasoning (e.g., chain-of-thought prompting) and acting (e.g., action plan generation) have primarily been studied as separate topics.
  • This paper by Yao et al. from Princeton and Google Brain in ICLR 2023 proposes ReAct, approach that Reasons and Acts by exploring the use of LLMs to generate both reasoning traces and task-specific actions in an interleaved manner, allowing for greater synergy between the two: reasoning traces help the model induce, track, and update action plans as well as handle exceptions, while actions allow it to interface with external sources, such as knowledge bases or environments, to gather additional information.
  • They apply ReAct to a diverse set of language and decision making tasks and demonstrate its effectiveness over state-of-the-art baselines, as well as improved human interpretability and trustworthiness over methods without reasoning or acting components.
  • Concretely, on question answering (HotpotQA) and fact verification (Fever), ReAct overcomes issues of hallucination and error propagation prevalent in chain-of-thought reasoning by interacting with a simple Wikipedia API, and generates human-like task-solving trajectories that are more interpretable than baselines without reasoning traces.
  • On two interactive decision making benchmarks (ALFWorld and WebShop), ReAct outperforms imitation and reinforcement learning methods by an absolute success rate of 34% and 10% respectively, while being prompted with only one or two in-context examples.
  • The figure below from the paper shows a comparison of four prompting methods: (a) standard, (b) Chain-of-Thought (CoT, Reason Only), (c) Act-only, and (d) ReAct (Reason+Act), solving a HotpotQA question. (2) Comparison of (a) Act-only and (b) ReAct prompting to solve an AlfWorld. Note that in both domains, in-context examples are omitted as part of the prompt, and only show task solving trajectories generated by the model (Act, Thought) and the environment (Obs).

LLaMA: Open and Efficient Foundation Language Models
  • This paper by Touvron et al. from Meta AI in 2023 introduces LLaMA, a collection of foundation language models ranging from 7B to 65B parameters.
  • They train LLaMA models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets.
  • In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B. They release all their models to the research community.
  • Please refer the LLaMA primer for an article on LLaMA.
Alpaca: A Strong, Replicable Instruction-Following Model
  • Stanford’s Alpaca 7B, a model fine-tuned from the LLaMA 7B model on 52K instruction-following demonstrations. On their preliminary evaluation of single-turn instruction following, Alpaca behaves qualitatively similarly to OpenAI’s text-davinci-003, while being surprisingly small and easy/cheap to reproduce.
Transformer models: an introduction and catalog
  • In the past few years, we have seen the meteoric appearance of dozens of models of the Transformer family, all of which have funny, but not self-explanatory, names. The goal of this paper is to offer a somewhat comprehensive but simple catalog and classification of the most popular Transformer models. The paper also includes an introduction to the most important aspects and innovation in Transformer models.
  • Spreadsheet tabulation of the paper.
  • The following plot from the paper shows the transformers family tree with prevalent models:

  • And, the plot below from the paper shows the timeline for prevalent transformer models:

  • Lastly, the plot below, again from the paper, shows the timeline vs. number of parameters for prevalent transformer models:

Learning to Compress Prompts with Gist Tokens
  • Prompting is now the primary way to utilize the multitask capabilities of language models (LMs), but prompts occupy valuable space in the input context window, and re-encoding the same prompt is computationally inefficient.
  • Finetuning and distillation methods allow for specialization of LMs without prompting, but require retraining the model for each task.
  • This paper by Mu et al. from Stanford in 2023 avoids this trade-off entirely by presenting gisting, which trains an LM to compress prompts into smaller sets of “gist” tokens which can be reused for compute efficiency.
  • Gist models can be easily trained as part of instruction finetuning via a restricted attention mask that encourages prompt compression.
  • On decoder (LLaMA-7B) and encoder-decoder (FLAN-T5-XXL) LMs, gisting enables up to 26x compression of prompts, resulting in up to 40% FLOPs reductions, 4.2% wall time speedups, storage savings, and minimal loss in output quality.
  • The figure below from the paper shows prompting (top), which retains the multitask capabilities of LMs, but is computationally inefficient. Finetuning/distillation (middle) removes the dependence on prompts, but requires training a model for each task. Gisting (bottom) compresses prompts into a smaller set of gist tokens, saving compute while also generalizing to novel prompts during deployment.

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
  • This paper by Zhang et al. from Shanghai Artificial Intelligence Laboratory, CUHK MMLab, and UCLA presents LLaMA-Adapter, a lightweight adaption method to efficiently fine-tune LLaMA into an instruction-following model.
  • Using 52K self-instruct demonstrations, LLaMA-Adapter only introduces 1.2M learnable parameters upon the frozen LLaMA 7B model, and costs less than one hour for fine-tuning on 8 A100 GPUs.
  • Specifically, they adopt a set of learnable adaption prompts, and prepend them to the input text tokens at higher transformer layers. Then, a zero-init attention mechanism with zero gating is proposed, which adaptively injects the new instructional cues into LLaMA, while effectively preserves its pre-trained knowledge.
  • With efficient training, LLaMA-Adapter generates high-quality responses, comparable to Alpaca with fully fine-tuned 7B parameters. Furthermore, their approach can be simply extended to multi-modal input, e.g., images, for image-conditioned LLaMA, which achieves superior reasoning capacity on ScienceQA.
  • Code.
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
  • How to efficiently transform large language models (LLMs) into instruction followers is recently a popular research direction, while training LLM for multi-modal reasoning remains less explored. Although the recent LLaMA-Adapter demonstrates the potential to handle visual inputs with LLMs, it still cannot generalize well to open-ended visual instructions and lags behind GPT-4.
  • This paper by Zhang et al. from Shanghai Artificial Intelligence Laboratory, CUHK MMLab, and Rutgers University presents LLaMA-Adapter V2, a parameter-efficient visual instruction model.
  • Specifically, they first augment LLaMA-Adapter by unlocking more learnable parameters (e.g., norm, bias and scale), which distribute the instruction-following ability across the entire LLaMA model besides adapters. Secondly, they propose an early fusion strategy to feed visual tokens only into the early LLM layers, contributing to better visual knowledge incorporation. Thirdly, a joint training paradigm of image-text pairs and instruction-following data is introduced by optimizing disjoint groups of learnable parameters.
  • This strategy effectively alleviates the interference between the two tasks of image-text alignment and instruction following and achieves strong multi-modal reasoning with only a small-scale image-text and instruction dataset.
  • During inference, they incorporate additional expert models (e.g., captioning/OCR systems) into LLaMA-Adapter to further enhance its image understanding capability without incurring training costs. Compared to the original LLaMA-Adapter, their LLaMA-Adapter V2 can perform open-ended multi-modal instructions by merely introducing 14M parameters over LLaMA. The newly designed framework also exhibits stronger language-only instruction-following capabilities and even excels in chat interactions.
  • The figure below from the paper shows the training pipeline of LLaMA-Adapter V2. They introduce several strategies to enhance the capability of LLaMA-Adapter, which enable a parameter-efficient visual instruction model with superior multi-modal reasoning.

LIMA: Less Is More for Alignment
  • Large language models are trained in two stages: (1) unsupervised pretraining from raw text, to learn general-purpose representations, and (2) large scale instruction tuning and reinforcement learning, to better align to end tasks and user preferences.
  • This paper by Zhou et al. from Meta AI, Carnegie Mellon University, University of Southern California, and Tel Aviv University in 2023 measures the relative importance of these two stages by training LIMA, a 65B parameter LLaMa language model fine-tuned with the standard supervised loss on only 1,000 carefully curated prompts and responses, without any reinforcement learning or human preference modeling.
  • They define the Superficial Alignment Hypothesis: A model’s knowledge and capabilities are learnt almost entirely during pretraining, while alignment teaches it which subdistribution of formats should be used when interacting with users. If this hypothesis is correct, and alignment is largely about learning style, then a corollary of the Superficial Alignment Hypothesis is that one could sufficiently tune a pretrained language model with a rather small set of examples. To that end, they collect a dataset of 1,000 prompts and responses, where the outputs (responses) are stylistically aligned with each other, but the inputs (prompts) are diverse. Specifically, they seek outputs in the style of a helpful AI assistant. They curate such examples from a variety of sources, primarily split into community Q&A forums and manually authored examples. They also collect a test set of 300 prompts and a development set of 50.
  • LIMA demonstrates remarkably strong performance, learning to follow specific response formats from only a handful of examples in the training data, including complex queries that range from planning trip itineraries to speculating about alternate history. Moreover, the model tends to generalize well to unseen tasks that did not appear in the training data. In a controlled human study, responses from LIMA are either equivalent or strictly preferred to GPT-4 in 43% of cases; this statistic is as high as 58% when compared to Bard and 65% versus DaVinci003, which was trained with human feedback.
  • Taken together, these results strongly suggest that almost all knowledge in large language models is learned during pretraining, and only limited instruction tuning data is necessary to teach models to produce high quality output.
  • The figure below from the paper shows (left) the human preference evaluation, comparing LIMA to 5 different baselines across 300 test prompts; (right) preference evaluation using GPT-4 as the annotator, given the same instructions provided to humans.

QLoRA: Efficient Finetuning of Quantized LLMs
  • This paper by Dettmers et al. from UW presents QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance. Put simply, QLoRA is a new technique to reduce the memory footprint of large language models during finetuning, without sacrificing performance. QLoRA backpropagates gradients through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters (LoRA).
  • QLoRA operates based on the following steps:
    • Quantize the pre-trained model to 4 bits and freeze it.
    • Attach small, trainable adapter layers (similar to LoRA).
    • Finetune only the adapter layers while using the frozen quantized model for context.
  • Their best model family, which they name Guanaco, outperforms all previous openly released models on the Vicuna benchmark, reaching 99.3% of the performance level of ChatGPT while only requiring 24 hours of finetuning on a single GPU. QLoRA introduces a number of innovations to save memory without sacrificing performance: (a) 4-bit NormalFloat (NF4), a new data type that is information theoretically optimal for normally distributed weights (b) double quantization to reduce the average memory footprint by quantizing the quantization constants, and (c) paged optimziers to manage memory spikes.
  • They use QLoRA to finetune more than 1,000 models, providing a detailed analysis of instruction following and chatbot performance across 8 instruction datasets, multiple model types (LLaMA, T5), and model scales that would be infeasible to run with regular finetuning (e.g. 33B and 65B parameter models).
  • Our results show that QLoRA finetuning on a small high-quality dataset leads to state-of-the-art results, even when using smaller models than the previous SoTA. They provide a detailed analysis of chatbot performance based on both human and GPT-4 evaluations showing that GPT-4 evaluations are a cheap and reasonable alternative to human evaluation. Furthermore, they find that current chatbot benchmarks are not trustworthy to accurately evaluate the performance levels of chatbots. A lemon-picked analysis demonstrates where Guanaco fails compared to ChatGPT.
  • The figure below from the paper shows different finetuning methods and their memory requirements. QLORA improves over LoRA by quantizing the transformer model to 4-bit precision and using paged optimizers to handle memory spikes.

  • To learn more about QLoRA and how it works, the Hugging Face blog post is highly recommended.
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
  • Large-scale unsupervised language models (LMs) acquire extensive world knowledge and reasoning skills, but precisely controlling their behavior is challenging due to their unsupervised training nature. Traditionally, methods like Reinforcement Learning from Human Feedback (RLHF), discussed earlier in this article, are used to steer these models, involving two stages: training a reward model based on human preference labels and then fine-tuning the LM to align with these preferences using reinforcement learning (RL). However, RLHF presents complexities and instability issues, necessitating fitting a reward model and then training a policy to optimize this reward, which is prone to stability concerns.
  • This paper by Rafailov et al. from Stanford in 2023 introduces Direct Preference Optimization (DPO), a novel approach that simplifies and enhances this process. DPO leverages a mathematical relationship between optimal policies and reward functions, demonstrating that the constrained reward maximization problem in RLHF can be optimized more effectively with a single stage of policy training. DPO redefines the RLHF objective by showing that the reward can be rewritten purely as a function of policy probabilities, allowing the LM to implicitly define both the policy and the reward function. This innovation eliminates the need for a separate reward model and the complexities of RL.
  • This paper introduces a novel algorithm that gets rid of the two stages of RL, namely - fitting a reward model, and training a policy to optimize the reward via sampling. The second stage is particularly hard to get right due to stability concerns, which DPO obliterates. The way it works is, given a dataset of the form <prompt, worse completion, better completion>, you train your LLM using a new loss function which essentially encourages it to increase the likelihood of the better completion and decrease the likelihood of the worse completion, weighted by how much higher the implicit reward model. This method obviates the need for an explicit reward model, as the LLM itself acts as a reward model. The key advantage is that it’s a straightforward loss function optimized using backpropagation.
  • The stability, performance, and computational efficiency of DPO are significant improvements over traditional methods. It eliminates the need for sampling from the LM during fine-tuning, fitting a separate reward model, or extensive hyperparameter tuning.
  • The figure below from the paper illustrates that DPO optimizes for human preferences while avoiding reinforcement learning. Existing methods for fine-tuning language models with human feedback first fit a reward model to a dataset of prompts and human preferences over pairs of responses, and then use RL to find a policy that maximizes the learned reward. In contrast, DPO directly optimizes for the policy best satisfying the preferences with a simple classification objective, without an explicit reward function or RL.

  • Experiments demonstrate that DPO can fine-tune LMs to align with human preferences as effectively, if not more so, than traditional RLHF methods. It notably surpasses RLHF in controlling the sentiment of generations and enhances response quality in tasks like summarization and single-turn dialogue. Its implementation and training processes are substantially simpler.
  • In essence, DPO represents a groundbreaking shift in training language models to align with human preferences. It consolidates the two-stage process of RLHF into a single, efficient end-to-end policy learning approach. By reparameterizing the reward function and unifying policy learning and reward modeling into one streamlined optimization process, DPO offers a more efficient and lightweight method for training language models to match human preferences.
Deduplicating Training Data Makes Language Models Better
  • This paper by Lee et al. from Google Brain in 2023 finds that existing language modeling datasets contain many near-duplicate examples and long repetitive substrings. As a result, over 1% of the unprompted output of language models trained on these datasets is copied verbatim from the training data.
  • They develop two tools that allow us to deduplicate training datasets – for example removing from C4 a single 61 word English sentence that is repeated over 60,000 times.
  • Deduplication allows them to train models that emit memorized text ten times less frequently and require fewer train steps to achieve the same or better accuracy. They can also reduce train-test overlap, which affects over 4% of the validation set of standard datasets, thus allowing for more accurate evaluation.
  • Code.
Llama 2: Open Foundation and Fine-Tuned Chat Models
  • Llama 2 is a collection of pretrained and fine-tuned large language models (LLMs) from Meta AI ranging in scale from 7 billion to 70 billion parameters. The fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Their models outperform open-source chat models on most benchmarks we tested, and based on their human evaluations for helpfulness and safety, may be a suitable substitute for closed source models. We provide a detailed description of their approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on their work and contribute to the responsible development of LLMs.
  • Llama 2 is powered by Ghost Attention (GAtt), introduced in the paper, which improves multi-turn memory. From section 3.3 in the technical report:
    • “In a dialogue setup, some instructions should apply for all the conversation turns, e.g., to respond succinctly, or to “act as” some public figure. When we provided such instructions to Llama 2-Chat, the subsequent response should always respect the constraint. However, their initial RLHF models tended to forget the initial instruction after a few turns of dialogue, as illustrated in the below figure (left) which shows that issues with multi-turn memory (left) can be improved with GAtt (right).

    • To address these limitations, we propose Ghost Attention (GAtt), a very simple method inspired by Context Distillation (Bai et al., 2022) that hacks the fine-tuning data to help the attention focus in a multi-stage process. GAtt enables dialogue control over multiple turns, as illustrated in the figure above (right).
    • GAtt Method: Assume we have access to a multi-turn dialogue dataset between two persons (e.g., a user and an assistant), with a list of messages \(\left[u_1, a_1, \ldots, u_n, a_n\right]\), where \(u_n\) and \(a_n\) correspond to the user and assistant messages for turn \(n\), respectively. Then, we define an instruction, inst, that should be respected throughout the dialogue. For example, inst could be “act as.” We can then synthetically concatenate this instruction to all the user messages of the conversation.
    • Next, we can sample from this synthetic data using the latest RLHF model. We now have a context-dialogue and the sample with which to fine-tune a model, in a process analogous to Rejection Sampling. Instead of augmenting all context-dialogue turns with the instruction, we can drop it in all but the first turn, but this would lead to a mismatch at training time between the system message, i.e., all the intermediate assistant messages that come before the last turn, and their sample. To fix this issue, which could hurt the training, we simply set the loss to 0 for all the tokens from the previous turns, including assistant messages.
    • For the training instructions, we created a few synthetic constraints to sample from: Hobbies (“You enjoy e.g. Tennis”), Language (“Speak in e.g. French”), or Public Figure (“Act as e.g. Napoleon”). To obtain the lists of hobbies and public figures, we asked Llama 2-Chat to generate it, avoiding a mismatch between the instruction and model knowledge (e.g., asking the model to act as someone it had not encountered during training). To make the instructions more complex and diverse, we construct the final instruction by randomly combining the above constraints. When constructing the final system message for the training data, we also modify the original instruction half of the time to be less verbose, e.g., “Always act as Napoleon from now”-> “Figure: Napoleon.” These steps produce an SFT dataset, on which we can fine-tune Llama 2-Chat.
    • GAtt Evaluation: We applied GAtt after RLHF V3. We report a quantitative analysis indicating that GAtt is consistent up to 20+ turns, until the maximum context length is reached (see Appendix A.3.5 in the paper). We tried to set constraints not present in the training of GAtt at inference time, for instance “Always answer with Haiku,” for which the model was found to remain consistent.
    • To illustrate how GAtt helped reshape attention during fine-tuning, we display the maximum attention activations of the model in Figure 10. The left-hand side of each figure corresponds to the system message (“Act as Oscar Wilde”). From the figure above, we can see that the GAtt-equipped model (right) maintains large attention activations with respect to the system message for a larger portion of the dialogue, as compared to the model without GAtt (left).
    • Despite its utility, the current implementation of GAtt is vanilla, and more development and iteration on this technique could likely further benefit the model. For instance, we could teach the model to change the system message during the conversation by integrating such data during fine-tuning.”
  • Another important aspect that is highlighted in the report is the effect of RLHF on Llama 2, and this graph from Meta’s paper shows how high-quality human preferences data (obtained from Surge AI) keeps on improving Llama 2 – without saturation.

  • They also call out the importance of supervised fine-tuning (SFT) data quality (in the “quality is all you need” section) – it’s not about volume, but diversity and quality.
  • From Linxi Fan’s notes:
    • Llama-2 likely costed $20M+ to train. Meta has done an incredible service to the community by releasing the model with a commercially-friendly license. AI researchers from big companies were wary of Llama-1 due to licensing issues, but now many of them will jump on the ship and contribute their firepower.
    • Meta’s team did a human study on 4K prompts to evaluate Llama-2’s helpfulness. They use “win rate” as a metric to compare models, in similar spirit as the Vicuna benchmark. 70B model roughly ties with GPT-3.5-0301, and performs noticeably stronger than Falcon, MPT, and Vicuna. These real human ratings should be trusted more than academic benchmarks, because they typically capture the “in-the-wild vibe” better.
    • Llama-2 is not yet at GPT-3.5 level, mainly because of its weak coding abilities. On “HumanEval” (standard coding benchmark), it isn’t nearly as good as StarCoder or many other models specifically designed for coding. That being said, I have little doubt that Llama-2 will improve significantly thanks to its open weights.
    • Meta’s team goes above and beyond on AI safety issues. In fact, almost half of the paper is talking about safety guardrails, red-teaming, and evaluations. A round of applause for such responsible efforts!
    • In prior works, there’s a thorny trade-ff between helpfulness and safety. Meta mitigates this by training 2 separate reward models. They aren’t open-source yet, but would be extremely valuable to the community.
    • Llama-2 will dramatically boost multimodal AI and robotics research. These fields need more than just blackbox access to an API.
    • So far, we have to convert the complex sensory signals (video, audio, 3D perception) to text description and then feed to an LLM, which is awkward and leads to huge information loss. It’d be much more effective to graft sensory modules directly on a strong LLM backbone.
    • The whitepaper itself is a masterpiece. Unlike GPT-4’s paper that shared very little info, Llama-2 spelled out the entire recipe, including model details, training stages, hardware, data pipeline, and annotation process. For example, there’s a systematic analysis on the effect of RLHF with nice visualizations. Quote sec 5.1: “We posit that the superior writing abilities of LLMs, as manifested in surpassing human annotators in certain tasks, are fundamentally driven by RLHF.”
  • The following figure from the paper shows the training of Llama 2-Chat: This process begins with the pretraining of Llama 2 using publicly available online sources. Following this, they create an initial version of Llama 2-Chat through the application of supervised fine-tuning. Subsequently, the model is iteratively refined using Reinforcement Learning with Human Feedback (RLHF) methodologies, specifically through rejection sampling and Proximal Policy Optimization (PPO). Throughout the RLHF stage, the accumulation of iterative reward modeling data in parallel with model enhancements is crucial to ensure the reward models remain within distribution.

  • Summary:
    • Llama 2 is available for free (including commercial license).
    • Llama 2 can be accessed via managed services in Azure and AWS.
    • Llama is trained on 2B tokens, with 4 variants, ranging from 7-70B parameters.
    • Llama is intended to be used in English, with almost 90% of the pre-training data being in English.
    • The commercial license specifies a number of harmful use cases that violate the license, including spam!
    • Llama 2 is very comparable to ChatGPT 3.5 in most benchmarks (particularly, it beats ChatGPT in human evaluation on helpfulness: Win 36%; Tie 32%; Loss 32%) other than coding, looking at the data mix coding data is still quite small (classified under the - unknown language category)
    • Llama 2 outperforms all other open-source models including Falcon and MPT, and has three variants including 7B, 13B, and 70B; the 70B variant achieves top performance across the board.
    • Benchmarks were done both on standardized ones (like MMLU) and head to head competition against other models, including PaLM-2 Bison and ChatGPT 3.5.
    • A large portion of the paper focuses on RLHF improvements and objectives which is super neat.
    • Model toxicity and evaluation is another large focus, including evaluations like red-teaming which were found in the Claude 2 model card. Generally Llama 2 performed very well with fewer safety violations than ChatGPT in human evaluations.
    • The tokenizer is the same as Llama 1 which is interesting, but the context length is now 4k, double the original 2k!
    • There’s both a regular and chat variation, as has been the trend in recent papers.
    • Llama 2 (with fine tuning) offers better domain-specificity via fine-tuning at lower cost, and better guardrails.
    • Llama 2 is trained on 40% more data than Llama 1 and performs well against benchmarks.
    • In short: companies can create their own enterprise “ChatGPT” (without sharing any data with OpenAI).
  • Quantized Llama 2 weights are available for local inference here.

  • The following diagram presents summarizes the key graphs/tables of the Llama 2 paper:

  • The following infographic (source) presents an overview of Llama 2:

Retentive Network: A Successor to Transformer for Large Language Models
  • This paper by Sun wet al. from Microsoft Research and Tsinghua University proposes a foundation architecture called Retentive Network (RetNet) to replace the transformer as default backbone for language modelling, simultaneously achieving training parallelism, low-cost inference, and good performance.
  • One of the main reasons why NLP research couldn’t progress beyond a particular level with RNNs and LSTMs was that they weren’t parallelizable, which hindered people from developing reasonably huge models that could learn large range dependencies with them. Transformers enabled parallelism, however suffer from quadratic computational complexity. RetNet is a smart “mathematical makeover” for RNNs which makes it parallelizable, thus circumventing their biggest limitation while still enabling linear time complexity.
  • The idea behind RetNet is to combine recurrence and parallelism in a way that is flexible and combines the best of both worlds. To achieve this, the researchers introduced a mechanism called “retention” that can be formulated in both a recurrent and a parallel way (or even both at the same time). They theoretically derive the connection between recurrence and attention. Then they propose the retention mechanism for sequence modeling, which supports three computation paradigms, i.e., parallel, recurrent, and chunkwise recurrent.
  • As a rule of thumb, the parallel representation allows for training parallelism. The recurrent representation enables low-cost \(O(1)\) inference, which improves decoding throughput, latency, and GPU memory without sacrificing performance. A hybrid form deals with a few exceptions like long sequences.
  • The chunkwise recurrent representation facilitates efficient long-sequence modeling with linear complexity, where each chunk is encoded parallelly while recurrently summarizing the chunks.
  • RetNet shows favorable scaling laws compared to the transformer; they observe better perplexity for sizes north of 2B parameters!
  • They trained RetNet on 512 AMD MI200 GPUs.
  • The following figure from the paper illustrates the fact that RetNet makes the “impossible triangle” possible, which achieves training parallelism, good performance, and low inference cost simultaneously.

  • The following figure from the paper presents the dual form of RetNet. “GN” is short for GroupNorm.

  • The following figure from the paper presents RetNet which achieves low-cost inference (i.e., GPU memory, throughput, and latency), training parallelism, and favorable scaling curves compared with Transformer. Results of inference cost are reported with 8k as input length. Figure 6 shows more results on different sequence lengths. Put simply, RetNet additionally makes inference much more efficient (smaller memory footprint) and lower cost, while keeping the desirable properties of transformers (training parallelism) by offering \(O(1)\) inference, in contrast with \(O(n)\) for transformers!

  • Experimental results on language modeling show that RetNet achieves favorable scaling results, parallel training, low-cost deployment, and efficient inference. With alluring benefits on basically all fronts – less memory, higher throughput, better scaling, and faster inference – these intriguing properties make RetNet a strong successor to Transformer for large language models.
  • The following table from the paper shows the model comparison from various perspectives. RetNet achieves training parallelization, constant inference cost, linear long-sequence memory complexity, and good performance.

The case for 4-bit precision: k-bit Inference Scaling Laws
  • Quantization methods reduce the number of bits required to represent each parameter in a model, trading accuracy for smaller memory footprints and inference latencies. However, the final model size depends on both the number of parameters of the original model and the rate of compression. - Put simply, this paper seeks to answer the question: “what’s the optimal number of bits for quantizing transformer weights if you wish to maximize the zero shot accuracy given a particular budget of total model weight bits”. For example, a 30B 8-bit model and a 60B 4-bit model have the same number of bits but may have very different zero-shot accuracies with the the 4-bit model outperforming the 8-bit model.
  • In this work, we study this trade-off by developing inference scaling laws of zero-shot performance in Large Language Models (LLMs) to determine the bit-precision and model size that maximizes zero-shot performance.
  • They run more than 35,000 experiments with 16-bit inputs and k-bit parameters to examine which zero-shot quantization methods improve scaling for 3 to 8-bit precision at scales of 19M to 176B parameters across the LLM families BLOOM, OPT, NeoX/Pythia, and GPT-2. We find that it is challenging to improve the bit-level scaling trade-off, with the only improvements being the use of a small block size – splitting the parameters into small independently quantized blocks – and the quantization data type being used (e.g., Int vs Float).
  • Overall, their findings show that {4-bit} precision is almost universally optimal for total model bits and zero-shot accuracy.
  • There are practical limitations to this method though. Low-bit models with 16-bit inputs might be less latency efficient if such a model is deployed to be used by many users (i.e. bigger batch sizes). Something to keep in mind.
  • The following table from the paper illustrates bit-level scaling laws for mean zero-shot performance across four datasets for 125M to 176B parameter OPT models. Zero-shot performance increases steadily for fixed model bits as they reduce the quantization precision from 16 to 4 bits. At 3-bits, this relationship reverses, making 4-bit precision optimal.

DeBERTa: Decoding-enhanced BERT with Disentangled Attention
  • Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks.
  • This paper by He et al. from Microsoft in ICLR 2021 proposes a new model architecture Decoding-enhanced BERT with disentangled attention (DeBERTa) that improves the BERT and RoBERTa models using two novel techniques.
  • The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions, respectively.
  • Second, an enhanced mask decoder is used to incorporate absolute positions in the decoding layer to predict the masked tokens in model pre-training.
  • In addition, a new virtual adversarial training method is used for fine-tuning to improve models’ generalization.
  • They show that these techniques significantly improve the efficiency of model pre-training and the performance of both natural language understanding (NLU) and natural langauge generation (NLG) downstream tasks.
  • Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). Notably, they scale up DeBERTa by training a larger version that consists of 48 Transform layers with 1.5 billion parameters. The significant performance boost makes the single DeBERTa model surpass the human performance on the SuperGLUE benchmark (Wang et al., 2019a) for the first time in terms of macro-average score (89.9 versus 89.8), and the ensemble DeBERTa model sits atop the SuperGLUE leaderboard as of January 6, 2021, out performing the human baseline by a decent margin (90.3 versus 89.8).
  • The following infographic (source) presents the fact that interestingly, DeBERTa-1.5B (and encoder-only model) beats Llama 2 on BoolQ, which is a nice example that encoders still outperform large decoders on classification tasks. For fairness: The DeBERTa-1.5B model was likely finetuned on the training data whereas Llama 2 was used via few-shot prompting. In that case, it highlights once more that finetuning custom LLMs remains worthwhile.

UL2: Unifying Language Learning Paradigms
  • Existing pre-trained models are generally geared towards a particular class of problems. To date, there seems to be still no consensus on what the right architecture and pre-training setup should be. This paper presents a unified framework for pre-training models that are universally effective across datasets and setups.
  • THis paper by Tay et al. from GOogle Brain begin by disentangling architectural archetypes with pre-training objectives – two concepts that are commonly conflated. Next, we present a generalized & unified perspective for self-supervision in NLP and show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective. We then propose Mixture-of-Denoisers (MoD), a pre-training objective that combines diverse pre-training paradigms together. We furthermore introduce a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training schemes. We conduct extensive ablative experiments to compare multiple pre-training objectives and find that their method pushes the Pareto-frontier by outperforming T5 & GPT-like models across multiple diverse setups. By scaling their model up to 20B parameters, we achieve SOTA performance on 50 well-established supervised finetuning based NLP tasks. Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization. On 0-shot MMLU, UL2 20B outperforms T0 and T5 models. UL2 20B also works well with chain-of-thought prompting and reasoning, making it an appealing choice for research into reasoning at a small to medium scale of 20B parameters. Finally, we apply FLAN instruction tuning to the UL2 20B model, achieving MMLU and Big-Bench scores competitive to FLAN-PaLM 62B.
  • They release Flax-based T5X checkpoints for the UL2 20B & Flan-UL2 20B.
  • The following table from the paper illustrates an overview of UL2 pretraining paradigm. UL2 proposes a new pretraining objective that works well on a diverse suite of downstream tasks.

  • The following table from the paper illustrates the mixture of denoisers for training UL2. Greyed out rectangles are masked tokens that are shifted to ‘targets’ for prediction.

Graph of Thoughts: Solving Elaborate Problems with Large Language Models
  • Similar to Beyond Chain-of-Thought, Effective Graph-of-Thought Reasoning in Large Language Models, this paper by Besta et al. from ETH Zurich, Cledar, and Warsaw University of Technology introduces Graph of Thoughts (GoT) a framework that advances prompting capabilities in large language models (LLMs) beyond those offered by paradigms such as Chain-of-Thought or Tree of Thoughts (ToT).
  • The key idea and primary advantage of GoT is the ability to model the information generated by an LLM as an arbitrary graph, where units of information (“LLM thoughts”) are vertices, and edges correspond to dependencies between these vertices.
  • This approach enables combining arbitrary LLM thoughts into synergistic outcomes, distilling the essence of whole networks of thoughts, or enhancing thoughts using feedback loops.
  • They illustrate that GoT offers advantages over state of the art on different tasks, for example increasing the quality of sorting by 62% over ToT, while simultaneously reducing costs by >31%.
  • They ensure that GoT is extensible with new thought transformations and thus can be used to spearhead new prompting schemes. This work brings the LLM reasoning closer to human thinking or brain mechanisms such as recurrence, both of which form complex networks.
  • The following table from the paper shows a comparison of Graph of Thoughts (GoT) with other prompting strategies.

Accelerating Large Language Model Decoding with Speculative Sampling
  • This paper by Chen et al. from Google DeepMind presents speculative sampling, an algorithm for accelerating transformer decoding by enabling the generation of multiple tokens from each transformer call.
  • Their algorithm relies on the observation that the latency of parallel scoring of short continuations, generated by a faster but less powerful draft model, is comparable to that of sampling a single token from the larger target model. This is combined with a novel modified rejection sampling scheme which preserves the distribution of the target model within hardware numerics.
  • They benchmark speculative sampling with Chinchilla, a 70 billion parameter language model, achieving a 2-2.5x decoding speedup in a distributed setup, without compromising the sample quality or making modifications to the model itself.
Pretraining Language Models with Human Preferences
  • The following paper summary has been contributed by Zhibo Zhang.
  • This paper by Korbak et al. from University of Sussex, New York University, FAR AI, Northeastern University and Anthropic in ICML 2023 explores objective functions that incorporate human preferences when pre-training language models.
  • Assuming access to a reward function that assigns scores to document segments, on top of maximum likelihood estimation, the authors explore the following PHF (pre-training with human preferences) objective functions: maximum likelihood estimation with filtering (Solaiman & Dennison, 2021, Wang et al., 2022); conditional training (Ficler & Goldberg, 2017, Fan et al., 2018, Keskar et al., 2019); unlikelihood (Welleck et al., 2020); reward-weighted regression (Peters & Schaal, 2007); advantage-weighted regression (Peng et al., 2019).
  • The authors studied the chosen objective functions through three perspectives: (i) avoiding toxic content, (ii) avoid leaking personally identifiable information, and (iii) code generation that aligns with user intent.
  • The objective functions in question were evaluated from the alignment and capability perspectives through misalignment scores and KL divergence from the GPT-3 model (Brown et al., 2020) accordingly. It was observed that among the PHF objective functions investigated, conditional training achieved the best balance between alignment and utility.
  • The authors also evaluated robustness to adversarial prompts for the models pre-trained with the objective functions in question. It was observed from the misalignment scores that conditional training and filtering are most robust to adversarial prompts overall.
  • The following table from the paper illustrates the toxicity score (lower is better) of LMs pretrained with the standard objective (solid blue), using conditional training (solid orange) and LMs finetuned using conditional training for 1.6B (orange dashed) and 330M tokens (orange dotted). Pretraining with Human Feedback (PHF) reduces the amount of offensive content much more effectively than finetuning with human feedback.

Large Language Models as Optimizers
  • Optimization is ubiquitous. While derivative-based algorithms have been powerful tools for various problems, the absence of gradient imposes challenges on many real-world applications.
  • This paper by Yang et al. from Google DeepMind proposes Optimization by PROmpting (OPRO), a simple and effective approach to leverage large language models (LLMs) as optimizers, where the optimization task is described in natural language. In each optimization step, the LLM generates new solutions from the prompt that contains previously generated solutions with their values, then the new solutions are evaluated and added to the prompt for the next optimization step.
  • They first showcase OPRO on linear regression and traveling salesman problems, then move on to prompt optimization where the goal is to find instructions that maximize the task accuracy.
  • The following table from the paper illustrates an overview of the OPRO framework. Given the meta-prompt as the input, the LLM generates new solutions to the objective function, then the new solutions and their scores are added into the meta-prompt for the next optimization step. The meta-prompt contains the solution-score pairs obtained throughout the optimization process, as well as a natural language description of the task and (in prompt optimization) a few exemplars from the task. See the below figure from the paper for a sample meta-prompt for prompt optimization.

  • The following table from the paper illustrates an example of the meta-prompt for prompt optimization with instruction-tuned PaLM 2-L (PaLM 2-L-IT) on GSM8K, where the generated instruction will be prepended to the beginning of “A:” in the scorer LLM output (A_begin in Section 4.1). <INS> denotes the position where the generated instruction will be added. The blue text contains solution-score pairs; the purple text describes the optimization task and output format; the orange text are meta-instructions.

  • With a variety of LLMs, they demonstrate that the best prompts optimized by OPRO outperform human-designed prompts by up to 8% on GSM8K, and by up to 50% on Big-Bench Hard tasks.
  • In terms of limitations, OPRO is designed for neither outperforming the state-of-the-art gradient-based optimization algorithms for continuous mathematical optimization, nor surpassing the performance of specialized solvers for classical combinatorial optimization problems such as TSP. Instead, the goal is to demonstrate that LLMs are able to optimize different kinds of objective functions simply through prompting, and reach the global optimum for some small-scale problems. Their evaluation reveals several limitations of OPRO for mathematical optimization. Specifically, the length limit of the LLM context window makes it hard to fit large-scale optimization problem descriptions in the prompt, e.g., linear regression with high-dimensional data, and traveling salesman problems with a large set of nodes to visit. In addition, the optimization landscape of some objective functions are too bumpy for the LLM to propose a correct descending direction, causing the optimization to get stuck halfway.
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
  • The quality of texts generated by natural language generation (NLG) systems is hard to measure automatically. Conventional reference-based metrics, such as BLEU and ROUGE, have been shown to have relatively low correlation with human judgments, especially for tasks that require creativity and diversity. Recent studies suggest using large language models (LLMs) as reference-free metrics for NLG evaluation, which have the benefit of being applicable to new tasks that lack human references. However, these LLM-based evaluators still have lower human correspondence than medium-size neural evaluators.
  • This paper by Liu et al. from presents G-Eval, a framework of using large language models with chain-of-thoughts (CoT) and a form-filling paradigm, to assess the quality of NLG outputs.
  • The following table from the paper illustrates the overall framework of G-Eval. We first input Task Introduction and Evaluation Criteria to the LLM, and ask it to generate a CoT of detailed Evaluation Steps. Then we use the prompt along with the generated CoT to evaluate the NLG outputs in a form-filling paradigm. Finally, we use the probability-weighted summation of the output scores as the final score.

  • They experiment with two generation tasks, text summarization and dialogue generation. They show that G-Eval with GPT-4 as the backbone model achieves a Spearman correlation of 0.514 with human on summarization task, outperforming all previous methods by a large margin.
  • They also propose preliminary analysis on the behavior of LLM-based evaluators, and highlight the potential issue of LLM-based evaluators having a bias towards the LLM-generated texts.
  • Code.
Chain-of-Verification Reduces Hallucination in Large Language Models
  • Generation of plausible yet incorrect factual information, termed hallucination, is an unsolved issue in large language models.
  • This paper by Dhuliawala et al. from Meta AI and ETH Zurich studies the ability of language models to deliberate on the responses they give in order to correct their mistakes.
  • They develop the Chain-of-Verification (CoVe) method whereby the model first (i) drafts an initial response; then (ii) plans verification questions to fact-check its draft; (iii) answers those questions independently so the answers are not biased by other responses; and (iv) generates its final verified response.
  • The following table from the paper illustrates the Chain-of-Verification (CoVe) method. Given a user query, a large language model generates a baseline response that may contain inaccuracies, e.g. factual hallucinations. We show a query here which failed for ChatGPT (see section 9 for more details). To improve this, CoVe first generates a plan of a set of verification questions to ask, and then executes that plan by answering them and hence checking for agreement. We find that individual verification questions are typically answered with higher accuracy than the original accuracy of the facts in the original longform generation. Finally, the revised response takes into account the verifications. The factored version of CoVe answers verification questions such that they cannot condition on the original response, avoiding repetition and improving performance.

  • Via experiments, they show that CoVe decreases hallucinations across a variety of tasks, from list-based questions from Wikidata, closed book MultiSpanQA and longform text generation.
LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models
  • This paper by Chen et al. from CUHK and MIT presents LongLoRA, an efficient fine-tuning approach that extends the context sizes of pre-trained large language models (LLMs) during fine-tuning, with limited computation cost.
  • Typically, training LLMs with long context sizes is computationally expensive, requiring extensive training hours and GPU resources. For example, training on the context length of 8192 needs 16x computational costs in self-attention layers as that of 2048.
  • LongLoRA speeds up the context extension of LLMs in two aspects. On the one hand, although dense global attention is needed during inference, fine-tuning the model can be effectively and efficiently done by sparse local attention. The proposed shift short attention ($$S^2$-attention) effectively enables context extension, leading to non-trivial computation savings with similar performance to fine-tuning with vanilla attention. Particularly, it can be implemented with only two lines of code in training, while being optional in inference.
  • $$S^2$-attention splits the context into groups and only attends within each group. Tokens are shifted between groups in different heads to enable information flow. This approximates full attention but is much more efficient.
  • On the other hand, they revisit the parameter-efficient fine-tuning regime for context expansion. Notably, they find that LoRA for context extension works well under the premise of trainable embedding and normalization. LongLoRA demonstrates strong empirical results on various tasks on Llama 2 models from 7B/13B to 70B.
  • LongLoRA adopts Llama 2 7B from 4k context to 100k, or Llama 2 70B to 32k on a single 8x A100 machine. LongLoRA extends models’ context while retaining their original architectures, and is compatible with most existing techniques, like FlashAttention-2.
  • In addition, to make LongLoRA practical, they collect a dataset, LongQA, for supervised fine-tuning. It contains more than 3k long context question-answer pairs.
  • The following table from the paper illustrates an overview of LongLoRA designs. LongLoRA introduces shift short attention during finetuning. The trained model can retain its original standard self-attention during inference. In addition to plain LoRA weights, LongLoRA additionally makes embedding and normalization layers trainable, which is essential to long context learning, but takes up only a small proportion of parameters.

  • The following table from the paper shows a performance and efficiency comparison between full fine-tuning, plain LoRA, and our LongLoRA. They fine-tune LLaMA2 7B on various context lengths, with FlashAttention-2 and DeepSpeed stage 2. Perplexity is evaluated on the Proof-pile test set. Plain LoRA baseline spends limited GPU memory cost, but its perplexity gets worse as the context length increases. LongLoRA achieves comparable performance to full fine-tuning while the computational cost is much less.

Mass-Editing Memory in a Transformer
  • Recent work has shown exciting promise in updating large language models with new memories, so as to replace obsolete information or add specialized knowledge. However, this line of work is predominantly limited to updating single associations.
  • ICLR 2023 develops MEMIT, a method for directly updating a language model with many memories, demonstrating experimentally that it can scale up to thousands of associations for GPT-J (6B) and GPT-NeoX (20B), exceeding prior work by orders of magnitude.
  • The following table from the paper illustrates the fact that MEMIT modifies transformer parameters on the critical path of MLP-mediated factual recall. We edit stored associations based on observed patterns of causal mediation: (a) first, the early-layer attention modules gather subject names into vector representations at the last subject token \(S\). (b) Then MLPs at layers \(l \in \mathcal{R}\) read these encodings and add memories to the residual stream. (c) Those hidden states are read by attention to produce the output. (d) MEMIT edits memories by storing vector associations in the critical MLPs.

  • The following table from the paper shows the MEMIT update process. They first (i) replace \(h_i^l\) with the vector \(z_i\) and optimize Eqn. 16 in the paper so that it conveys the new memory. Then, after all \(z_i\) are calculated we (ii) iteratively insert a fraction of the residuals for all \(z_i\) over the range of critical MLP modules, executing each layer’s update by applying Eqn. 14 in the paper. Because changing one layer will affect activations of downstream modules, they recollect activations after each iteration.

MTEB: Massive Text Embedding Benchmark
  • Text embeddings are commonly evaluated on a small set of datasets from a single task not covering their possible applications to other tasks. It is unclear whether state-of-the-art embeddings on semantic textual similarity (STS) can be equally well applied to other tasks like clustering or reranking. This makes progress in the field difficult to track, as various models are constantly being proposed without proper evaluation.
  • To solve this problem, Muennighoff et al. from Hugging Face and introduce the Massive Text Embedding Benchmark (MTEB) Leaderboard. MTEB spans 8 embedding tasks covering a total of 58 datasets and 112 languages.
  • Through the benchmarking of 33 models on MTEB, they establish the most comprehensive benchmark of text embeddings to date. The following figure from the paper shows an overview of tasks and datasets in MTEB. Multilingual datasets are marked with a purple shade

  • They find that no particular text embedding method dominates across all tasks. This suggests that the field has yet to converge on a universal text embedding method and scale it up sufficiently to provide state-of-the-art results on all embedding tasks.
Language Modeling Is Compression
  • It has long been established that predictive models can be transformed into lossless compressors and vice versa. Incidentally, in recent years, the machine learning community has focused on training increasingly large and powerful self-supervised (language) models. Since these large language models exhibit impressive predictive capabilities, they are well-positioned to be strong compressors.
  • This paper by Delétang et al. from DeepMind, Meta AI, and Inria, advocates for viewing the prediction problem through the lens of compression and evaluate the compression capabilities of large (foundation) models.
  • They empirically investigate the lossless compression capabilities of foundation models. To that end, we review how to compress with predictive models via arithmetic coding and call attention to the connection between current language modeling research and compression.
  • The following figure from the paper shows arithmetic encoding of the sequence ‘AIXI’ with a probabilistic (language) model \(P\) (both in blue) resulting in the binary code ‘0101001’ (in green). Arithmetic coding compresses data by assigning unique intervals to symbols based on the probabilities assigned by \(P\). It progressively refines these intervals to output compressed bits, which represent the original message. To decode, arithmetic coding initializes an interval based on the received compressed bits. It iteratively matches intervals with symbols using the probabilities given by \(P\) to reconstruct the original message.

  • They show that foundation models, trained primarily on text, are general-purpose compressors due to their in-context learning abilities. In other words, large language models are powerful general-purpose predictors and that the compression viewpoint provides novel insights into scaling laws, tokenization, and in-context learning. Specifically, they provide a novel view on scaling laws, showing that the dataset size provides a hard limit on model size in terms of compression performance and that scaling is not a silver bullet. They also demonstrate that tokenization, which can be viewed as a pre-compression, does, in general, not improve compression performance, but allows models to increase the information content in their context and is thus generally employed to improve prediction performance. For example, Chinchilla 70B, while trained primarily on text, compresses ImageNet patches to 43.4% and LibriSpeech samples to 16.4% of their raw size, beating domain-specific compressors like PNG (58.5%) or FLAC (30.3%), respectively.
  • They leverage the compression-prediction equivalence to employ compressors as generative models and visually illustrate the performance of the underlying compressor.
  • Finally, they show that the prediction-compression equivalence allows us to use any compressor (like gzip) to build a conditional generative model.
SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models
  • Generative Large Language Models (LLMs) such as GPT-3 are capable of generating highly fluent responses to a wide variety of user prompts. However, LLMs are known to hallucinate facts and make non-factual statements which can undermine trust in their output. Existing fact-checking approaches either require access to the output probability distribution (which may not be available for systems such as ChatGPT) or external databases that are interfaced via separate, often complex, modules.
  • This paper by Manakul et al. from Cambridge in EMNLP 2023 proposes “SelfCheckGPT”, a simple sampling-based approach that can be used to fact-check the responses of black-box models in a zero-resource fashion, i.e. without an external database.
  • SelfCheckGPT leverages the simple idea that if an LLM has knowledge of a given concept, sampled responses are likely to be similar and contain consistent facts. However, for hallucinated facts, stochastically sampled responses (i.e., token sampling methods such as top-p/top-k sampling or beam search, adjusting the softmax temperature, etc.) are likely to diverge and contradict one another.
  • The following figure from the paper illustrates SelfCheckGPT with Prompt. Each LLM-generated sentence is compared against stochastically generated responses with no external database. A comparison method can be, for example, through LLM prompting as shown above.

  • They investigate this approach by using GPT-3 to generate passages about individuals from the WikiBio dataset, and manually annotate the factuality of the generated passages.
  • They demonstrate that SelfCheckGPT can: (i) detect non-factual and factual sentences; and (ii) rank passages in terms of factuality.
  • They compare SelfCheckGPT to several baselines and show that our approach has considerably higher AUC-PR scores in sentence-level hallucination detection and higher correlation scores in passage-level factuality assessment compared to grey-box methods.
Zephyr: Direct Distillation of LM Alignment
  • This paper by Tunstall et al. from Huggingface introduces a technique termed distilled direct preference optimization (dDPO), designed to align a small language model (LM) to user intent via distillation, eliminating the need for human feedback. Furthermore, the study presents a 7B parameter language model named Zephyr, which is specifically tailored to align with user intent. Their approach has three main steps:
    1. Distilled Supervised Fine-Tuning (dSFT): They first fine-tune the base 7B Mistral model using the UltraChat dataset, which contains 1.4M dialogues generated by having a large proprietary teacher model like GPT-3.5 Turbo converse with itself. This provides a strong initialization for the student model.
    2. AI Feedback (AIF) Collection: An ensemble of diverse open chat models (e.g., Claude, Falcon) are used to generate responses to prompts from the UltraFeedback dataset. These responses are then scored by a powerful teacher model like GPT-4. The top scoring response is taken as the “chosen” response and one random lower scoring response as the “rejected” response. This provides training pairs of good vs. bad responses.
    3. Distilled Direct Preference Optimization (dDPO): The dSFT model is further optimized by training it to rank the “chosen” responses higher than “rejected” responses from the AIF collection step. This is done by directly optimizing a preference likelihood objective on the static AIF data without needing to sample from the model during training.
  • They apply this approach to train Zephyr-7B, starting from Mistral-7B. First dSFT using UltraChat (1.4M examples from GPT-3.5), then AIF from UltraFeedback (64K prompts ranked by GPT-4), then dDPO.
  • Results:
    • Zephyr-7B sets a new SOTA for alignment and conversational ability compared to other 7B models on MT-Bench (7.34 score) and AlpacaEval (90.6% win rate), surpassing prior best dSFT and PPO distillation methods.
    • It matches (and in some cases, even outperforms) the performance of 70B RLHF models like LLaMA2 on MT-Bench.
    • Ablations show dSFT is necessary before dDPO, and overfitting dDPO can still improve performance.
  • The key technical innovation is direct distillation of preferences without human involvement, through dSFT then dDPO, achieving strong alignment for small 7B models.
  • Key advantages are that it requires no human labeling or feedback, scales easily to larger models, and can be trained in just a few hours on commercially available hardware. Limitations are potential biases inherited from the teacher models and lack of safety considerations. Overall, it demonstrates the surprising efficacy of distillation and preference learning for aligning smaller open models.
  • The image below (source) gives a graphical sense of Zephyr’s performance on tasks as compared with pther prevalent LLMs.

  1. Start with the strongest pretrained model you can find: Mistral 7B is by far the strongest 7B pretrained model.
  2. Scale human-preference annotations: Several studies have show how for many tasks GPT4 is on-par with the average human annotators while making scalable annotations as easy as an API call: The Hugging Face H4 team started from the largest and most diverse public GPT4 preference annotation dataset: UltraFeedback.
  3. Drop Reinforcement Learning in favor of DPO (Direct Preference Optimization): While using RL with LLMs is definitely much easier compared to the struggles of getting deep-RL to work from scratch, DPO totally remove RL from the preference annotation training and directly optimize the preference model in a much more stable training procedure in the H4 team’s experiments.
  4. Don’t be scared of overfitting on the preference dataset: This is maybe the most counter-intuitive results of the work. While the train/test loss of DPO training shows signs of overfitting on the feedback dataset after just one epoch, training further still show significant improvements on downstream tasks even up to 3 epochs without signs of performances regression.
Hugging Face’s Alignment Handbook
  • The Alignment Handbook contains robust recipes to align language models with human and AI preferences. It also contains code to train your very own Zephyr models:
    • Full fine-tuning with Microsoft’s DeepSpeed ZeRO-3 on A100s
    • LoRA or QLoRA fine-tuning on consumer GPUs

  • Dataset from Hugging Face called No Robots of 10k instructions and demonstrations to train instruct models. This is based on the SFT dataset from OpenAI’s InstructGPT paper. 100% organic and written entirely by skilled human annotators.
Evaluating Large Language Models: A Comprehensive Survey
  • This paper by Guo et al. from Tianjin University offers a comprehensive survey providing an in-depth analysis of evaluating large language models (LLMs).
  • The paper categorizes LLM evaluation into three key domains: knowledge and capability evaluation, alignment evaluation, and safety evaluation, addressing the need for rigorous assessment across various tasks and applications.
  • The following figure from the paper illustrates the proposed taxonomy of major categories and sub-categories of LLM evaluation.

  • In-depth exploration of knowledge and capability evaluation includes question answering, knowledge completion, reasoning, and tool learning, highlighting LLMs’ growing sophistication in handling diverse information processing tasks.
  • Alignment evaluation focuses on ethics, bias, toxicity, and truthfulness, critical for ensuring LLM outputs align with societal values and user expectations.
  • Safety evaluation examines robustness and risks associated with LLM deployment, emphasizing the importance of secure and reliable model performance in real-world applications.
  • The survey also covers specialized evaluations in fields like biology, medicine, education, legislation, computer science, and finance, demonstrating the broad applicability and impact of LLMs.
  • Future directions suggest enhanced evaluation methods, including dynamic, agent-oriented, and risk-focused assessments, to guide responsible LLM development and maximize societal benefits.
Tamil-LLaMA: A New Tamil Language Model Based on LLaMA 2
  • This paper by Abhinand Balachandran, this paper introduces Tamil LLaMA, an enhancement of the open-source LLaMA model, tailored for Tamil language processing by incorporating 16,000 additional Tamil tokens.
  • The model, trained on an expansive Tamil corpus using the LoRA methodology, shows marked improvement in Tamil text generation and comprehension, addressing the under-representation of Tamil in large language models.
  • Key contributions include the expansion of LLaMA’s vocabulary with 16,000 Tamil tokens, training on a comprehensive Tamil dataset, and presenting Tamil-translated versions of Alpaca and OpenOrca datasets for instruction fine-tuning.
  • Tamil LLaMA outperforms its predecessors and other open-source models in tasks specific to the Tamil language, demonstrating significant advancements in performance.
  • Performance comparison on the IndicSentiment-7B dataset (left) and the IndicGLUE Text Classification (right).

  • The paper emphasizes the importance of language diversity in LLMs and contributes to advancing language models for Indian languages, with public access to models, datasets, and code to foster further research.
  • The table below shows a list of available models:

Think before you speak: Training Language Models With Pause Tokens
  • This paper by Goyal from Carnegie Mellon University and Google Research introduces a novel training method for language models using pause tokens.
  • The concept involves appending learnable pause tokens to the input during both pretraining and downstream finetuning, allowing the model additional computation time before generating responses.
  • The following figure from the paper illustrates standard vs. pause-inference (and finetuning). We consider a downstream task where, given a prefix, the decoder-only model (bidirectionally) attends to all of the prefix to generate its target answer. The rounded squares denote one Transformer operation (a self-attention and MLP) in a 2-layer Transformer. Any “Ignore Output” denotes that during inference, the corresponding output token is not extracted and thus, not fed back autoregressively; during finetuning, this output is not backpropagated through. The connecting lines denote some (not all) of the “computational pathways” within the model. Specifically, we visualize only those pathways that begin at a specific token in the prefix (here arbitrarily chosen to be “4 is”) and end at an output token (here arbitrarily chosen to be “25+”). All differences between the two settings are highlighted in color. (a) In standard inference (finetuning), the model’s output is extracted immediately upon seeing the last prefix token. (b) In pause-inference (and pause-finetuning), this is initiated only after appending a manually specified number of <pause> tokens. This introduces new computational pathways (the colored lines) between the prefix token and the output token of interest.

  • The following figure from the paper illustrates standard vs. pause-pretraining. We consider pretraining based on causal language modeling, where each token is predicted given all preceding tokens in the sequence, using unidirectional self-attention. Here, we visualize the computational pathways beginning from the token “is” on the input side of the decoder-only model, to a subsequent token “soccer” on the output side. Please see the above figure for a guide on how to follow this visualization. (a) In standard pretraining, we compute the model’s loss at each output token, and backpropagate through it. (b) In pause-pretraining, we insert multiple copies of <pause> tokens at uniformly random locations in the input. However, we do not apply a loss on the model to predict these tokens, as indicated by each corresponding Ignore Output flags. This introduces new computational pathways connecting the input token and the output token of interest.

  • This method demonstrates significant improvements in various tasks, notably an 18% increase in Exact Match score on the SQuAD question-answering task and 8% on CommonSenseQA.
  • The paper reveals that the gains are most pronounced when pause tokens are used during both pretraining and finetuning, with lesser improvements observed when used only during finetuning.
  • The approach alters the traditional immediate next-token prediction in language models, introducing a new paradigm – delayed next-token prediction – that offers enhanced performance on complex language tasks.
YaRN: Efficient Context Window Extension of Large Language Models
  • This paper by Peng et al. from Nous Research, EleutherAI, and the University of Geneva, proposes Yet Another RoPE extensioN method (YaRN) to efficiently extend the context window of transformer-based language models using Rotary Position Embeddings (RoPE).
  • The authors address the limitation of transformer-based language models, specifically their inability to generalize beyond the sequence length they were trained on. YaRN demonstrates a compute-efficient way to extend the context window of such models, requiring significantly fewer tokens and training steps compared to previous methods.
  • YaRN enables LLaMA models to effectively utilize and extrapolate to context lengths much longer than their original pre-training would allow. This method surpasses previous state-of-the-art approaches in context window extension.
  • The paper details various technical aspects of YaRN, including its capability to extrapolate beyond the limited context of a fine-tuning dataset. The models fine-tuned using YaRN have been reproduced online, supporting context lengths up to 128k.
  • YaRN introduces an innovative technique known as “Dynamic NTK” (Neural Tangents Kernel) interpolation, which modifies the attention mechanism of the model. This dynamic scaling allows the model to handle longer contexts without extensive retraining. By doing so, YaRN surpasses previous approaches in context window extension and significantly reduces the computational resources required. Put simply, Dynamic NTK is designed to address the challenge of extending the context window of transformer-based language models using Rotary Position Embeddings (RoPE). It achieves this by dynamically scaling the attention mechanism of the model, allowing it to efficiently process longer text sequences without requiring extensive retraining.
  • Dynamic NTK interpolation modifies the traditional attention mechanism to adapt to extended contexts, ensuring that the model can effectively utilize and extrapolate to context lengths much longer than its original pre-training would allow. This dynamic scaling approach optimizes the use of available resources and computational power.
  • Dynamic NTK interpolation is a key component of YaRN that empowers language models to handle extended context windows with improved efficiency and performance, making it a valuable advancement in the field of large language models.
  • Additionally, YaRN incorporates a temperature parameter that affects the perplexity across different data samples and token positions within the extended context window. Adjusting this temperature parameter modifies the attention mechanism, enhancing the model’s ability to handle extended context lengths efficiently.
  • Extensive experiments demonstrate YaRN’s efficacy. For instance, it achieves context window extension of language models with RoPE as the position embedding, using only about 0.1% of the original pre-training corpus, a significant reduction in computational resources.
  • The following figure from the paper illustrates that evaluations focus on several aspects, such as perplexity scores of fine-tuned models with extended context windows, the passkey retrieval task, and performance on standardized LLM benchmarks. YaRN models show strong performance across all contexts, effectively extending the context window of LLaMA 2 models to 128k. The following figure from the paper illustrates the sliding window perplexity (S = 256) of ten 128k Proof-pile documents truncated to evaluation context window size.

  • The paper concludes that YaRN improves upon all existing RoPE interpolation methods and acts as a highly efficient drop-in replacement. It preserves the original abilities of fine-tuned models while attending to very large context sizes and allows for efficient extrapolation and transfer learning under compute-constrained scenarios.
  • The research illustrates YaRN as a significant advancement in extending the context window of large language models, offering a more compute-efficient approach with broad implications for model training and performance.
  • Code.
StarCoder: May the Source Be with You!
  • The BigCode community, an open-scientific collaboration, introduces StarCoder and StarCoderBase: Large Language Models (LLMs) for code, each with 15.5 billion parameters, 8K context length, infilling capabilities, and fast large-batch inference enabled by multi-query attention.
  • StarCoderBase was trained on 1 trillion tokens from The Stack, a large collection of permissively licensed GitHub repositories, covering over 80 programming languages, GitHub issues, Git commits, and Jupyter notebooks. StarCoder was fine-tuned on an additional 35 billion Python tokens.
  • These models exhibit novel architectural features like an 8K context length, Fill-in-the-Middle (FIM) capabilities, and Multi-Query-Attention (MQA). An extensive evaluation of these models was conducted, showcasing their superiority over other open Code LLMs in handling multiple programming languages and even matching or surpassing the OpenAI code-cushman-001 model.
  • StarCoder, when fine-tuned on Python, significantly outperforms other Python-tuned LLMs and, with its 8K token context, can function as a virtual technical assistant without requiring instruction-tuning or Reinforcement Learning from Human Feedback (RLHF).
  • Significant steps were taken towards ensuring a safe open model release. StarCoder is released under the OpenRAIL-M license, promoting transparency and commercial viability. The release includes an integrated attribution tool in the VSCode demo for detecting and locating model generations potentially copied from the training set. Additionally, a robust Personally Identifiable Information (PII) detection model, StarEncoder, was developed to enhance privacy protection, utilizing a dataset containing 12,000 files with 22,950 annotated entities.
Let’s Verify Step by Step
  • This paper by Lightman et al. from OpenAI presents a detailed investigation into the effectiveness of process supervision compared to outcome supervision in training language models for complex multi-step reasoning. Here’s a summary of their findings:
  • The authors explore the concepts of outcome and process supervision. Outcome-supervised reward models (ORMs) focus on the final result of a model’s reasoning chain, while process-supervised reward models (PRMs) receive feedback at each step in the reasoning chain.
  • To collect process supervision data, they present human data-labelers with step-by-step solutions to MATH problems sampled by the large-scale generator. Their task is to assign each step in the solution a label of positive, negative, or neutral, as shown in the below figure. A positive label indicates that the step is correct and reasonable. A negative label indicates that the step is either incorrect or unreasonable. A neutral label indicates ambiguity. In practice, a step may be labelled neutral if it is subtly misleading, or if it is a poor suggestion that is technically still valid. Neutral labels allows them to defer the decision about how to handle ambiguity: at test time, we can treat neutral labels as either positive or negative. The following figure from the paper shows a screenshot of the interface used to collect feedback for each step in a solution.

  • The following figure from the paper shows two solutions to the same problem, graded by the PRM. The solution on the left is correct while the solution on the right is incorrect. A green background indicates a high PRM score, and a red background indicates a low score. The PRM correctly identifies the mistake in the incorrect solution.

  • For their experiments, they used large-scale models fine-tuned from GPT-4 and smaller models for detailed comparisons. These models were trained on the MATH dataset, which includes complex mathematical problems.
  • The paper introduces a new dataset, PRM800K, comprising 800,000 step-level human feedback labels, which was instrumental in training their PRM models.
  • The key findings show that process supervision significantly outperforms outcome supervision in training models to solve complex problems. Specifically, their PRM model solved 78.2% of problems from a representative subset of the MATH test set.
  • The researchers also demonstrate that active learning significantly improves the efficiency of process supervision, leading to better data utilization.
  • They conducted out-of-distribution generalization tests using recent STEM tests like AP Physics and Calculus exams, where the PRM continued to outperform other methods.
  • The paper discusses the implications of their findings for AI alignment, highlighting the advantages of process supervision in producing more interpretable and aligned models.
  • They acknowledge potential limitations related to test set contamination but argue that the relative comparisons made in their work are robust against such issues.
  • This research contributes to the field by showing the effectiveness of process supervision and active learning in improving the reasoning capabilities of language models, especially in complex domains like mathematics.



Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks
  • Many real-world sequence learning tasks require the prediction of sequences of labels from noisy, unsegmented input data. In speech recognition, for example, an acoustic signal is transcribed into words or sub-word units. Recurrent neural networks (RNNs) are powerful sequence learners that would seem well suited to such tasks. However, because they require pre-segmented training data, and post-processing to transform their outputs into label sequences, their applicability has so far been limited.
  • This paper by Graves et al. from Schmidhuber’s lab presents a novel method for for temporal classification with RNNs to label unsegmented sequences directly, thereby solving both aforementioned problems. Their method fits naturally into the existing framework of neural network classifiers, and is derived from the same probabilistic principles. It obviates the need for pre-segmented data, and allows the network to be trained directly for sequence labelling.
  • An experiment on a real-world temporal classification problem with the TIMIT speech corpus demonstrates its advantages over both a baseline HMM and a hybrid HMM-RNN without requiring any task-specific knowledge.


Front-end factor analysis for speaker verification
  • This paper by Dehak et al. from JHU in IEEE/ACM Transactions on Audio, Speech, and Language Processing 2010 proposes a non-deep learning method that users Joint Factor Analysis (JFA) as a feature extractor to learn a low-dimensional speaker representation for speaker verification, which is also used to model session and channel effects/variabilities.
  • In this new space, a given speech utterance is represented by a new vector named total factors (called the identity-vector or the “i-vector”). The i-vector is thus a feature that represents the characteristics of the frame-level features’ distributive pattern. i-vector extraction is essentially a dimensionality reduction of the GMM supervector (although the GMM supervector is not extracted when computing the i-vector). It’s extracted in a similar manner with the eigenvoice adaptation scheme or the JFA technique, but is extracted per sentence (or input speech sample).
  • Two speaker verification systems are proposed which use this new representation. The first system is a Support-Vector-Machine-based system that uses the cosine kernel to estimate the similarity between the input data. The second system directly uses the cosine similarity as the final decision score. In this scoring, they removed the SVM from the decision process. One important characteristic of this approach is that there is no speaker enrollment, unlike in other approaches like SVM and JFA, which makes the decision process faster and less complex.
  • They achieved an EER of 1.12% and MinDCF of 0.0094 using the cosine distance scoring on the male English trials of the core condition of the NIST 2008 Speaker Recognition Evaluation dataset. They also obtained 4% absolute EER improvement for both-gender trials on the 10sec-10sec condition compared to the classical joint factor analysis scoring.
  • Up until d-vectors, the state-of-the-art speaker verification systems were based on the concept of i-vectors (which use Probabilistic Linear Discriminant Analysis (PLDA) as a classifier to make the final decision).


Sequence Transduction with Recurrent Neural Networks
  • Many machine learning tasks can be expressed as the transformation or transduction of input sequences into output sequences: speech recognition, machine translation, protein secondary structure prediction and text-to-speech to name but a few. One of the key challenges in sequence transduction is learning to represent both the input and output sequences in a way that is invariant to sequential distortions such as shrinking, stretching and translating.
  • Recurrent neural networks (RNNs) are a powerful sequence learning architecture that has proven capable of learning such representations. However RNNs traditionally require a pre-defined alignment between the input and output sequences to perform transduction. This is a severe limitation since finding the alignment is the most difficult aspect of many sequence transduction problems. Indeed, even determining the length of the output sequence is often challenging.
  • This paper by Graves in the 2012 ICML Workshop on Representation Learning introduces an end-to-end, probabilistic sequence transduction system, based entirely on RNNs, that is in principle able to transform any input sequence into any finite, discrete output sequence.
  • Experimental results for phoneme recognition are provided on the TIMIT speech corpus.
  • Slides.


Hybrid speech recognition with Deep Bidirectional LSTM
  • Deep Bidirectional LSTM (DBLSTM) recurrent neural networks have recently been shown to give state-of-the-art performance on the TIMIT speech database. However, the results in that work relied on recurrent-neural-network-specific objective functions, which are difficult to integrate with existing large vocabulary speech recognition systems.
  • This paper by Graves et al. from UofT in the 2013 IEEE Workshop on Automatic Speech Recognition and Understanding investigates the use of DBLSTM as an acoustic model in a standard neural network-HMM hybrid system. They find that a DBLSTM-HMM hybrid gives equally good results on TIMIT as the previous work. It also outperforms both GMM and deep network benchmarks on a subset of the Wall Street Journal corpus. However the improvement in word error rate over the deep network is modest, despite a great increase in framelevel accuracy.
  • They conclude that the hybrid approach with DBLSTM appears to be well suited for tasks where acoustic modelling predominates. Further investigation needs to be conducted to understand how to better leverage the improvements in frame-level accuracy towards better word error rates.


Towards End-To-End Speech Recognition with Recurrent Neural Networks
  • This paper by Graves and Jaitly in PMLR in 2014 presents a character-level speech recognition system that directly transcribes audio data with text using a recurrent neural network with minimal preprocessing, without requiring an intermediate phonetic representation.
  • The system is based on a combination of the deep bidirectional LSTM recurrent neural network architecture and a modified Connectionist Temporal Classification (CTC) objective function that allows a direct optimization of the word error rate, even in the absence of a lexicon or language model. Further, they show how to integrate the network outputs with a language model during decoding.
  • The system achieves a word error rate of 27.3% on the Wall Street Journal corpus with no prior linguistic information, 21.9% with only a lexicon of allowed words, and 8.2% with a trigram language model. Combining the network with a baseline system further reduces the error rate to 6.7% and achieves state-of-the-art accuracy on the Wall Street Journal corpus for speaker independent recognition.
Deep neural networks for small footprint text-dependent speaker verification
  • This paper by Variani et al. from JHU, Google, and Biometric Recognition Group in 2014 investigates the use of deep neural networks (DNNs) to train speaker embeddings for a small footprint text-dependent speaker verification task. The DNN architecture is shown in the figure below.
  • During model training, the DNN takes stacked filterbank features as input (similar to the DNN acoustic model used in ASR) and generates the one-hot speaker label (or the speaker probability) to classify speakers at the frame-level.
  • During speaker enrollment, the trained DNN is used to extract speaker-specific features/embeddings by averaging the activations from the last hidden layer (called deep-vectors or “d-vectors” for short), which is taken as the speaker model.
  • During speaker evaluation, a d-vector is extracted for each utterance and compared to the enrolled speaker model to make a verification decision by calculating the cosine distance between the test d-vector and the claimed speaker’s d-vector, similar to the i-vector framework. A verification decision is made by comparing the distance to a threshold.
  • Experimental results show the DNN based speaker verification system achieves good performance compared to a popular i-vector system on a small footprint text-dependent speaker verification task. In addition, the DNN based system is more robust to additive noise and outperforms the i-vector system at low False Rejection operating points. The combined system outperforms the i-vector system by 14% and 25% relative in equal error rate (EER) for clean and noisy conditions respectively.
  • Experimental results show the d-vectors are more robust to additive noise and outperforms i-vectors at low False Rejection operating points. The combined (d+i)-vector system outperforms the i-vector system by 14% and 25% relative in equal error rate (EER) for clean and noisy conditions respectively.
  • Note that unlike the i-vector framework, this doesn’t have any assumptions about the feature’s distribution (the i-vector framework assumes that the i-vector has a Gaussian distribution).


Listen, Attend and Spell
  • This paper by Chan et al. from CMU and Google in 2015 presents Listen, Attend and Spell (LAS), a neural network that learns to transcribe speech utterances to characters. Unlike traditional DNN-HMM models, this model learns all the components of a speech recognizer jointly.
  • LAS is based on the sequence-to-sequence framework, is trained end-to-end and has two main components: a listener (encoder) and a speller (decoder). The listener is a pyramidal RNN encoder that accepts filter bank spectra as inputs, transforms the input sequence into a high level feature representation and reduces the number of timesteps that the decoder has to attend to. The speller is an attention-based RNN decoder that attends to the high level features and spells out the transcript one character at a time.
  • The proposed system does not use the concepts of phonemes, nor does it rely on pronunciation dictionaries or HMMs. They bypass the conditional independence assumptions of CTC, and show how they can learn an implicit language model that can generate multiple spelling variants given the same acoustics. In other words, producing character sequences without making any independence assumptions between the characters is the key improvement of LAS over previous end-to-end CTC models.
  • To further improve the results, they used samples from the softmax classifier in the decoder as inputs to the next step prediction during training. Finally, they show how a language model trained on additional text can be used to rerank their top hypotheses.
  • On a subset of the Google voice search task, LAS achieves a word error rate (WER) of 14.1% without a dictionary or a language model, and 10.3% with language model rescoring over the top 32 beams. By comparison, the state-of-the-art CLDNN-HMM model achieves a WER of 8.0%.


CNN Architectures for Large-Scale Audio Classification
  • This paper by Hershey et al. from Google in ICASSP 2017 presents VGGish by applying various state-of-the-art image networks with CNN architectures to audio and show that they are capable of excellent results on audio classification when compared to a simple fully connected network or earlier image classification architectures.
  • They examine fully connected deep neural networks such as AlexNet, VGG, InceptionNet, and ResNet. The input audio is divided into non-overlapping 960 ms frames which are decomposed by applying the Fourier transform, resulting in a spectrogram. The spectrogram is integrated into 64 mel-spaced frequency bins, and the magnitude of each bin is log-transformed. Finally, this gives log-mel spectrogram patches that are passed on as input to all classifiers. They explore the effects of training with different sized subsets of the 70M training videos (5.24 million hours) with 30,871 labels.
  • While their dataset contains video-level labels, they are also interested in Acoustic Event Detection (AED) and train a classifier on embeddings learned from the video-level task on AudioSet. They find that a model for AED with embeddings learned from these classifiers does much better than raw features on the Audio Set AED classification task.
  • They find that derivatives of image classification networks do well on the audio classification task, that increasing the number of labels they train on provides some improved performance over subsets of labels, that performance of models improves as they increase training set size, and that a model using embeddings learned from the video-level task do much better than a baseline on the AudioSet classification task.


X-Vectors: Robust DNN Embeddings for Speaker Recognition
  • This paper by Synder et al. from JHU in ICASSP 2018 uses data augmentation to improve performance of deep neural network (DNN) embeddings for speaker recognition.
  • The DNN, which is trained to discriminate between speakers, maps variable-length utterances to fixed-dimensional embeddings called x-vectors.
  • While prior studies have found that embeddings leverage large-scale training datasets better than i-vectors, it can be challenging to collect substantial quantities of labeled data for training. They use data augmentation, consisting of added noise and reverberation, as an inexpensive method to multiply the amount of training data and improve robustness.
  • Their data augmentation strategy employs additive noises and reverberation. Reverberation involves convolving room impulse responses (RIR) with audio. They use the simulated RIRs described by Ko et al. and the reverberation itself is performed with the multicondition training tools in the Kaldi ASpIRE recipe. For additive noise, they use the MUSAN dataset, which consists of over 900 noises, 42 hours of music from various genres and 60 hours of speech from twelve languages
  • A PLDA classifier is used in the x-vector framework to make the final decision, similar to i-vector systems.
  • The x-vectors are compared with i-vector baselines on Speakers in the Wild and NIST SRE 2016 Cantonese where they achieve superior performance on the evaluation datasets.
WaveGlow: A Flow-based Generative Network for Speech Synthesis
  • This paper by Prenger et al. from NVIDIA in 2018 proposes WaveGlow, a flow-based network capable of generating high quality speech from mel-spectrograms.
  • WaveGlow combines insights from Glow and WaveNet in order to provide fast, efficient and high-quality audio synthesis, without the need for auto-regression. WaveGlow is implemented using only a single network, trained using only a single cost function: maximizing the likelihood of the training data, which makes the training procedure simple and stable.
  • Their PyTorch implementation produces audio samples at a rate of more than 500 kHz on an NVIDIA V100 GPU. Mean Opinion Scores show that it delivers audio quality as good as the best publicly available WaveNet implementation.
Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions
  • This paper by Shen et al. from Google in 2018 describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize timedomain waveforms from those spectrograms.
  • Their model achieves a mean opinion score (MOS) of 4.53 comparable to a MOS of 4.58 for professionally recorded speech.
  • To validate their design choices, they present ablation studies of key components of their system and evaluate the impact of using mel-spectrograms as the input to WaveNet instead of linguistic, duration, and F0 features.
  • They further demonstrate that using a compact acoustic intermediate representation enables significant simplification of the WaveNet architecture.
  • PyTorch hub


wav2vec: Unsupervised Pre-training for Speech Recognition
  • Reducing the need for manually annotated data is important for developing systems that understand non-English languages, particularly those with limited existing training sets of transcribed speech.
  • This paper by Schneider from Facebook AI in 2019 introduces wav2vec, the first application of unsupervised pre-training to speech recognition using a fully convolutional model that learns representations of raw, unlabeled audio.
  • Wav2vec is trained on large amounts of unlabeled audio data and the resulting representations are then used to improve acoustic model training. They pre-train a simple multi-layer convolutional neural network optimized via a noise contrastive binary classification task.
  • Wav2vec trains models to learn the difference between original speech examples and modified versions, often repeating this task hundreds of times for each second of audio, and predicting the correct audio milliseconds into the future.
  • This self-supervised approach beats traditional ASR systems that rely solely on transcribed audio. Their experiments on WSJ reduce WER of a strong character-based log-mel filterbank baseline by up to 36% when only a few hours of transcribed data is available. Their approach achieves 2.43% WER on the nov92 test set. This outperforms Deep Speech 2 (Amodei et al., 2016), the best reported character-based system in the literature while using two orders of magnitude less labeled training data.
  • They show that more data for pre-training improves performance and that this approach not only improves resource-poor setups, but also settings where all WSJ training data is used.
  • Facebook AI article.
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
  • This paper by Park et al. from Google in 2019 presents SpecAugment, a simple data augmentation method for speech recognition.
  • SpecAugment greatly improves the performance of ASR networks. SpecAugment is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients). The augmentation policy consists of warping the features, masking blocks of frequency channels, and masking blocks of time steps. They apply SpecAugment on Listen, Attend and Spell (LAS) networks for end-to-end speech recognition tasks.
  • They achieve state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks on end-to-end LAS networks by augmenting the training set using simple handcrafted policies, surpassing the performance of hybrid systems even without the aid of a language model. SpecAugment converts ASR from an over-fitting to an under-fitting problem, and they are able to gain performance by using bigger networks and training longer. On LibriSpeech, they achieve 6.8% WER on test-other without the use of a language model, and 5.8% WER with shallow fusion with a language model. This compares to the previous state-of-the-art hybrid system of 7.5% WER. For Switchboard, they achieve 7.2%/14.6% on the Switchboard/CallHome portion of the Hub5’00 test set without the use of a language model, and 6.8%/14.1% with shallow fusion, which compares to the previous state-of-the-art hybrid system at 8.3%/17.3% WER.
Margin Matters: Towards More Discriminative Deep Neural Network Embeddings for Speaker Recognition
  • Recently, speaker embeddings extracted from a speaker discriminative deep neural network (DNN) yield better performance than the conventional methods such as i-vector. In most cases, the DNN speaker classifier is trained using cross entropy loss with softmax. However, this kind of loss function does not explicitly encourage inter-class separability and intra-class compactness. As a result, the embeddings are not optimal for speaker recognition tasks.
  • This paper by Xiang et al. from Shanghai Jiao Tong and AISpeech in Interspeech 2019 addresses this issue, with three different margin-based losses which not only separate classes but also demand a fixed margin between classes are introduced to deep speaker embedding learning.
    • Angular softmax loss (denoted by A-Softmax loss),
    • Additive margin softmax loss (denoted by AM-Softmax loss), and
    • Additive angular margin loss (denoted by AAM-Softmax loss).
  • They find that the margin plays a vital role in learning discriminative embeddings and leads to a significant performance boost.
  • Experiments are conducted on two public text independent tasks: VoxCeleb1 and Speaker in The Wild (SITW).
  • The proposed approach can achieve the state-of-the-art performance, with 25% ~ 30% equal error rate (EER) reduction on both tasks when compared to strong baselines using cross entropy loss with softmax, obtaining 2.238% EER on VoxCeleb1 test set and 2.761% EER on SITW core-core test set, respectively.


Conformer: Convolution-augmented Transformer for Speech Recognition
  • Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR), outperforming Recurrent neural networks (RNNs). Transformer models are good at capturing content-based global interactions, while CNNs exploit local features effectively.
  • This paper by Gulati et al. from Google in Interspeech 2020 achieves the best of both worlds by integrating components from both CNNs and Transformers for end-to-end speech recognition to model both local and global dependencies of an audio sequence in a parameter-efficient way.
  • They studied the importance of each component, and demonstrated that the inclusion of convolution modules is critical to the performance of the Conformer model.
  • To this regard, they propose the convolution-augmented transformer for speech recognition, named Conformer. Conformer significantly outperforms the previous Transformer and CNN based models achieving state-of-the-art accuracies. On the widely used LibriSpeech benchmark, Conformer model achieves WER of 2.1%/4.3% without using a language model and 1.9%/3.9% with an external language model on test/testother. They also observe competitive performance of 2.7%/6.3% with a small model of only 10M parameters.
  • The following figure from the paper shows the Conformer encoder model architecture. Conformer comprises of two macaron-like feed-forward layers with halfstep residual connections sandwiching the multi-headed selfattention and convolution modules. This is followed by a post-layernorm.

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
  • This paper by Baevski et al. from Facebook AI in NeurIPS 2020 shows for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler.
  • Wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned.
  • Compared to wav2vec, wav2vec 2.0 learns basic speech units used to tackle a self-supervised task. The model is trained to predict the correct speech unit for masked parts of the audio, while at the same time learning what the speech units should be.
  • Experiments using all labeled data of Librispeech achieve 1.8/3.3 WER on the clean/other test sets. When lowering the amount of labeled data to one hour, wav2vec 2.0 outperforms the previous state of the art on the 100 hour subset while using 100 times less labeled data. With just 10 minutes of transcribed speech and 53K hours of unlabeled speech, wav2vec 2.0 enables speech recognition models at a word error rate (WER) of 8.6 percent on noisy speech and 5.2 percent on clean speech on the standard LibriSpeech benchmark. This demonstrates the feasibility of speech recognition with limited amounts of labeled data.
  • This opens the door for speech recognition models in many more languages, dialects, and domains that previously required much more transcribed audio data to provide acceptable accuracy.
  • They have also developed a cross-lingual approach, dubbed XLSR, that can learn speech units common to several languages. This approach helps when they have even small amounts of unlabeled speech, since languages for which they have little data can benefit from languages for which more data is available.
  • Code; Facebook AI article.
HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
  • Several recent work on speech synthesis have employed generative adversarial networks (GANs) to produce raw waveforms. Although such methods improve the sampling efficiency and memory usage, their sample quality has not yet reached that of autoregressive and flow-based generative models.
  • This paper by Kong et al. from Kakao Enterprise in NeurIPS 2020 proposes HiFi-GAN, which achieves both efficient and high-fidelity speech synthesis. As speech audio consists of sinusoidal signals with various periods, they demonstrate that modeling periodic patterns of an audio is crucial for enhancing sample quality.
  • HiFi-GAN outperforms the best performing publicly available models in terms of synthesis quality, even comparable to human level. Moreover, it shows a significant improvement in terms of synthesis speed. A subjective human evaluation (mean opinion score, MOS) of a single speaker dataset indicates that their proposed method demonstrates similarity to human quality while generating 22.05 kHz high-fidelity audio 167.9 times faster than real-time on a single V100 GPU.
  • They took inspiration from the characteristic of speech audio that consists of patterns with various periods and applied it to neural networks, and verified that the existence of the proposed discriminator greatly influences the quality of speech synthesis through the ablation study.
  • HiFi-GAN shows ability to generalize to the mel-spectrogram inversion of unseen speakers and synthesize speech audio comparable to human quality from noisy inputs in an end-to-end setting. In addition, their small footprint model demonstrates comparable sample quality with the best publicly available autoregressive counterpart, while generating samples in an order-of-magnitude faster than real-time on CPU. This shows progress towards on-device natural speech synthesis, which requires low latency and memory footprint.
  • Finally, their experiments show that the generators of various configurations can be trained with the same discriminators and learning mechanism, which indicates the possibility of flexibly selecting a generator configuration according to the target specifications without the need for a time-consuming hyper-parameter search for the discriminators.
  • Code.
GAN-based Data Generation for Speech Emotion Recognition
  • This paper by Eskimez et al. from Microsoft in Interspeech 2020 proposes a GAN-based method to generate synthetic data in the form of speech emotion spectrograms, which can be used for training speech emotion recognition networks. Specifically, they investigate the usage of GANs for capturing the data manifold when the data is eyes-off, i.e., where they can train networks using the data but cannot copy it from the clients.
  • They propose a CNN-based GAN with spectral normalization on both the generator and discriminator, both of which are pre-trained on large unlabeled speech corpora. They show that their method provides better speech emotion recognition performance than a strong baseline.
  • They proposed to use GANs for modeling imbalanced and highly skewed data among clients for future use, even after the original data is removed.
  • Furthermore, they show that even after the data on the client is lost, their model can generate similar data that can be used for model bootstrapping in the future. Although they evaluated their method for speech emotion recognition, it can be applied to other tasks.
Generalized end-to-end loss for speaker verification
  • This paper by Wan et al. from Google in 2020 propose a new loss function called generalized end-to-end (GE2E) loss, which makes the training of speaker verification models more efficient (especially compared to their previous tuple-based end-to-end (TE2E) loss function).
  • Unlike TE2E, the GE2E loss function updates the network in a way that emphasizes examples that are difficult to verify at each step of the training process. GE2E loss pushes the embedding towards the centroid of the true speaker, and away from the centroid of the most similar different speaker.
  • Additionally, the GE2E loss does not require an initial stage of example selection. With these properties, their model with the new loss function decreases speaker verification EER by more than 10%, while reducing the training time by 60% at the same time.
  • Both theoretical and experimental results verified the advantage of this novel loss function.
  • They also introduce the MultiReader technique, which allows them to do domain adaptation — training a more accurate model that supports multiple keywords (i.e., “OK Google” and “Hey Google”) as well as multiple languages/dialects. By combining these two techniques, they produced more accurate speaker verification models.


Generative Spoken Language Modeling from Raw Audio
  • This paper by Lakhotia et al. from Facebook AI in 2021 introduces Generative Spoken Language Modeling which learns speech representations from CPC, Wav2Vec2.0, and HuBERT for synthesizing speech.
  • Generative Spoken Language Modeling, the task of learning the acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation.
  • They set up baseline systems consisting of a discrete speech encoder (returning pseudo-text units), a generative language model (trained on pseudo-text), and a speech decoder (generating a waveform from pseudo-text) all trained without supervision and validate the proposed metrics with human evaluation. THe following figure from the paper shows the setup of the baseline model architecture, tasks and metrics.

  • Across 3 speech encoders (CPC, wav2vec 2.0, HuBERT), they find that the number of discrete units (50, 100, or 200) matters in a task-dependent and encoder-dependent way, and that some combinations approach text-based systems.
  • Facebook AI post.
  • Code.
Text-Free Prosody-Aware Generative Spoken Language Modeling
  • Speech pre-training has primarily demonstrated efficacy on classification tasks, while its capability of generating novel speech, similar to how GPT-2 can generate coherent paragraphs, has barely been explored.
  • This paper by Kharitonov et al. from Facebook AI in 2021 builds upon Generative Spoken Language Modeling (GSLM) (Lakhotia et al., 2021) which addresses the generative aspects of speech pre-training, by replacing text with discovered phone-like units for language modeling and shows the ability to generate meaningful novel sentences. Unfortunately, despite eliminating the need of text, the units used in GSLM discard most of the prosodic information. Hence, GSLM fails to leverage prosody for better comprehension, and does not generate expressive speech.
  • In this work, they present a prosody-aware generative spoken language model (pGSLM). It is composed of a multi-stream transformer language model (MS-TLM) of speech, represented as discovered unit and prosodic feature streams, and an adapted HiFi-GAN model converting MS-TLM outputs to waveforms.
  • They devise a series of metrics for prosody modeling and generation, and re-use metrics from GSLM for content modeling. Experimental results show that the pGSLM can utilize prosody to improve both prosody and content modeling, and also generate natural, meaningful, and coherent speech given a spoken prompt.
  • Facebook AI post.
  • Code
Speech Resynthesis from Discrete Disentangled Self-Supervised Representations
  • This paper by Polyak et al. from Facebook AI in Interspeech 2021 proposes using self-supervised discrete representations for the task of speech resynthesis. To generate disentangled representation, they separately extract low-bitrate representations for speech content, prosodic information, and speaker identity. This allows to synthesize speech in a controllable manner.
  • They analyze various state-of-the-art, self-supervised representation learning methods and shed light on the advantages of each method while considering reconstruction quality and disentanglement properties. Specifically, they evaluate the F0 reconstruction, speaker identification performance (for both resynthesis and voice conversion), recordings’ intelligibility, and overall quality using subjective human evaluation.
  • Lastly, they demonstrate how these representations can be used for an ultra-lightweight speech codec. Using the obtained representations, they can get to a rate of 365 bits per second while providing better speech quality than the baseline methods.
  • Facebook AI post.
  • Code
WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing
  • Self-supervised learning (SSL) achieves great success in speech recognition, while limited exploration has been attempted for other speech processing tasks. As speech signal contains multi-faceted information including speaker identity, paralinguistics, spoken content, etc., learning universal representations for all speech tasks is challenging.
  • This paper by Chen et al. from Furu Wei’s group at Microsoft Research in JSTSP 2021 proposes WavLM, a new large-scale pre-trained model trained on 94k hour audio, to solve full stack downstream speech processing tasks.
  • WavLM extends the HuBERT framework to masked speech prediction and denoising modeling, enabling the pre-trained models to perform well on both ASR and non-ASR tasks.
  • WavLM jointly learns masked speech prediction and denoising in pre-training. By this means, WavLM does not only keep the speech content modeling capability by the masked speech prediction, but also improves the potential to non-ASR tasks by the speech denoising.
  • In addition, WavLM employs gated relative position bias for the Transformer structure to better capture the sequence ordering of input speech. THey also scale up the training dataset from 60k hours to 94k hours.
  • WavLM Large achieves state-of-the-art performance on the SUPERB benchmark, and brings significant improvements for various speech processing tasks such as speaker verification, speech separation, and speaker diarization.
  • In contrast to previous SSL models, WavLM is not only effective for the ASR task but also has the potential to become the next-generation backbone network for speaker-related tasks.
  • Code with code and pre-trained models.
Recent Advances in End-to-End Automatic Speech Recognition
  • The following paper summary has been contributed by Zhibo Zhang.
  • This paper by Li from Microsoft in APSIPA Transactions on Signal and Information Processing in 2021 reviewed the influential frameworks in end-to-end automatic speech recognition systems, the major challenges as well as the solutions and advances in this field.
  • The author firstly reviewed three popular methods in this domain, including CTC (Connectionist Temporal Classification) by Graves et al., AED (Attention-based Encoder-Decoder) by Cho et al., Bahdanau et al. as well as RNN-T (RNN Transducer) by Graves.
  • The author then analyzed two major encoder architectures - LSTMs by Hochreiter and Schmidhuber and Transformers by Vaswani et al., along with their limitations and variations.
  • The author also mentioned other training criteria including knowledge distillation by Hinton et al. and minimum word error rate.
  • It is easier to build a multilingual model with end-to-end systems compared to hybrid systems.
  • The paper covered several major challenges for end-to-end models:
    • It is difficult to adapt the model to the test speaker because of the small amount of adaptation data. Approaches to solve this issue include utilizing regularization techniques, multi-task learning as well as multi-speaker text-to-speech.
    • The performance would be worse when adapting the end-to-end model to a different content domain due to the lack of the speech-text data pairs in the new domain. Approaches to overcome this problem include:
      • Fusing the end-to-end model with an extra language model where the language model was trained on the text data of the new domain.
      • Training the end-to-end model on the new domain by synthesizing speech from the text of the new domain utilizing TTS (text-to-speech) technologies.
      • Adopting the spliced data method by Zhao et al..
    • Improving the capability of making use of the context is challenging for end-to-end models and the author mentioned a few existing solutions that address this issue including adding a context encoder.
w2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training
  • This paper by Chung et al. from MIT CSAIL and Google Brain proposes w2v-BERT, which combines the core methodologies of self-supervised pre-training of speech embodied in the wav2vec 2.0 model and the self-supervised pre-training of language emobdied in BERT.
  • The following figures from the paper shows an illustration of the w2v-BERT pre-training framework. w2vBERT is composed of a feature encoder, a contrastive module, and a masked language modeling (MLM) module, where the latter two are both a stack of conformer blocks. \(N\) and \(M\) denote the number of conformer blocks in the two modules, respectively.

  • The idea of w2v-BERT is learn contextualized speech representations by using the contrastive task defined earlier in wav2vec 2.0 to obtain an inventory of a finite set of discretized speech units, and then use them as tokens in a masked prediction task similar to the masked language modeling (MLM) proposed in BERT.
  • From the figure above, we can see that w2v-BERT consists of three main components:
    • Feature Encoder: The feature encoder acts as a convolutional sub-sampling block that consists of two 2D-convolution layers, both with strides \(\left( 2,2 \right)\), resulting in a 4x reduction in the acoustic input’s sequence length. Given, for example, a log-mel spectrogram as input, the feature encoder extracts latent speech representations that will be taken as input by the subsequent contrastive module.
    • Contrastive Module: The goal of the contrastive module is to discretize the feature encoder output into a finite set of representative speech units; that’s why the output of the feature encoder follows two different paths:
      • First path: It is masked, then fed into the linear projection layer followed by the stack of Conformer blocks to produce context vectors.
      • Second Path: It is passed to the quantization mechanism without masking to yield quantized vectors and their assigned token IDs.
      • The quantized vectors are used in conjunction with the context vectors that correspond to the masked positions to solve the contrastive task defined in wav2vec 2.0; the assigned token IDs will be later used by the subsequent masked prediction module as prediction target.
    • Masked Prediction Module: The masked prediction module is a stack of Conformer blocks (identical to the one used with the contrastive module) which directly takes in the context vectors produced by the contrastive module and extracts high-level contextualized speech representations.

  • Pre-training & Fine-tuning: During pre-training only unlabeled speech data is used to train w2v-BERT to solve two self-supervised tasks at the same time weighted by two different hyper-parameters \(\beta\) and \(\gamma\) which were set to 1 in the paper:
\[\mathcal{L} = \beta.\mathcal{L}_{c} + \gamma.\mathcal{L}_{m}\]
  • Contrastive Loss \(\mathcal{L}_{\mathbf{c}}\): For a context vector $c_t$ corresponding to a masked time step \(t\), the model is asked to identify its true quantized vector \(q_t\) from a set of \(K\) distractors \(\left\\{ {\widetilde{q}}_1,\ {\widetilde{q}}_2,\ ...{\widetilde{q}}_K \right\\}\) that are also quantized vectors uniformly sampled from other masked time steps of the same utterance. This loss is denoted as \(\mathcal{L}_w\), and further augment it with a codebook diversity loss \(\mathcal{L}_d\) to encourage a uniform usage of codes weighted by a hyper-parameter $\alpha$. Therefore, the final contrastive loss is defined as:
\[\mathcal{L}_{c} = \mathcal{L}_{w} + \alpha\mathcal{L}_{d}\]
  • **Mask Prediction Loss \(\mathcal{L}_{\mathbf{m}}\): This is the cross entropy loss for the predicting masked context vectors. They randomly sample the starting positions to be masked with a probability of 0.065 and mask the subsequent 10 time steps knowing that the masked spans may overlap.

  • During fine-tuning, a labeled data was used to train an RNN-T model where the encoder is a pre-trained w2v-BERT model, the decoder is a two-layer LSTM with a hidden dimension of 640, and the joint network is a linear layer with Swish activation and batch normalization.

SUPERB: Speech processing Universal PERformance Benchmark
  • Self-supervised learning (SSL) has proven vital for advancing research in natural language processing (NLP) and computer vision (CV). The paradigm pretrains a shared model on large volumes of unlabeled data and achieves state-of-the-art (SOTA) for various tasks with minimal adaptation. However, the speech processing community lacks a similar setup to systematically explore the paradigm.
  • This paper by Yang et al. from Facebook AI in Interspeech 2021 seeks to bridge this gap and introduces the Speech processing Universal PERformance Benchmark (SUPERB).
  • SUPERB is a leaderboard to benchmark the performance of a shared model across a wide range of speech processing tasks with minimal architecture changes and labeled data. Among multiple usages of the shared model, they especially focus on extracting the representation learned from SSL due to its preferable re-usability.
  • They present a simple framework to solve SUPERB tasks by learning task-specialized lightweight prediction heads on top of the frozen shared model.
  • Their results demonstrate that the framework is promising as SSL representations show competitive generalizability and accessibility across SUPERB tasks.
  • They release SUPERB as a challenge with a leaderboard and a benchmark toolkit to fuel research in representation learning and general speech processing.


Direct speech-to-speech translation with discrete units
  • This paper by Lee et al. from Facebook AI in 2022 presents a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation.
  • They tackle the problem by first applying a self-supervised discrete speech encoder on the target speech and then training a sequence-to-sequence speech-to-unit translation (S2UT) model to predict the discrete representations of the target speech.
  • When target text transcripts are available, they design a joint speech and text training framework that enables the model to generate dual modality output (speech and text) simultaneously in the same inference pass.
  • Experiments on the Fisher Spanish-English dataset show that the proposed framework yields improvement of 6.7 BLEU compared with a baseline direct S2ST model that predicts spectrogram features. When trained without any text transcripts, S2ST’s performance is comparable to models that predict spectrograms and are trained with text supervision, showing the potential of their system for translation between unwritten languages.
  • Audio samples
Textless Speech Emotion Conversion using Discrete and Decomposed Representations
  • Speech emotion conversion is the task of modifying the perceived emotion of a speech utterance while preserving the lexical content and speaker identity.
  • This paper by Kreuk et al. from Facebook AI in 2021 casts the problem of emotion conversion as a spoken language translation task. They use a decomposition of the speech signal into discrete learned representations, consisting of phonetic-content units, prosodic features, speaker, and emotion.
  • First, they modify the speech content by translating the phonetic-content units to a target emotion, and then predict the prosodic features based on these units.
  • Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder. Such a paradigm allows them to go beyond spectral and parametric changes of the signal, and model non-verbal vocalizations, such as laughter insertion, yawning removal, etc.
  • They demonstrate objectively and subjectively that the proposed method is vastly superior to current approaches and even beats text-based systems in terms of perceived emotion and audio quality. They rigorously evaluate all components of such a complex system and conclude with an extensive model analysis and ablation study to better emphasize the architectural choices, strengths and weaknesses of the proposed method.
  • Facebook AI post
  • Code
Generative Spoken Dialogue Language Modeling
  • This paper by Nguyen et al. from Facebook AI in 2022 introduces dGSLM, the first “textless” model able to generate audio samples of naturalistic spoken dialogues. It uses recent work on unsupervised spoken unit discovery coupled with a dual-tower transformer architecture with cross-attention trained on 2000 hours of two-channel raw conversational audio (Fisher dataset) without any text or labels.
  • It is able to generate speech, laughter and other paralinguistic signals in the two channels simultaneously and reproduces naturalistic turn taking.
  • Facebook AI post
  • Code
textless-lib: a Library for Textless Spoken Language Processing
  • Textless spoken language processing research aims to extend the applicability of standard NLP toolset onto spoken language and languages with few or no textual resources.
  • This paper by Kharitonov et al. from Facebook AI in 2022 introduces textless-lib, a PyTorch-based library aimed to facilitate research in this research area. They describe the building blocks that the library provides and demonstrate its usability by discuss three different use-case examples: (i) speaker probing, (ii) speech resynthesis and compression, and (iii) speech continuation.
  • They believe that textless-lib substantially simplifies research the textless setting and will be handful not only for speech researchers but also for the NLP community at large.
  • Facebook AI post
  • Code
Self-Supervised Speech Representation Learning: A Review
  • Although supervised deep learning has revolutionized speech and audio processing, it has necessitated the building of specialist models for individual tasks and application scenarios. It is likewise difficult to apply this to dialects and languages for which only limited labeled data is available.
  • Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains. Such methods have shown success in natural language processing and computer vision domains, achieving new levels of performance while reducing the number of labels required for many downstream scenarios. Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods. Other approaches rely on multi-modal data for pre-training, mixing text or visual data streams with speech.
  • This paper by Mohamed et al. from Facebook AI in 2022 reviews the current approaches in the field for self-supervised speech representation learning and their connection to other research areas. Since many current methods focus solely on automatic speech recognition as a downstream task, they review recent efforts on benchmarking learned representations to extend the application beyond speech recognition.
Masked Autoencoders that Listen
  • This paper by Huang et al. from Facebook AI and CMU in 2022 introuces Audie-MAE, a simple extension of image-based Masked Autoencoders (MAE) to self-supervised representation learning from audio spectrograms. Audio-MAE learns to reconstruct masked spectrogram patches from audio recordings and achieves state-of-the-art performance on six audio and speech classification tasks.
  • Following the Transformer encoder-decoder design in MAE, Audio-MAE first encodes audio spectrogram patches with a high masking ratio, feeding only the non-masked tokens through encoder layers.
  • The decoder then re-orders and decodes the encoded context padded with mask tokens, in order to reconstruct the input spectrogram. They find it beneficial to incorporate local window attention in the decoder, as audio spectrograms are highly correlated in local time and frequency bands.
  • They then fine-tune the encoder with a lower masking ratio on target datasets. Empirically, Audio-MAE sets new state-of-the-art performance on six audio and speech classification tasks, outperforming other recent models that use external supervised pre-training.
  • They draw the four interesting observations:
    • A simple MAE approach works surprisingly well for audio spectrograms.
    • It is possible to learn stronger representations with local self-attention in the decoder.
    • They show that masking can be applied to both pre-training and fine-tuning, improving accuracy and reducing training computation. The optimal strategy depends on the nature of the data (audio, image, etc.) and the learning type (self-/supervised).
    • The best performance can be achieved by pre-training and fine-tuning under the same modality, without reliance on cross-modality transfer learning.
  • Code with code and models.
Robust Speech Recognition via Large-Scale Weak Supervision
  • This paper by Radford et al. from OpenAI in 2022 proposes Whisper, a model trained to predict large amounts of transcripts of audio on the internet and studies its capabilities.
  • Whisper suggests that scaling weakly supervised pretraining has been underappreciated so far in speech recognition research. When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zeroshot transfer setting without the need for any finetuning.
  • When compared to humans, the models approach their accuracy and robustness.
  • What is important to note is that Whisper achieves stellar results without the need for self-supervision and self-training techniques that have been a mainstay of recent large-scale speech recognition work and demonstrates how training on a large and diverse supervised dataset and focusing on zero-shot transfer can significantly improve the robustness of a speech recognition system.
  • Project page.
AudioGen: Textually Guided Audio Generation
  • This paper by Kreuk et al. from FAIR and the Hebrew University of Jerusalem in 2022 proposes AudioGen, which tackles the problem of generating audio samples conditioned on descriptive text captions.
  • AudioGen is an auto-regressive generative model that operates on a learnt discrete audio representation and generates audio samples conditioned on text inputs.
  • The task of text-to-audio generation poses multiple challenges. Due to the way audio travels through a medium, differentiating ‘objects’ can be a difficult task (e.g., separating multiple people simultaneously speaking). This is further complicated by real-world recording conditions (e.g., background noise, reverberation, etc.). Scarce text annotations impose another constraint, limiting the ability to scale models.
  • Finally, modeling high-fidelity audio requires encoding audio at high sampling rate, leading to extremely long sequences. To alleviate the aforementioned challenges, they propose an augmentation technique that mixes different audio samples, driving the model to internally learn to separate multiple sources. They curated 10 datasets containing different types of audio and text annotations to handle the scarcity of text-audio data points.
  • For faster inference, they explore the use of multi-stream modeling, allowing the use of shorter sequences while maintaining a similar bitrate and perceptual quality. They apply classifier-free guidance to improve adherence to text.
  • Comparing to the evaluated baselines, AudioGen outperforms over both objective and subjective metrics. Finally, they explore the ability of the proposed method to generate audio continuation conditionally and unconditionally.
  • Audio samples.
AudioLM: a Language Modeling Approach to Audio Generation
  • The following summary has been contributed by Zhibo Zhang.
  • This paper by Borsos et al. from Google Research in 2022 proposes a generative language model approach to synthesize audios that are consistent and are of high quality.
  • The method proposed contains the following stages:
    • The tokenization stage that maps the single channel audio sequence into acoustic tokens and semantic tokens. Specifically, the SoundStream codec (Zeghidour et al., 2021) is adopted to produce acoustic tokens and the w2v-BERT model (Chung et al., 2021) is used to produce semantic tokens using intermediate layer representations. The acoustic token representations and the semantic token representations are for ensuring high quality and long-term consistency of the generated audio accordingly.
    • The hierarchical modeling stage that is composed of the following three steps, as indicated in the illustration figure by Borsos et al.:
      • Autoregressive modeling on the semantic tokens. This step is for learning long-term temporal structure.
      • Coarse acoustic modeling conditioned on the acoustic tokens from the previous time steps that are produced by the first \(Q’\) SoundStream quantizers. This step is for capturing high-level acoustic properties.
      • Fine acoustic modeling conditioned on both the coarse tokens of all time steps and the fine tokens (from the last \(Q - Q’\) quantizers) of the previous time steps. This step is for better capturing fine acoustic details.
  • At inference time, AudioLM can be used to:
    • Generate audios with diverse context, various speakers and acoustic conditions when there are no conditional restrictions.
    • Generate audios of the same content with various speaker identities when conditioned on given semantic tokens.
    • Generate continuations of the audio given an acoustic prompt.
  • Empirically, the authors trained the AudioLM components on the unlab-60k train split of the Libri-Light dataset. In order to validate the functionality of the semantic tokens and the acoustic tokens.
    • The authors conducted acoustic generation experiments conditioned on semantic tokens. Automatic speech recognition was performed on the generated audio, and with a low Word Error Rate, this shows that the system captures the linguistic content mostly relying on the semantic tokens.
    • The authors also conducted speaker classification on the generated audio. A low classification accuracy suggests that the semantic tokens lack information about speaker identities.

SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing
  • This paper by Furu Wei’s group in ACL 2022 from Microsoft builds upon the T5 (Text-To-Text Transfer Transformer) by Raffel et al. (2020) in pre-trained natural language processing models, propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning.
  • The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets. After preprocessing the input speech/text through the pre-nets, the shared encoder-decoder network models the sequence-to-sequence transformation, and then the post-nets generate the output in the speech/text modality based on the output of the decoder. Leveraging large-scale unlabeled speech and text data, they pre-train SpeechT5 to learn a unified-modal representation, hoping to improve the modeling capability for both speech and text. To align the textual and speech information into this unified semantic space, they propose a cross-modal vector quantization approach that randomly mixes up speech/text states with latent units as the interface between encoder and decoder.
  • Extensive evaluations show the superiority and versatility of the proposed SpeechT5 framework on a wide variety of spoken language processing tasks, including automatic speech recognition (ASR), speech synthesis (TTS), voice conversion (VC), speech translation (ST), speech enhancement (SE), and speaker identification (SID).
  • Huggingface spaces demos:
Scaling Speech Technology to 1,000+ Languages
  • Expanding the language coverage of speech technology has the potential to improve access to information for many more people. However, current speech technology is restricted to about one hundred languages which is a small fraction of the over 7,000 languages spoken around the world.
  • This paper by Pratap et al. from Meta AI in 2023 introduces the Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task.
  • The main ingredients are a new dataset based on readings of publicly available religious texts and effectively leveraging self-supervised learning. They built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, as well as a language identification model for 4,017 languages.
  • Forced alignment determines which parts of the audio correspond to which parts of the text. They employ a Scalable Forced Alignment step, using the following tweaks:
    1. Generating Posterior Probabilities: forced alignment requires posterior probabilities from an acoustic model which they use for alignment. This acoustic model is a Transformer which requires substantial amounts of memory to store activations which makes it infeasible to use for long audio files. As a workaround, they chunk the audio files into 15 second segments, generate posterior probabilities for each audio frame using the alignment model, and then concatenate these posterior probabilities into a single matrix again. The acoustic model is trained with Connectionist Temporal Classification (CTC).
    2. Efficient Forced Alignment on GPUs: they implemented a GPU version that computes the Viterbi path in a memory efficient way. Storing all \(O(T \times L)\) forward values for the Viterbi algorithm is infeasible on GPUs due to memory constraints. They therefore only store forward values for the current and the previous time-step and regularly transfer the computed backtracking matrices to CPU memory. This reduces the required GPU memory to \(O(L)\) compared to \(O(T \times L)\) and enables forced alignment for very long audio).
    3. Robust Alignment for Noisy Transcripts: a star token ⟨∗⟩ to map audio segments if there is no good alternative in the text.
  • Also, to create a labeled dataset which includes speech audio paired with corresponding transcriptions in 1,107 languages by aligning New Testament texts obtained from online sources using the following steps:
    1. Download and preprocess both the speech audio and the text data.
    2. Apply a scalable alignment algorithm which can force align very long audio files with text and do this for data in 1000+ languages in the following steps.
    3. Initial Data Alignment: they train an initial alignment model using existing multilingual speech datasets covering 8K hours of data in 127 languages and use this model to align data for all languages.
    4. Improved Data Alignment: they train a second alignment model on the newly aligned data for which the original alignment model has high confidence and generate the alignments again. The new alignment model supports 1,130 languages and 31K hours of data including the data used in step 3.
    5. Final data filtering: they filter the low-quality samples of each language based on a cross-validation procedure. For each language, they train a monolingual ASR model on half of the aligned data to transcribe the other half of the data. They retain only samples for which the transcriptions are of acceptable quality.
  • Experiments show that their multilingual speech recognition model more than halves the word error rate of Whisper on 54 languages of the FLEURS benchmark while being trained on a small fraction of the labeled data.
  • The following figure from the paper shows (top) MMS-lab (paired data): amount of speech data across languages – they show the size of the training data sets and name some of the 1,107 languages; (bottom) MMS-unlab (unpaired data): amount of speech data across languages – they show the size of the training data sets and name a few of the 3,809 languages.

Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling
  • This paper authored by Gandhi et al. from Hugging Face introduces Distil-Whisper, a distilled variant of the Whisper automatic speech recognition model.
  • Distil-Whisper is significantly smaller and faster (5.8 times faster, 51% fewer parameters) than the original Whisper model, maintaining similar performance (within 1% WER) on out-of-distribution test data in a zero-shot setting.
  • The authors use a large-scale pseudo-labelling approach to assemble an open-source dataset for training, selecting high-quality pseudo-labels based on a word error rate (WER) heuristic.
  • The motivation was the fact that OpenAI’s Whisper yields astonishing accuracy for most audio, but it’s too slow and expensive for most production use cases. In addition, it has a tendency to hallucinate.
  • Encoding takes \(O(1)\) passes while decoding takes \(O(N)\). This implies that reducing decoder layers is \(N\) time more effective. They kept the whole encoder, but utilized only two decoder layers.
  • The encoder is frozen during distillation to ensure Whisper’s robustness to noise is kept.
  • The model demonstrates improved robustness against hallucination errors in long-form audio, and its design allows it to be paired with Whisper for speculative decoding, doubling the inference speed while maintaining output accuracy.
  • The paper highlights the utility of large-scale pseudo-labelling in speech recognition and the effectiveness of the WER threshold filter in distillation. The training and inference code, along with the models, are made publicly available by the authors.
  • To make sure Distil-Whisper does not inherit hallucinations, they filtered out all data samples below a certain WER threshold. By doing so, we were able to reduce hallucinations and actually beat the teacher on long-form audio evaluation.
  • Code; Hugging Face Page



“Why Should I Trust You?” Explaining the Predictions of Any Classifier
  • Trust is crucial for effective human interaction with machine learning systems, and that explaining individual predictions is important in assessing trust.
  • This paper by Ribeiro et al. from Guestrin’s lab in UW in 2016 proposes LIME, a novel model-agnostic modular and extensible explanation technique that explains the predictions of any classifier in an interpretable and faithful manner, by learning an interpretable model locally around the prediction. They further introduced SP-LIME, a method to explain models by selectingrepresentative individual predictions and their explanations in a non-redundant way, framing the task as a submodular optimization problem and providing a global view of the model to users.
  • They demonstrate the flexibility of these methods by explaining different models for text (e.g. random forests) and image classification (e.g. neural networks). The usefulness of explanations is shown via novel experiments, both simulated and with human subjects.
  • Their explanations empower users in various scenarios that require trust: deciding if one should trust a prediction, choosing between models, improving an untrustworthy classifier, getting insights into predictions, and detecting why a classifier should not be trusted.
  • LIME - Local Interpretable Model-Agnostic Explanations blog post.
SPICE: Semantic Propositional Image Caption Evaluation
  • There is considerable interest in the task of automatically generating image captions. However, evaluation is challenging. Existing automatic evaluation metrics are primarily sensitive to n-gram overlap, which is neither necessary nor sufficient for the task of simulating human judgment.
  • This paper by Anderson et al. from Australian National University and Macquarie University in ECCV 2016 hypothesizes that semantic propositional content is an important component of human caption evaluation, and propose a new automated caption evaluation metric defined over scene graphs coined SPICE.
  • The following figure from the paper illustrates SPICE’s main principle which uses semantic propositional content to assess the quality of image captions. Reference and candidate captions are mapped through dependency parse trees (top) to semantic scene graphs (right)— encoding the objects (red), attributes (green), and relations (blue) present. Caption quality is determined using an F-score calculated over tuples in the candidate and reference scene graphs.

  • Extensive evaluations across a range of models and datasets indicate that SPICE captures human judgments over model-generated captions better than other automatic metrics (e.g., system-level correlation of 0.88 with human judgments on the MS COCO dataset, versus 0.43 for CIDEr and 0.53 for METEOR).
  • Furthermore, SPICE can answer questions such as ‘which caption-generator best understands colors?’ and ‘can caption-generators count?’


A Unified Approach to Interpreting Model Predictions
  • While various methods have recently been proposed to help users interpret the predictions of complex models, it is often unclear how these methods are related and when one method is preferable over another.
  • This paper by Lundberg and Lee from UW in NeurIPS 2017 seeks to address this problem and presents a unified framework for interpreting predictions, SHAP (SHapley Additive exPlanations).
  • SHAP is a game theoretic approach to explain the output of any machine learning model. It connects optimal credit allocation with local explanations using the classic Shapley values from game theory and their related extensions. SHAP assigns each feature an importance value for a particular prediction. Its novel components include: (1) the identification of a new class of additive feature importance measures, and (2) theoretical results showing there is a unique solution in this class with a set of desirable properties.
  • The new class unifies six existing methods, notable because several recent methods in the class lack the proposed desirable properties. Based on insights from this unification, they present new methods that show improved computational performance and/or better consistency with human intuition than previous approaches.
  • Github repo.
mixup: Beyond Empirical Risk Minimization
  • Large deep neural networks are powerful, but exhibit undesirable behaviors such as memorization and sensitivity to adversarial examples.
  • This paper by Zhang et al. from MIT and FAIR in ICLR 2018 proposes mixup, a regularizer, which trains a neural network on convex combinations of pairs of examples and their labels. By doing so, mixup regularizes the neural network to favor simple linear behavior in-between training examples.
  • Their experiments on the ImageNet-2012, CIFAR-10, CIFAR-100, Google commands, and UCI datasets show that mixup improves the generalization of state-of-the-art neural network architectures.
  • They also find that mixup reduces the memorization of corrupt labels, increases the robustness to adversarial examples, and stabilizes the training of generative adversarial networks.
Multimodal Machine Learning: A Survey and Taxonomy
  • Our experience of the world is multimodal – we see objects, hear sounds, feel texture, smell odors, and taste flavors. Modality refers to the way in which something happens or is experienced and a research problem is characterized as multimodal when it includes multiple such modalities. In order for Artificial Intelligence to make progress in understanding the world around us, it needs to be able to interpret such multimodal signals together.
  • Multimodal machine learning aims to build models that can process and relate information from multiple modalities. It is a vibrant multi-disciplinary field of increasing importance and with extraordinary potential.
    • This paper by Baltrusaitis et al. from Microsoft and Louis-Philippe Morency’s lab at CMU surveys the recent advances in multimodal machine learning itself and presents them in a common taxonomy instead of focusing on specific multimodal applications.
  • They go beyond the typical early and late fusion categorizationThis new taxonomy will enable researchers to better understand the state of the field and identify directions for future research.
  • Their taxonomy goes beyond the typical early and late fusion split and identify broader challenges that are faced by multimodal machine learning, namely: 1) Representation: A first fundamental challenge is learning how to represent and summarize multimodal data in a way that exploits the complementarity and redundancy of multiple modalities. The heterogeneity of multimodal data makes it challenging to construct such representations. For example, language is often symbolic while audio and visual modalities will be represented as signals. 2) Translation: A second challenge addresses how to translate (map) data from one modality to another. Not only is the data heterogeneous, but the relationship between modalities is often open-ended or subjective. For example, there exist a number of correct ways to describe an image and and one perfect translation may not exist. 3) Alignment: A third challenge is to identify the direct relations between (sub)elements from two or more different modalities. For example, we may want to align the steps in a recipe to a video showing the dish being made. To tackle this challenge we need to measure similarity between different modalities and deal with possible longrange dependencies and ambiguities. 4) Fusion: A fourth challenge is to join information from two or more modalities to perform a prediction. For example, for audio-visual speech recognition, the visual description of the lip motion is fused with the speech signal to predict spoken words. The information coming from different modalities may have varying predictive power and noise topology, with possibly missing data in at least one of the modalities. 5) Co-learning: A fifth challenge is to transfer knowledge between modalities, their representation, and their predictive models. This is exemplified by algorithms of cotraining, conceptual grounding, and zero shot learning. Co-learning explores how knowledge learning from one modality can help a computational model trained on a different modality. This challenge is particularly relevant when one of the modalities has limited resources (e.g., annotated data).
  • The table below from the paper offers a summary of applications enabled by multimodal machine learning. For each application area they identify the core technical challenges that need to be addressed in order to tackle it.


Representation Learning with Contrastive Predictive Coding
  • While supervised learning has enabled great progress in many applications, unsupervised learning has not seen such widespread adoption, and remains an important and challenging endeavor for artificial intelligence.
  • This paper by Oord et al. from Google in 2019 proposes a universal unsupervised learning approach to extract useful representations from high-dimensional data, which they call Contrastive Predictive Coding (CPC), a framework for extracting compact latent representations to encode predictions over future observations.
  • The key insight of CPC is to learn such representations by predicting the future in latent space by using powerful autoregressive models.
  • CPC uses a probabilistic contrastive loss based on NCE, which both the encoder and autoregressive model are trained to jointly optimize, which they call InfoNCE. InfoNCE induces the latent space to capture information that is maximally useful to predict future samples.
  • CPC combines autoregressive modeling and noise-contrastive estimation with intuitions from predictive coding to learn abstract representations in an unsupervised fashion.
  • It also makes the model tractable by using negative sampling. While most prior work has focused on evaluating representations for a particular modality, they demonstrate that CPC is able to learn useful representations achieving strong performance on four distinct domains: speech, images, text and reinforcement learning in 3D environments.
  • The figure below from the paper offers an overview of Contrastive Predictive Coding, the proposed representation learning approach. Although this figure shows audio as input, they use the same setup for images, text, and reinforcement learning.

  • They tested these representations in a wide variety of domains: audio, images, natural language, and reinforcement learning and achieve strong or state-of-the-art performance when used as stand-alone features.
  • The simplicity and low computational requirements to train the model, together with the encouraging results in challenging reinforcement learning domains when used in conjunction with the main loss are exciting developments towards useful unsupervised learning that applies universally to many more data modalities.


Modality Dropout for Improved Performance-driven Talking Faces
  • This paper by Adbelaziz et al. from Apple in 2020 introduces the idea of Modality Dropout (MDO). The begin by describing a novel deep learning approach for driving animated faces using both acoustic and visual information. In particular, speech-related facial movements are generated using audiovisual information, and non-speech facial movements are generated using only visual information.
  • To ensure that the proposed model exploits both modalities during training, batches are generated that contain audio-only, video-only, and audiovisual input features. The probability of dropping a modality allows control over the degree to which the model exploits audio and visual information during training.
  • Their trained model runs in real-time on resource limited hardware (e.g., a smart phone), it is user agnostic, and it is not dependent on a potentially error-prone transcription of the speech.
  • They use subjective testing to demonstrate: 1) the improvement of audiovisual-driven animation over the equivalent video-only approach, and 2) the improvement in the animation of speech-related facial movements after introducing modality dropout. Before introducing dropout, viewers prefer audiovisual-driven animation in 51% of the test sequences compared with only 18% for video-driven. After introducing dropout viewer preference for audiovisual-driven animation increases to 74%, but decreases to 8% for video-only.
Augmentation adversarial training for self-supervised speaker recognition
  • This paper by Huh et al. from Oxford, Naver Corporation, Shinji Watanabe’s lab at JHU in the Workshop on Self-Supervised Learning for Speech and Audio Processing, NeurIPS 2020 seeks to train robust speaker recognition models without speaker labels.
  • Recent works on unsupervised speaker representations are based on contrastive learning in which they encourage within-utterance embeddings to be similar and across-utterance embeddings to be dissimilar. However, since the within-utterance segments share the same acoustic characteristics, it is difficult to separate the speaker information from the channel information.
  • They propose augmentation adversarial training strategy that trains the network to be discriminative for the speaker information, while invariant to the augmentation applied.
  • Since the augmentation simulates the acoustic characteristics, training the network to be invariant to augmentation also encourages the network to be invariant to the channel information in general. Extensive experiments on the VoxCeleb and VOiCES datasets show significant improvements over previous works using self-supervision, and the performance of their self-supervised models far exceed that of humans.
  • The following figure from the paper illustrates an overview of the training strategy. The index notation for the inputs and the embeddings are consistent with the equations, i.e., \(i, j, k\) refer to \(j^{th}\) segment of \(i^{th}\) utterance, with augmentation type \(k\).

BERTScore: Evaluating Text Generation with BERT
  • This paper by Zhang et al. from Cornell Tech, Cornell University, and ASAPP Inc. in ICLR 2020 proposes BERTScore, an automatic evaluation metric for text generation. Analogously to common metrics, BERTScore computes a similarity score for each token in the candidate sentence with each token in the reference sentence. However, instead of exact matches, they compute token similarity using contextual embeddings.
  • They evaluate using the outputs of 363 machine translation and image captioning systems. BERTScore correlates better with human judgments and provides stronger model selection performance than existing metrics.
  • Finally, they use an adversarial paraphrase detection task to show that BERTScore is more robust to challenging examples when compared to existing metrics.
  • The following figure from the paper offers an illustration of the computation of the recall metric \(R_{BERT}\). Given the reference \(x\) and candidate \(\hat{x}\), they compute BERT embeddings and pairwise cosine similarity. They highlight the greedy matching in red, and include the optional idf importance weighting.


Comparing Data Augmentation and Annotation Standardization to Improve End-to-end Spoken Language Understanding Models
  • All-neural end-to-end (E2E) Spoken Language Understanding (SLU) models can improve performance over traditional compositional SLU models, but have the challenge of requiring high-quality training data with both audio and annotations. In particular they struggle with performance on “golden utterances”, which are essential for defining and supporting features, but may lack sufficient training data.
  • This paper by Nicolich-Henkin et al. from Amazon in NeurIPS 2021 proposes using data augmentation to compare two data-centric AI methods to improve performance on golden utterances: improving the annotation quality of existing training utterances and augmenting the training data with varying amounts of synthetic data.
  • Their experimental results show improvements with both methods, and in particular that augmenting with synthetic data is effective in addressing errors caused by both inconsistent training data annotations as well as lack of training data. In other words, both data-centric approaches to improving E2E SLU achieved the desired effect, although data augmentation was much more powerful than annotation standardization. This method leads to improvement in intent recognition error rate (IRER) on their golden utterance test set by 93% relative to the baseline without seeing a negative impact on other test metrics.
Learning Transferable Visual Models From Natural Language Supervision
  • This paper by Radford et al. from OpenAI introduces CLIP, a pre-training task which efficiently learns visual concepts from natural language supervision. CLIP uses vision and language encoders trained in isolation and uses a contrastive loss to bring similar image-text pairs closer, while pulling apart dissimilar pairs as a part of pretaining.
  • CLIP can be applied to any visual classification benchmark by simply providing the names of the visual categories to be recognized, similar to the “zero-shot” capabilities of GPT-2 and GPT-3.
  • CLIP pre-trains an image encoder and a text encoder to predict which images were paired with which texts in their dataset. They then use this behavior to turn CLIP into a zero-shot classifier. They convert all of a dataset’s classes into captions such as “a photo of a dog” and predict the class of the caption CLIP estimates best pairs with a given image.
  • It can rival the generalization of ImageNet SoTA models (since it was pretained on 400M image and noisy text pairs) and is thus typically used for zero-shot image classification and zero-shot cross-modal searches.
  • OpenAI article.
Zero-Shot Text-to-Image Generation
  • Text-to-image generation (i.e., language-guided image generation) has traditionally focused on finding better modeling assumptions for training on a fixed dataset. These assumptions might involve complex architectures, auxiliary losses, or side information such as object part labels or segmentation masks supplied during training.
  • This paper by Ramesh et al. from OpenAI introduces DALL-E which offers a simple approach for text-to-image generation based on an autoregressive transformer which models the text and image tokens as a single stream of data. DALL-E is a simple decoder-only transformer that receives both the text and the image as a single stream of 1280 tokens—256 for the text and 1024 for the image—and models all of them autoregressively.
  • They find that sufficient data and scale can lead to improved generalization, both in terms of zero-shot performance relative to previous domain-specific approaches, and in terms of the range of capabilities that emerge from a single generative model. Their findings suggest that improving generalization as a function of scale may be a useful driver for progress on this task.
  • OpenAI article.
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
  • This paper by Kim et al. from NAVER AI and Kakao in 2021 introduces Vision-and-Language Transformer (ViLT) that seeks to improve performance on various joint vision-and-language downstream tasks using Vision-and-Language Pre-training (VLP).
  • CLIP and Hugging Face’s VisionEncoderDecoder utilize image and language encoders learned/trained in isolation and aligning/gluing them using either (i) cross-entropy loss that utilizes cross-attention (in case of VisionEncoderDecoder), and (ii) contrastive loss (in case of CLIP). This is shown in the figure below from Prithvi Da which summarizes the aforementioned approaches.

  • The downside of the above approach is poor image-text alignment, huge data appetite and longer training time. This approach is useful to create a downstream generative model to tackle applications such as cross-modal retrieval, say OCR or image captioning or content based image retrieval (CBIR) or even text2image (using DALL-E or CLIPDraw). However, there are derived/advanced multimodal tasks involving vision and language that are much more complicated in nature such as Natural Language for Visual Reasoning (NLVR), Visual Question Answering (VQA), Visual Commonsense Reasoning (VCR), Visual Navigation, etc. than the aforementioned higher-order tasks. The diagram below from Prithvi Da summarizes the hierarchy of image-based tasks.

  • In order to tackle derived tasks in a similar way, they need to train image and language data jointly (rather than in isolation) in a “mixed-modal” fashion with a combination of image level loss, language level loss, and alignment loss. This is the underlying idea behind VLP. The diagram below from Prithvi Da summarizes the two approaches of aligning/gluing the modalities together (with either cross-entropy loss or contrastive loss) independently-trained vision and language encoders vs. training both encoders jointly.

  • Current approaches to VLP heavily rely on image feature extraction processes using convolutional visual embedding networks (e.g., Faster R-CNN and ResNets), which involve region supervision (e.g., object detection) and the convolutional architecture (e.g., ResNet). This is problematic in terms of both efficiency/speed, in that extracting input features requires much more computation than the multimodal interaction steps; and expressive power, as it is upper bounded to the expressive power of the visual embedder and its predefined visual vocabulary.
  • ViLT seeks to remedy the above two issues by presenting a minimal VLP model, which is monolithic in that the processing of visual inputs is drastically simplified to just the same convolution-free manner that they process textual inputs. In other words, the unique selling point of ViLT is that while most VLP models rely on object detectors, CNNs or transformers for feature extraction (for e.g., UNiTER, LXMERT and VisualBERT need Faster-RCNN for object detection), ViLT stands out of the crowd by removing the need for object detectors. ViLT accomplishes this by avoiding heavyweight image encoders by directly embedding low-level pixel data with a single-layer projection and achieves similar results with reduced complexity, as shown in the diagram below:

  • Self-supervision is accomplished using (i) Image Text Matching (ITM) loss and (ii) Masked Language Model (MLM) loss. ITM loss is an alignment loss that encompasses cross-modality interaction between image and text. ITM requires positive and negative pairs. For text, ViLT simply reuses Masked Language Model (MLM), used in BERT.
  • ViLT is pre-trained on four datasets: MSCOCO, Visual Genome, SBU Captions, and Google Conceptual Captions. They evaluate ViLT on two widely explored types of vision-and-language downstream tasks: for classification, they use VQAv2 and NLVR2; for retrieval, they use MSCOCO and Flickr30K (F30K).
  • Finally, they show that ViLT is over 10x faster than previous VLP models, yet with competitive or better downstream task performance.
  • The key takeaway in this paper is that VLP needs to focus more on the multi-modality interactions aspect inside the transformer module rather than engaging in an arms race that merely powers up unimodal embedders. ViLT-B/32 is a proof of concept that efficient VLP models free of convolution and region supervision can still be competent.
  • Code with code and pre-trained weights; Hugging Face docs; ViLT tutorials/notebooks.
MLIM: Vision-and-language Model Pre-training With Masked Language and Image Modeling
  • Vision-and-Language Pre-training (VLP) improves model performance for downstream tasks that require image and text inputs. Current VLP approaches differ on (i) model architecture (especially image embedders), (ii) loss functions, and (iii) masking policies. Image embedders are either deep models like ResNet or linear projections that directly feed image-pixels into the transformer. Typically, in addition to the Masked Language Modeling (MLM) loss, alignment-based objectives are used for cross-modality interaction, and RoI feature regression and classification tasks for Masked ImageRegion Modeling (MIRM). Alignment-based objectives require pairings of image and text and heuristic objective functions. MIRM relies on object detectors. Masking policies either do not take advantage of multi-modality or are strictly coupled with alignments generated by other models.
  • This paper by Arici et al. from Amazon in 2021 presents Masked Language and Image Modeling (MLIM) for VLP. MLIM is pre-trained using two pre-training tasks as a multi-loss objective given a mini-batch of image-text pairs: Masked Language Modeling (MLM) loss (as in BERT) for text, and image reconstruction (RECON) loss for image, coupled with Modality Aware Masking (MAM). MAM determines the masking probability and applies masking to both word and image embeddings. MLP is based on BERT predict the masked words from available words and image regions. They follow BERT for this task: two-layer MLP MLM head outputting logits over the vocabulary. MLM loss is negative log-likelihood for masked word. The RECON loss is an an average of pixel-wise sum of squared errors (SSE). Both image and word masking is realized by replacing an embedding with the embedding of [MASK]. This way transformer layers recognize [MASK]’s embedding as a special embedding that needs to be “filled in”, independent of the modality, by attending to other vectors in the layer inputs.
  • Note that unlike other architectures (LXMERT, UNiTER, ViLBERT, VLP, VL-BERT, VisualBERT, etc.), image masking is not based on image regions detected by the object detector, but a shallow CNN as an image embedder which is much more lightweight than deep models like ResNet and is designed to be masking friendly. MLM + RECON losses apply only to the masked text/image areas and measure reconstructed text and image quality.
  • MLIM uses no specific alignment loss, but instead proposes Modality Aware Masking (MAM) to boost cross-modality interaction and take advantage of MLM and RECON losses that separately capture text and image reconstruction quality. Using MLM + RECON tasks coupled with MAM, they present a simplified VLP methodology and show that it has better downstream task performance on a proprietary e-commerce multi-modal dataset.
  • Since the the task of finding closely-matching (CM) item pairs requires a pair of image+text inputs, they exploit this multi-modality by employing Modality Dropout (MDO). MDO improves fine-tuning by randomly dropping one of the modalities. Similar to MAM, MDO operates in one of the three modes on a micro-batch: text-only, image-only, and image-text mode.
  • The authors also tried using the ITM loss proposed in ViLT. However, RECON instead of ITM loss offers better PR AUC. Similarly, using the ITM loss together with MLM and RECON does not change the performance.
  • The key takeways from this paper are that MLIM is a simplified VLP method using MLM and RECON losses and MAM. They simplify loss function design, propose a shallow CNN-based image embedder to avoid heavyweight object-detectors and present an image decoder to enable RECON loss. They believe VLP datasets (e.g. e-commerce datasets) are large enough to enable learning built-in image embedders during pre-training. While alignment-based loss functions are promising and help in learning contrastive features, finding good image-text pairs (especially negative pairs) becomes an issue and makes pre-training rely on pairing techniques. On the other hand finer-grained alignment objectives such as alignment and MIRM objectives do not have ground truth. Masked Image-Region Modeling (MIRM) relies on RoI features and classes predicted by the object detector. Furthermore MIRM tasks aim to “fill in” masked regions. However the proposed RECON task aims to reconstruct the whole image and is designed to get the best cross-modality interaction inside the transformer.

MURAL: Multimodal, Multi-task Retrieval Across Languages
  • This paper by Jain and Yang from Google Research in EMNLP 2021 describes MURAL, a representation model for image–text matching that uses multitask learning applied to image–text pairs in combination with translation pairs covering 100+ languages.
  • While we currently have solutions that take both image and text and embed them in the same vector space with solutions like CLIP and ALIGN, we do not have solutions that scale for languages outside of English due to lack of training data.
  • MURAL shows that training jointly using translation pairs helps overcome the scarcity of image–text pairs for many under-resourced languages and improves cross-modal performance.
  • MURAL consistently outperforms prior state-of-the-art models in multilingual image-to-text and text-to-image retrieval.
  • Additionally, when visualizing MURAL’s embeddings with LaBSE’s, it is interesting to observe hints of areal linguistics and contact linguistics in the text representations learned by using a multimodal model.
  • The diagram below shows the MURAL architecture (from the paper), which is based on the architecture of ALIGN but employed in a multitask fashion:

  • The MURAL paper shows that (i) training jointly with both image and text helps possibly overcome scarcity of data for low-resource languages, and (ii) training jointly also increases cross-modal performance.
Perceiver: General Perception with Iterative Attention
  • Biological systems perceive the world by simultaneously processing high-dimensional inputs from modalities as diverse as vision, audition, touch, proprioception, etc. The perception models used in deep learning on the other hand are designed for individual modalities, often relying on domain-specific assumptions such as the local grid structures exploited by virtually all existing vision models. These priors introduce helpful inductive biases, but also lock models to individual modalities.
  • This paper by Jeagle et al. from DeepMind in ICML 2021 introduces the Perceiver - a model that builds upon Transformers and hence makes few architectural assumptions about the relationship between its inputs, but that also scales to hundreds of thousands of inputs, like ConvNets.
  • The model leverages an asymmetric attention mechanism to iteratively distill inputs into a tight latent bottleneck, allowing it to scale to handle very large inputs. We show that this architecture is competitive with or outperforms strong, specialized models on classification tasks across various modalities: images, point clouds, audio, video, and video+audio.
  • The Perceiver obtains performance comparable to ResNet-50 and ViT on ImageNet without 2D convolutions by directly attending to 50,000 pixels. It is also competitive in all modalities in AudioSet.
Multimodal Few-Shot Learning with Frozen Language Models
  • When trained at sufficient scale, auto-regressive language models exhibit the notable ability to learn a new language task after being prompted with just a few examples.
  • This paper by Tsimpoukelli et al. from DeepMind in NeurIPS 2021 presents Frozen – a simple, yet effective, approach for transferring this few-shot learning ability to a multimodal setting (vision and language).
  • Using aligned image and caption data, they train a vision encoder to represent each image as a sequence of continuous embeddings, such that a pre-trained, frozen language model prompted with this prefix generates the appropriate caption.
  • The resulting system is a multimodal few-shot learner, with the surprising ability to learn a variety of new tasks when conditioned on examples, represented as a sequence of multiple interleaved image and text embeddings.
  • They demonstrate that it can rapidly learn words for new objects and novel visual categories, do visual question-answering with only a handful of examples, and make use of outside knowledge, by measuring a single model on a variety of established and new benchmarks.
  • The following figure from the paper shows that gradients through a frozen language model’s self attention layers are used to train the vision encoder:

On the Opportunities and Risks of Foundation Models
  • AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks. We call these models foundation models to underscore their critically central yet incomplete character.
  • This report by Bommasani et al. from the Center for Research on Foundation Models (CRFM) at Stanford provides a thorough account of the opportunities and risks of foundation models, ranging from their capabilities (e.g., language, vision, robotics, reasoning, human interaction) and technical principles (e.g., model architectures, training procedures, data, systems, security, evaluation, theory) to their applications (e.g., law, healthcare, education) and societal impact (e.g., inequity, misuse, economic and environmental impact, legal and ethical considerations). Though foundation models are based on standard deep learning and transfer learning, their scale results in new emergent capabilities,and their effectiveness across so many tasks incentivizes homogenization. Homogenization provides powerful leverage but demands caution, as the defects of the foundation model are inherited by all the adapted models downstream. Despite the impending widespread deployment of foundation models, we currently lack a clear understanding of how they work, when they fail, and what they are even capable of due to their emergent properties. To tackle these questions, we believe much of the critical research on foundation models will require deep interdisciplinary collaboration commensurate with their fundamentally sociotechnical nature.
  • The following figure from the paper illustrates the fact that a foundation model can centralize the information from all the data from various modalities. This one model can then be adapted to a wide range of downstream tasks.

CLIPScore: A Reference-free Evaluation Metric for Image Captioning
  • This paper by Hessel et al. from Allen AI and UW in EMNLP 2021 proposes CLIPScore, a new automatic evaluation metric for image captioning that uses CLIP to assess the compatibility between images and candidate captions without requiring reference captions.
  • A reference-augmented version, RefCLIPScore, achieves even higher correlation by combining CLIPScore with maximal reference cosine similarity. Analyses show CLIPScore captures different aspects of caption quality compared to text-only metrics, with CLIPScore being more focused on image-text compatibility. CLIPScore is sensitive to detecting incorrect “hallucinated” captions where a noun has been swapped. It also correlates well for rating image alt-text quality on Twitter.
  • Experiments on several standard captioning datasets (Flickr8K, Composite, Pascal-50S) show that CLIPScore achieves higher correlation with human judgments than previous reference-based metrics like CIDEr and SPICE.
  • For abstract clipart images and personality-based engaging captions, CLIPScore underperforms compared to reference-based metrics. For news image captions requiring richer context, reference-based metrics do better. Overall, for literal image description tasks, CLIPScore offers strong correlation without needing reference captions, complementing existing reference-based metrics. The authors recommend reporting CLIPScore along with text-only metrics like SPICE.
  • The following figure from the paper illustrates the following: (Top) CLIPScore uses CLIP to assess image-caption compatibility without using references, just like humans. (Bottom) This frees CLIPScore from the well-known shortcomings of n-gram matching metrics, which disfavor good captions with new words (top) and favor any captions with familiar words (bottom).


DeepNet: Scaling Transformers to 1,000 Layers
  • This paper by Wang et al. from Microsoft Research in 2022 introduces DeepNet, a new method that allows train extremely deep transformers with 1000L+ layers – order of magnitude improvements over existing efforts and with theoretical justification.
  • DeepNet is fundamental, effective and simple. It can be used in any Transformer architecture (encoder, decoder, encoder-decoder) which covers almost all different tasks across AI areas (language, vision, speech, multimodal, and beyond). It is not only for 1000L+ Transformers, but also important and effective for training existing large models (e.g., [24, 100] layers). It combines the best of two worlds, i.e., good performance of Post-LN and stable training of Pre-LN, making it a preferred alternative for any Transformers model training.
  • At the core of DeepNet is a newly proposed normalization function (called DeepNorm) which modifies the residual connection in Transformers. DeepNorm has theoretical justification of bounding the model update by a constant which makes stable training possible in a principled way. They only need lines of code change to make it work in existing Transformer implementation.
  • DeepNorm modifies the residual connection in the Transformer architecture by up-scaling it before performing layer normalization. It works alongside a dedicated initialization scheme based on Xavier initialization.
  • These two tricks lead to greater stability during the training which allows the authors to scale their modified Transformer architecture (DeepNet) up to 1000 layers.
  • DeepNet’s 200-layer model with 3.2B parameters significantly outperforms the 48-layer state-of-the-art model with 12B parameters by 5 BLEU points a in multilingual translation task with 7,482 translation directions.
  • Github repo.
data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language
  • While the general idea of self-supervised learning is identical across modalities, the actual algorithms and objectives differ widely because they were developed with a single modality in mind.
  • This paper by Baevski et al. from Facebook in 2022 helps get us closer to general self-supervised learning by presenting data2vec, a framework that uses the same learning method for either speech, NLP or computer vision. The core idea is to predict latent representations of the full input data based on a masked view of the input in a self distillation setup using a standard Transformer architecture.
  • Instead of predicting modality-specific targets such as words, visual tokens or units of human speech which are local in nature, data2vec predicts contextualized latent representations that contain information from the entire input. Experiments on the major benchmarks of speech recognition, image classification, and natural language understanding demonstrate a new state of the art or competitive performance to predominant approaches.
  • Today’s self-supervised learning research almost typically focuses on a single modality. As a result, researchers specializing in one modality often adopt a totally different strategy than those specializing in another. Researchers train algorithms to fill in blanks in sentences in the case of the text. On the other hand, speech models must learn an inventory of essential speech sounds, like forecasting missing sounds. In computer vision, models are frequently taught to assign comparable representations to a color image of a cow, and the same image flipped upside down, allowing them to correlate the two far more closely than they would with an unrelated image like a duck. data2vec symbolizes a new paradigm of holistic self-supervised learning, in which further research enhances several rather than just one modality.
  • For each modality, algorithms anticipate distinct units: pixels or visual tokens for images, words for the text, and learned sound inventories for voice. Because a collection of pixels differs significantly from an audio waveform or a passage of text, algorithm creation has been related to a particular modality. This means that algorithms in each modality continue to work differently. Data2vec makes this easier by teaching models to anticipate their own representations of the incoming data, regardless of mode. Instead of predicting visual tokens, phrases, or sounds, a single algorithm may work with completely different sorts of input by focusing on these representations — the layers of a neural network. This eliminates the learning task’s reliance on modality-specific targets. It also doesn’t use contrastive learning or reconstructed input examples.
  • It was necessary to define a robust normalization of the features for the job that would be trustworthy in different modalities to directly predict representations. The method starts by computing target representations from an image, a piece of text, or a voice utterance using a teacher network. After that, a portion of the input was masked and repeated with a student network, which predicts the teacher’s latent representations. Even though it only has a partial view of the data, the student model must predict accurate input data. The instructor network is identical to the student network, except with somewhat out-of-date weights.
  • The method was tested on the primary ImageNet computer vision benchmark, and it outperformed existing processes for a variety of model sizes. It surpassed wav2vec 2.0 and HuBERT, two previous Meta AI self-supervised voice algorithms. It was put through its paces on the popular GLUE benchmark suite for text, and it came out on par with RoBERTa, a reimplementation of BERT.
  • Key takeaways:
    • data2vec is a self-supervised algorithm that works for multiple modalities outperforming the previous best single-purpose algorithms for computer vision and speech and generating competitive scores on NLP tasks.
    • The core idea is to predict latent representations of the full input data based on a masked view of the input in a self-distillation setup using a standard Transformer architecture. Instead of predicting modality-specific targets such as words, visual tokens or units of human speech which are local in nature, data2vec predicts contextualized latent representations that contain information from the entire input.
    • Method:
      • data2vec is trained by predicting the model representations of the full input data given a partial view of the input
      • They first encode a masked version of the training sample (model in student mode) and then construct training targets by encoding the unmasked version of the input sample with the same model but when parameterized as an exponentially moving average of the model weights (model in teacher mode)
      • The target representations encode all of the information in the training sample and the learning task is for the student to predict these representations given a partial view of the input.
    • Modality encoding:
      • The model architecture used is the standard Transformer architecture with a modality-specific encoding of the input data borrowed from prior work:
        • For computer vision, they have used the ViT-strategy of encoding an image as a sequence of patches, each spanning 16x16 pixels, input to a linear transformation.
        • Speech data is encoded using a multi-layer 1-D convolutional neural network that maps 16 kHz waveform to 50 Hz representations.
        • Text is pre-processed to obtain sub-word units, which are then embedded in distributional space via learned embedding vectors.
    • Ablations (layer-averaged targets):
      • They have used targets which are based on averaging multiple layers from the teacher network.
  • Facebook AI link; Github; Marktechpost article.

Hierarchical Text-Conditional Image Generation with CLIP Latents
  • In January 2021, OpenAI introduced DALL-E. A year later, their newest system, DALL-E 2, generates more realistic and accurate images with 4x greater resolution, better caption matching and photorealism.
  • Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style.
  • This paper by Ramesh et al. from OpenAI in 2022 proposes DALL-E 2 leverages these representations for image generation, they propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a “unCLIP” decoder that generates an image conditioned on the image embedding.
  • They show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity.
  • Their decoder, which is conditioned on image representations, can also produce variations of an image that preserve both its semantics and style, while varying the non-essential details absent from the image representation.
  • They use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples
  • OpenAI article.
AutoDistill: an End-to-End Framework to Explore and Distill Hardware-Efficient Language Models
  • Recently, large pre-trained models have significantly improved the performance of various Natural LanguageProcessing (NLP) tasks but they are expensive to serve due to long serving latency and large memory usage. To compress these models, knowledge distillation has attracted an increasing amount of interest as one of the most effective methods for model compression. However, existing distillation methods have not yet addressed the unique challenges of model serving in datacenters, such as handling fast evolving models, considering serving performance, and optimizing for multiple objectives.
  • This paper by Zhang et al. from Google in 2022 solve these problems, they propose AutoDistill, an end-to-end model distillation framework integrating model architecture exploration and multi-objective optimization for building hardware-efficient NLP pre-trained models. They use Bayesian Optimization to conduct multi-objective Neural Architecture Search for selecting student model architectures. The proposed search comprehensively considers both prediction accuracy and serving latency on target hardware. The experiments on TPUv4i show the finding of seven model architectures with better pre-trained accuracy (up to 3.2% higher) and lower inference latency (up to 1.44x faster) than MobileBERT.
  • By running downstream NLP tasks in the GLUE benchmark, the model distilled for pre-training by AutoDistill with 28.5M parameters achieves an 81.69 average score, which is higher than BERT_BASE, DistillBERT, TinyBERT, NAS-BERT, and MobileBERT. The most compact model found by AutoDistill contains only 20.6M parameters but still outperform BERT_BASE(109M), DistillBERT(67M), TinyBERT(67M), and MobileBERT(25.3M) regarding the average GLUE score. By evaluating on SQuAD, a model found by AutoDistill achieves an 88.4% F1 score with 22.8M parameters, which reduces parameters by more than 62% while maintaining higher accuracy than DistillBERT, TinyBERT, and NAS-BERT.
A Generalist Agent
  • This paper by Reed et al. from DeepMind in 2022 proposes Gato, a single generalist agent beyond the realm of text outputs, inspired by progress in large-scale language modeling.
  • Gato, works as a multi-modal, multi-task, multi-embodiment generalist policy. The same network with the same weights can play Atari, caption images, chat, stack blocks with a real robot arm and much more, deciding based on its context whether to output text, joint torques, button presses, or other tokens.
  • The guiding design principle of Gato is to train on the widest variety of relevant data possible, including diverse modalities such as images, text, proprioception, joint torques, button presses, and other discrete and continuous observations and actions. To enable processing this multi-modal data from different tasks and modalities, it is serialized into a flat sequence of tokens. In this representation, Gato can be trained and sampled from akin to a standard large-scale language model. Masking is used such that the loss function is applied only to target outputs, i.e. text and various actions. During deployment, sampled tokens are assembled into dialogue responses, captions, button presses, or other actions based on the context.
  • Gato uses a 1.2B parameter decoder-only transformer with 24 layers, an embedding size of 2048, and a post-attention feedforward hidden size of 8196.
  • Transformer sequence models are effective as multi-task multi-embodiment policies, including for real-world text, vision and robotics tasks. They show promise as well in few-shot out-of-distribution task learning. The authors envision that in the future, such models could be used as a default starting point via prompting or fine-tuning to learn new behaviors, rather than training from scratch.
  • DeepMind page.
Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors
  • Recent text-to-image generation methods provide a simple yet exciting conversion capability between text and image domains. While these methods have incrementally improved the generated image fidelity/quality and text relevancy (i.e., adherence to text of generated images), several pivotal gaps remain unanswered, limiting applicability and quality.
  • This paper by Gafni et al. from Meta AI in 2022 proposes a novel text-to-image method that addresses these gaps by (i) enabling a simple control mechanism complementary to text in the form of a scene, (ii) introducing elements that substantially improve the tokenization process by employing domain-specific knowledge over key image regions (faces and salient objects), and (iii) adapting classifier-free guidance for the transformer use case.
  • While some methods propose image editing techniques, progress is not often directed towards enabling new forms of human creativity and experiences. They attempt to progress text-to-image generation towards a more interactive experience, where people can perceive more control over the generated outputs, thus enabling real-world applications such as storytelling.
  • In addition to improving the general image quality, they focus on improving key image aspects that are significant in human perception, such as faces and salient objects, resulting in higher favorability of their method in human evaluations and objective metrics.
  • Their model achieves state-of-the-art FID and human evaluation results, unlocking the ability to generate high fidelity images in a resolution of 512 × 512 pixels, significantly improving visual quality. Through scene controllability, they introduce several new capabilities: (i) scene editing, (ii) text editing with anchor scenes, (iii) overcoming out-of-distribution text prompts, and (iv) story illustration generation, as demonstrated in the story they wrote.
i-Code: An Integrative and Composable Multimodal Learning Framework
  • Human intelligence is multimodal; they integrate visual, linguistic, and acoustic signals to maintain a holistic worldview. Most current pretraining methods, however, are limited to one or two modalities.
  • This paper by Yang et al. from Microsoft in 2022 presents i-Code, a self-supervised pretraining framework which jointly learns representations for vision, language and speech into a unified, shared and general-purpose vector representation.
  • In this framework, data from each modality are first given to pretrained single-modality encoders. The encoder outputs are then integrated with a multimodal fusion network, which uses novel attention mechanisms and other architectural innovations to effectively combine information from the different modalities. The entire system is pretrained end-to-end with new objectives including (i) masked modality modeling and (ii) cross-modality contrastive learning.
  • They show that pretraining on dual-modality datasets can also yield competitive or even better performance than pretraining on videos, the data resource that previous three-modality models were restricted to. i-Code can dynamically process single, dual, and triple-modality data during training and inference, flexibly projecting different combinations of modalities into a single representation space.
  • Experimental results demonstrate how i-Code can outperform state-of-the-art techniques on five video understanding tasks and the GLUE NLP benchmark, improving by as much as 11% and demonstrating the power of integrative multimodal pretraining.
  • The figure below from the paper shows the overall model architecture of i-Code. Shown on the right is the attention and feed-forward operation in a fusion network layer with (a) merge-attention layers and (b) co-attention layers. To facilitate more effective cross-modality understanding and design the best fusion architecture, they explore two variations of the traditional attention mechanism: mechanisms that merge and cross the attention scores of different modalities, namely merge-attention (based on self-attention) and co-attention (based on self- and cross-attention) respectively. Note that for simplicity, only the residual connection of the language modality is drawn, but all three modalities use residual connections.

VL-BEIT: Generative Vision-Language Pretraining
  • This paper by Bao et al. from Furu Wei’s research group at Microsoft Research introduces a vision-language foundation model called VL-BEIT, a simple and effective approach to pretraining a bidirectional multimodal Transformer encoder for both vision-language and vision tasks learned by generative pretraining. Their minimalist solution conducts masked prediction on both monomodal and multimodal data with a shared Transformer.
  • VL-BEIT solely employs generative pretraining tasks, including masked language modeling on texts, masked image modeling on images, and masked vision-language modeling on image-text pairs. VL-BEIT is learned from scratch with one unified pretraining task, one shared backbone, and one-stage training which renders it conceptually simple and empirically effective.
  • Experimental results show that VL-BEIT obtains strong results on various vision-language benchmarks, such as visual question answering, visual reasoning, and image-text retrieval. Moreover, their method learns transferable visual features, achieving competitive performance on image classification, and semantic segmentation.
  • Code.
FLAVA: A Foundational Language And Vision Alignment Model
  • This paper by Singh et al. from Meta AI Research in CVPR 2022 presents FLAVA, a foundational vision and language alignment model that performs well on all three target modalities: 1) vision, 2) language, and 3) vision & language.
  • State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety of downstream tasks. Generally, such models are often either cross-modal (contrastive) or multi-modal (with earlier fusion) but not both; and they often only target specific modalities or tasks. A promising direction would be to use a single holistic universal model, as a “foundation”, that targets all modalities at once – a true vision and language foundation model should be good at vision tasks, language tasks, and cross- and multi-modal vision and language tasks.
  • FLAVA was trained on a corpus of publicly available datasets that is several orders of magnitude smaller than similar recent models, but still obtained better or competitive performance. FLAVA paves the way forward towards generalized but open models that perform well on a wide variety of multimodal tasks.
  • FLAVA demonstrates impressive performance on a wide range of 35 tasks spanning these target modalities.
Flamingo: a Visual Language Model for Few-Shot Learning
  • In recent years, large-scale pre-training followed by task-specific fine-tuning has emerged as a standard approach, but the fine-tuning step still requires a lot of samples. In other words, building models that can be rapidly adapted to numerous tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research.
  • This paper by Alayrac et al. from DeepMind in NeurIPS 2022 introduces Flamingo, a family of Visual Language Models (VLM) which seek to train a multi-modal model (i.e., with the ability to understand different types of input – visual, audio, text etc.) in a few-shot learning approach (which refers to the ability to learn a new task with just a few samples for training).
  • Flamingo models include key architectural innovations to: (i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily interleaved visual and textual data, and (iii) seamlessly ingest images or videos as inputs.
  • The key ideas behind Flamingo are:
    • Interleave cross-attention layers with language-only self-attention layers (frozen).
    • Perceiver-based architecture that transforms the input sequence data (videos) into a fixed number of visual tokens.
    • Large-scale (web) multi-modal data by scraping webpages which has inter-leaved text and images.
  • Thanks to their flexibility, Flamingo models can be trained on large-scale multimodal web corpora containing arbitrarily interleaved text and images, which is key to endow them with in-context few-shot learning capabilities.
  • They perform a thorough evaluation of the proposed Flamingo models, exploring and measuring their ability to rapidly adapt to a variety of image and video understanding benchmarks. These include open-ended tasks such as visual question-answering, where the model is prompted with a question which it has to answer, captioning tasks, which evaluate the ability to describe a scene or an event, and close-ended tasks such as multiple choice visual question-answering.
  • For tasks lying anywhere on this spectrum, they demonstrate that a single Flamingo model can achieve a new state of the art for few-shot learning, simply by prompting the model with task-specific examples. On many of these benchmarks, Flamingo actually surpasses the performance of models that are fine-tuned on thousands of times more task-specific data.

Stable and Latent Diffusion Model
  • The following blog post summary has been contributed by Zhibo Zhang.
  • This blog post from Hugging Face describes stable diffusion, a latent representation model developed by CompVis, Stability AI and LAION.
  • According to the blog, the stable diffusion model takes in a text description as input, where the text encoder from the CLIP model is used to generate a representation for the text input.
  • A latent image representation of size 64 * 64 is initialized based on the Gaussian distribution. A UNet (conditioned on the text representation) works together with a scheduler algorithm to denoise the latent representation. Generally, 50 denoising iterations are sufficient to generate images of high quality. After the denoising process, the decoder of a variational autoencoder is responsible for reconstructing the latent representation back into the image of size \(512 \times 512\).
DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation
  • The following summary has been contributed by Zhibo Zhang.
  • This paper by Ruiz et al. from Google Research and Boston University in 2022 introduces DreamBooth, which generates subjects with diverse contexts through text-to-image diffusion model fine-tuning.
  • Specifically, this work defines a new problem setting: recontextualize the specified subject while ensuring that the key visual features of the original subject are preserved.
  • In order to achieve this goal, the authors adopted the pre-trained Imagen model (Saharia et al.) and fine-tune it using around 3 to 5 images of a chosen subject as follows:
    • The fine-tuning of the low-resolution part of the model: The image generation process is conditioned on the text which is composed of the class noun and a rare token identifier for the subject. The objective function contains two parts: 1. The reconstruction loss to ensure that the generated images are similar to the input images. 2. The class-specific prior preservation loss to ensure that the generated images have diversity.
    • The fine-tuning of the super-resolution part of the model: Only the reconstruction loss is used. This step is to ensure the preservation of fine-grained details of the subjects in output images.
  • The authors discussed a few application scenarios of the DreamBooth framework including recontextualization, art renditions, expression manipulation, novel view synthesis, accessorization as well as property modification and displayed some example images for each application.
  • The authors also performed ablation studies validating that:
    • It is necessary to use the correct class noun in the input text.
    • The prior preservation encourages diversity in the generated images.
    • Using low-level noise when fine-tuning the super-resolution component improves the quality of the generated images.
UniT: Multimodal Multitask Learning with a Unified Transform
  • This paper by Hu And Singh from Facebook AI in 2021 proposes UniT, a Unified Transformer model to simultaneously learn the most prominent tasks across different domains, ranging from object detection to natural language understanding and multimodal reasoning.
  • Based on the transformer encoder-decoder architecture, UniT encodes each input modality with an encoder and makes predictions on each task with a shared decoder over the encoded input representations, followed by task-specific output heads. The entire model is jointly trained end-to-end with losses from each task.
  • Compared to previous efforts on multi-task learning with transformers, they share the same model parameters across all tasks instead of separately finetuning task-specific models and handle a much higher variety of tasks across different domains.
  • In their experiments, they learn 7 tasks jointly over 8 datasets, achieving strong performance on each task with significantly fewer parameters.
  • Code.
Perceiver IO: A General Architecture for Structured Inputs & Outputs
  • A central goal of machine learning is the development of systems that can solve many problems in as many data domains as possible. Current architectures, however, cannot be applied beyond a small set of stereotyped settings, as they bake in domain & task assumptions or scale poorly to large inputs or outputs.
  • This paper by Jeagle et al. from DeepMind in ICLR 2022 proposes Perceiver IO, a general-purpose architecture that handles data from arbitrary settings while scaling linearly with the size of inputs and outputs.
  • Perceiver IO augments the Perceiver with a flexible querying mechanism that enables outputs of various sizes and semantics, doing away with the need for task-specific architecture engineering. The same architecture achieves strong results on tasks spanning natural language and visual understanding, multi-task and multi-modal reasoning, and StarCraft II.
  • As highlights, Perceiver IO outperforms a Transformer-based BERT baseline on the GLUE language benchmark despite removing input tokenization and achieves state-of-the-art performance on Sintel optical flow estimation with no explicit mechanisms for multiscale correspondence.
Foundation Transformers
  • The following summary has been contributed by Zhibo Zhang.
  • Transformers are widely adopted across various input modalities such as speech, text and images. However, Transformers for different input modalities are generally designed to have distinct implementations such that the best performance can be achieved in each domain.
  • Foundation Transformers by Wang et al. from Microsoft in 2022 proposes the MAGNETO architecture, a general purpose transformer that can achieve stable task performance under various input modalities.
  • A key component is the introduction of Sub-LN (Sub-LayerNorm). As shown in the illustration figure by Wang et al., there are two layer normalization operations in both the multi-head attention module and the feed-forward network module accordingly. Specifically, for the multi-head attention module, compared to Pre-LN, Sub-LN introduces one more layer normalization operation following the multi-head self-attention component. For the feedforward network module, compared to Pre-LN, Sub-LN introduces one more layer normalization operation following the ReLU activation function.
  • With theoretical support, the authors showed the best initialization and weight scaling approaches for the encoder-only / decoder-only architecture and the encoder-decoder architecture.
  • Empirically, the authors validated the effectiveness of MAGNETO in domains with different input modalities including language, vision, speech and vision-language:
    • For the language domain, the authors conducted experiments on tasks including causal language modeling, masked language modeling as well as neural machine translation. On average, MAGNETO performed better than the comparison methodologies.
    • For the vision domain, the authors compared MAGNETO with the Vision Transformer (Dosovitskiy et al., 2021) with Pre-LN on both ImageNet (and its variants) image classification and ADE20k semantic segmentation tasks. MAGNETO outperformed Pre-LN in terms of top-1 accuracy for classification and in terms of mIoU score for semantic segmentation.
    • For the speech recognition task, MAGNETO achieved lower Word Error Rates compared to pre-LN.
  • In addition, the MAGNETO module also outperformed pre-LN on two vision-language tasks.

Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language
  • Current self-supervised learning algorithms are often modality-specific and require large amounts of computational resources.
  • This paper by Baevski et al. from Facebook in 2022 seeks to address the computational inefficiency of data2vec 1.0, we increase the training efficiency of data2vec, a learning objective that generalizes across several modalities.
  • data2vec 2.0 does not encode masked tokens, uses a fast convolutional decoder and amortizes the effort to build teacher representations.
  • data2vec 2.0 benefits from the rich contextualized target representations introduced in data2vec which enable a fast self-supervised learner.
  • Experiments on ImageNet-1K image classification show that data2vec 2.0 matches the accuracy of Masked Autoencoders in 16.4x lower pre-training time, on Librispeech speech recognition it performs as well as wav2vec 2.0 in 10.6x less time, and on GLUE natural language understanding it matches a retrained RoBERTa model in half the time.
  • Trading some speed for accuracy results in ImageNet-1K top-1 accuracy of 86.8% with a ViT-L model trained for 150 epochs.
  • Facebook AI link.
Imagic: Text-Based Real Image Editing with Diffusion Models
  • Text-conditioned image editing has recently attracted considerable interest. However, most methods are currently either limited to specific editing types (e.g., object overlay, style transfer), or apply to synthetically generated images, or require multiple input images of a common object.
  • This paper by Kawar et al. from Google Research in CVPR 2023 introduces Imagic which, for the very first time, demonstrates the ability to apply complex (e.g., non-rigid) text-guided semantic edits to a single real image. For example, Imagic can change the posture and composition of one or multiple objects inside an image, while preserving its original characteristics. Imagic can make a standing dog sit down or jump, cause a bird to spread its wings, etc. — each within its single high-resolution natural image provided by the user.

  • Contrary to previous work, Imagic requires only a single input image and a target text (the desired edit). It operates on real images, and does not require any additional inputs (such as image masks or additional views of the object). Imagic leverages a pre-trained text-to-image diffusion model for this task.
  • It produces a text embedding that aligns with both the input image and the target text, while fine-tuning the diffusion model to capture the image-specific appearance. They demonstrate the quality and versatility of Imagic on numerous inputs from various domains, showcasing a plethora of high quality complex semantic image edits, all within a single unified framework.
  • The following diagram shows the method adopted by Imagic. Given a real image and a target text prompt, they encode the target text and get the initial text embedding \(e_{tgt}\), then optimize it to reconstruct the input image, obtaining \(e_{opt}\). They then fine-tune the generative model to improve fidelity to the input image while fixing \(e_{opt}\). Finally, they interpolate \(e_{opt}\) with \(e_{tgt}\) to generate the edit result.

EDICT: Exact Diffusion Inversion via Coupled Transformations
  • Finding an initial noise vector that produces an input image when fed into the diffusion process (known as inversion) is an important problem in denoising diffusion models (DDMs), with applications for real image editing. The state-of-the-art approach for real image editing with inversion uses denoising diffusion implicit models (DDIMs) to deterministically noise the image to the intermediate state along the path that the denoising would follow given the original conditioning.
  • However, DDIM inversion for real images is unstable as it relies on local linearization assumptions, which result in the propagation of errors, leading to incorrect image reconstruction and loss of content.
  • This paper by Wallace et al. seeks to alleviate these problems and proposes Exact Diffusion Inversion via Coupled Transformations (EDICT), an inversion method that draws inspiration from affine coupling layers. EDICT enables mathematically exact inversion of real and model-generated images by maintaining two coupled noise vectors which are used to invert each other in an alternating fashion. - Using Stable Diffusion, a state-of-the-art latent diffusion model, they demonstrate that EDICT successfully reconstructs real images with high fidelity.
  • On complex image datasets like MS-COCO, EDICT reconstruction significantly outperforms DDIM, improving the mean square error of reconstruction by a factor of two. Using noise vectors inverted from real images, EDICT enables a wide range of image edits – from local and global semantic edits to image stylization – while maintaining fidelity to the original image structure.
  • EDICT requires no model training/finetuning, prompt tuning, or extra data and can be combined with any pretrained DDM.
  • Code.
CLAP: Learning Audio Concepts From Natural Language Supervision
  • Mainstream Audio Analytics models are trained to learn under the paradigm of one class label to many recordings focusing on one task. Learning under such restricted supervision limits the flexibility of models because they require labeled audio for training and can only predict the predefined categories.
  • This paper by Elizalde et al. from Microsoft in 2022 proposes Contrastive Language-Audio Pretraining (CLAP), which learns audio concepts from natural language supervision. CLAP connects language and audio by using two encoders and a contrastive learning to bring audio and text descriptions into a joint multimodal space.
  • They trained CLAP with 128k audio and text pairs and evaluated it on 16 downstream tasks across 8 domains, such as Sound Event Classification, Music tasks, and Speech-related tasks. Although CLAP was trained with significantly less pairs than similar computer vision models, it establishes SoTA for Zero-Shot performance.
  • Additionally, they evaluated CLAP in a supervised learning setup and achieve SoTA in 5 tasks. Hence, CLAP’s Zero-Shot capability removes the need of training with class labels, enables flexible class prediction at inference time, and generalizes to multiple downstream tasks.

An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA
  • Knowledge-based visual question answering (VQA) involves answering questions that require external knowledge not present in the image. Existing methods first retrieve knowledge from external resources, then reason over the selected knowledge, the input image, and question for answer prediction. However, this two-step approach could lead to mismatches that potentially limit the VQA performance. For example, the retrieved knowledge might be noisy and irrelevant to the question, and the re-embedded knowledge features during reasoning might deviate from their original meanings in the knowledge base (KB).
  • This paper by Yang et al. in AAAI 2022 proposes PICa, a simple yet effective method that Prompts GPT3 via the use of Image Captions, for knowledge-based VQA. Inspired by GPT-3’s power in knowledge retrieval and question answering, instead of using structured KBs as in previous work, we treat GPT-3 as an implicit and unstructured KB that can jointly acquire and process relevant knowledge. Specifically, we first convert the image into captions (or tags) that GPT-3 can understand, then adapt GPT-3 to solve the VQA task in a few-shot manner by just providing a few in-context VQA examples.
  • They further boost performance by carefully investigating: (i) what text formats best describe the image content, and (ii) how in-context examples can be better selected and used. PICa unlocks the first use of GPT-3 for multimodal tasks. By using only 16 examples, PICa surpasses the supervised state of the art by an absolute +8.6 points on the OK-VQA dataset. They also benchmark PICa on VQAv2, where PICa also shows a decent few-shot performance.
  • The following figure from the paper shows the inference-time interface of PICa for \(n\)-shot VQA. The input prompt to GPT-3 consists of a prompt head \(\boldsymbol{h}\) (blue box), \(n\) in-context examples \(\left(\left\{\boldsymbol{x}_i, \boldsymbol{y}_i\right\}_{i=1}^n\right)\) (red boxes), and the VQA input \(\boldsymbol{x}\) (green box). The answer \(\boldsymbol{y}\) is produced in an open-ended text generation manner. PICa supports zero-/few-shot VQA by including different numbers of in-context examples in prompt.

OCR-free Document Understanding Transformer
  • Understanding document images (e.g., invoices) is a core but challenging task since it requires complex functions such as reading text and a holistic understanding of the document. Current Visual Document Understanding (VDU) methods outsource the task of reading text to off-the-shelf Optical Character Recognition (OCR) engines and focus on the understanding task with the OCR outputs. The following figure from the paper shows the pipeline overview and benchmarks. The proposed end-to-end model, Donut, outperforms the recent OCR-dependent VDU models in memory, time cost and accuracy. Performances on visual document information extraction are shown in (b).

  • Although such OCR-based approaches have shown promising performance, they suffer from 1) high computational costs for using OCR; 2) inflexibility of OCR models on languages or types of document; 3) OCR error propagation to the subsequent process.
  • This paper by Kim et al. from NAVER CLOVA, NAVER Search, NAVER AI Lab, Upstage, Tmax, Google, LBox in ECCV 2022 seeks to address these issues and introduces a novel OCR-free VDU model named Donut, which stands for Document understanding transformer.
  • As the first step in OCR-free VDU research, they propose a simple architecture (i.e., Transformer) with a pre-training objective (i.e., cross-entropy loss). Donut is conceptually simple yet effective.
  • The following figure from the paper shows the pipeline of Donut. The encoder maps a given document image into embeddings. With the encoded embeddings, the decoder generates a sequence of tokens that can be converted into a target type of information in a structured form.

  • Through extensive experiments and analyses, they show a simple OCR-free VDU model, Donut, achieves state-of-the-art performances on various VDU tasks in terms of both speed and accuracy.
  • In addition, they offer a synthetic data generator that helps the model pre-training to be flexible in various languages and domains.
  • Code.


Pix2Video: Video Editing using Image Diffusion
  • Image diffusion models, trained on massive image collections, have emerged as the most versatile image generator model in terms of quality and diversity. They support inverting real images and conditional (e.g., text) generation, making them attractive for high-quality image editing applications.
  • This paper by Ceylan et al. from Adobe Research and UCL investigates how pre-trained image models could be used for text-guided video editing. The critical challenge is to achieve the target edits while still preserving the content of the source video.
  • Pix2Video works in two simple steps: first, they use a pre-trained structure-guided (e.g., depth) image diffusion model to perform text-guided edits on an anchor frame; then, in the next step, they progressively propagate the changes to the future frames via self-attention feature injection to adapt the core denoising step of the diffusion model. In other words, as shown in the figure below (source), Pix2Video first inverts each frame with DDIM-inversion and consider it as the initial noise for the denoising process. To edit each frame (lower row), they select a reference frame (upper row), inject its self-attention features to the UNet. At each diffusion step, they also update the latent of the current frame guided by the latent of the reference.

  • Pix2Video then consolidates the changes by adjusting the latent code for the frame before continuing the process.
  • Pix2Video’s approach is training-free and generalizes to a wide range of edits. They demonstrate the effectiveness of the approach by extensive experimentation and compare it against four different prior and parallel efforts. They demonstrate that realistic text-guided video edits are possible, without any compute-intensive preprocessing or video-specific finetuning.
  • Project page.
TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs
  • Artificial Intelligence (AI) has made incredible progress recently. On the one hand, advanced foundation models like ChatGPT can offer powerful conversation, in-context learning and code generation abilities on a broad range of open-domain tasks. They can also generate high-level solution outlines for domain-specific tasks based on the common sense knowledge they have acquired. However, they still face difficulties with some specialized tasks because they lack enough domain-specific data during pre-training or they often have errors in their neural network computations on those tasks that need accurate executions. On the other hand, there are also many existing models and systems (symbolic-based or neural-based) that can do some domain-specific tasks very well. However, due to the different implementation or working mechanisms, they are not easily accessible or compatible with foundation models. Therefore, there is a clear and pressing need for a mechanism that can leverage foundation models to propose task solution outlines and then automatically match some of the sub-tasks in the outlines to the off-the-shelf models and systems with special functionalities to complete them.
  • This paper by Liang et al. from Microsoft in 2023 introduces TaskMatrix.AI as a new AI ecosystem that connects foundation models with millions of APIs for task completion. Unlike most previous work that aimed to improve a single AI model, TaskMatrix.AI focuses more on using existing foundation models (as a brain-like central system) and APIs of other AI models and systems (as sub-task solvers) to achieve diversified tasks in both digital and physical domains.
  • As a position paper, they will present their vision of how to build such an ecosystem, explain each key component, and use study cases to illustrate both the feasibility of this vision and the main challenges that need to be addressed next.
  • The following figure from the paper presents an overview of TaskMatrix.AI. Given user instruction and the conversational context, the multimodal conversational foundation model (MCFM) first generates a solution outline (step 1), which is a textual description of the steps needed to solve the task. Then, the API selector chooses the most relevant APIs from the API platform according to the solution outline (step 2). Next, MCFM generates action codes using the recommended APIs, which will be further executed by calling APIs. Last, the user feedback on task completion is returned to MCFM and API developers.

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace
  • Solving complicated AI tasks with different domains and modalities is a key step toward advanced artificial intelligence. While there are abundant AI models available for different domains and modalities, they cannot handle complicated AI tasks.
  • This paper by Shen et al. from Zhejiang University and Microsoft Research Asia in 2023 advocates that LLMs could act as a controller to manage existing AI models to solve complicated AI tasks and language could be a generic interface to empower this, considering the exceptional ability large language models (LLMs) have exhibited in language understanding, generation, interaction, and reasoning, etc. Based on this philosophy, they present HuggingGPT, a framework that leverages LLMs (e.g., ChatGPT) to connect various AI models in machine learning communities (e.g., Hugging Face) to solve AI tasks.
  • Specifically, they use ChatGPT to conduct task planning when receiving a user request, select models according to their function descriptions available in Hugging Face, execute each subtask with the selected AI model, and summarize the response according to the execution results. By leveraging the strong language capability of ChatGPT and abundant AI models in Hugging Face, HuggingGPT is able to cover numerous sophisticated AI tasks in different modalities and domains and achieve impressive results in language, vision, speech, and other challenging tasks, which paves a new way towards advanced artificial intelligence.
  • Summary:
    1. HuggingGPT is recently introduced as a suitable middleware to bridge the connections between Large Language Models (LLMs) and AI models. The workflow goes as follows.
    2. Users can send a request (multimodal for sure) which will be processed by an LLM controller. The LLM analyzes the request, understands the intention of the user, and generates possible solvable sub-tasks.
    3. ChatGPT selects and invokes the corresponding models hosted on Hugging Face to solve each subtask.
    4. Once tasks are executed, the invoked model returns the results to the ChatGPT controller.
    5. Finally, ChatGPT integrates the prediction of all models and generates the response.
    6. It is amazing how HuggingGPT can show its reasoning and point to its in-context task-model assignment as intermediary steps before generating the output.
  • The following figure shows that language serves as an interface for LLMs (e.g., ChatGPT) to connect numerous AI models (e.g., those in Hugging Face) for solving complicated AI tasks. In this concept, an LLM acts as a controller, managing and organizing the cooperation of expert models. The LLM first plans a list of tasks based on the user request and then assigns expert models to each task. After the experts execute the tasks, the LLM collects the results and responds to the user.

Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation
  • Contrastive learning has shown remarkable success in the field of multimodal representation learning.
  • This paper by Wu et al. from ICASSP 2023 proposes a pipeline of contrastive language-audio pretraining to develop an audio representation by combining audio data with natural language descriptions.
  • To accomplish this target, they first release LAION-Audio-630K, a large collection of 633,526 audio-text pairs from different data sources.
  • Second, they construct a contrastive language-audio pretraining model by considering different audio encoders and text encoders.
  • They incorporate the feature fusion mechanism and keyword-to-caption augmentation into the model design to further enable the model to process audio inputs of variable lengths and enhance the performance.
  • Third, they perform comprehensive experiments to evaluate their model across three tasks: text-to-audio retrieval, zero-shot audio classification, and supervised audio classification.
  • The results demonstrate that their model achieves superior performance in text-to-audio retrieval task. In audio classification tasks, the model achieves state-of-the-art performance in the zero-shot setting and is able to obtain performance comparable to models’ results in the non-zero-shot setting.
  • Code.

ImageBind: One Embedding Space To Bind Them All
  • This paper by Girdhar et al. from Meta in CVPR 2023 presents ImageBind, an approach to learn a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data.
  • They show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together.
  • ImageBind can leverage recent large scale vision-language models, and extends their zero-shot capabilities to new modalities just by using their natural pairing with images. It enables novel emergent applications ‘out-of-the-box’ including cross-modal retrieval, composing modalities with arithmetic, cross-modal detection, and generation.
  • The emergent capabilities improve with the strength of the image encoder and we set a new state-of-the-art on emergent zero-shot recognition tasks across modalities, outperforming specialist supervised models. Finally, they show strong few-shot recognition results outperforming prior work, and that ImageBind serves as a new way to evaluate vision models for visual and non-visual tasks.
  • This figure below from the paper shows ImageBind’s joint embedding space which enables novel multimodal capabilities. By aligning six modalities’ embedding into a common space, IMAGEBIND enables: (i) Cross-Modal Retrieval, which shows emergent alignment of modalities such as audio, depth or text, that aren’t observed together, (ii) Adding embeddings from different modalities naturally composes their semantics, and (iii) Audio-to-Image generation, by using their audio embeddings with a pre-trained DALLE-2 decoder designed to work with CLIP text embeddings.

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
  • The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models.
  • This paper by Li et al. from Salesforce Research in 2023 proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models. BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pre-trained in two stages. The first stage bootstraps vision-language representation learning from a frozen image encoder. The second stage bootstraps vision-to-language generative learning from a frozen language model.
  • They propose a Querying Transformer (QFormer) pre-trained with a new two-stage pstrategy. As shown in the figure below, Q-Former is a lightweight transformer which employs a set of learnable query vectors to extract visual features from the frozen image encoder. It acts as an information bottleneck between the frozen image encoder and the frozen LLM, where it feeds the most useful visual feature for the LLM to output the desired text.
  • In the first pre-training stage, they perform vision-language representation learning which enforces the Q-Former to learn visual representation most relevant to the text. In the second pre-training stage, they perform vision-to-language generative learning by connecting the output of the Q-Former to a frozen LLM, and trains the Q-Former such that its output visual representation can be interpreted by the LLM.
  • BLIP-2 achieves state-of-the-art performance on various vision-language tasks, despite having significantly fewer trainable parameters than existing methods. For example, BLIP-2 model outperforms Flamingo-80B by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters. We also demonstrate the model’s emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions.
  • The following figure from the paper shows an overview of BLIP-2’s framework. They pre-train a lightweight Querying Transformer following a two-stage strategy to bridge the modality gap. The first stage bootstraps vision-language representation learning from a frozen image encoder. The second stage bootstraps vision-to-language generative learning from a frozen LLM, which enables zero-shot instructed image-to-text generation.

  • The following figure from the paper shows: (Left) Model architecture of Q-Former and BLIP-2’s first-stage vision-language representation learning objectives. They jointly optimize three objectives which enforce the queries (a set of learnable embeddings) to extract visual representation most relevant to the text. (Right) The self-attention masking strategy for each objective to control query-text interaction.

  • The following figure from the paper shows BLIP-2’s second-stage vision-to-language generative pre-training, which bootstraps from frozen large language models (LLMs). (Top) Bootstrapping a decoder-based LLM (e.g. OPT). (Bottom) Bootstrapping an encoder-decoder-based LLM (e.g. FlanT5). The fully-connected layer adapts from the output dimension of the Q-Former to the input dimension of the chosen LLM.

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
  • General-purpose language models that can solve various language-domain tasks have emerged driven by the pre-training and instruction-tuning pipeline. However, building general-purpose vision-language models is challenging due to the increased task discrepancy introduced by the additional visual input. Although vision-language pre-training has been widely studied, vision-language instruction tuning remains relatively less explored.
  • This paper by Dai et al. from Salesforce Research, HKUST, and NTU Singapore in 2023 conducts a systematic and comprehensive study on vision-language instruction tuning based on the pre-trained BLIP-2 models. They gather a wide variety of 26 publicly available datasets, transform them into instruction tuning format and categorize them into two clusters for held-in instruction tuning and held-out zero-shot evaluation. Additionally, they introduce instruction-aware visual feature extraction, a crucial method that enables the model to extract informative features tailored to the given instruction.
  • The following figure from the paper shows the model architecture of InstructBLIP. The Q-Former extracts instruction-aware visual features from the output embeddings of the frozen image encoder, and feeds the visual features as soft prompt input to the frozen LLM. We instruction-tune the model with the language modeling loss to generate the response.

  • The resulting InstructBLIP models achieve state-of-the-art zero-shot performance across all 13 held-out datasets, substantially outperforming BLIP-2 and the larger Flamingo.
  • Their models also lead to state-of-the-art performance when finetuned on individual downstream tasks (e.g., 90.7% accuracy on ScienceQA IMG). Furthermore, they qualitatively demonstrate the advantages of InstructBLIP over concurrent multimodal models.
  • The figure below from the paper shows a few qualitative examples generated by our InstructBLIP Vicuna model. Here, a range of its diverse capabilities are demonstrated, including complex visual scene understanding and reasoning, knowledge-grounded image description, multi-turn visual conversation, etc.

AtMan: Understanding Transformer Predictions Through Memory Efficient Attention Manipulation
  • Generative transformer models have become increasingly complex, with large numbers of parameters and the ability to process multiple input modalities. Current methods for explaining their predictions are resource-intensive. Most crucially, they require prohibitively large amounts of extra memory, since they rely on backpropagation which allocates almost twice as much GPU memory as the forward pass. This makes it difficult, if not impossible, to use them in production.
  • This paper by Deb et al. from Aleph Alpha, TU Darmstadt, and German Center for Artificial Intelligence (DFKI) in 2023 presents AtMan that provides explanations of generative transformer models at almost no extra cost. Specifically, AtMan is a modality-agnostic perturbation method that manipulates the attention mechanisms of transformers to produce relevance maps for the input with respect to the output prediction. Instead of using backpropagation, AtMan applies a parallelizable token-based search method based on cosine similarity neighborhood in the embedding space.
  • Their exhaustive experiments on text and image-text benchmarks demonstrate that AtMan outperforms current state-of-the-art gradient-based methods on several metrics while being computationally efficient. As such, AtMan is suitable for use in large model inference deployments.
  • The following figures from the paper (top) show an illustration of the proposed explainability method where first, they collect the original cross-entropy score of the target tokens. Then they iterate and suppress one token at a time, indicated by the red box, and track changes in the cross-entropy score of the target token (2); (bottom) manipulating the attention scores of a single token (highlighted in blue) inside a transformer block to steer the model’s prediction into a different contextual direction (amplifications highlighted in green, suppression in red).

Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
  • Large language models (LLMs) have achieved remarkable progress in solving various natural language processing tasks due to emergent reasoning abilities. However, LLMs have inherent limitations as they are incapable of accessing up-to-date information (stored on the Web or in task-specific knowledge bases), using external tools, and performing precise mathematical and logical reasoning.
  • This paper by Lu et al. from UCLA and Microsoft Research presents Chameleon, an AI system that mitigates these limitations by augmenting LLMs with plug-and-play modules for compositional reasoning. Chameleon synthesizes programs by composing various tools (e.g., LLMs, off-the-shelf vision models, web search engines, Python functions, and heuristic-based modules) for accomplishing complex reasoning tasks.
  • At the heart of Chameleon is an LLM-based planner that assembles a sequence of tools to execute to generate the final response.
  • They showcase the effectiveness of Chameleon on two multi-modal knowledge-intensive reasoning tasks: ScienceQA and TabMWP. Chameleon, powered by GPT-4, achieves an 86.54% overall accuracy on ScienceQA, improving the best published few-shot result by 11.37%. On TabMWP, GPT-4-powered Chameleon improves the accuracy by 17.0%, lifting the state of the art to 98.78%.
  • Their analysis also shows that the GPT-4-powered planner exhibits more consistent and rational tool selection via inferring potential constraints from instructions, compared to a ChatGPT-powered planner.
  • The following figures from the paper shows two examples from their Chameleon with GPT-4 on TabMWP, a mathematical reasoning benchmark with tabular contexts. Chameleon demonstrates flexibility and efficiency in adapting to different queries that require various reasoning abilities.

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
  • This paper by Yang et al. from proposes MM-REACT, a system paradigm that integrates ChatGPT with a pool of vision experts to achieve multimodal reasoning and action.
  • They define and explore a comprehensive list of advanced vision tasks that are intriguing to solve, but may exceed the capabilities of existing vision and vision-language models. To achieve such advanced visual intelligence, MM-REACT introduces a textual prompt design that can represent text descriptions, textualized spatial coordinates, and aligned file names for dense visual signals such as images and videos.
  • MM-REACT’s prompt design allows language models to accept, associate, and process multimodal information, thereby facilitating the synergetic combination of ChatGPT and various vision experts. Zero-shot experiments demonstrate MM-REACT’s effectiveness in addressing the specified capabilities of interests and its wide application in different scenarios that require advanced visual understanding.
  • Furthermore, they discuss and compare MM-REACT’s system paradigm with an alternative approach that extends language models for multimodal scenarios through joint finetuning.
  • The following figures from the paper shows that MM-REACT allocates specialized vision experts with ChatGPT to solve challenging visual understanding tasks through multimodal reasoning and action. For example, the system could associate information from multiple uploaded receipts and calculate the total travel cost (“Multi-Image Reasoning”).

  • The following figures from the paper shows the flowchart of MM-REACT for enhanced visual understanding with ChatGPT. The user input can be in the form of text, images, or videos, with the latter two represented as file path strings. ChatGPT is instructed to say specific watchwords in action request if a vision expert is required to interpret the visual inputs. Regular expression matching is applied to parse the expert’s name and the file path, which are then used to call the vision expert (action execution). The expert’s output (observation) is serialized as text and combined with the history to further activate ChatGPT. If no extra experts are needed, MM-REACT would return the final response to the user. The right figure shows a single-round vision expert execution, which is the component that constructs the full execution flow.

PaLM-E: An Embodied Multimodal Language Model
  • Large language models have been demonstrated to perform complex tasks. However, enabling general inference in the real world, e.g. for robotics problems, raises the challenge of grounding.
  • This paper by Driess from Google, TU Berlin, and Google Research proposes PaLM-E, an embodied language models to directly incorporate real-world continuous sensor modalities into language models and thereby establish the link between words and percepts. Input to their embodied language model are multi-modal sentences that interleave visual, continuous state estimation, and textual input encodings.
  • They train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks, including sequential robotic manipulation planning, visual question answering, and captioning.
  • Their evaluations show that PaLM-E, a single large embodied multimodal model, can address a variety of embodied reasoning tasks, from a variety of observation modalities, on multiple embodiments, and further, exhibits positive transfer: the model benefits from diverse joint training across internet-scale language, vision, and visual-language domains.
  • Their largest model, PaLM-E-562B with 562B parameters, in addition to being trained on robotics tasks, is a visual-language generalist with state-of-the-art performance on OK-VQA, and retains generalist language capabilities with increasing scale.
  • The following figures from the paper shows PaLM-E, a single general-purpose multimodal language model for embodied reasoning tasks, visual-language tasks, and language tasks. - PaLM-E transfers knowledge from visual-language domains into embodied reasoning – from robot planning in environments with complex dynamics and physical constraints, to answering questions about the observable world. PaLM-E operates on multimodal sentences, i.e. sequences of tokens where inputs from arbitrary modalities (e.g. images, neural 3D representations, or states, in green and blue) are inserted alongside text tokens (in orange) as input to an LLM, trained end-to-end.

MIMIC-IT: Multi-Modal In-Context Instruction Tuning
  • High-quality instructions and responses are essential for the zero-shot performance of large language models on interactive natural language tasks. For interactive vision-language tasks involving intricate visual scenes, a large quantity of diverse and creative instruction-response pairs should be imperative to tune vision-language models (VLMs). Nevertheless, the current availability of vision-language instruction-response pairs in terms of quantity, diversity, and creativity remains limited, posing challenges to the generalization of interactive VLMs.
  • This paper by Li et al. from NTU Singapore and Microsoft Research presents MultI-Modal In-Context Instruction Tuning (MIMIC-IT), a dataset comprising 2.8 million multimodal instruction-response pairs, with 2.2 million unique instructions derived from images and videos. Each pair is accompanied by multi-modal in-context information, forming conversational contexts aimed at empowering VLMs in perception, reasoning, and planning.
  • The instruction-response collection process, dubbed as Syphus, is scaled using an automatic annotation pipeline that combines human expertise with GPT’s capabilities. Using the MIMIC-IT dataset, they train a large VLM named Otter based on OpenFlamingo.
  • Based on extensive evaluations conducted on vision-language benchmarks, it has been observed that Otter demonstrates remarkable proficiency in multi-modal perception, reasoning, and in-context learning. Human evaluation reveals it effectively aligns with the user’s intentions. They release the MIMIC-IT dataset, instruction-response collection pipeline, benchmarks, and the Otter model.
  • The following figure from the paper shows an overview of MIMIC-IT. The MIMIC-IT dataset comprises 2.8M multi-modal instructionresponse pairs spanning fundamental capabilities: perception, reasoning, and planning. Each instruction is accompanied by multi-modal conversational context, allowing VLMs trained on MIMIC-IT to demonstrate strong proficiency in interactive instruction following with zero-shot generalization.

Visual Instruction Tuning
  • Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field.
  • This paper by Liu et al. from UW-Madison, Microsoft Research, and Columbia University presents the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, they introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding. Their early experiments show that LLaVA demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset.
  • When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%.
  • The following figure from the paper shows the LLaVA network architecture.

Multimodal Chain-of-Thought Reasoning in Language Models
  • Large language models (LLMs) have shown impressive performance on complex reasoning by leveraging chain-of-thought (CoT) prompting to generate intermediate reasoning chains as the rationale to infer the answer. However, existing CoT studies have focused on the language modality.
  • This paper by Zhang et al. from Shanghai Jiao Tong University and Amazon Web Services addresses the limitations of current CoT studies in large language models (LLMs) by incorporating both language (text) and vision (images) modalities.
  • It introduces Multimodal-CoT, a novel two-stage framework that enhances complex reasoning in LLMs. This approach first generates rationales using both text and images, then leverages these enhanced rationales for more accurate answer inference. This method marks a significant departure from existing CoT studies that focus solely on the language modality.
  • The following figure from the paper shows an example of the multimodal CoT task.

  • The following figure from the paper shows an overview of their Multimodal-CoT framework. Multimodal-CoT consists of two stages: (i) rationale generation and (ii) answer inference. Both stages share the same model architecture but differ in the input and output. In the first stage, they feed the model with language and vision inputs to generate rationales. In the second stage, they append the original language input with the rationale generated from the first stage. Then, they feed the updated language input with the original vision input to the model to infer the answer.

  • The authors demonstrate that their model, which has fewer than 1 billion parameters, significantly outperforms the state-of-the-art LLM, GPT-3.5, on the ScienceQA benchmark. With a 16 percentage point increase in accuracy (from 75.17% to 91.68%), Multimodal-CoT not only surpasses GPT-3.5 but also exceeds human performance levels.
  • The paper provides a detailed analysis of the model’s architecture, highlighting the use of fine-tuned language models to effectively fuse vision and language representations. This is a key component in generating more informative rationales for the subsequent inference stage.
  • Empirical evaluations are included to demonstrate the model’s effectiveness in both rationale generation and answer accuracy, showcasing its superiority in scenarios where traditional CoT reasoning may falter.
  • The authors compare Multimodal-CoT with other models and baselines, emphasizing the considerable advancements it brings to multimodal reasoning tasks.
  • The potential applications and future improvements of Multimodal-CoT are also discussed, particularly in enhancing the interaction between language and vision features and incorporating more sophisticated vision extraction techniques.
  • Overall, this paper represents a significant leap in multimodal reasoning for LLMs, showing how integrating language and vision modalities can lead to remarkable improvements in reasoning and understanding.
  • Code.
Dreamix: Video Diffusion Models are General Video Editors
  • Text-driven image and video diffusion models have recently achieved unprecedented generation realism. While diffusion models have been successfully applied for image editing, very few works have done so for video editing.
  • This paper by Molad et al. from Google Research and The Hebrew University of Jerusalem presents the first diffusion-based method that is able to perform text-based motion and appearance editing of general videos. Our approach uses a video diffusion model to combine, at inference time, the low-resolution spatio-temporal information from the original video with new, high resolution information that it synthesized to align with the guiding text prompt.
  • The following figure from the paper shows the video editing use-case with Dreamix: Frames from a video conditioned on the text prompt “A bear dancing and jumping to upbeat music, moving his whole body“. Dreamix transforms the eating monkey (top row) into a dancing bear, affecting appearance and motion (bottom row). It maintains fidelity to color, posture, object size and camera pose, resulting in a temporally consistent video.

  • As obtaining high-fidelity to the original video requires retaining some of its high-resolution information, we add a preliminary stage of finetuning the model on the original video, significantly boosting fidelity.
  • They propose to improve motion editability by a new, mixed objective that jointly finetunes with full temporal attention and with temporal attention masking.
  • They further introduce a new framework for image animation. They first transform the image into a coarse video by simple image processing operations such as replication and perspective geometric projections, and then use their general video editor to animate it.
  • As a further application, Dreamix can be used for subject-driven video generation. Extensive qualitative and numerical experiments showcase the remarkable editing ability of Dreamix and establish its superior performance compared to baseline methods.
  • The following figure from the paper illustrates the process of inference. Dreamix supports multiple applications by application dependent pre-processing (left), converting the input content into a uniform video format. For image-to-video, the input image is duplicated and transformed using perspective transformations, synthesizing a coarse video with some camera motion. For subject-driven video generation, the input is omitted - finetuning alone takes care of the fidelity. This coarse video is then edited using their general “Dreamix Video Editor“ (right): we first corrupt the video by downsampling followed by adding noise. We then apply the finetuned text-guided VDM, which upscales the video to the final spatio-temporal resolution.

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
  • This paper by Liu et al. from Tsinghua University, International Digital Economy Academy (IDEA), The Hong Kong University of Science and Technology, CUHK, MSR, presents an open-set object detector, called Grounding DINO, by marrying Transformer-based detector DINO with grounded pre-training, which can detect arbitrary objects with human inputs such as category names or referring expressions.
  • The following figure from the paper illustrates: (a) closed-set object detection requires models to detect objects of pre-defined categories; (b) previous work zero-shot transfer models to novel categories for model generalization – they propose to add Referring expression comprehension (REC) as another evaluation for model generalizations on novel objects with attributes; (c) they present an image editing application by combining Grounding DINO and Stable Diffusion.

  • The key solution of open-set object detection is introducing language to a closed-set detector for open-set concept generalization. To effectively fuse language and vision modalities, they conceptually divide a closed-set detector into three phases and propose a tight fusion solution, which includes a feature enhancer, a language-guided query selection, and a cross-modality decoder for cross-modality fusion.
  • The following figure from the paper illustrates the framework of Grounding DINO including the overall framework, a feature enhancer layer, and a decoder layer in block 1, block 2, and block 3, respectively.

  • While previous works mainly evaluate open-set object detection on novel categories, they propose to also perform evaluations on referring expression comprehension for objects specified with attributes.
  • Grounding DINO performs remarkably well on all three settings, including benchmarks on COCO, LVIS, ODinW, and RefCOCO/+/g. Grounding DINO achieves a 52.5 AP on the COCO detection zero-shot transfer benchmark, i.e., without any training data from COCO. It sets a new record on the ODinW zero-shot benchmark with a mean 26.1 AP.

  • Code.
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
  • This technical report by Awadalla et al. from UW, Stanford, Allen Institute for AI, LAION, UCSB, Hebrew University, Columbia, Google DeepMind introduces OpenFlamingo, a family of autoregressive vision-language models ranging from 3B to 9B parameters.
  • OpenFlamingo is an open-source replication of DeepMind’s Flamingo models, a suite of autoregressive vision-language models. On seven vision-language datasets, OpenFlamingo models average between 80 - 89% of corresponding Flamingo performance. describes their models, training data, hyperparameters, and evaluation suite. They describe the training pipeline to replicate the Flamingo models with 80 to 89% of the original Flamingo performance (on an average).
  • The following figure from the paper illustrates the fact that OpenFlamingo-9B can process interleaved image-and-text sequences. This interface allows OpenFlamingo to learn many vision-language tasks through in-context demonstrations.

Med-Flamingo: a Multimodal Medical Few-shot Learner
  • Medicine, by its nature, is a multifaceted domain that requires the synthesis of information across various modalities. Medical generative vision-language models (VLMs) make a first step in this direction and promise many exciting clinical applications. However, existing models typically have to be fine-tuned on sizeable down-stream datasets, which poses a significant limitation as in many medical applications data is scarce, necessitating models that are capable of learning from few examples in real-time.
  • This paper by Moor et al. from Stanford University, Stanford Medicine, Hospital Israelita Albert Einstein, and Harvard Medical School proposes Med-Flamingo, a multimodal few-shot learner adapted to the medical domain. Based on OpenFlamingo-9B, they continue pre-training on paired and interleaved medical image-text data from publications and textbooks.
  • Med-Flamingo unlocks few-shot generative medical visual question answering (VQA) abilities, which they evaluate on several datasets including a novel challenging open-ended VQA dataset of visual USMLE-style problems.
  • Furthermore, they conduct the first human evaluation for generative medical VQA where physicians review the problems and blinded generations in an interactive app. Med-Flamingo improves performance in generative medical VQA by up to 20% in clinician’s rating and firstly enables multimodal medical few-shot adaptations, such as rationale generation.
  • The following figure from the paper shows an overview of the Med-Flamingo model using three steps. First, they pre-train their Med-Flamingo model using paired and interleaved image-text data from the general medical domain (sourced from publications and textbooks). They initialize their model at the OpenFlamingo checkpoint continue pre-training on medical image-text data. Second, we perform few-shot generative visual question answering (VQA). For this, we leverage two existing medical VQA datasets, and a new one, Visual USMLE. Third, we conduct a human rater study with clinicians to rate generations in the context of a given image, question and correct answer. The human evaluation was conducted with a dedicated app and results in a clinical evaluation score that serves as their main metric for evaluation.

Towards Generalist Biomedical AI
  • Medicine is inherently multimodal, with rich data modalities spanning text, imaging, genomics, and more. Generalist biomedical artificial intelligence (AI) systems that flexibly encode, integrate, and interpret this data at scale can potentially enable impactful applications ranging from scientific discovery to care delivery.
  • This paper by Tu et al. from Google Research and Google DeepMind seeks to enable the development of these models by first curating MultiMedBench, a new multimodal biomedical benchmark. MultiMedBench encompasses 14 diverse tasks such as medical question answering, mammography and dermatology image interpretation, radiology report generation and summarization, and genomic variant calling. They then introduce Med-PaLM Multimodal (Med-PaLM M), their proof of concept for a generalist biomedical AI system. Med-PaLM M is a large multimodal generative model that flexibly encodes and interprets biomedical data including clinical language, imaging, and genomics with the same set of model weights.
  • Med-PaLM M reaches performance competitive with or exceeding the state of the art on all MultiMedBench tasks, often surpassing specialist models by a wide margin. They also report examples of zero-shot generalization to novel medical concepts and tasks, positive transfer learning across tasks, and emergent zero-shot medical reasoning.
  • To further probe the capabilities and limitations of Med-PaLM M, they conduct a radiologist evaluation of model-generated (and human) chest X-ray reports and observe encouraging performance across model scales.
  • In a side-by-side ranking on 246 retrospective chest X-rays, clinicians express a pairwise preference for Med-PaLM M reports over those produced by radiologists in up to 40.50% of cases, suggesting potential clinical utility.
  • The following figure from the paper shows an overview of Med-PaLM M. A generalist biomedical AI system should be able to handle a diverse range of biomedical data modalities and tasks. To enable progress towards this overarching goal, they curate MultiMedBench, a benchmark spanning 14 diverse biomedical tasks including question answering, visual question answering, image classification, radiology report generation and summarization, and genomic variant calling. Med-PaLM Multimodal (Med-PaLM M), their proof of concept for such a generalist biomedical AI system (denoted by the shaded blue area) is competitive with or exceeds prior SOTA results from specialists models (denoted by dotted red lines) on all tasks in MultiMedBench. Notably, Med-PaLM M achieves this using a single set of model weights, without any task-specific customization.

PaLI: A Jointly-Scaled Multilingual Language-Image Model
  • Effective scaling and a flexible task interface enable large language models to excel at many tasks.
  • This paper by Chen et al. from Google Research in ICLR 2023 presents PaLI (Pathways Language and Image model), a model that extends this approach to the joint modeling of language and vision.
  • PaLI generates text based on visual and textual inputs, and with this interface performs many vision, language, and multimodal tasks, in many languages.
  • To train PaLI, they make use of large pre-trained encoder-decoder language models and Vision Transformers (ViTs). This allows them to capitalize on their existing capabilities and leverage the substantial cost of training them. They find that joint scaling of the vision and language components is important.
  • Since existing Transformers for language are much larger than their vision counterparts, we train a large, 4-billion parameter ViT (ViT-e) to quantify the benefits from even larger-capacity vision models.
  • To train PaLI, they create a large multilingual mix of pretraining tasks, based on a new image-text training set containing 10B images and texts in over 100 languages. PaLI achieves state-of-the-art in multiple vision and language tasks (such as captioning, visual question-answering, scene-text understanding), while retaining a simple, modular, and scalable design.
  • The PaLI main architecture is simple and scalable. It uses an encoder-decoder Transformer model, with a large-capacity ViT component for image processing.

Nougat: Neural Optical Understanding for Academic Documents
  • Scientific knowledge is predominantly stored in books and scientific journals, often in the form of PDFs. However, the PDF format leads to a loss of semantic information, particularly for mathematical expressions. Information Extraction from PDFs and especially scientific papers is a problem is one of the the first milestones to conquer if we want to revolutionize science in the coming decades.
  • This paper by Blecher et al. from Meta proposes Nougat (Neural Optical Understanding for Academic Documents), a Visual Transformer model that performs an Optical Character Recognition (OCR) task for processing scientific documents into a markup language, and demonstrate the effectiveness of our model on a new dataset of scientific documents.
  • Nougat offers a way to unlock the next trillion high-quality tokens, currently frozen in textbook pixels that are not LLM-ready.
  • As of this paper’s writing, since there are no paired dataset of PDF pages and corresponding source code out there, they created our own from the open access articles on arXiv. For layout diversity they also include a subset of the PubMed Central 5 (PMC) open access non-commercial dataset. During the pretraining, a portion of the Industry Documents Library 6 (IDL) was also included. From arXiv, they collected the source code and compiled PDFs from 1,748,201 articles. To ensure consistent formatting, we first process the source files using LaTeXML and convert them into HTML5 files. This step was important as it standardized and removed ambiguity from the LaTeX source code, especially in mathematical expressions. The conversion process included replacing user-defined macros, standardizing whitespace, adding optional brackets, normalizing tables, and replacing references and citations with their correct numbers. This following figure from the paper shows the data processing aspect of Nougat. The source file is converted into HTML which is then converted to Markdown. a) The LaTeX source provided by the authors. b) The HTML file computed form the LaTeX source using LaTeXML. c) The Markdown file parsed from the HTML file. d) The PDF file provided by the authors.

  • The following table from the paper illustrates Nougat’s simple end-to-end architecture (which resembles that of Donut). The Swin Transformer encoder takes a document image and converts it into latent embeddings, which are subsequently converted to a sequence of tokens in a autoregressive manner

  • Nougat offers a promising solution to enhance the accessibility of scientific knowledge in the digital age, by bridging the gap between human-readable documents and machine-readable text.
  • The following table from the paper offers an example of Nougat’s OCR capabilities on an old calculus text book.

Text-Conditional Contextualized Avatars For Zero-Shot Personalization
  • Recent large-scale text-to-image generation models have made significant improvements in the quality, realism, and diversity of the synthesized images and enable users to control the created content through language. However, the personalization aspect of these generative models is still challenging and under-explored. In this work, we propose a pipeline that enables personalization of image generation with avatars capturing a user’s identity in a delightful way.
  • This paper by Azadi et al. from Meta AI proposes Personalized Avatar Scene (PAS), a pipeline that is zero-shot, avatar texture and style agnostic, and does not require training on the avatar at all – it is scalable to millions of users who can generate a scene with their avatar.
  • To render the avatar in a pose faithful to the given text prompt, PAS utilizes a novel text-to-3D pose diffusion model trained on a curated large-scale dataset of in-the-wild human poses improving the performance of the SOTA text-to-motion models significantly.
  • The following figure from the paper shows PAS: They generate 3D SMPL body poses using a diffusion based transformer model and leverage a pre-trained VPoser either for pose regularization or decoding. The generated pose is re-targeted to the avatar body enabling every user to render their own avatar in the target generated pose. Finally, we generate an avatar scene using a fine-tuned text-to-image model conditioned on the rendered avatar and the text prompt.

  • The following figure from the paper shows their transformer based Text-to-3D pose diffusion model at time step t. The input sequence includes CLIP text embedding, tokens embedding, diffusion timestep, and noised pose and root orient representations $\left(\hat{x}_p, \hat{x}_r\right)$ all projected to the transformer dimension. A positional embedding is added to each token in the above sequence. The un-noised pose and root orient representations are predicted at each timestep during training.

  • At the time of this paper’s writing, this was the first instance which explored leveraging large-scale image datasets to learn human 3D pose parameters and overcome the limitations of motion capture datasets.
  • The following figure from the paper shows samples of images generated by our proposed approach Personalized Avatar Scene (PAS). Each caption is prefixed by “A person (is) ”.

Make-An-Animation: Large-Scale Text-conditional 3D Human Motion Generation
  • Text-guided human motion generation has drawn significant interest because of its impactful applications spanning animation and robotics. Recently, application of diffusion models for motion generation has enabled improvements in the quality of generated motions. However, existing approaches are limited by their reliance on relatively small-scale motion capture data, leading to poor performance on more diverse, in-the-wild prompts.
  • This paper by introduces Make-An-Animation, a text-conditioned human motion generation model which learns more diverse poses and prompts from large-scale image-text datasets, enabling significant improvement in performance over prior works.
  • Make-An-Animation is trained in two stages. First, they train on a curated large-scale dataset of (text, static pseudo-pose) pairs extracted from image-text datasets. Second, they fine-tune on motion capture data, adding additional layers to model the temporal dimension. Unlike prior diffusion models for motion generation, Make-An-Animation uses a U-Net architecture similar to recent text-to-video generation models.
  • The following figure from the paper illustrates the Make-An-Animation Model Architecture. Their diffusion model is built on a U-Net architecture inspired by recent image and video generation models. The U-Net consists of a sequence of Residual Blocks with 1x1 2D-convolution layers and Attention Blocks with cross-attention on textual information. To model the temporal dimension, they add 1D temporal convolution layers after each 1x1 2D-convolution, as well as temporal attention layers after each cross-attention layer. These temporal layers (greyed out in the figure) are only added in the motion fine-tuning stage.

  • Human evaluation of motion realism and alignment with input text shows that our model reaches state-of-the-art performance on text-to-motion generation.
  • The following figure from the paper shows samples generated by Make-An-Animation for text conditional motion generation. The lighting of the body models represents progress across time. Darker color indicates later frames in the sequence. In the top image, for a better visualization, frames are distributed horizontally.

AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
  • This paper by Moon et al. from Meta AI and Reality Labs presents Any-Modality Augmented Language Model (AnyMAL), a unified model that reasons over diverse input modality signals (i.e. text, image, video, audio, IMU motion sensor), and generates textual responses.
  • AnyMAL inherits the powerful text-based reasoning abilities of the state-of-the-art LLMs including LLaMA-2 (70B), and converts modality-specific signals to the joint textual space through a pre-trained aligner module.
  • The following figure from the paper illustrates the AnyMAL Training process. (a) Modality alignment pre-training allows for mapping the output of each modality encoder into the joint LLM embeddings space through projection layers. (b) With multimodal instruction tuning, the model learns to associate system instructions and text queries with input multimodal contexts. Our modality-specific encoder zoo includes: CLIP ViT-L, ViT-G, DinoV2 (image), CLAP (audio), IMU2CLIP (IMU motion sensor), and Intervideo (video).

  • The following figure from the paper shows example AnyMAL outputs. The model understands various input signals (i.e. vision, audio, motion sensor signals), and responds to free-form user queries. When multiple modalities are interleaved and given as input (e.g. right-most: image + IMU motion sensor signals), the model reasons over them jointly.

  • To further strengthen the multimodal LLM’s capabilities, AnyMAL is fine-tuned with a multimodal instruction set manually collected to cover diverse topics and tasks beyond simple QAs.
  • They conduct comprehensive empirical analysis comprising both human and automatic evaluations, and demonstrate state-of-the-art performance on various multimodal tasks.
Phenaki: Variable Length Video Generation From Open Domain Textual Description
  • This paper by Villegas et al. from Google Brain presents Phenaki, a model capable of realistic video synthesis, given a sequence of textual prompts. Generating videos from text is particularly challenging due to the computational cost, limited quantities of high quality text-video data and variable length of videos.
  • To address these issues, Phenaki learns video representations by compressing the video to a small representation of discrete tokens. This tokenizer uses causal attention in time, which allows it to work with variable-length videos. To generate video tokens from text, Phenaki uses a bidirectional masked transformer conditioned on pre-computed text tokens. The generated video tokens are subsequently de-tokenized to create the actual video.
  • To address data issues, they demonstrate how joint training on a large corpus of image-text pairs as well as a smaller number of video-text examples can result in generalization beyond what is available in the video datasets. Compared to the previous video generation methods, Phenaki can generate arbitrary long videos conditioned on a sequence of prompts (i.e. time variable text or a story) in open domain.
  • To the best of their knowledge, this is the first time a paper studies generating videos from time variable prompts. In addition, compared to the per-frame baselines, the proposed video encoder-decoder computes fewer tokens per video but results in better spatio-temporal consistency.
  • The following figure from the paper shows the architecture of Phenaki. Left: C-ViViT encoder architecture. The embeddings of images and video patches from raw frames \(x\) are processed by a spatial and then a causal transformer (auto-regressive in time) to generate video tokens \(z\). Center: MaskGiT is trained to reconstruct masked tokens \(z\) predicted by a frozen C-ViViT encoder and conditioned on T5X tokens of a given prompt \(p_0\). Right: How Phenaki can generate arbitrary long videos by freezing the past token and generating the future tokens. The prompt can change over time to enable time-variable prompt (i.e. story) conditional generation. The subscripts represent time (i.e. frame number).

Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators
  • Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets.
  • This paper by Khachatryan et al. from Picsart AI Resarch (PAIR), UT Austin, U of Oregon, UIUC introduces a new task of zero-shot text-to-video generation and propose a low-cost approach (without any training or optimization) by leveraging the power of existing text-to-image synthesis methods (e.g., Stable Diffusion), making them suitable for the video domain.
  • Their key modifications include (i) enriching the latent codes of the generated frames with motion dynamics to keep the global scene and the background time consistent; and (ii) reprogramming frame-level self-attention using a new cross-frame attention of each frame on the first frame, to preserve the context, appearance, and identity of the foreground object.
  • Experiments show that this leads to low overhead, yet high-quality and remarkably consistent video generation. Moreover, their approach is not limited to text-to-video synthesis but is also applicable to other tasks such as conditional and content-specialized video generation, and Video Instruct-Pix2Pix, i.e., instruction-guided video editing.
  • Based on experiments, Text2Video-Zero method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data.
  • The following figure from the paper shows the overview of the method: starting from a randomly sampled latent code \(x_T^1\), they apply \(\Delta t\) DDIM backward steps to obtain \(x_{T^{\prime}}^1\) using a pre-trained Stable Diffusion model (SD). A specified motion field results for each frame \(k\) in a warping function \(W_k\) that turns \(x_{T^{\prime}}^1\) to \(x_{T^{\prime}}^k\). By enhancing the latent codes with motion dynamics, they determine the global scene and camera motion and achieve temporal consistency in the background and the global scene. A subsequent DDPM forward application delivers latent codes \(x_T^k\) for \(k=1, \ldots, m\). By using the (probabilistic) DDPM method, a greater degree of freedom is achieved with respect to the motion of objects. Finally, the latent codes are passed to our modified SD model using the proposed cross-frame attention, which uses keys and values from the first frame to generate the image of frame \(k=1, \ldots, m\). By using cross-frame attention, the appearance and the identity of the foreground object are preserved throughout the sequence. Optionally, they apply background smoothing. To this end, they employ salient object detection to obtain for each frame \(k\) a mask \(M^k\) indicating the foreground pixels. Finally, for the background (using the mask \(M^k\)), a convex combination between the latent code \(x_t^1\) of frame one warped to frame \(k\) and the latent code \(x_t^k\) is used to further improve the temporal consistency of the background.

  • The following figure from the paper shows that Text2Video-Zero enables zero-shot video generation using (i) a textual prompt (see rows 1, 2), (ii) a prompt combined with guidance from poses or edges (see lower right), and (iii) Video Instruct-Pix2Pix, i.e., instruction-guided video editing (see lower left). Results are temporally consistent and follow closely the guidance and textual prompts.

SeamlessM4T – Massively Multilingual & Multimodal Machine Translation
  • Building a universal language translator, like the fictional Babel Fish in The Hitchhiker’s Guide to the Galaxy, is challenging because existing speech-to-speech and speech-to-text systems only cover a small fraction of the world’s languages. SeamlessM4T represents a significant breakthrough in the field of speech-to-speech and speech-to-text by addressing the challenges of limited language coverage and a reliance on separate systems, which divide the task of speech-to-speech translation into multiple stages across subsystems. These systems can leverage large amounts of data and generally perform well for only one modality. Our challenge was to create a unified multilingual model that could do it all.
  • This technical report by Barrault et al. from Meta AI and UC Berkeley builds on advancements Meta and others have made over the years in the quest to create a universal translator. In 2022, Meta released No Language Left Behind (NLLB), a text-to-text machine translation model that supports 200 languages and has since been integrated into Wikipedia as one of its translation providers. A few months later, they shared a demo of their Universal Speech Translator, which was the first direct speech-to-speech translation system for Hokkien, a language without a widely used writing system. Through this, they developed SpeechMatrix, the first large-scale multilingual speech-to-speech translation dataset, derived from SpeechLASER, a breakthrough in supervised representation learning. Earlier this year, they also shared Massively Multilingual Speech, which provides automatic speech recognition, language identification, and speech synthesis technology across more than 1,100 languages. SeamlessM4T draws on findings from all of these projects to enable a multilingual and multimodal translation experience stemming from a single model, built across a wide range of spoken data sources and with state-of-the-art results.
  • Note that there are two views of what constitutes a direct model in speech-to-speech translation literature: (1) A model that does not use intermediate text representation and (2) a model that directly predicts the target spectrogram.
  • For the model, they use the multitask UnitY model architecture, which is capable of directly generating translated text and speech. This new architecture also supports automatic speech recognition, text-to-text, text-to-speech, speech-to-text, and speech-to-speech translations that are already a part of the vanilla UnitY model. The multitask UnitY model consists of three main sequential components. Text and speech encoders have the task of recognizing speech input in nearly 100 languages. The text decoder then transfers that meaning into nearly 100 languages for text followed by a text-to-unit model to decode into discrete acoustic units for 36 speech languages. The self-supervised encoder, speech-to-text, text-to-text translation components, and text-to-unit model are pre-trained to improve the quality of the model and for training stability The decoded discrete units are then converted into speech using a multilingual HiFi-GAN unit vocoder. The following figure from the paper shows an overview of SeamlessM4T. (1) shows the pre-trained models used when finetuning multitasking UnitY. (2) outlines multitasking UnitY with its two encoders, text decoder, T2U encoder-decoder, and the supporting vocoders for synthesizing output speech in S2ST.

  • How the encoder processes speech: Their self-supervised speech encoder, w2v-BERT 2.0 which is an improved version of w2v-BERT that improves its training stability and representation quality, learns to find structure and meaning in speech by analyzing millions of hours of multilingual speech. The encoder takes the audio signal, breaks it down into smaller parts, and builds an internal representation of what is being said. Because spoken words are made up of many of those sounds and characters, they use a length adaptor to roughly map them to actual words.
  • How the encoder processes text: Similarly, they have a text encoder that is based on the NLLB model. It has been trained to understand text in nearly 100 languages and produce representations that are useful for translation.
  • Producing text: Our text decoder is trained to take encoded speech representations or text representations. This can be applied to tasks in the same language, such as automatic speech recognition, and multilingual translation tasks. For example, someone can say the word “bonjour” in French, and expect the translated text in Swahili to be “habari.” With multitask training, they leverage the strengths of a strong text-to-text translation model (NLLB) to guide their speech-to-text translation model via token-level knowledge distillation. The following figure from the paper shows an overview of the SeamlessM4T X2T (Into-Text Translation and Transcription) model. (1) describes the main two building blocks: w2v-BERT 2.0 and SeamlessM4T-NLLB. (2) describes the training of the X2T model. In Stage 1, the model is trained on X–eng directions and in Stage 2, eng–X directions are added.

  • Producing speech: They use acoustic units to represent speech on the target side. The text-to-unit (T2U) component in the UnitY model generates these discrete speech units based on the text output and is pre-trained on ASR data prior to UnitY fine-tuning. A multilingual HiFi-GAN unit vocoder is then used to convert these discrete units into audio waveforms. The following figure from the paper shows an overview of the SeamlessM4T multitask UnitY model with the speech-to-speech translation task. (1) describes the additional two building blocks on top of X2T: T2U encoder-decoder and unit vocoder. (2) describes the training of the UnitY model. In Stage 3, the model is trained on S2ST data.

  • Data scaling:
    • Data-driven models like SeamlessM4T usually benefit from large amounts of high-quality end-to-end data, namely speech-to-text and speech-to-speech data. Relying only on human transcribed and translated speech does not scale to tackle the challenging task of speech translation for 100 languages. They build upon their pioneering work on text-to-text mining using a similarity measure in a joint embedding space, and initial work in speech mining to create additional resources to train the SeamlessM4T model.
    • First, they build a new massively multilingual and -modal text embedding space for 200 languages, named SONAR (Sentence-level mOdality- and laNguage-Agnostic Representations), which substantially outperforms existing approaches like LASER3 or LaBSE in multilingual similarity search. They then apply a teacher-student approach to extend this embedding space to the speech modality and currently cover 36 languages. Mining is performed in data from publicly available repositories of web data (tens of billions of sentences) and speech (4 million hours). In total, they were able to automatically align more than 443,000 hours of speech with texts and create about 29,000 hours of speech-to-speech alignments. This corpus, dubbed SeamlessAlign, is the largest open speech/speech and speech/text parallel corpus in terms of total volume and language coverage to date.
  • Results: For these tasks and languages, SeamlessM4T achieves state-of-the-art results for nearly 100 languages and multitask support across automatic speech recognition, speech-to-text, speech-to-speech, text-to-speech, and text-to-text translation—all in a single model.

PaLI-X: On Scaling up a Multilingual Vision and Language Model
  • This paper by Chen et al. presents the training recipe and results of scaling up PaLI-X, a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture.
  • PaLI-X achieves new levels of performance on a wide-range of varied and complex tasks, including multiple image-based captioning and question-answering tasks, image-based document understanding and few-shot (in-context) learning, as well as object detection, video question answering, and video captioning.
  • PaLI-X advances the state-of-the-art on most vision-and-language benchmarks considered (25+ of them).
  • Finally, they observe emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix.
  • The following figure from the paper shows examples demonstrating multilingual, OCR and other capabilities transferred to detection.

The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)
  • Large multimodal models (LMMs) extend large language models (LLMs) with multi-sensory skills, such as visual understanding, to achieve stronger generic intelligence.
  • This report by Yang et al. from Microsoft analyzes the latest model, GPT-4V(ision), to deepen the understanding of LMMs.
  • The analysis focuses on the intriguing tasks that GPT-4V can perform, containing test samples to probe the quality and genericity of GPT-4V’s capabilities, its supported inputs and working modes, and the effective ways to prompt the model.
  • In our approach to exploring GPT-4V, they curate and organize a collection of carefully designed qualitative samples spanning a variety of domains and tasks. Observations from these samples demonstrate that GPT-4V’s unprecedented ability in processing arbitrarily interleaved multimodal inputs and the genericity of its capabilities together make GPT-4V a powerful multimodal generalist system. Furthermore, GPT-4V’s unique capability of understanding visual markers drawn on input images can give rise to new human-computer interaction methods such as visual referring prompting.
  • The following figure from the paper shows that GPT-4V can work with multi-image and interleaved image-text inputs.

  • The following figure from the paper shows constrained prompting to return in JSON format. Images are example IDs for samples. Red highlights the wrong answer.

  • The following figure from the paper illustrates that conditioning on a memetic proxy can help improve the model’s response. Green (red) highlights the correct (wrong) answer. Blue indicates different ways to prompting in addition to the basic requirement of “Count the number of apples in the image.”

  • They conclude the report with in-depth discussions on the emerging application scenarios and the future research directions for GPT-4V-based systems. They hope that this preliminary exploration will inspire future research on the next-generation multimodal task formulation, new ways to exploit and enhance LMMs to solve real-world problems, and gaining better understanding of multimodal foundation models.
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
  • This paper by Zhu et al. from King Abdullah University of Science and Technology explores whether aligning visual features with advanced large language models (LLMs) like Vicuna can replicate the impressive vision-language capabilities exhibited by GPT-4.
  • The authors present MiniGPT-4 which combines a frozen visual encoder (ViT + Q-Former from BLIP-2) with a frozen Vicuna LLM using just a single trainable projection layer.
  • The model undergoes a two-stage training process. The first stage involves pretraining on a large collection of aligned image-text pairs. The second stage involves finetuning with a smaller, detailed image description dataset to enhance generation reliability and usability. MiniGPT-4 was initially pretrained on 5M image-caption pairs, then finetuned on 3.5K detailed image descriptions to improve language quality.
  • Without training the vision or language modules, MiniGPT-4 demonstrates abilities similar to GPT-4, such as generating intricate image descriptions, creating websites from handwritten text, and explaining unusual visual phenomena. Additionally, it showcases unique capabilities like generating detailed cooking recipes from food photos, writing stories or poems inspired by images, and diagnosing problems in photos with solutions. Quantitative analysis showed strong performance in tasks like meme interpretation, recipe generation, advertisement creation, and poem composition compared to BLIP-2.
  • The finetuning process in the second stage significantly improved the naturalness and reliability of language outputs. This process was efficient, requiring only 400 training steps with a batch size of 12, and took around 7 minutes with a single A100 GPU.
  • Additional emergent skills are observed like composing ads/poems from images, generating cooking recipes from food photos, retrieving facts from movie images etc. Aligning visual features with advanced LLMs appears critical for GPT-4-like capabilities, as evidenced by the absence of such skills in models like BLIP-2 with less powerful language models.
  • The figure below from the paper shows the architecture of MiniGPT-4. It consists of a vision encoder with a pretrained ViT and Q-Former, a single linear projection layer, and an advanced Vicuna large language model. MiniGPT-4 only requires training the linear projection layer to align the visual features with the Vicuna.

  • The simple methodology verifies that advanced vision-language abilities can emerge from properly aligning visual encoders with large language models, without necessarily needing huge datasets or model capacity.
  • Despite its advancements, MiniGPT-4 faces limitations like hallucination of nonexistent knowledge and struggles with spatial localization. Future research could explore training on datasets designed for spatial information understanding to mitigate these issues.
  • Project page; Code; HuggignFace Space; Video; Dataset.
MiniGPT-v2: Large Language Model as a Unified Interface for Vision-Language Multi-task Learning
  • This paper by Chen et al. from King Abdullah University of Science and Technology and Meta AI Research presents MiniGPT-v2, a model designed to handle various vision-language tasks such as image description, visual question answering, and visual grounding.
  • MiniGPT-v2 uniquely incorporates task-specific identifiers in training, allowing it to distinguish and effectively handle different task instructions. This is achieved by using a three-stage training strategy with a mix of weakly-labeled image-text datasets and multi-modal instructional datasets. The model architecture includes a visual backbone (adapted from EVA), a linear projection layer, and a large language model (LLaMA2-chat, 7B), trained with high-resolution images to process visual tokens efficiently.
  • The figure below from the paper shows the architecture of MiniGPT-v2. The model takes a ViT visual backbone, which remains frozen during all training phases. We concatenate four adjacent visual output tokens from ViT backbone and project them into LLaMA-2 language model space via a linear projection layer.

  • In terms of performance, MiniGPT-v2 demonstrates superior results in various visual question-answering and visual grounding benchmarks, outperforming other generalist models like MiniGPT-4, InstructBLIP, LLaVA, and Shikra. It also shows a robust ability against hallucinations in image description tasks.
  • The figure below from the paper shows that MiniGPT-v2 achieves state-of-the-art performances on a broad range of vision-language tasks compared with other generalist models.

  • The paper highlights the importance of task identifier tokens, which significantly enhance the model’s efficiency in multi-task learning. These tokens have been shown to be crucial in the model’s strong performance across multiple tasks.
  • Despite its capabilities, MiniGPT-v2 faces challenges like occasional hallucinations and the need for more high-quality image-text aligned data for improvement.
  • The paper concludes that MiniGPT-v2, with its novel approach of task-specific identifiers and a unified interface, sets a new benchmark in multi-task vision-language learning. Its adaptability to new tasks underscores its potential in vision-language applications.
  • Project page; Code; HuggignFace Space; Demo; Video
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
  • This paper by Podell et al. from Stability AI Applied Research details significant advancements in the field of text-to-image synthesis using latent diffusion models (LDMs).
  • The paper introduces SDXL, a latent diffusion model that significantly improves upon previous versions of Stable Diffusion for text-to-image synthesis.
  • SDXL incorporates a UNet architecture three times larger than its predecessors, primarily due to an increased number of attention blocks and a larger cross-attention context. This is achieved by using a second text encoder, significantly enhancing the model’s capabilities.
  • Novel conditioning schemes are introduced, such as conditioning on original image resolution and cropping parameters. This conditioning is achieved through Fourier feature encoding and significantly improves the model’s performance and flexibility.
  • SDXL is trained on multiple aspect ratios, a notable departure from standard square image outputs. This training approach allows the model to better handle images with varied aspect ratios, reflecting real-world data more accurately.
  • An improved autoencoder is used, enhancing the fidelity of generated images, particularly in high-frequency details.
  • The paper also discusses a refinement model used as a post-hoc image-to-image technique to further improve the visual quality of samples generated by SDXL. SDXL demonstrates superior performance compared to earlier versions of Stable Diffusion and rivals state-of-the-art black-box image generators. The model’s performance was validated through user studies and quantitative metrics.
  • The figure below from the illustrates: (Left) Comparing user preferences between SDXL and Stable Diffusion 1.5 & 2.1. While SDXL already clearly outperforms Stable Diffusion 1.5 & 2.1, adding the additional refinement stage boosts performance. (Right) Visualization of the two-stage pipeline: They generate initial latents of size 128 × 128 using SDXL. Afterwards, they utilize a specialized high-resolution refinement model and apply SDEdit on the latents generated in the first step, using the same prompt. SDXL and the refinement model use the same autoencoder.

  • The authors emphasize the open nature of SDXL, highlighting its potential to foster transparency in large model training and evaluation, which is crucial for responsible and ethical deployment of such technologies.
  • The paper represents a significant step forward in generative modeling for high-resolution image synthesis, showcasing the potential of latent diffusion models in creating detailed and realistic images from textual descriptions.
Diffusion Model Alignment Using Direct Preference Optimization
  • This paper by Wallace et al. from Salesforce AI and Stanford University proposes a novel method for aligning diffusion models to human preferences.
  • The paper introduces Diffusion-DPO, a method adapted from Direct Preference Optimization (DPO), for aligning text-to-image diffusion models with human preferences. This approach is a significant shift from typical language model training, emphasizing direct optimization on human comparison data.
  • Unlike typical methods that fine-tune pre-trained models using curated images and captions, Diffusion-DPO directly optimizes a policy that best satisfies human preferences under a classification objective. It re-formulates DPO to account for a diffusion model notion of likelihood using the evidence lower bound, deriving a differentiable objective.
  • The authors utilized the Pick-a-Pic dataset, comprising 851K crowdsourced pairwise preferences, to fine-tune the base model of the Stable Diffusion XL (SDXL)-1.0 model with Diffusion-DPO. The fine-tuned model showed significant improvements over both the base SDXL-1.0 and its larger variant in terms of visual appeal and prompt alignment, as evaluated by human preferences.
  • The paper also explores a variant of the method that uses AI feedback, showing comparable performance to training on human preferences. This opens up possibilities for scaling diffusion model alignment methods.
  • The figure below from the illustrates: (Top) DPO-SDXL significantly outperforms SDXL in human evaluation. (L) PartiPrompts and (R) HPSv2 benchmark results across three evaluation questions, majority vote of 5 labelers. (Bottom) Qualitative comparisons between SDXL and DPO-SDXL. DPOSDXL demonstrates superior prompt following and realism. DPO-SDXL outputs are better aligned with human aesthetic preferences, favoring high contrast, vivid colors, fine detail, and focused composition. They also capture fine-grained textual details more faithfully.

  • Experiments demonstrate the effectiveness of Diffusion-DPO in various scenarios, including image-to-image editing and learning from AI feedback. The method significantly outperforms existing models in human evaluations for general preference, visual appeal, and prompt alignment.
  • The paper’s findings indicate that Diffusion-DPO can effectively increase measured human appeal across an open vocabulary with stable training, without increased inference time, and improves generic text-image alignment.
  • The authors note ethical considerations and risks associated with text-to-image generation, emphasizing the importance of diverse and representative sets of labelers and the potential biases inherent in the pre-trained models and labeling process.
  • In summary, the paper presents a groundbreaking approach to align diffusion models with human preferences, demonstrating notable improvements in visual appeal and prompt alignment. It highlights the potential of direct preference optimization in the realm of text-to-image diffusion models and opens avenues for further research and application in this field.

Core ML


What Every Computer Scientist Should Know About Floating-Point Arithmetic
  • This gem by Goldberg et al. from Oracle in the 1991 issue of ACM Computing Surveys helps demystify your errors about computer arithmetic and enables you to write more careful code.


Bidirectional recurrent neural networks
  • This paper by Schuster and Paliwal from the ATR Interpreting Telecommunications Research Laboratory, Kyoto, Japan in IEEE Transactions on Signal Processing 1997 proposes a bidirectional recurrent neural network (BRNN) by extending a regular recurrent neural network (RNN).
  • The BRNN can be trained without the limitation of using input information just up to a preset future frame. This is accomplished by training it simultaneously in positive and negative time direction. Structure and training procedure of the proposed network are explained. In regression and classification experiments on artificial data, the proposed structure gives better results than other approaches. For real data, classification experiments for phonemes from the TIMIT database show the same tendency.
  • They also show how the proposed bidirectional structure can be easily modified to allow efficient estimation of the conditional posterior probability of complete symbol sequences without making any explicit assumption about the shape of the distribution. For this part, experiments on real data are reported.


Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers
  • Accurate, well-calibrated estimates of class membership probabilities are needed in many supervised learning applications, in particular when a cost-sensitive decision must be made about examples with example-dependent costs.
  • This paper by Zadrozny and Elkan from UCSD in 2001 presents histogram binning, a simple but commonly-used calibration concept for obtaining calibrated probability estimates from decision tree and naive Bayesian classifiers.
  • Using the large and challenging KDD’98 contest dataset as a testbed, they report the results of a detailed experimental comparison of ten methods, according to four evaluation measures.
  • They conclude that binning succeeds in significantly improving naive Bayesian probability estimates, while for improving decision tree probability estimates, they recommend smoothing by \(m\)-estimation and a new variant of pruning that they call curtaitment.


Transforming classifier scores into accurate multiclass probability estimates
  • Class membership probability estimates are important for many applications of data mining in which classification outputs are combined with other sources of information for decision-making, such as example-dependent misclassification costs, the outputs of other classifiers, or domain knowledge. Previous calibration methods apply only to two-class problems.
  • This paper by Zadrozny and Elkan from UCSD in 2002 proposes isotonic regression, which helps obtain accurate probability estimates for multiclass problems by combining calibrated binary probability estimates.
  • They also propose a new method for obtaining calibrated two-class probability estimates that can be applied to any classifier that produces a ranking of examples.
  • Using naive Bayes and support vector machine classifiers, they give experimental results from a variety of two-class and multiclass domains, including direct marketing, text categorization and digit recognition.
Dimensionality Reduction by Learning an Invariant Mapping
  • This paper by Hadsell et al. from LeCun’s lab in CVPR 2006 first introduced the concept of a contrastive loss.
  • Contrastive loss is a distance-based loss as opposed to more conventional error-prediction losses. This loss is used to learn embeddings in which two “similar” points have a low Euclidean distance and two “dissimilar” points have a large Euclidean distance.
  • Two samples are either similar or dissimilar. This binary similarity can be determined using several approaches:
    • In this work, the \(N\) closest neighbors of a sample in input space (e.g. pixel space) are considered similar; all others are considered dissimilar. (This approach yields a smooth latent space; e.g. the latent vectors for two similar views of an object are close)
    • To the group of similar samples to a sample, transformed versions of the sample can be added (e.g. using data augmentation). This allows the latent space to be invariant to one or more transformations.
    • A manually obtained label determining if two samples are similar can be used (for e.g., we could use the class label. However, there can be cases where two samples from the same class are relatively dissimilar, or where two samples from different classes are relatively similar. Using classes alone does not encourage a smooth latent space.)
  • Formally, if we consider \(\vec{X}\) as the input data and \(G_W(\vec{X})\) the output of a neural network, the interpoint distance is given by,
\[D_W\left(\vec{X}_1, \vec{X}_2\right)=\left\|G_W\left(\vec{X}_1\right)-G_W\left(\vec{X}_2\right)\right\|_2\]
  • The contrastive loss is simply,

    \[\begin{aligned} \mathcal{L}(W) &=\sum_{i=1}^P L\left(W,\left(Y, \vec{X}_1, \vec{X}_2\right)^i\right) \\ L\left(W,\left(Y, \vec{X}_1, \vec{X}_2\right)^i\right) &=(1-Y) L_S\left(D_W^i\right)+Y L_D\left(D_W^i\right) \end{aligned}\]
    • where \(Y=0\) when \(X_1\) and \(X_2\) are similar and $Y=1$ otherwise, and \(L_S\) is a loss for similar points and \(L_D\) is a loss for dissimilar points.
  • More formally, the contrastive loss is given by,

    \[\begin{aligned} &L\left(W, Y, \vec{X}_1, \vec{X}_2\right)= \\ &\quad(1-Y) \frac{1}{2}\left(D_W\right)^2+(Y) \frac{1}{2}\left\{\max \left(0, m-D_W\right)\right\}^2 \end{aligned}\]
    • where $$ m $$ is a predefined margin.
  • The gradient is given by the simple equations:

\[\begin{gathered} \frac{\partial L_S}{\partial W}=D_W \frac{\partial D_W}{\partial W} \\ \frac{\partial L_D}{\partial W}=-\left(m-D_W\right) \frac{\partial D_W}{\partial W} \end{gathered}\]
  • Contrastive Loss is often used in image retrieval tasks to learn discriminative features for images. During training, an image pair is fed into the model with their ground truth relationship: equals 1 if the two images are similar and 0 otherwise. The loss function for a single pair is:

    \[y d^2+(1-y) \max (\operatorname{margin}-d, 0)^2\]
    • where \(d\) is the Euclidean distance between the two image features (suppose their features are \(f_1\) and \(f_2\)): \(d=\left \| f_1-f_2\right \|{_2}\). The \(margin\) term is used to “tighten” the constraint: if two images in a pair are dissimilar, then their distance should be at least \(margin\), or a loss will be incurred.
  • Shown below are the results from the paper which are quite convincing:

  • Note that while this is one of the earliest of the contrastive losses, this is not the only one. For instance, the contrastive loss used in SimCLR is quite different.


Reducing the Dimensionality of Data with Neural Networks
  • High-dimensional data can be converted to low-dimensional codes by training a multilayer neural network with a small central layer to reconstruct high-dimensional input vectors. Gradient descent can be used for fine-tuning the weights in such “autoencoder” networks, but this works well only if the initial weights are close to a good solution.
  • This paper by Hinton and Salakhutdinov in Science in 2006 describe an effective way of initializing the weights that allows deep autoencoder networks to learn low-dimensional codes that work much better than principal components analysis as a tool to reduce the dimensionality of data.


What Every Programmer Should Know About Memory
  • This must-read paper by Drepper from Red Hat in 2007 offers a detailed treatment on how system memory works.


ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning
  • This paper by Haibo He, Yang Bai, Edwardo A. Garcia, and Shutao Li introduces the Adaptive Synthetic (ADASYN) sampling method, addressing the challenge of learning from imbalanced datasets in machine learning.
  • ADASYN focuses on generating synthetic data for the minority class, emphasizing those samples that are more difficult to learn. This approach aims to reduce learning bias caused by class imbalance and adaptively shift the classification decision boundary towards challenging examples.
  • The authors propose a weighted distribution mechanism for generating more synthetic data for hard-to-learn minority class examples, while fewer or no synthetic data are generated for easier ones.
  • Simulation analysis on several machine learning datasets demonstrates the effectiveness of ADASYN across various evaluation metrics, showing improved handling of class imbalances compared to existing methods.
  • The plot below from the paper shows the performance of the ADASYN algorithm for imbalanced learning.

  • The paper emphasizes the utility of ADASYN in providing balanced class distribution and in focusing the learning process on the more challenging aspects of the minority class, contributing significantly to the field of imbalanced learning.


Large-scale Deep Unsupervised Learning using Graphics Processors
  • The promise of unsupervised learning methods lies in their potential to use vast amounts of unlabeled data to learn complex, highly nonlinear models with millions of free parameters. They consider two well-known unsupervised learning models, deep belief networks (DBNs) and sparse coding, that have recently been applied to a flurry of machine learning applications.
  • Unfortunately, current learning algorithms for both models are too slow for large-scale applications, forcing researchers to focus on smaller-scale models, or to use fewer training examples.
  • This must-read paper by Raina et al. from Andrew Ng’s lab at Stanford in ICML 2009 was the first to introduce deep learning on GPUs by suggesting massively parallel methods to help resolve the aforementioned problems.
  • They argue that modern graphics processors far surpass the computational capabilities of multicore CPUs, and have the potential to revolutionize the applicability of deep unsupervised learning methods. They develop general principles for massively parallelizing unsupervised learning tasks using graphics processors. They show that these principles can be applied to successfully scaling up learning algorithms for both DBNs and sparse coding.
  • Their implementation of DBN learning is up to 70 times faster than a dual-core CPU implementation for large models. For example, they are able to reduce the time required to learn a four-layer DBN with 100 million free parameters from several weeks to around a single day. For sparse coding, they develop a simple, inherently parallel algorithm, that leads to a 5 to 15-fold speedup over previous methods.
Practical Guide to Controlled Experiments on the Web: Listen to Your Customers not to the HiPPO
  • The web provides an unprecedented opportunity to evaluate ideas quickly using controlled experiments, also called randomized experiments (single-factor or factorial designs), A/B tests (and their generalizations), split tests, Control/Treatment tests, and parallel flights. Controlled experiments embody the best scientific design for establishing a causal relationship between changes and their influence on user-observable behavior.
  • This paper by Kohavi et al. from Microsoft in KDD 2007 provide a practical guide to conducting online experiments, where end-users can help guide the development of features. Their experience indicates that significant learning and return-on-investment (ROI) are seen when development teams listen to their customers, not to the Highest Paid Person’s Opinion (HiPPO).
  • They provide several examples of controlled experiments with surprising results. They review the important ingredients of running controlled experiments, and discuss their limitations (both technical and organizational). They focus on several areas that are critical to experimentation, including statistical power, sample size, and techniques for variance reduction.
  • They describe common architectures for experimentation systems and analyze their advantages and disadvantages. They evaluate randomization and hashing techniques, which they show are not as simple in practice as is often assumed. Controlled experiments typically generate large amounts of data, which can be analyzed using data mining techniques to gain deeper understanding of the factors influencing the outcome of interest, leading to new hypotheses and creating a virtuous cycle of improvements.
  • Organizations that embrace controlled experiments with clear evaluation criteria can evolve their systems with automated optimizations and real-time analyses. Based on their extensive practical experience with multiple systems and organizations, they share key lessons that will help practitioners in running trustworthy controlled experiments.
Curriculum Learning
  • The paper by Bengio et al. from the University of Montreal and NEC Laboratories America, presented at ICML 2009, introduces the concept of “Curriculum Learning” for machine learning, drawing parallels to human learning where the organization and complexity of learning materials significantly impact learning effectiveness.
  • The authors establish a foundation for curriculum learning in the context of deep deterministic and stochastic neural networks, particularly in the presence of non-convex training criteria. The experiments demonstrate notable improvements in generalization through curriculum learning.
  • The paper posits that curriculum learning affects both the speed of convergence during training and the quality of local minima achieved in non-convex optimization problems. This approach is likened to a continuation method, a strategy for global optimization in non-convex functions.
  • Several experiments across different domains, including vision and language tasks, show the efficacy of curriculum learning. Simple multi-stage curriculum strategies resulted in enhanced generalization and faster convergence.
  • The plot below from the paper shows the average error rate of the perceptron, with or without the curriculum. Top: the number of nonzero irrelevant inputs determines easiness. Bottom: the margin \(yw'x\) determines easiness.

  • In language modeling, the paper examines training a model to predict the next word in an English sentence. Using a curriculum approach, the authors demonstrate a statistically significant improvement in test set performance compared to a non-curriculum approach.
  • The research opens up new perspectives on machine learning training methodologies, suggesting that carefully designing the sequence and complexity of training examples can lead to better performance, especially in complex neural network architectures.
  • The paper suggests future exploration in understanding why certain curriculum strategies work better than others and automating the process of curriculum design, potentially leveraging active learning principles.


SMOTE: Synthetic Minority Over-sampling Technique
  • This paper by Chawla et al. from University of South Florida introduces an approach to the construction of classifiers from imbalanced datasets.
  • A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of “normal” examples with only a small percentage of “abnormal” or “interesting” examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Under-sampling of the majority (normal) class has been proposed as a good means of increasing the sensitivity of a classifier to the minority class.
  • This paper shows that a combination of the proposed method of over-sampling the minority (abnormal) class and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space) than only under-sampling the majority class. This paper also shows that a combination of their method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space) than varying the loss ratios in Ripper or class priors in Naive Bayes.
  • Their method of over-sampling the minority class involves creating synthetic minority class examples. Experiments are performed using C4.5, Ripper and a Naive Bayes classifier.
  • The method is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.


Acoustic Modeling using Deep Belief Networks
  • At the time of writing, Gaussian mixture models were predominantly the dominant technique for modeling the emission distribution of hidden Markov models for speech recognition.
  • This paper by Mohamed et al. from Hinton’s lab at UofT in IEEE Transactions on Audio, Speech, and Language Processing 2012 showed that better phone recognition on the TIMIT dataset can be achieved by replacing Gaussian mixture models by deep neural networks that contain many layers of features and a very large number of parameters.
  • These networks are first pre-trained as a multi-layer generative model of a window of spectral feature vectors without making use of any discriminative information. Once the generative pre-training has designed the features, they perform discriminative fine-tuning using backpropagation to adjust the features slightly to make them better at predicting a probability distribution over the states of monophone hidden Markov models.
Improving neural networks by preventing co-adaptation of feature detectors
  • This paper by Hinton et al. in 2012 introduced Dropout as a way to avoid overfitting.
  • When a large feedforward neural network is trained on a small training set, it typically performs poorly on held-out test data. This overfitting is greatly reduced by randomly omitting half of the feature detectors on each training case. This prevents complex co-adaptations in which a feature detector is only helpful in the context of several other specific feature detectors.
  • Instead, each neuron learns to detect a feature that is generally helpful for producing the correct answer given the combinatorially large variety of internal contexts in which it must operate.
  • Random “dropout” gives big improvements on many benchmark tasks and sets new records for speech and object recognition.
Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained
  • Online controlled experiments are often utilized to make data-driven decisions at Amazon, Microsoft, eBay, Facebook, Google, Yahoo, Zynga, and at many other companies. While the theory of a controlled experiment is simple, and dates back to Sir Ronald A. Fisher’s experiments at the Rothamsted Agricultural Experimental Station in England in the 1920s, the deployment and mining of online controlled experiments at scale — thousands of experiments now — has taught us many lessons. These exemplify the proverb that the difference between theory and practice is greater in practice than in theory.
  • This paper by Kohavi et al. from Microsoft in KDD 2012 presents the authors’ learnings as they happened: puzzling outcomes of controlled experiments that they analyzed deeply to understand and explain. Each of these took multiple-person weeks to months to properly analyze and get to the often surprising root cause. The root causes behind these puzzling results are not isolated incidents; these issues generalized to multiple experiments. The heightened awareness should help readers increase the trustworthiness of the results coming out of controlled experiments.
  • At Microsoft’s Bing, it is not uncommon to see experiments that impact annual revenue by millions of dollars, thus getting trustworthy results is critical and investing in understanding anomalies has tremendous payoff: reversing a single incorrect decision based on the results of an experiment can fund a whole team of analysts.
  • The topics they cover include: the OEC (Overall Evaluation Criterion), click tracking, effect trends, experiment length and power, and carryover effects.


Dropout: A Simple Way to Prevent Neural Networks from Overfitting
  • This paper by Srivastava et al. from Hinton’s lab in JMLR 2014 introduced Dropout, which (just like Batchnorm) is now part of the standard recipe for regularizing deep neural nets.
  • Please refer the Dropout primer for a detailed discourse on Dropout.
Intriguing properties of neural networks
    • Deep neural networks are highly expressive models that have recently achieved state of the art performance on speech and visual recognition tasks. While their expressiveness is the reason they succeed, it also causes them to learn uninterpretable solutions that could have counter-intuitive properties.
  • This paper by Szegedy et al. from Google, NYU, University of Montreal, and Facebook reports two such properties and most notably, introduced adversarial examples in the context of deep learning.
  • First, they find that there is no distinction between individual high level units and random linear combinations of high level units, according to various methods of unit analysis. It suggests that it is the space, rather than the individual units, that contains of the semantic information in the high layers of neural networks.
  • Second, they find that deep neural networks learn input-output mappings that are fairly discontinuous to a significant extend. They can cause the network to misclassify an image by applying a certain imperceptible perturbation, which is found by maximizing the network’s prediction error. In addition, the specific nature of these perturbations is not a random artifact of learning: the same perturbation can cause a different network, that was trained on a different subset of the dataset, to misclassify the same input.


ADAM: A Method for Stochastic Optimization
  • This paper by Kingma and Ba in ICLR 2015 introduces Adam (derived from adaptive moment estimation), an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters.
  • It is a fusion of RMSProp with momentum and involves calculating the exponentially weighted moving average of the first moment and second moment (which are gated by the hyper parameters \(\beta_1\) and \(\beta_2\) respectively).
  • The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. They also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, they discuss AdaMax, a variant of Adam based on the infinity norm.
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
  • This paper by Ioffe and Szegedy from Google in ICML 2015 introduced BatchNorm, which is now commonly implemented to accelerate training of deep neural nets.
  • Also, check out this in-depth article on BatchNorm here.


XGBoost: A Scalable Tree Boosting System
  • This paper by Chen and Guestrin from UW in 2016 proposes eXtreme Gradient Boost (XGBoost), a scalable end-to-end tree boosting system that is widely used by data scientists and provides state-of-the-art results on many problems.
  • They propose a novel sparsity aware algorithm for handling sparse data and a theoretically justified weighted quantile sketch for approximate tree learning.
  • Their experience shows that cache access patterns, data compression and sharding are essential elements for building a scalable end-to-end system for tree boosting. These lessons can be applied to other machine learning systems as well.
  • By combining these insights, XGBoost is able to solve real-world scale problems using far fewer resources than existing systems..
Layer Normalization
  • Training state-of-the-art, deep neural networks is computationally expensive. One way to reduce the training time is to normalize the activities of the neurons. A recently introduced technique called batch normalization uses the distribution of the summed input to a neuron over a mini-batch of training cases to compute a mean and variance which are then used to normalize the summed input to that neuron on each training case. This significantly reduces the training time in feed-forward neural networks. However, the effect of batch normalization is dependent on the mini-batch size and it is not obvious how to apply it to recurrent neural networks.
  • This paper by Ba et al. from Hinton’s lab in 2016 introduces layer normalization (LayerNorm) by transposing batch normalization into layer normalization by computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case. Like batch normalization, they also give each neuron its own adaptive bias and gain which are applied after the normalization but before the non-linearity.
  • Unlike batch normalization, layer normalization performs exactly the same computation at training and test times. It is also straightforward to apply to recurrent neural networks by computing the normalization statistics separately at each time step.
  • The following figure from the paper shows that LayerNorm is very effective at stabilizing the hidden state dynamics in recurrent networks. Empirically, they show that layer normalization can substantially reduce the training time compared with previously published techniques.

Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs
  • This paper by Malkov and Yashunin from Institute of Applied Physics of the Russian Academy of Sciences and Yandex presents a new approach for the approximate K-nearest neighbor search based on navigable small world graphs with controllable hierarchy (Hierarchical NSW, HNSW).
  • The proposed solution is fully graph-based, without any need for additional search structures, which are typically used at the coarse search stage of the most proximity graph techniques.
  • Hierarchical NSW incrementally builds a multi-layer structure consisting from hierarchical set of proximity graphs (layers) for nested subsets of the stored elements. The maximum layer in which an element is present is selected randomly with an exponentially decaying probability distribution. This allows producing graphs similar to the previously studied Navigable Small World (NSW) structures while additionally having the links separated by their characteristic distance scales.
  • Starting search from the upper layer together with utilizing the scale separation boosts the performance compared to NSW and allows a logarithmic complexity scaling. Additional employment of a heuristic for selecting proximity graph neighbors significantly increases performance at high recall and in case of highly clustered data. Performance evaluation has demonstrated that the proposed general metric space search index is able to strongly outperform previous opensource state-of-the-art vector-only approaches. Similarity of the algorithm to the skip list structure allows straightforward balanced distributed implementation.
  • The following figure from the paper illustrates the Hierarchical NSW idea. The search starts from an element from the top layer (shown red). Red arrows show direction of the greedy algorithm from the entry point to the query (shown green).


Axiomatic Attribution for Deep Networks
  • This paper by Sundararajan from Google in ICML 2017 studies the problem of attributing the prediction of a deep network to its input features, a problem previously studied by several other works.
  • They identify two fundamental axioms — Sensitivity and Implementation Invariance that attribution methods ought to satisfy. They show that they are not satisfied by most known attribution methods, which they consider to be a fundamental weakness of those methods.
  • They use the axioms to guide the design of a new attribution method called Integrated Gradients.
  • Their method requires no modification to the original network and is extremely simple to implement; it just needs a few calls to the standard gradient operator.
  • Since this method is multimodal, they apply this method to a couple of image models, a couple of text models and a chemistry model, demonstrating its ability to debug networks, to extract rules from a network, and to enable users to engage with models better.
  • Since integrated gradients add up to the final prediction score, the magnitudes can be use for accounting the contributions of each feature. For instance, for the molecule in the figure, atom-pairs that have a bond between them cumulatively contribute to 46% of the prediction score, while all other pairs cumulatively contribute to only −3%.
Decoupled Weight Decay Regularization
  • L2 regularization and weight decay regularization are equivalent for standard stochastic gradient descent (when rescaled by the learning rate), but as they demonstrate this is not the case for adaptive gradient algorithms, such as Adam.
  • This paper by Loshchilov and Hutter from University of Freiburg in ICLR 2019 proposes Adam with decoupled weight decay (AdamW), a simple modification to recover the original formulation of weight decay regularization by decoupling the weight decay from the optimization steps taken w.r.t. the loss function. Following suggestions that adaptive gradient methods such as Adam might lead to worse generalization than SGD with momentum (Wilson et al., 2017), they identify and expose the inequivalence of L2 regularization and weight decay for Adam.
  • They provide empirical evidence that AdamW proposed modification (i) decouples the optimal choice of weight decay factor from the setting of the learning rate for both standard SGD and Adam, and (ii) substantially improves Adam’s generalization performance, allowing it to compete with SGD with momentum on image classification datasets (on which it was previously typically outperformed by the latter). They empirically show that AdamW yields substantially better generalization performance than the common implementation of Adam with L2 regularization. They also proposed to use warm restarts for Adam to improve performance.
  • Their results obtained on image classification datasets must be verified on a wider range of tasks, especially ones where the use of regularization is expected to be important. While they focus their experimental analysis on Adam, they believe that similar results also hold for other adaptive gradient methods, such as AdaGrad (Duchi et al., 2011) and AMSGrad (Reddi et al., 2018).
  • AdamW has been implemented in TensorFlow and PyTorch.
  • Code.
On Calibration of Modern Neural Networks
  • Modern neural networks exhibit a strange phenomenon: probabilistic error and miscalibration worsen even as classification error is reduced.
  • This paper by Guo et al. from Cornell University in ICML 2017 proposes temperature scaling. They begin by discovering that modern neural networks, unlike those from a decade ago, are poorly calibrated. Through extensive experiments, they observe that model capacity (in terms of depth, width), weight decay (regularization), and Batch Normalization are important factors affect calibration while improving accuracy.
  • They evaluate the performance of various post-processing calibration methods on state-of-the-art architectures with image and document classification datasets.
  • They suggest that simple techniques can effectively remedy the miscalibration phenomenon in neural networks. Temperature scaling – a single-parameter variant of Platt Scaling – is the simplest, fastest, and most straightforward of the methods at calibrating predictions.
Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers
  • For optimal decision making under variable class distributions and misclassification costs a classifier needs to produce well-calibrated estimates of the posterior probability. Isotonic calibration is a powerful non-parametric method that is however prone to overfitting on smaller datasets; hence a parametric method based on the logistic curve is commonly used.
  • This paper by Kull et al. from University of Bristol and Universidade Federal de Pernambuco demonstrates that while logistic calibration is designed for normally distributed per-class scores, many classifiers including Naive Bayes and Adaboost suffer from a particular distortion where these score distributions are heavily skewed. In such cases logistic calibration can easily yield probability estimates that are worse than the original scores. Moreover, the logistic curve family does not include the identity function, and hence logistic calibration can easily uncalibrate a perfectly calibrated classifier.
  • The papers seeks to solve all these problems with a richer class of calibration maps based on the beta distribution. THey derive the method from first principles and show that fitting it is as easy as fitting a logistic curve.
  • Extensive experiments show that beta calibration is superior to logistic calibration for Naive Bayes and Adaboost.
Understanding Black-box Predictions via Influence Functions
  • The following paper summary has been contributed by Zhibo Zhang.
  • This paper by Koh and Liang in ICML 2017 from Percy Liang’s group at Stanford introduces influence functions that originated from robust statistics to explain individual instance predictions.
  • The method utilizes the inverse of the second-order derivative (Hessian matrix) to calculate an approximation of the empirical risk.
  • Although the authors propose a few approximation methodologies to calculate the inverse Hessian matrix, the amount of computation involved in this calculation is a drawback of the work.
  • Additionally, as discussed in the TracIn (Pruthi et al.) paper, the optimality condition for the approximation (with respect to the empirical risk) is hard to achieve in practice, especially for complicated deep neural networks.
  • As shown in the experimental part, this work can be used to identify influential training data points for the model, and the authors showed that this method could be further extended to several use cases, including understanding model behaviors as well as the influence of adversarial examples, detecting the mismatch between training distribution and test distribution, and identifying mislabelled data points.
  • Code.
Mixed Precision Training
  • Deep neural networks have enabled progress in a wide variety of applications. Growing the size of the neural network typically results in improved accuracy. As model sizes grow, the memory and compute requirements for training these models also increases.
  • This paper by Micikevicius et al. from Baidu Research and Nvidia in ICLR 2018 introduces a technique to train deep neural networks using half precision floating point numbers. In their technique, weights, activations and gradients are stored in IEEE half-precision format. Half-precision floating numbers have limited numerical range compared to single-precision numbers.
  • They propose two techniques to handle this loss of information. Firstly, they recommend maintaining a single-precision copy of the weights that accumulates the gradients after each optimizer step. This single-precision copy is rounded to half-precision format during training.
  • Secondly, they propose scaling the loss appropriately to handle the loss of information with half-precision gradients. They demonstrate that this approach works for a wide variety of models including convolution neural networks, recurrent neural networks and generative adversarial networks.
  • This technique works for large scale models with more than 100 million parameters trained on large datasets. Using this approach, they can reduce the memory consumption of deep learning models by nearly 2x. In future processors, they can also expect a significant computation speedup using half-precision hardware units.
StarSpace: Embed All The Things!
  • This paper by Wu et al. from FAIR presents StarSpace, a general-purpose neural embedding model that can solve a wide variety of problems: labeling tasks such as text classification, ranking tasks such as information retrieval/web search, collaborative filtering-based or content-based recommendation, embedding of multi-relational graphs, and learning word, sentence or document level embeddings.
  • In each case the model works by embedding those entities comprised of discrete features and comparing them against each other – learning similarities dependent on the task.
  • Empirical results on a number of tasks show that StarSpace is highly competitive with existing methods, whilst also being generally applicable to new cases where those methods are not.


Model Cards for Model Reporting
  • Trained machine learning models are increasingly used to perform high-impact tasks in areas such as law enforcement, medicine, education, and employment. In order to clarify the intended use cases of machine learning models and minimize their usage in contexts for which they are not well suited, they recommend that released models be accompanied by documentation detailing their performance characteristics.
  • This paper by Mitchell et al. from Google and UofT proposes a framework that they call model cards, to encourage such transparent model reporting. Model cards are short documents accompanying trained machine learning models that provide benchmarked evaluation in a variety of conditions, such as across different cultural, demographic, or phenotypic groups (e.g., race, geographic location, sex, Fitzpatrick skin type) and intersectional groups (e.g., age and race, or sex and Fitzpatrick skin type) that are relevant to the intended application domains. Model cards also disclose the context in which models are intended to be used, details of the performance evaluation procedures, and other relevant information.
  • While they focus primarily on human-centered machine learning models in the application fields of computer vision and natural language processing, this framework can be used to document any trained machine learning model. To solidify the concept, they provide cards for two supervised models: One trained to detect smiling faces in images, and one trained to detect toxic comments in text. They propose model cards as a step towards the responsible democratization of machine learning and related AI technology, increasing transparency into how well AI technology works.
Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)
  • The following paper summary has been contributed by Zhibo Zhang.
  • Many existing works on explainability focus on feature attribution, which attributes an importance score to each individual input feature. However, the individual features themselves do not necessarily have semantic meanings.
  • This paper by Kim et al. from Google in ICML 2018 introduced concept-based explanations using Concept Activation Vectors (CAVs) in neural networks to capture the importance of human-friendly high-level concepts.
  • This methodology adopts two sets of input examples - one set that contains instances with the concept of interest, another set that contains instances without the concept of interest. The class activation vector is defined to be the vector that is orthogonal to the linear classifier that separates the intermediate representations of the two sets of data instances. The sensitivity of a particular data class (for e.g., the zebra class, as in the paper) with respect to the concept in question (e.g., the ‘striped’ concept) can then be calculated using a directional derivative.
  • The drawback of this approach is that a linear classifier needs to be trained separately for each concept through the set of examples collected, which implies incurring extra time in collecting representative data instances and training the classifier.
  • The authors showed several use cases that adopted TCAV (Testing with Concept Activation Vectors) to better understand the learned model and predictions, including sorting images by similarity with respect to a concept of interest. The authors also conducted quantitative sanity checks through adding captions to the image and tuning the probability of noise in the captions, which showed that the concepts captured by TCAV closely matches what neural network focuses on to make predictions.
  • Code.
Representer Point Selection for Explaining Deep Neural Networks
  • The following paper summary has been contributed by Zhibo Zhang.
  • This paper by Yeh et al. from CMU introduces a method for selecting representer points for any given instance prediction. Relying on the representer theorem, the pre-activation value of the individual data instance can be decomposed into a linear combination of the training points’ activations. The weight corresponds to either positive contributions (if the weight is positive) or negative contributions (if the weight is negative) towards the prediction of the data instance in question.
  • Through experiments, the authors of the paper showed that this method can be used to efficiently detect and fix mislabelled training data points. It outperformed influence functions by 2% on test accuracy score with the same amount of training data (by fixing the mislabelled ones detected in those data) on the CIFAR-10 dataset. In addition, the authors showed that Representer Point Selection is capable of picking out more representative positive and negative examples for given data instances compared to influence functions from a qualitative perspective. Thus, this method can also be used by machine learning experts to understand misclassified examples.
  • Furthermore, compared to influence functions, Representer Point Selection is much faster in practice.
  • Code.
Mixed Precision Training
  • Deep neural networks have enabled progress in a wide variety of applications. Growing the size of the neural network typically results in improved accuracy. As model sizes grow, the memory and compute requirements for training these models also increases.
  • This paper by Narang et al. from Baidu Research and Nvidia in ICLR 2018 introduces a technique to train deep neural networks using half precision floating point numbers. In their technique, weights, activations and gradients are stored in IEEE half-precision format. Half-precision floating numbers have limited numerical range compared to single-precision numbers.
  • They propose two techniques to handle this loss of information. Firstly, they recommend maintaining a single-precision copy of the weights that accumulates the gradients after each optimizer step. This single-precision copy is rounded to half-precision format during training. Secondly, they propose scaling the loss appropriately to handle the loss of information with half-precision gradients.
  • They demonstrate that the latter approach works for a wide variety of large scale models including convolution neural networks, recurrent neural networks, and generative adversarial networks with more than 100 million parameters trained on large datasets. For certain models with a large number of small gradient values, this loss/gradient scaling method helps them converge to the same accuracy as FP32 baseline models.
  • Mixed precision training is an important technique that allows us to reduce the memory consumption as well as time spent in memory and arithmetic operations of deep neural networks. They demonstrate that many different deep learning models can be trained using this technique with no loss in accuracy without any hyper-parameter tuning. Using this approach, they can reduce the memory consumption of deep learning models by nearly 2x. For half-precision optimized hardware, they can also expect a significant computation speedup using half-precision hardware units.
  • DNN operations benchmarked with DeepBench on Volta GPU see 2-6x speedups compared to FP32 implementations if they are limited by memory or arithmetic bandwidth. Speedups are lower when operations are latency-limited.


Fast Transformer Decoding: One Write-Head is All You Need
  • Multi-head attention layers, as used in the Transformer neural sequence model, are a powerful alternative to RNNs for moving information across and between sequences. While training these layers is generally fast and simple, due to parallelizability across the length of the sequence, incremental inference (where such paralleization is impossible) is often slow, due to the memory-bandwidth cost of repeatedly loading the large “keys” and “values” tensors.
  • This paper by Shazeer from Google in 2019 propose a variant called multi-query attention, where the keys and values are shared across all of the different attention “heads”, greatly reducing the size of these tensors and hence the memory bandwidth requirements of incremental decoding.
  • They verify experimentally that the resulting models can indeed be much faster to decode, and incur only minor quality degradation from the baseline.
Similarity of Neural Network Representations Revisited
  • Recent work has sought to understand the behavior of neural networks by comparing representations between layers and between different trained models. Measuring similarity between the representations learned by neural networks is an ill-defined problem, since it is not entirely clear what aspects of the representation a similarity. index should focus on. Previous work has suggested that there is little similarity between intermediate layers of neural networks trained from different random initializ