## Papers List

• A curated set of papers I’ve reviewed for my latest scoop in AI/ML.

## Seminal Papers / Need-to-know

### Vision

#### 2010

##### Noise-contrastive estimation: A new estimation principle for unnormalized statistical models
• This paper by Gutmann and Hyvarinen in AISTATS 2010 introduced the concept of negative sampling that forms the basis of contrastive learning.
• They propose a new estimation principle for parameterized statistical models, noise-contrastive estimation, which discriminates between observed data and artificially generated noise. This is accomplished by performing nonlinear logistic regression to discriminate between the observed data and some artificially generated noise, using the model log-density function in the regression nonlinearity. They show that this leads to a consistent (convergent) estimator of the parameters, and analyze the asymptotic variance.
• In particular, the method is shown to directly work for unnormalized models, i.e. models where the density function does not integrate to one. The normalization constant can be estimated just like any other parameter.
• For a tractable ICA model, they compare the method with other estimation methods that can be used to learn unnormalized models, including score matching, contrastive divergence, and maximum-likelihood where the normalization constant is estimated with importance sampling.
• Simulations show that noise-contrastive estimation offers the best trade-off between computational and statistical efficiency.
• They apply the method to the modeling of natural images and show that the method can successfully estimate a large-scale two-layer model and a Markov random field.

#### 2012

##### ImageNet Classification with Deep Convolutional Neural Networks
• The original AlexNet paper by Krizhevsky et al. from NeurIPS 2012 that started it all. This trail-blazer introduced Deep Learning to the world :)
##### 3D Convolutional Neural Networks for Human Action Recognition
• This paper by Ji et al. from Arizona State University in IEEE PAMI 2012 introduced 3D CNNs.

#### 2013

##### Visualizing and Understanding Convolutional Networks
• This legendary paper by Zeiler and Fergus from the Courant Institute, NYU in 2013 seeks to demystify why CNNs perform so well on image classification, or how they might be improved. This paper seeks to address both issues.
• They introduce a novel visualization technique that gives insight into the function of intermediate feature layers and the operation of the classifier.
• They also perform an ablation study to discover the performance contribution from different model layers. This enables us to find model architectures that outperform Krizhevsky et. al on the ImageNet classification benchmark.
• They show our ImageNet model generalizes well to other datasets: when the softmax classifier is retrained, it convincingly beats the current state-of-the-art results on Caltech-101 and Caltech-256 datasets.

#### 2014

• This paper by Goodfellow et al. from NeurIPS 2014 proposes a new framework called Generative Adversarial Networks (GANs) that estimates generative models via an adversarial process that corresponds to a zero-sum minimax two-player game. In this process, two models are simultaneously trained: a generative model $$G$$ that captures the data distribution, and a discriminative model $$D$$ that estimates the probability that a sample came from the training data rather than $$G$$. The training procedure for $$G$$ is to maximize the probability of $$D$$ making a mistake. In the space of arbitrary functions $$G$$ and $$D$$, a unique solution exists, with $$G$$ recovering the training data distribution and $$D$$ equal to \frac{1}{2} everywhere. In the case where $$G$$ and $$D$$ are defined by multilayer perceptrons, the entire system can be trained with backpropagation.
• There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples.
• Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.

#### 2015

##### Very Deep Convolutional Networks for Large-Scale Image Recognition
• This paper by Simonyan and Zisserman from DeepMind and Oxford in ICLR 2015 proposed the VGG architecture. They showed that a significant performance improvement can be achieved by pushing the depth to 16-19 weight layers, i.e., VGG-16 and VGG-19.
• The main principle is that using a stack of $$3 \times 3$$ convolution filters are better than a single $$7 \times 7$$ layer. Firstly, because they use three non-linear activations (instead of one), which makes the function more discriminative. Secondly, the $$3 \times 3$$ design decreases the number of parameters – specifically, you need $$3 \times (3^2)C^2 = 27C^2$$ weights, compared to a $$7 \times 7$$ conv layer which would require $$1 \times (7^2)C^2 = 49C^2$$ parameters (81% more).
##### Going Deeper with Convolutions
• This paper by Szegedy et al. from Google in CVPR 2015 introduced the Inception (also known as GoogLeNet or InceptionNet) architecture which achieved state of the art results for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) 2014.
• Ideas from the paper:
• Increased the depth (number of layers) is not the only way to make a model bigger. What about increasing both the depth and width of the network while keeping computations at a manageable level? This time the inspiration comes from the human visual system, wherein information is processed at multiple scales and then aggregated locally. How do you achieve this without a memory explosion? The answer is with $$1 \times 1$$ convolutions! The main purpose is channel dimensionality reduction, by reducing the output channels of the input. Next, $$1 \times 1$$ convolutions are used to compute reductions before the computationally expensive convolutions ($$3 \times 3$$ and $$5 \times 5$$). Inception uses convolutions of different kernel sizes ($$5 \times 5$$, $$3 \times 3$$, $$1 \times 1$$) to capture details at multiple scales.
• To enable concatenation of features convolved with different kernels, they pad the output to make it the same size as the input. To find the appropriate padding with single stride convs without dilation, padding $$p$$ and kernel $$k$$ are defined so that $$out=in$$ (i.e., input and output have the same spatial dimensions): $$p = (k-1)/2p$$ (since $$out = in + 2p - k + 1$$).
##### FaceNet: A Unified Embedding for Face Recognition and Clustering
• This paper by Schroff et al. from Google in 2015 proposes FaceNet, a system that directly learns a mapping from face images to a compact Euclidean space where distances directly correspond to a measure of face similarity. Once this space has been produced, tasks such as face recognition, verification and clustering can be easily implemented using standard techniques with FaceNet embeddings as feature vectors.
• Their method uses a deep convolutional network trained to directly optimize the embedding itself, rather than an intermediate bottleneck layer as in previous deep learning approaches. To train, they use triplets of roughly aligned matching / non-matching face patches generated using a novel online triplet mining method. The benefit of our
• approach is much greater representational efficiency: they achieve state-of-the-art face recognition performance using only 128-bytes per face.
• Previous face recognition approaches based on deep networks use a classification layer trained over a set of known face identities and then take an intermediate bottle neck layer as a representation used to generalize recognition beyond the set of identities used in training. The downsides of this approach are its indirectness and its inefficiency: one has to hope that the bottleneck representation generalizes well to new faces; and by using a bottleneck layer the representation size per face is usually very large (1000s of dimensions). Some recent work has reduced this dimensionality using PCA, but this is a linear transformation that can be easily learnt in one layer of the network. In contrast to these approaches, FaceNet directly trains its output to be a compact 128-D embedding using a triplet-based loss function based on LMNN. Their triplets consist of two matching face thumbnails and a non-matching face thumbnail and the loss aims to separate the positive pair from the negative by a distance margin.
• Choosing which triplets to use turns out to be very important for achieving good performance and, inspired by curriculum learning, they present a novel online negative exemplar mining strategy which ensures consistently increasing difficulty of triplets as the network trains. To improve clustering accuracy, they also explore hard-positive mining techniques which encourage spherical clusters for the embeddings of a single person.
• The triplet loss minimizes the L2-distance between faces of the same identity and enforces a margin between the distance of faces of different identities and encourages a relative distance constraint. Specifically, the Triplet Loss minimizes the distance between an anchor and a positive, both of which have the same identity, and maximizes the distance between the anchor and a negative of a different identity. Thus, network is trained such that the squared L2 distances in the embedding space directly correspond to face similarity: faces of the same person have small distances and faces of distinct people have large distances. Once this embedding has been produced, downstream tasks become straight-forward: face verification simply involves thresholding the distance between the two embeddings; recognition becomes a k-NN classification problem; and clustering can be achieved using off-the-shelf techniques such as k-means or agglomerative clustering.
• On the widely used Labeled Faces in the Wild (LFW) dataset, their system achieves a new record accuracy of 99.63%, which cuts the error rate in comparison to the best published result by 30% on both datasets.
• They explore two different deep convolutional network architectures that have been recently used to great success in the computer vision community. The first architecture is based on the Zeiler&Fergus model which consists of multiple interleaved layers of convolutions, non-linear activations, local response normalizations, and max pooling layers. The second architecture is based on the Inception model of Szegedy et al. which was recently used as the winning approach for ImageNet 2014. These networks use mixed layers that run several different convolutional and pooling layers in parallel and concatenate their responses which reduces the number of parameters by up to 20 times and have the potential to reduce the number of FLOPS required for comparable performance.
• They also introduce the concept of harmonic embeddings, and a harmonic triplet loss, which describe different versions of face embeddings (produced by different networks) that are compatible to each other and allow for direct comparison between each other.
##### Distilling the Knowledge in a Neural Network
• This paper by Hinton et al. from Google in NeurIPS 2014 introduces a very simple way to improve the performance of almost any machine learning algorithm by training many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets.
• Caruana et al. have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and the authors develop this approach further using a different compression technique. They achieve some surprising results on MNIST and show that they can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. They also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. This shows that distilling works very well for transferring knowledge from an ensemble or from a large highly regularized model into a smaller, distilled model.
• The results show that on MNIST, distillation works remarkably well even when the transfer set that is used to train the distilled model lacks any examples of one or more of the classes. For a deep acoustic model that is version of the one used by Android voice search, they have shown that nearly all of the improvement that is achieved by training an ensemble of deep neural nets can be distilled into a single neural net of the same size which is far easier to deploy.
• For really big neural networks, it can be infeasible even to train a full ensemble, but have shown that the performance of a single really big net that has been trained for a very long time can be significantly improved by learning a large number of specialist nets, each of which learns to discriminate between the classes in a highly confusable cluster.
##### Deep Unsupervised Learning using Nonequilibrium Thermodynamics
• A central problem in machine learning involves modeling complex data-sets using highly flexible families of probability distributions in which learning, sampling, inference, and evaluation are still analytically or computationally tractable.
• This paper by Dickstein et al. from Surya Ganguli’s lab at Stanford in 2015 develops an approach that simultaneously achieves both flexibility and tractability. They introduce a novel algorithm for modeling probability distributions that enables exact sampling and evaluation of probabilities and demonstrated its effectiveness on a variety of toy and real datasets, including challenging natural image datasets. The essential idea, inspired by non-equilibrium statistical physics, is to systematically and slowly destroy structure in a data distribution through an iterative forward diffusion process.
• They then learn a reverse diffusion process that restores structure in data, yielding a highly flexible and tractable generative model of the data. This approach allows them to rapidly learn, sample from, and evaluate probabilities in deep generative models with thousands of layers or time steps, as well as to compute conditional and posterior probabilities under the learned model.
• For each of the tests they conduct, they use a similar basic algorithm, showing that our method can accurately model a wide variety of distributions. Most existing density estimation techniques must sacrifice modeling power in order to stay tractable and efficient, and sampling or evaluation are often extremely expensive. The core of their algorithm consists of estimating the reversal of a Markov diffusion chain which maps data to a noise distribution; as the number of steps is made large, the reversal distribution of each diffusion step becomes simple and easy to estimate.
• The result is an algorithm that can learn a fit to any data distribution, but which remains tractable to train, exactly sample from, and evaluate, and under which it is straightforward to manipulate conditional and posterior distributions.
• Github repo.

#### 2016

##### Rethinking the Inception Architecture for Computer Vision
• This paper by Szegedy et al. from Google in CVPR 2016 proposed InceptionV2, V3 by improving the Inception model based on the following principles:
• Using the same principle as VGG, the authors factorized $$5 \times 5$$ and $$7 \times 7$$ (in InceptionV3) convolutions to two and three $$3 \times 3$$ sequential convolutions respectively. This improves computational speed and utilizes far less parameters.
• Used spatially separable convolutions. Simply, a $$3 \times 3$$ kernel is decomposed into two smaller ones: a $$1 \times 3$$ and a $$3 \times 1$$ kernel, which are applied sequentially.
• Widened the inception modules (more number of filters).
• Distributed the computational budget in a balanced way between the depth and width of the network.
##### Deep Residual Learning for Image Recognition
• ResNet paper by He et al. from Facebook AI in CVPR 2016. Most cited in several AI fields.
• The issue of vanishing gradients when training a deep neural network was addressed with two tricks:
• Batch normalization and,
• Short skip connections
• Instead of $$H(x) = F(x)$$, the skip connection leads to $$H(x) = F(x) + x$$, which implies that the model is learning the difference (i.e., residual), $$F(x) = H(x) - x$$.
##### Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
• State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet and Fast R-CNN have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck.
• This paper by Ren et al. from University of Science and Technology of China and Microsoft Research in 2016 proposes a Region Proposal Network (RPN) for efficient and accurate region proposal generation that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. By sharing convolutional features with the down-stream detection network, the region proposal step is nearly cost-free.
• An RPN is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection.
• They further merge RPN and Fast R-CNN into a single network by sharing their convolutional features – using the recently popular terminology of neural networks with ‘attention’ mechanisms, the RPN component tells the unified network where to look.
• For the very deep VGG-16 model, our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image. In ILSVRC and COCO 2015 competitions, Faster R-CNN and RPN are the foundations of the 1st-place winning entries in several tracks.
• Faster R-CNN enables a unified, deep-learning-based object detection system to run at near real-time frame rates. The learned RPN also improves region proposal quality and thus the overall object detection accuracy.
• Github repo.

##### You Only Look Once: Unified, Real-Time Object Detection
• Prior work on object detection repurposes classifiers to perform detection.
• This paper by Redmon et al. from Ali Farhadi’s group at UWash in 2016 presents YOLO, a new approach to object detection which frames object detection as a regression problem to spatially separated bounding boxes and associated class probabilities.
• A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation.
• Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance. Unlike classifier-based approaches, YOLO is trained on a loss function that directly corresponds to detection performance and the entire model is trained jointly.
• YOLO is extremely fast and can thus be utilized for real-time object detection. The base YOLO model processes images in real-time at 45 frames per second. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors.
• Compared to state-of-the-art detection systems, YOLO makes more localization errors but is far less likely to predict false detections where nothing exists. Finally, YOLO learns very general representations of objects. It outperforms all other detection methods, including DPM and R-CNN, by a wide margin when generalizing from natural images to artwork on both the Picasso Dataset and the People-Art Dataset.

#### 2017

##### Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning
• This paper by Szegedy et al. from Google in AAAI 2017 introduced the latest versions of the Inception model – InceptionV4 and Inception-ResNet.
##### Photo-Realistic Single Image Super-Resolution using a GAN
• This paper by Ledig et al. from Twitter in CVPR 2017 applied GANs for single image super-resolution (SISR).
##### Understanding intermediate layers using linear classifier probes
• Neural network models have a notorious reputation for being black boxes.
• This paper by Alain and Bengio from Mila and the University of Montreal in ICLR 2017 proposes to monitor the features at every layer of a model and measure how suitable they are for classification.
• They use linear classifiers, which they refer to as “probes”, trained entirely independently of the model itself. This helps them better understand the roles and dynamics of the intermediate layers. They demonstrate how this can be used to develop a better intuition about models and to diagnose potential problems.
• They apply this technique to the popular models Inception v3 and Resnet-50. Among other things, they observe experimentally that the linear separability of features increase monotonically along the depth of the model.

#### 2018

##### From Recognition to Cognition: Visual Commonsense Reasoning
• Visual understanding goes well beyond object recognition. With one glance at an image, they can effortlessly imagine the world beyond the pixels: for instance, they can infer people’s actions, goals, and mental states. While this task is easy for humans, it is tremendously difficult for today’s vision systems, requiring higher-order cognition and commonsense reasoning about the world.
• This paper by Zellers et al. from UWash in CVPR 2019 formalizes this task as Visual Commonsense Reasoning (VCR). Given a challenging question about an image, a machine must answer correctly and then provide a rationale justifying its answer.
• Next, they introduce a new dataset, VCR, consisting of 290k multiple choice QA problems derived from 110k movie scenes. The key recipe for generating non-trivial and high-quality problems at scale is Adversarial Matching, a new approach to transform rich annotations into multiple choice questions with minimal bias. Experimental results show that while humans find VCR easy (over 90% accuracy), state-of-the-art vision models struggle (~45%).
• To move towards cognition-level understanding, they present a new reasoning engine, Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning. R2C helps narrow the gap between humans and machines (~65%); still, the challenge is far from solved, and they provide analysis that suggests avenues for future work.
• Website with models/datasets.
##### Focal Loss for Dense Object Detection
• The highest accuracy object detectors to date are based on a two-stage approach popularized by R-CNN, where a classifier is applied to a sparse set of candidate object locations. In contrast, one-stage detectors that are applied over a regular, dense sampling of possible object locations have the potential to be faster and simpler, but have trailed the accuracy of two-stage detectors thus far.
• This paper by Lin et al. from in 2017 investigates why this is the case and introduced focal loss. They discover that the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause. They propose to address this class imbalance by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples.
• Focal loss applies a modulating term to the cross entropy loss in order to focus learning on hard misclassified examples. It is a dynamically scaled cross entropy loss, where the scaling factor decays to zero as confidence in the correct class increases.
• Their novel Focal Loss focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training. To evaluate the effectiveness of our loss, they design and train a simple dense detector they call RetinaNet.
• Their results show that when trained with the focal loss, RetinaNet is able to match the speed of previous one-stage detectors while surpassing the accuracy of all existing state-of-the-art two-stage detectors.
##### Relational inductive biases, deep learning, and graph networks
• Recent advances in AI, propelled by deep learning, have been transformative across many important domains. Despite this, a vast gap between human and machine intelligence remains, especially with respect to efficient, generalizable learning.
• This paper by Battaglia et al. (2018) from DeepMind/Google, MIT and the University of Edinburgh offers a great overview of the relational inductive biases of various neural net architectures, summarized in the table below from the paper.

• They argue that combinatorial generalization must be a top priority for AI to achieve human-like abilities, and advocate for marrying complementary approaches which draw on ideas from human cognition, traditional computer science, standard engineering practice, and modern deep learning. Just as biology uses nature and nurture cooperatively, they reject the false choice between “hand-engineering” and “end-to-end” learning, and instead advocate for an approach which benefits from their complementary strengths.
• They investigate how using relational inductive biases within deep learning architectures can facilitate learning about entities, relations, and rules for composing them.
• They explore flexible learning-based approaches which implement strong relational inductive biases to capitalize on explicitly structured representations and computations, and present a new building block for the AI toolkit – the graph neural networks (GNNs).
• GNNs generalize and extend various approaches for neural networks that operate on graphs, and provides a straightforward interface for manipulating structured knowledge and producing structured behaviors. GNNs are designed to promote building complex architectures using customizable graph-to-graph building blocks, and their relational inductive biases promote support relational reasoning, combinatorial generalization, and improved sample efficiency over other standard machine learning building blocks. This would help lay the foundation for more sophisticated, interpretable, and flexible patterns of reasoning.

#### 2019

##### Objects as Points
• This paper by Zhou et al. from UT Austin in 2019 proposes CenterNet, a center point-based object detection approach, which is end-to-end differentiable, simpler, faster, and more accurate than other competitive bounding box based detectors.
• CenterNet is an anchorless object detection architecture. As such, this structure has an important advantage in that it replaces the classical NMS (Non Maximum Suppression) step during post-processing. This mechanism enables faster inference.
• Where most successful object detectors enumerate a nearly exhaustive list of potential object locations and classify each, which is wasteful, inefficient, and requires additional post-processing, CenterNet models an object as a single point — the center point of its bounding box. CenterNet object detector builds on successful keypoint estimation networks and uses keypoint estimation to find center points and regresses to all other object properties, such as size, 3D location, orientation, depth and extent, and pose in a single forward pass. The algorithm is simple, fast, accurate, and end-to-end differentiable without any NMS post-processing. The idea is general and has broad applications beyond simple two-dimensional detection.
• Upon comparison with other state-of-the-art detectors in the COCO test-dev set. With multi-scale evaluation, CenterNet with Hourglass104 achieves an AP of 45.1%, outperforming all existing one-stage detectors. Sophisticated two-stage detectors are more accurate, but also slower.
##### RandAugment: Practical automated data augmentation with a reduced search space
• Recent work has shown that data augmentation has the potential to significantly improve the generalization of deep learning models.
• Recently, automated augmentation strategies have led to state-of-the-art results in image classification and object detection. While these strategies were optimized for improving validation accuracy, they also led to state-of-the-art results in semi-supervised learning and improved robustness to common corruptions of images.
• An obstacle to a large-scale adoption of these methods is a separate search phase which increases the training complexity and may substantially increase the computational cost. Additionally, due to the separate search phase, these approaches are unable to adjust the regularization strength based on model or dataset size. Automated augmentation policies are often found by training small models on small datasets and subsequently applied to train larger models.
• This paper by Cubuk et al. from Google Brain in 2019 demonstrates that previous methods of learned augmentation suffers from systematic drawbacks. Namely, not tailoring the number of distortions and the distortion magnitude to the dataset size nor the model size leads to sub-optimal performance. In previous work, scaling learned data augmentation to larger dataset and models have been a notable obstacle. For example, AutoAugment and Fast AutoAugment could only be optimized for small models on reduced subsets of data; population based augmentation was not reported for large-scale problems.
• They propose RangAugment, a simple parameterization for targeting augmentation to particular model and dataset sizes, which seeks to remove both of the aforementioned obstacles. RandAugment has a significantly reduced search space which allows it to be trained on the target task with no need for a separate proxy task. Furthermore, due to the parameterization, the regularization strength may be tailored to different model and dataset sizes.
• RandAugment can be used uniformly across different tasks and datasets and works out of the box, matching or surpassing all previous automated augmentation approaches on CIFAR-10/100, SVHN, and ImageNet without a separate search for data augmentation policies.
• The proposed method scales quite well to datasets such as ImageNet and COCO while incurring minimal computational cost (e.g. 2 hyperparameters), but notable predictive performance gains.
• On the ImageNet dataset, they achieve 85.0% accuracy, a 0.6% increase over the previous state-of-the-art and 1.0% increase over baseline augmentation. On object detection, RandAugment leads to 1.0-1.3% improvement over baseline augmentation, and is within 0.3% mAP of AutoAugment on COCO.
• Finally, due to its interpretable hyperparameter, RandAugment may be used to investigate the role of data augmentation with varying model and dataset size.
##### Semantic Image Synthesis with Spatially-Adaptive Normalization
• This paper by Park et al. from UC Berkeley, NVIDIA and MIT CSAIL proposes a spatially-adaptive normalization, a simple but effective layer for synthesizing photorealistic images given an input semantic layout. Previous methods directly feed the semantic layout as input to the deep network, which is then processed through stacks of convolution, normalization, and nonlinearity layers.
• They show that this is suboptimal as the normalization layers tend to “wash away” semantic information.
• To address the issue, they propose using the input layout for modulating the activations in normalization layers through a spatially-adaptive, learned affine transformation. The proposed normalization leads to the first semantic image synthesis model that can produce photorealistic outputs for diverse scenes including indoor, outdoor, landscape, and street scenes.
• Experiments on several challenging datasets demonstrate the advantage of the proposed method over existing approaches, regarding both visual fidelity and alignment with input layouts.
• Finally, their model allows user control over both semantic and style and demonstrate its application for multi-modal and guided image synthesis.
• In the paper and the demo video, they showed GauGAN, an interactive app that generates realistic landscape images from the layout users draw. The model was trained on landscape images scraped from Flickr.com.
• Github repo; project page; online interactive demo of GauGAN; GauGAN360.

#### 2020

##### Denoising Diffusion Probabilistic Models
• This paper by Ho et al. from Pieter Abbeel’s lab at UC Berkeley presents high quality image samples using diffusion probabilistic models (also called diffusion models), a class of latent variable models inspired by considerations from nonequilibrium thermodynamics.
• Their best results are obtained by training on a weighted variational bound designed according to a novel connection between diffusion probabilistic models and denoising score matching with Langevin dynamics, and our models naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding.
• On the unconditional CIFAR10 dataset, they obtain an Inception score of 9.46 and a state-of-the-art FID score of 3.17. On 256x256 LSUN, we obtain sample quality similar to ProgressiveGAN.
• Github repo.
##### Designing Network Design Spaces
• This paper by Radosavovic et al. from FAIR in CVPR 2020 presents a new network design paradigm. Their goal is to help advance the understanding of network design and discover design principles that generalize across settings. Instead of focusing on designing individual network instances, they design network design spaces that parametrize populations of networks. The overall process is analogous to classic manual design of networks, but elevated to the design space level.
• Their methodology explores the structural aspect of network design and arrives at a low-dimensional design space consisting of simple, regular networks that they call RegNet. The core insight of the RegNet parametrization is surprisingly simple: widths and depths of good networks can be explained by a quantized linear function.
• They analyze the RegNet design space and arrive at interesting findings that do not match the current practice of network design. The RegNet design space provides simple and fast networks that work well across a wide range of flop regimes.
• Under comparable training settings and flops, the RegNet models outperform the popular EfficientNet models while being up to 5x faster on GPUs.
##### An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
• In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place.
• This paper by Dosovitskiy et al. from Google Brain in ICLR 2021 shows that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks.
• Inspired by the Transformer scaling successes in NLP, we experiment with applying a standard Transformer directly to images, with the fewest possible modifications. To do so, they split an image into patches and provide the sequence of linear embeddings of these patches as an input to a Transformer. Image patches are treated the same way as tokens (words) in an NLP application. They train the model on image classification in supervised fashion (as shown in the figure below).
• They introduce three ViT configurations (Base, Large, and Huge) in the form of two models: ViT-H/14 and ViT-L/16 (where the notation used is ViT-C/N, C is used to indicate the model size and N is the input patch size; for instance, ViT-L/16 means the “Large” variant with $$16 \times 16$$ input patch size).
• When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), the proposed Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

##### Training data-efficient image transformers & distillation through attention
• Compared to CNNs, vision transformers are pre-trained with hundreds of millions of images using an expensive infrastructure, thereby limiting their adoption.
• This paper by Touvron from Facebook AI and proposes DeiT, a competitive convolution-free transformer that does not require very large amount of data to be trained, thanks to improved training and in particular a novel distillation procedure. DeiT is trained on ImageNet on a single computer in less than 3 days. Their reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop evaluation) on ImageNet with no external data.
• They introduce a teacher-student strategy specific to transformers. Using distillation can hamper the performance of neural networks. The student model pursues two different objectives that may be diverging: learning from a labeled dataset (strong supervision) and learning from the teacher. To alleviate this, they introduced a distillation token, which is a learned vector that flows through the network along with the transformed image data. The distillation token cues the model for its distillation output, which can differ from its class output. This new distillation method is specific to Transformers and further improves the image classification performance.
• It relies on a distillation token ensuring that the student learns from the teacher through attention. They show the interest of this token-based distillation, especially when using a ConvNet as a teacher. This leads us to report results competitive with CNNs for both ImageNet (where they obtain up to 85.2% top-1 accuracy) and when transferring to other tasks.
• Github repo.
##### NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
• This paper by Mildenhall et al. from UC Berkeley, Google and UCSD in ECCV 2020 introduces NeRF, a method that achieves state-of-the-art results for synthesizing novel views of complex scenes by optimizing an underlying continuous volumetric scene function using a sparse set of input views.
• Their algorithm represents a scene using a fully-connected (non-convolutional) deep network, whose input is a single continuous 5D coordinate (spatial location (x,y,z) and viewing direction (θ,ϕ)) and whose output is the volume density and view-dependent emitted radiance at that spatial location.
• They synthesize views by querying 5D coordinates along camera rays and use classic volume rendering techniques to project the output colors and densities into an image. Because volume rendering is naturally differentiable, the only input required to optimize our representation is a set of images with known camera poses. They describe how to effectively optimize neural radiance fields to render photorealistic novel views of scenes with complicated geometry and appearance, and demonstrate results that outperform prior work on neural rendering and view synthesis.
• Project page with videos and code.
##### Bootstrap your own latent: A new approach to self-supervised Learning
• This paper by Grill et al. from DeepMind and Imperial College in 2020 introduces Bootstrap Your Own Latent (BYOL), a new approach to self-supervised image representation learning.
• BYOL learns its representation by predicting previous versions of its outputs, without using negative pairs. BYOL relies on two neural networks, referred to as online and target networks, that interact and learn from each other. From an augmented view of an image, they train the online network to predict the target network representation of the same image under a different augmented view. At the same time, they update the target network with a slow-moving average of the online network.
• While state-of-the art methods rely on negative pairs, BYOL achieves a new state of the art without them. BYOL reaches 74.3% top-1 classification accuracy on ImageNet using a linear evaluation with a ResNet-50 architecture and 79.6% with a larger ResNet, using 30% fewer parameters.
• They show that BYOL performs on par or better than the current state of the art on both transfer and semi-supervised benchmarks.
• Nevertheless, BYOL remains dependent on existing sets of augmentations that are specific to vision applications. To generalize BYOL to other modalities, it is necessary to obtain similarly suitable augmentations for each of them. Designing such augmentations may require significant effort and expertise. Therefore, automating the search for these augmentations would be an important next step to generalize BYOL to other modalities.
• BYOL’s architecture is as shown below. BYOL minimizes a similarity loss between $$q_{\theta}\left(z_{\theta}\right)$$ and $$\operatorname{sg}\left(z_{\xi}^{\prime}\right)$$, where $$\theta$$ are the trained weights, $$\xi$$ are an exponential moving average of $$\theta$$ and $$sg$$ means stop-gradient. At the end of training, everything but $$f_{\theta}$$ is discarded, and $$y_{\theta}$$ is used as the image representation.

##### A Simple Framework for Contrastive Learning of Visual Representations
• This paper by Chen et al. from Google Research and Hinton’s lab in ICML 2020 presents SimCLR, a simple framework for contrastive learning of visual representations.
• They simplify recently proposed contrastive self-supervised learning algorithms without requiring specialized architectures or a memory bank. In order to understand what enables the contrastive prediction tasks to learn useful representations, they systematically study the major components of our framework and show the effects of different design choices.
• They show that (1) composition of data augmentations plays a critical role in defining effective predictive tasks, (2) introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and (3) contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning.
• By combining these findings, SimCLR is able to considerably outperform previous methods for self-supervised and semi-supervised learning on ImageNet. SimCLR differs from standard supervised learning on ImageNet only in the choice of data augmentation, the use of a nonlinear head at the end of the network, and the loss function. The strength of this simple framework suggests that, despite a recent surge in interest, self-supervised learning remains undervalued.
• A linear classifier trained on self-supervised representations learned by SimCLR achieves 76.5% top-1 accuracy, which is a 7% relative improvement over previous state-of-the-art, matching the performance of a supervised ResNet-50. When fine-tuned on only 1% of the labels, SimCLR achieve 85.8% top-5 accuracy, outperforming AlexNet with 100x fewer labels.
• The following diagram shows the SimCLR framework. Two separate data augmentation operators are sampled from the same family of augmentations ($$t \sim \mathcal{T}$$ and $$t^{\prime} \sim \mathcal{T}$$) and applied to each data example to obtain two correlated views. A base encoder network $$f(\cdot)$$ and a projection head $$g(\cdot)$$ are trained to maximize agreement using a contrastive loss. After training is completed, they throw away the projection head $$g(\cdot)$$ and use encoder $$f(\cdot)$$ and representation $$\boldsymbol{h}$$ for downstream tasks.

##### Conditional Negative Sampling for Contrastive Learning of Visual Representations
• Recent methods for learning unsupervised visual representations, dubbed contrastive learning, optimize the noise-contrastive estimation (NCE) bound on mutual information between two views of an image. NCE uses randomly sampled negative examples to normalize the objective.
• This paper by Wu et al. from Stanford in 2020 shows that choosing difficult negatives, or those more similar to the current instance, can yield stronger representations. To do this, they introduce a family of mutual information estimators called Conditional Noise Contrastive Estimator (CNCE) that sample negatives conditionally – in a “ring” around each positive, by approximating the partition function using samples from a class of conditional distributions. They prove that these estimators lower-bound mutual information, with higher bias but lower variance than NCE.
• Applying these estimators as objectives in contrastive representation learning, shows that CNCE’s representations outperform existing approaches consistently across a spectrum of contrastive objectives, data distributions, and transfer tasks.
• Experimentally, CNCE applied on top of existing models (IR, CMC, and MoCo) improves accuracy by 2-5% points in each case, measured by linear evaluation on four standard image datasets. Moreover, they find continued benefits when transferring features to a variety of new image distributions from the meta-dataset collection and to a variety of downstream tasks such as object detection, instance segmentation, and keypoint detection.
##### Momentum Contrast for Unsupervised Visual Representation Learning
• This paper by He et al. from Facebook AI in CVPR 2020 presents Momentum Contrast (MoCo) for unsupervised visual representation learning. From a perspective on contrastive learning as dictionary look-up, MoCo builds a dynamic dictionary with a queue and a moving-averaged encoder. This enables building a large and consistent dictionary on-the-fly that facilitates contrastive unsupervised learning.
• MoCo provides competitive results under the common linear protocol on ImageNet classification. More importantly, the representations learned by MoCo transfer well to downstream tasks. MoCo can outperform its supervised pre-training counterpart in 7 detection/segmentation tasks on PASCAL VOC, COCO, and other datasets, sometimes surpassing it by large margins. This suggests that the gap between unsupervised and supervised representation learning has been largely closed in many vision tasks.
• Momentum Contrast (MoCo) trains a visual representation encoder by matching an encoded query $$q$$ to a dictionary of encoded keys using a contrastive loss, as shown in the diagram below. The dictionary keys $$\left\{k_{0}, k_{1}, k_{2}, \ldots\right\}$$ are defined on-the-fly by a set of data samples. The dictionary is built as a queue, with the current mini-batch enqueued and the oldest mini-batch dequeued, decoupling it from the mini-batch size. The keys are encoded by a slowly progressing encoder, driven by a momentum update with the query encoder. This method enables a large and consistent dictionary for learning visual representations.

• The figure below shows the conceptual comparison of three contrastive loss mechanisms by illustrating one pair of query and key. The three mechanisms differ in how the keys are maintained and how the key encoder is updated. (a): The encoders for computing the query and key representations are updated end-to-end by back-propagation (the two encoders can be different). (b): The key representations are sampled from a memory bank. (c): MoCo encodes the new keys on-the-fly by a momentum-updated encoder, and maintains a queue (not illustrated in this figure) of keys.

##### Generative Pretraining from Pixels
• We find that, just as a large transformer model trained on language can generate coherent text, the same exact model trained on pixel sequences can generate coherent image completions and samples. By establishing a correlation between sample quality and image classification accuracy, we show that our best generative model also contains features competitive with top convolutional nets in the unsupervised setting.
• This paper by Chen et al. from OpenAI in ICML 2020 examines whether similar models can learn useful representations for images, inspired by progress in unsupervised representation learning for natural language.
• They train a sequence Transformer to auto-regressively predict pixels, without incorporating knowledge of the 2D input structure.
• Despite training on low-resolution ImageNet without labels, they find that a GPT-2 scale model learns strong image representations as measured by linear probing, fine-tuning, and low-data classification. On CIFAR-10, they achieve 96.3% accuracy with a linear probe, outperforming a supervised Wide ResNet, and 99.0% accuracy with full finetuning, matching the top supervised pre-trained models.
• An even larger model trained on a mixture of ImageNet and web images is competitive with self-supervised benchmarks on ImageNet, achieving 72.0% top-1 accuracy on a linear probe of our features.
• OpenAI article.

#### 2021

##### Do Vision Transformers See Like Convolutional Neural Networks?
• Given the central role of convolutional neural networks in computer vision breakthroughs (leading to them being the de-facto model for visual data), it is remarkable that Transformer architectures (almost identical to those used in language) are capable of similar performance. For instance, recent work has shown that the Vision Transformer (ViT) model can achieve comparable or even superior performance on image classification tasks. This raises fundamental questions on whether these architectures work in the same way as CNNs: how are Vision Transformers solving these tasks? Are they acting like convolutional networks, or learning entirely different visual representations?
• This paper by Raghu et al. from Google Brain in 2021 analyzes the internal representation structure of ViTs and CNNs on image classification benchmarks, and finds striking differences in the features and internal structures between the two architectures, such as ViT having more uniform representations across all layers. They explore how these differences arise, finding crucial roles played by self-attention, which enables early aggregation of global information (“earlier global features”), and ViT residual connections, which offer representation propagation of features from lower to higher layers, while also revealing that some CNN properties, e.g. local information aggregation at lower layers, are important to ViTs, being learned from scratch at scale.
• They also examine the potential for ViTs to be used beyond classification through a study of spatial localization, discovering ViTs successfully preserve input spatial information with CLS tokens —- promising for future uses in object detection.
• Finally, they investigate the effect of scale for transfer learning, finding larger ViT models develop significantly stronger intermediate representations through larger pretraining datasets. These results are also very pertinent to understanding recent architectures for vision such as the MLP-Mixer.
##### BEiT: BERT Pre-Training of Image Transformers
• This paper by Wei et al. from Microsoft Research in 2021 introduces a self-supervised pre-trained representation model called BEiT, which stands for Bidirectional Encoder representations from Image Transformers. Following BERT developed in the natural language processing area, they propose a masked image modeling task to pretrain vision Transformers. Specifically, each image has two views in their pre-training, i.e, image patches (such as 16x16 pixels) the embeddings of which are calculated as linear projections of flattened patches, and visual tokens (i.e., discrete tokens) which are . Before pre-training, they learn a discrete variational autoencoder (dVAE) which acts as an “image tokenizer” learnt via autoencoding-style reconstruction, where the input image is tokenized into discrete visual tokens obtained by the latent codes of the discrete VAE (the one proposed in VQGAN and reused by CLIP in Ramesh et al., 2021) according to the learned vocabulary.
• They show that the proposed method is critical to make BERT-like pre-training (i.e., auto-encoding with masked input) work well for image Transformers. They also present the intriguing property of automatically acquired knowledge about semantic regions, without using any human-annotated data.
• Similar to the masked language modeling pre-training task of BERT, BEiT randomly masks some image patches and feeds them into the backbone Transformer. The pre-training objective is to recover the original visual tokens based on the corrupted image patches.
• After pre-training BEiT, they directly fine-tune the model parameters on downstream tasks by appending task layers upon the pretrained encoder.
• Experimental results on image classification and semantic segmentation show that BEiT achieves competitive results with previous pre-training methods. For example, base-size BEiT achieves 83.2% top-1 accuracy on ImageNet-1K, significantly outperforming from-scratch DeiT training (81.8%) with the same setup. Moreover, large-size BEiT obtains 86.3% only using ImageNet-1K, even outperforming ViT-L with supervised pre-training on ImageNet-22K (85.2%).
• Code and pretrained models are here.

##### Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
• This paper by Liu et al. from Microsoft Research in 2021 presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision by producing a hierarchical feature representation and offers a linear computational complexity with respect to input image size. The key element of the Swin Transformer is the shifted window based self-attention.
• The Swin transformer aims to address the challenges in adapting Transformer from language to vision which arise due to differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, they propose a hierarchical Transformer whose representation is computed with Shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection.
• This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including ImageNet image classification (87.3 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as COCO object detection (58.7 box AP and 51.1 mask AP on COCO testdev) and ADE20K semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures.
• Code and pretrained models are here.
##### CvT: Introducing Convolutions to Vision Transformers
• This paper by Wu et al. from McGill and Microsoft in 2021 proposes the Convolutional vision Transformer (CvT), that improves Vision Transformer (ViT) in performance and efficiency by introducing convolutions into ViT to yield the best of both designs for image recognition tasks.
• This is accomplished through two primary modifications: a hierarchy of Transformers containing a new convolutional token embedding, and a convolutional Transformer block leveraging a convolutional projection. These changes introduce desirable properties of convolutional neural networks (CNNs) to the ViT architecture (i.e., shift, scale, and distortion invariance) while maintaining the merits of Transformers (i.e., dynamic attention, global context, and better generalization).
• They validate CvT by conducting extensive experiments, showing that this approach achieves state-of-the-art performance over other Vision Transformers and ResNets on ImageNet-1k, with fewer parameters and lower FLOPs.
• In addition, performance gains are maintained when pretrained on larger datasets (for e.g., ImageNet-22k) and fine-tuned to downstream tasks. Pre-trained on ImageNet-22k, the CvT-W24 obtains a top-1 accuracy of 87.7% on the ImageNet-1k val set.
• Furthermore, their results show that the positional encoding, a crucial component in existing Vision Transformers, can be safely removed in our model, giving it a potential advantage for adaption to a wide range of vision tasks requiring variable input resolution. This is due to the built-in local context structure introduced by convolutions, CvT no longer requires a position embedding.
• CvTs thus introduce convolutions into the Vision Transformer architecture to merge the benefits of Transformers with the benefits of CNNs and demonstrate that the introduced convolutional token embedding and convolutional projection, along with the multi-stage design of the network enabled by convolutions, enable CvT to achieve superior performance while maintaining computational efficiency.
• Code and pretrained models are here.
##### RepVGG: Making VGG-style ConvNets Great Again
• This paper by Ding et al. from Tsinghua, HKUST and Aberystwyth University in 2021 presents Re-parameterization VGG (RepVGG), a simple but powerful architecture of convolutional neural network, which has a simple architecture with a stack of $$3 \times 3$$ Conv and ReLU during inference time, which is especially suitable for GPU and specialized inference chips, while the training-time model has a multi-branch topology.
• Such decoupling of the training-time and inference-time architecture is realized by a structural re-parameterization technique so that the model is named RepVGG.
• The figure below shows a sketch of RepVGG architecture. RepVGG has 5 stages and conducts down-sampling via stride-2 convolution at the beginning of a stage. Here, only the first 4 layers of a specific stage are shown. Inspired by ResNet, RepVGG also uses identity and $$1 \times 1$$ branches, but only for training.

• On ImageNet, RepVGG reaches over 80% top-1 accuracy, which is the first time for a plain model.
• On NVIDIA 1080Ti GPU, RepVGG models run 83% faster than ResNet-50 or 101% faster than ResNet-101 with higher accuracy and show favorable accuracy-speed trade-off compared to the state-of-the-art models like EfficientNet and RegNet.
• Github repo.
##### An Empirical Study of Training Self-Supervised Vision Transformers
• While the training recipes for standard convolutional networks have been highly mature and robust, the recipes for ViT are yet to be built, especially in the self-supervised scenarios where training becomes more challenging.
• This paper by Chen et al. from Facebook AI in ICCV 2021 studies a straightforward, incremental, yet must-know baseline given the recent progress in computer vision: self-supervised learning for Vision Transformers (ViT).
• They go back to basics and investigate the effects of several fundamental components for training self-supervised ViT. Their comparisons concern several aspects, including ViT vs. convolutional networks, supervised vs. self-supervised, and contrastive learning vs. masked auto-encoding.
• They observe that instability is a major issue that degrades accuracy, and it can be hidden by apparently good results. They reveal that these results are indeed partial failure, and they can be improved when training is made more stable.
• They introduce “MoCo v3”, a framework which offers an incremental improvement of MoCo v1/2, and strikes for a better balance between simplicity, accuracy, and scalability. The pseudocode of MoCo v3 is as below:

• They benchmark ViT results in MoCo v3 and several other self-supervised frameworks, with ablations in various aspects. They discuss the currently positive evidence as well as challenges and open questions.
##### Diffusion Models Beat GANs on Image Synthesis
• This paper by Dhariwal and Nichol from OpenAI in 2021 shows that diffusion models, a class of likelihood-based models with a stationary training objective, can achieve image sample quality superior to the current state-of-the-art generative models.
• They achieve this on unconditional image synthesis by finding a better architecture through a series of ablations. For conditional image synthesis, they further improve sample quality with classifier guidance: a simple, compute-efficient method for trading off diversity for fidelity using gradients from a classifier.
• These guided diffusion models can reduce the sampling time gap between GANs and diffusion models, although diffusion models still require multiple forward passes during sampling. Finally, by combining guidance with upsampling, they can further improve sample quality on high-resolution conditional image synthesis.
• They achieve an FID of 2.97 on ImageNet $$128 \times 128$$, 4.59 on ImageNet $$256 \times 256$$, and 7.72 on ImageNet $$512 \times 512$$, and we match BigGAN-deep even with as few as 25 forward passes per sample, all while maintaining better coverage of the distribution.
• Finally, they find that classifier guidance combines well with upsampling diffusion models, further improving FID to 3.94 on ImageNet $$256 \times 256$$ and 3.85 on ImageNet $$512 \times 512$$.
• Github repo.
##### GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
• Diffusion models have recently been shown to generate high-quality synthetic images, especially when paired with a guidance technique to trade off diversity for fidelity.
• This paper by Nichol et al. from OpenAI in 2021 explores diffusion models for the problem of text-conditional image synthesis and compare two different guidance strategies: CLIP guidance and classifier-free guidance.
• They find that the latter is preferred by human evaluators for both photorealism and caption similarity, and often produces photorealistic samples. Samples from a 3.5 billion parameter text-conditional diffusion model using classifier-free guidance are favored by human evaluators to those from DALL-E, even when the latter uses expensive CLIP reranking.
• Additionally, they find that our models can be fine-tuned to perform image inpainting, enabling powerful text-driven image editing.
• Github repo.

#### 2022

##### A ConvNet for the 2020s
• This paper by FAIR and UC Berkeley seeks to refute the recent apparent superiority of Transformers by re-examining the design of ConvNets and testing their limitations. The proposed approach is based on gradually modifying a standard ResNet50, following design choices closely inspired by Vision Transformer, to propose a new family of pure ConvNets called ConvNeXt, which can perform as good as a hierarchical vision Transformer on image classification, object detection, instance and semantic segmentation tasks.
• The “Roaring 20s” of visual recognition began with the introduction of Vision Transformers (ViTs), which quickly superseded ConvNets as the state-of-the-art image classification model. A vanilla ViT, on the other hand, faces difficulties when applied to general computer vision tasks such as object detection and semantic segmentation. It is the hierarchical Transformers (e.g., Swin Transformers) that reintroduced several ConvNet priors, making Transformers practically viable as a generic vision backbone and demonstrating remarkable performance on a wide variety of vision tasks.
• However, the effectiveness of such hybrid approaches is still largely credited to the intrinsic superiority of Transformers, rather than the inherent inductive biases of convolutions.
• In this paper, the authors reexamine the design spaces and test the limits of what a pure ConvNet can achieve.
• The authors gradually “modernize” a standard ResNet toward the design of a vision Transformer, and discover several key components that contribute to the performance difference along the way. They implement a series of design decisions starting with a ResNet50 trained with up-to-date techniques (extending the number of epochs, using AdamW optimizer, Stochastic Depth, Label Smoothing, and so on):
• Macro Design: The authors considered two aspects of Swin Transformers’ macro design. The first is the number of blocks in each stage (stage compute ratio), which was adjusted from (4, 4, 6, 3) to (3, 3, 9, 3), following the Swin Transformer ratio of (1:1:3:1). The second is the stem cell configuration, which in the original ResNet50 consisted of 7×7 convolutions with stride 2 followed by a max-pooling layer. This was substituted by a more Transformer-like “patchify” layer which utilizes 4×4 non-overlapping convolutions with stride 4. These modifications improved the accuracy to 79.5%.
• ResNeXt: In this part, the authors adopt two design choices of the popular ResNeXt: depthwise convolutions, which are interestingly similar to self-attention as they work on a per-channel basis, and a higher number of channels (from 64 to 96). These modifications improved the accuracy to 80.5%.
• Inverted Bottleneck: An essential configuration of Transformers is the expansion-compression rate in the MLP block (the hidden dimension is 4 times higher than the input and output dimension). This feature was reproduced by adding the inverted bottleneck design used in ConvNets (where the input is expanded using $$1 \times 1$$ convolutions and then shrunk through depthwise convolution and $$1 \times 1$$ convolutions). This modification slightly improved the accuracy to 80.6%.
• Large kernel sizes: The gold standard in ConvNet since the advent of VGG are 3×3 kernels. Small kernels lead to the famous local receptive field, which, compared to the global self-attention, has a more limited area of focus. Although Swin Transformers reintroduced the concept of local attention, their window size has always been at least $$7 \times 7$$. To explore larger kernels, the first thing is to move the depthwise convolution before the convolution, to reduce the number of channels before such an expensive operation. This first modification resulted in a temporary degradation to 79.9%, but, experimenting with different sizes, with a $$7 \times 7$$ window (higher values did not bring any alterations in the results), the authors were able to achieve an accuracy of 80.6% again.
• Micro Design: Finally, some micro design choices were added: GELU instead of ReLU, a single activation for each block (the original transformer module has just one activation after the MLP), fewer normalization layers, Batch Normalization substituted by Layer Normalization, and separate downsampling layer.
• These modifications improved the accuracy to 82.0% and defined the final model, named ConvNeXt.
• A comparison of this architecture with the Swin Transformer and ResNet is shown in the figure below.

• Based entirely on convolutions, this model competed on par with Transformer-based architectures, achieving a top-1 accuracy of 87.8% on ImageNet classification. Equally excellent results were obtained in other tasks, such as object detection and segmentation on COCO and semantic segmentation on ADE20K.
• The idea of modernizing ConvNets, adding all the concepts introduced over the past decade to a single model, is payback for convolutions, which have been ignored lately to the benefit of transformers. The authors suggest that ConvNeXt may be more suited for certain tasks, while Transformers may be more flexible for others. A case in point is multi-modal learning, in which a cross-attention module may be preferable for modeling feature interactions across many modalities. Additionally, Transformers may be more flexible when used for tasks requiring discretized, sparse, or structured outputs. They believe the architecture choice should meet the needs of the task at hand while striving for simplicity and efficiency.
• Github repo.
##### Natural Language Descriptions of Deep Visual Features
• Some neurons in deep networks specialize in recognizing highly specific perceptual, structural, or semantic features of inputs. In computer vision, techniques exist for identifying neurons that respond to individual concept categories like colors, textures, and object classes. But these techniques are limited in scope, labeling only a small subset of neurons and behaviors in any network. Is a richer characterization of neuron-level computation possible?
• This paper by Hernandez et al. from MIT, Northeastern and Alleghency College in 2022 proposes MILAN, for mutual-information-guided linguistic annotation of neurons, that aims to generate open-ended, compositional, natural language descriptions of individual neurons in deep networks.
• Given a neuron, MILAN generates a description by searching for a natural language string that maximizes pointwise mutual information with the image regions in which the neuron is active. These mutual information estimates are in turn produced by a pair of learned models trained on MILANNOTATIONS, a dataset of fine-grained image annotations released with this paper. MILAN produces fine-grained descriptions that capture categorical, relational, and logical structure in learned features. These descriptions obtain high agreement with human-generated feature descriptions across a diverse set of model architectures and tasks, and can aid in understanding and controlling learned models.
• They highlight three applications of natural language neuron descriptions.
• First, they use MILAN for analysis, characterizing the distribution and importance of neurons selective for attribute, category, and relational information in vision models.
• Second, they use MILAN for auditing, surfacing neurons sensitive to protected categories like race and gender in models trained on datasets intended to obscure these features.
• Finally, they use MILAN for editing, improving robustness in an image classifier by deleting neurons sensitive to text features spuriously correlated with class labels.
##### Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision
• Discriminative self-supervised learning allows training models on any random group of internet images, and possibly recover salient information that helps differentiate between the images. Applied to ImageNet, this leads to object-centric features that perform on par with supervised features on most object-centric downstream tasks.
• This paper by Goyal et al. in 2022 from FAIR questions that if using this ability, they can learn any salient and more representative information present in diverse unbounded set of images from across the globe. To do so, they train models on billions of random images without any data pre-processing or prior assumptions about what they want the model to learn. This is a very large-scale experiment in which a RegNet architecture scaled to a dense 10 billion parameters (to avoid underfitting on a large data size) is pre-trained using the SwAV self-supervised method on a large collection of 1 billion randomly selected public images from Instagram with a diversity of gender, ethnicity, cultures, and locations (all outside the EU because of GDPR).
• They achieve state of the art results on a majority of 50 transfer tasks, including fairness, robustness to distribution shift, geographical diversity, fine-grained classification, image copy detection and many image classification datasets. The resulting model, not only captures well semantic information, it also captures information about artistic style and learns salient information such as geo-locations and multilingual word embeddings based on visual content only.
• The key takeaway is that large-scale self-supervised pre-training yields more robust, fair, less harmful, and less biased results than supervised models or models trained on object centric datasets such as ImageNet.
##### Block-NeRF: Scalable Large Scene Neural View Synthesis
• This paper by Tancik et al. from UC Berkeley, Waymo and Google Research in 2022 presents Block-NeRF, a variant of Neural Radiance Fields (NeRFs) that can reconstruct large-scale environments.
• They demonstrate that when scaling NeRF to render city-scale scenes spanning multiple blocks, it is vital to decompose the scene into individually trained NeRFs that can be optimized independently. This decomposition decouples rendering time from scene size, enables rendering to scale to arbitrarily large environments, and allows per-block updates of the environment.
• At such a scale, the data collected will necessarily have transient objects and variations in appearance, which they account for by modifying the underlying NeRF architecture to make NeRF robust to data captured over months under different environmental conditions. They add appearance embeddings, learned pose refinement, and controllable exposure to each individual NeRF, and introduce a procedure for aligning appearance between adjacent NeRFs so that they can be seamlessly combined.
• They demonstrate the method’s efficacy by building an entire neighborhood in San Francisco from 2.8M images using a grid of Block-NeRFs, forming the largest neural scene representation to date.
##### VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning
• Recent self-supervised methods for image representation learning are based on maximizing the agreement between embedding vectors from different views of the same image. A trivial solution is obtained when the encoder outputs constant vectors. This collapse problem is often avoided through implicit biases in the learning architecture, that often lack a clear justification or interpretation.
• This paper from Bardes et al. from FAIR and NYU in ICLR 2022 introduces VICReg (Variance-Invariance-Covariance Regularization), a method that explicitly avoids the collapse problem with a simple regularization term on the variance of the embeddings along each dimension individually.
• VICReg offers simple approach to self-supervised learning based on a triple objective: learning invariance to different views with a invariance term, avoiding collapse of the representations with a variance preservation term, and maximizing the information content of the representation with a covariance regularization term.
• VICReg combines the variance term with a decorrelation mechanism based on redundancy reduction and covariance regularization, and achieves results on par with the state of the art on several downstream tasks, but is not subject to the same limitations as most other methods, particularly because it does not require the embedding branches to be identical or even similar. In addition, they show that incorporating our new variance term into other methods helps stabilize the training and leads to performance improvements.
##### Masked Autoencoders Are Scalable Vision Learners
• Simple algorithms that scale well are the core of deep learning. In NLP, simple self-supervised learning methods enable benefits from exponentially scaling models. In computer vision, practical pre-training paradigms are dominantly supervised despite progress in self-supervised learning. In this study, they observe on ImageNet and in transfer learning that an autoencoder —- a simple self-supervised method similar to techniques in NLP – provides scalable benefits. Self-supervised learning in vision may thus now be embarking on a similar trajectory as in NLP.
• This paper by He et al. from Facebook AI in 2022 shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision.
• Their MAE approach is simple: they mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs.
• First, they develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens.
• Second, they note that images and languages are signals of a different nature and this difference must be addressed carefully. Images are merely recorded light without a semantic decomposition into the visual analogue of words. The word (or subword) analog for images are pixels. But decomposing the image into patches (like ViT) reduces the quadratic computation cost of transformers compared to operating at the pixel level. However, ViT and its derived models are infamous for their data appetite and/or training slowness. Instead of attempting to remove objects, they remove random patches that most likely do not form a semantic segment. Likewise, MAE reconstructs pixels, which are not semantic entities. They find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task. Coupling these two designs enables them to train large models efficiently and effectively: they accelerate training (by 3x or more) and improve accuracy.
• Like any autoencoder, you train and throw away the decoder and fine-tune the encoder for downstream tasks.
• Their scalable approach allows for learning high-capacity models that generalize well: e.g., a vanilla ViT-Huge model (ViTMAE) achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data. Transfer performance in downstream tasks outperforms supervised pre-training and shows promising scaling behavior.
• Overall, they observe that MAE infers complex, holistic reconstructions, suggesting it has learned numerous visual concepts, i.e., semantics. They hypothesize that this behavior occurs by way of a rich hidden representation inside the MAE.
• HuggingFace docs

##### The Effects of Regularization and Data Augmentation are Class Dependent
• Regularization is a fundamental technique to prevent over-fitting and to improve generalization performances by constraining a model’s complexity. Current Deep Networks heavily rely on regularizers such as data augmentation (DA) or weight-decay, and employ structural risk minimization, i.e., cross-validation, to select the optimal regularization hyper-parameters.
• This paper by Balestriero et al. from Facebook AI in 2022 demonstrates that regularization techniques such as DA or weight decay increases the average test performances at the cost of significant performance drops on some specific classes. In other words, regularization produces a model with a reduced complexity that is unfair across classes. By focusing on maximizing aggregate performance statistics they have produced learning mechanisms that can be potentially harmful, especially in transfer learning tasks. The optimal amount of DA or weight decay found from cross-validation leads to disastrous model performances on some classes, e.g., on ImageNet with a ResNet50, the “barn spider” classification test accuracy falls from 68% to 46% only by introducing random crop DA during training. Even more surprising, such performance drop also appears when introducing uninformative regularization techniques such as weight decay.
• Those results demonstrate that our search for ever increasing generalization performance – averaged over all classes and samples – has left us with models and regularizers that silently sacrifice performances on some classes. In fact, they also observe that varying the amount of regularization employed during pre-training of a specific dataset impacts the per-class performances of that pre-trained model on different downstream tasks e.g. an ImageNet pre-trained ResNet50 deployed on INaturalist sees its performances fall from 70% to 30% on a particular classwhen introducing random crop DA during the Imagenet pre-training phase. Those results demonstrate that designing novel regularizers without class-dependent bias remains an open research question.
• Here’s an intuitive explanation:
• Some types of data augmentation and weight decay helps some categories but hurts others.
• Categories largely identifiable by color or texture (for e.g., yellow bird, textured mushroom) are unaffected by aggressive cropping, while categories identifiable by shape (for e.g., corkscrew) see a performance degradation with aggressive cropping that only contains part of the object.
• Conversely, color jitter does not affect shape or texture-based categories (for e.g., zebra), but affects color-based categories (for e.g., basket ball).
##### Instant Neural Graphics Primitives with a Multiresolution Hash Encoding
• Neural graphics primitives, parameterized by fully connected neural networks, can be costly to train and evaluate. Moreover, many graphics problems rely on task specific data structures to exploit the sparsity or smoothness of the problem at hand.
• This paper by Muller et al. from Nvidia in 2022 proposes InstantNeRF which reduce this cost with a versatile new input encoding that permits the use of a smaller network without sacrificing quality, thus significantly reducing the number of floating point and memory access operations. InstantNeRF offers near-instant training of neural graphics primitives on a single GPU for multiple tasks.
• To this end, a small neural network is augmented by a multiresolution hash table of trainable feature vectors whose values are optimized through stochastic gradient descent. Multi-resolution hash encoding provides a practical learning-based alternative that automatically focuses on relevant detail, independent of task at hand. Its low overhead allows it to be used even in time-constrained settings like online training and inference.
• In a gigapixel image, they represent an image by a neural network. SDF learns a signed distance function in 3D space whose zero level-set represents a 2D surface. NeRF uses 2D images and their camera poses to reconstruct a volumetric radiance-and-density field that is visualized using ray marching. Lastly, neural volume learns a denoised radiance and density field directly from a volumetric path tracer. In all tasks, our encoding and its efficient implementation provide clear benefits: instant training, high quality, and simplicity. Their encoding is task-agnostic: they use the same implementation and hyperparameters across all tasks and only vary the hash table size which trades off quality and performance.
• The multiresolution structure allows the network to disambiguate hash collisions, making for a simple architecture that is trivial to parallelize on modern GPUs. In the context of neural network input encodings, it is a drop-in replacement, for example speeding up NeRF by several orders of magnitude and matching the performance of concurrent non-neural 3D reconstruction techniques.
• They leverage this parallelism by implementing the whole system using fully-fused CUDA kernels with a focus on minimizing wasted bandwidth and compute operations.
• While slow computational processes in any setting, from lightmap baking to the training of neural networks, can lead to frustrating workflows due to long iteration times, they achieve a combined speedup of several orders of magnitude, enabling training of high-quality neural graphics primitives in a matter of seconds, and rendering in tens of milliseconds at a resolution of 1920×1080. They have demonstrated that single-GPU training times measured in seconds are within reach for many graphics applications, allowing neural approaches to be applied where previously they may have been discounted.
• Github repo.
##### Pix2seq: A Language Modeling Framework for Object Detection
• This paper by Chen et al. from Google Brain in ICLR 2022 presents Pix2Seq, a simple yet generic framework for object detection. This paper introduces Pix2Seq, a simple yet generic framework for object detection. By casting object detection as a language modeling task conditioned on the observed pixel inputs, Pix2Seq largely simplifies the detection pipeline, removing most of the specialization in modern detection algorithms.
• Object descriptions (e.g., bounding boxes and class labels) are expressed as sequences of discrete tokens, and they train a neural network to perceive the image and generate the desired sequence.
• Pix2Seq is based mainly on the intuition that if a neural network knows about where and what the objects are, they just need to teach it how to read them out.
• Beyond the use of task-specific data augmentations, our approach makes minimal assumptions about the task, yet it achieves competitive results on the challenging COCO dataset, compared to highly specialized and well optimized detection algorithms.
• Pix2Seq can be extended beyond object detection to solving a large variety of vision tasks where the output can be represented by a relatively concise sequence of discrete tokens (e.g., keypoint detection, image captioning, visual question answering).
• A major limitation of Pix2Seq is that autoregressive modeling is expensive for long sequences (mainly during model inference). Practical measures to mitigate the issue includes: 1) stop inference when the ending token is produced (e.g., in COCO dataset, there are, in average, 7 objects per image, leading to a relatively small number of ∼35 tokens), 2) applying it to offline inference, or online scenarios where the objects of interest are relatively sparse (for e.g., locate a specific object with language description).
• However, future work is needed to make it faster for real-time object detection applications. Another limitation is that the current approach for training Pix2Seq is entirely based on human annotation, and by reducing such dependence and letting the model train using unlabeled data in an unsupervised fashion, they can enable far more applications in the vision domain.
##### An Improved One millisecond Mobile Backbone
• Efficient neural network backbones for mobile devices are often optimized for metrics such as FLOPs or parameter count. However, these metrics may not correlate well with latency of the network when deployed on a mobile device.
• This paper by Vasu et al. from Apple in 2022 performs extensive analysis of different metrics by deploying several mobile friendly networks on a mobile device. They identify and analyze architectural and optimization bottlenecks in recent efficient neural networks and provide ways to mitigate these bottlenecks.
• To this end, they design an efficient backbone MobileOne, with variants achieving an inference time under 1 ms on an iPhone12 with 75.9% top-1 accuracy on ImageNet. They show that MobileOne achieves state-of-the-art performance within the efficient architectures while being many times faster on mobile.
• A MobileOne block has two different structures at train time and test time, inspired from RepVGG: Making VGG-style ConvNets Great Again. Left: Train time MobileOne block with reparameterizable branches. Right: MobileOne block at inference where the branches are reparameterized. Either ReLU or SE-ReLU is used as activation. The trivial over-parameterization factor $$k$$ is a hyperparameter which is tuned for every variant.

• Their best model obtains similar performance on ImageNet as MobileFormer while being 38x faster. MobileOne obtains 2.3% better top-1 accuracy on ImageNet than EfficientNet at similar latency. Furthermore, they show that our model generalizes to multiple tasks – image classification, object detection, and semantic segmentation with significant improvements in latency and accuracy as compared to existing efficient architectures when deployed on a mobile device.
##### Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
• This paper by Saharia et al. from Google Brain in 2022 presents Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding. Imagen showcases the effectiveness of frozen large pretrained language models as text encoders for the text-to-image generation using diffusion models.
• Imagen builds on the power of large transformer language models in understanding text and hinges on the strength of diffusion models in high-fidelity image generation. With these novel components, Imagen produces $$1024 \times 1024$$ samples with unprecedented photorealism and alignment with text.
• Their key discovery is that generic large language models (e.g. T5), pretrained on text-only corpora, are surprisingly effective at encoding text for image synthesis: increasing the size of the language model in Imagen boosts both sample fidelity and image-text alignment much more than increasing the size of the image diffusion model.
• Imagen achieves a new state-of-the-art FID score of 7.27 on the COCO dataset, without ever training on COCO, and human raters find Imagen samples to be on par with the COCO data itself in image-text alignment.
• To assess text-to-image models in greater depth, they introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models. With DrawBench, they compare Imagen with recent methods including VQ-GAN+CLIP, Latent Diffusion Models, and DALL-E 2, and find that human raters prefer Imagen over other models in side-by-side comparisons, both in terms of sample quality and image-text alignment.
• Google page with an overview of the results.
##### Swin Transformer V2: Scaling Up Capacity and Resolution
• Large-scale NLP models have been shown to significantly improve the performance on language tasks with no signs of saturation. They also demonstrate amazing few-shot capabilities like that of human beings.
• This paper by Liu et al. from Microsoft Research in 2022 explores large-scale models in computer vision. THey tackle three major issues in training and application of large vision models, including training instability, resolution gaps between pre-training and fine-tuning, and hunger on labelled data. Three main techniques are proposed: 1) a residual-post-norm method combined with cosine attention to improve training stability; 2) A log-spaced continuous position bias method to effectively transfer models pre-trained using low-resolution images to downstream tasks with high-resolution inputs; 3) A self-supervised pre-training method, SimMIM, to reduce the needs of vast labeled images.
• Through these techniques, this paper successfully trained a 3 billion-parameter Swin Transformer V2 model, which is the largest dense vision model to date, and makes it capable of training with images of up to $$1,536 \times 1,536$$ resolution.
• By scaling up capacity and resolution, Swin V2 sets new performance records on 4 representative vision tasks, including ImageNet-V2 image classification, COCO object detection, ADE20K semantic segmentation, and Kinetics-400 video action classification. Also, note our training is much more efficient than that in Google’s billion-level visual models, which consumes 40 times less labelled data and 40 times less training time.
• The diagram below from the paper presents the techniques for scaling Swin Transformer up to 3 billion parameters and making it capable of training with images of up to $$1,536 \times 1,536$$ resolution, including the res-post-norm and scaled cosine attention to make the model easier to be scaled up in capacity, as well a log-spaced continuous relative position bias approach which lets the model more effectively transferred across window resolutions.

##### Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
• This paper by Yu et al. from Google Research in 2022 presents the Pathways Autoregressive Text-to-Image (Parti) model, which generates high-fidelity photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge. In particular, Parti is able to represent a broad range of visual world knowledge, such as landmarks, specific years, makes and models of vehicles, pottery types, visual styles – and integrate these into novel settings and configurations.
• Parti treats text-to-image generation as a sequence-to-sequence modeling problem, akin to machine translation, with sequences of image tokens as the target outputs rather than text tokens in another language. This strategy can naturally tap into the rich body of prior work on large language models, which have seen continued advances in capabilities and performance through scaling data and model sizes.
• Their approach is simple: First, Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens.
• Second, they achieve consistent quality improvements by scaling the encoder-decoder Transformer model up to 20B parameters, with a new state-of-the-art zero-shot FID score of 7.23 and finetuned FID score of 3.22 on MS-COCO.
• Their detailed analysis on Localized Narratives as well as PartiPrompts (P2), a new holistic benchmark of over 1600 English prompts, demonstrate the effectiveness of Parti across a wide variety of categories and difficulty aspects.
• They also provide an extensive discussion of the limitations, including a breakdown of many kinds of model errors and challenges, that we hope will be useful both for contextualizing what the model can do and for highlighting opportunities for future research.
• Parti opens up opportunities to integrate scaled autoregressive models with diffusion models, starting with having an autoregressive model generate an initial low-resolution image and then iteratively refining and super-resolving images with diffusion modules. Furthermore, the authors suggest conducting more experiments and comparisons with both autoregressive and diffusion models in order to understand their relative capabilities, to address key questions of fairness and bias in both classes of models and strategies for mitigating them, and to identify optimal opportunities for combining their strengths.
• Key takeaways:
• One of the most exciting research fields nowadays is text-to-image modeling. OpenAI’s DALL-E 2 and Google’s Imagen are phenomenal models in this area. Both used a Transformer to encode the text and use diffusion models to generate the image. Google’s Parti, consists solely of (really big) Transformer modules:
• Text encoder: as with previous works, encoding the text with a Transformer is a no-brainer.
• Image tokenizer and de-tokenizer: instead of generating the entire image, Parti will generate one patch at a time. A ViT-based module is used to encode and decode those patches.
• Conditional decoder: conditioned on the encoded text and the tokenized image patches generated so far, a Transformer is used to generate the next patch (with the help of the de-tokenizer from the previous step).
• Github repo.

##### Sequencer: Deep LSTM for Image Classification
• In recent computer vision research, the advent of the Vision Transformer (ViT) has rapidly revolutionized various architectural design efforts: ViT achieved state-of-the-art image classification performance using self-attention found in natural language processing, and MLP-Mixer achieved competitive performance using simple multi-layer perceptrons. In contrast, several studies have also suggested that carefully redesigned convolutional neural networks (CNNs) can achieve advanced performance comparable to ViT without resorting to these new ideas. Against this background, there is growing interest in what inductive bias is suitable for computer vision.
• This paper by Tatsunami and Taki from Rikkyo, Japan in NeurIPS 2022 proposes Sequencer, a novel and competitive architecture alternative to ViT that provides a new perspective on these issues. Unlike ViTs, Sequencer models long-range dependencies using LSTMs rather than self-attention layers.
• They also propose a two-dimensional version of Sequencer module, where an LSTM is decomposed into vertical and horizontal LSTMs to enhance performance. Despite its simplicity, several experiments demonstrate that Sequencer performs impressively well: Sequencer2D-L, with 54M parameters, realizes 84.6% top-1 accuracy on only ImageNet-1K.
• Of note is the fact that the overall data appetite and time to converge was reported to be much better than the ViT and cousins since CNNs and LSTMs have great sample efficiency. Not only that, the paper shows that it has good transferability and the robust resolution adaptability on double resolution-band.

##### High-Resolution Image Synthesis with Latent Diffusion Models
• The following paper summary has been contributed by Zhibo Zhang.
• Diffusion models are known to be computationally expensive given that they require many steps of diffusion and denoising diffusion operations in possibly high-dimensional input feature spaces.
• High-Resolution Image Synthesis with Latent Diffusion Models by Rombach et al. from Ludwig Maximilian University of Munich & IWR, Heidelberg University and Runway ML in CVPR 2022 introduces diffusion models that operate on the latent space, aiming at generating high-resolution images with lower computation demands compared to those that operate directly on the pixel space.
• In particular, the authors adopted an autoencoder that compresses the input images into a lower dimensional latent space. The autoencoder relies on either KL regularization or VQ regularization to constrain the variance of the latent space.
• As shown in the illustration figure below by Rombach et al., in the latent space, the latent representation of the input image goes through a total of $$T$$ diffusion operations to get the noisy representation. A U-Net is then applied on top of the noisy representation for $$T$$ iterations to produce the denoised version of the representation. In addition, the authors introduced a cross attention mechanism to condition the denoising process on other types of inputs such as text and semantic maps.
• In the final stage, the denoised representation will be mapped back to the pixel space using the decoder to get the synthesized image.
• Empirically, the best performing latent diffusion model (with a carefully chosen downsampling factor) achieved competitive FID scores in image generation when comparing with a few other state-of-the-art generative models such as variations of generative adversarial nets on a few datasets including the CelebA-HQ dataset.
• Github repo

##### Make-A-Video: Text-to-Video Generation without Text-Video Data
• This paper by Singer et al. from Meta AI proposes Make-A-Video – an approach for directly translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V).
• Their intuition is simple: learn what the world looks like and how it is described from paired text-image data, and learn how the world moves from unsupervised video footage. Make-A-Video has three advantages: (1) it accelerates training of the T2V model (it does not need to learn visual and multimodal representations from scratch), (2) it does not require paired text-video data, and (3) the generated videos inherit the vastness (diversity in aesthetic, fantastical depictions, etc.) of today’s image generation models. They design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules.
• First, they decompose the full temporal U-Net and attention tensors and approximate them in space and time. Second, they design a spatial temporal pipeline to generate high resolution and frame rate videos with a video decoder, interpolation model and two super resolution models that can enable various applications besides T2V.
• In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation, as determined by both qualitative and quantitative measures.

### NLP

#### 1997

##### Long Short-Term Memory
• Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow.
• This paper from Hochreiter and Schmidhuber in Neural Computation 1997 briefly reviews Hochreiter’s (1991) analysis of this problem, then addresses it by introducing a novel, efficient, gradient based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow.
• LSTM is local in space and time; its computational complexity per time step and weight is O. 1. Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations.
• In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.

#### 2003

##### A Neural Probabilistic Language Model
• This paper by Bengio from the University of Montreal in 2003 revolutionized statistical language modeling by replacing “tables of conditional probabilities” (n-gram language models) with more compact and smoother representations based on distributed representations that can accommodate far more conditioning variables.
• The traditional technique of learning the joint probability function of sequences of words in a language was intrinsically difficult because of the curse of dimensionality: a word sequence on which the model will be tested is likely to be different from all the word sequences seen during training. Traditional but very successful approaches based on n-grams obtain generalization by concatenating very short overlapping sequences seen in the training set.
• They propose to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential/combinatorial number of semantically neighboring sentences, which forms the main reason for the spectacular improvements the proposed approach offers. The model learns simultaneously (i) a distributed representation for each word along with (ii) the probability function for word sequences, expressed in terms of these representations.
• Generalization is obtained because a sequence of words that has never been seen before gets high probability if it is made of words that are similar (in the sense of having a nearby representation) to words forming an already seen sentence.
• They report on experiments using neural networks for the probability function, showing on two text corpora that the proposed approach significantly improves on state-of-the-art n-gram models, and that the proposed approach allows to take advantage of longer contexts.

#### 2010

##### Recurrent neural network based language model
• This paper by Mikolov et al. from Khudanpur’s lab at JHU in Interspeech 2010, was the first to propose using a recurrent neural network-based language model (RNN LM) with applications to speech recognition.
• The results indicate that it is possible to obtain around 50% reduction of perplexity (PPL) by using a mixture of several RNN LMs, compared to a state of the art backoff language model. Speech recognition experiments show around 18% reduction of word error rate on the Wall Street Journal task when comparing models trained on the same amount of data, and around 5% on the much harder NIST RT05 task, and 12% even when the backoff model is trained on 5 times more data than the RNN model. For NIST RT05, they can conclude that models trained on just 5.4M words of in-domain data can outperform big backoff models, which are trained on hundreds times more data.
• They provide ample empirical evidence to suggest that connectionist language models are superior to standard n-gram techniques, except their high computational (training) complexity. Recurrent neural networks outperformed significantly state of the art backoff models in all of the experiments, most notably even in case when backoff models were trained on much more data than RNN LMs.
• The paper seeks to break the myth that language modeling is just about counting n-grams, and that the only reasonable way how to improve results is by acquiring new training data.

#### 2011

##### Generating Text with Recurrent Neural Networks
• Recurrent Neural Networks (RNNs) are very powerful sequence models that do not enjoy widespread use because it is extremely difficult to train them properly. Fortunately, recent advances in Hessian-free optimization have been able to overcome the difficulties associated with training RNNs, making it possible to apply them successfully to challenging sequence problems.
• This paper by Sutskever et al. from UofT in ICML 2011 demonstrates the power of RNNs trained with the new Hessian-Free optimizer (HF) by applying them to character-level language modeling tasks. The standard RNN architecture, while effective, is not ideally suited for such tasks, so they introduce a new RNN variant that uses multiplicative (or “gated”) connections which allow the current input character to determine the transition matrix from one hidden state vector to the next.
• Having applied a modestly-sized standard RNN architecture to the character-level language modeling problem (where the target output at each time step is defined as the the input character at the next time-step), they found the performance somewhat unsatisfactory, and that while increasing the dimensionality of the hidden state did help, the per-parameter gain in test performance was not sufficient to allow the method to be both practical and competitive with state-of-the-art approaches. They address this problem by proposing a new temporal architecture called the Multiplicative RNN (MRNN) which they argue is better suited to the language modeling task.
• Modeling language at the character level seems unnecessarily difficult. This is because morphemes are the appropriate units for making semantic and syntactic predictions and as such, converting large databases into sequences of morphemes, however, is non-trivial compared with treating them as character strings. Also, learning which character strings make words is a relatively easy task compared with discovering the subtleties of semantic and syntactic structure. So, given a powerful learning system like an MRNN, the convenience of using characters may outweigh the extra work of having to learn the words. Their experiments show that an MRNN finds it very easy to learn words. With the exception of proper names, the generated text contains very few non-words. At the same time, the MRNN also assigns probability to (and occasionally generates) plausible words that do not appear in the training set (e.g., “cryptoliation”, “homosomalist”, or “un-ameliary”). This is a desirable property which enabled the MRNN to gracefully deal with real words that it nonetheless didn’t see in the training set. Predicting the next word by making a sequence of character predictions avoids having to use a huge softmax over all known words and this is so advantageous that some word-level language models actually make up binary “spellings” of words so that they can predict them one bit at a time (Mnih & Hinton, 2009).
• MRNNs already learn surprisingly good language models using only 1500 hidden units, and unlike other approaches such as the sequence memoizer and PAQ, they are easy to extend along various dimensions. If much bigger MRNNs could be trained with millions of units and billions of connections, it is possible that brute force alone would be sufficient to achieve an even higher standard of performance. But this will of course require considerably more computational power.
• After training the multiplicative RNN with the HF optimizer for five days on 8 high-end Graphics Processing Units, they were able to surpass the performance of the best previous single method for character level language modeling – a hierarchical nonparametric sequence model. At this point, this represents the largest recurrent neural network application to date.

#### 2013

##### Efficient Estimation of Word Representations in Vector Space
• “You shall know a word by the company it keeps” — J. R. Firth.
• This paper by Mikolov et al. from Google in 2013 proposes word2vec which comprises of two novel model architectures for computing continuous vector representations of words from very large data sets. They studied the quality of vector representations of words derived by various models on a collection of syntactic and semantic language tasks involving word similarity, and the results are compared to the previously best performing techniques based on different types of neural networks. They observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set.
• They propose two new model architectures for learning distributed representations of words that try to minimize computational complexity. The Continuous Bag-of-Words (CBOW) model architecture predicts the current word based on the context, while the skip-gram model predicts surrounding/context words given the current word.
• They observed that it is possible to train high quality word vectors using very simple model architectures, compared to the popular neural network models (both feedforward and recurrent). Because of the much lower computational complexity, it is possible to compute very accurate high dimensional word vectors from a much larger data set.
• Furthermore, they show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.
• Word2vec popularized the “King – Man + Woman = Queen” analogy.
• Overall, two important learnings from Word2Vec were:
• Embeddings of semantically similar words are close in cosine similarity.
• Word embeddings support intuitive arithmetic properties. (An important consequence of this statement is that phrase embeddings can be obtained as the sum of word embeddings.)
##### Distributed Representations of Words and Phrases and their Compositionality
• This paper by Mikolov et al. from Google in NeurIPS 2013 builds on their other paper Efficient Estimation of Word Representations in Vector Space which proposed the Skip-gram model as an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. They present several extensions that improve both the quality of the vectors and the training speed.
• They describe a simple alternative to the hierarchical softmax called negative sampling, packaged as Skipgram with Negative Sampling (SGNS). Negative sampling is an extremely simple training method that learns accurate representations especially for frequent words. Furthermore, they propose subsampling of frequent words which is shown to to yield both faster training and significantly better representations of uncommon words.
• An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of “Canada” and “Air” cannot be easily combined to obtain “Air Canada”. Motivated by this example, they present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.
• The techniques introduced in this paper can be used also for training the continuous bag-of-words model introduced in Efficient Estimation of Word Representations in Vector Space.
• Owing to the training optimizations proposed in this paper, successfully trained models on several orders of magnitude more data than the previously published models, thanks to the computationally efficient model architecture. This results in a great improvement in the quality of the learned word and phrase representations, especially for the rare entities.
• The choice of the training algorithm and the hyper-parameter selection is a task specific decision, as different problems have different optimal hyperparameter configurations. In our experiments, the most crucial decisions that affect the performance are the choice of the model architecture, the size of the vectors, the subsampling rate, and the size of the training window.
• A very interesting result of this work is that the word vectors can be somewhat meaningfully combined using just simple vector addition.
• Another approach for learning representations of phrases presented in this paper is to simply represent the phrases with a single token. Combination of these two approaches gives a powerful yet simple way how to represent longer pieces of text, while having minimal computational complexity. Our work can thus be seen as complementary to the existing approaches that attempt to represent phrases using recursive matrix-vector operations.

#### 2014

##### On the Properties of Neural Machine Translation: Encoder–Decoder Approaches
• This paper by Cho from Bengio’s lab in Universite de Montreal in 2014 first introduced Gated Recurrent Units (GRUs).
• Neural machine translation is a relatively new approach to statistical machine translation based purely on neural networks in which models often consist of an encoder and a decoder. The encoder extracts a fixed-length representation from a variable-length input sentence, and the decoder generates a correct translation from this representation.
• The paper focuses on analyzing the properties of the neural machine translation using two types of neural networks that are able to process variable-length sequences (and differ in the choice of the encoder): (i) an recurrent neural network with gated hidden units, and (ii) the newly proposed gated recursive convolutional neural network. They show that the neural machine translation performs relatively well on short sentences without unknown words, but its performance degrades rapidly as the length of the sentence and the number of unknown words increase.
• Furthermore, they find that the proposed gated recursive convolutional network learns a grammatical structure of a sentence automatically.
##### GloVe: Global Vectors for Word Representation
• Word2vec relies only on local information of language. That is, the semantics learnt for a given word, is only affected by the surrounding words.
• This paper by Pennington et al. from Stanford in EMNLP 2014 proposed Global Vectors (GloVe), an unsupervised learning algorithm which captures both global statistics and local statistics of a corpus, in order to train word vectors. Training is performed on aggregated global word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.
• Contemporary methods focused considerable attention on the question of whether distributional word representations are best learned from count-based methods or from prediction-based methods. Currently, prediction-based models garner substantial support; for example, Baroni et al. (2014) argue that these models perform better across a range of tasks. They argue that the two classes of methods are not dramatically different at a fundamental level since they both probe the underlying co-occurrence statistics of the corpus, but the efficiency with which the count-based methods capture global statistics can be advantageous.
• After Tomas Mikolov et al. released word2vec, there was a boom of papers about word vector representations. GloVe was one such proposal, which explained why such algorithms work and reformulated word2vec optimizations as a special kind of factorization for word co-occurence matrices. Note that GloVe does not use neural networks while word2vec does.
• They construct a model that utilizes this main benefit of count data while simultaneously capturing the meaningful linear substructures prevalent in recent log-bilinear prediction-based methods like word2vec. The result, GloVe, is a new global log-bilinear regression model for the unsupervised learning of word representations that outperforms other models on word analogy, word similarity, and named entity recognition tasks.
##### Sequence to Sequence Learning with Neural Networks
• This paper by Sutskever et al. from Google in 2014 introduced seq2seq encoder-decoder learning to map sequences to sequences, a task that simple Deep Neural Networks (DNNs) cannot be used to accomplish.
• They present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Their method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. They show that a large deep LSTM with a limited vocabulary can outperform a standard statistical machine translation (SMT)-based system whose vocabulary is unlimited on a large-scale MT task. The success of their simple LSTM-based approach on MT suggests that it should do well on many other sequence learning problems, provided they have enough training data.
• Their main result is that on an English to French translation task from the WMT-14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.8 on the entire test set, where the LSTM’s BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When they used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which is close to the previous state of the art. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice.
• They also find that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM’s performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.
##### Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation
• This paper by Cho et al. from Bengio’s lab in EMNLP 2014 introduced the seq2seq encoder-decoder model for neural machine translation. They propose a novel neural network model called RNN Encoder–Decoder that consists of two recurrent neural networks (RNN) that is together able to learn the mapping from a sequence of an arbitrary length to another sequence, possibly from a different set, of an arbitrary length. The encoder RNN encodes a sequence of symbols into a fixed-length vector representation, and the other decodes the representation into another sequence of symbols.
• The proposed RNN Encoder–Decoder is able to either score a pair of sequences (in terms of a conditional probability) or generate a target sequence given a source sequence.
• The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence.
• Along with the new architecture, they propose a novel hidden unit that includes a reset gate and an update gate that adaptively control how much each hidden unit remembers or forgets while reading/generating a sequence.
• They evaluated the proposed model with the task of statistical machine translation, where they used the RNN Encoder–Decoder to score each phrase pair in the phrase table. Qualitatively, they were able to show that the new model is able to capture linguistic regularities in the phrase pairs well and also that the RNN Encoder–Decoder is able to propose well-formed target phrases.
• The scores by the RNN Encoder–Decoder were found to improve the overall translation performance in terms of BLEU scores. Also, they found that the contribution by the RNN Encoder–Decoder is rather orthogonal to the existing approach of using neural networks in the SMT system, so that they can improve further the performance by using, for instance, the RNN Encoder–Decoder and the neural net language model together.
• Qualitative analysis of the the proposed model learns a semantically and syntactically meaningful representation of linguistic phrases at multiple levels, i.e., at the word level as well as phrase level. This suggests that there may be more natural language related applications that may benefit from the proposed RNN Encoder–Decoder.

#### 2015

##### Neural Machine Translation by Jointly Learning to Align and Translate
• This paper by Bahdanau et al. from Bengio’s lab in ICLR 2015 borrowed the attention mechanism from the field of information retrieval and introduced it within the context of NLP (commonly called Bahdanau attention or additive attention in the field).
##### Effective Approaches to Attention-based Neural Machine Translation
• Neural Machine Translation by Jointly Learning to Align and Translate proposed an attention mechanism to improve neural machine translation (NMT) by selectively focusing on parts of the source sentence during translation.
• This paper by Luong et al. in EMNLP 2015 from Manning’s group at Stanford explores useful architectures for attention-based NMT. This paper examines two simple and effective classes of attentional mechanism: a global approach which always attends to all source words and a local one that only looks at a subset of source words at a time.
• They demonstrate the effectiveness of both approaches over the WMT translation tasks between English and German in both directions. With local attention, they achieve a significant gain of 5.0 BLEU points over non-attentional systems which already incorporate known techniques such as dropout.
• Their ensemble model using different attention architectures has established a new state-of-the-art result in the WMT’15 English to German translation task with 25.9 BLEU points, an improvement of 1.0 BLEU points over the existing best system backed by NMT and an n-gram reranker.

#### 2016

##### Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
• Neural Machine Translation (NMT) is an end-to-end learning approach for automated translation, with the potential to overcome many of the weaknesses of conventional phrase-based translation systems. Unfortunately, NMT systems are known to be computationally expensive both in training and in translation inference. Also, most NMT systems have difficulty with rare words. These issues have hindered NMT’s use in practical deployments and services, where both accuracy and speed are essential.
• This paper by Wu et al. from Google in 2016 presents GNMT, Google’s Neural Machine Translation system, which attempts to address many of these issues. Their model consists of a deep LSTM network with 8 encoder and 8 decoder layers using attention and residual connections.
• To improve parallelism and therefore decrease training time, their attention mechanism connects the bottom layer of the decoder to the top layer of the encoder. To accelerate the final translation speed, they employ low-precision arithmetic during inference computations.
• To improve handling of rare words, they divide words into a limited set of common sub-word units (“wordpieces”) for both input and output. This method provides a good balance between the flexibility of “character”-delimited models and the efficiency of “word”-delimited models, naturally handles translation of rare words, and ultimately improves the overall accuracy of the system.
• Their beam search technique employs a length-normalization procedure and uses a coverage penalty, which encourages generation of an output sentence that is most likely to cover all the words in the source sentence.
• On the WMT’14 English-to-French and English-to-German benchmarks, GNMT achieves competitive results to state-of-the-art. Using a human side-by-side evaluation on a set of isolated simple sentences, it reduces translation errors by an average of 60% compared to Google’s phrase-based production system.

#### 2017

##### Attention Is All You Need
• This paper by Vaswani et al. from Google in NeurIPS 2017 introduced Transformers (that are based on scaled dot-product multi-headed attention) which are prevalent in most NLP and CV areas today.

#### 2018

##### Deep contextualized word representations
• This paper by Peters et al. from Allen AI and UWash in NAACL 2018 introduced LSTM-based Embeddings from Language Models (ELMo), an approach for learning high-quality deep context-dependent/context-sensitive word representations/embeddings from biLMs.
• These deep contextualized word representations model both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy).
• ELMo’s word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus. They show that these representations can be easily added to existing models and significantly improve the state of the art across six challenging NLP problems, including question answering, textual entailment, and sentiment analysis. They also present an analysis showing that exposing the deep internals of the pre-trained network is crucial, allowing downstream models to mix different types of semi-supervision signals.
• Through ablations and other controlled experiments, they have confirmed that the biLM layers efficiently encode different types of syntactic and semantic information about words-in-context, and that using all layers improves overall task performance, enabling ELMo to show large improvements on a broad range of NLP tasks.
##### Improving Language Understanding by Generative Pre-Training
• Natural language understanding comprises a wide range of diverse tasks such as textual entailment, question answering, semantic similarity assessment, and document classification. Although large unlabeled text corpora are abundant, labeled data for learning these specific tasks is scarce, making it challenging for discriminatively trained models to perform adequately.
• This paper by Radford et al. from OpenAI in 2018 introduces a framework for achieving strong natural language understanding with a single task-agnostic model through generative pre-training and discriminative fine-tuning and demonstrates large gains on the aforementioned NLU tasks can be realized by generative pre-training of a language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task.
• In contrast to previous approaches, they make use of task-aware input transformations during fine-tuning to achieve effective transfer while requiring minimal changes to the model architecture.
• By pre-training on a diverse corpus with long stretches of contiguous text our model acquires significant world knowledge and ability to process long-range dependencies which are then successfully transferred to solving discriminative tasks such as question answering, semantic similarity assessment, entailment determination, and text classification, improving the state of the art on 9 of the 12 datasets and thus outperforming discriminatively trained models that use architectures specifically crafted for each task. For instance, they achieve absolute improvements of 8.9% on commonsense reasoning (Stories Cloze Test), 5.7% on question answering (RACE), and 1.5% on textual entailment (MultiNLI).
• Using unsupervised (pre-)training to boost performance on discriminative tasks has long been an important goal of Machine Learning research. Their work suggests that achieving significant performance gains is indeed possible, and offers hints as to what models (Transformers) and data sets (text with long range dependencies) work best with this approach.

#### 2019

##### BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
• This paper by Devlin et al. from Google in ACL 2019 proposed BERT (Bidirectional Encoder Representations from Transformers), a Transformer-based language representation model which proposed pre-training bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. BERT is pre-trained using two unsupervised tasks: (i) masked language modeling (MLM) and, (ii) next sentence prediction (NSP).
• MLM is often referred to as a Cloze task in the literature (Taylor, 1953). In this case, the final hidden vectors corresponding to the mask tokens are fed into an output softmax over the vocabulary, as in a standard LM.
• NSP is needed because many important downstream tasks such as Question Answering (QA) and Natural Language Inference (NLI) are based on understanding the relationship between two sentences, which is not directly captured by language modeling. In order to train a model that understands sentence relationships, they pre-train for a binarized next sentence prediction task that can be trivially generated from any monolingual corpus.
• Fine-tuning for the task at hand involves using an additional output layer, to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.
• BERT comes in two flavors: (i) BERT Base: 12 layers (transformer blocks), 12 attention heads, and 110 million parameters; (ii) BERT Large: 24 layers (transformer blocks), 16 attention heads, and 340 million parameters.
• BERT consumes a max of 512 input tokens. At its output, word embeddings for BERT (what is called BERT-base) have 768 dimensions.
• BERT obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).
• BERT demonstrated that unsupervised pretraining is an integral part of many language understanding systems and enables even low-resource tasks to benefit from them.
• Google Blog’s article that discusses using BERT for improving search relevance and ranking.
• Also, here’s a brief timeline of NLP models from Bag of Words to the Transformer family from Fabio Chiusano:

##### RoBERTa: A Robustly Optimized BERT Pretraining Approach
• Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes, while hyperparameter choices have significant impact on the final results.
• This paper by Liu et al. from University of Washington and Facebook AI in 2019 carefully evaluates a number of design decisions when pretraining BERT models.
• They present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and training data size. They find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it. They find that performance can be substantially improved by training the model longer, with bigger batches over more data; removing the next sentence prediction objective; training on longer sequences; and dynamically changing the masking pattern applied to the training data.
• Their improved pretraining procedure, which they call RoBERTa, achieves state-of-the-art results on GLUE, RACE and SQuAD, without multi-task finetuning for GLUE or additional data for SQuAD. These results highlight the importance of previously overlooked design choices, and suggest that BERT’s pretraining objective remains competitive with recently proposed alternatives.
• Note that RoBERTa uses only the masked language model objective (and does not train using the next sentence prediction objective), and achieves better results than the original BERT.
##### BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
• This paper by Lewis et al. from Facebook AI in 2019 presented BART, a denoising autoencoder for pretraining sequence-to-sequence models that learns to map corrupted documents to the original. BART is trained by corrupting text with an arbitrary noising function, and learning a model to reconstruct the original text. It uses a standard Tranformer-based neural machine translation architecture which, despite its simplicity, can be seen as generalizing BERT (due to the bidirectional encoder), GPT (with the left-to-right decoder), and many other more recent pretraining schemes.
• They evaluate a number of noising approaches, finding the best performance by both randomly shuffling the order of the original sentences and using a novel in-filling scheme, where spans of text are replaced with a single mask token.
• Background: With BERT, random tokens are replaced with masks, and the document is encoded bidirectionally. Missing tokens are predicted independently, so BERT cannot easily be used for generation.

• With GPT, tokens are predicted auto-regressively (generation of a new token is conditioned on the prior tokens), meaning GPT can be used for generation. However words can only condition on leftward context, so it cannot learn bidirectional interactions.

• BART applies noising schemes to an input document and thus corrupts it by replacing spans of text with mask symbols. In the diagram below, the corrupted document (left) is encoded with a bidirectional model, and then the likelihood of the original document (right) is calculated with an autoregressive decoder. For fine-tuning, an uncorrupted document is input to both the encoder and decoder, and they use representations from the final hidden state of the decoder. The advantage of using this scheme is that inputs to the encoder need not be aligned with decoder outputs, allowing arbitary noise transformations.

• BART is particularly effective when finetuned for text generation but also works well for comprehension tasks. It matches the performance of RoBERTa with comparable training resources on GLUE and SQuAD, achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 6 ROUGE.
• BART also provides a 1.1 BLEU increase over a back-translation system for machine translation, with only target language pretraining.
• BART achieves similar performance to RoBERTa on discriminative tasks, while achieving new state-of-the-art results on a number of text generation tasks.
##### DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
• As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging.
• This paper by Sanh et al. from Huggingface in the Energy Efficient Machine Learning and Cognitive Computing - NeurIPS 2019 introduced a language representation model, DistilBERT which is a general-purpose pre-trained version of BERT. DistilBERT is 40% smaller, 60% faster, cheaper to pre-train, and retains 97% of the language understanding capabilities. DistilBERT can be fine-tuned with good performances on a wide range of tasks much like its larger counterparts.
• While most prior work investigated the use of distillation for building task-specific models, they leverage knowledge distillation during the pre-training phase and show that DistilBERT is a compelling option for edge applications.
• To leverage the inductive biases learned by larger models during pretraining, they introduce a triple loss combining language modeling, distillation and cosine-distance losses.
• The following graph shows the parameter counts of several recently released pretrained language models:

##### Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
• Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling.
• This paper by Dai et al. from CMU and Google Brain in 2019 proposes a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence.
• Transformer-XL consists of a segment-level recurrence mechanism and a novel positional encoding scheme.
• Transformer-XL not only enables capturing longer-term dependency than RNNs and vanilla Transformers, achieves substantial speedup during evaluation, but also resolves the context fragmentation problem. As a result, Transformer-XL learns dependency that is 80% longer than RNNs and 450% longer than vanilla Transformers, achieves better performance on both short and long sequences, and is up to 1,800+ times faster than vanilla Transformers during evaluation.
• They improve the state-of-the-art results of bpc/perplexity to 0.99 on enwiki8, 1.08 on text8, 18.3 on WikiText-103, 21.8 on One Billion Word, and 54.5 on Penn Treebank (without finetuning). When trained only on WikiText-103, Transformer-XL manages to generate reasonably coherent, novel text articles with thousands of tokens.
##### XLNet: Generalized Autoregressive Pretraining for Language Understanding
• With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive language modeling. However, relying on corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a pretrain-finetune discrepancy.
• This paper by Yang et al. from CMU and Google in 2019 proposes XLNet considering BERT’s aforementioned pros and cons, and offers a generalized autoregressive pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order (thereby proposing a new objective called Permutation Language Modeling), and (2) overcomes the limitations of BERT thanks to its autoregressive formulation. Put simply, XLNet is a generalized autoregressive pretraining method that uses a permutation language modeling objective to combine the advantages of autoregressive and autoencoder methods.
• Furthermore, the neural architecture of XLNet is developed to work seamlessly with the autoregressive objective, including integrating ideas from Transformer-XL, the state-of-the-art autoregressive model and the careful design of the two-stream attention mechanism. Empirically, under comparable experiment settings, XLNet outperforms BERT on 20 tasks, often by a large margin, including question answering, natural language inference, sentiment analysis, and document ranking.
• Github repo.
##### Adaptive Input Representations for Neural Language Modeling
• This paper by Baevski and Auli from Facebook AI in 2019 introduces adaptive input representations by varying the size of input word embeddings for neural language modeling. Adaptive input embeddings can improve accuracy while drastically reducing the number of model parameters.
• There are several choices on how to factorize the input and output layers, and whether to model words, characters or sub-word units.
• They perform a systematic comparison of popular choices for a self-attentional architecture.
• Their experiments show that models equipped with adaptive embeddings are more than twice as fast to train than the popular character input CNN while having a lower number of parameters.
• On the WIKITEXT-103 benchmark, they achieve 18.7 perplexity, an improvement of 10.5 perplexity compared to the previously best published result and on the BILLION WORD benchmark, they achieve 23.02 perplexity.
##### Attention Interpretability Across NLP Tasks
• This paper by Vashishth et al. from IISc and Google in 2019 seeks to empirically prove the hypothesis that attention weights are interpretable and are correlated with feature importance measures, However, this holds only for cases when attention weights are essential for model’s prediction.
• Some works (Jain & Wallace, 2019; Vig & Belinkov, 2019) have demonstrated that attention weights are not interpretable, and altering them does not affect the model output while several others have shown that attention captures several linguistic notions in the model. They extend the analysis of prior works to diverse NLP tasks and demonstrate that attention weights are interpretable and are correlated with feature importance measures. However, this holds only for cases when attention weights are essential for model’s prediction and cannot simply be reduced to a gating unit. This paper takes a balanced approach – rather than taking a black and white approach – they draw on previous literature that raised issues with the fact “attentions are indicative of model predictions” and show “when is attention interpretable and when it is not”.
• The attention layer in a neural network model provides insights into the model’s reasoning behind its prediction, which are usually criticized for being opaque. Recently, seemingly contradictory viewpoints have emerged about the interpretability of attention weights. Amid such confusion arises the need to understand attention mechanism more systematically. The paper attempts to fill this gap by giving a comprehensive explanation which justifies both kinds of observations (i.e., when is attention interpretable and when it is not). Through a series of experiments on diverse NLP tasks, they validate their observations and reinforce the claim of interpretability of attention through manual evaluation.
• They find that in both single and pair sequence tasks, the attention weights in samples with original weights do make sense in general. However, in the former case, the attention mechanism learns to give higher weights to tokens relevant to both kinds of sentiment. They show that attention weights in single sequence tasks do not provide a reason for the prediction, which in the case of pairwise tasks, attention do reflect the reasoning behind model output.
• Unrelated to the paper: To use attention visualization as a proxy for interpreting your predictions, use the BertViz library. The lib supports multiple views and supports a plethora of models (BERT, GPT-2, XLNet, RoBERTa, XLM, ALBERT, DistilBERT, BART etc.). The BertViz repo has some nice examples to get started.

• This paper by Selvaraju et al. from Parikh/Batra’s team at GATech in 2019 proposes a technique for producing ‘visual explanations’ for decisions from a large class of CNN-based models, making them more transparent and explainable.
• Their approach – Gradient-weighted Class Activation Mapping (Grad-CAM), uses the gradients of any target concept (say ‘dog’ in a classification network or a sequence of words in captioning network) flowing into the final convolutional layer to produce a coarse localization map highlighting the important regions in the image for predicting the concept.
• Unlike previous approaches, Grad-CAM is applicable to a wide variety of CNN model-families: (1) CNNs with fully-connected layers (e.g. VGG), (2) CNNs used for structured outputs (e.g. captioning), (3) CNNs used in tasks with multimodal inputs (e.g. visual question answering) or reinforcement learning, all without architectural changes or re-training.
• They combine Grad-CAM with existing fine-grained visualizations to create a high-resolution class-discriminative visualization, Guided Grad-CAM, and apply it to image classification, image captioning, and visual question answering (VQA) models, including ResNet-based architectures.
• In the context of image classification models, our visualizations (a) lend insights into failure modes of these models (showing that seemingly unreasonable predictions have reasonable explanations), (b) outperform previous methods on the ILSVRC-15 weakly-supervised localization task, (c) are robust to adversarial perturbations, (d) are more faithful to the underlying model, and (e) help achieve model generalization by identifying dataset bias.
• For image captioning and VQA, their visualizations show that even non-attention based models learn to localize discriminative regions of input image.
• They devise a way to identify important neurons through GradCAM and combine it with neuron names to provide textual explanations for model decisions.
• Finally, they design and conduct human studies to measure if Grad-CAM explanations help users establish appropriate trust in predictions from deep networks and show that Grad-CAM helps untrained users successfully discern a ‘stronger’ deep network from a ‘weaker’ one even when both make identical predictions.
• Github repo; CloudCV demo.
##### Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond
• This paper by Artetxe and Schwenk from University of the Basque Country and FAIR introduces an architecture to learn joint multilingual sentence representations, called LASER (Language-Agnostic SEntence Representations), for 93 languages, belonging to more than 30 different families and written in 28 different scripts. The work focuses on universal language agnostic sentence embeddings, that is, vector representations of sentences that are general with respect to two dimensions: the input language and the NLP task. The motivations for such representations are multiple: the hope that languages with limited resources benefit from joint training over many languages, the desire to perform zero-shot transfer of an NLP model from one language (typically English) to another, and the possibility to handle code-switching. To that end, they train a single encoder to handle multiple languages, so that semantically similar sentences in different languages are close in the embedding space.
• Their system uses a single BiLSTM encoder with a shared BPE vocabulary for all languages, which is coupled with an auxiliary decoder and trained on publicly available parallel corpora. This enables them to learn a classifier on top of the resulting embeddings using English annotated data only, and transfer it to any of the 93 languages without any modification.
• Their experiments in cross-lingual natural language inference (XNLI dataset), cross-lingual document classification (MLDoc dataset) and parallel corpus mining (BUCC dataset) show the effectiveness of their approach.
• They also introduce a new test set of aligned sentences in 112 languages, and show that their sentence embeddings obtain strong results in multilingual similarity search even for low-resource languages.
• Github repo with the pretrained encoder and multilingual test set.
##### GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
• For natural language understanding (NLU) technology to be maximally useful, both practically and as a scientific object of study, it must be general: it must be able to process language in a way that is not exclusively tailored to any one specific task or dataset.
• This paper by Wang et al. from NYU, UWash, and Deepmin in ICLR 2019 introduces the General Language Understanding Evaluation benchmark (GLUE), a tool for evaluating and analyzing the performance of models across a diverse range of existing NLU tasks. GLUE is model-agnostic, but it incentivizes sharing knowledge across tasks because certain tasks have very limited training data. They further provide a hand-crafted diagnostic test suite that enables detailed linguistic analysis of NLU models.
• They evaluate baselines based on current methods for multi-task and transfer learning and find that they do not immediately give substantial improvements over the aggregate performance of training a separate model per task, indicating room for improvement in developing general and robust NLU systems.

#### 2020

##### Language Models are Few-Shot Learners
• Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do.
• This paper by Brown et al. from OpenAI in 2020 introduces Generative Pretrained Transformer (GPT)-3 and shows that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches.
• Specifically, they train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model.
• GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.
• At the same time, they also identify some datasets where GPT-3’s few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, they find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans.
• They also present broader societal impacts of their findings and of GPT-3 in general.
##### Longformer: The Long-Document Transformer
• Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length.
• This paper by Beltagy et al. from Allen AI in 2020 seeks to address this limitation, by introducing the Longformer with an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer.
• Longformer’s attention mechanism is a drop-in replacement for the standard self-attention and combines a local windowed attention with a task motivated global attention.
• Following prior work on long-sequence transformers, they evaluate Longformer on character-level language modeling and achieve state-of-the-art results on text8 and enwik8.
• In contrast to most prior work, they also pretrain Longformer and finetune it on a variety of downstream tasks.
• Their pretrained Longformer consistently outperforms RoBERTa on long document tasks and sets new state-of-the-art results on WikiHop and TriviaQA. They finally introduce the Longformer-Encoder-Decoder (LED), a Longformer variant for supporting long document generative sequence-to-sequence tasks, and demonstrate its effectiveness on the arXiv summarization dataset.
##### Big Bird: Transformers for Longer Sequences
• The primary limitation of Transformer-based models is the quadratic complexity (mainly in terms of memory, but also computation) on the sequence length due to their full attention mechanism. BigBird by Zaheer et al. from Google, published in NeurIPS 2020, remedied this by proposing a sparse attention mechanism that reduces this quadratic complexity to linear.
##### Beyond Accuracy: Behavioral Testing of NLP models with CheckList
• Although measuring held-out test-set accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models either focus on individual tasks or on specific behaviors. Further, ML systems can run to completion without throwing any errors (indicating functional correctness) but can still produce incorrect outputs (indicating behavioral issues). Thus, it is important to test the behavioral aspects of your model to make sure it works as you expected.
• This paper by Ribeiro et al. from Microsoft, UW and UCI in 2020 introduces CheckList, a model-agnostic and task-agnostic methodology for testing NLP models inspired by principles of behavioral testing in software engineering. CheckList tests individual capabilities of the model using three different test types.
• Checklist includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation, as well as a software tool to generate a large and diverse number of test cases quickly. They illustrate the utility of CheckList with tests for three tasks, identifying critical failures in both commercial and state-of-art models.
• Tests created with CheckList can be applied to any model, making it easy to incorporate in current benchmarks or evaluation pipelines. In a user study, a team responsible for a commercial sentiment analysis model found new and actionable bugs in an extensively tested model that has “solved” existing benchmarks on three different tasks. They incorporated three distinct types of tests:
• Minimum Functionality Test (MFT): A Minimum Functionality Test (MFT) uses simple examples to make sure the model can perform a specific task well. For example, they might want to test the performance of a sentiment model when dealing with negations.
• Invariance Test: Besides testing the functionality of a model, they might also want to test if the model prediction stays the same when trivial parts of inputs are slightly perturbed. These tests are called Invariance Tests (IV).
• Directional Expectation Test: In the Invariance Test, they expect the outputs after the perturbation to be the same. However, sometimes they might expect the output after perturbation to change. That is when Directional Expectation Tests comes in handy. In another user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.
• Github repo.
##### The Curious Case of Neural Text Degeneration
• Despite considerable advancements with deep neural language models, the enigma of neural text degeneration persists when these models are tested as text generators. The counter-intuitive empirical observation is that even though the use of likelihood as training objective leads to high quality models for a broad range of language understanding tasks, using likelihood as a decoding objective leads to text that is bland and strangely repetitive.
• This paper by Holztman et al. from Choi’s lab at UWash in ICLR 2020 provided a deep analysis into the properties of the most common decoding methods for open-ended language generation. It reveals surprising distributional differences between human text and machine text.
• In addition, they find that decoding strategies alone can dramatically effect the quality of machine text, even when generated from exactly the same neural language model. They show that likelihood maximizing decoding causes repetition and overly generic language usage, while sampling methods without truncation risk sampling from the low-confidence tail of a model’s predicted distribution. Their findings motivate Nucleus (or top-p) Sampling, a simple but effective method that captures the region of confidence of language models effectively to draw the best out of neural generation.
• By sampling text from the dynamic nucleus of the probability distribution, which allows for diversity while effectively truncating the less reliable tail of the distribution, the resulting text better demonstrates the quality of human text, yielding enhanced diversity without sacrificing fluency and coherence.
##### ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
• Pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to downstream NLP tasks, they generally require large amounts of compute to be effective.
• This paper by Clark et al. from Manning’s lab at Stanford proposes a more sample-efficient pre-training alternative task called replaced token detection, a new self-supervised task for language representation learning compared to BERT’s masked language modeling (MLM). Instead of masking the input, their approach corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead of training a model that predicts the original identities of the corrupted tokens, the key idea is training a discriminative text encoder model to distinguish input tokens from high-quality negative samples produced by an small generator network.
• Thorough experiments demonstrate this new pre-training task is more efficient than MLM because the task is defined over all input tokens rather than just the small subset that was masked out.
• As a result, compared to MLM, their pre-training objective is more compute-efficient and results in better performance on downstream tasks. The contextual representations learned by their approach substantially outperform the ones learned by BERT given the same model size, data, and compute.
• The gains are particularly strong for small models; for example, they train a model on one GPU for 4 days that outperforms GPT (trained using 30x more compute) on the GLUE natural language understanding benchmark. Their approach also works well at scale, where it performs comparably to RoBERTa and XLNet while using less than 1/4 of their compute and outperforms them when using the same amount of compute.
• Since ELECTRA works well even when using relatively small amounts of compute, the authors hope this will make developing and applying pre-trained text encoders more accessible to researchers and practitioners with less access to computing resources.
##### TinyBERT: Distilling BERT for Natural Language Understanding
• Language model pre-training, such as BERT, has significantly improved the performances of many natural language processing tasks. However, pre-trained language models are usually computationally expensive, so it is difficult to efficiently execute them on resource-restricted devices.
• This paper by Jiao et al. from Huazhong University of Science and Technology, Wuhan National Lab for Optoelectronics, and Huawei Noah’s Ark Lab in EMNLP 2020 propose a novel Transformer distillation method to accelerate inference and reduce model size while maintaining accuracy, that is specially designed for knowledge distillation (KD) of the Transformer-based models. They also propose a two-stage framework for TinyBERT.
• By leveraging this new KD method, the plenty of knowledge encoded in a large teacher BERT can be effectively transferred to a small student Tiny-BERT.
• Then, they introduce a new two-stage learning framework for TinyBERT, which performs Transformer distillation at both the pretraining and task-specific learning stages. This framework ensures that TinyBERT can capture he general-domain as well as the task-specific knowledge in BERT.
• TinyBERT with 4 layers is empirically effective and achieves more than 96.8% the performance of its teacher BERTBASE on GLUE benchmark, while being 7.5x smaller and 9.4x faster on inference.
• Extensive experiments show that TinyBERT achieves competitive performances meanwhile significantly reducing the model size and inference time of BERTBASE, which provides an effective way to deploy BERT-based NLP models on edge devices. Specifically, TinyBERT with 4 layers is also significantly better than 4-layer state-of-the-art baselines on BERT distillation, with only about 28% parameters and about 31% inference time of them. Moreover, TinyBERT with 6 layers performs on-par with its teacher BERTBASE.
• Github repo.
##### MPNet: Masked and Permuted Pre-training for Language Understanding
• BERT adopts masked language modeling (MLM) for pre-training and is one of the most successful pre-training models. Since BERT neglects dependency among predicted tokens, XLNet introduces permuted language modeling (PLM) for pre-training to address this problem. However, XLNet does not leverage the full position information of a sentence and thus suffers from position discrepancy between pre-training and fine-tuning.
• This paper by Song et al. from Nanjing University and Microsoft Research in NeurIPS 2020 proposes MPNet, a novel pre-training method that inherits the advantages of BERT and XLNet and avoids their limitations.
• MPNet leverages the dependency among predicted tokens through permuted language modeling (vs. MLM in BERT), and takes auxiliary position information as input to make the model see a full sentence and thus reducing the position discrepancy (vs. PLM in XLNet).
• They pre-train MPNet on a large-scale dataset (over 160GB text corpora) and fine-tune on a variety of down-streaming tasks (GLUE, SQuAD, etc). Experimental results show that MPNet outperforms MLM and PLM by a large margin, and achieves better results on these tasks compared with previous state-of-the-art pre-trained methods (e.g., BERT, XLNet, RoBERTa) under the same model setting.
• Github repo with code and pre-trained models.

#### 2021

##### BinaryBERT: Pushing the Limit of BERT Quantization
• The rapid development of large pre-trained language models has greatly increased the demand for model compression techniques, among which quantization is a popular solution. In this paper,
• This paper by Bai et al. from CUHK and Huawei Noah’s Ark Lab in 2021 proposes BinaryBERT, which pushes BERT quantization to the limit by weight binarization.
• They find that a binary BERT is hard to be trained directly than a ternary counterpart due to its steep and complex loss landscape. Therefore, they propose ternary weight splitting, which initializes BinaryBERT by equivalently splitting from a half-sized ternary network, followed by fine-tuning for further refinement.
• The binary model thus inherits the good performance of the ternary one, and can be further enhanced by fine-tuning the new architecture after splitting.
• Their approach also supports adaptive splitting that can tailor the size of BinaryBERT based on the edge device constraints.
• Empirical results show that BinaryBERT has only a slight performance drop compared with the full-precision model while being 24x smaller, achieving the state-of-the-art compression results on the GLUE and SQuAD benchmarks.
##### Towards Zero-Label Language Learning
• This paper by Wang et al. from Google in 2021 explores “zero-label” learning in NLP, whereby no human-annotated data is used anywhere during training and models are trained purely on synthetic data. They show that language models (LMs) are also few-shot generators or example creators (rather than just few-shot learners as in the GPT-3 paper) in that they can be used to generate high-quality synthetic data in a fully unsupervised manner. In other words, their propose that labelled-data generation is easy with prompting, LMs are great few-shot data generators, and that classic fine-tuning » zero/few shot prompting.
• At the core of their framework is a novel approach for better leveraging the powerful pretrained LMs. Specifically, inspired by the recent success of few-shot inference on GPT-3, they present a training data creation procedure named Unsupervised Data Generation (UDG), which leverages few-shot prompts to synthesize high-quality training data without real human annotations.
• Their method enables zero-label learning as they train task-specific models solely on the synthetic data, yet they achieve better or comparable results from strong baseline models trained on human-labeled data. Furthermore, when mixed with labeled data, our approach serves as a highly effective data augmentation procedure, achieving new state-of-the-art results on the SuperGLUE benchmark.
• The paper illustrates a promising direction for future transfer learning research in NLP.
• Key takeaways:
• Old idea (from OpenAI’s GPT3 paper):
• Treat LMs as few-shot learners.
• Create prompts with <sample, label> pair(s).
• Ask the model to infer the label for a new
• The emphasis is on the inference.
• New idea (from Google’s zero-label paper):
• Treat LMs as few-shot generators (rather than few-shot learners).
• Create prompts with <sample, label> pair(s).
• Ask the model to generate more for the same label.
• The emphasis is on the labelled data generation (rather than inference).
• Learnings:
• Old idea created a new wave of prompt programming, i.e. no need for conventional task specific fine-tuning.
• However, prompting can solve only lower-order tasks, for e.g., classification, NLI. Even with lower-order tasks it is not practical because you cannot build a human-in-the-loop system to continually improve the model.
• The new idea is about generating more data and going with conventional route.
• This paper confirms all the above by introducing UDG using LMs, even for complex higher-order tasks and empirically shows classical fine-tuning with more data works better.
• The diagram below from Prithvi Da summarizes the proposed approach.

##### Improving Language Models by Retrieving from Trillions of Tokens
• This paper from Borgeaud et al. from DeepMind in 2021 proposes Retrieval-Enhanced Transformer (RETRO) which enhances auto-regressive language models by conditioning on document chunks retrieved from a large corpus, based on local similarity with preceding tokens. RETRO incorporates information retrieved from a database to free its parameters from being an expensive store of facts and world knowledge. With a 2 trillion token database, RETRO obtains comparable performance to GPT-3 and Jurassic-1 on the Pile, despite using 25x fewer parameters.
• After fine-tuning, RETRO performance translates to downstream knowledge-intensive tasks such as question answering. RETRO combines a frozen BERT retriever, a differentiable encoder and a chunked cross-attention mechanism to predict tokens based on an order of magnitude more data than what is typically consumed during training.
• On Wikitext103 and the Pile, RETRO outperforms previous models trained on large scale datasets. They also show that RETRO is competitive on retrieval-intensive downstream tasks such as question answering.
• RETRO models are flexible and can be used without retrieval at evaluation and still achieve comparable performance to baseline models. Conversely, baseline pre-trained transformer models can be rapidly fine-tuned (“RETROfit with retrieval”) to obtain nearly the same performance as if trained from scratch.
• They demonstrates at an unprecedented scale that improving semi-parametric language models through explicit memory can provide an orthogonal, more efficient approach than raw parameter scaling as they seek to build more powerful language models.
• Related: The Illustrated Retrieval Transformer by Jay Alammar.
##### WebGPT: Browser-assisted question-answering with human feedback
• This paper by Nakano et al. from OpenAI in 2021 proposes WebGPT, which is a fine-tuned version of GPT-3 to more accurately answer open-ended questions using a text-based web browser. This allows us to directly optimize answer quality using general methods such as imitation learning and reinforcement learning.
• Their prototype copies how humans research answers to questions online —- it submits search queries, follows links, and scrolls up and down web pages. It is trained to cite its sources, which makes it easier to give feedback to improve factual accuracy.
• By setting up the task so that it can be performed by humans, they are able to train models on the task using imitation learning, and then optimize answer quality with human feedback. To make human evaluation of factual accuracy easier, models must collect references while browsing in support of their answers.
• They train and evaluate their models on ELI5, a dataset of questions asked by Reddit users. Their best model is obtained by fine-tuning GPT-3 using behavior cloning, and then performing rejection sampling against a reward model trained to predict human preferences. This model’s answers are preferred by humans 56% of the time to those of our human demonstrators, and 69% of the time to the highest-voted answer from Reddit. While their best model outperforms humans on ELI5, but still struggles with out-of-distribution questions.

#### 2022

##### Formal Mathematics Statement Curriculum Learning
• This paper by Polu et al. from OpenAI in 2022 proposes a neural theorem prover using GPT-f that can successfully solve a curriculum of increasingly difficult problems out of a set of formal statements of sufficiently varied difficulty, including many high-school Math Olympiad problems. The prover uses a language model to find proofs of formal statements.
• They explore the use of expert iteration in the context of language modeling applied to formal mathematics. They show that at same compute budget, expert iteration, by which they mean proof search interleaved with learning, dramatically outperforms proof search only. They also observe that when applied to a collection of formal statements of sufficiently varied difficulty, expert iteration is capable of finding and solving a curriculum of increasingly difficult problems, without the need for associated ground-truth proofs.
• Finally, by applying this expert iteration to a manually curated set of problem statements, they achieve state-of-the-art on the miniF2F benchmark, automatically solving multiple challenging problems drawn from high school olympiads.
• Their results suggest that the lack of self-play in the formal mathematics setup can be effectively compensated for by automatically as well as manually curated sets of formal statements, which are much cheaper to formalize than full proofs. The statement curriculum learning methodology presented in this work can help accelerate progress in automated reasoning, especially if scaled with automated generation and curation of formal statements in the future.
##### Survey of Hallucination in Natural Language Generation
• While natural language generation (NLG) has improved exponentially in recent years thanks to the development of deep learning technologies such as Transformer-based language models, large language models (LLMs) -based NLG often produces false statements that are disconnected from reality because such models are not grounded in reality. Such generation includes hallucinated texts, which makes the performances of text generation fail to meet users’ expectations in many real-world scenarios owing to the lack of commonsense built from experiencing the real world.
• This paper by Ji et al. from Pascale Fung’s group at Hong Kong University of Science and Technology in 2022 reviews studies in evaluation and mitigation methods of hallucinations that have been presented in various tasks.
• They provide a broad overview of the research progress and challenges in the hallucination problem of NLG. The survey is organized into two big divisions: (i) a general overview of metrics, mitigation methods, and future directions; (ii) task-specific research progress for hallucinations in a large set of downstream tasks: abstractive summarization, dialogue generation, generative question answering, data-to-text generation, and machine translation.
##### Transformer Quality in Linear Time
• This paper by Hua et al. form Cornell University and Google Brain in 2022 revisits the design choices in Transformers, and propose methods to address their weaknesses in handling long sequences by presenting FLASH - a novel efficient modification of Transformer architecture. This is achieved by designing a performant layer (gated linear unit) and by combining it with an accelerator-efficient approximation strategy (mixed chunk attention).
• Existing efficient attention methods often cause significant quality drop compared to full self-attention. At the same time they might be difficult to implement to fully leverage hardware accelerators. The authors introduce GAU (gated attention unit; a generalization of GLU - gated linear unit) that allows for better and more efficient approximation of multi-head attention than many other efficient attention methods by using a weaker single-head attention with minimal quality loss.
• Next, complementary to this new layer, they propose mixed chunk attention - an efficient linear approximation method that combines the benefits from partial and linear attention mechanisms, which is accelerator-friendly and highly competitive in quality. The method works on chunks of tokens and leverages local (within chunk) and global (between chunks) attention spans.
• The resulting model, named FLASH, when deployed on bidirectional and auto-regressive language modeling tasks, outperforms three baselines: vanilla Transformer, Performer and Combiner in terms of quality and efficiency. FLASH matches the quality (perplexity) of fully-augmented Transformers over both short (512) and long (8K) context lengths, while being substantially faster to train than the state-of-the-art - achieving training speedups of up to 4.9x on Wiki-40B and 12.1x on PG-19 for auto-regressive language modeling, and 4.8x on C4 for masked language modeling. The differences are particularly pronounced for larger context sizes (4096-8192).
##### Chain of Thought Prompting Elicits Reasoning in Large Language Models
• Although scaling up language model size has reliably improved performance on a range of NLP tasks, even the largest models currently struggle with certain reasoning tasks such as arithmetic reasoning, math word problems, symbolic manipulation, and commonsense reasoning.
• This paper by Wei et al. from Google in 2022 explores the ability of language models to generate a coherent chain of thought – a series of short sentences that mimic the reasoning process a person might have when responding to a question. The idea is strikingly simple: instead of being terse while prompting show the model a few examples of a multi-step reasoning process (the like of which a human would use). Couple this with LLMs (the larger the better) and magic happens! Check out the below image.

• They have explored chain of thought prompting as a simple and broadly applicable method for enhancing reasoning in language models. The superb results you can elucidate via this method are an emergent property of model scale (surprise surprise) - bigger models benefit more from this, and the conclusion holds across models (LaMDA, GPT, PaLM).
• Interestingly enough, the more complex the task of interest is (in the sense of requiring multi-step reasoning approach), the bigger the boost from the chain of thought prompting!
• In order to make sure that the performance boost comes from this multi-step approach and not simply because of e.g. more compute, the authors have done a couple of ablations: (i) outputting a terse equation instead of a multi-step reasoning description, (ii) outputting the answer and only then the chain of thought, etc. None of these experiments yielded good results.
• The method also proved to be fairly robust (always outperforms standard prompting) to the choice of exact few shot exemplars. Despite different annotators, different styles, etc. the method is always better than standard prompting.
• Through experiments on arithmetic, symbolic, and commonsense reasoning, they find that chain of thought processing is an emergent property of model scale that can be induced via prompting and can enable sufficiently large language models to better perform reasoning tasks that otherwise have flat scaling curves.
##### PaLM: Scaling Language Modeling with Pathways
• This paper by Chowdhery et al. from Google in 2022 introduces Pathways Language Model (PaLM), a single 540 billion parameter dense Transformer language model, trained on 780B tokens of high-quality, diverse text, that generalizes across domains and tasks while being highly efficient. PaLM pushes the boundaries of scale for few-shot language understanding and generation.
• Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application.
• To further our understanding of the impact of scale on few-shot learning, they trained a 540-billion parameter, densely activated, Transformer language model, which they call Pathways Language Model PaLM. They trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. They demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks.
• On a number of these tasks, PaLM 540B achieves breakthrough few-shot performance on language, reasoning, and code tasks, achieving state-of-the-art results on 28 out of the 29 most widely evaluated English NLP tasks when compared to the best finetuned per-task result from any previous large language model. Their evaluation suite consists of multi-step reasoning tasks, and comparisons to average human performance on the recently released BIG-bench benchmark.
• Another critical takeaway from this work is the breakthrough performance on reasoning tasks, which require multi-step logical inference. Their few-shot results match or exceed the finetuned state of the art across a number of different arithmetic and commonsense reasoning tasks. The results on reasoning tasks are not achieved through model scale alone, but by a combination of scale and chain-of-thought prompting, where the model is explicitly prompted to generate a natural language logical inference chain before making its prediction. They present a number of intriguing examples where PaLM was able to write explicit logical inference chains to both explain jokes and answer complex questions about scenarios. On BIG-bench, a recently developed benchmark containing 150+ challenging new language tasks, PaLM 5-shot achieves higher performance than the average performance score of humans who were asked to complete the same tasks. Additional state-of-the-art performance is demonstrated on source code understanding/generation, multilingual NLP, and machine translation.
• From these results, they draw a number of conclusions.
• First, the results presented here suggest that the improvements from scale for few-shot language understanding have not yet plateaued. When they compare results from PaLM 540B to our own identically trained 62B and 8B model variants, improvements are typically log-linear. This alone suggests that they have not yet reached the apex point of the scaling curve. However, a number of BIG-bench benchmarks showed discontinuous improvements from model scale, improvements are actually discontinuous, meaning that the improvements from 8B to 62B are very modest, but then steeply increase when scaling to 540B. This suggests that certain capabilities of language models only emerge when trained at sufficient scale, and there are additional capabilities that could emerge from future generations of models.
• Second, the breakthrough performance on reasoning tasks has critical implications. It is obvious that a model being able to generate natural language to explain its predictions is beneficial to the end user of a system, in order to better understand why a model made a certain prediction. However, these results go far beyond that, demonstrating that prompting the model to generate explicit inference chains can drastically increase the quality of the predictions themselves. In other words, the model’s generation (rather than just understanding) capabilities can be immensely beneficial even for tasks that are modeled as categorical prediction or regression, which typically do not require significant language generation.
• Finally, although they achieved their goal of further pushing the boundaries of scale for few-shot language modeling, there are still many open questions about the ideal network architecture and training scheme for future generations of models. PaLM is only the first step in our vision towards establishing Pathways as the future of ML scaling at Google and beyond. To that end, they chose to demonstrate this scaling capability on a well-studied, well-established recipe: a dense, decoder-only, full-attention Transformer model, which is trained to perform autoregressive language modeling. However, our wider goal is to explore a diverse array of novel architectural choices and training schemes, and combine the most promising systems with the extreme scaling capabilities of Pathways.
• They believe that PaLM demonstrates a strong foundation in their ultimate goal of developing a large-scale, modularized system that will have broad generalization capabilities across multiple modalities.
• They additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale.
• Finally, they discuss the ethical considerations related to large language models and discuss potential mitigation strategies.

##### Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity
• When primed with only a handful of training samples, very large, pretrained language models such as GPT-3 have shown competitive results when compared to fully-supervised, fine-tuned, large, pretrained language models.
• This paper by Lu et al. from in 2022 demonstrates that that few-shot prompts suffer from order sensitivity, in that for the same prompt the order in which samples are provided can make the difference between state-of-the-art and random performance – essentially some permutations are “fantastic” and some not.
• They analyze this phenomenon in detail, establishing that the problem is prevalent across tasks, model sizes (even for the largest current models), prompt templates, it is not related to a specific subset of samples, number of training samples, and that a given good permutation for one model is not transferable to another.
• While one could use a development set to determine which permutations are performant, this would deviate from the true few-shot setting as it requires additional annotated data. Instead, to alleviate this problem, they introduce a novel probing method that exploits the generative nature of language models to construct an artificial development set. They identity performant permutations for prompts using entropy-based statistics over this set, which yields a 13% relative improvement for GPT-family models across eleven different established text classification tasks.
##### Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
• Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that they understand the present and near-future capabilities and limitations of language models.
• This paper by Srivastava et al. from Google in 2022 addresses this challenge by introducing the Beyond the Imitation Game benchmark (BIG-bench), a benchmark that can measure progress well beyond the current state-of-the-art. BIG-bench currently consists of 204 tasks, contributed by 442 authors across 132 institutions.
• Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. They evaluate the behavior of OpenAI’s GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters.
• In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit “breakthrough” behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.
• Github repo.
##### Training Compute-Optimal Large Language Models
• Previous work in training LLMs offered a heuristic that given a 10x increase in computational budget, model size should increase 5.5x, and the number of tokens should only increase 1.8x.
• This paper by Hoffman et al. from DeepMind in 2022 challenges that assumption and shows that model and data size should increase in accordance! Thus collecting high-quality datasets will play a key role in further scaling of LLMs. They investigate the optimal model size and number of tokens for training a transformer language model under a given compute budget.
• They find that current large language models are significantly undertrained, a consequence of the recent focus on scaling language models whilst keeping the amount of training data constant.
• By training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, they find that for compute-optimal training, the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled.
• They test this hypothesis by training a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and 4x more more data.
• Chinchilla uniformly and significantly outperforms Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a large range of downstream evaluation tasks.
• This also means that Chinchilla uses substantially less compute for fine-tuning and inference, greatly facilitating downstream usage. As a highlight, Chinchilla reaches a state-of-the-art average accuracy of 67.5% on the MMLU benchmark, greater than a 7% improvement over Gopher.
##### Large Language Models Still Can’t Plan (A Benchmark for LLMs on Planning and Reasoning about Change)
• The recent advances in large language models (LLMs) have transformed the field of natural language processing (NLP). From GPT-3 to PaLM, the state-of-the-art performance on natural language tasks is being pushed forward with every new large language model. Along with natural language abilities, there has been a significant interest in understanding whether such models, trained on enormous amounts of data, exhibit reasoning capabilities. Hence there has been interest in developing benchmarks for various reasoning tasks and the preliminary results from testing LLMs over such benchmarks seem mostly positive. However, the current benchmarks are relatively simplistic and the performance over these benchmarks cannot be used as an evidence to support, many a times outlandish, claims being made about LLMs’ reasoning capabilities. As of right now, these benchmarks only represent a very limited set of simple reasoning tasks and they need to look at more sophisticated reasoning problems if they are to measure the true limits of such LLM-based systems.
• This paper by Valmeekam et al. from ASU in 2022 proposes an extensible assessment framework motivated by the above gaps in current benchmarks to test the abilities of LLMs on a central aspect of human intelligence, which is reasoning about actions and change.
• They provide multiple test cases that are more involved than any of the previously established reasoning benchmarks and each test case evaluates a certain aspect of reasoning about actions and change. Their initial results on even on simple common-sense planning tasks the base version of GPT-3 (Davinci) seems to display a dismal performance.
##### OPT: Open Pre-trained Transformer Language Models
• Large language models, which are often trained for hundreds of thousands of compute days, have shown remarkable capabilities for zero- and few-shot learning. Given their computational cost, these models are difficult to replicate without significant capital. For the few that are available through APIs, no access is granted to the full model weights, making them difficult to study.
• This paper by Zhang et al. from Facebook AI introduces Open Pre-trained Transformers (OPT), a collection of auto-regressive/decoder-only pre-trained transformer-based language models ranging in size from 125M to 175B parameters, which they aim to fully and responsibly share with interested researchers.
• Their goal is to replicate the performance and sizes of the GPT-3 class of models, while also applying the latest best practices in data curation and training efficiency.
• They show that OPT-175B is comparable to GPT-3, while requiring only 1/7th the carbon footprint to develop. They also release their logbook detailing the infrastructure challenges we faced, along with code for experimenting with all of the released models.
• They believe that broad access to these types of models will increase the diversity of voices defining the ethical considerations of such technologies.
• Github repo.
##### Diffusion-LM Improves Controllable Text Generation
• Controlling the behavior of language models (LMs) without re-training is a major open problem in natural language generation. While recent works have demonstrated successes on controlling simple sentence attributes (e.g., sentiment), there has been little progress on complex, fine-grained controls (e.g., syntactic structure).
• This paper by Li et al. from Stanford in 2022 seeks to address this challenge, and develops a novel non-autoregressive language model based on continuous diffusions called Diffusion-LM, which enables new forms of complex fine-grained control tasks.
• Diffusion-LM is a substantial departure from the current paradigm of discrete autoregressive generation.
• Building upon the recent successes of diffusion models in continuous domains, Diffusion-LM iteratively denoises a sequence of Gaussian vectors into word vectors, yielding a sequence of intermediate latent variables. The continuous, hierarchical nature of these intermediate variables enables a simple gradient-based algorithm to perform complex, controllable generation tasks.
• They demonstrate successful control of Diffusion-LM for six challenging fine-grained control tasks, significantly outperforming prior work by almost doubling the control success rate of prior methods and is competitive with baseline fine-tuning methods that require additional training.
##### DeepPERF: A Deep Learning-Based Approach For Improving Software Performance
• Performance bugs may not cause system failure and may depend on user input, so detecting them can be challenging. They also tend to be harder to fix than non-performance bugs.
• In recent years, a variety of performance bug detection approaches have emerged to help developers identify performance issues. However, a majority of existing performance bug detection approaches focus on specific types of performance problems and rely on expert-written algorithms or pre-defined set of rules to detect and fix issues. Building rule-based analyzers is a non-trivial task, as it requires achieving the right balance between precision and recall. Once developed, maintaining these rules can also be costly.
• Transformer-based approaches have been shown to achieve state-of-the-art performance, not only in various NLP problems, but also in a variety of software engineering tasks such as code-completion, documentation generation, unit test generation, bug detection, etc. In this paper, the authors present an approach called DeepPERF that uses a large transformer model to suggest changes at application source code level to improve its performance. The authors first pretrain the model using masked language modelling (MLM) tasks on English text and source code taken from open source repositories on GitHub, followed by finetuning on millions of performance commits made by .NET developers.
• This paper by Garg et al. from Microsoft in 2022 shows that their approach is able to recommend patches to provide a wide-range of performance optimizations in C# applications. Most suggested changes involve modifications to high-level constructs like API/Data Structure usages or other algorithmic changes, often spanning multiple methods, which cannot be optimized away automatically by the C# compiler and could, therefore, lead to slow-downs on the user’s side.

Their evaluation shows that the model can generate the same performance improvement suggestion as the developer fix in ∼53% of the cases, getting ∼34% of them verbatim in their expert-verified dataset of performance changes made by C# developers. Additionally, the authors evaluate DeepPERF on 50 open source C# repositories on GitHub using both benchmark and unit tests and find that the model is able to suggest valid performance improvements that can improve both CPU usage and Memory allocations.

##### No Language Left Behind: Scaling Human-Centered Machine Translation
• Driven by the goal of eradicating language barriers on a global scale, machine translation has solidified itself as a key focus of artificial intelligence research today. However, such efforts have coalesced around a small subset of languages, leaving behind the vast majority of mostly low-resource languages.
• This paper by Costa-jussà et al. from Meta AI in 2022 explores what it takes to break the 200 language barrier while ensuring safe, high quality results, all while keeping ethical considerations in mind. In No Language Left Behind, they take on this challenge by first contextualizing the need for low-resource language translation support through exploratory interviews with native speakers.
• Furthermore, they created datasets and models aimed at narrowing the performance gap between low and high-resource languages. More specifically, they developed a conditional compute model based on Sparsely Gated Mixture of Experts that is trained on data obtained with novel and effective data mining techniques tailored for low-resource languages.
• They propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks. Critically, they evaluated the performance of over 40,000 different translation directions using a human-translated benchmark, Flores-200, and combined human evaluation with a novel toxicity benchmark covering all languages in Flores-200 to assess translation safety. - Their model achieves an improvement of 44% BLEU relative to the previous state-of-the-art, laying important groundwork towards realizing a universal translation system.
• Facebook AI article; Github repo
• They tackle three major tasks:
• Automatic dataset construction for low-resource languages: They’ve solved this by investing in a teacher-student training procedure, making it possible to 1) extend LASER’s language coverage to 200 languages, and 2) produce a massive amount of data, even for low resource languages.
• Specifically, to scale one model to hundreds of languages, as the first step, they built an appropriate data set. Meta created an initial model able to detect languages automatically, which they call their language identification system.
• It then uses another language model based on Transformers to find sentence pairs for all the scrapped data. These two models are only used to build the 200 paired-languages datasets they need to train the final language translation model, NLLB200.
• Modeling 200 languages: They’ve developed a Sparse Mixture-of-Experts model that has a shared and specialized capacity, so low-resource languages without much data can be automatically routed to the shared capacity. When combined with better regularization systems, this avoids overfitting. Further, we used self-supervised learning and large-scale data augmentation through multiple types of back translation.
• Specifically, the multi-language translation model is a Transformer based encoder-decoder architecture. This implies NLLB200 takes a text sentence, encodes it and then decodes it to produce a new text sentence, a translated version of the input.
• What’s new is the modifications they’ve done to the model to scale up to so many different languages instead of being limited to one. The first modification is adding a variable identifying the source language of the input, taken from the language detector we just discussed. This will help the encoder do a better job for the current input language. Then, we do the same thing with the decoder giving it which language to translate to. Note that this conditioned encoding scheme is very similar to CLIP, which encodes images and text similarly. Here, in ideal conditions, it will encode a sentence similarly whatever the language.
• They use Sparsely Gated Mixture of Experts models to achieve a more optimal trade-off between cross-lingual transfer and interference and improve performance for low-resource languages. Sparsely Gated Mixture of Experts are basically regular models but only activate a subset of model parameters per input instead of involving most if not all parameters every time. You can easily see how this is the perfect kind of model for this application. The Mixture of Experts is simply an extra step added in the Transformer architecture for both the encoder and decoder, replacing the feed-forward network sublayer with $$N$$ feed-forward networks, each with input and output projections, and the Transformer model automatically learns which subnetwork to use for each language during training.
• Evaluating translation quality: They’ve extended 2x the coverage of FLORES, a human-translated evaluation benchmark, to now cover 200 languages. Through automatic metrics and human evaluation support, we’re able to extensively quantify the quality of our translations.
##### Efficient Few-Shot Learning Without Prompts
• Recent few-shot methods, such as parameter-efficient fine-tuning (PEFT) and pattern exploiting training (PET), have achieved impressive results in label-scarce settings. However, they are difficult to employ since they are subject to high variability from manually crafted prompts, and typically require billion-parameter language models to achieve high accuracy.
• This paper by Tunstall et al. from Hugging Face, cohere, TU Darmstadt, and Intel Labs in 2022 addresses these shortcomings by proposing SetFit (Sentence Transformer Fine-tuning), an efficient and prompt-free framework for few-shot fine-tuning of Sentence Transformers (ST). SetFit works by first fine-tuning a pretrained ST on a small number of text pairs, in a contrastive Siamese manner.
• The resulting model is then used to generate rich text embeddings, which are used to train a classification head. Compared to other few-shot learning methods, SetFit has several unique features:
• No prompts or verbalisers: Current techniques for few-shot fine-tuning require handcrafted prompts or verbalisers to convert examples into a format that’s suitable for the underlying language model. SetFit dispenses with prompts altogether by generating rich embeddings directly from text examples.
• Fast to train: SetFit doesn’t require large-scale models like T0 or GPT-3 to achieve high accuracy. As a result, it is typically an order of magnitude (or more) faster to train and run inference with.
• Multilingual support: SetFit can be used with any Sentence Transformer on the Hub, which means you can classify text in multiple languages by simply fine-tuning a multilingual checkpoint.
• Achieves high accuracy: SetFit achieves high accuracy with little labeled data - for instance, with only 8 labeled examples per class on the Customer Reviews sentiment dataset, SetFit is competitive with fine-tuning RoBERTa Large on the full training set of 3k examples. This is accomplished with orders of magnitude less parameters than existing techniques.
• Their experiments show that SetFit obtains comparable results with PEFT and PET techniques, while being an order of magnitude faster to train. We also show that SetFit can be applied in multilingual settings by simply switching the ST body.
• Github repo.

##### Large language models are different
• The following paper summary has been contributed by Zhibo Zhang.
• Large language models have shown promising capability in various natural language tasks in recent years. This presentation by Wei from Google Brain in 2022 covers some of the recent works for large language models.
• The motivation behind large language models is clear: It is ideal to have pre-trained models that can easily generalize to different downstream tasks rather than training a new model for each different task that will require a new dataset. In addition, pre-trained large language models only require a few labeled examples to learn from when it comes to a new task.
• Training a large language model and doing inference on it typically contains the following components:
• Pre-train a language model on a massive amount of data. The model size nowadays is huge, such as GPT-3 which contains 175 billion parameters (Brown et al., 2020) and PaLM which contains 540 billion parameters (Chowdhery et al., 2022). An important property of large language models is the emergent ability. That is, the performance of the model grows from near-random to well above-random after the model size reaches a certain threshold (Wei et al., 2022).
• Perform in-context learning with a few examples. This step is typically done through promoting techniques, where a few example natural language tasks are provided in the form of input-label pairs, and the machine is expected to generalize the learning outcome to predict the label for an unseen input. Notice that the term “learning” here does not involve any optimization step of the model parameters.
• Researchers have been trying to understand the property of prompting. In particular, Zhao et al., 2020 discusses three major biases introduced by the natural language prompts during in-context learning:
• The majority label bias: the predictions largely depend on the majority label in the prompts.
• The recency bias: the labels near the end of the prompts affect the predictions more.
• Common token bias: the predictions are more likely to be high frequency words in the n-gram model.
• The authors of the paper proposed to use affine transformation to calibrate the probabilistic output of the model for each specific prediction, named contextual calibration.
• Min et al., 2022 pointed out that whether the demonstration prompts have the correct labels or not does not significantly affect the prediction. The input text distribution, the label space and the input-label pairing format have a larger impact on the predictions.
• The speaker also mentioned other prompting techniques, such as chain-of-thought prompting (Wei et al., 2022).
• In addition to prompting, Wei et al., 2021 shows that fine tuning language models on different datasets through instructions can improve the model performance when there are no demonstrations given for downstream tasks.
##### Solving Quantitative Reasoning Problems with Language Models
• The following paper summary has been contributed by Zhibo Zhang.
• Solving Quantitative Reasoning Problems with Language Models by Lewkowycz et al. from Google Research in NeurIPS 2022 introduces Minerva, a language model based on PaLM (Chowdhery et al., 2022) to solve quantitative reasoning problems.
• Specifically, the authors used the pre-trained PaLM models with 8 billion, 62 billion and 540 billion parameters accordingly and fine-tuned them on the technical training dataset that is composed of web pages of mathematical content, arXiv papers and general natural language data.
• At the inference stage, the authors utilized the following techniques to boost the performance of the model:
• Selecting the most common answer based on a total of $$k$$ sampled solutions.
• Prompting the model with 4 examples when evaluating on the MATH dataset (Hendrycks et al., 2021) and with 5 examples when evaluating on the STEM (science, technology, engineering and mathematics) subset of the MMLU dataset (Hendrycks et al., 2021).
• Chain-of-thought prompting when evaluating on the GSM8k dataset (Cobbe et al., 2021) and the subset of the MMLU dataset.
• Empirically, under the same model scale, Minerva consistently outperformed the PaLM model on the evaluation datasets according to the paper. In addition, Minerva with 62 billion parameters and 540 billion parameters outperformed both OpenAI davinci-002 and published state-of-the-art on the MATH dataset.
• Through additional validation, the authors concluded that there is little evidence that memorization contributes to the model performance.
• The following paper summary has been contributed by Zhibo Zhang.
• AD-DROP: Attribution-Driven Dropout for Robust Language Model Fine-Tuning by Yang et al. from Sun Yat-sen University and Meta AI in NeurIPS 2022 proposes AD-DROP, an attribution-based dropout mechanism for self-attention modules.
• The authors propose to generate attribution scores based on existing input gradient explanation methods. In particular, the attribution scores are generated for the attention map of each attention head with respect to the output logit of the Transformer for a particular class.
• Following the above attribution methods, the authors empirically observed that dropping neurons with low attribution scores will lead to a larger degree of overfitting compared to random dropping, and dropping neurons with high attribution scores increases training loss but alleviates the overfitting problem.
• Based on the above empirical finding, the authors proposed AD-DROP, as indicated in the illustration figure: the attribution matrices are generated for the self-attention maps based on the logits from the forward pass. The mask matrices (that contain information about which position to drop) are then produced relying on the attribution scores and sampling. As a last step, an element-wise addition operation between the mask matrices and the original self-attention maps is done to produce the masked self-attention maps, which are then used to perform the forward propagation.
• In addition, the authors proposed a cross-tuning algorithm to alternatively perform optimization without dropout (at odd number epochs) and optimization with AD-DROP (at even number epochs) during the training process.
• The authors conducted experiments on eight tasks of the GLUE benchmark (Wang et al., 2019) using BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019) models as the base, observing that AD-DROP had the best average performance compared to several other regularization methods.

##### Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers
• The following paper summary has been contributed by Zhibo Zhang.
• In-Context Learning has been an effective strategy in adapting a pre-trained large language model to a new task by showing the model with a few input-label pairs through prompts. How In-Context Learning works has been an active research topic that people try to understand.
• This paper by Dai et al. from Peking University, Tsinghua University and Microsoft Research proposes that In-Context Learning can be understood as performing implicit fine-tuning on the Transformer models.
• In particular, Aizerman et al. and Irie et al. pointed out that linear attention is a dual form of linear layers with gradient descent optimization.
• Based on the above finding and through relaxing the standard attention into linear attention, the authors demonstrate that it is possible to express the attention outcome as a linear expression of any new input query, where the weight matrix can be decomposed into two parts: the part based on the pre-trained model and the updates of the former part due to prompt demonstrations.
• Empirically, the authors compared between the models generated by fine-tuning and In-Context Learning accordingly on 6 datasets, observing similarities between the two in terms of prediction capability, updates of the attention output (where the pre-trained model is used as a baseline when calculating the updates) as well as attention maps.

### Speech

#### 2006

##### Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks
• Many real-world sequence learning tasks require the prediction of sequences of labels from noisy, unsegmented input data. In speech recognition, for example, an acoustic signal is transcribed into words or sub-word units. Recurrent neural networks (RNNs) are powerful sequence learners that would seem well suited to such tasks. However, because they require pre-segmented training data, and post-processing to transform their outputs into label sequences, their applicability has so far been limited.
• This paper by Graves et al. from Schmidhuber’s lab presents a novel method for for temporal classification with RNNs to label unsegmented sequences directly, thereby solving both aforementioned problems. Their method fits naturally into the existing framework of neural network classifiers, and is derived from the same probabilistic principles. It obviates the need for pre-segmented data, and allows the network to be trained directly for sequence labelling.
• An experiment on a real-world temporal classification problem with the TIMIT speech corpus demonstrates its advantages over both a baseline HMM and a hybrid HMM-RNN without requiring any task-specific knowledge.

#### 2010

##### Front-end factor analysis for speaker verification
• This paper by Dehak et al. from JHU in IEEE/ACM Transactions on Audio, Speech, and Language Processing 2010 proposes a non-deep learning method that users Joint Factor Analysis (JFA) as a feature extractor to learn a low-dimensional speaker representation for speaker verification, which is also used to model session and channel effects/variabilities.
• In this new space, a given speech utterance is represented by a new vector named total factors (called the identity-vector or the “i-vector”). The i-vector is thus a feature that represents the characteristics of the frame-level features’ distributive pattern. i-vector extraction is essentially a dimensionality reduction of the GMM supervector (although the GMM supervector is not extracted when computing the i-vector). It’s extracted in a similar manner with the eigenvoice adaptation scheme or the JFA technique, but is extracted per sentence (or input speech sample).
• Two speaker verification systems are proposed which use this new representation. The first system is a Support-Vector-Machine-based system that uses the cosine kernel to estimate the similarity between the input data. The second system directly uses the cosine similarity as the final decision score. In this scoring, they removed the SVM from the decision process. One important characteristic of this approach is that there is no speaker enrollment, unlike in other approaches like SVM and JFA, which makes the decision process faster and less complex.
• They achieved an EER of 1.12% and MinDCF of 0.0094 using the cosine distance scoring on the male English trials of the core condition of the NIST 2008 Speaker Recognition Evaluation dataset. They also obtained 4% absolute EER improvement for both-gender trials on the 10sec-10sec condition compared to the classical joint factor analysis scoring.
• Up until d-vectors, the state-of-the-art speaker verification systems were based on the concept of i-vectors (which use Probabilistic Linear Discriminant Analysis (PLDA) as a classifier to make the final decision).

#### 2012

##### Sequence Transduction with Recurrent Neural Networks
• Many machine learning tasks can be expressed as the transformation or transduction of input sequences into output sequences: speech recognition, machine translation, protein secondary structure prediction and text-to-speech to name but a few. One of the key challenges in sequence transduction is learning to represent both the input and output sequences in a way that is invariant to sequential distortions such as shrinking, stretching and translating.
• Recurrent neural networks (RNNs) are a powerful sequence learning architecture that has proven capable of learning such representations. However RNNs traditionally require a pre-defined alignment between the input and output sequences to perform transduction. This is a severe limitation since finding the alignment is the most difficult aspect of many sequence transduction problems. Indeed, even determining the length of the output sequence is often challenging.
• This paper by Graves in the 2012 ICML Workshop on Representation Learning introduces an end-to-end, probabilistic sequence transduction system, based entirely on RNNs, that is in principle able to transform any input sequence into any finite, discrete output sequence.
• Experimental results for phoneme recognition are provided on the TIMIT speech corpus.
• Slides.

#### 2014

##### Towards End-To-End Speech Recognition with Recurrent Neural Networks
• This paper by Graves and Jaitly in PMLR in 2014 presents a character-level speech recognition system that directly transcribes audio data with text using a recurrent neural network with minimal preprocessing, without requiring an intermediate phonetic representation.
• The system is based on a combination of the deep bidirectional LSTM recurrent neural network architecture and a modified Connectionist Temporal Classification (CTC) objective function that allows a direct optimization of the word error rate, even in the absence of a lexicon or language model. Further, they show how to integrate the network outputs with a language model during decoding.
• The system achieves a word error rate of 27.3% on the Wall Street Journal corpus with no prior linguistic information, 21.9% with only a lexicon of allowed words, and 8.2% with a trigram language model. Combining the network with a baseline system further reduces the error rate to 6.7% and achieves state-of-the-art accuracy on the Wall Street Journal corpus for speaker independent recognition.
##### Deep neural networks for small footprint text-dependent speaker verification
• This paper by Variani et al. from JHU, Google, and Biometric Recognition Group in 2014 investigates the use of deep neural networks (DNNs) to train speaker embeddings for a small footprint text-dependent speaker verification task. The DNN architecture is shown in the figure below.
• During model training, the DNN takes stacked filterbank features as input (similar to the DNN acoustic model used in ASR) and generates the one-hot speaker label (or the speaker probability) to classify speakers at the frame-level.
• During speaker enrollment, the trained DNN is used to extract speaker-specific features/embeddings by averaging the activations from the last hidden layer (called deep-vectors or “d-vectors” for short), which is taken as the speaker model.
• During speaker evaluation, a d-vector is extracted for each utterance and compared to the enrolled speaker model to make a verification decision by calculating the cosine distance between the test d-vector and the claimed speaker’s d-vector, similar to the i-vector framework. A verification decision is made by comparing the distance to a threshold.
• Experimental results show the DNN based speaker verification system achieves good performance compared to a popular i-vector system on a small footprint text-dependent speaker verification task. In addition, the DNN based system is more robust to additive noise and outperforms the i-vector system at low False Rejection operating points. The combined system outperforms the i-vector system by 14% and 25% relative in equal error rate (EER) for clean and noisy conditions respectively.
• Experimental results show the d-vectors are more robust to additive noise and outperforms i-vectors at low False Rejection operating points. The combined (d+i)-vector system outperforms the i-vector system by 14% and 25% relative in equal error rate (EER) for clean and noisy conditions respectively.
• Note that unlike the i-vector framework, this doesn’t have any assumptions about the feature’s distribution (the i-vector framework assumes that the i-vector has a Gaussian distribution).

#### 2015

##### Listen, Attend and Spell
• This paper by Chan et al. from CMU and Google in 2015 presents Listen, Attend and Spell (LAS), a neural network that learns to transcribe speech utterances to characters. Unlike traditional DNN-HMM models, this model learns all the components of a speech recognizer jointly.
• LAS is based on the sequence-to-sequence framework, is trained end-to-end and has two main components: a listener (encoder) and a speller (decoder). The listener is a pyramidal RNN encoder that accepts filter bank spectra as inputs, transforms the input sequence into a high level feature representation and reduces the number of timesteps that the decoder has to attend to. The speller is an attention-based RNN decoder that attends to the high level features and spells out the transcript one character at a time.
• The proposed system does not use the concepts of phonemes, nor does it rely on pronunciation dictionaries or HMMs. They bypass the conditional independence assumptions of CTC, and show how they can learn an implicit language model that can generate multiple spelling variants given the same acoustics. In other words, producing character sequences without making any independence assumptions between the characters is the key improvement of LAS over previous end-to-end CTC models.
• To further improve the results, they used samples from the softmax classifier in the decoder as inputs to the next step prediction during training. Finally, they show how a language model trained on additional text can be used to rerank their top hypotheses.
• On a subset of the Google voice search task, LAS achieves a word error rate (WER) of 14.1% without a dictionary or a language model, and 10.3% with language model rescoring over the top 32 beams. By comparison, the state-of-the-art CLDNN-HMM model achieves a WER of 8.0%.

#### 2017

##### CNN Architectures for Large-Scale Audio Classification
• This paper by Hershey et al. from Google in ICASSP 2017 presents VGGish by applying various state-of-the-art image networks with CNN architectures to audio and show that they are capable of excellent results on audio classification when compared to a simple fully connected network or earlier image classification architectures.
• They examine fully connected deep neural networks such as AlexNet, VGG, InceptionNet, and ResNet. The input audio is divided into non-overlapping 960 ms frames which are decomposed by applying the Fourier transform, resulting in a spectrogram. The spectrogram is integrated into 64 mel-spaced frequency bins, and the magnitude of each bin is log-transformed. Finally, this gives log-mel spectrogram patches that are passed on as input to all classifiers. They explore the effects of training with different sized subsets of the 70M training videos (5.24 million hours) with 30,871 labels.
• While their dataset contains video-level labels, they are also interested in Acoustic Event Detection (AED) and train a classifier on embeddings learned from the video-level task on AudioSet. They find that a model for AED with embeddings learned from these classifiers does much better than raw features on the Audio Set AED classification task.
• They find that derivatives of image classification networks do well on the audio classification task, that increasing the number of labels they train on provides some improved performance over subsets of labels, that performance of models improves as they increase training set size, and that a model using embeddings learned from the video-level task do much better than a baseline on the AudioSet classification task.

#### 2018

##### X-Vectors: Robust DNN Embeddings for Speaker Recognition
• This paper by Synder et al. from JHU in ICASSP 2018 uses data augmentation to improve performance of deep neural network (DNN) embeddings for speaker recognition.
• The DNN, which is trained to discriminate between speakers, maps variable-length utterances to fixed-dimensional embeddings called x-vectors.
• While prior studies have found that embeddings leverage large-scale training datasets better than i-vectors, it can be challenging to collect substantial quantities of labeled data for training. They use data augmentation, consisting of added noise and reverberation, as an inexpensive method to multiply the amount of training data and improve robustness.
• Their data augmentation strategy employs additive noises and reverberation. Reverberation involves convolving room impulse responses (RIR) with audio. They use the simulated RIRs described by Ko et al. and the reverberation itself is performed with the multicondition training tools in the Kaldi ASpIRE recipe. For additive noise, they use the MUSAN dataset, which consists of over 900 noises, 42 hours of music from various genres and 60 hours of speech from twelve languages
• A PLDA classifier is used in the x-vector framework to make the final decision, similar to i-vector systems.
• The x-vectors are compared with i-vector baselines on Speakers in the Wild and NIST SRE 2016 Cantonese where they achieve superior performance on the evaluation datasets.
##### WaveGlow: A Flow-based Generative Network for Speech Synthesis
• This paper by Prenger et al. from NVIDIA in 2018 proposes WaveGlow, a flow-based network capable of generating high quality speech from mel-spectrograms.
• WaveGlow combines insights from Glow and WaveNet in order to provide fast, efficient and high-quality audio synthesis, without the need for auto-regression. WaveGlow is implemented using only a single network, trained using only a single cost function: maximizing the likelihood of the training data, which makes the training procedure simple and stable.
• Their PyTorch implementation produces audio samples at a rate of more than 500 kHz on an NVIDIA V100 GPU. Mean Opinion Scores show that it delivers audio quality as good as the best publicly available WaveNet implementation.
##### Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions
• This paper by Shen et al. from Google in 2018 describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize timedomain waveforms from those spectrograms.
• Their model achieves a mean opinion score (MOS) of 4.53 comparable to a MOS of 4.58 for professionally recorded speech.
• To validate their design choices, they present ablation studies of key components of their system and evaluate the impact of using mel-spectrograms as the input to WaveNet instead of linguistic, duration, and F0 features.
• They further demonstrate that using a compact acoustic intermediate representation enables significant simplification of the WaveNet architecture.
• PyTorch hub

#### 2019

##### wav2vec: Unsupervised Pre-training for Speech Recognition
• Reducing the need for manually annotated data is important for developing systems that understand non-English languages, particularly those with limited existing training sets of transcribed speech.
• This paper by Schneider from Facebook AI in 2019 introduces wav2vec, the first application of unsupervised pre-training to speech recognition using a fully convolutional model that learns representations of raw, unlabeled audio.
• Wav2vec is trained on large amounts of unlabeled audio data and the resulting representations are then used to improve acoustic model training. They pre-train a simple multi-layer convolutional neural network optimized via a noise contrastive binary classification task.
• Wav2vec trains models to learn the difference between original speech examples and modified versions, often repeating this task hundreds of times for each second of audio, and predicting the correct audio milliseconds into the future.
• This self-supervised approach beats traditional ASR systems that rely solely on transcribed audio. Their experiments on WSJ reduce WER of a strong character-based log-mel filterbank baseline by up to 36% when only a few hours of transcribed data is available. Their approach achieves 2.43% WER on the nov92 test set. This outperforms Deep Speech 2 (Amodei et al., 2016), the best reported character-based system in the literature while using two orders of magnitude less labeled training data.
• They show that more data for pre-training improves performance and that this approach not only improves resource-poor setups, but also settings where all WSJ training data is used.
##### SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
• This paper by Park et al. from Google in 2019 presents SpecAugment, a simple data augmentation method for speech recognition.
• SpecAugment greatly improves the performance of ASR networks. SpecAugment is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients). The augmentation policy consists of warping the features, masking blocks of frequency channels, and masking blocks of time steps. They apply SpecAugment on Listen, Attend and Spell (LAS) networks for end-to-end speech recognition tasks.
• They achieve state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks on end-to-end LAS networks by augmenting the training set using simple handcrafted policies, surpassing the performance of hybrid systems even without the aid of a language model. SpecAugment converts ASR from an over-fitting to an under-fitting problem, and they are able to gain performance by using bigger networks and training longer. On LibriSpeech, they achieve 6.8% WER on test-other without the use of a language model, and 5.8% WER with shallow fusion with a language model. This compares to the previous state-of-the-art hybrid system of 7.5% WER. For Switchboard, they achieve 7.2%/14.6% on the Switchboard/CallHome portion of the Hub5’00 test set without the use of a language model, and 6.8%/14.1% with shallow fusion, which compares to the previous state-of-the-art hybrid system at 8.3%/17.3% WER.
##### Margin Matters: Towards More Discriminative Deep Neural Network Embeddings for Speaker Recognition
• Recently, speaker embeddings extracted from a speaker discriminative deep neural network (DNN) yield better performance than the conventional methods such as i-vector. In most cases, the DNN speaker classifier is trained using cross entropy loss with softmax. However, this kind of loss function does not explicitly encourage inter-class separability and intra-class compactness. As a result, the embeddings are not optimal for speaker recognition tasks.
• This paper by Xiang et al. from Shanghai Jiao Tong and AISpeech in Interspeech 2019 addresses this issue, with three different margin-based losses which not only separate classes but also demand a fixed margin between classes are introduced to deep speaker embedding learning.
• Angular softmax loss (denoted by A-Softmax loss),
• Additive margin softmax loss (denoted by AMSoftmax loss), and
• Additive angular margin loss (denoted by AAM-Softmax loss).
• They find that the margin plays a vital role in learning discriminative embeddings and leads to a significant performance boost.
• Experiments are conducted on two public text independent tasks: VoxCeleb1 and Speaker in The Wild (SITW).
• The proposed approach can achieve the state-of-the-art performance, with 25% ~ 30% equal error rate (EER) reduction on both tasks when compared to strong baselines using cross entropy loss with softmax, obtaining 2.238% EER on VoxCeleb1 test set and 2.761% EER on SITW core-core test set, respectively.

#### 2020

##### Conformer: Convolution-augmented Transformer for Speech Recognition
• Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR), outperforming Recurrent neural networks (RNNs). Transformer models are good at capturing content-based global interactions, while CNNs exploit local features effectively.
• This paper by Gulati et al. from Google in Interspeech 2020 achieves the best of both worlds by integrating components from both CNNs and Transformers for end-to-end speech recognition to model both local and global dependencies of an audio sequence in a parameter-efficient way.
• They studied the importance of each component, and demonstrated that the inclusion of convolution modules is critical to the performance of the Conformer model.
• To this regard, they propose the convolution-augmented transformer for speech recognition, named Conformer. Conformer significantly outperforms the previous Transformer and CNN based models achieving state-of-the-art accuracies. On the widely used LibriSpeech benchmark, Conformer model achieves WER of 2.1%/4.3% without using a language model and 1.9%/3.9% with an external language model on test/testother. They also observe competitive performance of 2.7%/6.3% with a small model of only 10M parameters.
##### wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
• This paper from Baevski et al. from Facebook AI in NeurIPS 2020 shows for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler.
• Wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned.
• Compared to wav2vec, wav2vec 2.0 learns basic speech units used to tackle a self-supervised task. The model is trained to predict the correct speech unit for masked parts of the audio, while at the same time learning what the speech units should be.
• Experiments using all labeled data of Librispeech achieve 1.8/3.3 WER on the clean/other test sets. When lowering the amount of labeled data to one hour, wav2vec 2.0 outperforms the previous state of the art on the 100 hour subset while using 100 times less labeled data. With just 10 minutes of transcribed speech and 53K hours of unlabeled speech, wav2vec 2.0 enables speech recognition models at a word error rate (WER) of 8.6 percent on noisy speech and 5.2 percent on clean speech on the standard LibriSpeech benchmark. This demonstrates the feasibility of speech recognition with limited amounts of labeled data.
• This opens the door for speech recognition models in many more languages, dialects, and domains that previously required much more transcribed audio data to provide acceptable accuracy.
• They have also developed a cross-lingual approach, dubbed XLSR, that can learn speech units common to several languages. This approach helps when they have even small amounts of unlabeled speech, since languages for which they have little data can benefit from languages for which more data is available.
• Github repo; Facebook AI article.
##### HiFi-GAN: High-Fidelity Denoising and Dereverberation Based on Speech Deep Features in Adversarial Networks
• Real-world audio recordings are often degraded by factors such as noise, reverberation, and equalization distortion.
• This paper by Su et al. from Princeton and Adobe Research in 2020 introduces HiFi-GAN, a deep learning method to transform recorded speech to sound as though it had been recorded in a studio.
• They use an end-to-end feed-forward WaveNet architecture, trained with multi-scale adversarial discriminators in both the time domain and the time-frequency domain. HiFi-GAN relies on the deep feature matching losses of the discriminators to improve the perceptual quality of enhanced speech.
• The proposed model generalizes well to new speakers, new speech content, and new environments. It significantly outperforms state-of-the-art baseline methods in both objective and subjective experiments.
##### HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
• Several recent work on speech synthesis have employed generative adversarial networks (GANs) to produce raw waveforms. Although such methods improve the sampling efficiency and memory usage, their sample quality has not yet reached that of autoregressive and flow-based generative models.
• This paper by Kong et al. from Kakao Enterprise in NeurIPS 2020 proposes HiFi-GAN, which achieves both efficient and high-fidelity speech synthesis. As speech audio consists of sinusoidal signals with various periods, they demonstrate that modeling periodic patterns of an audio is crucial for enhancing sample quality.
• HiFi-GAN outperforms the best performing publicly available models in terms of synthesis quality, even comparable to human level. Moreover, it shows a significant improvement in terms of synthesis speed. A subjective human evaluation (mean opinion score, MOS) of a single speaker dataset indicates that our proposed method demonstrates similarity to human quality while generating 22.05 kHz high-fidelity audio 167.9 times faster than real-time on a single V100 GPU.
• They took inspiration from the characteristic of speech audio that consists of patterns with various periods and applied it to neural networks, and verified that the existence of the proposed discriminator greatly influences the quality of speech synthesis through the ablation study.
• HiFi-GAN shows ability to generalize to the mel-spectrogram inversion of unseen speakers and synthesize speech audio comparable to human quality from noisy inputs in an end-to-end setting. In addition, their small footprint model demonstrates comparable sample quality with the best publicly available autoregressive counterpart, while generating samples in an order-of-magnitude faster than real-time on CPU. This shows progress towards on-device natural speech synthesis, which requires low latency and memory footprint.
• Finally, their experiments show that the generators of various configurations can be trained with the same discriminators and learning mechanism, which indicates the possibility of flexibly selecting a generator configuration according to the target specifications without the need for a time-consuming hyper-parameter search for the discriminators.
##### GAN-based Data Generation for Speech Emotion Recognition
• This paper by Eskimez et al. from Microsoft in Interspeech 2020 proposes a GAN-based method to generate synthetic data in the form of speech emotion spectrograms, which can be used for training speech emotion recognition networks. Specifically, they investigate the usage of GANs for capturing the data manifold when the data is eyes-off, i.e., where they can train networks using the data but cannot copy it from the clients.
• They propose a CNN-based GAN with spectral normalization on both the generator and discriminator, both of which are pre-trained on large unlabeled speech corpora. They show that our method provides better speech emotion recognition performance than a strong baseline.
• They proposed to use GANs for modeling imbalanced and highly skewed data among clients for future use, even after the original data is removed.
• Furthermore, they show that even after the data on the client is lost, their model can generate similar data that can be used for model bootstrapping in the future. Although they evaluated their method for speech emotion recognition, it can be applied to other tasks.
##### Unsupervised Cross-lingual Representation Learning at Scale
• This paper by Conneau et al. from Facebook AI in ACL 2020 shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks.
• They train a Transformer-based masked language model on one hundred languages, using more than two terabytes of filtered CommonCrawl data.
• Their model, dubbed XLM-R, significantly outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +14.6% average accuracy on XNLI, +13% average F1 score on MLQA, and +2.4% F1 score on NER.
• XLM-R performs particularly well on low-resource languages, improving 15.7% in XNLI accuracy for Swahili and 11.4% for Urdu over previous XLM models.
• They also present a detailed empirical analysis of the key factors that are required to achieve these gains, including the trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and low resource languages at scale.
• Finally, they show, for the first time, the possibility of multilingual modeling without sacrificing per-language performance; XLM-R is very competitive with strong monolingual models on the GLUE and XNLI benchmarks.
##### Generalized end-to-end loss for speaker verification
• This paper by Wan et al. from Google in 2020 propose a new loss function called generalized end-to-end (GE2E) loss, which makes the training of speaker verification models more efficient (especially compared to their previous tuple-based end-to-end (TE2E) loss function).
• Unlike TE2E, the GE2E loss function updates the network in a way that emphasizes examples that are difficult to verify at each step of the training process. GE2E loss pushes the embedding towards the centroid of the true speaker, and away from the centroid of the most similar different speaker.
• Additionally, the GE2E loss does not require an initial stage of example selection. With these properties, their model with the new loss function decreases speaker verification EER by more than 10%, while reducing the training time by 60% at the same time.
• Both theoretical and experimental results verified the advantage of this novel loss function.
• They also introduce the MultiReader technique, which allows them to do domain adaptation — training a more accurate model that supports multiple keywords (i.e., “OK Google” and “Hey Google”) as well as multiple languages/dialects. By combining these two techniques, they produced more accurate speaker verification models.

#### 2021

##### Generative Spoken Language Modeling from Raw Audio
• This paper by Lakhotia et al. from Facebook AI in 2021 introduces Generative Spoken Language Modeling which learns speech representations from CPC, Wav2Vec2.0, and HuBERT for synthesizing speech.
• Generative Spoken Language Modeling, the task of learning the acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation. They set up baseline systems consisting of a discrete speech encoder (returning pseudo-text units), a generative language model (trained on pseudo-text), and a speech decoder (generating a waveform from pseudo-text) all trained without supervision and validate the proposed metrics with human evaluation. Across 3 speech encoders (CPC, wav2vec 2.0, HuBERT), they find that the number of discrete units (50, 100, or 200) matters in a task-dependent and encoder-dependent way, and that some combinations approach text-based systems.
##### Text-Free Prosody-Aware Generative Spoken Language Modeling
• Speech pre-training has primarily demonstrated efficacy on classification tasks, while its capability of generating novel speech, similar to how GPT-2 can generate coherent paragraphs, has barely been explored.
• This paper by Kharitonov et al. from Facebook AI in 2021 builds upon Generative Spoken Language Modeling (GSLM) (Lakhotia et al., 2021) which addresses the generative aspects of speech pre-training, by replacing text with discovered phone-like units for language modeling and shows the ability to generate meaningful novel sentences. Unfortunately, despite eliminating the need of text, the units used in GSLM discard most of the prosodic information. Hence, GSLM fails to leverage prosody for better comprehension, and does not generate expressive speech.
• In this work, they present a prosody-aware generative spoken language model (pGSLM). It is composed of a multi-stream transformer language model (MS-TLM) of speech, represented as discovered unit and prosodic feature streams, and an adapted HiFi-GAN model converting MS-TLM outputs to waveforms.
• They devise a series of metrics for prosody modeling and generation, and re-use metrics from GSLM for content modeling. Experimental results show that the pGSLM can utilize prosody to improve both prosody and content modeling, and also generate natural, meaningful, and coherent speech given a spoken prompt.
• Github repo
##### Speech Resynthesis from Discrete Disentangled Self-Supervised Representations
• This paper by Polyak et al. from Facebook AI in Interspeech 2021 proposes using self-supervised discrete representations for the task of speech resynthesis. To generate disentangled representation, they separately extract low-bitrate representations for speech content, prosodic information, and speaker identity. This allows to synthesize speech in a controllable manner.
• They analyze various state-of-the-art, self-supervised representation learning methods and shed light on the advantages of each method while considering reconstruction quality and disentanglement properties. Specifically, they evaluate the F0 reconstruction, speaker identification performance (for both resynthesis and voice conversion), recordings’ intelligibility, and overall quality using subjective human evaluation.
• Lastly, they demonstrate how these representations can be used for an ultra-lightweight speech codec. Using the obtained representations, they can get to a rate of 365 bits per second while providing better speech quality than the baseline methods.
• Github repo
##### WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing
• Self-supervised learning (SSL) achieves great success in speech recognition, while limited exploration has been attempted for other speech processing tasks. As speech signal contains multi-faceted information including speaker identity, paralinguistics, spoken content, etc., learning universal representations for all speech tasks is challenging.
• This paper by Chen et al. from Furu Wei’s group at Microsoft Research in JSTSP 2021 proposes WavLM, a new large-scale pre-trained model trained on 94k hour audio, to solve full stack downstream speech processing tasks.
• WavLM extends the HuBERT framework to masked speech prediction and denoising modeling, enabling the pre-trained models to perform well on both ASR and non-ASR tasks.
• WavLM jointly learns masked speech prediction and denoising in pre-training. By this means, WavLM does not only keep the speech content modeling capability by the masked speech prediction, but also improves the potential to non-ASR tasks by the speech denoising.
• In addition, WavLM employs gated relative position bias for the Transformer structure to better capture the sequence ordering of input speech. THey also scale up the training dataset from 60k hours to 94k hours.
• WavLM Large achieves state-of-the-art performance on the SUPERB benchmark, and brings significant improvements for various speech processing tasks such as speaker verification, speech separation, and speaker diarization.
• In contrast to previous SSL models, WavLM is not only effective for the ASR task but also has the potential to become the next-generation backbone network for speaker-related tasks.
• Github repo with code and pre-trained models.
##### Recent Advances in End-to-End Automatic Speech Recognition
• The following paper summary has been contributed by Zhibo Zhang.
• This paper by Li from Microsoft in APSIPA Transactions on Signal and Information Processing in 2021 reviewed the influential frameworks in end-to-end automatic speech recognition systems, the major challenges as well as the solutions and advances in this field.
• The author firstly reviewed three popular methods in this domain, including CTC (Connectionist Temporal Classification) by Graves et al., AED (Attention-based Encoder-Decoder) by Cho et al., Bahdanau et al. as well as RNN-T (RNN Transducer) by Graves.
• The author then analyzed two major encoder architectures - LSTMs by Hochreiter and Schmidhuber and Transformers by Vaswani et al., along with their limitations and variations.
• The author also mentioned other training criteria including knowledge distillation by Hinton et al. and minimum word error rate.
• It is easier to build a multilingual model with end-to-end systems compared to hybrid systems.
• The paper covered several major challenges for end-to-end models:
• It is difficult to adapt the model to the test speaker because of the small amount of adaptation data. Approaches to solve this issue include utilizing regularization techniques, multi-task learning as well as multi-speaker text-to-speech.
• The performance would be worse when adapting the end-to-end model to a different content domain due to the lack of the speech-text data pairs in the new domain. Approaches to overcome this problem include:
• Fusing the end-to-end model with an extra language model where the language model was trained on the text data of the new domain.
• Training the end-to-end model on the new domain by synthesizing speech from the text of the new domain utilizing TTS (text-to-speech) technologies.
• Adopting the spliced data method by Zhao et al..
• Improving the capability of making use of the context is challenging for end-to-end models and the author mentioned a few existing solutions that address this issue including adding a context encoder.

#### 2022

##### Direct speech-to-speech translation with discrete units
• This paper by Lee et al. from Facebook AI in 2022 presents a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation.
• They tackle the problem by first applying a self-supervised discrete speech encoder on the target speech and then training a sequence-to-sequence speech-to-unit translation (S2UT) model to predict the discrete representations of the target speech.
• When target text transcripts are available, they design a joint speech and text training framework that enables the model to generate dual modality output (speech and text) simultaneously in the same inference pass.
• Experiments on the Fisher Spanish-English dataset show that the proposed framework yields improvement of 6.7 BLEU compared with a baseline direct S2ST model that predicts spectrogram features. When trained without any text transcripts, S2ST’s performance is comparable to models that predict spectrograms and are trained with text supervision, showing the potential of their system for translation between unwritten languages.
• Audio samples
##### Textless Speech Emotion Conversion using Discrete and Decomposed Representations
• Speech emotion conversion is the task of modifying the perceived emotion of a speech utterance while preserving the lexical content and speaker identity.
• This paper by Kreuk et al. from Facebook AI in 2021 casts the problem of emotion conversion as a spoken language translation task. They use a decomposition of the speech signal into discrete learned representations, consisting of phonetic-content units, prosodic features, speaker, and emotion.
• First, they modify the speech content by translating the phonetic-content units to a target emotion, and then predict the prosodic features based on these units.
• Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder. Such a paradigm allows them to go beyond spectral and parametric changes of the signal, and model non-verbal vocalizations, such as laughter insertion, yawning removal, etc.
• They demonstrate objectively and subjectively that the proposed method is vastly superior to current approaches and even beats text-based systems in terms of perceived emotion and audio quality. They rigorously evaluate all components of such a complex system and conclude with an extensive model analysis and ablation study to better emphasize the architectural choices, strengths and weaknesses of the proposed method.
• Github repo
##### Generative Spoken Dialogue Language Modeling
• This paper by Nguyen et al. from Facebook AI in 2022 introduces dGSLM, the first “textless” model able to generate audio samples of naturalistic spoken dialogues. It uses recent work on unsupervised spoken unit discovery coupled with a dual-tower transformer architecture with cross-attention trained on 2000 hours of two-channel raw conversational audio (Fisher dataset) without any text or labels.
• It is able to generate speech, laughter and other paralinguistic signals in the two channels simultaneously and reproduces naturalistic turn taking.
• Github repo
##### textless-lib: a Library for Textless Spoken Language Processing
• Textless spoken language processing research aims to extend the applicability of standard NLP toolset onto spoken language and languages with few or no textual resources.
• This paper by Kharitonov et al. from Facebook AI in 2022 introduces textless-lib, a PyTorch-based library aimed to facilitate research in this research area. They describe the building blocks that the library provides and demonstrate its usability by discuss three different use-case examples: (i) speaker probing, (ii) speech resynthesis and compression, and (iii) speech continuation.
• They believe that textless-lib substantially simplifies research the textless setting and will be handful not only for speech researchers but also for the NLP community at large.
• Github repo
##### Self-Supervised Speech Representation Learning: A Review
• Although supervised deep learning has revolutionized speech and audio processing, it has necessitated the building of specialist models for individual tasks and application scenarios. It is likewise difficult to apply this to dialects and languages for which only limited labeled data is available.
• Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains. Such methods have shown success in natural language processing and computer vision domains, achieving new levels of performance while reducing the number of labels required for many downstream scenarios. Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods. Other approaches rely on multi-modal data for pre-training, mixing text or visual data streams with speech.
• This paper by Mohamed et al. from Facebook AI in 2022 reviews the current approaches in the field for self-supervised speech representation learning and their connection to other research areas. Since many current methods focus solely on automatic speech recognition as a downstream task, they review recent efforts on benchmarking learned representations to extend the application beyond speech recognition.
• This paper by Huang et al. from Facebook AI and CMU in 2022 introuces Audie-MAE, a simple extension of image-based Masked Autoencoders (MAE) to self-supervised representation learning from audio spectrograms. Audio-MAE learns to reconstruct masked spectrogram patches from audio recordings and achieves state-of-the-art performance on six audio and speech classification tasks.
• Following the Transformer encoder-decoder design in MAE, Audio-MAE first encodes audio spectrogram patches with a high masking ratio, feeding only the non-masked tokens through encoder layers.
• The decoder then re-orders and decodes the encoded context padded with mask tokens, in order to reconstruct the input spectrogram. They find it beneficial to incorporate local window attention in the decoder, as audio spectrograms are highly correlated in local time and frequency bands.
• They then fine-tune the encoder with a lower masking ratio on target datasets. Empirically, Audio-MAE sets new state-of-the-art performance on six audio and speech classification tasks, outperforming other recent models that use external supervised pre-training.
• They draw the four interesting observations:
• A simple MAE approach works surprisingly well for audio spectrograms.
• It is possible to learn stronger representations with local self-attention in the decoder.
• They show that masking can be applied to both pre-training and fine-tuning, improving accuracy and reducing training computation. The optimal strategy depends on the nature of the data (audio, image, etc.) and the learning type (self-/supervised).
• The best performance can be achieved by pre-training and fine-tuning under the same modality, without reliance on cross-modality transfer learning.
• Github repo with code and models.
##### Robust Speech Recognition via Large-Scale Weak Supervision
• This paper from Radford et al. from OpenAI in 2022 proposes Whisper, a model trained to predict large amounts of transcripts of audio on the internet and studies its capabilities.
• Whisper suggests that scaling weakly supervised pretraining has been underappreciated so far in speech recognition research. When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zeroshot transfer setting without the need for any finetuning.
• When compared to humans, the models approach their accuracy and robustness.
• What is important to note is that Whisper achieves stellar results without the need for self-supervision and self-training techniques that have been a mainstay of recent large-scale speech recognition work and demonstrates how training on a large and diverse supervised dataset and focusing on zero-shot transfer can significantly improve the robustness of a speech recognition system.
• Project page.
##### AudioGen: Textually Guided Audio Generation
• This paper by Kreuk et al. from FAIR and the Hebrew University of Jerusalem in 2022 proposes AudioGen, which tackles the problem of generating audio samples conditioned on descriptive text captions.
• AudioGen is an auto-regressive generative model that operates on a learnt discrete audio representation and generates audio samples conditioned on text inputs.
• The task of text-to-audio generation poses multiple challenges. Due to the way audio travels through a medium, differentiating ‘objects’ can be a difficult task (e.g., separating multiple people simultaneously speaking). This is further complicated by real-world recording conditions (e.g., background noise, reverberation, etc.). Scarce text annotations impose another constraint, limiting the ability to scale models.
• Finally, modeling high-fidelity audio requires encoding audio at high sampling rate, leading to extremely long sequences. To alleviate the aforementioned challenges, they propose an augmentation technique that mixes different audio samples, driving the model to internally learn to separate multiple sources. They curated 10 datasets containing different types of audio and text annotations to handle the scarcity of text-audio data points.
• For faster inference, they explore the use of multi-stream modeling, allowing the use of shorter sequences while maintaining a similar bitrate and perceptual quality. They apply classifier-free guidance to improve adherence to text.
• Comparing to the evaluated baselines, AudioGen outperforms over both objective and subjective metrics. Finally, they explore the ability of the proposed method to generate audio continuation conditionally and unconditionally.
• Audio samples.
##### AudioLM: a Language Modeling Approach to Audio Generation
• The following summary has been contributed by Zhibo Zhang.
• This paper by Borsos et al. from Google Research in 2022 proposes a generative language model approach to synthesize audios that are consistent and are of high quality.
• The method proposed contains the following stages:
• The tokenization stage that maps the single channel audio sequence into acoustic tokens and semantic tokens. Specifically, the SoundStream codec (Zeghidour et al., 2021) is adopted to produce acoustic tokens and the w2v-BERT model (Chung et al., 2021) is used to produce semantic tokens using intermediate layer representations. The acoustic token representations and the semantic token representations are for ensuring high quality and long-term consistency of the generated audio accordingly.
• The hierarchical modeling stage that is composed of the following three steps, as indicated in the illustration figure by Borsos et al.:
• Autoregressive modeling on the semantic tokens. This step is for learning long-term temporal structure.
• Coarse acoustic modeling conditioned on the acoustic tokens from the previous time steps that are produced by the first $Q’$ SoundStream quantizers. This step is for capturing high-level acoustic properties.
• Fine acoustic modeling conditioned on both the coarse tokens of all time steps and the fine tokens (from the last $Q - Q’$ quantizers) of the previous time steps. This step is for better capturing fine acoustic details.
• At inference time, AudioLM can be used to:
• Generate audios with diverse context, various speakers and acoustic conditions when there are no conditional restrictions.
• Generate audios of the same content with various speaker identities when conditioned on given semantic tokens.
• Generate continuations of the audio given an acoustic prompt.
• Empirically, the authors trained the AudioLM components on the unlab-60k train split of the Libri-Light dataset. In order to validate the functionality of the semantic tokens and the acoustic tokens.
• The authors conducted acoustic generation experiments conditioned on semantic tokens. Automatic speech recognition was performed on the generated audio, and with a low Word Error Rate, this shows that the system captures the linguistic content mostly relying on the semantic tokens.
• The authors also conducted speaker classification on the generated audio. A low classification accuracy suggests that the semantic tokens lack information about speaker identities.

### Multimodal

#### 2016

##### “Why Should I Trust You?” Explaining the Predictions of Any Classifier
• Trust is crucial for effective human interaction with machine learning systems, and that explaining individual predictions is important in assessing trust.
• This paper by Ribeiro et al. from Guestrin’s lab in UWash in 2016 proposes LIME, a novel model-agnostic modular and extensible explanation technique that explains the predictions of any classifier in an interpretable and faithful manner, by learning an interpretable model locally around the prediction. They further introduced SP-LIME, a method to explain models by selectingrepresentative individual predictions and their explanations in a non-redundant way, framing the task as a submodular optimization problem and providing a global view of the model to users.
• They demonstrate the flexibility of these methods by explaining different models for text (e.g. random forests) and image classification (e.g. neural networks). The usefulness of explanations is shown via novel experiments, both simulated and with human subjects.
• Their explanations empower users in various scenarios that require trust: deciding if one should trust a prediction, choosing between models, improving an untrustworthy classifier, getting insights into predictions, and detecting why a classifier should not be trusted.
• LIME - Local Interpretable Model-Agnostic Explanations blog post.

#### 2017

##### A Unified Approach to Interpreting Model Predictions
• While various methods have recently been proposed to help users interpret the predictions of complex models, it is often unclear how these methods are related and when one method is preferable over another.
• This paper from Lundberg and Lee from UWash in NeurIPS 2017 seeks to address this problem and presents a unified framework for interpreting predictions, SHAP (SHapley Additive exPlanations).
• SHAP is a game theoretic approach to explain the output of any machine learning model. It connects optimal credit allocation with local explanations using the classic Shapley values from game theory and their related extensions. SHAP assigns each feature an importance value for a particular prediction. Its novel components include: (1) the identification of a new class of additive feature importance measures, and (2) theoretical results showing there is a unique solution in this class with a set of desirable properties.
• The new class unifies six existing methods, notable because several recent methods in the class lack the proposed desirable properties. Based on insights from this unification, they present new methods that show improved computational performance and/or better consistency with human intuition than previous approaches.
• Github repo.
##### mixup: Beyond Empirical Risk Minimization
• Large deep neural networks are powerful, but exhibit undesirable behaviors such as memorization and sensitivity to adversarial examples.
• This paper by Zhang et al. from MIT and FAIR in ICLR 2018 proposes mixup, a regularizer, which trains a neural network on convex combinations of pairs of examples and their labels. By doing so, mixup regularizes the neural network to favor simple linear behavior in-between training examples.
• Their experiments on the ImageNet-2012, CIFAR-10, CIFAR-100, Google commands, and UCI datasets show that mixup improves the generalization of state-of-the-art neural network architectures.
• They also find that mixup reduces the memorization of corrupt labels, increases the robustness to adversarial examples, and stabilizes the training of generative adversarial networks.

#### 2019

##### Representation Learning with Contrastive Predictive Coding
• While supervised learning has enabled great progress in many applications, unsupervised learning has not seen such widespread adoption, and remains an important and challenging endeavor for artificial intelligence.
• This paper by Oord et al. from Google in 2019 proposes a universal unsupervised learning approach to extract useful representations from high-dimensional data, which they call Contrastive Predictive Coding (CPC), a framework for extracting compact latent representations to encode predictions over future observations.
• The key insight of CPC is to learn such representations by predicting the future in latent space by using powerful autoregressive models.
• CPC uses a probabilistic contrastive loss based on NCE, which both the encoder and autoregressive model are trained to jointly optimize, which they call InfoNCE. InfoNCE induces the latent space to capture information that is maximally useful to predict future samples.
• CPC combines autoregressive modeling and noise-contrastive estimation with intuitions from predictive coding to learn abstract representations in an unsupervised fashion.
• It also makes the model tractable by using negative sampling. While most prior work has focused on evaluating representations for a particular modality, they demonstrate that CPC is able to learn useful representations achieving strong performance on four distinct domains: speech, images, text and reinforcement learning in 3D environments.
• The figure below offers an overview of Contrastive Predictive Coding, the proposed representation learning approach. Although this figure shows audio as input, they use the same setup for images, text, and reinforcement learning.

• They tested these representations in a wide variety of domains: audio, images, natural language, and reinforcement learning and achieve strong or state-of-the-art performance when used as stand-alone features.
• The simplicity and low computational requirements to train the model, together with the encouraging results in challenging reinforcement learning domains when used in conjunction with the main loss are exciting developments towards useful unsupervised learning that applies universally to many more data modalities.

#### 2020

##### Modality Dropout for Improved Performance-driven Talking Faces
• This paper by Adbelaziz et al. from Apple in 2020 introduces the idea of Modality Dropout (MDO). The begin by describing a novel deep learning approach for driving animated faces using both acoustic and visual information. In particular, speech-related facial movements are generated using audiovisual information, and non-speech facial movements are generated using only visual information.
• To ensure that the proposed model exploits both modalities during training, batches are generated that contain audio-only, video-only, and audiovisual input features. The probability of dropping a modality allows control over the degree to which the model exploits audio and visual information during training.
• Their trained model runs in real-time on resource limited hardware (e.g., a smart phone), it is user agnostic, and it is not dependent on a potentially error-prone transcription of the speech.
• They use subjective testing to demonstrate: 1) the improvement of audiovisual-driven animation over the equivalent video-only approach, and 2) the improvement in the animation of speech-related facial movements after introducing modality dropout. Before introducing dropout, viewers prefer audiovisual-driven animation in 51% of the test sequences compared with only 18% for video-driven. After introducing dropout viewer preference for audiovisual-driven animation increases to 74%, but decreases to 8% for video-only.

#### 2021

##### Comparing Data Augmentation and Annotation Standardization to Improve End-to-end Spoken Language Understanding Models
• All-neural end-to-end (E2E) Spoken Language Understanding (SLU) models can improve performance over traditional compositional SLU models, but have the challenge of requiring high-quality training data with both audio and annotations. In particular they struggle with performance on “golden utterances”, which are essential for defining and supporting features, but may lack sufficient training data.
• This paper by Nicolich-Henkin et al. from Amazon in NeurIPS 2021 proposes using data augmentation to compare two data-centric AI methods to improve performance on golden utterances: improving the annotation quality of existing training utterances and augmenting the training data with varying amounts of synthetic data.
• Their experimental results show improvements with both methods, and in particular that augmenting with synthetic data is effective in addressing errors caused by both inconsistent training data annotations as well as lack of training data. In other words, both data-centric approaches to improving E2E SLU achieved the desired effect, although data augmentation was much more powerful than annotation standardization. This method leads to improvement in intent recognition error rate (IRER) on their golden utterance test set by 93% relative to the baseline without seeing a negative impact on other test metrics.
##### Learning Transferable Visual Models From Natural Language Supervision
• This paper by Radford et al. from OpenAI introduces CLIP, a pre-training task which efficiently learns visual concepts from natural language supervision. CLIP uses vision and language encoders trained in isolation and uses a contrastive loss to bring similar image-text pairs closer, while pulling apart dissimilar pairs as a part of pretaining.
• CLIP can be applied to any visual classification benchmark by simply providing the names of the visual categories to be recognized, similar to the “zero-shot” capabilities of GPT-2 and GPT-3.
• CLIP pre-trains an image encoder and a text encoder to predict which images were paired with which texts in our dataset. They then use this behavior to turn CLIP into a zero-shot classifier. They convert all of a dataset’s classes into captions such as “a photo of a dog” and predict the class of the caption CLIP estimates best pairs with a given image.
• It can rival the generalization of ImageNet SoTA models (since it was pretained on 400M image and noisy text pairs) and is thus typically used for zero-shot image classification and zero-shot cross-modal searches.
• OpenAI article.
##### Zero-Shot Text-to-Image Generation
• Text-to-image generation (i.e., language-guided image generation) has traditionally focused on finding better modeling assumptions for training on a fixed dataset. These assumptions might involve complex architectures, auxiliary losses, or side information such as object part labels or segmentation masks supplied during training.
• This paper by Ramesh et al. from OpenAI introduces DALL-E which offers a simple approach for text-to-image generation based on an autoregressive transformer which models the text and image tokens as a single stream of data. DALL-E is a simple decoder-only transformer that receives both the text and the image as a single stream of 1280 tokens—256 for the text and 1024 for the image—and models all of them autoregressively.
• They find that sufficient data and scale can lead to improved generalization, both in terms of zero-shot performance relative to previous domain-specific approaches, and in terms of the range of capabilities that emerge from a single generative model. Their findings suggest that improving generalization as a function of scale may be a useful driver for progress on this task.
• OpenAI article.
##### ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
• This paper by Kim et al. from NAVER AI and Kakao in 2021 introduces Vision-and-Language Transformer (ViLT) that seeks to improve performance on various joint vision-and-language downstream tasks using Vision-and-Language Pre-training (VLP).
• CLIP and Hugging Face’s VisionEncoderDecoder utilize image and language encoders learned/trained in isolation and aligning/gluing them using either (i) cross-entropy loss that utilizes cross-attention (in case of VisionEncoderDecoder), and (ii) contrastive loss (in case of CLIP). This is shown in the figure below from Prithvi Da which summarizes the aforementioned approaches.

• The downside of the above approach is poor image-text alignment, huge data appetite and longer training time. This approach is useful to create a downstream generative model to tackle applications such as cross-modal retrieval, say OCR or image captioning or content based image retrieval (CBIR) or even text2image (using DALL-E or CLIPDraw). However, there are derived/advanced multimodal tasks involving vision and language that are much more complicated in nature such as Natural Language for Visual Reasoning (NLVR), Visual Question Answering (VQA), Visual Commonsense Reasoning (VCR), Visual Navigation, etc. than the aforementioned higher-order tasks. The diagram below from Prithvi Da summarizes the hierarchy of image-based tasks.

• In order to tackle derived tasks in a similar way, they need to train image and language data jointly (rather than in isolation) in a “mixed-modal” fashion with a combination of image level loss, language level loss, and alignment loss. This is the underlying idea behind VLP. The diagram below from Prithvi Da summarizes the two approaches of aligning/gluing the modalities together (with either cross-entropy loss or contrastive loss) independently-trained vision and language encoders vs. training both encoders jointly.

• Current approaches to VLP heavily rely on image feature extraction processes using convolutional visual embedding networks (e.g., Faster R-CNN and ResNets), which involve region supervision (e.g., object detection) and the convolutional architecture (e.g., ResNet). This is problematic in terms of both efficiency/speed, in that extracting input features requires much more computation than the multimodal interaction steps; and expressive power, as it is upper bounded to the expressive power of the visual embedder and its predefined visual vocabulary.
• ViLT seeks to remedy the above two issues by presenting a minimal VLP model, which is monolithic in that the processing of visual inputs is drastically simplified to just the same convolution-free manner that they process textual inputs. In other words, the unique selling point of ViLT is that while most VLP models rely on object detectors, CNNs or transformers for feature extraction (for e.g., UNiTER, LXMERT and VisualBERT need Faster-RCNN for object detection), ViLT stands out of the crowd by removing the need for object detectors. ViLT accomplishes this by avoiding heavyweight image encoders by directly embedding low-level pixel data with a single-layer projection and achieves similar results with reduced complexity, as shown in the diagram below:

• Self-supervision is accomplished using (i) Image Text Matching (ITM) loss and (ii) Masked Language Model (MLM) loss. ITM loss is an alignment loss that encompasses cross-modality interaction between image and text. ITM requires positive and negative pairs. For text, ViLT simply reuses Masked Language Model (MLM), used in BERT.
• ViLT is pre-trained on four datasets: MSCOCO, Visual Genome, SBU Captions, and Google Conceptual Captions. They evaluate ViLT on two widely explored types of vision-and-language downstream tasks: for classification, they use VQAv2 and NLVR2; for retrieval, they use MSCOCO and Flickr30K (F30K).
• Finally, they show that ViLT is over 10x faster than previous VLP models, yet with competitive or better downstream task performance.
• The key takeaway in this paper is that VLP needs to focus more on the multi-modality interactions aspect inside the transformer module rather than engaging in an arms race that merely powers up unimodal embedders. ViLT-B/32 is a proof of concept that efficient VLP models free of convolution and region supervision can still be competent.
• Github repo with code and pre-trained weights; HuggingFace docs; ViLT tutorials/notebooks.
##### MLIM: Vision-and-language Model Pre-training With Masked Language and Image Modeling
• Vision-and-Language Pre-training (VLP) improves model performance for downstream tasks that require image and text inputs. Current VLP approaches differ on (i) model architecture (especially image embedders), (ii) loss functions, and (iii) masking policies. Image embedders are either deep models like ResNet or linear projections that directly feed image-pixels into the transformer. Typically, in addition to the Masked Language Modeling (MLM) loss, alignment-based objectives are used for cross-modality interaction, and RoI feature regression and classification tasks for Masked ImageRegion Modeling (MIRM). Alignment-based objectives require pairings of image and text and heuristic objective functions. MIRM relies on object detectors. Masking policies either do not take advantage of multi-modality or are strictly coupled with alignments generated by other models.
• This paper by Arici et al. from Amazon in 2021 presents Masked Language and Image Modeling (MLIM) for VLP. MLIM is pre-trained using two pre-training tasks as a multi-loss objective given a mini-batch of image-text pairs: Masked Language Modeling (MLM) loss (as in BERT) for text, and image reconstruction (RECON) loss for image, coupled with Modality Aware Masking (MAM). MAM determines the masking probability and applies masking to both word and image embeddings. MLP is based on BERT predict the masked words from available words and image regions. They follow BERT for this task: two-layer MLP MLM head outputting logits over the vocabulary. MLM loss is negative log-likelihood for masked word. The RECON loss is an an average of pixel-wise sum of squared errors (SSE). Both image and word masking is realized by replacing an embedding with the embedding of [MASK]. This way transformer layers recognize [MASK]’s embedding as a special embedding that needs to be “filled in”, independent of the modality, by attending to other vectors in the layer inputs.
• Note that unlike other architectures (LXMERT, UNiTER, ViLBERT, VLP, VL-BERT, VisualBERT, etc.), image masking is not based on image regions detected by the object detector, but a shallow CNN as an image embedder which is much more lightweight than deep models like ResNet and is designed to be masking friendly. MLM + RECON losses apply only to the masked text/image areas and measure reconstructed text and image quality.
• MLIM uses no specific alignment loss, but instead proposes Modality Aware Masking (MAM) to boost cross-modality interaction and take advantage of MLM and RECON losses that separately capture text and image reconstruction quality. Using MLM + RECON tasks coupled with MAM, they present a simplified VLP methodology and show that it has better downstream task performance on a proprietary e-commerce multi-modal dataset.
• Since the the task of finding closely-matching (CM) item pairs requires a pair of image+text inputs, they exploit this multi-modality by employing Modality Dropout (MDO). MDO improves fine-tuning by randomly dropping one of the modalities. Similar to MAM, MDO operates in one of the three modes on a micro-batch: text-only, image-only, and image-text mode.
• The authors also tried using the ITM loss proposed in ViLT. However, RECON instead of ITM loss offers better PR AUC. Similarly, using the ITM loss together with MLM and RECON does not change the performance.
• The key takeways from this paper are that MLIM is a simplified VLP method using MLM and RECON losses and MAM. They simplify loss function design, propose a shallow CNN-based image embedder to avoid heavyweight object-detectors and present an image decoder to enable RECON loss. They believe VLP datasets (e.g. e-commerce datasets) are large enough to enable learning built-in image embedders during pre-training. While alignment-based loss functions are promising and help in learning contrastive features, finding good image-text pairs (especially negative pairs) becomes an issue and makes pre-training rely on pairing techniques. On the other hand finer-grained alignment objectives such as alignment and MIRM objectives do not have ground truth. Masked Image-Region Modeling (MIRM) relies on RoI features and classes predicted by the object detector. Furthermore MIRM tasks aim to “fill in” masked regions. However the proposed RECON task aims to reconstruct the whole image and is designed to get the best cross-modality interaction inside the transformer.

##### MURAL: Multimodal, Multi-task Retrieval Across Languages
• This paper by Jain and Yang from Google Research in EMNLP 2021 describes MURAL, a representation model for image–text matching that uses multitask learning applied to image–text pairs in combination with translation pairs covering 100+ languages.
• MURAL shows that training jointly using translation pairs helps overcome the scarcity of image–text pairs for many under-resourced languages and improves cross-modal performance.
• MURAL consistently outperforms prior state-of-the-art models in multilingual image-to-text and text-to-image retrieval.
• Additionally, when visualizing MURAL’s embeddings with LaBSE’s, it is interesting to observe hints of areal linguistics and contact linguistics in the text representations learned by using a multimodal model.
##### Perceiver: General Perception with Iterative Attention
• Biological systems perceive the world by simultaneously processing high-dimensional inputs from modalities as diverse as vision, audition, touch, proprioception, etc. The perception models used in deep learning on the other hand are designed for individual modalities, often relying on domain-specific assumptions such as the local grid structures exploited by virtually all existing vision models. These priors introduce helpful inductive biases, but also lock models to individual modalities.
• This paper by Jeagle et al. from DeepMind in ICML 2021 introduces the Perceiver - a model that builds upon Transformers and hence makes few architectural assumptions about the relationship between its inputs, but that also scales to hundreds of thousands of inputs, like ConvNets.
• The model leverages an asymmetric attention mechanism to iteratively distill inputs into a tight latent bottleneck, allowing it to scale to handle very large inputs. We show that this architecture is competitive with or outperforms strong, specialized models on classification tasks across various modalities: images, point clouds, audio, video, and video+audio.
• The Perceiver obtains performance comparable to ResNet-50 and ViT on ImageNet without 2D convolutions by directly attending to 50,000 pixels. It is also competitive in all modalities in AudioSet.

#### 2022

##### DeepNet: Scaling Transformers to 1,000 Layers
• This paper by Wang et al. from Microsoft Research in 2022 introduces DeepNet, a new method that allows train extremely deep transformers with 1000L+ layers – order of magnitude improvements over existing efforts and with theoretical justification.
• DeepNet is fundamental, effective and simple. It can be used in any Transformer architecture (encoder, decoder, encoder-decoder) which covers almost all different tasks across AI areas (language, vision, speech, multimodal, and beyond). It is not only for 1000L+ Transformers, but also important and effective for training existing large models (e.g., [24, 100] layers). It combines the best of two worlds, i.e., good performance of Post-LN and stable training of Pre-LN, making it a preferred alternative for any Transformers model training.
• At the core of DeepNet is a newly proposed normalization function (called DeepNorm) which modifies the residual connection in Transformers. DeepNorm has theoretical justification of bounding the model update by a constant which makes stable training possible in a principled way. They only need lines of code change to make it work in existing Transformer implementation.
• DeepNorm modifies the residual connection in the Transformer architecture by up-scaling it before performing layer normalization. It works alongside a dedicated initialization scheme based on Xavier initialization.
• These two tricks lead to greater stability during the training which allows the authors to scale their modified Transformer architecture (DeepNet) up to 1000 layers.
• DeepNet’s 200-layer model with 3.2B parameters significantly outperforms the 48-layer state-of-the-art model with 12B parameters by 5 BLEU points a in multilingual translation task with 7,482 translation directions.
• Github repo.
##### data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language
• While the general idea of self-supervised learning is identical across modalities, the actual algorithms and objectives differ widely because they were developed with a single modality in mind.
• This paper by Baevski et al. from Facebook in 2022 helps get us closer to general self-supervised learning by presenting data2vec, a framework that uses the same learning method for either speech, NLP or computer vision. The core idea is to predict latent representations of the full input data based on a masked view of the input in a self distillation setup using a standard Transformer architecture.
• Instead of predicting modality-specific targets such as words, visual tokens or units of human speech which are local in nature, data2vec predicts contextualized latent representations that contain information from the entire input. Experiments on the major benchmarks of speech recognition, image classification, and natural language understanding demonstrate a new state of the art or competitive performance to predominant approaches.
• Today’s self-supervised learning research almost typically focuses on a single modality. As a result, researchers specializing in one modality often adopt a totally different strategy than those specializing in another. Researchers train algorithms to fill in blanks in sentences in the case of the text. On the other hand, speech models must learn an inventory of essential speech sounds, like forecasting missing sounds. In computer vision, models are frequently taught to assign comparable representations to a color image of a cow, and the same image flipped upside down, allowing them to correlate the two far more closely than they would with an unrelated image like a duck. data2vec symbolizes a new paradigm of holistic self-supervised learning, in which further research enhances several rather than just one modality.
• For each modality, algorithms anticipate distinct units: pixels or visual tokens for images, words for the text, and learned sound inventories for voice. Because a collection of pixels differs significantly from an audio waveform or a passage of text, algorithm creation has been related to a particular modality. This means that algorithms in each modality continue to work differently. Data2vec makes this easier by teaching models to anticipate their own representations of the incoming data, regardless of mode. Instead of predicting visual tokens, phrases, or sounds, a single algorithm may work with completely different sorts of input by focusing on these representations — the layers of a neural network. This eliminates the learning task’s reliance on modality-specific targets. It also doesn’t use contrastive learning or reconstructed input examples.
• It was necessary to define a robust normalization of the features for the job that would be trustworthy in different modalities to directly predict representations. The method starts by computing target representations from an image, a piece of text, or a voice utterance using a teacher network. After that, a portion of the input was masked and repeated with a student network, which predicts the teacher’s latent representations. Even though it only has a partial view of the data, the student model must predict accurate input data. The instructor network is identical to the student network, except with somewhat out-of-date weights.
• The method was tested on the primary ImageNet computer vision benchmark, and it outperformed existing processes for a variety of model sizes. It surpassed wav2vec 2.0 and HuBERT, two previous Meta AI self-supervised voice algorithms. It was put through its paces on the popular GLUE benchmark suite for text, and it came out on par with RoBERTa, a reimplementation of BERT.
• Key takeaways:
• data2vec is a self-supervised algorithm that works for multiple modalities outperforming the previous best single-purpose algorithms for computer vision and speech and generating competitive scores on NLP tasks.
• The core idea is to predict latent representations of the full input data based on a masked view of the input in a self-distillation setup using a standard Transformer architecture. Instead of predicting modality-specific targets such as words, visual tokens or units of human speech which are local in nature, data2vec predicts contextualized latent representations that contain information from the entire input.
• Method:
• data2vec is trained by predicting the model representations of the full input data given a partial view of the input
• They first encode a masked version of the training sample (model in student mode) and then construct training targets by encoding the unmasked version of the input sample with the same model but when parameterized as an exponentially moving average of the model weights (model in teacher mode)
• The target representations encode all of the information in the training sample and the learning task is for the student to predict these representations given a partial view of the input.
• Modality encoding:
• The model architecture used is the standard Transformer architecture with a modality-specific encoding of the input data borrowed from prior work:
• For computer vision, they have used the ViT-strategy of encoding an image as a sequence of patches, each spanning 16x16 pixels, input to a linear transformation.
• Speech data is encoded using a multi-layer 1-D convolutional neural network that maps 16 kHz waveform to 50 Hz representations.
• Text is pre-processed to obtain sub-word units, which are then embedded in distributional space via learned embedding vectors.
• Ablations (layer-averaged targets):
• They have used targets which are based on averaging multiple layers from the teacher network.

##### Hierarchical Text-Conditional Image Generation with CLIP Latents
• In January 2021, OpenAI introduced DALL-E. A year later, their newest system, DALL-E 2, generates more realistic and accurate images with 4x greater resolution, better caption matching and photorealism.
• Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style.
• This paper by Ramesh et al. from OpenAI in 2022 proposes DALL-E 2 leverages these representations for image generation, they propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a “unCLIP” decoder that generates an image conditioned on the image embedding.
• They show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity.
• Their decoder, which is conditioned on image representations, can also produce variations of an image that preserve both its semantics and style, while varying the non-essential details absent from the image representation.
• They use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples
• OpenAI article.
##### AutoDistill: an End-to-End Framework to Explore and Distill Hardware-Efficient Language Models
• Recently, large pre-trained models have significantly improved the performance of various Natural LanguageProcessing (NLP) tasks but they are expensive to serve due to long serving latency and large memory usage. To compress these models, knowledge distillation has attracted an increasing amount of interest as one of the most effective methods for model compression. However, existing distillation methods have not yet addressed the unique challenges of model serving in datacenters, such as handling fast evolving models, considering serving performance, and optimizing for multiple objectives.
• This paper by Zhang et al. from Google in 2022 solve these problems, they propose AutoDistill, an end-to-end model distillation framework integrating model architecture exploration and multi-objective optimization for building hardware-efficient NLP pre-trained models. They use Bayesian Optimization to conduct multi-objective Neural Architecture Search for selecting student model architectures. The proposed search comprehensively considers both prediction accuracy and serving latency on target hardware. The experiments on TPUv4i show the finding of seven model architectures with better pre-trained accuracy (up to 3.2% higher) and lower inference latency (up to 1.44x faster) than MobileBERT.
• By running downstream NLP tasks in the GLUE benchmark, the model distilled for pre-training by AutoDistill with 28.5M parameters achieves an 81.69 average score, which is higher than BERT_BASE, DistillBERT, TinyBERT, NAS-BERT, and MobileBERT. The most compact model found by AutoDistill contains only 20.6M parameters but still outperform BERT_BASE(109M), DistillBERT(67M), TinyBERT(67M), and MobileBERT(25.3M) regarding the average GLUE score. By evaluating on SQuAD, a model found by AutoDistill achieves an 88.4% F1 score with 22.8M parameters, which reduces parameters by more than 62% while maintaining higher accuracy than DistillBERT, TinyBERT, and NAS-BERT.
##### A Generalist Agent
• This paper by Reed et al. from DeepMind in 2022 proposes Gato, a single generalist agent beyond the realm of text outputs, inspired by progress in large-scale language modeling.
• Gato, works as a multi-modal, multi-task, multi-embodiment generalist policy. The same network with the same weights can play Atari, caption images, chat, stack blocks with a real robot arm and much more, deciding based on its context whether to output text, joint torques, button presses, or other tokens.
• The guiding design principle of Gato is to train on the widest variety of relevant data possible, including diverse modalities such as images, text, proprioception, joint torques, button presses, and other discrete and continuous observations and actions. To enable processing this multi-modal data from different tasks and modalities, it is serialized into a flat sequence of tokens. In this representation, Gato can be trained and sampled from akin to a standard large-scale language model. Masking is used such that the loss function is applied only to target outputs, i.e. text and various actions. During deployment, sampled tokens are assembled into dialogue responses, captions, button presses, or other actions based on the context.
• Gato uses a 1.2B parameter decoder-only transformer with 24 layers, an embedding size of 2048, and a post-attention feedforward hidden size of 8196.
• Transformer sequence models are effective as multi-task multi-embodiment policies, including for real-world text, vision and robotics tasks. They show promise as well in few-shot out-of-distribution task learning. The authors envision that in the future, such models could be used as a default starting point via prompting or fine-tuning to learn new behaviors, rather than training from scratch.
• DeepMind page.
##### Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors
• Recent text-to-image generation methods provide a simple yet exciting conversion capability between text and image domains. While these methods have incrementally improved the generated image fidelity/quality and text relevancy (i.e., adherence to text of generated images), several pivotal gaps remain unanswered, limiting applicability and quality.
• This paper by Gafni et al. from Meta AI in 2022 proposes a novel text-to-image method that addresses these gaps by (i) enabling a simple control mechanism complementary to text in the form of a scene, (ii) introducing elements that substantially improve the tokenization process by employing domain-specific knowledge over key image regions (faces and salient objects), and (iii) adapting classifier-free guidance for the transformer use case.
• While some methods propose image editing techniques, progress is not often directed towards enabling new forms of human creativity and experiences. They attempt to progress text-to-image generation towards a more interactive experience, where people can perceive more control over the generated outputs, thus enabling real-world applications such as storytelling.
• In addition to improving the general image quality, they focus on improving key image aspects that are significant in human perception, such as faces and salient objects, resulting in higher favorability of their method in human evaluations and objective metrics.
• Their model achieves state-of-the-art FID and human evaluation results, unlocking the ability to generate high fidelity images in a resolution of 512 × 512 pixels, significantly improving visual quality. Through scene controllability, they introduce several new capabilities: (i) scene editing, (ii) text editing with anchor scenes, (iii) overcoming out-of-distribution text prompts, and (iv) story illustration generation, as demonstrated in the story they wrote.
##### i-Code: An Integrative and Composable Multimodal Learning Framework
• Human intelligence is multimodal; they integrate visual, linguistic, and acoustic signals to maintain a holistic worldview. Most current pretraining methods, however, are limited to one or two modalities.
• This paper by Yang et al. from Microsoft in 2022 presents i-Code, a self-supervised pretraining framework which jointly learns representations for vision, language and speech into a unified, shared and general-purpose vector representation.
• In this framework, data from each modality are first given to pretrained single-modality encoders. The encoder outputs are then integrated with a multimodal fusion network, which uses novel attention mechanisms and other architectural innovations to effectively combine information from the different modalities. The entire system is pretrained end-to-end with new objectives including (i) masked modality modeling and (ii) cross-modality contrastive learning.
• They show that pretraining on dual-modality datasets can also yield competitive or even better performance than pretraining on videos, the data resource that previous three-modality models were restricted to. i-Code can dynamically process single, dual, and triple-modality data during training and inference, flexibly projecting different combinations of modalities into a single representation space.
• Experimental results demonstrate how i-Code can outperform state-of-the-art techniques on five video understanding tasks and the GLUE NLP benchmark, improving by as much as 11% and demonstrating the power of integrative multimodal pretraining.
• The figure below from the paper shows the overall model architecture of i-Code. Shown on the right is the attention and feed-forward operation in a fusion network layer with (a) merge-attention layers and (b) co-attention layers. To facilitate more effective cross-modality understanding and design the best fusion architecture, they explore two variations of the traditional attention mechanism: mechanisms that merge and cross the attention scores of different modalities, namely merge-attention (based on self-attention) and co-attention (based on self- and cross-attention) respectively. Note that for simplicity, only the residual connection of the language modality is drawn, but all three modalities use residual connections.

##### VL-BEIT: Generative Vision-Language Pretraining
• This paper by Bao et al. from Furu Wei’s research group at Microsoft Research introduces a vision-language foundation model called VL-BEIT, a simple and effective approach to pretraining a bidirectional multimodal Transformer encoder for both vision-language and vision tasks learned by generative pretraining. Their minimalist solution conducts masked prediction on both monomodal and multimodal data with a shared Transformer.
• VL-BEIT solely employs generative pretraining tasks, including masked language modeling on texts, masked image modeling on images, and masked vision-language modeling on image-text pairs. VL-BEIT is learned from scratch with one unified pretraining task, one shared backbone, and one-stage training which renders it conceptually simple and empirically effective.
• Experimental results show that VL-BEIT obtains strong results on various vision-language benchmarks, such as visual question answering, visual reasoning, and image-text retrieval. Moreover, our method learns transferable visual features, achieving competitive performance on image classification, and semantic segmentation.
• Github repo.
##### FLAVA: A Foundational Language And Vision Alignment Model
• This paper by Singh et al. from Meta AI Research in CVPR 2022 presents FLAVA, a foundational vision and language alignment model that performs well on all three target modalities: 1) vision, 2) language, and 3) vision & language.
• State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety of downstream tasks. Generally, such models are often either cross-modal (contrastive) or multi-modal (with earlier fusion) but not both; and they often only target specific modalities or tasks. A promising direction would be to use a single holistic universal model, as a “foundation”, that targets all modalities at once – a true vision and language foundation model should be good at vision tasks, language tasks, and cross- and multi-modal vision and language tasks.
• FLAVA was trained on a corpus of publicly available datasets that is several orders of magnitude smaller than similar recent models, but still obtained better or competitive performance. FLAVA paves the way forward towards generalized but open models that perform well on a wide variety of multimodal tasks.
• FLAVA demonstrates impressive performance on a wide range of 35 tasks spanning these target modalities.
##### Flamingo: a Visual Language Model for Few-Shot Learning
• In recent years, large-scale pre-training followed by task-specific fine-tuning has emerged as a standard approach, but the fine-tuning step still requires a lot of samples. In other words, building models that can be rapidly adapted to numerous tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research.
• This paper by Alayrac et al. from DeepMind in 2022 introduces Flamingo, a family of Visual Language Models (VLM) which seek to train a multi-modal model (i.e., with the ability to understand different types of input – visual, audio, text etc.) in a few-shot learning approach (which refers to the ability to learn a new task with just a few samples for training).
• Flamingo models include key architectural innovations to: (i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily interleaved visual and textual data, and (iii) seamlessly ingest images or videos as inputs.
• The key ideas behind Flamingo are:
• Interleave cross-attention layers with language-only self-attention layers (frozen).
• Perceiver-based architecture that transforms the input sequence data (videos) into a fixed number of visual tokens.
• Large-scale (web) multi-modal data by scraping webpages which has inter-leaved text and images.
• Thanks to their flexibility, Flamingo models can be trained on large-scale multimodal web corpora containing arbitrarily interleaved text and images, which is key to endow them with in-context few-shot learning capabilities.
• They perform a thorough evaluation of the proposed Flamingo models, exploring and measuring their ability to rapidly adapt to a variety of image and video understanding benchmarks. These include open-ended tasks such as visual question-answering, where the model is prompted with a question which it has to answer, captioning tasks, which evaluate the ability to describe a scene or an event, and close-ended tasks such as multiple choice visual question-answering.
• For tasks lying anywhere on this spectrum, they demonstrate that a single Flamingo model can achieve a new state of the art for few-shot learning, simply by prompting the model with task-specific examples. On many of these benchmarks, Flamingo actually surpasses the performance of models that are fine-tuned on thousands of times more task-specific data.

##### Stable and Latent Diffusion Model
• The following blog post summary has been contributed by Zhibo Zhang.
• This blog post from Hugging Face describes stable diffusion, a latent representation model developed by CompVis, Stability AI and LAION.
• According to the blog, the stable diffusion model takes in a text description as input, where the text encoder from the CLIP model is used to generate a representation for the text input.
• A latent image representation of size 64 * 64 is initialized based on the Gaussian distribution. A UNet (conditioned on the text representation) works together with a scheduler algorithm to denoise the latent representation. Generally, 50 denoising iterations are sufficient to generate images of high quality. After the denoising process, the decoder of a variational autoencoder is responsible for reconstructing the latent representation back into the image of size $$512 \times 512$$.
##### DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation
• The following summary has been contributed by Zhibo Zhang.
• This paper by Ruiz et al. from Google Research and Boston University in 2022 introduces DreamBooth, which generates subjects with diverse contexts through text-to-image diffusion model fine-tuning.
• Specifically, this work defines a new problem setting: recontextualize the specified subject while ensuring that the key visual features of the original subject are preserved.
• In order to achieve this goal, the authors adopted the pre-trained Imagen model (Saharia et al.) and fine-tune it using around 3 to 5 images of a chosen subject as follows:
• The fine-tuning of the low-resolution part of the model: The image generation process is conditioned on the text which is composed of the class noun and a rare token identifier for the subject. The objective function contains two parts: 1. The reconstruction loss to ensure that the generated images are similar to the input images. 2. The class-specific prior preservation loss to ensure that the generated images have diversity.
• The fine-tuning of the super-resolution part of the model: Only the reconstruction loss is used. This step is to ensure the preservation of fine-grained details of the subjects in output images.
• The authors discussed a few application scenarios of the DreamBooth framework including recontextualization, art renditions, expression manipulation, novel view synthesis, accessorization as well as property modification and displayed some example images for each application.
• The authors also performed ablation studies validating that:
• It is necessary to use the correct class noun in the input text.
• The prior preservation encourages diversity in the generated images.
• Using low-level noise when fine-tuning the super-resolution component improves the quality of the generated images.
##### UniT: Multimodal Multitask Learning with a Unified Transform
• This paper by Hu And Singh from Facebook AI in 2021 proposes UniT, a Unified Transformer model to simultaneously learn the most prominent tasks across different domains, ranging from object detection to natural language understanding and multimodal reasoning.
• Based on the transformer encoder-decoder architecture, UniT encodes each input modality with an encoder and makes predictions on each task with a shared decoder over the encoded input representations, followed by task-specific output heads. The entire model is jointly trained end-to-end with losses from each task.
• Compared to previous efforts on multi-task learning with transformers, they share the same model parameters across all tasks instead of separately finetuning task-specific models and handle a much higher variety of tasks across different domains.
• In their experiments, they learn 7 tasks jointly over 8 datasets, achieving strong performance on each task with significantly fewer parameters.
• Code.
##### Perceiver IO: A General Architecture for Structured Inputs & Outputs
• A central goal of machine learning is the development of systems that can solve many problems in as many data domains as possible. Current architectures, however, cannot be applied beyond a small set of stereotyped settings, as they bake in domain & task assumptions or scale poorly to large inputs or outputs.
• This paper by Jeagle et al. from DeepMind in ICLR 2022 proposes Perceiver IO, a general-purpose architecture that handles data from arbitrary settings while scaling linearly with the size of inputs and outputs.
• Perceiver IO augments the Perceiver with a flexible querying mechanism that enables outputs of various sizes and semantics, doing away with the need for task-specific architecture engineering. The same architecture achieves strong results on tasks spanning natural language and visual understanding, multi-task and multi-modal reasoning, and StarCraft II.
• As highlights, Perceiver IO outperforms a Transformer-based BERT baseline on the GLUE language benchmark despite removing input tokenization and achieves state-of-the-art performance on Sintel optical flow estimation with no explicit mechanisms for multiscale correspondence.
##### Foundation Transformers
• The following summary has been contributed by Zhibo Zhang.
• Transformers are widely adopted across various input modalities such as speech, text and images. However, Transformers for different input modalities are generally designed to have distinct implementations such that the best performance can be achieved in each domain.
• Foundation Transformers by Wang et al. from Microsoft in 2022 proposes the MAGNETO architecture, a general purpose transformer that can achieve stable task performance under various input modalities.
• A key component is the introduction of Sub-LN (Sub-LayerNorm). As shown in the illustration figure by Wang et al., there are two layer normalization operations in both the multi-head attention module and the feed-forward network module accordingly. Specifically, for the multi-head attention module, compared to Pre-LN, Sub-LN introduces one more layer normalization operation following the multi-head self-attention component. For the feedforward network module, compared to Pre-LN, Sub-LN introduces one more layer normalization operation following the ReLU activation function.
• With theoretical support, the authors showed the best initialization and weight scaling approaches for the encoder-only / decoder-only architecture and the encoder-decoder architecture.
• Empirically, the authors validated the effectiveness of MAGNETO in domains with different input modalities including language, vision, speech and vision-language:
• For the language domain, the authors conducted experiments on tasks including causal language modeling, masked language modeling as well as neural machine translation. On average, MAGNETO performed better than the comparison methodologies.
• For the vision domain, the authors compared MAGNETO with the Vision Transformer (Dosovitskiy et al., 2021) with Pre-LN on both ImageNet (and its variants) image classification and ADE20k semantic segmentation tasks. MAGNETO outperformed Pre-LN in terms of top-1 accuracy for classification and in terms of mIoU score for semantic segmentation.
• For the speech recognition task, MAGNETO achieved lower Word Error Rates compared to pre-LN.
• In addition, the MAGNETO module also outperformed pre-LN on two vision-language tasks.

### RecSys

#### 2015

##### Collaborative Deep Learning for Recommender Systems
• Collaborative filtering (CF) is a successful approach commonly used by many recommender systems. Conventional CF-based methods use the ratings given to items by users as the sole source of information for learning to make recommendation. However, the ratings are often very sparse in many applications, causing CF-based methods to degrade significantly in their recommendation performance. To address this sparsity problem, auxiliary information such as item content information may be utilized. Collaborative topic regression (CTR) is an appealing recent method taking this approach which tightly couples the two components that learn from two different sources of information. Nevertheless, the latent representation learned by CTR may not be very effective when the auxiliary information is very sparse.
• This paper by Wang et al. from HKU addresses this problem, by generalizing recent advances in deep learning from i.i.d. input to non-i.i.d. (CF-based) input and propose a hierarchical Bayesian model called collaborative deep learning (CDL), which jointly performs deep representation learning for the content information and collaborative filtering for the ratings (feedback) matrix. Extensive experiments on three real-world datasets from different domains show that CDL can significantly advance the state of the art.

#### 2016

##### Wide & Deep Learning for Recommender Systems
• Generalized linear models with nonlinear feature transformations are widely used for large-scale regression and classification problems with sparse inputs. Memorization of feature interactions through a wide set of cross-product feature transformations are effective and interpretable, while generalization requires more feature engineering effort. However, memorization and generalization are both important for recommender systems. With less feature engineering, deep neural networks can generalize better to unseen feature combinations through low-dimensional dense embeddings learned for the sparse features. However, deep neural networks with embeddings can over-generalize and recommend less relevant items when the user-item interactions are sparse and high-rank.
• This paper by Cheng et al. from Google in 2016 introduced Wide & Deep learning – jointly trained wide linear models and deep neural networks – to combine the benefits of memorization and generalization for recommender systems. Wide linear models can effectively memorize sparse feature interactions using cross-product feature transformations, while deep neural networks can generalize to previously unseen feature interactions through low dimensional embeddings. Wide & Deep learning framework to combine the strengths of both types of models. In other words, the fusion of wide and deep models combines the strengths of memorization and generalization, and provides us with better recommendation systems. The two models are trained jointly with the same loss function.
• They productionized and evaluated the system on Google Play Store, a massive-scale commercial mobile app store with over one billion active users and over one million apps. Online experiment results show that Wide & Deep significantly increased app acquisitions compared with wide-only and deep-only models.

#### 2017

##### DeepFM: A Factorization-Machine based Neural Network for CTR Prediction
• Learning sophisticated feature interactions behind user behaviors is critical in maximizing CTR for recommender systems. Despite great progress, existing methods seem to have a strong bias towards low- or high-order interactions, or require expertise feature engineering.
• This paper by Guo et al. from Harbin Institute of Technology and Huawei in 2017 proposes DeepFM, an end-to-end learning model that emphasizes both low- and high-order feature interactions. DeepFM is a Factorization-Machine (FM) based Neural Network for CTR prediction, to overcome the shortcomings of the state-of-the-art models and to achieve better performance. DeepFM trains a deep component and an FM component jointly and models low-order feature interactions through FM and models high-order feature interactions through the DNN. Unlike Google’s Wide & Deep Model, DeepFM can be trained end-to-end with a shared input to its “wide” and “deep” parts, with no need of feature engineering besides raw features.
• DeepFM gains performance improvement from these advantages: 1) it does not need any pre-training; 2) it learns both high- and low-order feature interactions; 3) it introduces a sharing strategy of feature embedding to avoid feature engineering.
• DeepFM thus combines the power of factorization machines for recommendation and deep learning for feature learning in a new neural network architecture.
• Extensive experiments were conducted to demonstrate the effectiveness and efficiency of DeepFM over the existing models for CTR prediction, on two real-world datasets (Criteo dataset and a commercial App Store dataset) to compare the effectiveness and efficiency of DeepFM and the state-of-the-art models. Their experiment results demonstrate that 1) DeepFM outperforms the state-ofthe-art models in terms of AUC and Logloss on both datasets; 2) The efficiency of DeepFM is comparable to the most efficient deep model in the state-of-the-art.

#### 2019

##### Deep Learning Recommendation Model for Personalization and Recommendation Systems
• With the advent of deep learning, neural network-based recommendation models have emerged as an important tool for tackling personalization and recommendation tasks. These networks differ significantly from other deep learning networks due to their need to handle categorical features and are not well studied or understood.
• This paper by Naumov et al. from Facebook in 2019 proposes a state-of-the-art deep learning recommendation model (DLRM). They open-source its implementation in both PyTorch and Caffe2 frameworks.
• The DLRM model handles continuous (dense) and categorical (sparse) features that describe users and products. DLRM exercises a wide range of hardware and system components, such as memory capacity and bandwidth, as well as communication and compute resources as shown in the figure below.

• Furthermore, they design a specialized parallelization scheme utilizing model parallelism on the embedding tables to mitigate memory constraints while exploiting data parallelism to scale-out compute from the fully-connected layers.
• Compared to other DL-based approaches to recommendation, DLRM differs in two ways. First, it computes the feature interactions explicitly while limiting the order of interaction to pairwise interactions. Second, DLRM treats each embedded feature vector (corresponding to categorical features) as a single unit, whereas other methods (such as Deep and Cross) treat each element in the feature vector as a new unit that should yield different cross terms. These design choices help reduce computational/memory cost while maintaining competitive accuracy.
• They compare DLRM against existing recommendation models and characterize its performance on the Big Basin AI platform, demonstrating its usefulness as a benchmark for future algorithmic experimentation, system co-design, and benchmarking.

### Core ML

#### 1991

##### What Every Computer Scientist Should Know About Floating-Point Arithmetic
• This gem by Goldberg et al. from Oracle in the 1991 issue of ACM Computing Surveys helps demystify your errors about computer arithmetic and enables you to write more careful code.

#### 2001

##### Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers
• Accurate, well-calibrated estimates of class membership probabilities are needed in many supervised learning applications, in particular when a cost-sensitive decision must be made about examples with example-dependent costs.
• This paper by Zadrozny and Elkan from UCSD in 2001 presents histogram binning, a simple but commonly-used calibration concept for obtaining calibrated probability estimates from decision tree and naive Bayesian classifiers.
• Using the large and challenging KDD’98 contest dataset as a testbed, they report the results of a detailed experimental comparison of ten methods, according to four evaluation measures.
• They conclude that binning succeeds in significantly improving naive Bayesian probability estimates, while for improving decision tree probability estimates, they recommend smoothing by $$m$$-estimation and a new variant of pruning that they call curtaitment.

#### 2002

##### Transforming classifier scores into accurate multiclass probability estimates
• Class membership probability estimates are important for many applications of data mining in which classification outputs are combined with other sources of information for decision-making, such as example-dependent misclassification costs, the outputs of other classifiers, or domain knowledge. Previous calibration methods apply only to two-class problems.
• This paper by Zadrozny and Elkan from UCSD in 2002 proposes isotonic regression, which helps obtain accurate probability estimates for multiclass problems by combining calibrated binary probability estimates.
• They also propose a new method for obtaining calibrated two-class probability estimates that can be applied to any classifier that produces a ranking of examples.
• Using naive Bayes and support vector machine classifiers, they give experimental results from a variety of two-class and multiclass domains, including direct marketing, text categorization and digit recognition.
##### Dimensionality Reduction by Learning an Invariant Mapping
• This paper by Hadsell et al. from LeCun’s lab in CVPR 2006 first introduced the concept of a contrastive loss.
• Contrastive loss is a distance-based loss as opposed to more conventional error-prediction losses. This loss is used to learn embeddings in which two “similar” points have a low Euclidean distance and two “dissimilar” points have a large Euclidean distance.
• Two samples are either similar or dissimilar. This binary similarity can be determined using several approaches:
• In this work, the $$N$$ closest neighbors of a sample in input space (e.g. pixel space) are considered similar; all others are considered dissimilar. (This approach yields a smooth latent space; e.g. the latent vectors for two similar views of an object are close)
• To the group of similar samples to a sample, we can add transformed versions of the sample (e.g. using data augmentation). This allows the latent space to be invariant to one or more transformations.
• We can use a manually obtained label determining if two samples are similar. (For example, we could use the class label. However, there can be cases where two samples from the same class are relatively dissimilar, or where two samples from different classes are relatively similar. Using classes alone does not encourage a smooth latent space.)
• Formally, if we consider $\vec{X}$ as the input data and $G_W(\vec{X})$ the output of a neural network, the interpoint distance is given by,
$D_W\left(\vec{X}_1, \vec{X}_2\right)=\left\|G_W\left(\vec{X}_1\right)-G_W\left(\vec{X}_2\right)\right\|_2$
• The contrastive loss is simply,

\begin{aligned} \mathcal{L}(W) &=\sum_{i=1}^P L\left(W,\left(Y, \vec{X}_1, \vec{X}_2\right)^i\right) \\ L\left(W,\left(Y, \vec{X}_1, \vec{X}_2\right)^i\right) &=(1-Y) L_S\left(D_W^i\right)+Y L_D\left(D_W^i\right) \end{aligned}
• where $Y=0$ when $X_1$ and $X_2$ are similar and $Y=1$ otherwise, and $L_S$ is a loss for similar points and $L_D$ is a loss for dissimilar points.
• More formally, the contrastive loss is given by,

\begin{aligned} &L\left(W, Y, \vec{X}_1, \vec{X}_2\right)= \\ &\quad(1-Y) \frac{1}{2}\left(D_W\right)^2+(Y) \frac{1}{2}\left\{\max \left(0, m-D_W\right)\right\}^2 \end{aligned}
• where $$m$$ is a predefined margin.
• The gradient is given by the simple equations:

$\begin{gathered} \frac{\partial L_S}{\partial W}=D_W \frac{\partial D_W}{\partial W} \\ \frac{\partial L_D}{\partial W}=-\left(m-D_W\right) \frac{\partial D_W}{\partial W} \end{gathered}$
• Contrastive Loss is often used in image retrieval tasks to learn discriminative features for images. During training, an image pair is fed into the model with their ground truth relationship: equals 1 if the two images are similar and 0 otherwise. The loss function for a single pair is:

$y d^2+(1-y) \max (\operatorname{margin}-d, 0)^2$
• where $$d$$ is the Euclidean distance between the two image features (suppose their features are $$f_1$$ and $$f_2$$): $$d=\left \mid f_1-f_2\right \mid_2$$. The $$margin$$ term is used to “tighten” the constraint: if two images in a pair are dissimilar, then their distance should be at least $$margin$$, or a loss will be incurred.
• Shown below are the results from the paper which are quite convincing:

• Note that while this is one of the earliest of the contrastive losses, this is not the only one. For instance, the contrastive loss used in SimCLR is quite different.

#### 2007

##### What Every Programmer Should Know About Memory
• This must-read paper by Drepper from Red Hat in 2007 offers a detailed treatment on how system memory works.
##### Practical Guide to Controlled Experiments on the Web: Listen to Your Customers not to the HiPPO
• The web provides an unprecedented opportunity to evaluate ideas quickly using controlled experiments, also called randomized experiments (single-factor or factorial designs), A/B tests (and their generalizations), split tests, Control/Treatment tests, and parallel flights. Controlled experiments embody the best scientific design for establishing a causal relationship between changes and their influence on user-observable behavior.
• This paper by Kohavi et al. from Microsoft in KDD 2007 provide a practical guide to conducting online experiments, where end-users can help guide the development of features. Their experience indicates that significant learning and return-oninvestment (ROI) are seen when development teams listen to their customers, not to the Highest Paid Person’s Opinion (HiPPO).
• They provide several examples of controlled experiments with surprising results. They review the important ingredients of running controlled experiments, and discuss their limitations (both technical and organizational). They focus on several areas that are critical to experimentation, including statistical power, sample size, and techniques for variance reduction.
• They describe common architectures for experimentation systems and analyze their advantages and disadvantages. They evaluate randomization and hashing techniques, which we show are not as simple in practice as is often assumed. Controlled experiments typically generate large amounts of data, which can be analyzed using data mining techniques to gain deeper understanding of the factors influencing the outcome of interest, leading to new hypotheses and creating a virtuous cycle of improvements.
• Organizations that embrace controlled experiments with clear evaluation criteria can evolve their systems with automated optimizations and real-time analyses. Based on their extensive practical experience with multiple systems and organizations, they share key lessons that will help practitioners in running trustworthy controlled experiments.

#### 2011

##### SMOTE: Synthetic Minority Over-sampling Technique
• This paper by Chawla et al. from University of South Florida introduces an approach to the construction of classifiers from imbalanced datasets.
• A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of “normal” examples with only a small percentage of “abnormal” or “interesting” examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Under-sampling of the majority (normal) class has been proposed as a good means of increasing the sensitivity of a classifier to the minority class.
• This paper shows that a combination of the proposed method of over-sampling the minority (abnormal) class and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space) than only under-sampling the majority class. This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space) than varying the loss ratios in Ripper or class priors in Naive Bayes.
• Their method of over-sampling the minority class involves creating synthetic minority class examples. Experiments are performed using C4.5, Ripper and a Naive Bayes classifier.
• The method is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.

#### 2012

##### Improving neural networks by preventing co-adaptation of feature detectors
• This paper by Hinton et al. in 2012 introduced Dropout as a way to avoid overfitting.
• When a large feedforward neural network is trained on a small training set, it typically performs poorly on held-out test data. This overfitting is greatly reduced by randomly omitting half of the feature detectors on each training case. This prevents complex co-adaptations in which a feature detector is only helpful in the context of several other specific feature detectors.
• Instead, each neuron learns to detect a feature that is generally helpful for producing the correct answer given the combinatorially large variety of internal contexts in which it must operate.
• Random “dropout” gives big improvements on many benchmark tasks and sets new records for speech and object recognition.
##### Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained
• Online controlled experiments are often utilized to make data-driven decisions at Amazon, Microsoft, eBay, Facebook, Google, Yahoo, Zynga, and at many other companies. While the theory of a controlled experiment is simple, and dates back to Sir Ronald A. Fisher’s experiments at the Rothamsted Agricultural Experimental Station in England in the 1920s, the deployment and mining of online controlled experiments at scale — thousands of experiments now — has taught us many lessons. These exemplify the proverb that the difference between theory and practice is greater in practice than in theory.
• This paper by Kohavi et al. from Microsoft in KDD 2012 presents the authors’ learnings as they happened: puzzling outcomes of controlled experiments that we analyzed deeply to understand and explain. Each of these took multiple-person weeks to months to properly analyze and get to the often surprising root cause. The root causes behind these puzzling results are not isolated incidents; these issues generalized to multiple experiments. The heightened awareness should help readers increase the trustworthiness of the results coming out of controlled experiments.
• At Microsoft’s Bing, it is not uncommon to see experiments that impact annual revenue by millions of dollars, thus getting trustworthy results is critical and investing in understanding anomalies has tremendous payoff: reversing a single incorrect decision based on the results of an experiment can fund a whole team of analysts.
• The topics they cover include: the OEC (Overall Evaluation Criterion), click tracking, effect trends, experiment length and power, and carryover effects.

#### 2014

##### Dropout: A Simple Way to Prevent Neural Networks from Overfitting
• This paper by Srivastava et al. from Hinton’s lab in JMLR 2014 introduced Dropout, which (just like Batchnorm) is now part of the standard recipe for regularizing deep neural nets.

#### 2015

##### ADAM: A Method for Stochastic Optimization
• This paper by Kingma and Ba in ICLR 2015 introduces Adam (derived from adaptive moment estimation), an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters.
• It is a fusion of RMSProp with momentum and involves calculating the exponentially weighted moving average of the first moment and second moment (which are gated by the hyper parameters $$\beta_1$$ and $$\beta_2$$ respectively).
• The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. They also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, they discuss AdaMax, a variant of Adam based on the infinity norm.
##### Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
• This paper by Ioffe and Szegedy from Google in ICML 2015 introduced BatchNorm, which is now commonly implemented to accelerate training of deep neural nets.
• Also, check out this in-depth article on BatchNorm here.

#### 2016

##### XGBoost: A Scalable Tree Boosting System
• This paper by Chen and Guestrin from UWash in 2016 proposes eXtreme Gradient Boost (XGBoost), a scalable end-to-end tree boosting system that is widely used by data scientists and provides state-of-the-art results on many problems.
• They propose a novel sparsity aware algorithm for handling sparse data and a theoretically justified weighted quantile sketch for approximate tree learning.
• Their experience shows that cache access patterns, data compression and sharding are essential elements for building a scalable end-to-end system for tree boosting. These lessons can be applied to other machine learning systems as well.
• By combining these insights, XGBoost is able to solve real-world scale problems using far fewer resources than existing systems..
##### Layer Normalization
• Training state-of-the-art, deep neural networks is computationally expensive. One way to reduce the training time is to normalize the activities of the neurons. A recently introduced technique called batch normalization uses the distribution of the summed input to a neuron over a mini-batch of training cases to compute a mean and variance which are then used to normalize the summed input to that neuron on each training case. This significantly reduces the training time in feed-forward neural networks. However, the effect of batch normalization is dependent on the mini-batch size and it is not obvious how to apply it to recurrent neural networks.
• This paper by Ba et al. from Hinton’s lab in 2016 transposes batch normalization into layer normalization by computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case. Like batch normalization, they also give each neuron its own adaptive bias and gain which are applied after the normalization but before the non-linearity.
• Unlike batch normalization, layer normalization performs exactly the same computation at training and test times. It is also straightforward to apply to recurrent neural networks by computing the normalization statistics separately at each time step.
• Layer normalization is very effective at stabilizing the hidden state dynamics in recurrent networks. Empirically, they show that layer normalization can substantially reduce the training time compared with previously published techniques.

#### 2017

##### Decoupled Weight Decay Regularization
• L2 regularization and weight decay regularization are equivalent for standard stochastic gradient descent (when rescaled by the learning rate), but as we demonstrate this is not the case for adaptive gradient algorithms, such as Adam.
• This paper by Loshchilov and Hutter from University of Freiburg in ICLR 2019 proposes Adam with decoupled weight decay (AdamW), a simple modification to recover the original formulation of weight decay regularization by decoupling the weight decay from the optimization steps taken w.r.t. the loss function. Following suggestions that adaptive gradient methods such as Adam might lead to worse generalization than SGD with momentum (Wilson et al., 2017), they identify and expose the inequivalence of L2 regularization and weight decay for Adam.
• They provide empirical evidence that AdamW proposed modification (i) decouples the optimal choice of weight decay factor from the setting of the learning rate for both standard SGD and Adam, and (ii) substantially improves Adam’s generalization performance, allowing it to compete with SGD with momentum on image classification datasets (on which it was previously typically outperformed by the latter). They empirically show that AdamW yields substantially better generalization performance than the common implementation of Adam with L2 regularization. They also proposed to use warm restarts for Adam to improve performance.
• Their results obtained on image classification datasets must be verified on a wider range of tasks, especially ones where the use of regularization is expected to be important. While they focus their experimental analysis on Adam, they believe that similar results also hold for other adaptive gradient methods, such as AdaGrad (Duchi et al., 2011) and AMSGrad (Reddi et al., 2018).
• AdamW has been implemented in TensorFlow and PyTorch.
• Github repo.
##### On Calibration of Modern Neural Networks
• Modern neural networks exhibit a strange phenomenon: probabilistic error and miscalibration worsen even as classification error is reduced.
• This paper by Guo et al. from Cornell University in ICML 2017 proposes temperature scaling. They begin by discovering that modern neural networks, unlike those from a decade ago, are poorly calibrated. Through extensive experiments, they observe that model capacity (in terms of depth, width), weight decay (regularization), and Batch Normalization are important factors affect calibration while improving accuracy.
• They evaluate the performance of various post-processing calibration methods on state-of-the-art architectures with image and document classification datasets.
• They suggest that simple techniques can effectively remedy the miscalibration phenomenon in neural networks. Temperature scaling – a single-parameter variant of Platt Scaling – is the simplest, fastest, and most straightforward of the methods at calibrating predictions.
##### Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers
• For optimal decision making under variable class distributions and misclassification costs a classifier needs to produce well-calibrated estimates of the posterior probability. Isotonic calibration is a powerful non-parametric method that is however prone to overfitting on smaller datasets; hence a parametric method based on the logistic curve is commonly used.
• This paper by Kull et al. from University of Bristol and Universidade Federal de Pernambuco demonstrates that while logistic calibration is designed for normally distributed per-class scores, many classifiers including Naive Bayes and Adaboost suffer from a particular distortion where these score distributions are heavily skewed. In such cases logistic calibration can easily yield probability estimates that are worse than the original scores. Moreover, the logistic curve family does not include the identity function, and hence logistic calibration can easily uncalibrate a perfectly calibrated classifier.
• The papers seeks to solve all these problems with a richer class of calibration maps based on the beta distribution. THey derive the method from first principles and show that fitting it is as easy as fitting a logistic curve.
• Extensive experiments show that beta calibration is superior to logistic calibration for Naive Bayes and Adaboost.
##### Understanding Black-box Predictions via Influence Functions
• The following paper summary has been contributed by Zhibo Zhang.
• This paper by Koh and Liang in ICML 2017 from Percy Liang’s group at Stanford introduces influence functions that originated from robust statistics to explain individual instance predictions.
• The method utilizes the inverse of the second-order derivative (Hessian matrix) to calculate an approximation of the empirical risk.
• Although the authors propose a few approximation methodologies to calculate the inverse Hessian matrix, the amount of computation involved in this calculation is a drawback of the work.
• Additionally, as discussed in the TracIn (Pruthi et al.) paper, the optimality condition for the approximation (with respect to the empirical risk) is hard to achieve in practice, especially for complicated deep neural networks.
• As shown in the experimental part, this work can be used to identify influential training data points for the model, and the authors showed that this method could be further extended to several use cases, including understanding model behaviors as well as the influence of adversarial examples, detecting the mismatch between training distribution and test distribution, and identifying mislabelled data points.
• GitHub repo.

#### 2018

##### Model Cards for Model Reporting
• Trained machine learning models are increasingly used to perform high-impact tasks in areas such as law enforcement, medicine, education, and employment. In order to clarify the intended use cases of machine learning models and minimize their usage in contexts for which they are not well suited, they recommend that released models be accompanied by documentation detailing their performance characteristics.
• This paper by Mitchell et al. from Google and UofT proposes a framework that they call model cards, to encourage such transparent model reporting. Model cards are short documents accompanying trained machine learning models that provide benchmarked evaluation in a variety of conditions, such as across different cultural, demographic, or phenotypic groups (e.g., race, geographic location, sex, Fitzpatrick skin type) and intersectional groups (e.g., age and race, or sex and Fitzpatrick skin type) that are relevant to the intended application domains. Model cards also disclose the context in which models are intended to be used, details of the performance evaluation procedures, and other relevant information.
• While they focus primarily on human-centered machine learning models in the application fields of computer vision and natural language processing, this framework can be used to document any trained machine learning model. To solidify the concept, they provide cards for two supervised models: One trained to detect smiling faces in images, and one trained to detect toxic comments in text. They propose model cards as a step towards the responsible democratization of machine learning and related AI technology, increasing transparency into how well AI technology works.
##### Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)
• The following paper summary has been contributed by Zhibo Zhang.
• Many existing works on explainability focus on feature attribution, which attributes an importance score to each individual input feature. However, the individual features themselves do not necessarily have semantic meanings.
• This paper by Kim et al. from Google in ICML 2018 introduced concept-based explanations using Concept Activation Vectors (CAVs) in neural networks to capture the importance of human-friendly high-level concepts.
• This methodology adopts two sets of input examples - one set that contains instances with the concept of interest, another set that contains instances without the concept of interest. The class activation vector is defined to be the vector that is orthogonal to the linear classifier that separates the intermediate representations of the two sets of data instances. The sensitivity of a particular data class (for e.g., the zebra class, as in the paper) with respect to the concept in question (e.g., the ‘striped’ concept) can then be calculated using a directional derivative.
• The drawback of this approach is that a linear classifier needs to be trained separately for each concept through the set of examples collected, which implies incurring extra time in collecting representative data instances and training the classifier.
• The authors showed several use cases that adopted TCAV (Testing with Concept Activation Vectors) to better understand the learned model and predictions, including sorting images by similarity with respect to a concept of interest. The authors also conducted quantitative sanity checks through adding captions to the image and tuning the probability of noise in the captions, which showed that the concepts captured by TCAV closely matches what neural network focuses on to make predictions.
• GitHub repo.
##### Representer Point Selection for Explaining Deep Neural Networks
• The following paper summary has been contributed by Zhibo Zhang.
• This paper by Yeh et al. from CMU introduces a method for selecting representer points for any given instance prediction. Relying on the representer theorem, the pre-activation value of the individual data instance can be decomposed into a linear combination of the training points’ activations. The weight corresponds to either positive contributions (if the weight is positive) or negative contributions (if the weight is negative) towards the prediction of the data instance in question.
• Through experiments, the authors of the paper showed that this method can be used to efficiently detect and fix mislabelled training data points. It outperformed influence functions by 2% on test accuracy score with the same amount of training data (by fixing the mislabelled ones detected in those data) on the CIFAR-10 dataset. In addition, the authors showed that Representer Point Selection is capable of picking out more representative positive and negative examples for given data instances compared to influence functions from a qualitative perspective. Thus, this method can also be used by machine learning experts to understand misclassified examples.
• Furthermore, compared to influence functions, Representer Point Selection is much faster in practice.
• GitHub repo.
##### Mixed Precision Training
• Deep neural networks have enabled progress in a wide variety of applications. Growing the size of the neural network typically results in improved accuracy. As model sizes grow, the memory and compute requirements for training these models also increases.
• This paper by Narang et al. from Baidu Research and Nvidia in ICLR 2018 introduces a technique to train deep neural networks using half precision floating point numbers. In their technique, weights, activations and gradients are stored in IEEE half-precision format. Half-precision floating numbers have limited numerical range compared to single-precision numbers.
• They propose two techniques to handle this loss of information. Firstly, they recommend maintaining a single-precision copy of the weights that accumulates the gradients after each optimizer step. This single-precision copy is rounded to half-precision format during training. Secondly, they propose scaling the loss appropriately to handle the loss of information with half-precision gradients.
• They demonstrate that the latter approach works for a wide variety of large scale models including convolution neural networks, recurrent neural networks, and generative adversarial networks with more than 100 million parameters trained on large datasets. For certain models with a large number of small gradient values, this loss/gradient scaling method helps them converge to the same accuracy as FP32 baseline models.
• Mixed precision training is an important technique that allows us to reduce the memory consumption as well as time spent in memory and arithmetic operations of deep neural networks. They demonstrate that many different deep learning models can be trained using this technique with no loss in accuracy without any hyper-parameter tuning. Using this approach, we can reduce the memory consumption of deep learning models by nearly 2x. For half-precision optimized hardware, they can also expect a significant computation speedup using half-precision hardware units.
• DNN operations benchmarked with DeepBench on Volta GPU see 2-6x speedups compared to FP32 implementations if they are limited by memory or arithmetic bandwidth. Speedups are lower when operations are latency-limited.

#### 2019

##### Similarity of Neural Network Representations Revisited
• Recent work has sought to understand the behavior of neural networks by comparing representations between layers and between different trained models. Measuring similarity between the representations learned by neural networks is an ill-defined problem, since it is not entirely clear what aspects of the representation a similarity. index should focus on. Previous work has suggested that there is little similarity between intermediate layers of neural networks trained from different random initializations.
• This paper by Kornblith et al. from Hinton’s lab at UofT in ICML 2019 examines methods for comparing neural network representations based on canonical correlation analysis (CCA).
• They show that CCA belongs to a family of statistics for measuring multivariate similarity, but that neither CCA nor any other statistic that is invariant to invertible linear transformation can measure meaningful similarities between representations of higher dimension than the number of data points.
• They introduce a similarity index that measures the relationship between representational similarity matrices and does not suffer from this limitation. This similarity index is equivalent to centered kernel alignment (CKA) and is also closely connected to CCA.
• CKA is a method for comparing representations of neural networks, and show that it consistently identifies correspondences between layers, not only in the same network trained from different initializations, but across entirely different architectures, whereas other methods do not. They also provide a unified framework for understanding the space of similarity indexes, as well as an empirical framework for evaluation.
• CKA captures intuitive notions of similarity, i.e. that neural networks trained from different initializations should be similar to each other. However, it remains an open question whether there exist kernels beyond the linear and RBF kernels that would be better for analyzing neural network representations.
• Unlike CCA, CKA can reliably identify correspondences between representations in networks trained from different initializations.

#### 2020

##### Estimating Training Data Influence by Tracing Gradient Descent
• The following paper summary has been contributed by Zhibo Zhang.
• This paper by Garima et al. from Google in NeurIPS 2020 introduces a method called TracIn that computes the influence of a training example on a prediction made by the model by keeping track of the gradient information along the training process.
• Specifically, this method measures the changes of the loss on a given test point in question before utilizing a specific training data instance and after utilizing this instance. The authors provide a first-order approximation of this calculation and also extend the methodology to the mini-batch setting.
• The authors conducted various experimental validations. Quantitatively, when increasing the fraction of training data that is allowed to be checked, TracIn very consistently outperforms influence functions and Representer Point Selection in terms of the fraction of mislabelled examples detected on the CIFAR-10 (a maximum of around 20% outperformance) as well as the MNIST (a maximum of over 10% outperformance) datasets. Qualitatively, the authors showed that TracIn was able to effectively pick out proponents (examples whose influence scores are positive) and opponents (examples whose influence scores are negative) for given data instances on a text classification task as well as an image classification task.
• GitHub repo

### LEEP - Log Expected Empirical Prediction

• LEEP - Log Expected Empirical Prediction by Nguyen et al. from Amazon Web Services and Facebook AI in ICML 2020 proposes to measure the transferability from the source dataset to the target dataset by evaluating the log likelihood of the correct prediction on the target dataset (without requiring to re-train a model on the target dataset). The individual probability of the correct prediction on the target dataset is calculated through a predictive distribution based on two conditional probabilities:
1. The probability of the dummy label based on the categorical distribution of the trained model (trained on the source dataset) evaluated on the input of the target dataset.
2. The conditional density of the target dataset’s label given the dummy label from the previous step. The predictive distribution is then evaluated through integrating over all possible dummy labels.
• More on LEEP in the comparative analysis between LEEP and OTDD below.

### OTDD - Optimal Transport Dataset Distance

• OTDD - Optimal Transport Dataset Distance by Alvarez-Melis et al. from Microsoft Research in NeurIPS 2020 proposes to measure distances between datasets through optimal transport as an estimation for transferability. Ideally, smaller distance indicates better transferability.
• Compared to LEEP, OTDD does not require training a model on the source dataset. It only needs the feature-label pairs of the two datasets. Specifically, the distance measure is composed of two parts:
1. The distance between feature vectors of the two datasets.
2. The distance between the labels of the two datasets, where each label is represented by the distribution of the associated feature vectors.
• However, the drawback of the OTDD approach is obvious. Wasserstein distance is known to be computationally expensive. Therefore, OTDD needs to rely on approximation algorithms. Although the authors propose that it is possible to use Gaussian distribution as the modeling choice for the feature vector distribution under each label so that the 2-Wasserstein distance can be calculated through an analytic form, the approximation of this approach is too coarse. In comparison, the LEEP approach only involves one iteration of trained model inference on the target dataset to acquire the dummy label distribution.
• In terms of experiments, both LEEP and OTDD validated the statistical correlation between their proposed transferability estimation approaches and the model performance on the target dataset on several transfer learning tasks. Specifically, the LEEP approach witnessed larger than 0.94 correlation coefficients between the LEEP score and the test accuracy (closer to 1 correlation coefficient indicates better transferability measurement) when transferring from the ImageNet dataset to the CIFAR-100 dataset and from the CIFAR-10 dataset to the CIFAR-100 dataset. The OTDD approach witnessed -0.85 correlation between the dataset distance and the relative drop in test error (closer to -1 correlation coefficient indicates better distance measure) when transferring from the MNIST dataset (with augmentations) to the USPS dataset. However, when not performing augmentations, the correlation when transferring among the MNIST dataset, its variations and the USPS dataset is only -0.59 for OTDD.
• Overall, neither LEEP and OTDD require re-training a model on the target dataset.

#### 2021

##### Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better
• Deep Learning has revolutionized the fields of computer vision, natural language understanding, speech recognition, information retrieval and more. However, with the progressive improvements in deep learning models, their number of parameters, latency, resources required to train, etc. have all have increased significantly. Consequently, it has become important to pay attention to these footprint metrics of a model as well, not just its quality. Training and deploying models involves making either implicit or explicit decisions about efficiency.
• This paper by Menghani from Google Research in 2022 motivates the problem of efficiency in deep learning, followed by a thorough survey of the seminal work in core areas of model efficiency (spanning modeling techniques, infrastructure, and hardware). They lay out a mental model for the readers to wrap their heads around the multiple focus areas of model efficiency and optimization, thereby offering the reader an opportunity to understand the state-of-the-art, apply these techniques in the modelling process, and/or use them as a starting point for exploration
• They also present an experiment-based guide along with code, for practitioners to optimize their model training and deployment. They believe this is the first comprehensive survey in the efficient deep learning space that covers the landscape of model efficiency from modeling techniques to hardware support.
• Finally, they present a section of explicit and actionable insights supplemented by code, for a practitioner to use as a guide in this space. This section will hopefully give concrete and actionable takeaways, as well as tradeoffs to think about when optimizing a model for training and deployment.
##### Do Wide and Deep Networks Learn the Same Things? Uncovering How Neural Network Representations Vary with Width and Depth
• This paper by Nguyen et al. from Google Research in ICLR 2021 performs a systematic study of the similarity between wide and deep networks from the same architectural family through the lens of their hidden representations and final outputs.
• As either width or depth increases relative to the size of the dataset, analysis of hidden representations reveals the emergence of a characteristic block structure that reflects the similarity of a dominant first principal component, propagated across many network hidden layers. Put simply, they establish a connection between this phenomenon and model over-parameterization.
• Comparisons across models demonstrate that those without the block structure show significant similarity between representations in corresponding layers, but those containing the block structure exhibit highly dissimilar representations.
• In other words, while the block structure is unique to each model, other learned features are shared across different initializations and architectures, particularly across relative depths of the network.
• These properties of the internal representations in turn translate to systematically different errors at the class and example levels for wide and deep models when they are evaluated on the same test set.

##### Using AntiPatterns to avoid MLOps Mistakes
• This paper by Muralidhar et al. from Virgina Tech and The Bank of New York Mellon in 2021 describes lessons learned from developing and deploying machine learning models at scale across the enterprise in a range of financial analytics applications. These lessons are presented in the form of antipatterns. Just as design patterns codify best software engineering practices, antipatterns provide a vocabulary to describe defective practices and methodologies.
• They catalog and document numerous antipatterns in financial ML operations (MLOps). Some antipatterns are due to technical errors, while others are due to not having sufficient knowledge of the surrounding context in which ML results are used. By providing a common vocabulary to discuss these situations, our intent is that antipatterns will support better documentation of issues, rapid communication between stakeholders, and faster resolution of problems. In addition to cataloging antipatterns, they describe solutions, best practices, and future directions toward MLOps maturity.
• Their recommendations for operationalizing lessons learnt in a production financial ML setting are:
• Use AntiPatterns presented here to document a model management process to avoid costly but routine mistakes in model development, deployment, and approval.
• Use assertions to track data quality across the enterprise. This is crucial since ML models can be so dependent on faulty or noisy data and suitable checks and balances can ensure a safe operating environment for ML algorithms.
• Document data lineage along with transformations to support creation of ‘audit trails’ so models can be situated back in time and in specific data slices for re-training or re-tuning.
• Use ensembles to maintain a palette of models including remedial and compensatory pipelines in the event of errors.
• Track model histories through the lifecycle of an application.
• Ensure human-in-the-loop operational capability at multiple levels.
• Overall, the model development and management pipeline in typical organizations supports four classes of stakeholders:
• The data steward (who holds custody of datasets and sets performance standards),
• The model developer (an ML person who designs algorithms),
• The model engineer (who places models in production and tracks performance), and
• The model certification authority (group of professionals who ensure compliance with standards and risk levels).
• Bringing such multiple stakeholder groups together ensures a structured process where benefits and risks of ML models are well documented and understood at all stages of development and deployment

#### 2022

##### Pathways: Asynchronous Distributed Dataflow for ML
• This paper by Barham et al. from Google in MLSys 2022 presents the design of Pathways, a new large scale orchestration layer for accelerators. Pathways is explicitly designed to enable exploration of new systems and ML research ideas, while matching state-of-the-art multi-controller performance on current ML models which are single-tenant SPMD.
• Pathways uses a sharded dataflow graph of asynchronous operators that consume and produce futures, and efficiently gang-schedules heterogeneous parallel computations on thousands of accelerators while coordinating data transfers over their dedicated interconnects. Pathways makes use of a novel asynchronous distributed dataflow design that lets the control plane execute in parallel despite dependencies in the data plane. This design, with careful engineering, allows Pathways to adopt a single-controller model that makes it easier to express complex new parallelism patterns. Pathways upends the execution model of JAX programs, pulling user code back into a single-controller model, and interposing a centralized resource management and scheduling framework between client and accelerators. The single-controller programming model allows users simple access to much richer computation patterns. The resource management and scheduling layer permits the reintroduction of cluster management policies including multi-tenant sharing, virtualization and elasticity, all tailored to the requirements of ML workloads and accelerators.
• Their micro-benchmarks show interleaving of concurrent client workloads, and efficient pipelined execution, convincingly demonstrating that the system mechanisms they have built are fast and flexible, and form a solid basis for research into novel policies to make use of them. They demonstrate that Pathways can achieve performance parity (~100% accelerator utilization) with state-of-the-art systems when running SPMD computations over 2048 TPUs, while also delivering throughput comparable to the SPMD case for Transformer models that are pipelined across 16 stages, or sharded across two islands of accelerators connected over a data center network.
• They have shown that careful system design and engineering lets them “get the best of both worlds”, matching performance on today’s ML models while delivering the features needed to write the models of tomorrow.
##### PolyLoss: A Polynomial Expansion Perspective of Classification Loss Functions
• Cross-entropy loss and focal loss are the most common choices when training deep neural networks for classification problems. Generally speaking, however, a good loss function can take on much more flexible forms, and should be tailored for different tasks and datasets.
• This paper by Leng et al. from in ICLR 2022 proposes a simple framework, named PolyLoss, to view and design loss functions as a linear combination of polynomial functions, motivated by how functions can be approximated via Taylor expansion. Under polynomial expansion, focal loss is a horizontal shift of the polynomial coefficients compared to the cross-entropy loss. Motivated by this new insight, they explore an alternative dimension, i.e., vertically modify the polynomial coefficients.
• PolyLoss allows flexible ways of changing the loss function shape by adjusting the polynomial coefficients depending on the targeting tasks and datasets, while naturally subsuming the aforementioned cross-entropy loss and focal loss as special cases.
• Extensive experimental results show that the optimal choice within the PolyLoss is indeed dependent on the task and dataset.
• By simply adjusting the coefficient of the leading polynomial coefficient with just one extra hyperparameter and adding one line of code, the Poly-1 formulation outperforms the cross-entropy loss and focal loss on 2D image classification, instance segmentation, object detection, and 3D object detection tasks, sometimes by a large margin.
• The following table from the paper shows the magnitude by which PolyLoss outperforms cross-entropy and focal loss on various models and tasks:

##### Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models
• Despite their wide adoption, the underlying training and memorization dynamics of very large language models is not well understood. They empirically study exact memorization in causal and masked language modeling, across model sizes and throughout the training process.
• This paper by Tirumala et al. from Meta AI Research in 2022 study the properties of memorization dynamics over language model training and demonstrate that larger models memorize faster.
• They also measure the properties of forgetting curves and surprisingly find that forgetting reaches a baseline, which again increases with the model scale. Combined with memorization analyses that expose the unintuitive behavior of language models, they hope to motivate considering memorization as a critical metric when increasing language model scale.
• They implicitly focus on information that is sensitive if outputted verbatim (phone numbers, SSNs, addresses, medical diagnoses, etc.), rather than capturing all aspects of privacy (for e.g., synonyms).
• It is also known that text data used for training language models contain certain biases and stereotypes; therefore, their work has similar implications for how long language models can train before they definitively memorize these biases from training data.
• They measure the effects of dataset size, learning rate, and model size on memorization, finding that larger language models memorize training data faster across all settings.
• Surprisingly, they show that larger models can memorize a larger portion of the data before over-fitting and tend to forget less throughout the training process.
• They also analyze the memorization dynamics of different parts of speech and find that models memorize nouns and numbers first; they hypothesize and provide empirical evidence that nouns and numbers act as a unique identifier for memorizing individual training examples. Together, these findings present another piece of the broader puzzle of trying to understand what actually improves as models get bigger.
• Their work highlights the importance of analyzing memorization dynamics as they scale up language models, instead of only reporting cross entropy. Cross-entropy loss and memorization capture different behavior — for example, in many of our memory degradation experiments, even though memorization approaches a baseline, they observe that perplexity still increases. This implies that the model is becoming unconfident about the exact predictions, which they can only conclude because they inspect memorization dynamics along with the loss. Similarly, there are multiple instances where they uncover interesting behavior because they focus on memorization dynamics, rather than focusing only on just the cross-entropy loss.
##### Federated Learning with Buffered Asynchronous Aggregation
• Scalability and privacy are two critical concerns for cross-device federated learning (FL) systems.
• This paper by Nguyen et al. from Meta AI in AISTATS 2022 identifies that synchronous FL - synchronized aggregation of client updates in FL - cannot scale efficiently beyond a few hundred clients training in parallel, and leads to diminishing returns in model performance and training speed, analogous to large-batch training. To address the scalability issue, they propose FedBuff, an asynchronous FL training scheme with buffered aggregation, which offers an asynchronous aggregation of client updates in FL (i.e., asynchronous FL). Compared to SyncFL proposals, FedBuff scales to large values of concurrency.
• However, aggregating individual client updates is incompatible with secure aggregation, which could result in an undesirable level of privacy for the system. To address these concerns, they propose a novel buffered asynchronous aggregation method, FedBuff, that is agnostic to the choice of optimizer, and combines the best properties of synchronous and asynchronous FL. Compared to AsyncFL proposals, FedBuff is more private as it is compatible with SecAgg and differential privacy.
• They empirically demonstrate that FedBuff is 3.3x more efficient than synchronous FL and up to 2.5x more efficient than asynchronous FL, while being compatible with privacy-preserving technologies such as secure aggregation and differential privacy.
• They provide theoretical convergence guarantees in a smooth non-convex setting. Finally, they show that under differentially private training, FedBuff can outperform FedAvgM at low privacy settings and achieve the same utility for higher privacy settings.
##### Applied Federated Learning: Architectural Design for Robust and Efficient Learning in Privacy Aware Settings
• The classical machine learning paradigm requires the aggregation of user data in a central location where machine learning practitioners can preprocess data, calculate features, tune models and evaluate performance. The advantage of this approach includes leveraging high performance hardware (such as GPUs) and the ability of machine learning practitioners to do in depth data analysis to improve model performance.
• However, these advantages may come at a cost to data privacy. User data is collected, aggregated, and stored on centralized servers for model development. Centralization of data poses risks, including a heightened risk of internal and external security incidents as well as accidental data misuse. Federated learning with differential privacy is designed to avoid the server-side centralization pitfall by bringing the ML learning step to users’ devices.
• Learning is done in a federated manner where each mobile device runs a training loop on a local copy of a model. Updates from on-device models are sent to the server via encrypted communication and through differential privacy to improve the global model. In this paradigm, users’ personal data remains on their devices. Surprisingly, model training in this manner comes at a fairly minimal degradation in model performance.
• This paper by Stojkovic from Meta in 2022 presents an architecture to address several challenges unique to productionizing federated machine learning with differential privacy owing to its distributed nature, heterogeneous compute environments and lack of data visibility. These challenges include label balancing, slow release cycles, low device participation rate, privacy-preserving system logging, model metric calculation and feature normalization.
• This paper concludes with results demonstrating the effectiveness of the proposed architecture.
• While this architecture is capable of successfully training and potentially deploying production federated learning models, there are several challenges left to future work. Specifically, developer speed remains one of the largest barriers to scaling production-grade federated machine learning. Current iterations of model development are several orders of magnitude slower when compared to similar sized undertakings within a centralized environment.
##### Operationalizing Machine Learning: An Interview Study
• Organizations rely on machine learning engineers (MLEs) to operationalize ML, i.e., deploy and maintain ML pipelines in production. The process of operationalizing ML, or MLOps, consists of a continual loop of (i) data collection and labeling, (ii) experimentation to improve ML performance, (iii) evaluation throughout a multi-staged deployment process, and (iv) monitoring of performance drops in production. When considered together, these responsibilities seem staggering – how does anyone do MLOps, what are the unaddressed challenges, and what are the implications for tool builders?
• This paper by Shankar et al. from UC Berkeley in 2022 presented results from semi-structured ethnographic interviews with 18 MLEs working spanning different organizations and applications to understand their workflow, best practices, and challenges – including chatbots, autonomous vehicles, and finance. Our interviews expose three variables that govern success for a production ML deployment: high velocity, validating as early as possible, and maintaining multiple versions of models for minimal production downtime.
• They summarize common practices for successful ML experimentation, deployment, and sustaining production performance. Finally, they discuss MLOps pain points and anti-patterns discovered in our interviews to inspire new MLOps tooling and research ideas.
##### A/B Testing Intuition Busters
• A/B tests, or online controlled experiments, are heavily used in industry to evaluate implementations of ideas.
• This paper by Kohavi et al. in KDD ‘22 goes over common misunderstandings in online controlled experiments.
• While the statistics behind controlled experiments are well documented and some basic pitfalls known, we have observed some seemingly intuitive concepts being touted, including by A/B tool vendors and agencies, which are misleading, often badly so.
• They describe these misunderstandings, the “intuition” behind them, and to explain and bust that intuition with solid statistical reasoning. They provide recommendations that experimentation platform designers can implement to make it harder for experimenters to make these intuitive mistakes.
##### Effect of scale on catastrophic forgetting in neural networks
• Catastrophic forgetting describes the phenomenon that the performance of neural networks degrades on the previous tasks once trained on new tasks.
• This paper by Ramasesh et al. from Google Research in ICLR 2022 empirically validated the effect of neural network pre-training on catastrophic forgetting.
• The key conclusion from the paper is that pre-training large image classification models can help mitigate forgetting in sequential tasks.
• Suppose there are two sequential tasks, denoted as Task A and Task B, which could be split CIFAR-10 or CIFAR-100 image classification tasks, according to the paper. The authors also studied a few other datasets beyond CIFAR. Ideally, a model that is robust to forgetting should observe as good performance as possible on Task B without sacrificing the performance on Task A.
• In order to validate the effect of pre-training on overcoming catastrophic forgetting, the authors adopted ResNet models pre-trained on the ImageNet21k dataset, sequentially fine-tuned them on Task A and Task B. Compared to the trained-from-scratch counterparts, the pre-trained models witnessed an overall trend of better accuracies on Task B for given accuracies of Task A.
• In order to validate the effect of model size on overcoming catastrophic forgetting, the authors used Vision Transformers of various sizes and ResNets of various sizes. Both under the case of Vision Transformers and under the case of ResNets, when maintaining a given performance on Task A, larger models exhibited an overall trend of better achievable performance on Task B.
• In addition, the authors empirically showed that the pre-trained model provided more orthogonal representations among distinct classes compared to its trained-from-scratch counterpart, which can explain the fact that the pre-trained models are more robust to forgetting.

### RL

#### 2022

##### Transdreamer: Reinforcement Learning With Transformer World Models
• The Dreamer agent provides various benefits of Model-Based Reinforcement Learning (MBRL) such as sample efficiency, reusable knowledge, and safe planning. However, its world model and policy networks inherit the limitations of recurrent neural networks and thus an important question is how an MBRL framework can benefit from the recent advances of transformers and what the challenges are in doing so.
• This paper from Chen et al. from Rutgers and KAIST in 2022 proposes a TransDreamer, a transformer-based MBRL agent. They first introduce the Transformer State-Space Model (TSSM), the first transformer-based stochastic world model that leverages a transformer for dynamics predictions. Then, they share this world model with a transformer-based policy network and obtain stability in training a transformer-based RL agent.
• TransDreamer shows comparable performance with Dreamer on DMC and Atari tasks that do not require long-term memory. However, when the proposed model is applied to Hidden Order Discovery involving both 2D visual RL and 3D first-person visual RL, which require long-range memory access for memory-based reasoning (i.e, long-term complex memory interactions), the proposed model outperforms Dreamer in these complex tasks.
• They also show that image generation and reward prediction of TSSM is better than Dreamer qualitatively and quantitatively.

## Selected Papers / Good-to-know

### Vision

#### 2015

##### Learning Deep Features for Discriminative Localization
• This paper by Zhou et al. from MIT CSAIL in 2015 is an explainable-AI method that seeks to answer what vision models “see” in images. They propose Class Activation Maps (CAM) which is a nifty visualization technique originally introduced for CNNs where the predicted class score is mapped back to the previous convolutional layer to generate the CAM. The CAM highlights the class-specific discriminative regions used by CNN to identify the category or class.
• They revisit the global average pooling layer proposed earlier, and shed light on how it explicitly enables the convolutional neural network to have remarkable localization ability despite being trained on image-level labels. This enables classification-trained CNNs to learn to perform object localization, without using any bounding box annotations. While this technique was previously proposed as a means for regularizing training, they find that it actually builds a generic localizable deep representation that can be applied to a variety of tasks.
• Furthermore they demonstrate that the CAM localization technique generalizes to other visual recognition tasks i.e., our technique produces generic localizable deep features that can aid other researchers in understanding the basis of discrimination used by CNNs for their tasks.
• Later, there were several variants of similar explainable-AI methods (such as GradCAM, Saliency Maps and Integrated Gradients) that were introduced.
• Despite the apparent simplicity of global average pooling, they are able to achieve 37.1% top-5 error for on weakly supervised object localization on the ILSVRC 2014 benchmark, demonstrating that our global average pooling CNNs can perform accurate object localization. Note that this is remarkably close to the 34.2% top-5 error achieved by a fully supervised CNN approach.
• They demonstrate that our network is able to localize the discriminative image regions on a variety of tasks despite not being trained for them.
• Unrelated to the paper but a similar approach for vision transformers was recently proposed. CNN uses pixel arrays, whereas ViT splits the images into patches, i.e., visual tokens. The visual transformer divides an image into fixed-size patches, correctly embeds each of them, and includes positional embedding as an input to the transformer encoder. So CAM will indicate what regions of the image the [CLS] token will use to discriminate between classes. Usage example.
• The figure below from Prithvi Da summarizes the approaches using ViT, but the same approach is applicable to other vision-based transformers such as DEiT, BEiT etc. as well.

#### 2016

##### Understanding the Effective Receptive Field in Deep Convolutional Neural Networks
• This paper by Luo et al. from UofT in NeurIPS 2016 studied the characteristics of the receptive field of units in CNNs and introduced the concept of effective receptive field.

#### 2017

##### Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
• This paper by Carreira and Zisserman from Google in CVPR 2017 introduced a new two-stream Inflated 3D ConvNet (I3D) architecture that incorporated both optical flow and RGB paths by inflating filters and pooling kernels of very deep image classification ConvNets from 2D to 3D, making it possible to learn seamless spatio-temporal feature extractors from video.
##### Densely Connected Convolutional Networks
• This paper by Huang et al. from Cornell, Tsinghua and FAIR in CVPR 2017 that skip-connected all layers with the main difference with ResNets being that they performed concatenation-based skip connections instead of addition-based skip connections (as in ResNet).
• The core idea behind DenseNet is feature reuse, which leads to very compact models. As a result it requires fewer parameters than other CNNs, as there are no repeated feature-maps.
• They work around two concerns:
• The feature maps have to be of the same size.
• The concatenation with all the previous feature maps may result in memory explosion.
• To address the first issue they propose two solutions:
• Use conv layers with appropriate padding that maintain spatial dimensions (as in InceptionNet) or
• Use dense skip connectivity only inside blocks called dense blocks.

#### 2018

##### Neural Discrete Representation Learning
• This paper by Oord et al. from DeepMind in NeurIPS 2018 proposed the Vector Quantised-Variational AutoEncoder (VQ-VAE), a simple yet powerful generative model that combine VAEs with vector quantisation (VQ) to obtain a discrete latent representations.
• VQ-VAE differs from VAEs in two key ways: (i) the encoder network outputs discrete, rather than continuous, codes, and (ii) the prior is learnt rather than static.
• In order to learn a discrete latent representation, they incorporate ideas from VQ. Using the VQ method allows the model to circumvent issues of “posterior collapse” – where the latents are ignored when they are paired with a powerful autoregressive decoder – typically observed in the VAE framework. Pairing these representations with an autoregressive prior, the model can generate high quality images, videos, and speech as well as doing high quality speaker conversion and unsupervised learning of phonemes, providing further evidence of the utility of the learnt representations.
• They show that VQ-VAEs are capable of modelling very long term dependencies through their compressed discrete latent space which they have demonstrated by generating 128 x 128 colour images, sampling action conditional video sequences and finally using audio where even an unconditional model can generate surprisingly meaningful chunks of speech and doing speaker conversion. All these experiments demonstrated that the discrete latent space learnt by VQ-VAEs capture important features of the data in a completely unsupervised manner.
• Moreover, VQ-VAEs achieve likelihoods that are almost as good as their continuous latent variable counterparts on CIFAR10 data. They believe that this is the first discrete latent variable model that can successfully model long range sequences and fully unsupervisedly learn high-level speech descriptors that are closely related to phonemes.

#### 2019

##### EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
• This paper by Tan and Le from Google in ICML 2019 introduced EfficientNet which is all about engineering and scale. It proves that if you carefully design your architecture you can achieve top results with reasonable parameters. It’s incredible that EfficientNet-B1 is 7.6x smaller and 5.7x faster than ResNet-152 with better accuracy!
• Ideas from the paper:
• With more layers (depth), one can capture richer and more complex features, but such models are hard to train (due to vanishing gradients).
• Wider networks are much easier to train. They tend to be able to capture more fine-grained features but saturate quickly.
• By training with higher resolution images, CNNs are able to capture more fine-grained details. Again, the accuracy gain diminishes for quite high resolutions.
• Instead of finding the best architecture, the authors proposed to start with a relatively small baseline model and gradually scale up network depth (more layers), width (more channels per layer), resolution (input image) simultaneously using a technique called compound scaling that they propose.

#### 2020

##### Taming Transformers for High-Resolution Image Synthesis
• Designed to learn long-range interactions on sequential data, transformers continue to show state-of-the-art results on a wide variety of tasks. In contrast to CNNs, they contain no inductive bias that prioritizes local interactions. This makes them expressive, but also computationally infeasible for long sequences, such as high-resolution images. They demonstrate how combining the effectiveness of the inductive bias of CNNs with the expressivity of transformers enables them to model and thereby synthesize high-resolution images.
• This paper by Esser et al. from the Heidelberg Collaboratory for Image Processing in 2020 proposed VQGAN which addresses the fundamental challenges that previously confined transformers to low-resolution images. VQGAN shows how to (i) use CNNs to learn a context-rich vocabulary of image constituents, and in turn (ii) utilize transformers to efficiently model their composition within high-resolution images.
• VQGAN represents images as a composition of perceptually rich image constituents and thereby overcomes the infeasible quadratic complexity when modeling images directly in pixel space. Their approach uses a convolutional generator to learn a codebook of context-rich visual parts, whose composition is subsequently modeled with an autoregressive transformer architecture. A discrete codebook provides the interface between these architectures and a patch-based discriminator enables strong compression while retaining high perceptual quality. This method introduces the efficiency of convolutional approaches to transformer based high resolution image synthesis.
• Modeling constituents with a CNN architecture and their compositions with a transformer architecture taps into the full potential of their complementary strengths and thereby allows VQGAN to represent the first results on high-resolution image synthesis with a transformer-based architecture.
• VQGAN is readily applied to conditional synthesis tasks, where both non-spatial information, such as object classes, and spatial information, such as segmentations, can control the generated image.
• VQGAN demonstrates the efficiency of convolutional inductive biases and the expressivity of transformers by performing semantically-guided synthesis of megapixel images and outperforming state-of-the-art convolutional approaches and autoregressive models on class-conditional ImageNet.
• Code and pretrained models can be found here.

##### Self-training with Noisy Student improves ImageNet classification
• This paper by Xie et al. from Google and CMU in CVPR 2020 introduced teacher-student training. The paper proposed an iterative semi-supervised method using 300M unlabeled images called “noisy student training” which can be described in 4 steps:
• Train a teacher model on labeled images.
• Use the teacher to generate labels on 300M unlabeled images (pseudo-labels).
• Train a student model on the combination of labeled images and pseudo labeled images.
• Iterate from step 1, by treating the student as a teacher. Re-infer the unlabeled data and train a new student from scratch.
##### Big Transfer (BiT): General Visual Representation Learning
• This paper by Kolesnikov et al. from Google in ECCV 2020 introduced BiT which is a scalable ResNet-based model for efficient image pre-training.
• They develop 3 BiT models (small, medium, and large) based on ResNet-152. For the large variation of BiT they used ResNet152x4, which means that each layer has 4 times more channels. They pretrained that model using far larger datasets than ImageNet. Specifically, the largest model was trained on the insanely large JFT dataset, which consists of 300M labeled images.
• The major contribution in the architecture is the choice of normalization layers – the authors replace batch normalization with group normalization and weight standardization.
##### Multi-modal Dense Video Captioning
• This paper by Iashin and Rahtu from Tampere University in CVPR Workshops 2020 introduced multi-modal dense video captioning.
##### Efficient Saliency Maps for Explainable AI
• This paper by Mundhenk et al. from Lawrence Livermore National Lab and UC Berkeley in 2020 describes an explainable AI saliency map method for use with deep CNNs that is much more efficient than popular fine-resolution gradient methods. It is also quantitatively similar or better in accuracy.
• Their technique works by measuring information at the end of each network scale which is then combined into a single saliency map. They describe how saliency measures can be made more efficient by exploiting Saliency Map Order Equivalence. They visualize individual scale/layer contributions by using a Layer Ordered Visualization of Information. This provides an interesting comparison of scale information contributions within the network not provided by other saliency map methods.
• Using their method instead of Guided Backprop, coarse-resolution class activation methods such as Grad-CAM and GradCAM++ seem to yield demonstrably superior results without sacrificing speed. This will make fine-resolution saliency methods feasible on resource limited platforms such as robots, cell phones, low-cost industrial devices, astronomy and satellite imagery.

#### 2021

##### Finetuning Pretrained Transformers into RNNs
• This paper by Kasai et al. from UWash, Microsoft, DeepMind, and Allen AI in 2021 presented an idea of converting pre-trained transformers into RNNs, lowering memory cost while retaining high accuracy.
• SyncedReview’s article.
##### VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
• This paper by Akbari et al. from Google, Columbia, and Cornell in 2021 explored learning multimodal representations from unlabeled data using convolution-free Transformer architectures. Furthermore, they also study a modality-agnostic, single-backbone Transformer by sharing weights among the three modalities.
##### Self-supervised learning for fast and scalable time series hyper-parameter tuning.
• This paper by Zhang et al from Facebook in 2021 proposed a new self-supervised learning framework for model selection and hyperparameter tuning, which provides accurate forecasts with less computational time and resources.
##### Accelerating SLIDE Deep Learning on Modern CPUs: Vectorization, Quantizations, Memory Optimizations, and More
• This paper by Daghaghi et al. from Rice University in MLSys 2021 presented a CPU algorithm using locality sensitive hashing that trains deep neural networks up to 15 times faster than top GPU trainers.
##### Emerging Properties in Self-Supervised Vision Transformers
• This paper by Caron et al. from Facebook in 2021 proposed DINO, a self-supervised vision transformer-based model that can discover and segment objects in an image or a video with absolutely no supervision and without being given a segmentation-targeted objective.
• DINO works by interpreting self-supervision as a special case of self-distillation, where no labels are used at all. DINO is trained as a student network by simply matching the output of a teacher network over different views of the same image. By discovering object parts and shared characteristics across images, DINO learns a feature space that organizes itself in an interpretable way, with similar categories landing near one another. This suggests that DINO managed to connect categories based on visual properties, a bit like humans do.
• TechCrunch’s article and Facebook AI article.
##### Semi-Supervised Learning of Visual Features by Non-Parametrically Predicting View Assignments with Support Samples
• This paper by Assran et al. from Facebook in 2021 proposed PAWS, which combines some of the ideas of semi-supervised learning with the more traditional supervised method, essentially giving the training a boost by letting it learn from both labeled and unlabeled data.
• PAWS is a method for semi-supervised learning that builds on the principles of self-supervised distance-metric learning. PAWS pre-trains a model to minimize a consistency loss, which ensures that different views of the same unlabeled image are assigned similar pseudo-labels. The pseudo-labels are generated non-parametrically, by comparing the representations of the image views to those of a set of randomly sampled labeled images. The distance between the view representations and labeled representations is used to provide a weighting over class labels, which they interpret as a soft pseudo-label. By non-parametrically incorporating labeled samples in this way, PAWS extended the distance-metric loss used in self-supervised methods such as BYOL and SwAV to the semi-supervised setting.
##### Enhancing Photorealism Enhancement
• This paper by Richter et al. from Intel Labs in 2021 proposed an approach to enhancing the realism of synthetic images using a convolutional network that leverages intermediate representations produced by conventional rendering pipelines. The network is trained via a novel adversarial objective, which provides strong supervision at multiple perceptual levels.
• The authors analyzed scene layout distributions in commonly used datasets and find that they differ in important ways. They hypothesize that this is one of the causes of strong artifacts that can be observed in the results of many prior methods. To address this, they propose a new strategy for sampling image patches during training.
• Intel Lab’s article with sample A/B results and videos from the paper. Also, The Verge’s article on the idea.
##### FNet: Mixing Tokens with Fourier Transforms
• This paper by Lee-Thorp et al. from Google in 2021 proposed replacing the self-attention sublayers with simple linear transformations that “mix” input tokens to significantly speed up the transformer encoder with limited accuracy cost.
• More surprisingly, the team discovers that replacing the self-attention sublayer with a standard, unparameterized Fourier Transform achieves 92 percent of the accuracy of BERT on the GLUE benchmark, with training times that are seven times faster on GPUs and twice as fast on TPUs.
• SynedReview’s article on the idea.
##### Are Convolutional Neural Networks or Transformers more like human vision?
• This paper by Tuli et al. from Princeton University, DeepMind, and UC Berkeley explored the extent to which different vision models correlate with human vision from an error consistency point-of-view. They conclude that the recently proposed Vision Transformer (ViT) networks not only outperform CNNs on accuracy for image classification tasks, but also have higher shape bias and are largely more consistent with human errors.
##### RegNet: Self-Regulated Network for Image Classification
• The ResNet and its variants have achieved remarkable successes in various computer vision tasks. Despite its success in making gradient flow through building blocks, the simple shortcut connection mechanism limits the ability of re-exploring new potentially complementary features due to the additive function.
• This paper by Xu et al. in 2021 from Harbin Institute of Technology, University of Electronic Science and Technology of China, Singapore Management University, and Sichuan University addresses this issue by proposing a regulator module as a memory mechanism to extract complementary features, which are further fed to the ResNet. In particular, the regulator module is composed of convolutional RNNs (e.g., Convolutional LSTMs or Convolutional GRUs), which are shown to be good at extracting spatio-temporal information. They named the new regulated networks as RegNet.
• The regulator module can be easily implemented and appended to any ResNet architecture. They also apply the regulator module for improving the squeeze-and-Excitation ResNet to show the generalization ability of our method. Experimental results on three image classification datasets have demonstrated the promising performance of the proposed architecture compared with the standard ResNet, SE-ResNet, and other state-of-the-art architectures.
##### Multiscale Vision Transformers
• This paper by Fan et al. from Facebook AI and UC Berkeley presents Multiscale Vision Transformers (MViT) for video and image recognition, by connecting the seminal idea of multiscale feature hierarchies with transformer models.
• Multiscale Transformers have several channel-resolution scale stages. Starting from the input resolution and a small channel dimension, the stages hierarchically expand the channel capacity/feature complexity while reducing the spatial resolution. This creates a multiscale pyramid of features with early layers operating at high spatial resolution to model simple low-level visual information, and deeper layers at spatially coarse, but complex, high-dimensional features.
• They evaluate this fundamental architectural prior for modeling the dense nature of visual signals for a variety of video recognition tasks where it outperforms single-scale vision transformers for video and image recognition that rely on large scale external pre-training and are 5-10x more costly in computation and parameters. They further remove the temporal dimension and apply our model for image classification where it outperforms prior work on vision transformers. In empirical evaluation, MViT shows a fundamental advantage over single-scale vision transformers for video and image recognition.
• Github repo.
##### Lossy Compression for Lossless Prediction
• The following paper summary has been contributed by Zhibo Zhang.
• This paper by Dubois et al. from Vector Institute, UofT, UBC and Facebook AI Research in NeurIPS 2021 introduces a neural compressor that saves a large amount of bit rate while preserving the downstream classification task performance.
• The authors calculated the bits needed to maintain high downstream task performance.
• Inspired by the rate distortion theory (Shannon et al.), an effective compression should keep the mutual information between the input and its latent representation low while maintaining the task utility. Based on this principle, the authors proposed two algorithms, VIC - Variational Invariant Compressor (similar to the variational autoencoder (Kingma et al.) and Bottleneck InfoNCE (BINCE) on top of the vanilla contrastive learning framework with InfoNCE loss (Oord et al.). The loss functions for both algorithms, which include minimizing the entropy of the latent representation variable, aim at removing unrelated information from the input data while reserving task-relevant information.
• The authors conducted controlled experiments on the STL10 dataset. The Variational Invariant Compressor method achieved huge compression gains compared to the PNG format (269 and 175 times compression gains when using the reconstruction and the latent representation to do the downstream task prediction accordingly) while leading to a huge drop in test accuracy (25.1 when using the reconstructed input for the downstream task). In comparison, the Bottleneck InfoNCE algorithm was able to achieve 121 times compression gains compared to the PNG format while observing no drop in test accuracy.
• The authors also applied the Bottleneck InfoNCE compressor on top of the pre-trained CLIP model (Radford et al.) and performed experiments on eight different datasets. The entropy bottleneck together with the CLIP model brought a much better bit-rate compared to the JPEG format across different datasets and caused only a very small drop in test accuracy compared to the vanilla CLIP model.

#### 2022

##### YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications
• For years, the YOLO series has been the de facto industry-level standard for efficient object detection. The YOLO community has prospered overwhelmingly to enrich its use in a multitude of hardware platforms and abundant scenarios.
• This paper by Li et al. from Meituan Inc. pushes YOLOs’ limits to the next level, stepping forward with an unwavering mindset for industry application.
• Considering the diverse requirements for speed and accuracy in the real environment, they extensively examine the up-to-date object detection advancements either from industry or academia. Specifically, they heavily assimilate ideas from recent network design, training strategies, testing techniques, quantization, and optimization methods. On top of this, they integrate their thoughts and practice to build a suite of deployment-ready networks at various scales to accommodate diversified use cases.
• YOLOv6 has a series of models for various industrial scenarios, including N/T/S/M/L, which the architectures vary considering the model size for better accuracy-speed trade-off. And some Bag-of-freebies methods are introduced to further improve the performance, such as self-distillation and more training epochs. For industrial deployment, they adopt QAT with channel-wise distillation and graph optimization to pursue extreme performance.
• YOLOv6-N hits 35.9% AP on the COCO dataset at a throughput of 1234 FPS on an NVIDIA Tesla T4 GPU. YOLOv6-S strikes 43.5% AP at 495 FPS, outperforming other mainstream detectors at the same scale~(YOLOv5-S, YOLOX-S, and PPYOLOE-S). Their quantized version of YOLOv6-S even brings a new state-of-the-art 43.3% AP at 869 FPS. Furthermore, YOLOv6-M/L also achieves better accuracy performance (i.e., 49.5%/52.3%) than other detectors with a similar inference speed.
• Github repo.

### NLP

#### 2015

##### Effective Approaches to Attention-based Neural Machine Translation
• This paper by Luong et al. from Manning’s lab in EMNLP 2015 described a few more attention models that offer improvements and simplifications compared to Bahdanau attention.
• They describe a few “global attention” models, the distinction between them being the way the attention scores are calculated.

#### 2018

##### Generating Wikipedia by Summarizing Long Sequences
• This paper by Liu et al. from Google Brain in ICLR 2018 shows that generating English Wikipedia articles can be approached as a multi-document summarization problem with a large, parallel dataset, and demonstrated a two-stage extractive-abstractive framework for carrying it out. They perform coarse extraction by using extractive summarization to identify salient information in the first stage and a neural decoder-only sequence transduction model for the abstractive stage, capable of handling very long input-output examples.
• For the abstractive model, they introduce a decoder-only architecture that can scalably attend to very long sequences, much longer than typical encoder-decoder architectures used in sequence transduction, allowing them to condition on many reference documents and to generate fluent and coherent multi-sentence paragraphs and even whole Wikipedia articles.
• When given reference documents, they show it can extract relevant factual information as reflected in perplexity, ROUGE scores and human evaluations.

#### 2019

##### Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
• While BERT and RoBERTa have set a new state-of-the-art performance on sentence-pair regression tasks like semantic textual similarity (STS), they requires that both sentences are fed into the network, which causes a massive computational overhead. Finding the most similar pair in a collection of 10,000 sentences requires about 50 million inference computations (~65 hours) with BERT. The construction of BERT makes it unsuitable for semantic similarity search as well as for unsupervised tasks like clustering.
• This paper by Reimers and Gurevych from Technische Universitat Darmstad in 2019 presented Sentence-BERT (SBERT) a modification of the pretrained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity. This reduces the effort for finding the most similar pair from 65 hours with BERT / RoBERTa to about 5 seconds with SBERT, while maintaining the accuracy from BERT.
• They showed that BERT out-of-the-box maps sentences to a vector space that is rather unsuitable to be used with common similarity measures like cosine-similarity. In fact, the performance for seven STS tasks was below the performance of average GloVe embeddings.
• SBERT fine-tunes BERT in a siamese/triplet network architecture. They evaluated the quality on various common benchmarks, where it could achieve a significant improvement over state-of-the-art sentence embeddings methods. Replacing BERT with RoBERTa did not yield a significant improvement in their experiments.
• They evaluate SBERT and SRoBERTa on common STS tasks and transfer learning tasks, where it outperforms other state-of-the-art sentence embeddings methods due to it being computationally efficient. On a GPU, it is about 9% faster than InferSent and about 55% faster than Universal Sentence Encoder. SBERT can be used for tasks which are computationally not feasible to be modeled with BERT such as clustering of 10,000 sentences with hierarchical clustering (BERT needs 65 hours, while SBERT needs 5 seconds).

#### 2020

##### Efficient Transformers: A Survey
• Transformer model architectures have garnered immense interest lately due to their effectiveness across a range of domains like language, vision and reinforcement learning. In the field of natural language processing for example, Transformers have become an indispensable staple in the modern deep learning stack. Recently, a dizzying number of “X-former” models have been proposed - Reformer, Linformer, Performer, Longformer, to name a few - which improve upon the original Transformer architecture, many of which make improvements around computational and memory efficiency.
• This paper by Tay et al. from Google in 2020 characterizes a large and thoughtful selection of recent efficiency-flavored “X-former” models, providing an organized and comprehensive overview of existing work and models across multiple domains.
##### Towards a Human-like Open-Domain Chatbot
• This paper by Adiwardana et al. from Google in 2020 presented Meena, which is an end-to-end, neural conversational model that learns to respond sensibly to a given conversational context. The training objective is to minimize perplexity, the uncertainty of predicting the next token (in this context, the next word in a conversation).

#### 2021

##### Pretrained Transformers As Universal Computation Engines
• This paper by Lu et al. from UC Berkeley, FAIR, and Google Brain in 2021 investigated the capability of a transformer pretrained on natural language to generalize to other modalities with minimal finetuning – in particular, without finetuning of the self-attention and feedforward layers of the residual blocks and apply this model to numerical computation, vision, and protein fold prediction.
• In contrast to prior works which investigate finetuning on the same modality as the pretraining dataset, the authors showed that pretraining on natural language improves performance and compute efficiency on non-language downstream tasks. In particular, the authors found that such pretraining enables FPT to generalize in zero-shot to these modalities, matching the performance of a transformer fully trained on these tasks.
• BAIR’s article; VentureBeat’s article; Yannic Kilcher’s video.
##### SimCSE: Simple Contrastive Learning of Sentence Embeddings
• This paper by Gao et al. from Princeton University and Tsinghua University in 2021 presents SimCSE, a simple contrastive learning framework that greatly advances the state-of-the-art sentence embeddings on semantic textual similarity tasks.
• They first describe an unsupervised approach, which takes an input sentence and predicts itself in a contrastive objective, with only standard dropout used as noise. This simple method works surprisingly well, performing on par with previous supervised counterparts. They find that dropout acts as minimal data augmentation and removing it leads to a representation collapse.
• Next, they propose a supervised approach utilizing NLI datasets, which incorporates annotated pairs from natural language inference datasets into their contrastive learning framework, by using “entailment” pairs as positives and “contradiction” pairs as hard negatives.
• They evaluate SimCSE on standard semantic textual similarity (STS) tasks, and their unsupervised and supervised models using BERTbase achieve an average of 76.3% and 81.6% Spearman’s correlation respectively, a 4.2% and 2.2% improvement compared to previous best results. They also show both theoretically and empirically justify the inner workings of their approach by analyzing alignment and uniformity of SimCSE and demonstrating that their contrastive learning objective regularizes pre-trained embeddings’ anisotropic space to be more uniform, and it better aligns positive pairs when supervised signals are available.
• The key takeaway is that their contrastive objective, especially the unsupervised one, may have a broader application in NLP. It provides a new perspective on data augmentation with text input, and can be extended to other continuous representations and integrated in language model pre-training.
##### DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations
• Sentence embeddings are an important component of many natural language processing (NLP) systems. Like word embeddings, sentence embeddings are typically learned on large text corpora and then transferred to various downstream tasks, such as clustering and retrieval. Unlike word embeddings, the highest performing solutions for learning sentence embeddings require labelled data, limiting their usefulness to languages and domains where labelled data is abundant.
• This paper by Giorgi et al. from UofT in 2021 present DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations. Similar to SimCSE, DeCLUTR learns high quality sentence embeddings in a self-supervised fashion, the quality of which are equal to or better than the ones obtained from a supervised setting.
• Inspired by recent advances in deep metric learning (DML), they design a self-supervised objective for learning universal sentence embeddings that does not require labelled training data. When used to extend the pretraining of transformer-based language models, their approach closes the performance gap between unsupervised and supervised pretraining for universal sentence encoders. Their experiments suggest that the quality of the learned embeddings scale with both the number of trainable parameters and the amount of unlabelled training data.
• They demonstrated the effectiveness of their objective by evaluating the learned embeddings on the SentEval benchmark, which contains a total of 28 tasks designed to evaluate the transferability and linguistic properties of sentence representations.
• Their experiments suggest that the learned embeddings’ quality can be further improved by increasing the model and train set size. Together, their results demonstrate the effectiveness and feasibility of replacing hand-labelled data with carefully designed self-supervised objectives for learning universal sentence embeddings.
• Github repo with code and pretrained models.

#### 2022

##### A Causal Lens for Controllable Text Generation
• This paper by Hu and Li from UCSD and Amazon introduces released a paper describing a novel approach to conditional text generation that leverages causal inference principles to mitigate the effects of spurious correlations.
• Conditional text modeling is hard. Natural language documents tend to contain large amounts of complex unstructured information, most of which is implicit.
• Controllable text generation concerns two fundamental tasks of wide applications, namely generating text of given attributes (i.e., attribute-conditional generation), and minimally editing existing text to possess desired attributes (i.e., text attribute transfer). Historically, problems of attribute-conditional text generation and attribute transfer were perceived as two independent tasks and approached individually and developed different conditional models which, however, are prone to producing biased text (e.g., various gender stereotypes). The authors propose a unifying causal framework to formulate controllable text generation from a principled causal perspective which models the two tasks with a unified framework for generation and transfer, based on structural causal models (SCMs).
• A direct advantage of the causal formulation is the use of rich causality tools to mitigate generation biases and improve control. They treat the two tasks as interventional and counterfactual causal inference based on a structural causal model, respectively. They propose to model attribute-conditional text generation as intervention, using Daniel Pearl’s $$do$$ operator. Hence, the attribute-conditional distribution becomes $$P(x\|do(a))$$ rather than purely association-based $$P(x\|a)$$, where: $$x$$ is a text, $$a$$ is an attribute (an intervention). Two more variables are used in the paper: $$z$$, a multidimensional latent (unobserved) confounder and $$c$$, a $$z$$’s observed proxy.
• Text attribute transfer is modeled as a conterfactual prediction, trying to answer the question: “what the text would have been if the attribute had been different?”
• Training consists of four objectives: VAE objective to learn the causal model and three counterfactual objectives.
• They apply the framework to the challenging practical setting where confounding factors (that induce spurious correlations) are observable only on a small subset (1%-5%) of training data with confounding labels for $$c$$.
• Results show that the proposed model achieves significantly better results than conventional conditional models in terms of control accuracy and reduced bias. This is true for both types of tasks: attribute-conditional generation and attribute transfer.
##### SNCSE: Contrastive Learning for Unsupervised Sentence Embedding with Soft Negative Samples
• This paper by Wang et al. from National University of Defense Technology, SenseTime and The University of Hong Kong in 2022 released a paper proposing a new contrastive sentence embedding framework called SNCSE.
• Application of contrastive learning techniques to sentence embeddings has been proved to be a great way to improve their semantic and classification properties. For a sentence, current models utilize diverse data augmentation methods to generate positive samples, while consider other independent sentences as negative samples. Then they adopt InfoNCE loss to pull the embeddings of positive pairs gathered, and push those of negative pairs scattered.
• Although these models have made great progress on sentence embedding, the authors argue that contrastive losses are not sensitive enough to distinguish and decouple textual and semantic similarity. This leads to the methods deploying traditional contrastive losses to overestimate the semantic similarity of any pairs with similar textual regardless of the actual semantic difference between them. This is because positive pairs in unsupervised contrastive learning come with similar and even the same textual meaning through data augmentation.
• Let’s take a negation. Adding a simple “not” to a sentence does not change its textual properties much, but can drastically change its semantic properties. The authors argue that traditional contrastive loss leads to feature supression, making models fail to decouple textual and semantic aspects of a sentence. To address this issue, the authors propose contrastive learning for unsupervised sentence embedding with soft negative samples (SNCSE) - samples with different semantic content (hence “negative”) and very high textual similarity (hence “soft”).
• Moreover, the authors propose an additional loss component - bidirectional margin loss (BML) - to model semantic differences between positive and soft negative samples, while retaining InfoNCE as a loss for regular positive-negative pairs. BML helps introduce soft negative examples into the traditional contrastive learning framework.
• To obtain these soft negative examples, the authors construct soft negative samples as negations of positive examples. They use a rule-based system for this purpose.
• SNCSE achieves state-of-the-art performance on semantic textual similarity (STS) task with average Spearman’s correlation coefficient of 78.97% on BERTbase and 79.23% on RoBERTabase, an improvement compared to other contrastive methods (e.g. SimCSE). Finally, they adopt rank-based error analysis method to detect the weakness of SNCSE.
• Github repo.
##### LaMDA: Language Models for Dialog Applications
• This paper by Cheng et al. from Google Brain in 2022 is an attempt to propose safe, grounded, and high-quality dialog models for open-ended applications.
• Language models are becoming more capable than ever before and are helpful in a variety of tasks — translating one language into another, summarizing a long document into a brief highlight, or answering information-seeking questions. Among these, open-domain dialog, where a model needs to be able to converse about any topic, is probably one of the most difficult, with a wide range of potential applications and open challenges. In addition to producing responses that humans judge as sensible, interesting, and specific to the context, dialog models should adhere to Responsible AI practices, and avoid making factual statements that are not supported by external information sources.
• Defining objectives and metrics is critical to guide training dialog models. LaMDA has three key objectives — Quality, Safety, and Groundedness — each of which they measure using carefully designed metrics as follows.
• Quality: They decompose Quality into three dimensions, Sensibleness, Specificity, and Interestingness (SSI), which are evaluated by human raters. Sensibleness refers to whether the model produces responses that make sense in the dialog context (e.g., no common sense mistakes, no absurd responses, and no contradictions with earlier responses). Specificity is measured by judging whether the system’s response is specific to the preceding dialog context, and not a generic response that could apply to most contexts (e.g., “ok” or “I don’t know”). Finally, Interestingness measures whether the model produces responses that are also insightful, unexpected or witty, and are therefore more likely to create better dialog.
• Safety: Safety is essential for responsible AI. Their Safety metric is composed of an illustrative set of safety objectives that captures the behavior that the model should exhibit in a dialog. These objectives attempt to constrain the model’s output to avoid any unintended results that create risks of harm for the user, and to avoid reinforcing unfair bias. For example, these objectives train the model to avoid producing outputs that contain violent or gory content, promote slurs or hateful stereotypes towards groups of people, or contain profanity. Their research towards developing a practical Safety metric represents very early work, and there is still a great deal of progress for us to make in this area.
• Groundedness: The current generation of language models often generate statements that seem plausible, but actually contradict facts established in known external sources. This motivates their study of groundedness in LaMDA. Groundedness is defined as the percentage of responses with claims about the external world that can be supported by authoritative external sources, as a share of all responses containing claims about the external world. A related metric, Informativeness, is defined as the percentage of responses with information about the external world that can be supported by known sources, as a share of all responses. Therefore, casual responses that do not carry any real world information (e.g., “That’s a great idea”), affect Informativeness but not Groundedness. While grounding LaMDA generated responses in known sources does not in itself guarantee factual accuracy, it allows users or external systems to judge the validity of a response based on the reliability of its source.
• With the objectives and metrics defined, they describe LaMDA’s two-stage training: pre-training and fine-tuning. In the fine-tuning stage, they train LaMDA to perform a mix of generative tasks to generate natural-language responses to given contexts, and classification tasks on whether a response is safe and high-quality, resulting in a single multi-task model that can do both. The LaMDA generator is trained to predict the next token on a dialog dataset restricted to back-and-forth dialog between two authors, while the LaMDA classifiers are trained to predict the Safety and Quality (SSI) ratings for the response in context using annotated data. During a dialog, the LaMDA generator first generates several candidate responses given the current multi-turn dialog context, and the LaMDA classifiers predict the SSI and Safety scores for every response candidate. Candidate responses with low Safety scores are first filtered out. Remaining candidates are re-ranked by their SSI scores, and the top result is selected as the response.
• They observe that LaMDA significantly outperforms the pre-trained model in every dimension and across all model sizes.
##### Causal Inference Principles for Reasoning about Commonsense Causality
• Commonsense causality reasoning (CCR) aims at identifying plausible causes and effects in natural language descriptions that are deemed reasonable by an average person. Although being of great academic and practical interest, this problem is still shadowed by the lack of a well-posed theoretical framework; existing work usually relies on deep language models wholeheartedly, and is potentially susceptible to confounding co-occurrences.
• This paper by Zhang et al. from UPenn in 2022 articulates CCR from a completely new perspective using classical causal principles. Their contributions include (i) a novel commonsense causality framework; (ii) mitigating confounding co-occurrences by matching temporal propensities; (iii) a modular pipeline for zeroshot CCR with demonstrated effectiveness.
• They propose a novel framework, ROCK, to Reason O(A)bout Commonsense K(C)ausality, which utilizes temporal signals as incidental supervision, and balances confounding effects using temporal propensities that are analogous to propensity scores. The ROCK implementation is modular and zero-shot, and demonstrates good CCR capabilities on various datasets.
##### RescoreBERT: Discriminative Speech Recognition Rescoring with BERT
• Second-pass rescoring is an important component in automatic speech recognition (ASR) systems that is used to improve the outputs from a first-pass decoder by implementing a lattice rescoring or n-best re-ranking.
• While pretraining with a masked language model (MLM) objective has received great success in various natural language understanding (NLU) tasks, it has not gained traction as a rescoring model for ASR. Specifically, training a bidirectional model like BERT on a discriminative objective such as minimum WER (MWER) has not been explored.
• This paper by Xu et al. from Amazon Alexa AI in ICASSP 2022 proposes a method to train a BERT rescoring model with discriminative objective functions. They show how to train a BERT-based rescoring model with MWER loss, to incorporate the improvements of a discriminative loss into fine-tuning of deep bidirectional pretrained models for ASR.
• Specifically, they propose a fusion strategy that incorporates the MLM into the discriminative training process to effectively distill knowledge from a pretrained model. They further propose an alternative discriminative loss.
• RescoreBERT reduces WER by 6.6%/3.4% relative on the LibriSpeech clean/other test sets over a BERT baseline without discriminative objective. They also evaluate RescoreBERT on an internal dataset from a conversational agent and find that it reduces both latency and WER (by 4%/8.3%/7.1% relative) over an LSTM rescoring model.
##### Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model
• This paper by Smith et al. from Microsoft and Nvidia presents MT-NLG, a 530 billion parameter left-to-right, autoregressive, generative transformer-based language model that possesses strong in-context learning capabilities.
• Pretrained general-purpose language models can achieve state-of-the-art accuracies in various natural language processing domains by adapting to downstream tasks via zero-shot, few-shot and fine-tuning techniques. Because of their success, the size of these models has increased rapidly, requiring high-performance hardware, software, and algorithmic techniques to enable training such large models.
• They present details on the training of the largest monolithic transformer based language model, Megatron-Turing NLG 530B (MT-NLG), with 530 billion parameters. They discuss the challenges in training neural networks at such scale and present the 3D parallelism methodology as well the hardware infrastructure used to efficiently train MT-NLG using DeepSpeed and Megatron.
• Next, they detail the training process, the design of the training corpus, and the data curation techniques, which is a key ingredient to the success of the model.
• MT-NLG achieves superior zero-/one- and few-shot learning performance on several NLP benchmarks, establishing new state-of-the-art results.
• They also analyze the social biases exhibited by MT-NLG and examine various factors that can affect in-context learning, bringing forth awareness of certain limitations of current generation of large language models.
• Microsoft blog article.
##### Extreme Compression for Pre-trained Transformers Made Simple and Efficient
• Extreme compression, particularly ultra-low bit precision (binary/ternary) quantization, has been proposed to fit large NLP models on resource-constraint devices. However, to preserve the accuracy for such aggressive compression schemes, cutting-edge methods usually introduce complicated compression pipelines, e.g., multi-stage expensive knowledge distillation with extensive hyperparameter tuning. Also, they oftentimes focus less on smaller transformer models that have already been heavily compressed via knowledge distillation and lack a systematic study to show the effectiveness of their methods.
• This paper by Wu et al. from Microsoft in 2022 derives a user-friendly celebrating recipe for extreme quantization, which allows them to achieve a larger model compression ratio and higher accuracy. They accomplish this by performing a very comprehensive systematic study to measure the impact of many key hyperparameters and training strategies from previous works.
• They carefully design and perform extensive experiments to investigate the contemporary existing extreme quantization methods for ultra-low bit precision quantization and find that they are significantly under-trained. To this end, they fine-tune pre-trained BERTbase models with various training budgets and learning rate search.
• They propose a simple yet effective compression pipeline for extreme compression, named XTC. XTC demonstrates that (1) they can skip the pre-training knowledge distillation to obtain a 5-layer BERT while achieving better performance than previous state-of-the-art methods, e.g., the 6-layer TinyBERT; (2) extreme quantization plus layer reduction is able to reduce the model size by 50x, resulting in new state-of-the-art results on GLUE tasks.
##### Memorizing Transformers
• The following paper summary has been contributed by Zhibo Zhang.
• When it comes to language modeling, Transformers are good at capturing the dependencies among input tokens within a context window given that they compare each pair of input tokens directly through the query-key matching process. However, many complicated tasks nowadays such as book reading require long-term dependencies that span across tens of thousands of tokens, which exceeds the size of the context window that existing Transformers can handle due to its quadratic complexity with respect to the number of input tokens.
• Memorizing Transformers by Wu et al. from Google in ICLR 2022 proposes an extension of Transformer that attends not only to the input tokens of the current context window, but also the ones from past context windows.
• As shown in the figure below, suppose context windows 1 and 2 are past context windows, and context window 3 is the current context window. The authors use a non-trainable memory to store the keys and values for the tokens from past context windows. However, the memory size will grow over time as the context window shifts. In order to avoid the model from attending to the full memory each time, an approximate kNN (k-Nearest-Neighbors) approach is adopted to select the past embedded tokens that match the most with each embedded token of the current context window.
• When predicting the next token, the authors propose a gating mechanism that performs a weighted sum between the attention outcome based on the current context window and the attention outcome based on the most relevant embedded tokens in memory.
• The authors validated the effect of external memory on five different language modeling tasks including English language books, long web articles, technical math papers, source code as well as formal theorems. It was observed that the Transformer with external memory can match the performance of a larger Transformer model without external memory in terms of perplexity scores. In addition, the authors observed that a larger external memory size would generally help the Transformer obtain lower perplexity scores.

##### Ask Me Anything: A simple strategy for prompting language models
• The following paper summary has been contributed by Zhibo Zhang.
• Prompting is a strategy that helps large language models transfer to new tasks under the natural language task specification. However, designing perfect prompts for a task requires a large amount of effort.
• This paper by Arora et al. from Chris Ré’s lab at Stanford University, Numbers Station and the University of Wisconsin-Madison proposes the Ask Me Anything Prompting (AMA) strategy that runs a collection of prompt chains, where each prompt chain is composed of question and answer prompts, as shown in the illustration figure below from the paper. The predictions of the individual prompt chains are then combined using weak supervision to produce the final prediction.
• In particular, the authors observed two empirical facts about effective prompt formats:
• Open-ended formats such as traditional QA (Question-Answering) are more effective than restrictive formats such as True or False selection.
• It is essential to map the answers of open-ended questions to specialized output categories of a given task.
• In order to evaluate the effectiveness of the AMA strategy, the authors applied Question-Answering prompt chains and Weak Supervision on top of the GPT-J-6B parameter model and compared it with the version without AMA as well as the GPT-3 175B model (with few in-context examples) on tasks spanning across natural language understanding, natural language inference, classification and question answering. The GPT-J-6B parameter model with AMA performed the best with the majority of large language models tested.
• Github repo.

### Speech

#### 2017

##### On Evaluating and Comparing Conversational Agents
• This paper by Venkatesh et al. from Amazon in 2017 proposes a comprehensive evaluation strategy using multiple metrics which correlate well with human judgement and are thus designed to reduce subjectivity, for non goal-oriented conversations. The proposed metrics provide granular analysis of the conversational agents, which is not captured in human ratings. They show that these metrics can be used as a reasonable proxy for human judgment.
• They propose the following evaluation metrics:
• Conversational User Experience: Measure of the overall interaction experience. Conversations with a socialbot can be significantly different from those with humans because of expectations, behavior or sentiment, trust and visual cues.
• Engagement: To enable an open-ended, multi-turn conversation, engagement is critical. Engagement is a measure of interestingness in a conversation. Other models also term this as interestingness.
• Coherence: A coherent response indicates a comprehensible and relevant response to a user’s request. Other models also term this as specificity.
• Conversational Depth: Coherence is usually measured at turn level. However, in a multi-turn conversation, context may be carried over multiple turns. While evaluating conversational agents, it is important to detect the context and the depth of the conversations.
• Topical Diversity: A good conversational agent is capable of: (i) identifying the topics and keywords from a given utterance (ii) able to have conversations around the same topics and (iii) can share related concepts (iv) identification of appropriate intent.
• Domain Coverage: An agent which is able to interact on multiple domains can be considered more consistent with humans expectations.
• They provide a mechanism to unify the metrics for selecting the top performing agents, which has also been applied throughout Amazon’s Alexa Prize competition.
• Till date, this study offers the largest setting for evaluating agents with millions of conversations and hundreds of thousands of ratings from users. They believe that this work is a step towards an automatic evaluation process for conversational AIs.

#### 2018

##### Attention-Based Models for Text-Dependent Speaker Verification
• This paper by Chowdhury et al. from Washington State and Google in 2018 proposes using attention-based models for a keyword-based text-dependent speaker verification (SV) system. One subtask of SV is global password text-dependent speaker verification (TD-SV), which refers to the set of problems for which the transcripts of reference enrollment and verification utterances are constrained to a specific phrase. Examples of such TD-SV phrases could be trigger keywords for voice assistants, such as “Hey Sir” or “Alexa” or “OK Google”. In this study, they focus on “OK Google” and “Hey Google”.
• A challenge in prior architectures is that silence and background noise are not being well captured. Even though the SV system runs on a short sub-second windows that are segmented by a keyword detector, the phonemes are usually surrounded by frames of silence and background noise. Ideally, the speaker embedding should be built only using the frames corresponding to phonemes. To remedy this, they propose to use an attention layer as a soft mechanism to emphasize the most relevant elements of the input sequence.
• Their training dataset is a collection of anonymized user voice queries, which is a mixture of “OK Google” and “Hey Google”. It has around 150M utterances from around 630K speakers.
• Attention helps summarize relevant information that occurs through the entire length of an input sequence. This paper also experiments with different attention mechanisms apart from the basic attention: cross-layer attention, and divided-layer attention. For cross-layer attention, the scores and weights are not computed using the outputs of the last LSTM layer but the outputs of the second-to-last layer. However, the d-vector is still the weighted average of the last layer output.
• For divided-layer attention, they double the dimension of the last layer LSTM output, and equally divide its dimension into two parts. They use one part to build the d-vector, while using the other to learn the scores.
• From their experimental results, the best practices are to: (i) use a shared-parameter non-linear scoring function; (ii) use a divided-layer attention connection to the last layer output of the LSTM; and (iii) apply a sliding window maxpooling on the attention weights. After combining all these best practices, they improved the EER of the baseline LSTM model from 1.72% to 1.48%, which is a 14% relative improvement.
##### Efficient Voice Trigger Detection for Low Resource Hardware
• This paper by Sigtia et al. from Apple in Interspeech 2018 describes the architecture of an always-on DNN-HMM system for on-device keyword spotting (KWS) in lowresource conditions, i.e., for battery-powered mobile devices.
• An always-available voice assistant needs a carefully designed voice keyword detector to satisfy the power and computational constraints of battery powered devices. They employ a multi-stage system that uses a low-power primary stage to decide when to run a more accurate (but more power-hungry) secondary detector. They describe a straightforward primary detector and explore variations that result in very useful reductions in computation (or increased accuracy for the same computation). By reducing the set of target labels from three to one per phone, and reducing the rate at which the acoustic model is operated, the compute rate can be reduced by a factor of six while maintaining the same accuracy.
• When the device is battery powered like the iPhone or the Apple Watch, it is imperative that the voice trigger detector consume as little power as possible while still maintaining sufficient accuracy. In recent iPhone designs, this is achieved by running a primary detector on a low-power processor that runs even when the main processor is asleep. This primary detector can decide to wake the main processor, where further checks are done (on the same waveform) before the main recognizer is applied and the identity of the speaker is confirmed. This paper focuses only on the primary detector which runs continuously on a low-power, low resource, always-on processor where computation and memory are the limiting factors.
• It has been demonstrated that LVCSR systems trained to predict whole phone labels (single label per phone) can achieve accuracies similar to conventional systems with 3 labels per phone. However, implementing this approach yields a significant loss in accuracy. The authors hypothesize the reason behind this as the need for a minimum duration constraint for each phone. To prove this hypothesis, they replicate each state in the trigger phrase HMM by a factor multiple while still using the whole phone DNN, which is equivalent to imposing a minimum duration on each of the labels. This yields similar accuracy as the baseline.
• An alternative way to impose longer minimum durations for each state: run the detector at a lower rate than 100 FPS. This results in longer intervals between predictions, which effectively increases the minimum duration of the HMM states. For on-device KWS, operating the detectors at a lower frame-rate is an attractive route for trying to limit the computation performed by the system.
• Their results demonstrate that for a voice trigger detection problem, it is not necessary to divide phone labels into 3 states for the beginning, middle and end of each phone. They achieve similar results to the baseline with a single label per phone and minimum duration constrains. This principle has been previously demonstrated for LVSCR with LSTM AMs, but their results demonstrate that the same holds true for DNN AMs with large input windows. As a practical consequence, they are able to run the detectors at frame-rates as low as 16.6 FPS without any loss in accuracy compared to the baseline. This represents a factor of 6 reduction in computation, which is significant when the system is deployed on low resource hardware. Alternatively, they can run a detector 6 times as large as the baseline without any extra computation.

#### 2020

##### Automatic Speaker Recognition with Limited Data
• This paper by Li et al. from UCLA, Tongji, and Amazon in WSDM 2020 proposes an adversarial few-shot learning-based speaker identification method that needs only a limited number of training instances.
• They employ metric learning-based few-shot learning to learn speaker acoustic representations using a support module and a query module, where the limited instances are comprehensively utilized to improve the identification performance. To that end, they first randomly sample a set of speakers from the training set as the start to construct the support module. For each speaker in the support module, they further randomly sample pieces of his/her audio instances and derive the corresponding MFCCs. These MFCCs are further fed into an embedding layer so they can use a fixed length vector to represent each audio instance. In the support module of the framework, for each speaker, they derive a representative embedding for each speaker, which summarizes the acoustic biometric of the aforementioned speaker. This is done using an attention layer to learn the importance weights using each audio embedding of a particular speaker.
• In the query module, they randomly select a piece of audio from a speaker, which is one of the speakers in the support module. They feed it into the embedding layer to derive the audio embedding.
• They then compare the distances between the query embedding and all the representative embeddings in the support module. The distances then are utilized to measure the relegation distribution over all speakers in the support module. The model is optimized by such iterative comparisons and reasoning between the support and query modules.
• Furthermore, adversarial learning is applied to further enhance the generalization and robustness for speaker identification with adversarial examples. The goal of employing adversarial training is to allow the identification system not only get optimized by the instances in the training data, but also be robust to unseen adversarial perturbations. To enhance the robustness, they enforce the model to perform consistently well even when the adversarial perturbations are presented. To achieve this goal, they further optimize the model to minimize the objective function with the perturbed parameters.
• The experiments are conducted on the LibriSpeech dataset. Experiments conducted on a publicly available large-scale dataset demonstrate that AFEASI significantly outperforms eleven baseline methods.
##### Speaker Identification for Household Scenarios with Self-attention and Adversarial Training
• This paper by Li et al. from Amazon, UCLA, and University of Note Dame in Interspeech 2020 proposes leveraging the self-attention mechanism to enhance long-span modeling of speaker characteristics since self-attention allows us to fully utilize dependencies over all frames in an utterance, resulting in informative global acoustic embedding representations. In contrast, CNNs by design are biased toward modeling features over nearby frames and frequencies, and RNNs are hard to train for retention of relevant information over long time intervals. These types of neutral networks thus potentially face problems capturing dependencies and characteristics expressed over long time spans within an utterance.
• Further, they utilize adversarial training as a tool to enhance the robustness and generalization of trained models, rather than as a defense against attacks.
• To learn the self-attentive utterance representations, the utterance spectrograms are fed as input to the self-attention layer to learn the transformed frame representations of speaker-relevant acoustic features, in two steps. First, they aim at mining correlations across frames in an utterance by having each transformed frame embedding be the weighted sum of frame embedding of itself and other related frames, where each weight gauges the similarity between one frame and another. Second, they aggregate the frame embeddings, including their correlational information, averaging it over the time dimension into one embedding vector and further, L2-normalizing into a fixed-length embedding vector that expresses the speaker-relevant information in the utterance. This yields a summarized global acoustic representation of an utterance.
• The experiments are conducted on the VCTK dataset which show that the proposed model significantly outperforms four state-of-the-art baselines in identifying both known and new speakers in terms of EER.
##### Stacked 1D convolutional networks for end-to-end small footprint voice trigger detection
• This paper by Higuchi et al. from Apple in 2020 proposes a stacked 1D convolutional neural network (S1DCNN) for end-to-end small footprint voice trigger detection in a streaming scenario. Voice trigger detection is an important speech application, with which users can activate their devices by simply saying a keyword or phrase. Due to privacy and latency reasons, a voice trigger detection system should run on an always-on processor on device. Therefore, having small memory and compute cost is crucial for a voice trigger detection system.
• Recently, singular value decomposition filters (SVDFs) has been used for end-to-end voice trigger detection. The SVDFs approximate a fully-connected layer with a low rank approximation, which reduces the number of model parameters. In this work, they propose S1DCNN as an alternative approach for end-to-end small-footprint voice trigger detection.
• An S1DCNN layer consists of a 1D convolution layer followed by a depth-wise 1D convolution layer. This is similar to the idea of depth-wise separable convolutions where $$K$$ filters (where $$K$$ is the number of channels in the input) are applied to each channel of the input (leading to a depth-wise convolution) yielding the same number of channels as the input, followed by a point-wise convolution which uses a $$1 \times 1 \times K$$ kernel leading to an output shape that has a single channel. Applying as many pointwise convolution filters as the desired number of output channels, yields the final output with much lesser multiplications than a standard convolution and fewer parameters than the baseline. As such, compared to a standard 2D CNN filter, the S1DCNN can be regarded as a factorization of a 2D CNN filter. An $$F \times K$$ filter of the 2D CNN layer is factorized into an F × 1 filter of the first 1D CNN layer and a $$1 \times K$$ filter of the second 1D CNN layer. This factorization reduces the number of parameters from $$O(F \times K)$$ to $$O(F + K)$$.
• They show that the SVDF can be expressed as a special case of the S1DCNN layer. Experimental results show that the S1DCNN achieve 19.0% relative false reject ratio (FRR) reduction with a similar model size and a similar time delay compared to the SVDF. By increasing the length of the future context (which leads to longer time delays), the S1DCNN further improve the FRR up to 12.2% relative.
##### Optimize What Matters: Training DNN-HMM Keyword Spotting Model Using End Metric
• In DNN-HMM based KWS models, the DNN computes the observation probabilities and outputs a probability distribution over as many classes as the HMM states for each speech frame using a softmax layer. The DNN is typically trained to minimize the average (over all frames) cross-entropy loss between the predicted and the ground-truth distributions. The HMM decoder computes the word detection score using the observation, the state transition, and the prior probabilities. This training ignores the HMM transition and prior probabilities which are learned independently using training data statistics.
• Such an independently trained DNN model relies on the accuracy of the ground-truth phoneme labels as well as the HMM model. This model also assumes that the set of keyword states are optimal and each state is equally important for the keyword detection task. The DNN spends all of its capacity focusing equally on all of the states, without considering its impact on the final metric of the detection score, resulting in a loss-metric mismatch.
• This paper by Shrivastava et al. from Apple in 2021 seeks to address this loss-metric mismatch by training the DNN model by directly optimizing the keyword detection score instead of optimizing for the state probabilities.
• This end-metric based training uses only the start and the end of the keyword instead of requiring all of the speech frames to be annotated, leading to substantial savings in annotation cost. Their method changes only the training algorithm without changing any inference pipeline; therefore, there is no overhead in runtime memory or compute, since they only need to update the model parameters.
• They use a hinge loss which uses the detection score as the output which ignores the samples from optimization if their scores are beyond a margin.
• Further, they propose IOU-based sampling and design an optimization procedure that maximizes the detection score for a speech segment that “tightly” contains the keyword (positive samples) and minimize the detection score for speech that does not contain the keyword (negative samples). They also sample additional hard negatives that contain partial keywords because they do not want the model to trigger at partial phrases. To formalize the concept of “tightly” containing the keyword, they use the concept of intersection-over-union (IOU) borrowed from computer vision. They sample positive and negative windows from speech utterances such that the positive windows have high intersection-over-union (IOU) and negative windows have low IOU with the ground-truth keyword window.
• The proposed approach works significantly better (> 70% relative reduction in FRR) than the conventional DNN-HMM training and is more interpretable, accurate in localization, and data-efficient compared to the CNN-based end-to-end models.
##### MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition
• This paper by Majumdar and Ginsburg from Nvidia in 2020 presents MatchboxNet - an end-to-end neural network for speech command recognition.
• MatchboxNet is a deep residual network composed from blocks of 1D time-channel separable convolution, batch-normalization, ReLU and dropout layers. - MatchboxNet reaches state-of-the art accuracy on the Google Speech Commands dataset while having significantly fewer parameters than similar models.
• The small footprint of MatchboxNet makes it an attractive candidate for devices with limited computational resources.
• The model is highly scalable, so model accuracy can be improved with modest additional memory and compute.
• Finally, they show how intensive data augmentation using an auxiliary noise dataset improves robustness in the presence of background noise.

#### 2021

##### Streaming Transformer for Hardware Efficient Voice Trigger Detection and False Trigger Mitigation
• This paper by Garg et al. from Apple in 2021 presented a unified and hardware efficient architecture for two-stage voice trigger detection (VTD) and false trigger mitigation (FTM) tasks. Two-stage VTD systems of voice assistants can get falsely activated to audio segments acoustically similar to the trigger phrase of interest. FTM systems cancel such activations by using post trigger audio context. Traditional FTM systems rely on automatic speech recognition lattices which are computationally expensive to obtain on device.
• They proposed a streaming transformer (TF) encoder architecture, which progressively processes incoming audio chunks and maintains audio context to perform both VTD and FTM tasks using only acoustic features.
##### Joint ASR and Language Identification Using RNN-T: An Efficient Approach to Dynamic Language Switching
• This paper by Punjabi et al. from Amazon in 2021 proposes joint ASR-LID architectures based on RNN-Ts as an efficient, on-device-suitable alternative to conventional dynamic language switching solutions. Two primary joint modeling paradigms are explored: coupled training, where ASR and LID vocabularies share the RNN-T output space, and multi-task learning, where ASR and LID losses are modeled using dedicated parameters but minimized jointly.
• The corpus used for RNN-T training consists of in-house, far-field, de-identified voice-assistant recordings amounting to about 3.8k and 12.5k hours of spoken Hindi and Indian English data, respectively. The acoustic LID classifier (used for baseline LID and for providing language representations to RNN-T) is trained using 2k hours of balanced English-Hindi data.
• Experiments with Indian English and spoken Hindi show that: (a) code-switched utterances are inherently difficult to recognize and classify, (b) multi-task learning provides superior ASR performance whereas coupled training offers better LID accuracy, and (c) multi-task models with a dedicated LID feed-forward network offer the best performance overall.
• The proposed joint ASR-LID architectures are language agnostic and, in principle, can be scaled to more than two languages.
##### Robust Self-Supervised Audio-Visual Speech Recognition
• Audio-visual speech recognition (AVSR) systems improve robustness by complementing the audio stream with the visual information that is invariant to noise and helps the model focus on the desired speaker. However, previous AVSR work focused solely on the supervised learning setup; hence the progress was hindered by the amount of labeled data available.
• This paper by Shi et al. from FB Research in 2022 introduces a self-supervised AVSR framework based on Audio-Visual Hidden-unit BERT (AV-HuBERT), a state-of-the-art audio-visual speech representation learning model, to tackle the problem of audio-based automatic speech recognition (ASR) degrading significantly in noisy environments and being particularly vulnerable to interfering speech, as the model cannot determine which speaker to transcribe.
• On the largest available AVSR benchmark dataset LRS3, AV-HuBERT approach outperforms prior state-of-the-art by ~50% (28.0% vs. 14.1%) using less than 10% of labeled data (433hr vs. 30hr) in the presence of babble noise, while reducing the WER of an audio-based model by over 75% (25.8% vs. 5.8%) on average.
##### HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
• This paper by Hsu et al. from FB Research in 2021 introduces Hidden-unit BERT (HuBERT) which enables self-supervised speech representation learning approach for speech recognition, generation, and compression.
• It is based on the masked prediction problem of predicting K-means cluster assignments of masked segments of continuous input.
• On both the Librispeech 960 hours and the 60,000 hours Librilight pre-training setups, HuBERT matches or outperforms SOTA systems over all fine-tuning subsets of 10mins, 1h, 10h, 100h, and 960h. Furthermore, the learned representation quality improves dramatically with iteratively refining K- means cluster assignments using learned latent representations for a previous iteration. HuBERT scales well to a 1B transformer model showing a relative reduction in WER of up to 13% on the test-other subset.
##### Deep Spoken Keyword Spotting: An Overview
• This paper by Lopez-Espejo et al. from Aalborg University, UT Dallas and Oticon in 2021 conducts a literature review into deep spoken KWS to assist practitioners and researchers who are interested in this technology. Specifically, this overview has a comprehensive nature by covering a thorough analysis of deep KWS systems (which includes speech features, acoustic modeling and posterior handling), robustness methods, applications, datasets, evaluation metrics, performance of deep KWS systems and audio-visual KWS.
##### BW-EDA-EEND: Streaming End-to-end Neural Speaker Diarization for a Variable Number of Speakers
• End-to-end neural diarization (EEND) with self-attention is one of the approaches that aim to model the joint speech activity of multiple speakers. It integrates voice activity and overlap detection with speaker tracking in end-to-end fashion. Moreover, it directly minimizes diarization errors and has demonstrated excellent diarization accuracy on two-speaker telephone conversations. However, EEND as originally formulated is limited to a fixed number of speakers because the output dimension of the neural network needs to be prespecified. Several methods have been proposed recently to overcome the limitations of EEND. One approach uses a speaker-wise chain rule to decode a speaker-specific speech activity iteratively conditioned on previously estimated speech activities. Another approach proposes an encoder/decoder-based attractor calculation. The embeddings of multiple speakers are accumulated over the time course of the audio input, and then disentangled one-by-one, for speaker identity assignment by speech frame. However, all these state-of-the-art EEND methods only work in an offline manner, which means that the complete recording must be available before diarization output is generated. This makes their application impractical for settings where potentially long multi-speaker recordings need to be processed incrementally (in streaming fashion).
• This paper by Han et al. from Amazon in 2021 proposes a novel method to perform EEND in a blockwise online fashion so that speaker identities are tracked with low latency soon after new audio arrives, without much degradation in accuracy compared to the offline system. They utilize the incremental Transformer encoder, where they attend to only its left contexts and ignore its right contexts, thus enabling blockwise online processing. Furthermore, the incremental Transformer encoder uses block-level recurrence in the hidden states to carry over information block by block, reducing computation time while attending to previous blocks. To their knowledge, ours is the first method that uses the incremental Transformer encoder with block-level recurrence to enable online speaker diarization.
• They present a novel online end-to-end neural diarization system, BWEDA-EEND, that processes data incrementally for a variable number of speakers. The system is based on the Encoder-Decoder-Attractor (EDA) architecture of Horiguchi et al., but utilizes the incremental Transformer encoder, attending only to its left contexts and using block-level recurrence in the hidden states to carry information from block to block, making the algorithm complexity linear in time. They propose two variants: For unlimited-latency BW-EDAEEND, which processes inputs in linear time, they show only moderate degradation for up to two speakers using a context size of 10 seconds compared to offline EDA-EEND. With more than two speakers, the accuracy gap between online and offline grows, but the algorithm still outperforms a baseline offline clustering diarization system for one to four speakers with unlimited context size, and shows comparable accuracy with context size of 10 seconds. For limited-latency BW-EDA-EEND, which produces diarization outputs block-by-block as audio arrives, they show accuracy comparable to the offline clustering-based system.
##### Attentive Contextual Carryover For Multi-turn End-to-end Spoken Language Understanding
• This paper by Wei et al. from Amazon in ASRU 2021 proposes a novel E2E SLU approach where a multi-head gated attention mechanism is introduced to effectively incorporate the dialogue history in a multi-turn E2E SLU system.
• They propose a multi-head gated attention mechanism as a context combiner which combines the context encodings consisting of dialogue acts and previous utterances to create the final context vectors that are fed into the model. They explore different ways to combine the context encodings into the model: (i) averaged contextual carryover, (ii) attentive contextual carryover, and (iii) gated attentive contextual carryover. Gated attentive contextual carryover performed better than traditional multi-head attention and a simple average.
• The attention-based context can be integrated at different layers of a neural E2E SLU model such as the speech encoder stage and the ASR-NLU hidden interface, and the shared context ingestion which integrates context into both acoustic embeddings and the ASR-NLU interface. The shared context ingestion approach gave the biggest improvement compared to the other schemes.
• They built contextual E2E SLU models based on the Recurrent Neural Network Transducer (RNN-T) as well as the Transformer Transducer (T-T). E2E SLU models share an audio encoder network that encodes log-filterbank energy (LFBE) features, a prediction network that encodes a sequence of predicted wordpieces, a joint network that combines the encoder and the prediction network, and an NLU tagger that predicts intents and slots. The intent tagger contains two feedforward layers before projecting into the number of intents, and the slot tagger directly takes the output embeddings from the NLU tagger and projects them into the slot size. The audio encoder in the E2E T-T SLU and E2E RNN-T SLU are Transformer layers (with 4 attention heads) and LSTM layers, respectively.
• The models are trained and evaluated on an internal industrial voice assistant (IVA) dataset and a synthetic and publicly available multi-turn E2E SLU (Syn-Multi) dataset. They utilize SpecAugment to augment audio feature inputs.
• The proposed approach significantly improves E2E SLU accuracy on the internal industrial voice assistant and publicly available datasets compared to the non-contextual E2E SLU models.
##### SmallER: Scaling Neural Entity Resolution for Edge Devices
• This paper by McGowan et al. from Amazon in Interspeech 2021 introduces SmallER, a scalable neural entity resolution system capable of running directly on edge devices.
• SmallER addresses constraints imposed by the on-device setting such as bounded memory consumption for both model and catalog storage, limited compute resources, and related latency challenges introduced by those restrictions. Their model offers a small-footprint neural architecture capable of learning syntactic and semantic information simultaneously using distinct modules and is trained to handle multiple domains within one compact architecture (a.k.a., one model to rule them all domains!).
• They use compressed tries to reduce the space required to store catalogs on device. They also propose a novel implementation of spatial partitioning trees which at inference time strikes a balance between reducing runtime latency (by reducing the search space) and preserving recall relative to a full/exhaustive catalog search.
• They utilize Quantization Aware Training (QAT) to train SmallER. The final model consumes only 3MB of memory at inference time with classification accuracy surpassing that of previously established, domain-specific baseline models on live customer utterances. Furthermore, catalog entries are compressed overall by a factor of 2.5x.
• For the largest catalogs they consider (300 or more entries), their proxy metric for runtime latency is reduced by more than 90%.
##### Leveraging Multilingual Neural Language Models for On-Device Natural Language Understanding
• This paper by Tu et al. from Amazon in the 2021 Web Conference Workshop on Multilingual Search investigates learning multi-lingual/cross-lingual representations as an approach to increase the accuracy of on-device multilingual models without increasing their footprint relative to monolingual models, appropriate for deployment on edge devices.
• They show that cross-lingual representations can help improve NLU performance in both monolingual and multilingual settings. In particular, they show that the performance improvements for non-English monolingual NLU models are higher when they are seeded with cross-lingual representations, as compared to seeding with monolingual representations. Further, multilingual experiments suggest that the scarcer the available data-resources, the more beneficial it is to use cross-lingual representations.
##### Comparing Data Augmentation and Annotation Standardization to Improve End-to-end Spoken Language Understanding Models
• All-neural end-to-end (E2E) Spoken Language Understanding (SLU) models can improve performance over traditional compositional SLU models, but have the challenge of requiring high-quality training data with both audio and annotations. In particular they struggle with performance on “golden utterances”, which are essential for defining and supporting features, but may lack sufficient training data.
• This paper by Nicolich-Henkin et al. from Amazon in NeurIPS 2021 proposes using data augmentation to compare two data-centric AI methods to improve performance on golden utterances: improving the annotation quality of existing training utterances and augmenting the training data with varying amounts of synthetic data.
• Their experimental results show improvements with both methods, and in particular that augmenting with synthetic data is effective in addressing errors caused by both inconsistent training data annotations as well as lack of training data. In other words, both data-centric approaches to improving E2E SLU achieved the desired effect, although data augmentation was much more powerful than annotation standardization. This method leads to improvement in intent recognition error rate (IRER) on their golden utterance test set by 93% relative to the baseline without seeing a negative impact on other test metrics.
##### CLAR: Contrastive Learning of Auditory Representations
• The following paper summary has been contributed by Zhibo Zhang.
• The paper by AI-Tahan et al. from Western Ontario University in AISTATS 2021 proposes a new framework CLAR - Contrastive Learning of Auditory Transformations for learning auditory representation that involves a mixture of contrastive loss and supervised cross-entropy loss.
• This framework adopts two forms of input on top of the augmented audio data: the audio signal as well as the spectrograms of the according audio signal, trained by two different encoders.
• The authors tested eight different augmentation strategies that belong to two categories - frequency transformation and temporal transformation, and they empirically found out that adding more augmentation operations did not necessarily bring better accuracy scores using the ResNet-18 model.
• In addition, the authors compared the CLAR method with supervised learning as well as self-supervised learning on the Speech CommandS-10 dataset. CLAR indicated better performance when training with 100%, 20% and 10% labels on large epochs while worse performance (compared to self-supervised learning) when training with only 1% labels.
• However, the authors showed all the experimental results using only the ResNet-18 model, which is less convincing given that contrastive learning benefits more from larger models as pointed out in the SimCLR paper (Chen et al.). Thus, it would be interesting to see results on the ResNet-50 model. In addition, as part of the experiments, the authors compared the CLAR approach with the [supervised contrastive learning framework (Khosla et al.) when data is partially labeled. It would be useful to add information that describes how supervised contrastive learning was generalized to the semi-supervised setting given that the original methodology was designed for the fully supervised setting.
• Last but not least, as some potential future work, AutoAugment (Cubuk et al.) could be adopted to select the augmentation strategy combinations as well as their hyperparameters.

#### 2022

##### Adaptive Global-Local Context Fusion for Multi-Turn Spoken Language Understanding
• This paper by Tran et al. from Amazon in AAAI 2022 tackles the problem of multi-turn Spoken Language Understanding (SLU), where dialogue contexts are used to guide intent classification and slot filling. They propose a novel contextual SLU model for multi-turn intent classification and slot filling tasks that aims to selectively incorporate dialogue contexts, such as previous utterances and dialogue acts for multi-turn SLU.
• They introduce an adaptive global-local context fusion mechanism to selectively integrate dialogue contexts into their model. The local context fusion aligns each dialogue context using multi-head attention, while the global context fusion measures overall context contribution to intent classification and slot filling tasks.
• The models are trained and evaluated on the publicly-available Sim-R and Sim-M datasets and an internal in-house dataset.
• Experiments show that on two benchmark datasets, their model achieves absolute F1 score improvements of 2.73% and 2.57% for the slot filling task on Sim-R and Sim-M datasets, respectively.
• Ablation studies indicate that dialogue history contexts play a crucial role in improving SLU task in the multi-turn dialogue setting.

### Multimodal

#### 2017

##### Axiomatic Attribution for Deep Networks
• This paper by Sundararajan from Google in ICML 2017 studies the problem of attributing the prediction of a deep network to its input features, a problem previously studied by several other works.
• They identify two fundamental axioms — Sensitivity and Implementation Invariance that attribution methods ought to satisfy. They show that they are not satisfied by most known attribution methods, which they consider to be a fundamental weakness of those methods.
• They use the axioms to guide the design of a new attribution method called Integrated Gradients.
• Their method requires no modification to the original network and is extremely simple to implement; it just needs a few calls to the standard gradient operator.
• Since this method is multimodal, they apply this method to a couple of image models, a couple of text models and a chemistry model, demonstrating its ability to debug networks, to extract rules from a network, and to enable users to engage with models better.
• Since integrated gradients add up to the final prediction score, the magnitudes can be use for accounting the contributions of each feature. For instance, for the molecule in the figure, atom-pairs that have a bond between them cumulatively contribute to 46% of the prediction score, while all other pairs cumulatively contribute to only −3%.

#### 2021

##### Style Equalization: Unsupervised Learning of Controllable Generative Sequence Models
• There exists a training-inference mismatch when learning these models in typical unsupervised training of controllable generative models. During training, the same sample is used as content input and style input, whereas during inference, content and style inputs are from different samples, i.e., the reference style sample contains a different content than the target content. The mismatch leads to incorrect content generation during inference.
• This paper by Chang et al. from Apple in 2021 presented a simple but effective technique to deal with the training-inference mismatch when controllable auto-regressive models are learned in an unsupervised manner. Further, to mitigate the training-inference mismatch, the paper proposed style equalization which takes unpaired samples as input during both training and inference. It transforms the style of sample B to that of A by estimating their style difference.
• Trained using tuples $$(x_i, c_i)$$ where $$x_i$$ is the style sample and $$c_i$$ is the content sample.
• If a generative model learns to utilize the content information in the style example, during inference the generative model will generate wrong content. This phenomenon is called content leakage.
• Instead of directly using sample B as style (in which case there is no ground truth), they jointly learn a style transformation function (using CNNs + Multihead attention), which estimates the style difference between A and B and transforms the style of sample B to the style of A. The generative model then takes content A and the transformation output (that contains the style of A) to reconstruct sample A. The proposed method enables us to use sample A as the ground truth while learning in the non-parallel setting. During inference given arbitrary content A and reference sample B, they turn off the style transformation (since by construction, the style difference is zero), and thus the output sample contains content A and style of B.
• The proposed method is general and can be applied to different sequence signals. They apply the proposed method on two signal domains, speech and online handwriting, and evaluate the performance carefully via quantitative evaluation (by computing content error rates) and conducting qualitative user studies. Their results show that the proposed method outperforms state-of-the-art methods, including those having access to additional style supervision like speaker labels. Both quantitative and qualitative results show that their model achieves near-real content accuracy and style reproduction.
• Note that for style equalization to be successful, $$M()$$ should not transfer any content-related information (e.g., copy the entire sequence) from $$x$$ but only its style information so that the decoder will utilize the transferred style and will rely on provided content input to generate the output. Therefore the design of $$M$$ is critical.
• They evaluate the proposed method on two multi-speaker speech datasets. VCTK dataset (Yamagishi et al., 2019) contains 110 speakers and 44 hours of speech, and LibriTTS dataset (Zen et al., 2019) contains 2,311 speakers and 555 hours of speech in the training set.

#### 2022

##### Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction
• This paper by Shi et al. from FB Research in 2022 introduces Audio-Visual Hidden Unit BERT (AV-HuBERT) which exploits video recordings of speech contain correlated audio and visual information, providing a strong signal for speech representation learning from the speaker’s lip movements and the produced sound.
• AV-HuBERT is a self-supervised representation learning framework for audio-visual speech, which masks multi-stream video input and predicts automatically discovered and iteratively refined multimodal hidden units. It learns powerful audio-visual speech representation benefiting both lip-reading and automatic speech recognition.
• On the largest public lip-reading benchmark LRS3 (433 hours), AV-HuBERT achieves 32.5% WER with only 30 hours of labeled data, outperforming the former state-of-the-art approach (33.6%) trained with a thousand times more transcribed video data (31K hours). The lip-reading WER is further reduced to 26.9% when using all 433 hours of labeled data from LRS3 and combined with self-training. Using their audio-visual representation on the same benchmark for audio-only speech recognition leads to a 40% relative WER reduction over the state-of-the-art performance (1.3% vs 2.3%).
• Code and models are available here.

### Core ML

#### 2018

##### Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning
• This article by Raschka from UW-Madison in 2018 reviews different techniques that can be used for model evaluation, model selection, and algorithm selection.
• Each technique is discussed and its pros and cons are weighed with supporting examples. Further, recommendations are given to encourage best yet feasible practices in research and applications of machine learning.
• Common methods such as the holdout method for model evaluation and selection are covered, which are not recommended when working with small datasets. Different flavors of the bootstrap technique are introduced for estimating the uncertainty of performance estimates, as an alternative to confidence intervals via normal approximation if bootstrapping is computationally feasible. Common cross-validation techniques such as leave-one-out cross-validation and $$k$$-fold cross-validation are reviewed, the bias-variance trade-off for choosing $$k$$ is discussed, and practical tips for the optimal choice of $$k$$ are given based on empirical evidence.
• Different statistical tests for algorithm comparisons are presented, and strategies for dealing with multiple comparisons such as omnibus tests and multiple-comparison corrections are discussed.
• Finally, alternative methods for algorithm selection, such as the combined F-test 5x2 cross-validation and nested cross-validation, are recommended for comparing machine learning algorithms when datasets are small.

#### 2020

##### Triple Wins: Boosting Accuracy, Robustness and Efficiency Together by Enabling Input-Adaptive Inference
• Deep networks were recently suggested to face the odds between accuracy (on clean natural images) and robustness (on adversarially perturbed images). Such a dilemma is shown to be rooted in the inherently higher sample complexity and/or model capacity, for learning a high-accuracy and robust classifier. In view of that, give a classification task, growing the model capacity appears to help draw a win-win between accuracy and robustness, yet at the expense of model size and latency, therefore posing challenges for resource-constrained applications. The paper seeks to answer the question: is it possible to co-design model accuracy, robustness and efficiency to achieve their triple wins?
• This paper by Hu et al. from TAMU in ICLR 2020 studies multi-exit networks associated with input-adaptive efficient inference, showing their strong promise in achieving a “sweet point” in co-optimizing model accuracy, robustness, and efficiency.
• Their proposed solution, dubbed Robust Dynamic Inference Networks (RDI-Nets), allows for each input (either clean or adversarial) to adaptively choose one of the multiple output layers (early branches or the final one) to output its prediction. That multi-loss adaptivity adds new variations and flexibility to adversarial attacks and defenses, on which they present a systematical investigation.
• They show experimentally that by equipping existing backbones with such robust adaptive inference, the resulting RDI-Nets can achieve better accuracy and robustness, yet with over 30% computational savings, compared to the defended original models.

#### 2022

##### OmniXAI: A Library for Explainable AI
• This paper by Yang et al. from Salesforce Research presents Omni eXplainable AI (OmniXAI), an open-source Python library of eXplainable AI (XAI), which offers omni-way explainable AI capabilities and various interpretable machine learning techniques to address the pain points of understanding and interpreting the decisions made by machine learning (ML) in practice.
• OmniXAI aims to be a one-stop comprehensive library that makes explainable AI easy for data scientists, ML researchers and practitioners who need explanation for various types of data, models and explanation methods at different stages of ML process (data exploration, feature engineering, model development, evaluation, and decision-making, etc).
• In particular, our library includes a rich family of explanation methods integrated in a unified interface, which supports multiple data types (tabular data, images, texts, time-series), multiple types of ML models (traditional ML in Scikit-learn and deep learning models in PyTorch/TensorFlow), and a range of diverse explanation methods including “model-specific” and “model-agnostic” ones (such as feature-attribution explanation, counterfactual explanation, gradient-based explanation, etc).
• For practitioners, the library provides an easy-to-use unified interface to generate the explanations for their applications by only writing a few lines of codes, and also a GUI dashboard for visualization of different explanations for more insights about decisions.
• They present OmniXAI’s design principles, system architectures, and major functionalities, and also demonstrate several example use cases across different types of data, tasks, and models.