Papers List

  • A curated set of papers I’ve reviewed for my latest scoop in AI/ML.

Seminal Papers

Vision

2010

Noise-contrastive estimation: A new estimation principle for unnormalized statistical models
  • This paper by Gutmann and Hyvarinen in 2010 introduced the concept of negative sampling that forms the basis of contrastive learning. They proposed a new estimation principle, noisecontrastive estimation, which discriminates between observed data and some artificially generated noise to perform nonlinear logistic regression.
  • They apply the method to the modeling of natural images and show that the method can successfully estimate a large-scale two-layer model and a Markov random field.

2012

ImageNet Classification with Deep Convolutional Neural Networks
  • The original AlexNet paper by Krizhevsky et al. from NeurIPS 2012 that started it all. This trail-blazer introduced Deep Learning to the world :)
3D Convolutional Neural Networks for Human Action Recognition
  • This paper by Ji et al. from Arizona State University in IEEE PAMI 2012 introduced 3D CNNs.

2013

Visualizing and Understanding Convolutional Networks
  • This legendary paper by Zeiler and Fergus from the Courant Institute, NYU in 2013 seeks to demystify why CNNs perform so well on image classification, or how they might be improved. This paper seeks to address both issues.
  • They introduce a novel visualization technique that gives insight into the function of intermediate feature layers and the operation of the classifier.
  • They also perform an ablation study to discover the performance contribution from different model layers. This enables us to find model architectures that outperform Krizhevsky et. al on the ImageNet classification benchmark.
  • They show our ImageNet model generalizes well to other datasets: when the softmax classifier is retrained, it convincingly beats the current state-of-the-art results on Caltech-101 and Caltech-256 datasets.

2015

Very Deep Convolutional Networks for Large-Scale Image Recognition
  • This paper by Simonyan and Zisserman from DeepMind and Oxford in ICLR 2015 proposed the VGG architecture. They showed that a significant performance improvement can be achieved by pushing the depth to 16-19 weight layers, i.e., VGG-16 and VGG-19.
  • The main principle is that using a stack of \(3 \times 3\) convolution filters are better than a single \(7 \times 7\) layer. Firstly, because they use three non-linear activations (instead of one), which makes the function more discriminative. Secondly, the \(3 \times 3\) design decreases the number of parameters – specifically, you need \(3 \times (3^2)C^2 = 27C^2\) weights, compared to a \(7 \times 7\) conv layer which would require \(1 \times (7^2)C^2 = 49C^2\) parameters (81% more).
Going Deeper with Convolutions
  • This paper by Szegedy et al. from Google in CVPR 2015 introduced the Inception (also known as GoogLeNet or InceptionNet) architecture which achieved state of the art results for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) 2014.
  • Ideas from the paper:
    • Increased the depth (number of layers) is not the only way to make a model bigger. What about increasing both the depth and width of the network while keeping computations at a manageable level? This time the inspiration comes from the human visual system, wherein information is processed at multiple scales and then aggregated locally. How do you achieve this without a memory explosion? The answer is with \(1 \times 1\) convolutions! The main purpose is channel dimensionality reduction, by reducing the output channels of the input. Next, \(1 \times 1\) convolutions are used to compute reductions before the computationally expensive convolutions (\(3 \times 3\) and \(5 \times 5\)). Inception uses convolutions of different kernel sizes (\(5 \times 5\), \(3 \times 3\), \(1 \times 1\)) to capture details at multiple scales.
    • To enable concatenation of features convolved with different kernels, they pad the output to make it the same size as the input. To find the appropriate padding with single stride convs without dilation, padding \(p\) and kernel \(k\) are defined so that \(out=in\) (i.e., input and output have the same spatial dimensions): \(p = (k-1)/2p\) (since \(out = in + 2p - k + 1\)).
FaceNet: A Unified Embedding for Face Recognition and Clustering
  • This paper by Schroff et al. from Google in 2015 proposes FaceNet, a system that directly learns a mapping from face images to a compact Euclidean space where distances directly correspond to a measure of face similarity. Once this space has been produced, tasks such as face recognition, verification and clustering can be easily implemented using standard techniques with FaceNet embeddings as feature vectors.
  • Their method uses a deep convolutional network trained to directly optimize the embedding itself, rather than an intermediate bottleneck layer as in previous deep learning approaches. To train, they use triplets of roughly aligned matching / non-matching face patches generated using a novel online triplet mining method. The benefit of our
  • approach is much greater representational efficiency: they achieve state-of-the-art face recognition performance using only 128-bytes per face.
  • Previous face recognition approaches based on deep networks use a classification layer trained over a set of known face identities and then take an intermediate bottle neck layer as a representation used to generalize recognition beyond the set of identities used in training. The downsides of this approach are its indirectness and its inefficiency: one has to hope that the bottleneck representation generalizes well to new faces; and by using a bottleneck layer the representation size per face is usually very large (1000s of dimensions). Some recent work has reduced this dimensionality using PCA, but this is a linear transformation that can be easily learnt in one layer of the network. In contrast to these approaches, FaceNet directly trains its output to be a compact 128-D embedding using a triplet-based loss function based on LMNN. Their triplets consist of two matching face thumbnails and a non-matching face thumbnail and the loss aims to separate the positive pair from the negative by a distance margin.
  • Choosing which triplets to use turns out to be very important for achieving good performance and, inspired by curriculum learning, they present a novel online negative exemplar mining strategy which ensures consistently increasing difficulty of triplets as the network trains. To improve clustering accuracy, they also explore hard-positive mining techniques which encourage spherical clusters for the embeddings of a single person.
  • The triplet loss minimizes the L2-distance between faces of the same identity and enforces a margin between the distance of faces of different identities and encourages a relative distance constraint. Specifically, the Triplet Loss minimizes the distance between an anchor and a positive, both of which have the same identity, and maximizes the distance between the anchor and a negative of a different identity. Thus, network is trained such that the squared L2 distances in the embedding space directly correspond to face similarity: faces of the same person have small distances and faces of distinct people have large distances. Once this embedding has been produced, downstream tasks become straight-forward: face verification simply involves thresholding the distance between the two embeddings; recognition becomes a k-NN classification problem; and clustering can be achieved using off-the-shelf techniques such as k-means or agglomerative clustering.
  • On the widely used Labeled Faces in the Wild (LFW) dataset, their system achieves a new record accuracy of 99.63%, which cuts the error rate in comparison to the best published result by 30% on both datasets.
  • They explore two different deep convolutional network architectures that have been recently used to great success in the computer vision community. The first architecture is based on the Zeiler&Fergus model which consists of multiple interleaved layers of convolutions, non-linear activations, local response normalizations, and max pooling layers. The second architecture is based on the Inception model of Szegedy et al. which was recently used as the winning approach for ImageNet 2014. These networks use mixed layers that run several different convolutional and pooling layers in parallel and concatenate their responses which reduces the number of parameters by up to 20 times and have the potential to reduce the number of FLOPS required for comparable performance.
  • They also introduce the concept of harmonic embeddings, and a harmonic triplet loss, which describe different versions of face embeddings (produced by different networks) that are compatible to each other and allow for direct comparison between each other.
Distilling the Knowledge in a Neural Network
  • This paper by Hinton et al. from Google in NeurIPS 2014 introduces a very simple way to improve the performance of almost any machine learning algorithm by training many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets.
  • Caruana et al. have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and the authors develop this approach further using a different compression technique. They achieve some surprising results on MNIST and show that they can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. They also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel. This shows that distilling works very well for transferring knowledge from an ensemble or from a large highly regularized model into a smaller, distilled model.
  • The results show that on MNIST, distillation works remarkably well even when the transfer set that is used to train the distilled model lacks any examples of one or more of the classes. For a deep acoustic model that is version of the one used by Android voice search, they have shown that nearly all of the improvement that is achieved by training an ensemble of deep neural nets can be distilled into a single neural net of the same size which is far easier to deploy.
  • For really big neural networks, it can be infeasible even to train a full ensemble, but have shown that the performance of a single really big net that has been trained for a very long time can be significantly improved by learning a large number of specialist nets, each of which learns to discriminate between the classes in a highly confusable cluster.

2016

Rethinking the Inception Architecture for Computer Vision
  • This paper by Szegedy et al. from Google in CVPR 2016 proposed InceptionV2, V3 by improving the Inception model based on the following principles:
    • Using the same principle as VGG, the authors factorized \(5 \times 5\) and \(7 \times 7\) (in InceptionV3) convolutions to two and three \(3 \times 3\) sequential convolutions respectively. This improves computational speed and utilizes far less parameters.
    • Used spatially separable convolutions. Simply, a \(3 \times 3\) kernel is decomposed into two smaller ones: a \(1 \times 3\) and a \(3 \times 1\) kernel, which are applied sequentially.
    • Widened the inception modules (more number of filters).
    • Distributed the computational budget in a balanced way between the depth and width of the network.
    • Added batch normalization.
Deep Residual Learning for Image Recognition
  • ResNet paper by He et al. from Facebook AI in CVPR 2016. Most cited in several AI fields.
  • The issue of vanishing gradients when training a deep neural network was addressed with two tricks:
    • Batch normalization and,
    • Short skip connections
  • Instead of \(H(x) = F(x)\), the skip connection leads to \(H(x) = F(x) + x\), which implies that the model is learning the difference (i.e., residual), \(F(x) = H(x) - x\).

2017

Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning
  • This paper by Szegedy et al. from Google in AAAI 2017 introduced the latest versions of the Inception model – InceptionV4 and Inception-ResNet.
Photo-Realistic Single Image Super-Resolution using a GAN
  • This paper by Ledig et al. from Twitter in CVPR 2017 applied GANs for single image super-resolution (SISR).

2018

From Recognition to Cognition: Visual Commonsense Reasoning
  • Visual understanding goes well beyond object recognition. With one glance at an image, they can effortlessly imagine the world beyond the pixels: for instance, they can infer people’s actions, goals, and mental states. While this task is easy for humans, it is tremendously difficult for today’s vision systems, requiring higher-order cognition and commonsense reasoning about the world.
  • This paper by Zellers et al. from UWash in CVPR 2019 formalizes this task as Visual Commonsense Reasoning (VCR). Given a challenging question about an image, a machine must answer correctly and then provide a rationale justifying its answer.
  • Next, they introduce a new dataset, VCR, consisting of 290k multiple choice QA problems derived from 110k movie scenes. The key recipe for generating non-trivial and high-quality problems at scale is Adversarial Matching, a new approach to transform rich annotations into multiple choice questions with minimal bias. Experimental results show that while humans find VCR easy (over 90% accuracy), state-of-the-art vision models struggle (~45%).
  • To move towards cognition-level understanding, they present a new reasoning engine, Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning. R2C helps narrow the gap between humans and machines (~65%); still, the challenge is far from solved, and they provide analysis that suggests avenues for future work.
  • Website with models/datasets.
Focal Loss for Dense Object Detection
  • The highest accuracy object detectors to date are based on a two-stage approach popularized by R-CNN, where a classifier is applied to a sparse set of candidate object locations. In contrast, one-stage detectors that are applied over a regular, dense sampling of possible object locations have the potential to be faster and simpler, but have trailed the accuracy of two-stage detectors thus far.
  • This paper by Lin et al. from in 2017 investigates why this is the case and introduced focal loss. They discover that the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause. They propose to address this class imbalance by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples.
  • Focal loss applies a modulating term to the cross entropy loss in order to focus learning on hard misclassified examples. It is a dynamically scaled cross entropy loss, where the scaling factor decays to zero as confidence in the correct class increases.
  • Their novel Focal Loss focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training. To evaluate the effectiveness of our loss, we design and train a simple dense detector we call RetinaNet.
  • Their results show that when trained with the focal loss, RetinaNet is able to match the speed of previous one-stage detectors while surpassing the accuracy of all existing state-of-the-art two-stage detectors.
Relational inductive biases, deep learning, and graph networks
  • Recent advances in AI, propelled by deep learning, have been transformative across many important domains. Despite this, a vast gap between human and machine intelligence remains, especially with respect to efficient, generalizable learning.
  • This paper by Battaglia et al. (2018) from DeepMind/Google, MIT and the University of Edinburgh offers a great overview of the relational inductive biases of various neural net architectures, summarized in the table below from the paper.

  • They argue that combinatorial generalization must be a top priority for AI to achieve human-like abilities, and advocate for marrying complementary approaches which draw on ideas from human cognition, traditional computer science, standard engineering practice, and modern deep learning. Just as biology uses nature and nurture cooperatively, they reject the false choice between “hand-engineering” and “end-to-end” learning, and instead advocate for an approach which benefits from their complementary strengths.
  • They investigate how using relational inductive biases within deep learning architectures can facilitate learning about entities, relations, and rules for composing them.
  • They explore flexible learning-based approaches which implement strong relational inductive biases to capitalize on explicitly structured representations and computations, and present a new building block for the AI toolkit – the graph neural networks (GNNs).
  • GNNs generalize and extend various approaches for neural networks that operate on graphs, and provides a straightforward interface for manipulating structured knowledge and producing structured behaviors. GNNs are designed to promote building complex architectures using customizable graph-to-graph building blocks, and their relational inductive biases promote support relational reasoning, combinatorial generalization, and improved sample efficiency over other standard machine learning building blocks. This would help lay the foundation for more sophisticated, interpretable, and flexible patterns of reasoning.

2019

Objects as Points
  • This paper by Zhou et al. from UT Austin in 2019 proposes CenterNet, a center point-based object detection approach, which is end-to-end differentiable, simpler, faster, and more accurate than other competitive bounding box based detectors.
  • CenterNet is an anchorless object detection architecture. As such, this structure has an important advantage in that it replaces the classical NMS (Non Maximum Suppression) step during post-processing. This mechanism enables faster inference.
  • Where most successful object detectors enumerate a nearly exhaustive list of potential object locations and classify each, which is wasteful, inefficient, and requires additional post-processing, CenterNet models an object as a single point — the center point of its bounding box. CenterNet object detector builds on successful keypoint estimation networks and uses keypoint estimation to find center points and regresses to all other object properties, such as size, 3D location, orientation, depth and extent, and pose in a single forward pass. The algorithm is simple, fast, accurate, and end-to-end differentiable without any NMS post-processing. The idea is general and has broad applications beyond simple two-dimensional detection.
  • Upon comparison with other state-of-the-art detectors in the COCO test-dev set. With multi-scale evaluation, CenterNet with Hourglass104 achieves an AP of 45.1%, outperforming all existing one-stage detectors. Sophisticated two-stage detectors are more accurate, but also slower.
RandAugment: Practical automated data augmentation with a reduced search space
  • Recent work has shown that data augmentation has the potential to significantly improve the generalization of deep learning models.
  • Recently, automated augmentation strategies have led to state-of-the-art results in image classification and object detection. While these strategies were optimized for improving validation accuracy, they also led to state-of-the-art results in semi-supervised learning and improved robustness to common corruptions of images.
  • An obstacle to a large-scale adoption of these methods is a separate search phase which increases the training complexity and may substantially increase the computational cost. Additionally, due to the separate search phase, these approaches are unable to adjust the regularization strength based on model or dataset size. Automated augmentation policies are often found by training small models on small datasets and subsequently applied to train larger models.
  • This paper by Cubuk et al. from Google Brain in 2019 demonstrates that previous methods of learned augmentation suffers from systematic drawbacks. Namely, not tailoring the number of distortions and the distortion magnitude to the dataset size nor the model size leads to sub-optimal performance. In previous work, scaling learned data augmentation to larger dataset and models have been a notable obstacle. For example, AutoAugment and Fast AutoAugment could only be optimized for small models on reduced subsets of data; population based augmentation was not reported for large-scale problems.
  • They propose RangAugment, a simple parameterization for targeting augmentation to particular model and dataset sizes, which seeks to remove both of the aforementioned obstacles. RandAugment has a significantly reduced search space which allows it to be trained on the target task with no need for a separate proxy task. Furthermore, due to the parameterization, the regularization strength may be tailored to different model and dataset sizes.
  • RandAugment can be used uniformly across different tasks and datasets and works out of the box, matching or surpassing all previous automated augmentation approaches on CIFAR-10/100, SVHN, and ImageNet without a separate search for data augmentation policies.
  • The proposed method scales quite well to datasets such as ImageNet and COCO while incurring minimal computational cost (e.g. 2 hyperparameters), but notable predictive performance gains.
  • On the ImageNet dataset we achieve 85.0% accuracy, a 0.6% increase over the previous state-of-the-art and 1.0% increase over baseline augmentation. On object detection, RandAugment leads to 1.0-1.3% improvement over baseline augmentation, and is within 0.3% mAP of AutoAugment on COCO.
  • Finally, due to its interpretable hyperparameter, RandAugment may be used to investigate the role of data augmentation with varying model and dataset size.

2020

Designing Network Design Spaces
  • This paper by Radosavovic et al. from FAIR in CVPR 2020 presents a new network design paradigm. Their goal is to help advance the understanding of network design and discover design principles that generalize across settings. Instead of focusing on designing individual network instances, we design network design spaces that parametrize populations of networks. The overall process is analogous to classic manual design of networks, but elevated to the design space level.
  • Their methodology explores the structural aspect of network design and arrives at a low-dimensional design space consisting of simple, regular networks that we call RegNet. The core insight of the RegNet parametrization is surprisingly simple: widths and depths of good networks can be explained by a quantized linear function.
  • They analyze the RegNet design space and arrive at interesting findings that do not match the current practice of network design. The RegNet design space provides simple and fast networks that work well across a wide range of flop regimes.
  • Under comparable training settings and flops, the RegNet models outperform the popular EfficientNet models while being up to 5x faster on GPUs.
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
  • In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place.
  • This paper by Dosovitskiy et al. from Google Brain in ICLR 2021 shows that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks.
  • When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), the proposed Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

Training data-efficient image transformers & distillation through attention
  • Compared to CNNs, vision transformers are pre-trained with hundreds of millions of images using an expensive infrastructure, thereby limiting their adoption.
  • This paper by Touvron from Facebook AI and proposes DeiT, a competitive convolution-free transformer that does not require very large amount of data to be trained, thanks to improved training and in particular a novel distillation procedure. DeiT is trained on ImageNet on a single computer in less than 3 days. Their reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop evaluation) on ImageNet with no external data.
  • They introduce a teacher-student strategy specific to transformers. Using distillation can hamper the performance of neural networks. The student model pursues two different objectives that may be diverging: learning from a labeled dataset (strong supervision) and learning from the teacher. To alleviate this, we introduced a distillation token, which is a learned vector that flows through the network along with the transformed image data. The distillation token cues the model for its distillation output, which can differ from its class output. This new distillation method is specific to Transformers and further improves the image classification performance.
  • It relies on a distillation token ensuring that the student learns from the teacher through attention. They show the interest of this token-based distillation, especially when using a ConvNet as a teacher. This leads us to report results competitive with CNNs for both ImageNet (where they obtain up to 85.2% top-1 accuracy) and when transferring to other tasks.
  • Facebook AI post.
  • Github repo.

2020

NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
  • This paper by Mildenhall et al. from UC Berkeley, Google and UCSD in ECCV 2020 introduces NeRF, a method that achieves state-of-the-art results for synthesizing novel views of complex scenes by optimizing an underlying continuous volumetric scene function using a sparse set of input views.
  • Their algorithm represents a scene using a fully-connected (non-convolutional) deep network, whose input is a single continuous 5D coordinate (spatial location (x,y,z) and viewing direction (θ,ϕ)) and whose output is the volume density and view-dependent emitted radiance at that spatial location.
  • They synthesize views by querying 5D coordinates along camera rays and use classic volume rendering techniques to project the output colors and densities into an image. Because volume rendering is naturally differentiable, the only input required to optimize our representation is a set of images with known camera poses. They describe how to effectively optimize neural radiance fields to render photorealistic novel views of scenes with complicated geometry and appearance, and demonstrate results that outperform prior work on neural rendering and view synthesis.
  • Project page with videos and code.

2021

Do Vision Transformers See Like Convolutional Neural Networks?
  • Given the central role of convolutional neural networks in computer vision breakthroughs (leading to them being the de-facto model for visual data), it is remarkable that Transformer architectures (almost identical to those used in language) are capable of similar performance. For instance, recent work has shown that the Vision Transformer (ViT) model can achieve comparable or even superior performance on image classification tasks. This raises fundamental questions on whether these architectures work in the same way as CNNs: how are Vision Transformers solving these tasks? Are they acting like convolutional networks, or learning entirely different visual representations?
  • This paper by Raghu et al. from Google Brain in 2021 analyzes the internal representation structure of ViTs and CNNs on image classification benchmarks, and finds striking differences in the features and internal structures between the two architectures, such as ViT having more uniform representations across all layers. They explore how these differences arise, finding crucial roles played by self-attention, which enables early aggregation of global information (“earlier global features”), and ViT residual connections, which offer representation propagation of features from lower to higher layers, while also revealing that some CNN properties, e.g. local information aggregation at lower layers, are important to ViTs, being learned from scratch at scale.
  • They also examine the potential for ViTs to be used beyond classification through a study of spatial localization, discovering ViTs successfully preserve input spatial information with CLS tokens —- promising for future uses in object detection.
  • Finally, they investigate the effect of scale for transfer learning, finding larger ViT models develop significantly stronger intermediate representations through larger pretraining datasets. These results are also very pertinent to understanding recent architectures for vision such as the MLP-Mixer.
BEiT: BERT Pre-Training of Image Transformers
  • This paper by Wei et al. from Microsoft Research in 2021 introduces a self-supervised pre-trained representation model called BEiT, which stands for Bidirectional Encoder representations from Image Transformers. Following BERT developed in the natural language processing area, we propose a masked image modeling task to pretrain vision Transformers. Specifically, each image has two views in their pre-training, i.e, image patches (such as 16x16 pixels) the embeddings of which are calculated as linear projections of flattened patches, and visual tokens (i.e., discrete tokens) which are . Before pre-training, they learn a discrete variational autoencoder (dVAE) which acts as an “image tokenizer” learnt via autoencoding-style reconstruction, where the input image is tokenized into discrete visual tokens obtained by the latent codes of the discrete VAE (the one proposed in VQGAN and reused by CLIP in Ramesh et al., 2021) according to the learned vocabulary.
  • They show that the proposed method is critical to make BERT-like pre-training (i.e., auto-encoding with masked input) work well for image Transformers. They also present the intriguing property of automatically acquired knowledge about semantic regions, without using any human-annotated data.
  • Similar to the masked language modeling pre-training task of BERT, BEiT randomly masks some image patches and feeds them into the backbone Transformer. The pre-training objective is to recover the original visual tokens based on the corrupted image patches.
  • After pre-training BEiT, they directly fine-tune the model parameters on downstream tasks by appending task layers upon the pretrained encoder.
  • Experimental results on image classification and semantic segmentation show that BEiT achieves competitive results with previous pre-training methods. For example, base-size BEiT achieves 83.2% top-1 accuracy on ImageNet-1K, significantly outperforming from-scratch DeiT training (81.8%) with the same setup. Moreover, large-size BEiT obtains 86.3% only using ImageNet-1K, even outperforming ViT-L with supervised pre-training on ImageNet-22K (85.2%).
  • Code and pretrained models are here.

2022

A ConvNet for the 2020s
  • This paper by FAIR and UC Berkeley seeks to refute the recent apparent superiority of Transformers by re-examining the design of ConvNets and testing their limitations. The proposed approach is based on gradually modifying a standard ResNet50, following design choices closely inspired by Vision Transformer, to propose a new family of pure ConvNets called ConvNeXt, which can perform as good as a hierarchical vision Transformer on image classification, object detection, instance and semantic segmentation tasks.
  • The “Roaring 20s” of visual recognition began with the introduction of Vision Transformers (ViTs), which quickly superseded ConvNets as the state-of-the-art image classification model. A vanilla ViT, on the other hand, faces difficulties when applied to general computer vision tasks such as object detection and semantic segmentation. It is the hierarchical Transformers (e.g., Swin Transformers) that reintroduced several ConvNet priors, making Transformers practically viable as a generic vision backbone and demonstrating remarkable performance on a wide variety of vision tasks.
  • However, the effectiveness of such hybrid approaches is still largely credited to the intrinsic superiority of Transformers, rather than the inherent inductive biases of convolutions.
  • In this paper, the authors reexamine the design spaces and test the limits of what a pure ConvNet can achieve.
  • The authors gradually “modernize” a standard ResNet toward the design of a vision Transformer, and discover several key components that contribute to the performance difference along the way. They implement a series of design decisions starting with a ResNet50 trained with up-to-date techniques (extending the number of epochs, using AdamW optimizer, Stochastic Depth, Label Smoothing, and so on):
    • Macro Design: The authors considered two aspects of Swin Transformers’ macro design. The first is the number of blocks in each stage (stage compute ratio), which was adjusted from (4, 4, 6, 3) to (3, 3, 9, 3), following the Swin Transformer ratio of (1:1:3:1). The second is the stem cell configuration, which in the original ResNet50 consisted of 7×7 convolutions with stride 2 followed by a max-pooling layer. This was substituted by a more Transformer-like “patchify” layer which utilizes 4×4 non-overlapping convolutions with stride 4. These modifications improved the accuracy to 79.5%.
    • ResNeXt: In this part, the authors adopt two design choices of the popular ResNeXt: depthwise convolutions, which are interestingly similar to self-attention as they work on a per-channel basis, and a higher number of channels (from 64 to 96). These modifications improved the accuracy to 80.5%.
    • Inverted Bottleneck: An essential configuration of Transformers is the expansion-compression rate in the MLP block (the hidden dimension is 4 times higher than the input and output dimension). This feature was reproduced by adding the inverted bottleneck design used in ConvNets (where the input is expanded using \(1 \times 1\) convolutions and then shrunk through depthwise convolution and \(1 \times 1\) convolutions). This modification slightly improved the accuracy to 80.6%.
    • Large kernel sizes: The gold standard in ConvNet since the advent of VGG are 3×3 kernels. Small kernels lead to the famous local receptive field, which, compared to the global self-attention, has a more limited area of focus. Although Swin Transformers reintroduced the concept of local attention, their window size has always been at least \(7 \times 7\). To explore larger kernels, the first thing is to move the depthwise convolution before the convolution, to reduce the number of channels before such an expensive operation. This first modification resulted in a temporary degradation to 79.9%, but, experimenting with different sizes, with a \(7 \times 7\) window (higher values did not bring any alterations in the results), the authors were able to achieve an accuracy of 80.6% again.
    • Micro Design: Finally, some micro design choices were added: GELU instead of ReLU, a single activation for each block (the original transformer module has just one activation after the MLP), fewer normalization layers, Batch Normalization substituted by Layer Normalization, and separate downsampling layer.
    • These modifications improved the accuracy to 82.0% and defined the final model, named ConvNeXt.
  • A comparison of this architecture with the Swin Transformer and ResNet is shown in the figure below.

  • Based entirely on convolutions, this model competed on par with Transformer-based architectures, achieving a top-1 accuracy of 87.8% on ImageNet classification. Equally excellent results were obtained in other tasks, such as object detection and segmentation on COCO and semantic segmentation on ADE20K.
  • The idea of modernizing ConvNets, adding all the concepts introduced over the past decade to a single model, is payback for convolutions, which have been ignored lately to the benefit of transformers. The authors suggest that ConvNeXt may be more suited for certain tasks, while Transformers may be more flexible for others. A case in point is multi-modal learning, in which a cross-attention module may be preferable for modeling feature interactions across many modalities. Additionally, Transformers may be more flexible when used for tasks requiring discretized, sparse, or structured outputs. They believe the architecture choice should meet the needs of the task at hand while striving for simplicity and efficiency.
  • Github repo.
Natural Language Descriptions of Deep Visual Features
  • Some neurons in deep networks specialize in recognizing highly specific perceptual, structural, or semantic features of inputs. In computer vision, techniques exist for identifying neurons that respond to individual concept categories like colors, textures, and object classes. But these techniques are limited in scope, labeling only a small subset of neurons and behaviors in any network. Is a richer characterization of neuron-level computation possible?
  • This paper by Hernandez et al. from MIT, Northeastern and Alleghency College in 2022 proposes MILAN, for mutual-information-guided linguistic annotation of neurons, that aims to generate open-ended, compositional, natural language descriptions of individual neurons in deep networks.
  • Given a neuron, MILAN generates a description by searching for a natural language string that maximizes pointwise mutual information with the image regions in which the neuron is active. These mutual information estimates are in turn produced by a pair of learned models trained on MILANNOTATIONS, a dataset of fine-grained image annotations released with this paper. MILAN produces fine-grained descriptions that capture categorical, relational, and logical structure in learned features. These descriptions obtain high agreement with human-generated feature descriptions across a diverse set of model architectures and tasks, and can aid in understanding and controlling learned models.
  • They highlight three applications of natural language neuron descriptions.
    • First, they use MILAN for analysis, characterizing the distribution and importance of neurons selective for attribute, category, and relational information in vision models.
    • Second, they use MILAN for auditing, surfacing neurons sensitive to protected categories like race and gender in models trained on datasets intended to obscure these features.
    • Finally, they use MILAN for editing, improving robustness in an image classifier by deleting neurons sensitive to text features spuriously correlated with class labels.
  • MarkTechPost [link])(https://www.marktechpost.com/2022/02/06/mit-researchers-introduce-a-machine-learning-technique-that-can-automatically-describe-the-roles-of-individual-neurons-in-a-neural-network-with-natural-language/?fbclid=IwAR0cTjFB_gnn1AXJcRmRshtYabclsi5VX9AHtv_S0L08Myv30F4o9F-vdM4).
Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision
  • Discriminative self-supervised learning allows training models on any random group of internet images, and possibly recover salient information that helps differentiate between the images. Applied to ImageNet, this leads to object-centric features that perform on par with supervised features on most object-centric downstream tasks.
  • This paper by Goyal et al. in 2022 from FAIR questions that if using this ability, we can learn any salient and more representative information present in diverse unbounded set of images from across the globe. To do so, we train models on billions of random images without any data pre-processing or prior assumptions about what we want the model to learn. This is a very large-scale experiment in which a RegNet architecture scaled to a dense 10 billion parameters (to avoid underfitting on a large data size) is pre-trained using the SwAV self-supervised method on a large collection of 1 billion randomly selected public images from Instagram with a diversity of gender, ethnicity, cultures, and locations (all outside the EU because of GDPR).
  • They achieve state of the art results on a majority of 50 transfer tasks, including fairness, robustness to distribution shift, geographical diversity, fine-grained classification, image copy detection and many image classification datasets. The resulting model, not only captures well semantic information, it also captures information about artistic style and learns salient information such as geo-locations and multilingual word embeddings based on visual content only.
  • The key takeaway is that large-scale self-supervised pre-training yields more robust, fair, less harmful, and less biased results than supervised models or models trained on object centric datasets such as ImageNet.
Block-NeRF: Scalable Large Scene Neural View Synthesis
  • This paper by Tancik et al. from UC Berkeley, Waymo and Google Research in 2022 presents Block-NeRF, a variant of Neural Radiance Fields (NeRFs) that can reconstruct large-scale environments.
  • They demonstrate that when scaling NeRF to render city-scale scenes spanning multiple blocks, it is vital to decompose the scene into individually trained NeRFs that can be optimized independently. This decomposition decouples rendering time from scene size, enables rendering to scale to arbitrarily large environments, and allows per-block updates of the environment.
  • At such a scale, the data collected will necessarily have transient objects and variations in appearance, which they account for by modifying the underlying NeRF architecture to make NeRF robust to data captured over months under different environmental conditions. They add appearance embeddings, learned pose refinement, and controllable exposure to each individual NeRF, and introduce a procedure for aligning appearance between adjacent NeRFs so that they can be seamlessly combined.
  • They demonstrate the method’s efficacy by building an entire neighborhood in San Francisco from 2.8M images using a grid of Block-NeRFs, forming the largest neural scene representation to date.
VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning
  • Recent self-supervised methods for image representation learning are based on maximizing the agreement between embedding vectors from different views of the same image. A trivial solution is obtained when the encoder outputs constant vectors. This collapse problem is often avoided through implicit biases in the learning architecture, that often lack a clear justification or interpretation.
  • This paper from Bardes et al. from FAIR and NYU in ICLR 2022 introduces VICReg (Variance-Invariance-Covariance Regularization), a method that explicitly avoids the collapse problem with a simple regularization term on the variance of the embeddings along each dimension individually.
  • VICReg offers simple approach to self-supervised learning based on a triple objective: learning invariance to different views with a invariance term, avoiding collapse of the representations with a variance preservation term, and maximizing the information content of the representation with a covariance regularization term.
  • VICReg combines the variance term with a decorrelation mechanism based on redundancy reduction and covariance regularization, and achieves results on par with the state of the art on several downstream tasks, but is not subject to the same limitations as most other methods, particularly because it does not require the embedding branches to be identical or even similar. In addition, they show that incorporating our new variance term into other methods helps stabilize the training and leads to performance improvements.
Masked Autoencoders Are Scalable Vision Learners
  • Simple algorithms that scale well are the core of deep learning. In NLP, simple self-supervised learning methods enable benefits from exponentially scaling models. In computer vision, practical pre-training paradigms are dominantly supervised despite progress in self-supervised learning. In this study, we observe on ImageNet and in transfer learning that an autoencoder —- a simple self-supervised method similar to techniques in NLP – provides scalable benefits. Self-supervised learning in vision may thus now be embarking on a similar trajectory as in NLP.
  • This paper by He et al. from Facebook AI in 2022 shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision.
  • Their MAE approach is simple: they mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs.
  • First, they develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens.
  • Second, they note that images and languages are signals of a different nature and this difference must be addressed carefully. Images are merely recorded light without a semantic decomposition into the visual analogue of words. The word (or subword) analog for images are pixels. But decomposing the image into patches (like ViT) reduces the quadratic computation cost of transformers compared to operating at the pixel level. However, ViT and its derived models are infamous for their data appetite and/or training slowness. Instead of attempting to remove objects, they remove random patches that most likely do not form a semantic segment. Likewise, MAE reconstructs pixels, which are not semantic entities. They find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task. Coupling these two designs enables them to train large models efficiently and effectively: they accelerate training (by 3x or more) and improve accuracy.
  • Like any autoencoder, you train and throw away the decoder and fine-tune the encoder for downstream tasks.
  • Their scalable approach allows for learning high-capacity models that generalize well: e.g., a vanilla ViT-Huge model (ViTMAE) achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data. Transfer performance in downstream tasks outperforms supervised pre-training and shows promising scaling behavior.
  • Overall, they observe that MAE infers complex, holistic reconstructions, suggesting it has learned numerous visual concepts, i.e., semantics. They hypothesize that this behavior occurs by way of a rich hidden representation inside the MAE.
  • HuggingFace docs

The Effects of Regularization and Data Augmentation are Class Dependent
  • Regularization is a fundamental technique to prevent over-fitting and to improve generalization performances by constraining a model’s complexity. Current Deep Networks heavily rely on regularizers such as data augmentation (DA) or weight-decay, and employ structural risk minimization, i.e., cross-validation, to select the optimal regularization hyper-parameters.
  • This paper by Balestriero et al. from Facebook AI in 2022 demonstrates that regularization techniques such as DA or weight decay increases the average test performances at the cost of significant performance drops on some specific classes. In other words, regularization produces a model with a reduced complexity that is unfair across classes. By focusing on maximizing aggregate performance statistics we have produced learning mechanisms that can be potentially harmful, especially in transfer learning tasks. The optimal amount of DA or weight decay found from cross-validation leads to disastrous model performances on some classes, e.g., on ImageNet with a ResNet50, the “barn spider” classification test accuracy falls from 68% to 46% only by introducing random crop DA during training. Even more surprising, such performance drop also appears when introducing uninformative regularization techniques such as weight decay.
  • Those results demonstrate that our search for ever increasing generalization performance – averaged over all classes and samples – has left us with models and regularizers that silently sacrifice performances on some classes. In fact, they also observe that varying the amount of regularization employed during pre-training of a specific dataset impacts the per-class performances of that pre-trained model on different downstream tasks e.g. an ImageNet pre-trained ResNet50 deployed on INaturalist sees its performances fall from 70% to 30% on a particular classwhen introducing random crop DA during the Imagenet pre-training phase. Those results demonstrate that designing novel regularizers without class-dependent bias remains an open research question.
  • Here’s an intuitive explanation:
    • Some types of data augmentation and weight decay helps some categories but hurts others.
    • Categories largely identifiable by color or texture (for e.g., yellow bird, textured mushroom) are unaffected by aggressive cropping, while categories identifiable by shape (for e.g., corkscrew) see a performance degradation with aggressive cropping that only contains part of the object.
    • Conversely, color jitter does not affect shape or texture-based categories (for e.g., zebra), but affects color-based categories (for e.g., basket ball).
Instant Neural Graphics Primitives with a Multiresolution Hash Encoding
  • Neural graphics primitives, parameterized by fully connected neural networks, can be costly to train and evaluate. Moreover, many graphics problems rely on task specific data structures to exploit the sparsity or smoothness of the problem at hand.
  • This paper by Muller et al. from Nvidia in 2022 proposes InstantNeRF which reduce this cost with a versatile new input encoding that permits the use of a smaller network without sacrificing quality, thus significantly reducing the number of floating point and memory access operations. InstantNeRF offers near-instant training of neural graphics primitives on a single GPU for multiple tasks.
  • To this end, a small neural network is augmented by a multiresolution hash table of trainable feature vectors whose values are optimized through stochastic gradient descent. Multi-resolution hash encoding provides a practical learning-based alternative that automatically focuses on relevant detail, independent of task at hand. Its low overhead allows it to be used even in time-constrained settings like online training and inference.
  • In a gigapixel image, they represent an image by a neural network. SDF learns a signed distance function in 3D space whose zero level-set represents a 2D surface. NeRF uses 2D images and their camera poses to reconstruct a volumetric radiance-and-density field that is visualized using ray marching. Lastly, neural volume learns a denoised radiance and density field directly from a volumetric path tracer. In all tasks, our encoding and its efficient implementation provide clear benefits: instant training, high quality, and simplicity. Their encoding is task-agnostic: they use the same implementation and hyperparameters across all tasks and only vary the hash table size which trades off quality and performance.
  • The multiresolution structure allows the network to disambiguate hash collisions, making for a simple architecture that is trivial to parallelize on modern GPUs. In the context of neural network input encodings, it is a drop-in replacement, for example speeding up NeRF by several orders of magnitude and matching the performance of concurrent non-neural 3D reconstruction techniques.
  • They leverage this parallelism by implementing the whole system using fully-fused CUDA kernels with a focus on minimizing wasted bandwidth and compute operations.
  • While slow computational processes in any setting, from lightmap baking to the training of neural networks, can lead to frustrating workflows due to long iteration times, they achieve a combined speedup of several orders of magnitude, enabling training of high-quality neural graphics primitives in a matter of seconds, and rendering in tens of milliseconds at a resolution of 1920×1080. They have demonstrated that single-GPU training times measured in seconds are within reach for many graphics applications, allowing neural approaches to be applied where previously they may have been discounted.
  • Github repo.

Text

2003

A Neural Probabilistic Language Model
  • This paper by Bengio from the University of Montreal in 2003 revolutionized statistical language modeling by replacing “tables of conditional probabilities” with more compact and smoother representations based on distributed representations that can accommodate far more conditioning variables.
  • The traditional technique of learning the joint probability function of sequences of words in a language was intrinsically difficult because of the curse of dimensionality: a word sequence on which the model will be tested is likely to be different from all the word sequences seen during training. Traditional but very successful approaches based on n-grams obtain generalization by concatenating very short overlapping sequences seen in the training set.
  • They propose to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential/combinatorial number of semantically neighboring sentences, which forms the main reason for the spectacular improvements the proposed approach offers. The model learns simultaneously (i) a distributed representation for each word along with (ii) the probability function for word sequences, expressed in terms of these representations.
  • Generalization is obtained because a sequence of words that has never been seen before gets high probability if it is made of words that are similar (in the sense of having a nearby representation) to words forming an already seen sentence.
  • They report on experiments using neural networks for the probability function, showing on two text corpora that the proposed approach significantly improves on state-of-the-art n-gram models, and that the proposed approach allows to take advantage of longer contexts.

2010

Recurrent neural network based language model
  • This paper by Mikolov et al. from Khudanpur’s lab at JHU in Interspeech 2010, was the first to propose using a recurrent neural network-based language model (RNN LM) with applications to speech recognition is presented.
  • The results indicate that it is possible to obtain around 50% reduction of perplexity (PPL) by using a mixture of several RNN LMs, compared to a state of the art backoff language model. Speech recognition experiments show around 18% reduction of word error rate on the Wall Street Journal task when comparing models trained on the same amount of data, and around 5% on the much harder NIST RT05 task, and 12% even when the backoff model is trained on 5 times more data than the RNN model. For NIST RT05, we can conclude that models trained on just 5.4M words of in-domain data can outperform big backoff models, which are trained on hundreds times more data.
  • They provide ample empirical evidence to suggest that connectionist language models are superior to standard n-gram techniques, except their high computational (training) complexity. Recurrent neural networks outperformed significantly state of the art backoff models in all of the experiments, most notably even in case when backoff models were trained on much more data than RNN LMs.
  • The paper seeks to break the myth that language modeling is just about counting n-grams, and that the only reasonable way how to improve results is by acquiring new training data.

2013

Efficient Estimation of Word Representations in Vector Space
  • “You shall know a word by the company it keeps” — J. R. Firth.
  • This paper by Mikolov et al. from Google in 2013 proposes word2vec which comprises of two novel model architectures for computing continuous vector representations of words from very large data sets. They studied the quality of vector representations of words derived by various models on a collection of syntactic and semantic language tasks involving word similarity, and the results are compared to the previously best performing techniques based on different types of neural networks. They observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set.
  • They observed that it is possible to train high quality word vectors using very simple model architectures, compared to the popular neural network models (both feedforward and recurrent). Because of the much lower computational complexity, it is possible to compute very accurate high dimensional word vectors from a much larger data set.
  • Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.
  • Word2vec popularized the “King – Man + Woman = Queen” analogy.

2014

GloVe: Global Vectors for Word Representation
  • Word2vec relies only on local information of language. That is, the semantics learnt for a given word, is only affected by the surrounding words.
  • This paper by Pennington et al. from Stanford in EMNLP 2014 proposed Global Vectors (GloVe), an unsupervised learning algorithm which captures both global statistics and local statistics of a corpus, in order to train word vectors. Training is performed on aggregated global word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.
  • Recently, considerable attention has been focused on the question of whether distributional word representations are best learned from count-based methods or from prediction-based methods. Currently, prediction-based models garner substantial support; for example, Baroni et al. (2014) argue that these models perform better across a range of tasks. They argue that the two classes of methods are not dramatically different at a fundamental level since they both probe the underlying co-occurrence statistics of the corpus, but the efficiency with which the count-based methods capture global statistics can be advantageous.
  • After Tomas Mikolov et al. released word2vec, there was a boom of papers about word vector representations. GloVe was one such proposal, which explained why such algorithms work and reformulated word2vec optimizations as a special kind of factorization for word co-occurence matrices. Note that GloVe does not use neural networks while word2vec does.
  • They construct a model that utilizes this main benefit of count data while simultaneously capturing the meaningful linear substructures prevalent in recent log-bilinear prediction-based methods like word2vec. The result, GloVe, is a new global log-bilinear regression model for the unsupervised learning of word representations that outperforms other models on word analogy, word similarity, and named entity recognition tasks.
Sequence to Sequence Learning with Neural Networks
  • This paper by Sutskever et al. from Google in 2014 introduced seq2seq encoder-decoder learning to map sequences to sequences, a task that simple Deep Neural Networks (DNNs) cannot be used to accomplish.
  • They present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Their method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. They show that a large deep LSTM with a limited vocabulary can outperform a standard statistical machine translation (SMT)-based system whose vocabulary is unlimited on a large-scale MT task. The success of their simple LSTM-based approach on MT suggests that it should do well on many other sequence learning problems, provided they have enough training data.
  • Their main result is that on an English to French translation task from the WMT-14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.8 on the entire test set, where the LSTM’s BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When we used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which is close to the previous state of the art. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice.
  • They also find that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM’s performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.
Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation
  • This paper by Cho et al. from Bengio’s lab in EMNLP 2014 introduced the seq2seq encoder-decoder model for neural machine translation. They propose a novel neural network model called RNN Encoder–Decoder that consists of two recurrent neural networks (RNN) that is together able to learn the mapping from a sequence of an arbitrary length to another sequence, possibly from a different set, of an arbitrary length. The encoder RNN encodes a sequence of symbols into a fixed-length vector representation, and the other decodes the representation into another sequence of symbols.
  • The proposed RNN Encoder–Decoder is able to either score a pair of sequences (in terms of a conditional probability) or generate a target sequence given a source sequence.
  • The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence.
  • Along with the new architecture, they propose a novel hidden unit that includes a reset gate and an update gate that adaptively control how much each hidden unit remembers or forgets while reading/generating a sequence.
  • They evaluated the proposed model with the task of statistical machine translation, where they used the RNN Encoder–Decoder to score each phrase pair in the phrase table. Qualitatively, we were able to show that the new model is able to capture linguistic regularities in the phrase pairs well and also that the RNN Encoder–Decoder is able to propose well-formed target phrases.
  • The scores by the RNN Encoder–Decoder were found to improve the overall translation performance in terms of BLEU scores. Also, they found that the contribution by the RNN Encoder–Decoder is rather orthogonal to the existing approach of using neural networks in the SMT system, so that we can improve further the performance by using, for instance, the RNN Encoder–Decoder and the neural net language model together.
  • Qualitative analysis of the the proposed model learns a semantically and syntactically meaningful representation of linguistic phrases at multiple levels, i.e., at the word level as well as phrase level. This suggests that there may be more natural language related applications that may benefit from the proposed RNN Encoder–Decoder.

2015

Neural Machine Translation by Jointly Learning to Align and Translate
  • This paper by Bahdanau et al. from Bengio’s lab in ICLR 2015 introduced the attention mechanism (borrowed from the field of information retrieval) within the context of NLP (commonly called Bahdanau attention or additive attention in the field).
Effective Approaches to Attention-based Neural Machine Translation
  • Neural Machine Translation by Jointly Learning to Align and Translate proposed an attention mechanism to improve neural machine translation (NMT) by selectively focusing on parts of the source sentence during translation.
  • This paper by Luong et al. in EMNLP 2015 from Manning’s group at Stanford explores useful architectures for attention-based NMT. This paper examines two simple and effective classes of attentional mechanism: a global approach which always attends to all source words and a local one that only looks at a subset of source words at a time.
  • They demonstrate the effectiveness of both approaches over the WMT translation tasks between English and German in both directions. With local attention, we achieve a significant gain of 5.0 BLEU points over non-attentional systems which already incorporate known techniques such as dropout.
  • Their ensemble model using different attention architectures has established a new state-of-the-art result in the WMT’15 English to German translation task with 25.9 BLEU points, an improvement of 1.0 BLEU points over the existing best system backed by NMT and an n-gram reranker.

2016

Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
  • Neural Machine Translation (NMT) is an end-to-end learning approach for automated translation, with the potential to overcome many of the weaknesses of conventional phrase-based translation systems. Unfortunately, NMT systems are known to be computationally expensive both in training and in translation inference. Also, most NMT systems have difficulty with rare words. These issues have hindered NMT’s use in practical deployments and services, where both accuracy and speed are essential.
  • This paper by Wu et al. from Google in 2016 presents GNMT, Google’s Neural Machine Translation system, which attempts to address many of these issues. Their model consists of a deep LSTM network with 8 encoder and 8 decoder layers using attention and residual connections.
  • To improve parallelism and therefore decrease training time, their attention mechanism connects the bottom layer of the decoder to the top layer of the encoder. To accelerate the final translation speed, we employ low-precision arithmetic during inference computations.
  • To improve handling of rare words, we divide words into a limited set of common sub-word units (“wordpieces”) for both input and output. This method provides a good balance between the flexibility of “character”-delimited models and the efficiency of “word”-delimited models, naturally handles translation of rare words, and ultimately improves the overall accuracy of the system.
  • Their beam search technique employs a length-normalization procedure and uses a coverage penalty, which encourages generation of an output sentence that is most likely to cover all the words in the source sentence.
  • On the WMT’14 English-to-French and English-to-German benchmarks, GNMT achieves competitive results to state-of-the-art. Using a human side-by-side evaluation on a set of isolated simple sentences, it reduces translation errors by an average of 60% compared to Google’s phrase-based production system.

2017

Attention Is All You Need
  • This paper by Vaswani et al. from Google in NeurIPS 2017 introduced Transformers (that are based on scaled dot-product multi-headed attention) which are prevalent in most NLP and CV areas today.

2018

Deep contextualized word representations
  • This paper by Peters et al. from Allen AI and UWash in NAACL 2018 introduced context-sensitive word embeddings using the LSTM-based Embeddings from Language Models (ELMo) architecture.
Improving Language Understanding by Generative Pre-Training
  • Natural language understanding comprises a wide range of diverse tasks such as textual entailment, question answering, semantic similarity assessment, and document classification. Although large unlabeled text corpora are abundant, labeled data for learning these specific tasks is scarce, making it challenging for discriminatively trained models to perform adequately.
  • This paper by Radford et al. from OpenAI in 2018 introduces a framework for achieving strong natural language understanding with a single task-agnostic model through generative pre-training and discriminative fine-tuning and demonstrates large gains on the aforementioned NLU tasks can be realized by generative pre-training of a language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task.
  • In contrast to previous approaches, we make use of task-aware input transformations during fine-tuning to achieve effective transfer while requiring minimal changes to the model architecture.
  • By pre-training on a diverse corpus with long stretches of contiguous text our model acquires significant world knowledge and ability to process long-range dependencies which are then successfully transferred to solving discriminative tasks such as question answering, semantic similarity assessment, entailment determination, and text classification, improving the state of the art on 9 of the 12 datasets and thus outperforming discriminatively trained models that use architectures specifically crafted for each task. For instance, they achieve absolute improvements of 8.9% on commonsense reasoning (Stories Cloze Test), 5.7% on question answering (RACE), and 1.5% on textual entailment (MultiNLI).
  • Using unsupervised (pre-)training to boost performance on discriminative tasks has long been an important goal of Machine Learning research. Their work suggests that achieving significant performance gains is indeed possible, and offers hints as to what models (Transformers) and data sets (text with long range dependencies) work best with this approach.

2019

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
  • This paper by Devlin et al. from Google in ACL 2019 proposed BERT (Bidirectional Encoder Representations from Transformers), a Transformer-based language representation model which proposed pre-training bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. BERT is pre-trained using two unsupervised tasks: (i) masked language modeling (MLM) and, (ii) next sentence prediction (NSP).
    • MLM is often referred to as a Cloze task in the literature (Taylor, 1953). In this case, the final hidden vectors corresponding to the mask tokens are fed into an output softmax over the vocabulary, as in a standard LM.
    • NSP is needed because many important downstream tasks such as Question Answering (QA) and Natural Language Inference (NLI) are based on understanding the relationship between two sentences, which is not directly captured by language modeling. In order to train a model that understands sentence relationships, they pre-train for a binarized next sentence prediction task that can be trivially generated from any monolingual corpus.
  • Fine-tuning for the task at hand involves using an additional output layer, to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.
  • BERT comes in two flavors: (i) BERT Base: 12 layers (transformer blocks), 12 attention heads, and 110 million parameters; (ii) BERT Large: 24 layers (transformer blocks), 16 attention heads, and 340 million parameters.
  • BERT consumes a max of 512 input tokens. At its output, word embeddings for BERT (what is called BERT-base) have 768 dimensions.
  • BERT obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).
  • BERT demonstrated that unsupervised pretraining is an integral part of many language understanding systems and enables even low-resource tasks to benefit from them.
  • Google Blog’s article that discusses using BERT for improving search relevance and ranking.
  • Also, here’s a brief timeline of NLP models from Bag of Words to the Transformer family from Fabio Chiusano:

RoBERTa: A Robustly Optimized BERT Pretraining Approach
  • Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes, while hyperparameter choices have significant impact on the final results.
  • This paper by Liu et al. from University of Washington and Facebook AI in 2019 carefully evaluates a number of design decisions when pretraining BERT models.
  • They present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and training data size. They find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it. They find that performance can be substantially improved by training the model longer, with bigger batches over more data; removing the next sentence prediction objective; training on longer sequences; and dynamically changing the masking pattern applied to the training data.
  • Their improved pretraining procedure, which they call RoBERTa, achieves state-of-the-art results on GLUE, RACE and SQuAD, without multi-task finetuning for GLUE or additional data for SQuAD. These results highlight the importance of previously overlooked design choices, and suggest that BERT’s pretraining objective remains competitive with recently proposed alternatives.
  • Note that RoBERTa uses only the masked language model objective (and does not train using the next sentence prediction objective), and achieves better results than the original BERT.
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
  • This paper by Lewis et al. from Facebook AI in 2019 presented BART, a denoising autoencoder for pretraining sequence-to-sequence models that learns to map corrupted documents to the original. BART is trained by corrupting text with an arbitrary noising function, and learning a model to reconstruct the original text. It uses a standard Tranformer-based neural machine translation architecture which, despite its simplicity, can be seen as generalizing BERT (due to the bidirectional encoder), GPT (with the left-to-right decoder), and many other more recent pretraining schemes.
  • They evaluate a number of noising approaches, finding the best performance by both randomly shuffling the order of the original sentences and using a novel in-filling scheme, where spans of text are replaced with a single mask token.
  • Background: With BERT, random tokens are replaced with masks, and the document is encoded bidirectionally. Missing tokens are predicted independently, so BERT cannot easily be used for generation.

  • With GPT, tokens are predicted auto-regressively (generation of a new token is conditioned on the prior tokens), meaning GPT can be used for generation. However words can only condition on leftward context, so it cannot learn bidirectional interactions.

  • BART applies noising schemes to an input document and thus corrupts it by replacing spans of text with mask symbols. In the diagram below, the corrupted document (left) is encoded with a bidirectional model, and then the likelihood of the original document (right) is calculated with an autoregressive decoder. For fine-tuning, an uncorrupted document is input to both the encoder and decoder, and we use representations from the final hidden state of the decoder. The advantage of using this scheme is that inputs to the encoder need not be aligned with decoder outputs, allowing arbitary noise transformations.

  • BART is particularly effective when finetuned for text generation but also works well for comprehension tasks. It matches the performance of RoBERTa with comparable training resources on GLUE and SQuAD, achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 6 ROUGE.
  • BART also provides a 1.1 BLEU increase over a back-translation system for machine translation, with only target language pretraining.
  • BART achieves similar performance to RoBERTa on discriminative tasks, while achieving new state-of-the-art results on a number of text generation tasks.
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
  • As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging.
  • This paper by Sanh et al. from Huggingface in the Energy Efficient Machine Learning and Cognitive Computing - NeurIPS 2019 introduced a language representation model, DistilBERT which is a general-purpose pre-trained version of BERT. DistilBERT is 40% smaller, 60% faster, cheaper to pre-train, and retains 97% of the language understanding capabilities. DistilBERT can be fine-tuned with good performances on a wide range of tasks much like its larger counterparts.
  • While most prior work investigated the use of distillation for building task-specific models, they leverage knowledge distillation during the pre-training phase and show that DistilBERT is a compelling option for edge applications.
  • To leverage the inductive biases learned by larger models during pretraining, they introduce a triple loss combining language modeling, distillation and cosine-distance losses.
  • The following graph shows the parameter counts of several recently released pretrained language models:

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
  • Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling.
  • This paper by Dai et al. from CMU and Google Brain in 2019 proposes a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence.
  • Transformer-XL consists of a segment-level recurrence mechanism and a novel positional encoding scheme.
  • Transformer-XL not only enables capturing longer-term dependency than RNNs and vanilla Transformers, achieves substantial speedup during evaluation, but also resolves the context fragmentation problem. As a result, Transformer-XL learns dependency that is 80% longer than RNNs and 450% longer than vanilla Transformers, achieves better performance on both short and long sequences, and is up to 1,800+ times faster than vanilla Transformers during evaluation.
  • They improve the state-of-the-art results of bpc/perplexity to 0.99 on enwiki8, 1.08 on text8, 18.3 on WikiText-103, 21.8 on One Billion Word, and 54.5 on Penn Treebank (without finetuning). When trained only on WikiText-103, Transformer-XL manages to generate reasonably coherent, novel text articles with thousands of tokens.
XLNet: Generalized Autoregressive Pretraining for Language Understanding
  • With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive language modeling. However, relying on corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a pretrain-finetune discrepancy.
  • This paper by Yang et al. from CMU and Google in 2019 proposes XLNet considering BERT’s aforementioned pros and cons, and offers a generalized autoregressive pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order (thereby proposing a new objective called Permutation Language Modeling), and (2) overcomes the limitations of BERT thanks to its autoregressive formulation. Put simply, XLNet is a generalized autoregressive pretraining method that uses a permutation language modeling objective to combine the advantages of autoregressive and autoencoder methods.
  • Furthermore, the neural architecture of XLNet is developed to work seamlessly with the autoregressive objective, including integrating ideas from Transformer-XL, the state-of-the-art autoregressive model and the careful design of the two-stream attention mechanism. Empirically, under comparable experiment settings, XLNet outperforms BERT on 20 tasks, often by a large margin, including question answering, natural language inference, sentiment analysis, and document ranking.
  • Github repo.
Adaptive Input Representations for Neural Language Modeling
  • This paper by Baevski and Auli from Facebook AI in 2019 introduces adaptive input representations by varying the size of input word embeddings for neural language modeling. Adaptive input embeddings can improve accuracy while drastically reducing the number of model parameters.
  • There are several choices on how to factorize the input and output layers, and whether to model words, characters or sub-word units.
  • They perform a systematic comparison of popular choices for a self-attentional architecture.
  • Their experiments show that models equipped with adaptive embeddings are more than twice as fast to train than the popular character input CNN while having a lower number of parameters.
  • On the WIKITEXT-103 benchmark, they achieve 18.7 perplexity, an improvement of 10.5 perplexity compared to the previously best published result and on the BILLION WORD benchmark, we achieve 23.02 perplexity.
Attention Interpretability Across NLP Tasks
  • This paper by Vashishth et al. from IISc and Google in 2019 seeks to empirically prove the hypothesis that attention weights are interpretable and are correlated with feature importance measures, However, this holds only for cases when attention weights are essential for model’s prediction.
  • Some works (Jain & Wallace, 2019; Vig & Belinkov, 2019) have demonstrated that attention weights are not interpretable, and altering them does not affect the model output while several others have shown that attention captures several linguistic notions in the model. We extend the analysis of prior works to diverse NLP tasks and demonstrate that attention weights are interpretable and are correlated with feature importance measures. However, this holds only for cases when attention weights are essential for model’s prediction and cannot simply be reduced to a gating unit. This paper takes a balanced approach – rather than taking a black and white approach – they draw on previous literature that raised issues with the fact “attentions are indicative of model predictions” and show “when is attention interpretable and when it is not”.
  • The attention layer in a neural network model provides insights into the model’s reasoning behind its prediction, which are usually criticized for being opaque. Recently, seemingly contradictory viewpoints have emerged about the interpretability of attention weights. Amid such confusion arises the need to understand attention mechanism more systematically. The paper attempts to fill this gap by giving a comprehensive explanation which justifies both kinds of observations (i.e., when is attention interpretable and when it is not). Through a series of experiments on diverse NLP tasks, they validate their observations and reinforce the claim of interpretability of attention through manual evaluation.
  • They find that in both single and pair sequence tasks, the attention weights in samples with original weights do make sense in general. However, in the former case, the attention mechanism learns to give higher weights to tokens relevant to both kinds of sentiment. They show that attention weights in single sequence tasks do not provide a reason for the prediction, which in the case of pairwise tasks, attention do reflect the reasoning behind model output.
  • Unrelated to the paper: To use attention visualization as a proxy for interpreting your predictions, use the BertViz library. The lib supports multiple views and supports a plethora of models (BERT, GPT-2, XLNet, RoBERTa, XLM, ALBERT, DistilBERT, BART etc.). The BertViz repo has some nice examples to get started.

Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
  • This paper by Selvaraju et al. from Parikh/Batra’s team at GATech in 2019 proposes a technique for producing ‘visual explanations’ for decisions from a large class of CNN-based models, making them more transparent and explainable.
  • Their approach – Gradient-weighted Class Activation Mapping (Grad-CAM), uses the gradients of any target concept (say ‘dog’ in a classification network or a sequence of words in captioning network) flowing into the final convolutional layer to produce a coarse localization map highlighting the important regions in the image for predicting the concept.
  • Unlike previous approaches, Grad-CAM is applicable to a wide variety of CNN model-families: (1) CNNs with fully-connected layers (e.g. VGG), (2) CNNs used for structured outputs (e.g. captioning), (3) CNNs used in tasks with multimodal inputs (e.g. visual question answering) or reinforcement learning, all without architectural changes or re-training.
  • We combine Grad-CAM with existing fine-grained visualizations to create a high-resolution class-discriminative visualization, Guided Grad-CAM, and apply it to image classification, image captioning, and visual question answering (VQA) models, including ResNet-based architectures.
  • In the context of image classification models, our visualizations (a) lend insights into failure modes of these models (showing that seemingly unreasonable predictions have reasonable explanations), (b) outperform previous methods on the ILSVRC-15 weakly-supervised localization task, (c) are robust to adversarial perturbations, (d) are more faithful to the underlying model, and (e) help achieve model generalization by identifying dataset bias.
  • For image captioning and VQA, their visualizations show that even non-attention based models learn to localize discriminative regions of input image.
  • They devise a way to identify important neurons through GradCAM and combine it with neuron names to provide textual explanations for model decisions.
  • Finally, they design and conduct human studies to measure if Grad-CAM explanations help users establish appropriate trust in predictions from deep networks and show that Grad-CAM helps untrained users successfully discern a ‘stronger’ deep network from a ‘weaker’ one even when both make identical predictions.
  • Github repo; CloudCV demo.

2020

Language Models are Few-Shot Learners
  • Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do.
  • This paper by Brown et al. from OpenAI in 2020 introduces Generative Pretrained Transformer (GPT)-3 and shows that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches.
  • Specifically, they train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model.
  • GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.
  • At the same time, they also identify some datasets where GPT-3’s few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, they find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans.
  • They also present broader societal impacts of their findings and of GPT-3 in general.
Longformer: The Long-Document Transformer
  • Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length.
  • This paper by Beltagy et al. from Allen AI in 2020 seeks to address this limitation, by introducing the Longformer with an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer.
  • Longformer’s attention mechanism is a drop-in replacement for the standard self-attention and combines a local windowed attention with a task motivated global attention.
  • Following prior work on long-sequence transformers, they evaluate Longformer on character-level language modeling and achieve state-of-the-art results on text8 and enwik8.
  • In contrast to most prior work, they also pretrain Longformer and finetune it on a variety of downstream tasks.
  • Their pretrained Longformer consistently outperforms RoBERTa on long document tasks and sets new state-of-the-art results on WikiHop and TriviaQA. They finally introduce the Longformer-Encoder-Decoder (LED), a Longformer variant for supporting long document generative sequence-to-sequence tasks, and demonstrate its effectiveness on the arXiv summarization dataset.
Big Bird: Transformers for Longer Sequences
  • The primary limitation of Transformer-based models is the quadratic complexity (mainly in terms of memory, but also computation) on the sequence length due to their full attention mechanism. BigBird by Zaheer et al. from Google, published in NeurIPS 2020, remedied this by proposing a sparse attention mechanism that reduces this quadratic complexity to linear.
Beyond Accuracy: Behavioral Testing of NLP models with CheckList
  • Although measuring held-out test-set accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models either focus on individual tasks or on specific behaviors. Further, ML systems can run to completion without throwing any errors (indicating functional correctness) but can still produce incorrect outputs (indicating behavioral issues). Thus, it is important to test the behavioral aspects of your model to make sure it works as you expected.
  • This paper by Ribeiro et al. from Microsoft, UW and UCI in 2020 introduces CheckList, a model-agnostic and task-agnostic methodology for testing NLP models inspired by principles of behavioral testing in software engineering. CheckList tests individual capabilities of the model using three different test types.
  • Checklist includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation, as well as a software tool to generate a large and diverse number of test cases quickly. They illustrate the utility of CheckList with tests for three tasks, identifying critical failures in both commercial and state-of-art models.
  • Tests created with CheckList can be applied to any model, making it easy to incorporate in current benchmarks or evaluation pipelines. In a user study, a team responsible for a commercial sentiment analysis model found new and actionable bugs in an extensively tested model that has “solved” existing benchmarks on three different tasks. They incorporated three distinct types of tests:
    • Minimum Functionality Test (MFT): A Minimum Functionality Test (MFT) uses simple examples to make sure the model can perform a specific task well. For example, we might want to test the performance of a sentiment model when dealing with negations.
    • Invariance Test: Besides testing the functionality of a model, we might also want to test if the model prediction stays the same when trivial parts of inputs are slightly perturbed. These tests are called Invariance Tests (IV).
    • Directional Expectation Test: In the Invariance Test, we expect the outputs after the perturbation to be the same. However, sometimes we might expect the output after perturbation to change. That is when Directional Expectation Tests comes in handy. In another user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.
  • Github repo.
The Curious Case of Neural Text Degeneration
  • Despite considerable advancements with deep neural language models, the enigma of neural text degeneration persists when these models are tested as text generators. The counter-intuitive empirical observation is that even though the use of likelihood as training objective leads to high quality models for a broad range of language understanding tasks, using likelihood as a decoding objective leads to text that is bland and strangely repetitive.
  • This paper by Holztman et al. from Choi’s lab at UWash in ICLR 2020 provided a deep analysis into the properties of the most common decoding methods for open-ended language generation. It reveals surprising distributional differences between human text and machine text.
  • In addition, they find that decoding strategies alone can dramatically effect the quality of machine text, even when generated from exactly the same neural language model. They show that likelihood maximizing decoding causes repetition and overly generic language usage, while sampling methods without truncation risk sampling from the low-confidence tail of a model’s predicted distribution. Their findings motivate Nucleus (or top-p) Sampling, a simple but effective method that captures the region of confidence of language models effectively to draw the best out of neural generation.
  • By sampling text from the dynamic nucleus of the probability distribution, which allows for diversity while effectively truncating the less reliable tail of the distribution, the resulting text better demonstrates the quality of human text, yielding enhanced diversity without sacrificing fluency and coherence.

2021

Towards Zero-Label Language Learning
  • This paper by Wang et al. from Google in 2021 explores “zero-label” learning in NLP, whereby no human-annotated data is used anywhere during training and models are trained purely on synthetic data. They show that language models (LMs) are also few-shot generators or example creators (rather than just few-shot learners as in the GPT-3 paper) in that they can be used to generate high-quality synthetic data in a fully unsupervised manner. In other words, their propose that labelled-data generation is easy with prompting, LMs are great few-shot data generators, and that classic fine-tuning » zero/few shot prompting.
  • At the core of their framework is a novel approach for better leveraging the powerful pretrained LMs. Specifically, inspired by the recent success of few-shot inference on GPT-3, we present a training data creation procedure named Unsupervised Data Generation (UDG), which leverages few-shot prompts to synthesize high-quality training data without real human annotations.
  • Their method enables zero-label learning as they train task-specific models solely on the synthetic data, yet they achieve better or comparable results from strong baseline models trained on human-labeled data. Furthermore, when mixed with labeled data, our approach serves as a highly effective data augmentation procedure, achieving new state-of-the-art results on the SuperGLUE benchmark.
  • The paper illustrates a promising direction for future transfer learning research in NLP.
  • Key takeaways:
    • Old idea (from OpenAI’s GPT3 paper):
      • Treat LMs as few-shot learners.
      • Create prompts with <sample, label> pair(s).
      • Ask the model to infer the label for a new
      • The emphasis is on the inference.
    • New idea (from Google’s zero-label paper):
      • Treat LMs as few-shot generators (rather than few-shot learners).
      • Create prompts with <sample, label> pair(s).
      • Ask the model to generate more for the same label.
      • The emphasis is on the labelled data generation (rather than inference).
    • Learnings:
      • Old idea created a new wave of prompt programming, i.e. no need for conventional task specific fine-tuning.
      • However, prompting can solve only lower-order tasks, for e.g., classification, NLI. Even with lower-order tasks it is not practical because you cannot build a human-in-the-loop system to continually improve the model.
      • The new idea is about generating more data and going with conventional route.
      • This paper confirms all the above by introducing UDG using LMs, even for complex higher-order tasks and empirically shows classical fine-tuning with more data works better.
  • The diagram below from Prithvi Da summarizes the proposed approach.

Improving Language Models by Retrieving from Trillions of Tokens
  • This paper from Borgeaud et al. from DeepMind in 2021 proposes Retrieval-Enhanced Transformer (RETRO) which enhances auto-regressive language models by conditioning on document chunks retrieved from a large corpus, based on local similarity with preceding tokens. RETRO incorporates information retrieved from a database to free its parameters from being an expensive store of facts and world knowledge. With a 2 trillion token database, RETRO obtains comparable performance to GPT-3 and Jurassic-1 on the Pile, despite using 25x fewer parameters.
  • After fine-tuning, RETRO performance translates to downstream knowledge-intensive tasks such as question answering. RETRO combines a frozen BERT retriever, a differentiable encoder and a chunked cross-attention mechanism to predict tokens based on an order of magnitude more data than what is typically consumed during training.
  • On Wikitext103 and the Pile, RETRO outperforms previous models trained on large scale datasets. They also show that RETRO is competitive on retrieval-intensive downstream tasks such as question answering.
  • RETRO models are flexible and can be used without retrieval at evaluation and still achieve comparable performance to baseline models. Conversely, baseline pre-trained transformer models can be rapidly fine-tuned (“RETROfit with retrieval”) to obtain nearly the same performance as if trained from scratch.
  • They demonstrates at an unprecedented scale that improving semi-parametric language models through explicit memory can provide an orthogonal, more efficient approach than raw parameter scaling as we seek to build more powerful language models.
  • Related: The Illustrated Retrieval Transformer by Jay Alammar.
WebGPT: Browser-assisted question-answering with human feedback
  • This paper by Nakano et al. from OpenAI in 2021 proposes WebGPT, which is a fine-tuned version of GPT-3 to more accurately answer open-ended questions using a text-based web browser. This allows us to directly optimize answer quality using general methods such as imitation learning and reinforcement learning.
  • Their prototype copies how humans research answers to questions online —- it submits search queries, follows links, and scrolls up and down web pages. It is trained to cite its sources, which makes it easier to give feedback to improve factual accuracy.
  • By setting up the task so that it can be performed by humans, they are able to train models on the task using imitation learning, and then optimize answer quality with human feedback. To make human evaluation of factual accuracy easier, models must collect references while browsing in support of their answers.
  • They train and evaluate their models on ELI5, a dataset of questions asked by Reddit users. Their best model is obtained by fine-tuning GPT-3 using behavior cloning, and then performing rejection sampling against a reward model trained to predict human preferences. This model’s answers are preferred by humans 56% of the time to those of our human demonstrators, and 69% of the time to the highest-voted answer from Reddit. While their best model outperforms humans on ELI5, but still struggles with out-of-distribution questions.

2022

Formal Mathematics Statement Curriculum Learning
  • This paper by Polu et al. from OpenAI in 2022 proposes a neural theorem prover using GPT-f that can successfully solve a curriculum of increasingly difficult problems out of a set of formal statements of sufficiently varied difficulty, including many high-school Math Olympiad problems. The prover uses a language model to find proofs of formal statements.
  • They explore the use of expert iteration in the context of language modeling applied to formal mathematics. They show that at same compute budget, expert iteration, by which they mean proof search interleaved with learning, dramatically outperforms proof search only. They also observe that when applied to a collection of formal statements of sufficiently varied difficulty, expert iteration is capable of finding and solving a curriculum of increasingly difficult problems, without the need for associated ground-truth proofs.
  • Finally, by applying this expert iteration to a manually curated set of problem statements, they achieve state-of-the-art on the miniF2F benchmark, automatically solving multiple challenging problems drawn from high school olympiads.
  • Their results suggest that the lack of self-play in the formal mathematics setup can be effectively compensated for by automatically as well as manually curated sets of formal statements, which are much cheaper to formalize than full proofs. The statement curriculum learning methodology presented in this work can help accelerate progress in automated reasoning, especially if scaled with automated generation and curation of formal statements in the future.
  • OpenAI link.
Survey of Hallucination in Natural Language Generation
  • While natural language generation (NLG) has improved exponentially in recent years thanks to the development of deep learning technologies such as Transformer-based language models, large language models (LLMs) -based NLG often produces false statements that are disconnected from reality because such models are not grounded in reality. Such generation includes hallucinated texts, which makes the performances of text generation fail to meet users’ expectations in many real-world scenarios owing to the lack of commonsense built from experiencing the real world.
  • This paper by Ji et al. from Pascale Fung’s group at Hong Kong University of Science and Technology in 2022 reviews studies in evaluation and mitigation methods of hallucinations that have been presented in various tasks.
  • They provide a broad overview of the research progress and challenges in the hallucination problem of NLG. The survey is organized into two big divisions: (i) a general overview of metrics, mitigation methods, and future directions; (ii) task-specific research progress for hallucinations in a large set of downstream tasks: abstractive summarization, dialogue generation, generative question answering, data-to-text generation, and machine translation.
Transformer Quality in Linear Time
  • This paper by Hua et al. form Cornell University and Google Brain in 2022 revisits the design choices in Transformers, and propose methods to address their weaknesses in handling long sequences by presenting FLASH - a novel efficient modification of Transformer architecture. This is achieved by designing a performant layer (gated linear unit) and by combining it with an accelerator-efficient approximation strategy (mixed chunk attention).
  • Existing efficient attention methods often cause significant quality drop compared to full self-attention. At the same time they might be difficult to implement to fully leverage hardware accelerators. The authors introduce GAU (gated attention unit; a generalization of GLU - gated linear unit) that allows for better and more efficient approximation of multi-head attention than many other efficient attention methods by using a weaker single-head attention with minimal quality loss.
  • Next, complementary to this new layer, they propose mixed chunk attention - an efficient linear approximation method that combines the benefits from partial and linear attention mechanisms, which is accelerator-friendly and highly competitive in quality. The method works on chunks of tokens and leverages local (within chunk) and global (between chunks) attention spans.
  • The resulting model, named FLASH, when deployed on bidirectional and auto-regressive language modeling tasks, outperforms three baselines: vanilla Transformer, Performer and Combiner in terms of quality and efficiency. FLASH matches the quality (perplexity) of fully-augmented Transformers over both short (512) and long (8K) context lengths, while being substantially faster to train than the state-of-the-art - achieving training speedups of up to 4.9x on Wiki-40B and 12.1x on PG-19 for auto-regressive language modeling, and 4.8x on C4 for masked language modeling. The differences are particularly pronounced for larger context sizes (4096-8192).

DeepNet: Scaling Transformers to 1,000 Layers

  • This paper by Wang et al. from Microsoft Research in 2022 introduces DeepNet, a new method that allows train extremely deep transformers with 1000L+ layers – order of magnitude improvements over existing efforts and with theoretical justification.
  • DeepNet is fundamental, effective and simple. It can be used in any Transformer architecture (encoder, decoder, encoder-decoder) which covers almost all different tasks across AI areas (language, vision, speech, multimodal, and beyond). It is not only for 1000L+ Transformers, but also important and effective for training existing large models (e.g., [24, 100] layers). It combines the best of two worlds, i.e., good performance of Post-LN and stable training of Pre-LN, making it a preferred alternative for any Transformers model training.
  • At the core of DeepNet is a newly proposed normalization function (called DeepNorm) which modifies the residual connection in Transformers. DeepNorm has theoretical justification of bounding the model update by a constant which makes stable training possible in a principled way. We only need lines of code change to make it work in existing Transformer implementation.
  • DeepNorm modifies the residual connection in the Transformer architecture by up-scaling it before performing layer normalization. It works alongside a dedicated initialization scheme based on Xavier initialization.
  • These two tricks lead to greater stability during the training which allows the authors to scale their modified Transformer architecture (DeepNet) up to 1000 layers.
  • DeepNet’s 200-layer model with 3.2B parameters significantly outperforms the 48-layer state-of-the-art model with 12B parameters by 5 BLEU points a in multilingual translation task with 7,482 translation directions.
  • Github repo.

2022

Chain of Thought Prompting Elicits Reasoning in Large Language Models
  • Although scaling up language model size has reliably improved performance on a range of NLP tasks, even the largest models currently struggle with certain reasoning tasks such as arithmetic reasoning, math word problems, symbolic manipulation, and commonsense reasoning.
  • This paper by Wei et al. from Google in 2022 explores the ability of language models to generate a coherent chain of thought – a series of short sentences that mimic the reasoning process a person might have when responding to a question.
  • They have explored chain of thought prompting as a simple and broadly applicable method for enhancing reasoning in language models. Through experiments on arithmetic, symbolic, and commonsense reasoning, they find that chain of thought processing is an emergent property of model scale that can be induced via prompting and can enable sufficiently large language models to better perform reasoning tasks that otherwise have flat scaling curves.
PaLM: Scaling Language Modeling with Pathways
  • This paper by Chowdhery et al. from Google in 2022 introduces Pathways Language Model (PaLM), a single 540 billion parameter dense Transformer language model, trained on 780B tokens of high-quality, diverse text, that generalizes across domains and tasks while being highly efficient. PaLM pushes the boundaries of scale for few-shot language understanding and generation.
  • Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application.
  • To further our understanding of the impact of scale on few-shot learning, they trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. They trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. They demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks.
  • On a number of these tasks, PaLM 540B achieves breakthrough few-shot performance on language, reasoning, and code tasks, achieving state-of-the-art results on 28 out of the 29 most widely evaluated English NLP tasks when compared to the best finetuned per-task result from any previous large language model. Their evaluation suite consists of multi-step reasoning tasks, and comparisons to average human performance on the recently released BIG-bench benchmark.
  • Another critical takeaway from this work is the breakthrough performance on reasoning tasks, which require multi-step logical inference. Their few-shot results match or exceed the finetuned state of the art across a number of different arithmetic and commonsense reasoning tasks. The results on reasoning tasks are not achieved through model scale alone, but by a combination of scale and chain-of-thought prompting, where the model is explicitly prompted to generate a natural language logical inference chain before making its prediction. They present a number of intriguing examples where PaLM was able to write explicit logical inference chains to both explain jokes and answer complex questions about scenarios. On BIG-bench, a recently developed benchmark containing 150+ challenging new language tasks, PaLM 5-shot achieves higher performance than the average performance score of humans who were asked to complete the same tasks. Additional state-of-the-art performance is demonstrated on source code understanding/generation, multilingual NLP, and machine translation.
  • From these results, they draw a number of conclusions.
    • First, the results presented here suggest that the improvements from scale for few-shot language understanding have not yet plateaued. When they compare results from PaLM 540B to our own identically trained 62B and 8B model variants, improvements are typically log-linear. This alone suggests that we have not yet reached the apex point of the scaling curve. However, a number of BIG-bench benchmarks showed discontinuous improvements from model scale, improvements are actually discontinuous, meaning that the improvements from 8B to 62B are very modest, but then steeply increase when scaling to 540B. This suggests that certain capabilities of language models only emerge when trained at sufficient scale, and there are additional capabilities that could emerge from future generations of models.
    • Second, the breakthrough performance on reasoning tasks has critical implications. It is obvious that a model being able to generate natural language to explain its predictions is beneficial to the end user of a system, in order to better understand why a model made a certain prediction. However, these results go far beyond that, demonstrating that prompting the model to generate explicit inference chains can drastically increase the quality of the predictions themselves. In other words, the model’s generation (rather than just understanding) capabilities can be immensely beneficial even for tasks that are modeled as categorical prediction or regression, which typically do not require significant language generation.
  • Finally, although they achieved their goal of further pushing the boundaries of scale for few-shot language modeling, there are still many open questions about the ideal network architecture and training scheme for future generations of models. PaLM is only the first step in our vision towards establishing Pathways as the future of ML scaling at Google and beyond. To that end, we chose to demonstrate this scaling capability on a well-studied, well-established recipe: a dense, decoder-only, full-attention Transformer model, which is trained to perform autoregressive language modeling. However, our wider goal is to explore a diverse array of novel architectural choices and training schemes, and combine the most promising systems with the extreme scaling capabilities of Pathways.
  • They believe that PaLM demonstrates a strong foundation in their ultimate goal of developing a large-scale, modularized system that will have broad generalization capabilities across multiple modalities.
  • They additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale.
  • Finally, they discuss the ethical considerations related to large language models and discuss potential mitigation strategies.
  • Google AI blog.

Speech

2006

Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks
  • Many real-world sequence learning tasks require the prediction of sequences of labels from noisy, unsegmented input data. In speech recognition, for example, an acoustic signal is transcribed into words or sub-word units. Recurrent neural networks (RNNs) are powerful sequence learners that would seem well suited to such tasks. However, because they require pre-segmented training data, and post-processing to transform their outputs into label sequences, their applicability has so far been limited.
  • This paper by Graves et al. from Schmidhuber’s lab presents a novel method for for temporal classification with RNNs to label unsegmented sequences directly, thereby solving both aforementioned problems. Their method fits naturally into the existing framework of neural network classifiers, and is derived from the same probabilistic principles. It obviates the need for pre-segmented data, and allows the network to be trained directly for sequence labelling.
  • An experiment on a real-world temporal classification problem with the TIMIT speech corpus demonstrates its advantages over both a baseline HMM and a hybrid HMM-RNN without requiring any task-specific knowledge.

2010

Front-end factor analysis for speaker verification
  • This paper by Dehak et al. from JHU in IEEE/ACM Transactions on Audio, Speech, and Language Processing 2010 proposes a non-deep learning method that users Joint Factor Analysis (JFA) as a feature extractor to learn a low-dimensional speaker representation for speaker verification, which is also used to model session and channel effects/variabilities.
  • In this new space, a given speech utterance is represented by a new vector named total factors (called the identity-vector or the “i-vector”). The i-vector is thus a feature that represents the characteristics of the frame-level features’ distributive pattern. i-vector extraction is essentially a dimensionality reduction of the GMM supervector (although the GMM supervector is not extracted when computing the i-vector). It’s extracted in a similar manner with the eigenvoice adaptation scheme or the JFA technique, but is extracted per sentence (or input speech sample).
  • Two speaker verification systems are proposed which use this new representation. The first system is a Support-Vector-Machine-based system that uses the cosine kernel to estimate the similarity between the input data. The second system directly uses the cosine similarity as the final decision score. In this scoring, they removed the SVM from the decision process. One important characteristic of this approach is that there is no speaker enrollment, unlike in other approaches like SVM and JFA, which makes the decision process faster and less complex.
  • They achieved an EER of 1.12% and MinDCF of 0.0094 using the cosine distance scoring on the male English trials of the core condition of the NIST 2008 Speaker Recognition Evaluation dataset. They also obtained 4% absolute EER improvement for both-gender trials on the 10sec-10sec condition compared to the classical joint factor analysis scoring.
  • Up until d-vectors, the state-of-the-art speaker verification systems were based on the concept of i-vectors (which use Probabilistic Linear Discriminant Analysis (PLDA) as a classifier to make the final decision).

2014

Towards End-To-End Speech Recognition with Recurrent Neural Networks
  • This paper by Graves and Jaitly in PMLR in 2014 presents a character-level speech recognition system that directly transcribes audio data with text using a recurrent neural network with minimal preprocessing, without requiring an intermediate phonetic representation.
  • The system is based on a combination of the deep bidirectional LSTM recurrent neural network architecture and a modified Connectionist Temporal Classification (CTC) objective function that allows a direct optimization of the word error rate, even in the absence of a lexicon or language model. Further, they show how to integrate the network outputs with a language model during decoding.
  • The system achieves a word error rate of 27.3% on the Wall Street Journal corpus with no prior linguistic information, 21.9% with only a lexicon of allowed words, and 8.2% with a trigram language model. Combining the network with a baseline system further reduces the error rate to 6.7% and achieves state-of-the-art accuracy on the Wall Street Journal corpus for speaker independent recognition.
Deep neural networks for small footprint text-dependent speaker verification
  • This paper by Variani et al. from JHU, Google, and Biometric Recognition Group in 2014 investigates the use of deep neural networks (DNNs) to train speaker embeddings for a small footprint text-dependent speaker verification task. The DNN architecture is shown in the figure below.
  • During model training, the DNN takes stacked filterbank features as input (similar to the DNN acoustic model used in ASR) and generates the one-hot speaker label (or the speaker probability) to classify speakers at the frame-level.
  • During speaker enrollment, the trained DNN is used to extract speaker-specific features/embeddings by averaging the activations from the last hidden layer (called deep-vectors or “d-vectors” for short), which is taken as the speaker model.
  • During speaker evaluation, a d-vector is extracted for each utterance and compared to the enrolled speaker model to make a verification decision by calculating the cosine distance between the test d-vector and the claimed speaker’s d-vector, similar to the i-vector framework. A verification decision is made by comparing the distance to a threshold.
  • Experimental results show the DNN based speaker verification system achieves good performance compared to a popular i-vector system on a small footprint text-dependent speaker verification task. In addition, the DNN based system is more robust to additive noise and outperforms the i-vector system at low False Rejection operating points. The combined system outperforms the i-vector system by 14% and 25% relative in equal error rate (EER) for clean and noisy conditions respectively.
  • Experimental results show the d-vectors are more robust to additive noise and outperforms i-vectors at low False Rejection operating points. The combined (d+i)-vector system outperforms the i-vector system by 14% and 25% relative in equal error rate (EER) for clean and noisy conditions respectively.
  • Note that unlike the i-vector framework, this doesn’t have any assumptions about the feature’s distribution (the i-vector framework assumes that the i-vector has a Gaussian distribution).

2015

Listen, Attend and Spell
  • This paper by Chan et al. from CMU and Google in 2015 presents Listen, Attend and Spell (LAS), a neural network that learns to transcribe speech utterances to characters. Unlike traditional DNN-HMM models, this model learns all the components of a speech recognizer jointly.
  • LAS is based on the sequence-to-sequence framework, is trained end-to-end and has two main components: a listener (encoder) and a speller (decoder). The listener is a pyramidal RNN encoder that accepts filter bank spectra as inputs, transforms the input sequence into a high level feature representation and reduces the number of timesteps that the decoder has to attend to. The speller is an attention-based RNN decoder that attends to the high level features and spells out the transcript one character at a time.
  • The proposed system does not use the concepts of phonemes, nor does it rely on pronunciation dictionaries or HMMs. They bypass the conditional independence assumptions of CTC, and show how they can learn an implicit language model that can generate multiple spelling variants given the same acoustics. In other words, producing character sequences without making any independence assumptions between the characters is the key improvement of LAS over previous end-to-end CTC models.
  • To further improve the results, they used samples from the softmax classifier in the decoder as inputs to the next step prediction during training. Finally, they show how a language model trained on additional text can be used to rerank their top hypotheses.
  • On a subset of the Google voice search task, LAS achieves a word error rate (WER) of 14.1% without a dictionary or a language model, and 10.3% with language model rescoring over the top 32 beams. By comparison, the state-of-the-art CLDNN-HMM model achieves a WER of 8.0%.

2017

CNN Architectures for Large-Scale Audio Classification
  • This paper by Hershey et al. from Google in ICASSP 2017 presents VGGish by applying various state-of-the-art image networks with CNN architectures to audio and show that they are capable of excellent results on audio classification when compared to a simple fully connected network or earlier image classification architectures.
  • They examine fully connected deep neural networks such as AlexNet, VGG, InceptionNet, and ResNet. The input audio is divided into non-overlapping 960 ms frames which are decomposed by applying the Fourier transform, resulting in a spectrogram. The spectrogram is integrated into 64 mel-spaced frequency bins, and the magnitude of each bin is log-transformed. Finally, this gives log-mel spectrogram patches that are passed on as input to all classifiers. They explore the effects of training with different sized subsets of the 70M training videos (5.24 million hours) with 30,871 labels.
  • While their dataset contains video-level labels, they are also interested in Acoustic Event Detection (AED) and train a classifier on embeddings learned from the video-level task on AudioSet. They find that a model for AED with embeddings learned from these classifiers does much better than raw features on the Audio Set AED classification task.
  • They find that derivatives of image classification networks do well on the audio classification task, that increasing the number of labels we train on provides some improved performance over subsets of labels, that performance of models improves as we increase training set size, and that a model using embeddings learned from the video-level task do much better than a baseline on the AudioSet classification task.

2018

X-Vectors: Robust DNN Embeddings for Speaker Recognition
  • This paper by Synder et al. from JHU in ICASSP 2018 uses data augmentation to improve performance of deep neural network (DNN) embeddings for speaker recognition.
  • The DNN, which is trained to discriminate between speakers, maps variable-length utterances to fixed-dimensional embeddings called x-vectors.
  • While prior studies have found that embeddings leverage large-scale training datasets better than i-vectors, it can be challenging to collect substantial quantities of labeled data for training. They use data augmentation, consisting of added noise and reverberation, as an inexpensive method to multiply the amount of training data and improve robustness.
  • Their data augmentation strategy employs additive noises and reverberation. Reverberation involves convolving room impulse responses (RIR) with audio. They use the simulated RIRs described by Ko et al. and the reverberation itself is performed with the multicondition training tools in the Kaldi ASpIRE recipe. For additive noise, they use the MUSAN dataset, which consists of over 900 noises, 42 hours of music from various genres and 60 hours of speech from twelve languages
  • A PLDA classifier is used in the x-vector framework to make the final decision, similar to i-vector systems.
  • The x-vectors are compared with i-vector baselines on Speakers in the Wild and NIST SRE 2016 Cantonese where they achieve superior performance on the evaluation datasets.
WaveGlow: A Flow-based Generative Network for Speech Synthesis
  • This paper by Prenger et al. from NVIDIA in 2018 proposes WaveGlow, a flow-based network capable of generating high quality speech from mel-spectrograms.
  • WaveGlow combines insights from Glow and WaveNet in order to provide fast, efficient and high-quality audio synthesis, without the need for auto-regression. WaveGlow is implemented using only a single network, trained using only a single cost function: maximizing the likelihood of the training data, which makes the training procedure simple and stable.
  • Their PyTorch implementation produces audio samples at a rate of more than 500 kHz on an NVIDIA V100 GPU. Mean Opinion Scores show that it delivers audio quality as good as the best publicly available WaveNet implementation.
Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions
  • This paper by Shen et al. from Google in 2018 describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize timedomain waveforms from those spectrograms.
  • Their model achieves a mean opinion score (MOS) of 4.53 comparable to a MOS of 4.58 for professionally recorded speech.
  • To validate their design choices, they present ablation studies of key components of their system and evaluate the impact of using mel-spectrograms as the input to WaveNet instead of linguistic, duration, and F0 features.
  • They further demonstrate that using a compact acoustic intermediate representation enables significant simplification of the WaveNet architecture.
  • PyTorch hub

2019

wav2vec: Unsupervised Pre-training for Speech Recognition
  • Reducing the need for manually annotated data is important for developing systems that understand non-English languages, particularly those with limited existing training sets of transcribed speech.
  • This paper by Schneider from Facebook AI in 2019 introduces wav2vec, the first application of unsupervised pre-training to speech recognition using a fully convolutional model that learns representations of raw, unlabeled audio.
  • Wav2vec is trained on large amounts of unlabeled audio data and the resulting representations are then used to improve acoustic model training. They pre-train a simple multi-layer convolutional neural network optimized via a noise contrastive binary classification task.
  • Wav2vec trains models to learn the difference between original speech examples and modified versions, often repeating this task hundreds of times for each second of audio, and predicting the correct audio milliseconds into the future.
  • This self-supervised approach beats traditional ASR systems that rely solely on transcribed audio. Their experiments on WSJ reduce WER of a strong character-based log-mel filterbank baseline by up to 36% when only a few hours of transcribed data is available. Our approach achieves 2.43% WER on the nov92 test set. This outperforms Deep Speech 2 (Amodei et al., 2016), the best reported character-based system in the literature while using two orders of magnitude less labeled training data.
  • They show that more data for pre-training improves performance and that this approach not only improves resource-poor setups, but also settings where all WSJ training data is used.
  • Facebook AI article.
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
  • This paper by Park et al. from Google in 2019 presents SpecAugment, a simple data augmentation method for speech recognition.
  • SpecAugment greatly improves the performance of ASR networks. SpecAugment is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients). The augmentation policy consists of warping the features, masking blocks of frequency channels, and masking blocks of time steps. They apply SpecAugment on Listen, Attend and Spell (LAS) networks for end-to-end speech recognition tasks.
  • They achieve state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks on end-to-end LAS networks by augmenting the training set using simple handcrafted policies, surpassing the performance of hybrid systems even without the aid of a language model. SpecAugment converts ASR from an over-fitting to an under-fitting problem, and they are able to gain performance by using bigger networks and training longer. On LibriSpeech, they achieve 6.8% WER on test-other without the use of a language model, and 5.8% WER with shallow fusion with a language model. This compares to the previous state-of-the-art hybrid system of 7.5% WER. For Switchboard, they achieve 7.2%/14.6% on the Switchboard/CallHome portion of the Hub5’00 test set without the use of a language model, and 6.8%/14.1% with shallow fusion, which compares to the previous state-of-the-art hybrid system at 8.3%/17.3% WER.

2020

Conformer: Convolution-augmented Transformer for Speech Recognition
  • Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR), outperforming Recurrent neural networks (RNNs). Transformer models are good at capturing content-based global interactions, while CNNs exploit local features effectively.
  • This paper by Gulati et al. from Google in Interspeech 2020 achieves the best of both worlds by integrating components from both CNNs and Transformers for end-to-end speech recognition to model both local and global dependencies of an audio sequence in a parameter-efficient way.
  • They studied the importance of each component, and demonstrated that the inclusion of convolution modules is critical to the performance of the Conformer model.
  • To this regard, they propose the convolution-augmented transformer for speech recognition, named Conformer. Conformer significantly outperforms the previous Transformer and CNN based models achieving state-of-the-art accuracies. On the widely used LibriSpeech benchmark, Conformer model achieves WER of 2.1%/4.3% without using a language model and 1.9%/3.9% with an external language model on test/testother. They also observe competitive performance of 2.7%/6.3% with a small model of only 10M parameters.
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
  • This paper from Baevski et al. from Facebook AI in NeurIPS 2020 shows for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler.
  • Wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned.
  • Compared to wav2vec, wav2vec 2.0 learns basic speech units used to tackle a self-supervised task. The model is trained to predict the correct speech unit for masked parts of the audio, while at the same time learning what the speech units should be.
  • Experiments using all labeled data of Librispeech achieve 1.8/3.3 WER on the clean/other test sets. When lowering the amount of labeled data to one hour, wav2vec 2.0 outperforms the previous state of the art on the 100 hour subset while using 100 times less labeled data. With just 10 minutes of transcribed speech and 53K hours of unlabeled speech, wav2vec 2.0 enables speech recognition models at a word error rate (WER) of 8.6 percent on noisy speech and 5.2 percent on clean speech on the standard LibriSpeech benchmark. This demonstrates the feasibility of speech recognition with limited amounts of labeled data.
  • This opens the door for speech recognition models in many more languages, dialects, and domains that previously required much more transcribed audio data to provide acceptable accuracy.
  • They have also developed a cross-lingual approach, dubbed XLSR, that can learn speech units common to several languages. This approach helps when we have even small amounts of unlabeled speech, since languages for which we have little data can benefit from languages for which more data is available.
  • Github repo; Facebook AI article.
HiFi-GAN: High-Fidelity Denoising and Dereverberation Based on Speech Deep Features in Adversarial Networks
  • Real-world audio recordings are often degraded by factors such as noise, reverberation, and equalization distortion.
  • This paper by Su et al. from Princeton and Adobe Research in 2020 introduces HiFi-GAN, a deep learning method to transform recorded speech to sound as though it had been recorded in a studio.
  • They use an end-to-end feed-forward WaveNet architecture, trained with multi-scale adversarial discriminators in both the time domain and the time-frequency domain. HiFi-GAN relies on the deep feature matching losses of the discriminators to improve the perceptual quality of enhanced speech.
  • The proposed model generalizes well to new speakers, new speech content, and new environments. It significantly outperforms state-of-the-art baseline methods in both objective and subjective experiments.
HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
  • Several recent work on speech synthesis have employed generative adversarial networks (GANs) to produce raw waveforms. Although such methods improve the sampling efficiency and memory usage, their sample quality has not yet reached that of autoregressive and flow-based generative models.
  • This paper by Kong et al. from Kakao Enterprise in NeurIPS 2020 proposes HiFi-GAN, which achieves both efficient and high-fidelity speech synthesis. As speech audio consists of sinusoidal signals with various periods, we demonstrate that modeling periodic patterns of an audio is crucial for enhancing sample quality.
  • HiFi-GAN outperforms the best performing publicly available models in terms of synthesis quality, even comparable to human level. Moreover, it shows a significant improvement in terms of synthesis speed. A subjective human evaluation (mean opinion score, MOS) of a single speaker dataset indicates that our proposed method demonstrates similarity to human quality while generating 22.05 kHz high-fidelity audio 167.9 times faster than real-time on a single V100 GPU.
  • They took inspiration from the characteristic of speech audio that consists of patterns with various periods and applied it to neural networks, and verified that the existence of the proposed discriminator greatly influences the quality of speech synthesis through the ablation study.
  • HiFi-GAN shows ability to generalize to the mel-spectrogram inversion of unseen speakers and synthesize speech audio comparable to human quality from noisy inputs in an end-to-end setting. In addition, their small footprint model demonstrates comparable sample quality with the best publicly available autoregressive counterpart, while generating samples in an order-of-magnitude faster than real-time on CPU. This shows progress towards on-device natural speech synthesis, which requires low latency and memory footprint.
  • Finally, their experiments show that the generators of various configurations can be trained with the same discriminators and learning mechanism, which indicates the possibility of flexibly selecting a generator configuration according to the target specifications without the need for a time-consuming hyper-parameter search for the discriminators.
GAN-based Data Generation for Speech Emotion Recognition
  • This paper by Eskimez et al. from Microsoft in Interspeech 2020 proposes a GAN-based method to generate synthetic data in the form of speech emotion spectrograms, which can be used for training speech emotion recognition networks. Specifically, they investigate the usage of GANs for capturing the data manifold when the data is eyes-off, i.e., where we can train networks using the data but cannot copy it from the clients.
  • They propose a CNN-based GAN with spectral normalization on both the generator and discriminator, both of which are pre-trained on large unlabeled speech corpora. They show that our method provides better speech emotion recognition performance than a strong baseline.
  • They proposed to use GANs for modeling imbalanced and highly skewed data among clients for future use, even after the original data is removed.
  • Furthermore, they show that even after the data on the client is lost, their model can generate similar data that can be used for model bootstrapping in the future. Although they evaluated their method for speech emotion recognition, it can be applied to other tasks.
Unsupervised Cross-lingual Representation Learning at Scale
  • This paper by Conneau et al. from Facebook AI in ACL 2020 shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks.
  • They train a Transformer-based masked language model on one hundred languages, using more than two terabytes of filtered CommonCrawl data.
  • Their model, dubbed XLM-R, significantly outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +14.6% average accuracy on XNLI, +13% average F1 score on MLQA, and +2.4% F1 score on NER.
  • XLM-R performs particularly well on low-resource languages, improving 15.7% in XNLI accuracy for Swahili and 11.4% for Urdu over previous XLM models.
  • They also present a detailed empirical analysis of the key factors that are required to achieve these gains, including the trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and low resource languages at scale.
  • Finally, they show, for the first time, the possibility of multilingual modeling without sacrificing per-language performance; XLM-R is very competitive with strong monolingual models on the GLUE and XNLI benchmarks.
  • Facebook AI post.

2021

Generative Spoken Language Modeling from Raw Audio
  • This paper by Lakhotia et al. from Facebook AI in 2021 introduces Generative Spoken Language Modeling which learns speech representations from CPC, Wav2Vec2.0, and HuBERT for synthesizing speech.
  • Generative Spoken Language Modeling, the task of learning the acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation. They set up baseline systems consisting of a discrete speech encoder (returning pseudo-text units), a generative language model (trained on pseudo-text), and a speech decoder (generating a waveform from pseudo-text) all trained without supervision and validate the proposed metrics with human evaluation. Across 3 speech encoders (CPC, wav2vec 2.0, HuBERT), they find that the number of discrete units (50, 100, or 200) matters in a task-dependent and encoder-dependent way, and that some combinations approach text-based systems.
  • Facebook AI post.
Text-Free Prosody-Aware Generative Spoken Language Modeling
  • Speech pre-training has primarily demonstrated efficacy on classification tasks, while its capability of generating novel speech, similar to how GPT-2 can generate coherent paragraphs, has barely been explored.
  • This paper by Kharitonov et al. from Facebook AI in 2021 builds upon Generative Spoken Language Modeling (GSLM) (Lakhotia et al., 2021) which addresses the generative aspects of speech pre-training, by replacing text with discovered phone-like units for language modeling and shows the ability to generate meaningful novel sentences. Unfortunately, despite eliminating the need of text, the units used in GSLM discard most of the prosodic information. Hence, GSLM fails to leverage prosody for better comprehension, and does not generate expressive speech.
  • In this work, they present a prosody-aware generative spoken language model (pGSLM). It is composed of a multi-stream transformer language model (MS-TLM) of speech, represented as discovered unit and prosodic feature streams, and an adapted HiFi-GAN model converting MS-TLM outputs to waveforms.
  • They devise a series of metrics for prosody modeling and generation, and re-use metrics from GSLM for content modeling. Experimental results show that the pGSLM can utilize prosody to improve both prosody and content modeling, and also generate natural, meaningful, and coherent speech given a spoken prompt.
  • Facebook AI post.
  • Github repo
Speech Resynthesis from Discrete Disentangled Self-Supervised Representations
  • This paper by Polyak et al. from Facebook AI in Interspeech 2021 proposes using self-supervised discrete representations for the task of speech resynthesis. To generate disentangled representation, they separately extract low-bitrate representations for speech content, prosodic information, and speaker identity. This allows to synthesize speech in a controllable manner.
  • They analyze various state-of-the-art, self-supervised representation learning methods and shed light on the advantages of each method while considering reconstruction quality and disentanglement properties. Specifically, they evaluate the F0 reconstruction, speaker identification performance (for both resynthesis and voice conversion), recordings’ intelligibility, and overall quality using subjective human evaluation.
  • Lastly, they demonstrate how these representations can be used for an ultra-lightweight speech codec. Using the obtained representations, they can get to a rate of 365 bits per second while providing better speech quality than the baseline methods.
  • Facebook AI post.
  • Github repo

2022

Direct speech-to-speech translation with discrete units
  • This paper by Lee et al. from Facebook AI in 2022 presents a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation.
  • They tackle the problem by first applying a self-supervised discrete speech encoder on the target speech and then training a sequence-to-sequence speech-to-unit translation (S2UT) model to predict the discrete representations of the target speech.
  • When target text transcripts are available, they design a joint speech and text training framework that enables the model to generate dual modality output (speech and text) simultaneously in the same inference pass.
  • Experiments on the Fisher Spanish-English dataset show that the proposed framework yields improvement of 6.7 BLEU compared with a baseline direct S2ST model that predicts spectrogram features. When trained without any text transcripts, S2ST’s performance is comparable to models that predict spectrograms and are trained with text supervision, showing the potential of their system for translation between unwritten languages.
  • Audio samples
Textless Speech Emotion Conversion using Discrete and Decomposed Representations
  • Speech emotion conversion is the task of modifying the perceived emotion of a speech utterance while preserving the lexical content and speaker identity.
  • This paper by Kreuk et al. from Facebook AI in 2021 casts the problem of emotion conversion as a spoken language translation task. They use a decomposition of the speech signal into discrete learned representations, consisting of phonetic-content units, prosodic features, speaker, and emotion.
  • First, they modify the speech content by translating the phonetic-content units to a target emotion, and then predict the prosodic features based on these units.
  • Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder. Such a paradigm allows them to go beyond spectral and parametric changes of the signal, and model non-verbal vocalizations, such as laughter insertion, yawning removal, etc.
  • They demonstrate objectively and subjectively that the proposed method is vastly superior to current approaches and even beats text-based systems in terms of perceived emotion and audio quality. They rigorously evaluate all components of such a complex system and conclude with an extensive model analysis and ablation study to better emphasize the architectural choices, strengths and weaknesses of the proposed method.
  • Facebook AI post
  • Github repo
Generative Spoken Dialogue Language Modeling
  • This paper by Nguyen et al. from Facebook AI in 2022 introduces dGSLM, the first “textless” model able to generate audio samples of naturalistic spoken dialogues. It uses recent work on unsupervised spoken unit discovery coupled with a dual-tower transformer architecture with cross-attention trained on 2000 hours of two-channel raw conversational audio (Fisher dataset) without any text or labels.
  • It is able to generate speech, laughter and other paralinguistic signals in the two channels simultaneously and reproduces naturalistic turn taking.
  • Facebook AI post
  • Github repo
textless-lib: a Library for Textless Spoken Language Processing
  • Textless spoken language processing research aims to extend the applicability of standard NLP toolset onto spoken language and languages with few or no textual resources.
  • This paper by Kharitonov et al. from Facebook AI in 2022 introduces textless-lib, a PyTorch-based library aimed to facilitate research in this research area. They describe the building blocks that the library provides and demonstrate its usability by discuss three different use-case examples: (i) speaker probing, (ii) speech resynthesis and compression, and (iii) speech continuation.
  • They believe that textless-lib substantially simplifies research the textless setting and will be handful not only for speech researchers but also for the NLP community at large.
  • Facebook AI post
  • Github repo

Multimodal

2016

“Why Should I Trust You?” Explaining the Predictions of Any Classifier
  • Trust is crucial for effective human interaction with machine learning systems, and that explaining individual predictions is important in assessing trust.
  • This paper by Ribeiro et al. from Guestrin’s lab in UWash in 2016 proposes LIME, a novel model-agnostic modular and extensible explanation technique that explains the predictions of any classifier in an interpretable and faithful manner, by learning an interpretable model locally around the prediction. They further introduced SP-LIME, a method to explain models by selectingrepresentative individual predictions and their explanations in a non-redundant way, framing the task as a submodular optimization problem and providing a global view of the model to users.
  • They demonstrate the flexibility of these methods by explaining different models for text (e.g. random forests) and image classification (e.g. neural networks). The usefulness of explanations is shown via novel experiments, both simulated and with human subjects.
  • Their explanations empower users in various scenarios that require trust: deciding if one should trust a prediction, choosing between models, improving an untrustworthy classifier, getting insights into predictions, and detecting why a classifier should not be trusted.
  • LIME - Local Interpretable Model-Agnostic Explanations blog post.

2017

A Unified Approach to Interpreting Model Predictions
  • While various methods have recently been proposed to help users interpret the predictions of complex models, it is often unclear how these methods are related and when one method is preferable over another.
  • This paper from Lundberg and Lee from UWash in NeurIPS 2017 seeks to address this problem and presents a unified framework for interpreting predictions, SHAP (SHapley Additive exPlanations).
  • SHAP is a game theoretic approach to explain the output of any machine learning model. It connects optimal credit allocation with local explanations using the classic Shapley values from game theory and their related extensions. SHAP assigns each feature an importance value for a particular prediction. Its novel components include: (1) the identification of a new class of additive feature importance measures, and (2) theoretical results showing there is a unique solution in this class with a set of desirable properties.
  • The new class unifies six existing methods, notable because several recent methods in the class lack the proposed desirable properties. Based on insights from this unification, we present new methods that show improved computational performance and/or better consistency with human intuition than previous approaches.
  • Github repo.

2021

Comparing Data Augmentation and Annotation Standardization to Improve End-to-end Spoken Language Understanding Models
  • All-neural end-to-end (E2E) Spoken Language Understanding (SLU) models can improve performance over traditional compositional SLU models, but have the challenge of requiring high-quality training data with both audio and annotations. In particular they struggle with performance on “golden utterances”, which are essential for defining and supporting features, but may lack sufficient training data.
  • This paper by Nicolich-Henkin et al. from Amazon in NeurIPS 2021 proposes using data augmentation to compare two data-centric AI methods to improve performance on golden utterances: improving the annotation quality of existing training utterances and augmenting the training data with varying amounts of synthetic data.
  • Their experimental results show improvements with both methods, and in particular that augmenting with synthetic data is effective in addressing errors caused by both inconsistent training data annotations as well as lack of training data. In other words, both data-centric approaches to improving E2E SLU achieved the desired effect, although data augmentation was much more powerful than annotation standardization. This method leads to improvement in intent recognition error rate (IRER) on their golden utterance test set by 93% relative to the baseline without seeing a negative impact on other test metrics.
Learning Transferable Visual Models From Natural Language Supervision
  • This paper by Radford et al. from OpenAI introduces CLIP, a pre-training task which efficiently learns visual concepts from natural language supervision, i.e., performs language-guided image generation. CLIP uses vision and language encoders trained in isolation and uses a contrastive loss to bring similar image-text pairs closer, while pulling apart dissimilar pairs as a part of pretaining.
  • CLIP can be applied to any visual classification benchmark by simply providing the names of the visual categories to be recognized, similar to the “zero-shot” capabilities of GPT-2 and GPT-3.
  • CLIP pre-trains an image encoder and a text encoder to predict which images were paired with which texts in our dataset. We then use this behavior to turn CLIP into a zero-shot classifier. We convert all of a dataset’s classes into captions such as “a photo of a dog” and predict the class of the caption CLIP estimates best pairs with a given image.
  • It can rival the generalization of ImageNet SoTA models (since it was pretained on 400M image and noisy text pairs) and is thus typically used for zero-shot image classification and zero-shot cross-modal searches.
  • OpenAI article.
Zero-Shot Text-to-Image Generation
  • Text-to-image generation has traditionally focused on finding better modeling assumptions for training on a fixed dataset. These assumptions might involve complex architectures, auxiliary losses, or side information such as object part labels or segmentation masks supplied during training.
  • This paper by Ramesh et al. from OpenAI introduces DALL-E which offers a simple approach for text-to-image generation based on an autoregressive transformer which models the text and image tokens as a single stream of data. DALL-E is a simple decoder-only transformer that receives both the text and the image as a single stream of 1280 tokens—256 for the text and 1024 for the image—and models all of them autoregressively.
  • They find that sufficient data and scale can lead to improved generalization, both in terms of zero-shot performance relative to previous domain-specific approaches, and in terms of the range of capabilities that emerge from a single generative model. Their findings suggest that improving generalization as a function of scale may be a useful driver for progress on this task.
  • OpenAI article.
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
  • This paper by Kim et al. from NAVER AI and Kakao in 2021 introduces Vision-and-Language Transformer (ViLT) that seeks to improve performance on various joint vision-and-language downstream tasks using Vision-and-Language Pre-training (VLP).
  • CLIP and Hugging Face’s VisionEncoderDecoder utilize image and language encoders learned/trained in isolation and aligning/gluing them using either (i) cross-entropy loss that utilizes cross-attention (in case of VisionEncoderDecoder), and (ii) contrastive loss (in case of CLIP). This is shown in the figure below from Prithvi Da which summarizes the approaches.

  • The downside of the above approach is poor image-text alignment, huge data appetite and longer training time. This approach is useful to create a downstream generative model to tackle applications such as cross-modal retrieval, say OCR or image captioning or content based image retrieval (CBIR) or even text2image (using DALL-E or CLIPDraw). However, there are derived/advanced multimodal tasks involving vision and language that are much more complicated in nature such as Natural Language for Visual Reasoning (NLVR), Visual Question Answering (VQA), Visual Commonsense Reasoning (VCR), Visual Navigation, etc. than the aforementioned higher-order tasks. The diagram below from Prithvi Da summarizes the hierarchy of image-based tasks.

  • In order to tackle derived tasks in a similar way, we need to train image and language data jointly (rather than in isolation) in a “mixed-modal” fashion with a combination of image level loss, language level loss, and alignment loss. This is the underlying idea behind VLP. The diagram below from Prithvi Da summarizes the two approaches of aligning/gluing the modalities together (with either cross-entropy loss or contrastive loss) independently-trained vision and language encoders vs. training both encoders jointly.

  • Current approaches to VLP heavily rely on image feature extraction processes using convolutional visual embedding networks (e.g., Faster R-CNN and ResNets), which involve region supervision (e.g., object detection) and the convolutional architecture (e.g., ResNet). This is problematic in terms of both efficiency/speed, in that extracting input features requires much more computation than the multimodal interaction steps; and expressive power, as it is upper bounded to the expressive power of the visual embedder and its predefined visual vocabulary.
  • ViLT seeks to remedy the above two issues by presenting a minimal VLP model, which is monolithic in that the processing of visual inputs is drastically simplified to just the same convolution-free manner that we process textual inputs. In other words, the unique selling point of ViLT is that while most VLP models rely on object detectors, CNNs or transformers for feature extraction (for e.g., UNiTER, LXMERT and VisualBERT need Faster-RCNN for object detection), ViLT stands out of the crowd by removing the need for object detectors. ViLT accomplishes this by avoiding heavyweight image encoders by directly embedding low-level pixel data with a single-layer projection and achieves similar results with reduced complexity, as shown in the diagram below:

  • Self-supervision is accomplished using (i) Image Text Matching (ITM) loss and (ii) Masked Language Model (MLM) loss. ITM loss is an alignment loss that encompasses cross-modality interaction between image and text. ITM requires positive and negative pairs. For text, ViLT simply reuses Masked Language Model (MLM), used in BERT.
  • ViLT is pre-trained on four datasets: MSCOCO, Visual Genome, SBU Captions, and Google Conceptual Captions. They evaluate ViLT on two widely explored types of vision-and-language downstream tasks: for classification, they use VQAv2 and NLVR2; for retrieval, they use MSCOCO and Flickr30K (F30K).
  • Finally, they show that ViLT is over 10x faster than previous VLP models, yet with competitive or better downstream task performance.
  • The key takeaway in this paper is that VLP needs to focus more on the multi-modality interactions aspect inside the transformer module rather than engaging in an arms race that merely powers up unimodal embedders. ViLT-B/32 is a proof of concept that efficient VLP models free of convolution and region supervision can still be competent.
  • Github repo with code and pre-trained weights; HuggingFace docs; ViLT tutorials/notebooks.
MLIM: Vision-and-language Model Pre-training With Masked Language and Image Modeling
  • Vision-and-Language Pre-training (VLP) improves model performance for downstream tasks that require image and text inputs. Current VLP approaches differ on (i) model architecture (especially image embedders), (ii) loss functions, and (iii) masking policies. Image embedders are either deep models like ResNet or linear projections that directly feed image-pixels into the transformer. Typically, in addition to the Masked Language Modeling (MLM) loss, alignment-based objectives are used for cross-modality interaction, and RoI feature regression and classification tasks for Masked ImageRegion Modeling (MIRM). Alignment-based objectives require pairings of image and text and heuristic objective functions. MIRM relies on object detectors. Masking policies either do not take advantage of multi-modality or are strictly coupled with alignments generated by other models.
  • This paper by Arici et al. from Amazon in 2021 presents Masked Language and Image Modeling (MLIM) for VLP. MLIM is pre-trained using two pre-training tasks as a multi-loss objective given a mini-batch of image-text pairs: Masked Language Modeling (MLM) loss (as in BERT) for text, and image reconstruction (RECON) loss for image, coupled with Modality Aware Masking (MAM). MAM determines the masking probability and applies masking to both word and image embeddings. MLP is based on BERT predict the masked words from available words and image regions. We follow BERT for this task: two-layer MLP MLM head outputting logits over the vocabulary. MLM loss is negative log-likelihood for masked word. The RECON loss is an an average of pixel-wise sum of squared errors (SSE). Both image and word masking is realized by replacing an embedding with the embedding of [MASK]. This way transformer layers recognize [MASK]’s embedding as a special embedding that needs to be “filled in”, independent of the modality, by attending to other vectors in the layer inputs.
  • Note that unlike other architectures (LXMERT, UNiTER, ViLBERT, VLP, VL-BERT, VisualBERT, etc.), image masking is not based on image regions detected by the object detector, but a shallow CNN as an image embedder which is much more lightweight than deep models like ResNet and is designed to be masking friendly. MLM + RECON losses apply only to the masked text/image areas and measure reconstructed text and image quality.
  • MLIM uses no specific alignment loss, but instead proposes Modality Aware Masking (MAM) to boost cross-modality interaction and take advantage of MLM and RECON losses that separately capture text and image reconstruction quality. Using MLM + RECON tasks coupled with MAM, they present a simplified VLP methodology and show that it has better downstream task performance on a proprietary e-commerce multi-modal dataset.
  • Since the the task of finding closely-matching (CM) item pairs requires a pair of image+text inputs, they exploit this multi-modality by employing Modality Dropout (MDO). MDO improves fine-tuning by randomly dropping one of the modalities. Similar to MAM, MDO operates in one of the three modes on a micro-batch: text-only, image-only, and image-text mode.
  • The authors also tried using the ITM loss proposed in ViLT. However, RECON instead of ITM loss offers better PR AUC. Similarly, using the ITM loss together with MLM and RECON does not change the performance.
  • The key takeways from this paper are that MLIM is a simplified VLP method using MLM and RECON losses and MAM. They simplify loss function design, propose a shallow CNN-based image embedder to avoid heavyweight object-detectors and present an image decoder to enable RECON loss. We believe VLP datasets (e.g. e-commerce datasets) are large enough to enable learning built-in image embedders during pre-training. While alignment-based loss functions are promising and help in learning contrastive features, finding good image-text pairs (especially negative pairs) becomes an issue and makes pre-training rely on pairing techniques. On the other hand finer-grained alignment objectives such as alignment and MIRM objectives do not have ground truth. Masked Image-Region Modeling (MIRM) relies on RoI features and classes predicted by the object detector. Furthermore MIRM tasks aim to “fill in” masked regions. However the proposed RECON task aims to reconstruct the whole image and is designed to get the best cross-modality interaction inside the transformer.

2022

data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language
  • While the general idea of self-supervised learning is identical across modalities, the actual algorithms and objectives differ widely because they were developed with a single modality in mind.
  • This paper by Baevski et al. from Facebook in 2022 helps get us closer to general self-supervised learning by presenting data2vec, a framework that uses the same learning method for either speech, NLP or computer vision. The core idea is to predict latent representations of the full input data based on a masked view of the input in a self distillation setup using a standard Transformer architecture.
  • Instead of predicting modality-specific targets such as words, visual tokens or units of human speech which are local in nature, data2vec predicts contextualized latent representations that contain information from the entire input. Experiments on the major benchmarks of speech recognition, image classification, and natural language understanding demonstrate a new state of the art or competitive performance to predominant approaches.
  • Today’s self-supervised learning research almost typically focuses on a single modality. As a result, researchers specializing in one modality often adopt a totally different strategy than those specializing in another. Researchers train algorithms to fill in blanks in sentences in the case of the text. On the other hand, speech models must learn an inventory of essential speech sounds, like forecasting missing sounds. In computer vision, models are frequently taught to assign comparable representations to a color image of a cow, and the same image flipped upside down, allowing them to correlate the two far more closely than they would with an unrelated image like a duck. data2vec symbolizes a new paradigm of holistic self-supervised learning, in which further research enhances several rather than just one modality.
  • For each modality, algorithms anticipate distinct units: pixels or visual tokens for images, words for the text, and learned sound inventories for voice. Because a collection of pixels differs significantly from an audio waveform or a passage of text, algorithm creation has been related to a particular modality. This means that algorithms in each modality continue to work differently. Data2vec makes this easier by teaching models to anticipate their own representations of the incoming data, regardless of mode. Instead of predicting visual tokens, phrases, or sounds, a single algorithm may work with completely different sorts of input by focusing on these representations — the layers of a neural network. This eliminates the learning task’s reliance on modality-specific targets. It also doesn’t use contrastive learning or reconstructed input examples.
  • It was necessary to define a robust normalization of the features for the job that would be trustworthy in different modalities to directly predict representations. The method starts by computing target representations from an image, a piece of text, or a voice utterance using a teacher network. After that, a portion of the input was masked and repeated with a student network, which predicts the teacher’s latent representations. Even though it only has a partial view of the data, the student model must predict accurate input data. The instructor network is identical to the student network, except with somewhat out-of-date weights.
  • The method was tested on the primary ImageNet computer vision benchmark, and it outperformed existing processes for a variety of model sizes. It surpassed wav2vec 2.0 and HuBERT, two previous Meta AI self-supervised voice algorithms. It was put through its paces on the popular GLUE benchmark suite for text, and it came out on par with RoBERTa, a reimplementation of BERT.
  • Facebook AI link; Github; Marktechpost article.

Hierarchical Text-Conditional Image Generation with CLIP Latents
  • In January 2021, OpenAI introduced DALL-E. A year later, their newest system, DALL-E 2, generates more realistic and accurate images with 4x greater resolution, better caption matching and photorealism.
  • Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style.
  • This paper by Ramesh et al. from OpenAI in 2022 proposes DALL-E 2 leverages these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a “unCLIP” decoder that generates an image conditioned on the image embedding.
  • They show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity.
  • Their decoder, which is conditioned on image representations, can also produce variations of an image that preserve both its semantics and style, while varying the non-essential details absent from the image representation.
  • They use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples
  • OpenAI article.
AutoDistill: an End-to-End Framework to Explore and Distill Hardware-Efficient Language Models
  • Recently, large pre-trained models have significantly improved the performance of various Natural LanguageProcessing (NLP) tasks but they are expensive to serve due to long serving latency and large memory usage. To compress these models, knowledge distillation has attracted an increasing amount of interest as one of the most effective methods for model compression. However, existing distillation methods have not yet addressed the unique challenges of model serving in datacenters, such as handling fast evolving models, considering serving performance, and optimizing for multiple objectives.
  • This paper by Zhang et al. from Google in 2022 solve these problems, we propose AutoDistill, an end-to-end model distillation framework integrating model architecture exploration and multi-objective optimization for building hardware-efficient NLP pre-trained models. We use Bayesian Optimization to conduct multi-objective Neural Architecture Search for selecting student model architectures. The proposed search comprehensively considers both prediction accuracy and serving latency on target hardware. The experiments on TPUv4i show the finding of seven model architectures with better pre-trained accuracy (up to 3.2% higher) and lower inference latency (up to 1.44x faster) than MobileBERT.
  • By running downstream NLP tasks in the GLUE benchmark, the model distilled for pre-training by AutoDistill with 28.5M parameters achieves an 81.69 average score, which is higher than BERT_BASE, DistillBERT, TinyBERT, NAS-BERT, and MobileBERT. The most compact model found by AutoDistill contains only 20.6M parameters but still outperform BERT_BASE(109M), DistillBERT(67M), TinyBERT(67M), and MobileBERT(25.3M) regarding the average GLUE score. By evaluating on SQuAD, a model found by AutoDistill achieves an 88.4% F1 score with 22.8M parameters, which reduces parameters by more than 62% while maintaining higher accuracy than DistillBERT, TinyBERT, and NAS-BERT.
A Generalist Agent
  • This paper by Reed et al. from DeepMind in 2022 proposes Gato, a single generalist agent beyond the realm of text outputs, inspired by progress in large-scale language modeling.
  • Gato, works as a multi-modal, multi-task, multi-embodiment generalist policy. The same network with the same weights can play Atari, caption images, chat, stack blocks with a real robot arm and much more, deciding based on its context whether to output text, joint torques, button presses, or other tokens.
  • The guiding design principle of Gato is to train on the widest variety of relevant data possible, including diverse modalities such as images, text, proprioception, joint torques, button presses, and other discrete and continuous observations and actions. To enable processing this multi-modal data from different tasks and modalities, it is serialized into a flat sequence of tokens. In this representation, Gato can be trained and sampled from akin to a standard large-scale language model. Masking is used such that the loss function is applied only to target outputs, i.e. text and various actions. During deployment, sampled tokens are assembled into dialogue responses, captions, button presses, or other actions based on the context.
  • Gato uses a 1.2B parameter decoder-only transformer with 24 layers, an embedding size of 2048, and a post-attention feedforward hidden size of 8196.
  • Transformer sequence models are effective as multi-task multi-embodiment policies, including for real-world text, vision and robotics tasks. They show promise as well in few-shot out-of-distribution task learning. The authors envision that in the future, such models could be used as a default starting point via prompting or fine-tuning to learn new behaviors, rather than training from scratch.
  • DeepMind page.
Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors
  • Recent text-to-image generation methods provide a simple yet exciting conversion capability between text and image domains. While these methods have incrementally improved the generated image fidelity/quality and text relevancy (i.e., adherence to text of generated images), several pivotal gaps remain unanswered, limiting applicability and quality.
  • This paper by Gafni et al. from Meta AI in 2022 proposes a novel text-to-image method that addresses these gaps by (i) enabling a simple control mechanism complementary to text in the form of a scene, (ii) introducing elements that substantially improve the tokenization process by employing domain-specific knowledge over key image regions (faces and salient objects), and (iii) adapting classifier-free guidance for the transformer use case.
  • While some methods propose image editing techniques, progress is not often directed towards enabling new forms of human creativity and experiences. They attempt to progress text-to-image generation towards a more interactive experience, where people can perceive more control over the generated outputs, thus enabling real-world applications such as storytelling.
  • In addition to improving the general image quality, they focus on improving key image aspects that are significant in human perception, such as faces and salient objects, resulting in higher favorability of their method in human evaluations and objective metrics.
  • Their model achieves state-of-the-art FID and human evaluation results, unlocking the ability to generate high fidelity images in a resolution of 512 × 512 pixels, significantly improving visual quality. Through scene controllability, we introduce several new capabilities: (i) scene editing, (ii) text editing with anchor scenes, (iii) overcoming out-of-distribution text prompts, and (iv) story illustration generation, as demonstrated in the story they wrote.
i-Code: An Integrative and Composable Multimodal Learning Framework
  • Human intelligence is multimodal; we integrate visual, linguistic, and acoustic signals to maintain a holistic worldview. Most current pretraining methods, however, are limited to one or two modalities.
  • This paper by Yang et al. from Microsoft in 2022 presents i-Code, a self-supervised pretraining framework which jointly learns representations for vision, language and speech into a unified, shared and general-purpose vector representation.
  • In this framework, data from each modality are first given to pretrained single-modality encoders. The encoder outputs are then integrated with a multimodal fusion network, which uses novel attention mechanisms and other architectural innovations to effectively combine information from the different modalities. The entire system is pretrained end-to-end with new objectives including (i) masked modality modeling and (ii) cross-modality contrastive learning.
  • They show that pretraining on dual-modality datasets can also yield competitive or even better performance than pretraining on videos, the data resource that previous three-modality models were restricted to. i-Code can dynamically process single, dual, and triple-modality data during training and inference, flexibly projecting different combinations of modalities into a single representation space.
  • Experimental results demonstrate how i-Code can outperform state-of-the-art techniques on five video understanding tasks and the GLUE NLP benchmark, improving by as much as 11% and demonstrating the power of integrative multimodal pretraining.
  • The figure below from the paper shows the overall model architecture of i-Code. Shown on the right is the attention and feed-forward operation in a fusion network layer with (a) merge-attention layers and (b) co-attention layers. To facilitate more effective cross-modality understanding and design the best fusion architecture, they explore two variations of the traditional attention mechanism: mechanisms that merge and cross the attention scores of different modalities, namely merge-attention (based on self-attention) and co-attention (based on self- and cross-attention) respectively. Note that for simplicity, only the residual connection of the language modality is drawn, but all three modalities use residual connections.

RecSys

2015

Collaborative Deep Learning for Recommender Systems
  • Collaborative filtering (CF) is a successful approach commonly used by many recommender systems. Conventional CF-based methods use the ratings given to items by users as the sole source of information for learning to make recommendation. However, the ratings are often very sparse in many applications, causing CF-based methods to degrade significantly in their recommendation performance. To address this sparsity problem, auxiliary information such as item content information may be utilized. Collaborative topic regression (CTR) is an appealing recent method taking this approach which tightly couples the two components that learn from two different sources of information. Nevertheless, the latent representation learned by CTR may not be very effective when the auxiliary information is very sparse.
  • This paper by Wang et al. from HKU addresses this problem, by generalizing recent advances in deep learning from i.i.d. input to non-i.i.d. (CF-based) input and propose a hierarchical Bayesian model called collaborative deep learning (CDL), which jointly performs deep representation learning for the content information and collaborative filtering for the ratings (feedback) matrix. Extensive experiments on three real-world datasets from different domains show that CDL can significantly advance the state of the art.

2016

Wide & Deep Learning for Recommender Systems
  • Generalized linear models with nonlinear feature transformations are widely used for large-scale regression and classification problems with sparse inputs. Memorization of feature interactions through a wide set of cross-product feature transformations are effective and interpretable, while generalization requires more feature engineering effort. However, memorization and generalization are both important for recommender systems. With less feature engineering, deep neural networks can generalize better to unseen feature combinations through low-dimensional dense embeddings learned for the sparse features. However, deep neural networks with embeddings can over-generalize and recommend less relevant items when the user-item interactions are sparse and high-rank.
  • This paper by Cheng et al. from Google in 2016 introduced Wide & Deep learning – jointly trained wide linear models and deep neural networks – to combine the benefits of memorization and generalization for recommender systems. Wide linear models can effectively memorize sparse feature interactions using cross-product feature transformations, while deep neural networks can generalize to previously unseen feature interactions through low dimensional embeddings. Wide & Deep learning framework to combine the strengths of both types of models. In other words, the fusion of wide and deep models combines the strengths of memorization and generalization, and provides us with better recommendation systems. The two models are trained jointly with the same loss function.
  • They productionized and evaluated the system on Google Play Store, a massive-scale commercial mobile app store with over one billion active users and over one million apps. Online experiment results show that Wide & Deep significantly increased app acquisitions compared with wide-only and deep-only models.

2017

DeepFM: A Factorization-Machine based Neural Network for CTR Prediction
  • Learning sophisticated feature interactions behind user behaviors is critical in maximizing CTR for recommender systems. Despite great progress, existing methods seem to have a strong bias towards low- or high-order interactions, or require expertise feature engineering.
  • This paper by Guo et al. from Harbin Institute of Technology and Huawei in 2017 proposes DeepFM, an end-to-end learning model that emphasizes both low- and high-order feature interactions. DeepFM is a Factorization-Machine (FM) based Neural Network for CTR prediction, to overcome the shortcomings of the state-of-the-art models and to achieve better performance. DeepFM trains a deep component and an FM component jointly and models low-order feature interactions through FM and models high-order feature interactions through the DNN. Unlike Google’s Wide & Deep Model, DeepFM can be trained end-to-end with a shared input to its “wide” and “deep” parts, with no need of feature engineering besides raw features.
  • DeepFM gains performance improvement from these advantages: 1) it does not need any pre-training; 2) it learns both high- and low-order feature interactions; 3) it introduces a sharing strategy of feature embedding to avoid feature engineering.
  • DeepFM thus combines the power of factorization machines for recommendation and deep learning for feature learning in a new neural network architecture.
  • Extensive experiments were conducted to demonstrate the effectiveness and efficiency of DeepFM over the existing models for CTR prediction, on two real-world datasets (Criteo dataset and a commercial App Store dataset) to compare the effectiveness and efficiency of DeepFM and the state-of-the-art models. Their experiment results demonstrate that 1) DeepFM outperforms the state-ofthe-art models in terms of AUC and Logloss on both datasets; 2) The efficiency of DeepFM is comparable to the most efficient deep model in the state-of-the-art.

2019

Deep Learning Recommendation Model for Personalization and Recommendation Systems
  • With the advent of deep learning, neural network-based recommendation models have emerged as an important tool for tackling personalization and recommendation tasks. These networks differ significantly from other deep learning networks due to their need to handle categorical features and are not well studied or understood.
  • This paper by Naumov et al. from Facebook in 2019 proposes a state-of-the-art deep learning recommendation model (DLRM). They open-source its implementation in both PyTorch and Caffe2 frameworks.
  • The DLRM model handles continuous (dense) and categorical (sparse) features that describe users and products. DLRM exercises a wide range of hardware and system components, such as memory capacity and bandwidth, as well as communication and compute resources as shown in the figure below.

  • Furthermore, they design a specialized parallelization scheme utilizing model parallelism on the embedding tables to mitigate memory constraints while exploiting data parallelism to scale-out compute from the fully-connected layers.
  • Compared to other DL-based approaches to recommendation, DLRM differs in two ways. First, it computes the feature interactions explicitly while limiting the order of interaction to pairwise interactions. Second, DLRM treats each embedded feature vector (corresponding to categorical features) as a single unit, whereas other methods (such as Deep and Cross) treat each element in the feature vector as a new unit that should yield different cross terms. These design choices help reduce computational/memory cost while maintaining competitive accuracy.
  • They compare DLRM against existing recommendation models and characterize its performance on the Big Basin AI platform, demonstrating its usefulness as a benchmark for future algorithmic experimentation, system co-design, and benchmarking.
  • Facebook AI post.

Core ML

1991

What Every Computer Scientist Should Know About Floating-Point Arithmetic
  • This gem by Goldberg et al. from Oracle in the 1991 issue of ACM Computing Surveys helps demystify your errors about computer arithmetic and enables you to write more careful code.

2007

What Every Programmer Should Know About Memory
  • This must-read paper by Drepper from Red Hat in 2007 offers a detailed treatment on how system memory works.

2011

SMOTE: Synthetic Minority Over-sampling Technique
  • This paper by Chawla et al. from University of South Florida introduces an approach to the construction of classifiers from imbalanced datasets.
  • A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of “normal” examples with only a small percentage of “abnormal” or “interesting” examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Under-sampling of the majority (normal) class has been proposed as a good means of increasing the sensitivity of a classifier to the minority class.
  • This paper shows that a combination of our method of over-sampling the minority (abnormal) class and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space) than only under-sampling the majority class. This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space) than varying the loss ratios in Ripper or class priors in Naive Bayes.
  • Their method of over-sampling the minority class involves creating synthetic minority class examples. Experiments are performed using C4.5, Ripper and a Naive Bayes classifier.
  • The method is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.

2012

Improving neural networks by preventing co-adaptation of feature detectors
  • This paper by Hinton et al. in 2012 introduced Dropout as a way to avoid overfitting.
  • When a large feedforward neural network is trained on a small training set, it typically performs poorly on held-out test data. This overfitting is greatly reduced by randomly omitting half of the feature detectors on each training case. This prevents complex co-adaptations in which a feature detector is only helpful in the context of several other specific feature detectors.
  • Instead, each neuron learns to detect a feature that is generally helpful for producing the correct answer given the combinatorially large variety of internal contexts in which it must operate.
  • Random “dropout” gives big improvements on many benchmark tasks and sets new records for speech and object recognition.

2014

Dropout: A Simple Way to Prevent Neural Networks from Overfitting
  • This paper by Srivastava et al. from Hinton’s lab in JMLR 2014 introduced Dropout, which (just like Batchnorm) is now part of the standard recipe for regularizing deep neural nets.

2015

ADAM: A Method for Stochastic Optimization
  • This paper by Kingma and Ba in ICLR 2015 introduces Adam (derived from adaptive moment estimation), an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters.
  • It is a fusion of RMSProp with momentum and involves calculating the exponentially weighted moving average of the first moment and second moment (which are gated by the hyper parameters \(\beta_1\) and \(\beta_2\) respectively).
  • The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. They also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, they discuss AdaMax, a variant of Adam based on the infinity norm.
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
  • This paper by Ioffe and Szegedy from Google in ICML 2015 introduced BatchNorm, which is now commonly implemented to accelerate training of deep neural nets.
  • Also, check out this in-depth article on BatchNorm here.

2016

XGBoost: A Scalable Tree Boosting System
  • This paper by Chen and Guestrin from UWash in 2016 proposes eXtreme Gradient Boost (XGBoost), a scalable end-to-end tree boosting system that is widely used by data scientists and provides state-of-the-art results on many problems.
  • They propose a novel sparsity aware algorithm for handling sparse data and a theoretically justified weighted quantile sketch for approximate tree learning.
  • Their experience shows that cache access patterns, data compression and sharding are essential elements for building a scalable end-to-end system for tree boosting. These lessons can be applied to other machine learning systems as well.
  • By combining these insights, XGBoost is able to solve real-world scale problems using far fewer resources than existing systems..
Layer Normalization
  • Training state-of-the-art, deep neural networks is computationally expensive. One way to reduce the training time is to normalize the activities of the neurons. A recently introduced technique called batch normalization uses the distribution of the summed input to a neuron over a mini-batch of training cases to compute a mean and variance which are then used to normalize the summed input to that neuron on each training case. This significantly reduces the training time in feed-forward neural networks. However, the effect of batch normalization is dependent on the mini-batch size and it is not obvious how to apply it to recurrent neural networks.
  • This paper by Ba et al. from Hinton’s lab in 2016 transposes batch normalization into layer normalization by computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case. Like batch normalization, we also give each neuron its own adaptive bias and gain which are applied after the normalization but before the non-linearity.
  • Unlike batch normalization, layer normalization performs exactly the same computation at training and test times. It is also straightforward to apply to recurrent neural networks by computing the normalization statistics separately at each time step.
  • Layer normalization is very effective at stabilizing the hidden state dynamics in recurrent networks. Empirically, they show that layer normalization can substantially reduce the training time compared with previously published techniques.

2017

On Calibration of Modern Neural Networks
  • Modern neural networks exhibit a strange phenomenon: probabilistic error and miscalibration worsen even as classification error is reduced.
  • This paper by Guo et al. from Cornell University in ICML 2017 discovers that modern neural networks, unlike those from a decade ago, are poorly calibrated. Through extensive experiments, they observe that model capacity (in terms of depth, width), weight decay (regularization), and Batch Normalization are important factors affect calibration while improving accuracy.
  • They evaluate the performance of various post-processing calibration methods on state-of-the-art architectures with image and document classification datasets.
  • They suggest that simple techniques can effectively remedy the miscalibration phenomenon in neural networks. Temperature scaling – a single-parameter variant of Platt Scaling – is the simplest, fastest, and most straightforward of the methods at calibrating predictions.

2018

Model Cards for Model Reporting
  • Trained machine learning models are increasingly used to perform high-impact tasks in areas such as law enforcement, medicine, education, and employment. In order to clarify the intended use cases of machine learning models and minimize their usage in contexts for which they are not well suited, we recommend that released models be accompanied by documentation detailing their performance characteristics.
  • This paper by Mitchell et al. from Google and UofT proposes a framework that we call model cards, to encourage such transparent model reporting. Model cards are short documents accompanying trained machine learning models that provide benchmarked evaluation in a variety of conditions, such as across different cultural, demographic, or phenotypic groups (e.g., race, geographic location, sex, Fitzpatrick skin type) and intersectional groups (e.g., age and race, or sex and Fitzpatrick skin type) that are relevant to the intended application domains. Model cards also disclose the context in which models are intended to be used, details of the performance evaluation procedures, and other relevant information.
  • While they focus primarily on human-centered machine learning models in the application fields of computer vision and natural language processing, this framework can be used to document any trained machine learning model. To solidify the concept, we provide cards for two supervised models: One trained to detect smiling faces in images, and one trained to detect toxic comments in text. We propose model cards as a step towards the responsible democratization of machine learning and related AI technology, increasing transparency into how well AI technology works.

2021

Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better
  • Deep Learning has revolutionized the fields of computer vision, natural language understanding, speech recognition, information retrieval and more. However, with the progressive improvements in deep learning models, their number of parameters, latency, resources required to train, etc. have all have increased significantly. Consequently, it has become important to pay attention to these footprint metrics of a model as well, not just its quality. Training and deploying models involves making either implicit or explicit decisions about efficiency.
  • This paper by Menghani from Google Research in 2022 motivates the problem of efficiency in deep learning, followed by a thorough survey of the seminal work in core areas of model efficiency (spanning modeling techniques, infrastructure, and hardware). They lay out a mental model for the readers to wrap their heads around the multiple focus areas of model efficiency and optimization, thereby offering the reader an opportunity to understand the state-of-the-art, apply these techniques in the modelling process, and/or use them as a starting point for exploration
  • They also present an experiment-based guide along with code, for practitioners to optimize their model training and deployment. They believe this is the first comprehensive survey in the efficient deep learning space that covers the landscape of model efficiency from modeling techniques to hardware support.
  • Finally, they present a section of explicit and actionable insights supplemented by code, for a practitioner to use as a guide in this space. This section will hopefully give concrete and actionable takeaways, as well as tradeoffs to think about when optimizing a model for training and deployment.

2022

Pathways: Asynchronous Distributed Dataflow for ML
  • This paper by Barham et al. from Google in MLSys 2022 presents the design of Pathways, a new large scale orchestration layer for accelerators. Pathways is explicitly designed to enable exploration of new systems and ML research ideas, while matching state-of-the-art multi-controller performance on current ML models which are single-tenant SPMD.
  • Pathways uses a sharded dataflow graph of asynchronous operators that consume and produce futures, and efficiently gang-schedules heterogeneous parallel computations on thousands of accelerators while coordinating data transfers over their dedicated interconnects. Pathways makes use of a novel asynchronous distributed dataflow design that lets the control plane execute in parallel despite dependencies in the data plane. This design, with careful engineering, allows Pathways to adopt a single-controller model that makes it easier to express complex new parallelism patterns. Pathways upends the execution model of JAX programs, pulling user code back into a single-controller model, and interposing a centralized resource management and scheduling framework between client and accelerators. The single-controller programming model allows users simple access to much richer computation patterns. The resource management and scheduling layer permits the reintroduction of cluster management policies including multi-tenant sharing, virtualization and elasticity, all tailored to the requirements of ML workloads and accelerators.
  • Their micro-benchmarks show interleaving of concurrent client workloads, and efficient pipelined execution, convincingly demonstrating that the system mechanisms we have built are fast and flexible, and form a solid basis for research into novel policies to make use of them. They demonstrate that Pathways can achieve performance parity (~100% accelerator utilization) with state-of-the-art systems when running SPMD computations over 2048 TPUs, while also delivering throughput comparable to the SPMD case for Transformer models that are pipelined across 16 stages, or sharded across two islands of accelerators connected over a data center network.
  • They have shown that careful system design and engineering lets them “get the best of both worlds”, matching performance on today’s ML models while delivering the features needed to write the models of tomorrow.
PolyLoss: A Polynomial Expansion Perspective of Classification Loss Functions
  • Cross-entropy loss and focal loss are the most common choices when training deep neural networks for classification problems. Generally speaking, however, a good loss function can take on much more flexible forms, and should be tailored for different tasks and datasets.
  • This paper by Leng et al. from in ICLR 2022 proposes a simple framework, named PolyLoss, to view and design loss functions as a linear combination of polynomial functions, motivated by how functions can be approximated via Taylor expansion. We Under polynomial expansion, focal loss is a horizontal shift of the polynomial coefficients compared to the cross-entropy loss. Motivated by this new insight, we explore an alternative dimension, i.e., vertically modify the polynomial coefficients.
  • PolyLoss allows tprovides flexible ways of changing the loss function shape by adjusting the polynomial coefficients depending on the targeting tasks and datasets, while naturally subsuming the aforementioned cross-entropy loss and focal loss as special cases.
  • Extensive experimental results show that the optimal choice within the PolyLoss is indeed dependent on the task and dataset.
  • By simply adjusting the coefficient of the leading polynomial coefficient with just one extra hyperparameter and adding one line of code, the Poly-1 formulation outperforms the cross-entropy loss and focal loss on 2D image classification, instance segmentation, object detection, and 3D object detection tasks, sometimes by a large margin.

RL

2022

Transdreamer: Reinforcement Learning With Transformer World Models
  • The Dreamer agent provides various benefits of Model-Based Reinforcement Learning (MBRL) such as sample efficiency, reusable knowledge, and safe planning. However, its world model and policy networks inherit the limitations of recurrent neural networks and thus an important question is how an MBRL framework can benefit from the recent advances of transformers and what the challenges are in doing so.
  • This paper from Chen et al. from Rutgers and KAIST in 2022 proposes a TransDreamer, a transformer-based MBRL agent. They first introduce the Transformer State-Space Model (TSSM), the first transformer-based stochastic world model that leverages a transformer for dynamics predictions. Then, they share this world model with a transformer-based policy network and obtain stability in training a transformer-based RL agent.
  • TransDreamer shows comparable performance with Dreamer on DMC and Atari tasks that do not require long-term memory. However, when the proposed model is applied to Hidden Order Discovery involving both 2D visual RL and 3D first-person visual RL, which require long-range memory access for memory-based reasoning (i.e, long-term complex memory interactions), the proposed model outperforms Dreamer in these complex tasks.
  • They also show that image generation and reward prediction of TSSM is better than Dreamer qualitatively and quantitatively.

Selected Papers

Vision

2015

Learning Deep Features for Discriminative Localization
  • This paper by Zhou et al. from MIT CSAIL in 2015 is an explainable-AI method that seeks to answer what vision models “see” in images. They propose Class Activation Maps (CAM) which is a nifty visualization technique originally introduced for CNNs where the predicted class score is mapped back to the previous convolutional layer to generate the CAM. The CAM highlights the class-specific discriminative regions used by CNN to identify the category or class.
  • They revisit the global average pooling layer proposed earlier, and shed light on how it explicitly enables the convolutional neural network to have remarkable localization ability despite being trained on image-level labels. This enables classification-trained CNNs to learn to perform object localization, without using any bounding box annotations. While this technique was previously proposed as a means for regularizing training, they find that it actually builds a generic localizable deep representation that can be applied to a variety of tasks.
  • Furthermore we demonstrate that the CAM localization technique generalizes to other visual recognition tasks i.e., our technique produces generic localizable deep features that can aid other researchers in understanding the basis of discrimination used by CNNs for their tasks.
  • Later, there were several variants of similar explainable-AI methods (such as GradCAM, Saliency Maps and Integrated Gradients) that were introduced.
  • Despite the apparent simplicity of global average pooling, they are able to achieve 37.1% top-5 error for on weakly supervised object localization on the ILSVRC 2014 benchmark, demonstrating that our global average pooling CNNs can perform accurate object localization. Note that this is remarkably close to the 34.2% top-5 error achieved by a fully supervised CNN approach.
  • They demonstrate that our network is able to localize the discriminative image regions on a variety of tasks despite not being trained for them.
  • Unrelated to the paper but a similar approach for vision transformers was recently proposed. CNN uses pixel arrays, whereas ViT splits the images into patches, i.e., visual tokens. The visual transformer divides an image into fixed-size patches, correctly embeds each of them, and includes positional embedding as an input to the transformer encoder. So CAM will indicate what regions of the image the [CLS] token will use to discriminate between classes. Usage example.
  • This is shown in the figure below from Prithvi Da summarizes the approaches using ViT, but the same approach is applicable to other vision-based transformers such as DEiT, BEiT etc.

2016

Understanding the Effective Receptive Field in Deep Convolutional Neural Networks
  • This paper by Luo et al. from UofT in NeurIPS 2016 studied the characteristics of the receptive field of units in CNNs and introduced the concept of effective receptive field.

2017

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
  • This paper by Carreira and Zisserman from Google in CVPR 2017 introduced a new two-stream Inflated 3D ConvNet (I3D) architecture that incorporated both optical flow and RGB paths by inflating filters and pooling kernels of very deep image classification ConvNets from 2D to 3D, making it possible to learn seamless spatio-temporal feature extractors from video.
Densely Connected Convolutional Networks
  • This paper by Huang et al. from Cornell, Tsinghua and FAIR in CVPR 2017 that skip-connected all layers with the main difference with ResNets being that they performed concatenation-based skip connections instead of addition-based skip connections (as in ResNet).
  • The core idea behind DenseNet is feature reuse, which leads to very compact models. As a result it requires fewer parameters than other CNNs, as there are no repeated feature-maps.
  • They work around two concerns:
    • The feature maps have to be of the same size.
    • The concatenation with all the previous feature maps may result in memory explosion.
  • To address the first issue they propose two solutions:
    • Use conv layers with appropriate padding that maintain spatial dimensions (as in InceptionNet) or
    • Use dense skip connectivity only inside blocks called dense blocks.

2018

Neural Discrete Representation Learning
  • This paper by Oord et al. from DeepMind in NeurIPS 2018 proposed the Vector Quantised-Variational AutoEncoder (VQ-VAE), a simple yet powerful generative model that combine VAEs with vector quantisation (VQ) to obtain a discrete latent representations.
  • VQ-VAE differs from VAEs in two key ways: (i) the encoder network outputs discrete, rather than continuous, codes, and (ii) the prior is learnt rather than static.
  • In order to learn a discrete latent representation, they incorporate ideas from VQ. Using the VQ method allows the model to circumvent issues of “posterior collapse” – where the latents are ignored when they are paired with a powerful autoregressive decoder – typically observed in the VAE framework. Pairing these representations with an autoregressive prior, the model can generate high quality images, videos, and speech as well as doing high quality speaker conversion and unsupervised learning of phonemes, providing further evidence of the utility of the learnt representations.
  • They show that VQ-VAEs are capable of modelling very long term dependencies through their compressed discrete latent space which we have demonstrated by generating 128 x 128 colour images, sampling action conditional video sequences and finally using audio where even an unconditional model can generate surprisingly meaningful chunks of speech and doing speaker conversion. All these experiments demonstrated that the discrete latent space learnt by VQ-VAEs capture important features of the data in a completely unsupervised manner.
  • Moreover, VQ-VAEs achieve likelihoods that are almost as good as their continuous latent variable counterparts on CIFAR10 data. They believe that this is the first discrete latent variable model that can successfully model long range sequences and fully unsupervisedly learn high-level speech descriptors that are closely related to phonemes.

2019

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
  • This paper by Tan and Le from Google in ICML 2019 introduced EfficientNet which is all about engineering and scale. It proves that if you carefully design your architecture you can achieve top results with reasonable parameters. It’s incredible that EfficientNet-B1 is 7.6x smaller and 5.7x faster than ResNet-152 with better accuracy!
  • Ideas from the paper:
    • With more layers (depth), one can capture richer and more complex features, but such models are hard to train (due to vanishing gradients).
    • Wider networks are much easier to train. They tend to be able to capture more fine-grained features but saturate quickly.
    • By training with higher resolution images, CNNs are able to capture more fine-grained details. Again, the accuracy gain diminishes for quite high resolutions.
    • Instead of finding the best architecture, the authors proposed to start with a relatively small baseline model and gradually scale up network depth (more layers), width (more channels per layer), resolution (input image) simultaneously using a technique called compound scaling that they propose.

2020

Taming Transformers for High-Resolution Image Synthesis
  • Designed to learn long-range interactions on sequential data, transformers continue to show state-of-the-art results on a wide variety of tasks. In contrast to CNNs, they contain no inductive bias that prioritizes local interactions. This makes them expressive, but also computationally infeasible for long sequences, such as high-resolution images. We demonstrate how combining the effectiveness of the inductive bias of CNNs with the expressivity of transformers enables them to model and thereby synthesize high-resolution images.
  • This paper by Esser et al. from the Heidelberg Collaboratory for Image Processing in 2020 proposed VQGAN which addresses the fundamental challenges that previously confined transformers to low-resolution images. VQGAN shows how to (i) use CNNs to learn a context-rich vocabulary of image constituents, and in turn (ii) utilize transformers to efficiently model their composition within high-resolution images.
  • VQGAN represents images as a composition of perceptually rich image constituents and thereby overcomes the infeasible quadratic complexity when modeling images directly in pixel space. Their approach uses a convolutional generator to learn a codebook of context-rich visual parts, whose composition is subsequently modeled with an autoregressive transformer architecture. A discrete codebook provides the interface between these architectures and a patch-based discriminator enables strong compression while retaining high perceptual quality. This method introduces the efficiency of convolutional approaches to transformer based high resolution image synthesis.
  • Modeling constituents with a CNN architecture and their compositions with a transformer architecture taps into the full potential of their complementary strengths and thereby allows VQGAN to represent the first results on high-resolution image synthesis with a transformer-based architecture.
  • VQGAN is readily applied to conditional synthesis tasks, where both non-spatial information, such as object classes, and spatial information, such as segmentations, can control the generated image.
  • VQGAN demonstrates the efficiency of convolutional inductive biases and the expressivity of transformers by performing semantically-guided synthesis of megapixel images and outperforming state-of-the-art convolutional approaches and autoregressive models on class-conditional ImageNet.
  • Code and pretrained models can be found here.

Self-training with Noisy Student improves ImageNet classification
  • This paper by Xie et al. from Google and CMU in CVPR 2020 introduced teacher-student training. The paper proposed an iterative semi-supervised method using 300M unlabeled images called “noisy student training” which can be described in 4 steps:
    • Train a teacher model on labeled images.
    • Use the teacher to generate labels on 300M unlabeled images (pseudo-labels).
    • Train a student model on the combination of labeled images and pseudo labeled images.
    • Iterate from step 1, by treating the student as a teacher. Re-infer the unlabeled data and train a new student from scratch.
Big Transfer (BiT): General Visual Representation Learning
  • This paper by Kolesnikov et al. from Google in ECCV 2020 introduced BiT which is a scalable ResNet-based model for efficient image pre-training.
  • They develop 3 BiT models (small, medium, and large) based on ResNet-152. For the large variation of BiT they used ResNet152x4, which means that each layer has 4 times more channels. They pretrained that model using far larger datasets than ImageNet. Specifically, the largest model was trained on the insanely large JFT dataset, which consists of 300M labeled images.
  • The major contribution in the architecture is the choice of normalization layers – the authors replace batch normalization with group normalization and weight standardization.
Multi-modal Dense Video Captioning
  • This paper by Iashin and Rahtu from Tampere University in CVPR Workshops 2020 introduced multi-modal dense video captioning.
Efficient Saliency Maps for Explainable AI
  • This paper by Mundhenk et al. from Lawrence Livermore National Lab and UC Berkeley in 2020 describes an explainable AI saliency map method for use with deep CNNs that is much more efficient than popular fine-resolution gradient methods. It is also quantitatively similar or better in accuracy.
  • Their technique works by measuring information at the end of each network scale which is then combined into a single saliency map. They describe how saliency measures can be made more efficient by exploiting Saliency Map Order Equivalence. They visualize individual scale/layer contributions by using a Layer Ordered Visualization of Information. This provides an interesting comparison of scale information contributions within the network not provided by other saliency map methods.
  • Using their method instead of Guided Backprop, coarse-resolution class activation methods such as Grad-CAM and GradCAM++ seem to yield demonstrably superior results without sacrificing speed. This will make fine-resolution saliency methods feasible on resource limited platforms such as robots, cell phones, low-cost industrial devices, astronomy and satellite imagery.

2021

Finetuning Pretrained Transformers into RNNs
  • This paper by Kasai et al. from UWash, Microsoft, DeepMind, and Allen AI in 2021 presented an idea of converting pre-trained transformers into RNNs, lowering memory cost while retaining high accuracy.
  • SyncedReview’s article.
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
  • This paper by Akbari et al. from Google, Columbia, and Cornell in 2021 explored learning multimodal representations from unlabeled data using convolution-free Transformer architectures. Furthermore, they also study a modality-agnostic, single-backbone Transformer by sharing weights among the three modalities.
Self-supervised learning for fast and scalable time series hyper-parameter tuning.
  • This paper by Zhang et al from Facebook in 2021 proposed a new self-supervised learning framework for model selection and hyperparameter tuning, which provides accurate forecasts with less computational time and resources.
Accelerating SLIDE Deep Learning on Modern CPUs: Vectorization, Quantizations, Memory Optimizations, and More
  • This paper by Daghaghi et al. from Rice University in MLSys 2021 presented a CPU algorithm using locality sensitive hashing that trains deep neural networks up to 15 times faster than top GPU trainers.
Emerging Properties in Self-Supervised Vision Transformers
  • This paper by Caron et al. from Facebook in 2021 proposed DINO, a self-supervised vision transformer-based model that can discover and segment objects in an image or a video with absolutely no supervision and without being given a segmentation-targeted objective.
  • DINO works by interpreting self-supervision as a special case of self-distillation, where no labels are used at all. DINO is trained as a student network by simply matching the output of a teacher network over different views of the same image. By discovering object parts and shared characteristics across images, DINO learns a feature space that organizes itself in an interpretable way, with similar categories landing near one another. This suggests that DINO managed to connect categories based on visual properties, a bit like humans do.
  • TechCrunch’s article and Facebook AI article.
Semi-Supervised Learning of Visual Features by Non-Parametrically Predicting View Assignments with Support Samples
  • This paper by Assran et al. from Facebook in 2021 proposed PAWS, which combines some of the ideas of semi-supervised learning with the more traditional supervised method, essentially giving the training a boost by letting it learn from both labeled and unlabeled data.
  • PAWS is a method for semi-supervised learning that builds on the principles of self-supervised distance-metric learning. PAWS pre-trains a model to minimize a consistency loss, which ensures that different views of the same unlabeled image are assigned similar pseudo-labels. The pseudo-labels are generated non-parametrically, by comparing the representations of the image views to those of a set of randomly sampled labeled images. The distance between the view representations and labeled representations is used to provide a weighting over class labels, which they interpret as a soft pseudo-label. By non-parametrically incorporating labeled samples in this way, PAWS extended the distance-metric loss used in self-supervised methods such as BYOL and SwAV to the semi-supervised setting.
Enhancing Photorealism Enhancement
  • This paper by Richter et al. from Intel Labs in 2021 proposed an approach to enhancing the realism of synthetic images using a convolutional network that leverages intermediate representations produced by conventional rendering pipelines. The network is trained via a novel adversarial objective, which provides strong supervision at multiple perceptual levels.
  • The authors analyzed scene layout distributions in commonly used datasets and find that they differ in important ways. They hypothesize that this is one of the causes of strong artifacts that can be observed in the results of many prior methods. To address this, they propose a new strategy for sampling image patches during training.
  • Intel Lab’s article with sample A/B results and videos from the paper. Also, The Verge’s article on the idea.
FNet: Mixing Tokens with Fourier Transforms
  • This paper by Lee-Thorp et al. from Google in 2021 proposed replacing the self-attention sublayers with simple linear transformations that “mix” input tokens to significantly speed up the transformer encoder with limited accuracy cost.
  • More surprisingly, the team discovers that replacing the self-attention sublayer with a standard, unparameterized Fourier Transform achieves 92 percent of the accuracy of BERT on the GLUE benchmark, with training times that are seven times faster on GPUs and twice as fast on TPUs.
  • SynedReview’s article on the idea.
Are Convolutional Neural Networks or Transformers more like human vision?
  • This paper by Tuli et al. from Princeton University, DeepMind, and UC Berkeley explored the extent to which different vision models correlate with human vision from an error consistency point-of-view. They conclude that the recently proposed Vision Transformer (ViT) networks not only outperform CNNs on accuracy for image classification tasks, but also have higher shape bias and are largely more consistent with human errors.
RegNet: Self-Regulated Network for Image Classification
  • The ResNet and its variants have achieved remarkable successes in various computer vision tasks. Despite its success in making gradient flow through building blocks, the simple shortcut connection mechanism limits the ability of re-exploring new potentially complementary features due to the additive function.
  • This paper by Xu et al. in 2021 from Harbin Institute of Technology, University of Electronic Science and Technology of China, Singapore Management University, and Sichuan University addresses this issue by proposing a regulator module as a memory mechanism to extract complementary features, which are further fed to the ResNet. In particular, the regulator module is composed of convolutional RNNs (e.g., Convolutional LSTMs or Convolutional GRUs), which are shown to be good at extracting spatio-temporal information. They named the new regulated networks as RegNet.
  • The regulator module can be easily implemented and appended to any ResNet architecture. They also apply the regulator module for improving the squeeze-and-Excitation ResNet to show the generalization ability of our method. Experimental results on three image classification datasets have demonstrated the promising performance of the proposed architecture compared with the standard ResNet, SE-ResNet, and other state-of-the-art architectures.
Multiscale Vision Transformers
  • This paper by Fan et al. from Facebook AI and UC Berkeley presents Multiscale Vision Transformers (MViT) for video and image recognition, by connecting the seminal idea of multiscale feature hierarchies with transformer models.
  • Multiscale Transformers have several channel-resolution scale stages. Starting from the input resolution and a small channel dimension, the stages hierarchically expand the channel capacity/feature complexity while reducing the spatial resolution. This creates a multiscale pyramid of features with early layers operating at high spatial resolution to model simple low-level visual information, and deeper layers at spatially coarse, but complex, high-dimensional features.
  • They evaluate this fundamental architectural prior for modeling the dense nature of visual signals for a variety of video recognition tasks where it outperforms single-scale vision transformers for video and image recognition that rely on large scale external pre-training and are 5-10x more costly in computation and parameters. We further remove the temporal dimension and apply our model for image classification where it outperforms prior work on vision transformers. In empirical evaluation, MViT shows a fundamental advantage over single-scale vision transformers for video and image recognition.
  • Github repo.

Text

2015

Effective Approaches to Attention-based Neural Machine Translation
  • This paper by Luong et al. from Manning’s lab in EMNLP 2015 described a few more attention models that offer improvements and simplifications compared to Bahdanau attention.
  • They describe a few “global attention” models, the distinction between them being the way the attention scores are calculated.

2018

Generating Wikipedia by Summarizing Long Sequences
  • This paper by Liu et al. from Google Brain in ICLR 2018 shows that generating English Wikipedia articles can be approached as a multi-document summarization problem with a large, parallel dataset, and demonstrated a two-stage extractive-abstractive framework for carrying it out. They perform coarse extraction by using extractive summarization to identify salient information in the first stage and a neural decoder-only sequence transduction model for the abstractive stage, capable of handling very long input-output examples.
  • For the abstractive model, they introduce a decoder-only architecture that can scalably attend to very long sequences, much longer than typical encoder-decoder architectures used in sequence transduction, allowing them to condition on many reference documents and to generate fluent and coherent multi-sentence paragraphs and even whole Wikipedia articles.
  • When given reference documents, they show it can extract relevant factual information as reflected in perplexity, ROUGE scores and human evaluations.

2019

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
  • While BERT and RoBERTa have set a new state-of-the-art performance on sentence-pair regression tasks like semantic textual similarity (STS), they requires that both sentences are fed into the network, which causes a massive computational overhead. Finding the most similar pair in a collection of 10,000 sentences requires about 50 million inference computations (~65 hours) with BERT. The construction of BERT makes it unsuitable for semantic similarity search as well as for unsupervised tasks like clustering.
  • This paper by Reimers and Gurevych from Technische Universitat Darmstad in 2019 presented Sentence-BERT (SBERT) a modification of the pretrained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity. This reduces the effort for finding the most similar pair from 65 hours with BERT / RoBERTa to about 5 seconds with SBERT, while maintaining the accuracy from BERT.
  • They showed that BERT out-of-the-box maps sentences to a vector space that is rather unsuitable to be used with common similarity measures like cosine-similarity. In fact, the performance for seven STS tasks was below the performance of average GloVe embeddings.
  • SBERT fine-tunes BERT in a siamese/triplet network architecture. We evaluated the quality on various common benchmarks, where it could achieve a significant improvement over state-of-the-art sentence embeddings methods. Replacing BERT with RoBERTa did not yield a significant improvement in their experiments.
  • They evaluate SBERT and SRoBERTa on common STS tasks and transfer learning tasks, where it outperforms other state-of-the-art sentence embeddings methods due to it being computationally efficient. On a GPU, it is about 9% faster than InferSent and about 55% faster than Universal Sentence Encoder. SBERT can be used for tasks which are computationally not feasible to be modeled with BERT such as clustering of 10,000 sentences with hierarchical clustering (BERT needs 65 hours, while SBERT needs 5 seconds).

2020

Efficient Transformers: A Survey
  • Transformer model architectures have garnered immense interest lately due to their effectiveness across a range of domains like language, vision and reinforcement learning. In the field of natural language processing for example, Transformers have become an indispensable staple in the modern deep learning stack. Recently, a dizzying number of “X-former” models have been proposed - Reformer, Linformer, Performer, Longformer, to name a few - which improve upon the original Transformer architecture, many of which make improvements around computational and memory efficiency.
  • This paper by Tay et al. from Google in 2020 characterizes a large and thoughtful selection of recent efficiency-flavored “X-former” models, providing an organized and comprehensive overview of existing work and models across multiple domains.
Towards a Human-like Open-Domain Chatbot
  • This paper by Adiwardana et al. from Google in 2020 presented Meena, which is an end-to-end, neural conversational model that learns to respond sensibly to a given conversational context. The training objective is to minimize perplexity, the uncertainty of predicting the next token (in this context, the next word in a conversation).
  • Google AI’s article.

2021

Pretrained Transformers As Universal Computation Engines
  • This paper by Lu et al. from UC Berkeley, FAIR, and Google Brain in 2021 investigated the capability of a transformer pretrained on natural language to generalize to other modalities with minimal finetuning – in particular, without finetuning of the self-attention and feedforward layers of the residual blocks and apply this model to numerical computation, vision, and protein fold prediction.
  • In contrast to prior works which investigate finetuning on the same modality as the pretraining dataset, the authors showed that pretraining on natural language improves performance and compute efficiency on non-language downstream tasks. In particular, the authors found that such pretraining enables FPT to generalize in zero-shot to these modalities, matching the performance of a transformer fully trained on these tasks.
  • BAIR’s article; VentureBeat’s article; Yannic Kilcher’s video.
SimCSE: Simple Contrastive Learning of Sentence Embeddings
  • This paper by Gao et al. from Princeton University and Tsinghua University in 2021 presents SimCSE, a simple contrastive learning framework that greatly advances the state-of-the-art sentence embeddings on semantic textual similarity tasks.
  • They first describe an unsupervised approach, which takes an input sentence and predicts itself in a contrastive objective, with only standard dropout used as noise. This simple method works surprisingly well, performing on par with previous supervised counterparts. They find that dropout acts as minimal data augmentation and removing it leads to a representation collapse.
  • Next, they propose a supervised approach utilizing NLI datasets, which incorporates annotated pairs from natural language inference datasets into their contrastive learning framework, by using “entailment” pairs as positives and “contradiction” pairs as hard negatives.
  • They evaluate SimCSE on standard semantic textual similarity (STS) tasks, and their unsupervised and supervised models using BERTbase achieve an average of 76.3% and 81.6% Spearman’s correlation respectively, a 4.2% and 2.2% improvement compared to previous best results. They also show both theoretically and empirically justify the inner workings of their approach by analyzing alignment and uniformity of SimCSE and demonstrating that their contrastive learning objective regularizes pre-trained embeddings’ anisotropic space to be more uniform, and it better aligns positive pairs when supervised signals are available.
  • The key takeaway is that their contrastive objective, especially the unsupervised one, may have a broader application in NLP. It provides a new perspective on data augmentation with text input, and can be extended to other continuous representations and integrated in language model pre-training.
DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations
  • Sentence embeddings are an important component of many natural language processing (NLP) systems. Like word embeddings, sentence embeddings are typically learned on large text corpora and then transferred to various downstream tasks, such as clustering and retrieval. Unlike word embeddings, the highest performing solutions for learning sentence embeddings require labelled data, limiting their usefulness to languages and domains where labelled data is abundant.
  • This paper by Giorgi et al. from UofT in 2021 present DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations. Similar to SimCSE, DeCLUTR learns high quality sentence embeddings in a self-supervised fashion, the quality of which are equal to or better than the ones obtained from a supervised setting.
  • Inspired by recent advances in deep metric learning (DML), they design a self-supervised objective for learning universal sentence embeddings that does not require labelled training data. When used to extend the pretraining of transformer-based language models, their approach closes the performance gap between unsupervised and supervised pretraining for universal sentence encoders. Their experiments suggest that the quality of the learned embeddings scale with both the number of trainable parameters and the amount of unlabelled training data.
  • They demonstrated the effectiveness of their objective by evaluating the learned embeddings on the SentEval benchmark, which contains a total of 28 tasks designed to evaluate the transferability and linguistic properties of sentence representations.
  • Their experiments suggest that the learned embeddings’ quality can be further improved by increasing the model and train set size. Together, their results demonstrate the effectiveness and feasibility of replacing hand-labelled data with carefully designed self-supervised objectives for learning universal sentence embeddings.
  • Github repo with code and pretrained models.

2022

A Causal Lens for Controllable Text Generation
  • This paper by Hu and Li from UCSD and Amazon introduces released a paper describing a novel approach to conditional text generation that leverages causal inference principles to mitigate the effects of spurious correlations.
  • Conditional text modeling is hard. Natural language documents tend to contain large amounts of complex unstructured information, most of which is implicit.
  • Controllable text generation concerns two fundamental tasks of wide applications, namely generating text of given attributes (i.e., attribute-conditional generation), and minimally editing existing text to possess desired attributes (i.e., text attribute transfer). Historically, problems of attribute-conditional text generation and attribute transfer were perceived as two independent tasks and approached individually and developed different conditional models which, however, are prone to producing biased text (e.g., various gender stereotypes). The authors propose a unifying causal framework to formulate controllable text generation from a principled causal perspective which models the two tasks with a unified framework for generation and transfer, based on structural causal models (SCMs).
  • A direct advantage of the causal formulation is the use of rich causality tools to mitigate generation biases and improve control. They treat the two tasks as interventional and counterfactual causal inference based on a structural causal model, respectively. They propose to model attribute-conditional text generation as intervention, using Daniel Pearl’s \(do\) operator. Hence, the attribute-conditional distribution becomes \(P(x\|do(a))\) rather than purely association-based \(P(x\|a)\), where: \(x\) is a text, \(a\) is an attribute (an intervention). Two more variables are used in the paper: \(z\), a multidimensional latent (unobserved) confounder and \(c\), a \(z\)’s observed proxy.
  • Text attribute transfer is modeled as a conterfactual prediction, trying to answer the question: “what the text would have been if the attribute had been different?”
  • Training consists of four objectives: VAE objective to learn the causal model and three counterfactual objectives.
  • They apply the framework to the challenging practical setting where confounding factors (that induce spurious correlations) are observable only on a small subset (1%-5%) of training data with confounding labels for \(c\).
  • Results show that the proposed model achieves significantly better results than conventional conditional models in terms of control accuracy and reduced bias. This is true for both types of tasks: attribute-conditional generation and attribute transfer.
SNCSE: Contrastive Learning for Unsupervised Sentence Embedding with Soft Negative Samples
  • This paper by Wang et al. from National University of Defense Technology, SenseTime and The University of Hong Kong in 2022 released a paper proposing a new contrastive sentence embedding framework called SNCSE.
  • Application of contrastive learning techniques to sentence embeddings has been proved to be a great way to improve their semantic and classification properties. For a sentence, current models utilize diverse data augmentation methods to generate positive samples, while consider other independent sentences as negative samples. Then they adopt InfoNCE loss to pull the embeddings of positive pairs gathered, and push those of negative pairs scattered.
  • Although these models have made great progress on sentence embedding, the authors argue that contrastive losses are not sensitive enough to distinguish and decouple textual and semantic similarity. This leads to the methods deploying traditional contrastive losses to overestimate the semantic similarity of any pairs with similar textual regardless of the actual semantic difference between them. This is because positive pairs in unsupervised contrastive learning come with similar and even the same textual meaning through data augmentation.
  • Let’s take a negation. Adding a simple “not” to a sentence does not change its textual properties much, but can drastically change its semantic properties. The authors argue that traditional contrastive loss leads to feature supression, making models fail to decouple textual and semantic aspects of a sentence. To address this issue, the authors propose contrastive learning for unsupervised sentence embedding with soft negative samples (SNCSE) - samples with different semantic content (hence “negative”) and very high textual similarity (hence “soft”).
  • Moreover, the authors propose an additional loss component - bidirectional margin loss (BML) - to model semantic differences between positive and soft negative samples, while retaining InfoNCE as a loss for regular positive-negative pairs. BML helps introduce soft negative examples into the traditional contrastive learning framework.
  • To obtain these soft negative examples, the authors construct soft negative samples as negations of positive examples. They use a rule-based system for this purpose.
  • SNCSE achieves state-of-the-art performance on semantic textual similarity (STS) task with average Spearman’s correlation coefficient of 78.97% on BERTbase and 79.23% on RoBERTabase, an improvement compared to other contrastive methods (e.g. SimCSE). Finally, they adopt rank-based error analysis method to detect the weakness of SNCSE.
  • Github repo.
LaMDA: Language Models for Dialog Applications
  • This paper by Cheng et al. from Google Brain in 2022 is an attempt to propose safe, grounded, and high-quality dialog models for open-ended applications.
  • Language models are becoming more capable than ever before and are helpful in a variety of tasks — translating one language into another, summarizing a long document into a brief highlight, or answering information-seeking questions. Among these, open-domain dialog, where a model needs to be able to converse about any topic, is probably one of the most difficult, with a wide range of potential applications and open challenges. In addition to producing responses that humans judge as sensible, interesting, and specific to the context, dialog models should adhere to Responsible AI practices, and avoid making factual statements that are not supported by external information sources.
  • Defining objectives and metrics is critical to guide training dialog models. LaMDA has three key objectives — Quality, Safety, and Groundedness — each of which they measure using carefully designed metrics as follows.
    • Quality: They decompose Quality into three dimensions, Sensibleness, Specificity, and Interestingness (SSI), which are evaluated by human raters. Sensibleness refers to whether the model produces responses that make sense in the dialog context (e.g., no common sense mistakes, no absurd responses, and no contradictions with earlier responses). Specificity is measured by judging whether the system’s response is specific to the preceding dialog context, and not a generic response that could apply to most contexts (e.g., “ok” or “I don’t know”). Finally, Interestingness measures whether the model produces responses that are also insightful, unexpected or witty, and are therefore more likely to create better dialog.
    • Safety: Safety is essential for responsible AI. Their Safety metric is composed of an illustrative set of safety objectives that captures the behavior that the model should exhibit in a dialog. These objectives attempt to constrain the model’s output to avoid any unintended results that create risks of harm for the user, and to avoid reinforcing unfair bias. For example, these objectives train the model to avoid producing outputs that contain violent or gory content, promote slurs or hateful stereotypes towards groups of people, or contain profanity. Their research towards developing a practical Safety metric represents very early work, and there is still a great deal of progress for us to make in this area.
    • Groundedness: The current generation of language models often generate statements that seem plausible, but actually contradict facts established in known external sources. This motivates their study of groundedness in LaMDA. Groundedness is defined as the percentage of responses with claims about the external world that can be supported by authoritative external sources, as a share of all responses containing claims about the external world. A related metric, Informativeness, is defined as the percentage of responses with information about the external world that can be supported by known sources, as a share of all responses. Therefore, casual responses that do not carry any real world information (e.g., “That’s a great idea”), affect Informativeness but not Groundedness. While grounding LaMDA generated responses in known sources does not in itself guarantee factual accuracy, it allows users or external systems to judge the validity of a response based on the reliability of its source.
  • With the objectives and metrics defined, they describe LaMDA’s two-stage training: pre-training and fine-tuning. In the fine-tuning stage, they train LaMDA to perform a mix of generative tasks to generate natural-language responses to given contexts, and classification tasks on whether a response is safe and high-quality, resulting in a single multi-task model that can do both. The LaMDA generator is trained to predict the next token on a dialog dataset restricted to back-and-forth dialog between two authors, while the LaMDA classifiers are trained to predict the Safety and Quality (SSI) ratings for the response in context using annotated data. During a dialog, the LaMDA generator first generates several candidate responses given the current multi-turn dialog context, and the LaMDA classifiers predict the SSI and Safety scores for every response candidate. Candidate responses with low Safety scores are first filtered out. Remaining candidates are re-ranked by their SSI scores, and the top result is selected as the response.
  • They observe that LaMDA significantly outperforms the pre-trained model in every dimension and across all model sizes.
Causal Inference Principles for Reasoning about Commonsense Causality
  • Commonsense causality reasoning (CCR) aims at identifying plausible causes and effects in natural language descriptions that are deemed reasonable by an average person. Although being of great academic and practical interest, this problem is still shadowed by the lack of a well-posed theoretical framework; existing work usually relies on deep language models wholeheartedly, and is potentially susceptible to confounding co-occurrences.
  • This paper by Zhang et al. from UPenn in 2022 articulates CCR from a completely new perspective using classical causal principles. Their contributions include (i) a novel commonsense causality framework; (ii) mitigating confounding co-occurrences by matching temporal propensities; (iii) a modular pipeline for zeroshot CCR with demonstrated effectiveness.
  • They propose a novel framework, ROCK, to Reason O(A)bout Commonsense K(C)ausality, which utilizes temporal signals as incidental supervision, and balances confounding effects using temporal propensities that are analogous to propensity scores. The ROCK implementation is modular and zero-shot, and demonstrates good CCR capabilities on various datasets.

Speech

2017

On Evaluating and Comparing Conversational Agents
  • This paper by Venkatesh et al. from Amazon in 2017 proposes a comprehensive evaluation strategy using multiple metrics which correlate well with human judgement and are thus designed to reduce subjectivity, for non goal-oriented conversations. The proposed metrics provide granular analysis of the conversational agents, which is not captured in human ratings. They show that these metrics can be used as a reasonable proxy for human judgment.
  • They propose the following evaluation metrics:
    • Conversational User Experience: Measure of the overall interaction experience. Conversations with a socialbot can be significantly different from those with humans because of expectations, behavior or sentiment, trust and visual cues.
    • Engagement: To enable an open-ended, multi-turn conversation, engagement is critical. Engagement is a measure of interestingness in a conversation. Other models also term this as interestingness.
    • Coherence: A coherent response indicates a comprehensible and relevant response to a user’s request. Other models also term this as specificity.
    • Conversational Depth: Coherence is usually measured at turn level. However, in a multi-turn conversation, context may be carried over multiple turns. While evaluating conversational agents, it is important to detect the context and the depth of the conversations.
    • Topical Diversity: A good conversational agent is capable of: (i) identifying the topics and keywords from a given utterance (ii) able to have conversations around the same topics and (iii) can share related concepts (iv) identification of appropriate intent.
    • Domain Coverage: An agent which is able to interact on multiple domains can be considered more consistent with humans expectations.
  • They provide a mechanism to unify the metrics for selecting the top performing agents, which has also been applied throughout Amazon’s Alexa Prize competition.
  • Till date, this study offers the largest setting for evaluating agents with millions of conversations and hundreds of thousands of ratings from users. They believe that this work is a step towards an automatic evaluation process for conversational AIs.

2018

Attention-Based Models for Text-Dependent Speaker Verification
  • This paper by Chowdhury et al. from Washington State and Google in 2018 proposes using attention-based models for a keyword-based text-dependent speaker verification (SV) system. One subtask of SV is global password text-dependent speaker verification (TD-SV), which refers to the set of problems for which the transcripts of reference enrollment and verification utterances are constrained to a specific phrase. Examples of such TD-SV phrases could be trigger keywords for voice assistants, such as “Hey Sir” or “Alexa” or “OK Google”. In this study, they focus on “OK Google” and “Hey Google”.
  • A challenge in prior architectures is that silence and background noise are not being well captured. Even though the SV system runs on a short sub-second windows that are segmented by a keyword detector, the phonemes are usually surrounded by frames of silence and background noise. Ideally, the speaker embedding should be built only using the frames corresponding to phonemes. To remedy this, they propose to use an attention layer as a soft mechanism to emphasize the most relevant elements of the input sequence.
  • Their training dataset is a collection of anonymized user voice queries, which is a mixture of “OK Google” and “Hey Google”. It has around 150M utterances from around 630K speakers.
  • Attention helps summarize relevant information that occurs through the entire length of an input sequence. This paper also experiments with different attention mechanisms apart from the basic attention: cross-layer attention, and divided-layer attention. For cross-layer attention, the scores and weights are not computed using the outputs of the last LSTM layer but the outputs of the second-to-last layer. However, the d-vector is still the weighted average of the last layer output.
  • For divided-layer attention, they double the dimension of the last layer LSTM output, and equally divide its dimension into two parts. They use one part to build the d-vector, while using the other to learn the scores.
  • From their experimental results, the best practices are to: (i) use a shared-parameter non-linear scoring function; (ii) use a divided-layer attention connection to the last layer output of the LSTM; and (iii) apply a sliding window maxpooling on the attention weights. After combining all these best practices, they improved the EER of the baseline LSTM model from 1.72% to 1.48%, which is a 14% relative improvement.
Efficient Voice Trigger Detection for Low Resource Hardware
  • This paper by Sigtia et al. from Apple in Interspeech 2018 describes the architecture of an always-on DNN-HMM system for on-device keyword spotting (KWS) in lowresource conditions, i.e., for battery-powered mobile devices.
  • An always-available voice assistant needs a carefully designed voice keyword detector to satisfy the power and computational constraints of battery powered devices. They employ a multi-stage system that uses a low-power primary stage to decide when to run a more accurate (but more power-hungry) secondary detector. They describe a straightforward primary detector and explore variations that result in very useful reductions in computation (or increased accuracy for the same computation). By reducing the set of target labels from three to one per phone, and reducing the rate at which the acoustic model is operated, the compute rate can be reduced by a factor of six while maintaining the same accuracy.
  • When the device is battery powered like the iPhone or the Apple Watch, it is imperative that the voice trigger detector consume as little power as possible while still maintaining sufficient accuracy. In recent iPhone designs, this is achieved by running a primary detector on a low-power processor that runs even when the main processor is asleep. This primary detector can decide to wake the main processor, where further checks are done (on the same waveform) before the main recognizer is applied and the identity of the speaker is confirmed. This paper focuses only on the primary detector which runs continuously on a low-power, low resource, always-on processor where computation and memory are the limiting factors.
  • It has been demonstrated that LVCSR systems trained to predict whole phone labels (single label per phone) can achieve accuracies similar to conventional systems with 3 labels per phone. However, implementing this approach yields a significant loss in accuracy. The authors hypothesize the reason behind this as the need for a minimum duration constraint for each phone. To prove this hypothesis, they replicate each state in the trigger phrase HMM by a factor multiple while still using the whole phone DNN, which is equivalent to imposing a minimum duration on each of the labels. This yields similar accuracy as the baseline.
  • An alternative way to impose longer minimum durations for each state: run the detector at a lower rate than 100 FPS. This results in longer intervals between predictions, which effectively increases the minimum duration of the HMM states. For on-device KWS, operating the detectors at a lower frame-rate is an attractive route for trying to limit the computation performed by the system.
  • Their results demonstrate that for a voice trigger detection problem, it is not necessary to divide phone labels into 3 states for the beginning, middle and end of each phone. They achieve similar results to the baseline with a single label per phone and minimum duration constrains. This principle has been previously demonstrated for LVSCR with LSTM AMs, but their results demonstrate that the same holds true for DNN AMs with large input windows. As a practical consequence, they are able to run the detectors at frame-rates as low as 16.6 FPS without any loss in accuracy compared to the baseline. This represents a factor of 6 reduction in computation, which is significant when the system is deployed on low resource hardware. Alternatively, they can run a detector 6 times as large as the baseline without any extra computation.

2020

Automatic Speaker Recognition with Limited Data
  • This paper by Li et al. from UCLA, Tongji, and Amazon in WSDM 2020 proposes an adversarial few-shot learning-based speaker identification method that needs only a limited number of training instances.
  • They employ metric learning-based few-shot learning to learn speaker acoustic representations using a support module and a query module, where the limited instances are comprehensively utilized to improve the identification performance. To that end, they first randomly sample a set of speakers from the training set as the start to construct the support module. For each speaker in the support module, they further randomly sample pieces of his/her audio instances and derive the corresponding MFCCs. These MFCCs are further fed into an embedding layer so they can use a fixed length vector to represent each audio instance. In the support module of the framework, for each speaker, they derive a representative embedding for each speaker, which summarizes the acoustic biometric of the aforementioned speaker. This is done using an attention layer to learn the importance weights using each audio embedding of a particular speaker.
  • In the query module, they randomly select a piece of audio from a speaker, which is one of the speakers in the support module. They feed it into the embedding layer to derive the audio embedding.
  • They then compare the distances between the query embedding and all the representative embeddings in the support module. The distances then are utilized to measure the relegation distribution over all speakers in the support module. The model is optimized by such iterative comparisons and reasoning between the support and query modules.
  • Furthermore, adversarial learning is applied to further enhance the generalization and robustness for speaker identification with adversarial examples. The goal of employing adversarial training is to allow the identification system not only get optimized by the instances in the training data, but also be robust to unseen adversarial perturbations. To enhance the robustness, they enforce the model to perform consistently well even when the adversarial perturbations are presented. To achieve this goal, they further optimize the model to minimize the objective function with the perturbed parameters.
  • The experiments are conducted on the LibriSpeech dataset. Experiments conducted on a publicly available large-scale dataset demonstrate that AFEASI significantly outperforms eleven baseline methods.
Speaker Identification for Household Scenarios with Self-attention and Adversarial Training
  • This paper by Li et al. from Amazon, UCLA, and University of Note Dame in Interspeech 2020 proposes leveraging the self-attention mechanism to enhance long-span modeling of speaker characteristics since self-attention allows us to fully utilize dependencies over all frames in an utterance, resulting in informative global acoustic embedding representations. In contrast, CNNs by design are biased toward modeling features over nearby frames and frequencies, and RNNs are hard to train for retention of relevant information over long time intervals. These types of neutral networks thus potentially face problems capturing dependencies and characteristics expressed over long time spans within an utterance.
  • Further, they utilize adversarial training as a tool to enhance the robustness and generalization of trained models, rather than as a defense against attacks.
  • To learn the self-attentive utterance representations, the utterance spectrograms are fed as input to the self-attention layer to learn the transformed frame representations of speaker-relevant acoustic features, in two steps. First, they aim at mining correlations across frames in an utterance by having each transformed frame embedding be the weighted sum of frame embedding of itself and other related frames, where each weight gauges the similarity between one frame and another. Second, they aggregate the frame embeddings, including their correlational information, averaging it over the time dimension into one embedding vector and further, L2-normalizing into a fixed-length embedding vector that expresses the speaker-relevant information in the utterance. This yields a summarized global acoustic representation of an utterance.
  • The experiments are conducted on the VCTK dataset which show that the proposed model significantly outperforms four state-of-the-art baselines in identifying both known and new speakers in terms of EER.
Stacked 1D convolutional networks for end-to-end small footprint voice trigger detection
  • This paper by Higuchi et al. from Apple in 2020 proposes a stacked 1D convolutional neural network (S1DCNN) for end-to-end small footprint voice trigger detection in a streaming scenario. Voice trigger detection is an important speech application, with which users can activate their devices by simply saying a keyword or phrase. Due to privacy and latency reasons, a voice trigger detection system should run on an always-on processor on device. Therefore, having small memory and compute cost is crucial for a voice trigger detection system.
  • Recently, singular value decomposition filters (SVDFs) has been used for end-to-end voice trigger detection. The SVDFs approximate a fully-connected layer with a low rank approximation, which reduces the number of model parameters. In this work, we propose S1DCNN as an alternative approach for end-to-end small-footprint voice trigger detection.
  • An S1DCNN layer consists of a 1D convolution layer followed by a depth-wise 1D convolution layer. This is similar to the idea of depth-wise separable convolutions where \(K\) filters (where \(K\) is the number of channels in the input) are applied to each channel of the input (leading to a depth-wise convolution) yielding the same number of channels as the input, followed by a point-wise convolution which uses a \(1 \times 1 \times K\) kernel leading to an output shape that has a single channel. Applying as many pointwise convolution filters as the desired number of output channels, yields the final output with much lesser multiplications than a standard convolution and fewer parameters than the baseline. As such, compared to a standard 2D CNN filter, the S1DCNN can be regarded as a factorization of a 2D CNN filter. An \(F \times K\) filter of the 2D CNN layer is factorized into an F × 1 filter of the first 1D CNN layer and a \(1 \times K\) filter of the second 1D CNN layer. This factorization reduces the number of parameters from \(O(F \times K)\) to \(O(F + K)\).
  • They show that the SVDF can be expressed as a special case of the S1DCNN layer. Experimental results show that the S1DCNN achieve 19.0% relative false reject ratio (FRR) reduction with a similar model size and a similar time delay compared to the SVDF. By increasing the length of the future context (which leads to longer time delays), the S1DCNN further improve the FRR up to 12.2% relative.
Optimize What Matters: Training DNN-HMM Keyword Spotting Model Using End Metric
  • In DNN-HMM based KWS models, the DNN computes the observation probabilities and outputs a probability distribution over as many classes as the HMM states for each speech frame using a softmax layer. The DNN is typically trained to minimize the average (over all frames) cross-entropy loss between the predicted and the ground-truth distributions. The HMM decoder computes the word detection score using the observation, the state transition, and the prior probabilities. This training ignores the HMM transition and prior probabilities which are learned independently using training data statistics.
  • Such an independently trained DNN model relies on the accuracy of the ground-truth phoneme labels as well as the HMM model. This model also assumes that the set of keyword states are optimal and each state is equally important for the keyword detection task. The DNN spends all of its capacity focusing equally on all of the states, without considering its impact on the final metric of the detection score, resulting in a loss-metric mismatch.
  • This paper by Shrivastava et al. from Apple in 2021 seeks to address this loss-metric mismatch by training the DNN model by directly optimizing the keyword detection score instead of optimizing for the state probabilities.
  • This end-metric based training uses only the start and the end of the keyword instead of requiring all of the speech frames to be annotated, leading to substantial savings in annotation cost. Their method changes only the training algorithm without changing any inference pipeline; therefore, there is no overhead in runtime memory or compute, since we only need to update the model parameters.
  • They use a hinge loss which uses the detection score as the output which ignores the samples from optimization if their scores are beyond a margin.
  • Further, they propose IOU-based sampling and design an optimization procedure that maximizes the detection score for a speech segment that “tightly” contains the keyword (positive samples) and minimize the detection score for speech that does not contain the keyword (negative samples). They also sample additional hard negatives that contain partial keywords because we do not want the model to trigger at partial phrases. To formalize the concept of “tightly” containing the keyword, they use the concept of intersection-over-union (IOU) borrowed from computer vision. They sample positive and negative windows from speech utterances such that the positive windows have high intersection-over-union (IOU) and negative windows have low IOU with the ground-truth keyword window.
  • The proposed approach works significantly better (> 70% relative reduction in FRR) than the conventional DNN-HMM training and is more interpretable, accurate in localization, and data-efficient compared to the CNN-based end-to-end models.

2021

Streaming Transformer for Hardware Efficient Voice Trigger Detection and False Trigger Mitigation
  • This paper by Garg et al. from Apple in 2021 presented a unified and hardware efficient architecture for two-stage voice trigger detection (VTD) and false trigger mitigation (FTM) tasks. Two-stage VTD systems of voice assistants can get falsely activated to audio segments acoustically similar to the trigger phrase of interest. FTM systems cancel such activations by using post trigger audio context. Traditional FTM systems rely on automatic speech recognition lattices which are computationally expensive to obtain on device.
  • They proposed a streaming transformer (TF) encoder architecture, which progressively processes incoming audio chunks and maintains audio context to perform both VTD and FTM tasks using only acoustic features.
Joint ASR and Language Identification Using RNN-T: An Efficient Approach to Dynamic Language Switching
  • This paper by Punjabi et al. from Amazon in 2021 proposes joint ASR-LID architectures based on RNN-Ts as an efficient, on-device-suitable alternative to conventional dynamic language switching solutions. Two primary joint modeling paradigms are explored: coupled training, where ASR and LID vocabularies share the RNN-T output space, and multi-task learning, where ASR and LID losses are modeled using dedicated parameters but minimized jointly.
  • The corpus used for RNN-T training consists of in-house, far-field, de-identified voice-assistant recordings amounting to about 3.8k and 12.5k hours of spoken Hindi and Indian English data, respectively. The acoustic LID classifier (used for baseline LID and for providing language representations to RNN-T) is trained using 2k hours of balanced English-Hindi data.
  • Experiments with Indian English and spoken Hindi show that: (a) code-switched utterances are inherently difficult to recognize and classify, (b) multi-task learning provides superior ASR performance whereas coupled training offers better LID accuracy, and (c) multi-task models with a dedicated LID feed-forward network offer the best performance overall.
  • The proposed joint ASR-LID architectures are language agnostic and, in principle, can be scaled to more than two languages.
Robust Self-Supervised Audio-Visual Speech Recognition
  • Audio-visual speech recognition (AVSR) systems improve robustness by complementing the audio stream with the visual information that is invariant to noise and helps the model focus on the desired speaker. However, previous AVSR work focused solely on the supervised learning setup; hence the progress was hindered by the amount of labeled data available.
  • This paper by Shi et al. from FB Research in 2022 introduces a self-supervised AVSR framework based on Audio-Visual Hidden-unit BERT (AV-HuBERT), a state-of-the-art audio-visual speech representation learning model, to tackle the problem of audio-based automatic speech recognition (ASR) degrading significantly in noisy environments and being particularly vulnerable to interfering speech, as the model cannot determine which speaker to transcribe.
  • On the largest available AVSR benchmark dataset LRS3, AV-HuBERT approach outperforms prior state-of-the-art by ~50% (28.0% vs. 14.1%) using less than 10% of labeled data (433hr vs. 30hr) in the presence of babble noise, while reducing the WER of an audio-based model by over 75% (25.8% vs. 5.8%) on average.
  • Facebook AI link.
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
  • This paper by Hsu et al. from FB Research in 2021 introduces Hidden-unit BERT (HuBERT) which enables self-supervised speech representation learning approach for speech recognition, generation, and compression.
  • It is based on the masked prediction problem of predicting K-means cluster assignments of masked segments of continuous input.
  • On both the Librispeech 960 hours and the 60,000 hours Librilight pre-training setups, HuBERT matches or outperforms SOTA systems over all fine-tuning subsets of 10mins, 1h, 10h, 100h, and 960h. Furthermore, the learned representation quality improves dramatically with iteratively refining K- means cluster assignments using learned latent representations for a previous iteration. HuBERT scales well to a 1B transformer model showing a relative reduction in WER of up to 13% on the test-other subset.
  • Facebook AI link.
Deep Spoken Keyword Spotting: An Overview
  • This paper by Lopez-Espejo et al. from Aalborg University, UT Dallas and Oticon in 2021 conducts a literature review into deep spoken KWS to assist practitioners and researchers who are interested in this technology. Specifically, this overview has a comprehensive nature by covering a thorough analysis of deep KWS systems (which includes speech features, acoustic modeling and posterior handling), robustness methods, applications, datasets, evaluation metrics, performance of deep KWS systems and audio-visual KWS.
BW-EDA-EEND: Streaming End-to-end Neural Speaker Diarization for a Variable Number of Speakers
  • End-to-end neural diarization (EEND) with self-attention is one of the approaches that aim to model the joint speech activity of multiple speakers. It integrates voice activity and overlap detection with speaker tracking in end-to-end fashion. Moreover, it directly minimizes diarization errors and has demonstrated excellent diarization accuracy on two-speaker telephone conversations. However, EEND as originally formulated is limited to a fixed number of speakers because the output dimension of the neural network needs to be prespecified. Several methods have been proposed recently to overcome the limitations of EEND. One approach uses a speaker-wise chain rule to decode a speaker-specific speech activity iteratively conditioned on previously estimated speech activities. Another approach proposes an encoder/decoder-based attractor calculation. The embeddings of multiple speakers are accumulated over the time course of the audio input, and then disentangled one-by-one, for speaker identity assignment by speech frame. However, all these state-of-the-art EEND methods only work in an offline manner, which means that the complete recording must be available before diarization output is generated. This makes their application impractical for settings where potentially long multi-speaker recordings need to be processed incrementally (in streaming fashion).
  • This paper by Han et al. from Amazon in 2021 proposes a novel method to perform EEND in a blockwise online fashion so that speaker identities are tracked with low latency soon after new audio arrives, without much degradation in accuracy compared to the offline system. They utilize the incremental Transformer encoder, where they attend to only its left contexts and ignore its right contexts, thus enabling blockwise online processing. Furthermore, the incremental Transformer encoder uses block-level recurrence in the hidden states to carry over information block by block, reducing computation time while attending to previous blocks. To their knowledge, ours is the first method that uses the incremental Transformer encoder with block-level recurrence to enable online speaker diarization.
  • They present a novel online end-to-end neural diarization system, BWEDA-EEND, that processes data incrementally for a variable number of speakers. The system is based on the Encoder-Decoder-Attractor (EDA) architecture of Horiguchi et al., but utilizes the incremental Transformer encoder, attending only to its left contexts and using block-level recurrence in the hidden states to carry information from block to block, making the algorithm complexity linear in time. They propose two variants: For unlimited-latency BW-EDAEEND, which processes inputs in linear time, they show only moderate degradation for up to two speakers using a context size of 10 seconds compared to offline EDA-EEND. With more than two speakers, the accuracy gap between online and offline grows, but the algorithm still outperforms a baseline offline clustering diarization system for one to four speakers with unlimited context size, and shows comparable accuracy with context size of 10 seconds. For limited-latency BW-EDA-EEND, which produces diarization outputs block-by-block as audio arrives, they show accuracy comparable to the offline clustering-based system.
Attentive Contextual Carryover For Multi-turn End-to-end Spoken Language Understanding
  • This paper by Wei et al. from Amazon in ASRU 2021 proposes a novel E2E SLU approach where a multi-head gated attention mechanism is introduced to effectively incorporate the dialogue history in a multi-turn E2E SLU system.
  • They propose a multi-head gated attention mechanism as a context combiner which combines the context encodings consisting of dialogue acts and previous utterances to create the final context vectors that are fed into the model. They explore different ways to combine the context encodings into the model: (i) averaged contextual carryover, (ii) attentive contextual carryover, and (iii) gated attentive contextual carryover. Gated attentive contextual carryover performed better than traditional multi-head attention and a simple average.
  • The attention-based context can be integrated at different layers of a neural E2E SLU model such as the speech encoder stage and the ASR-NLU hidden interface, and the shared context ingestion which integrates context into both acoustic embeddings and the ASR-NLU interface. The shared context ingestion approach gave the biggest improvement compared to the other schemes.
  • They built contextual E2E SLU models based on the Recurrent Neural Network Transducer (RNN-T) as well as the Transformer Transducer (T-T). E2E SLU models share an audio encoder network that encodes log-filterbank energy (LFBE) features, a prediction network that encodes a sequence of predicted wordpieces, a joint network that combines the encoder and the prediction network, and an NLU tagger that predicts intents and slots. The intent tagger contains two feedforward layers before projecting into the number of intents, and the slot tagger directly takes the output embeddings from the NLU tagger and projects them into the slot size. The audio encoder in the E2E T-T SLU and E2E RNN-T SLU are Transformer layers (with 4 attention heads) and LSTM layers, respectively.
  • The models are trained and evaluated on an internal industrial voice assistant (IVA) dataset and a synthetic and publicly available multi-turn E2E SLU (Syn-Multi) dataset. They utilize SpecAugment to augment audio feature inputs.
  • The proposed approach significantly improves E2E SLU accuracy on the internal industrial voice assistant and publicly available datasets compared to the non-contextual E2E SLU models.
SmallER: Scaling Neural Entity Resolution for Edge Devices
  • This paper by McGowan et al. from Amazon in Interspeech 2021 introduces SmallER, a scalable neural entity resolution system capable of running directly on edge devices.
  • SmallER addresses constraints imposed by the on-device setting such as bounded memory consumption for both model and catalog storage, limited compute resources, and related latency challenges introduced by those restrictions. Their model offers a small-footprint neural architecture capable of learning syntactic and semantic information simultaneously using distinct modules and is trained to handle multiple domains within one compact architecture (a.k.a., one model to rule them all domains!).
  • They use compressed tries to reduce the space required to store catalogs on device. They also propose a novel implementation of spatial partitioning trees which at inference time strikes a balance between reducing runtime latency (by reducing the search space) and preserving recall relative to a full/exhaustive catalog search.
  • They utilize Quantization Aware Training (QAT) to train SmallER. The final model consumes only 3MB of memory at inference time with classification accuracy surpassing that of previously established, domain-specific baseline models on live customer utterances. Furthermore, catalog entries are compressed overall by a factor of 2.5x.
  • For the largest catalogs they consider (300 or more entries), their proxy metric for runtime latency is reduced by more than 90%.
Leveraging Multilingual Neural Language Models for On-Device Natural Language Understanding
  • This paper by Tu et al. from Amazon in the 2021 Web Conference Workshop on Multilingual Search investigates learning multi-lingual/cross-lingual representations as an approach to increase the accuracy of on-device multilingual models without increasing their footprint relative to monolingual models, appropriate for deployment on edge devices.
  • They show that cross-lingual representations can help improve NLU performance in both monolingual and multilingual settings. In particular, they show that the performance improvements for non-English monolingual NLU models are higher when they are seeded with cross-lingual representations, as compared to seeding with monolingual representations. Further, multilingual experiments suggest that the scarcer the available data-resources, the more beneficial it is to use cross-lingual representations.
Comparing Data Augmentation and Annotation Standardization to Improve End-to-end Spoken Language Understanding Models
  • All-neural end-to-end (E2E) Spoken Language Understanding (SLU) models can improve performance over traditional compositional SLU models, but have the challenge of requiring high-quality training data with both audio and annotations. In particular they struggle with performance on “golden utterances”, which are essential for defining and supporting features, but may lack sufficient training data.
  • This paper by Nicolich-Henkin et al. from Amazon in NeurIPS 2021 proposes using data augmentation to compare two data-centric AI methods to improve performance on golden utterances: improving the annotation quality of existing training utterances and augmenting the training data with varying amounts of synthetic data.
  • Their experimental results show improvements with both methods, and in particular that augmenting with synthetic data is effective in addressing errors caused by both inconsistent training data annotations as well as lack of training data. In other words, both data-centric approaches to improving E2E SLU achieved the desired effect, although data augmentation was much more powerful than annotation standardization. This method leads to improvement in intent recognition error rate (IRER) on their golden utterance test set by 93% relative to the baseline without seeing a negative impact on other test metrics.

2022

Adaptive Global-Local Context Fusion for Multi-Turn Spoken Language Understanding
  • This paper by Tran et al. from Amazon in AAAI 2022 tackles the problem of multi-turn Spoken Language Understanding (SLU), where dialogue contexts are used to guide intent classification and slot filling. They propose a novel contextual SLU model for multi-turn intent classification and slot filling tasks that aims to selectively incorporate dialogue contexts, such as previous utterances and dialogue acts for multi-turn SLU.
  • They introduce an adaptive global-local context fusion mechanism to selectively integrate dialogue contexts into their model. The local context fusion aligns each dialogue context using multi-head attention, while the global context fusion measures overall context contribution to intent classification and slot filling tasks.
  • The models are trained and evaluated on the publicly-available Sim-R and Sim-M datasets and an internal in-house dataset.
  • Experiments show that on two benchmark datasets, their model achieves absolute F1 score improvements of 2.73% and 2.57% for the slot filling task on Sim-R and Sim-M datasets, respectively.
  • Ablation studies indicate that dialogue history contexts play a crucial role in improving SLU task in the multi-turn dialogue setting.

Multimodal

2017

Axiomatic Attribution for Deep Networks
  • This paper by Sundararajan from Google in ICML 2017 studies the problem of attributing the prediction of a deep network to its input features, a problem previously studied by several other works.
  • They identify two fundamental axioms — Sensitivity and Implementation Invariance that attribution methods ought to satisfy. They show that they are not satisfied by most known attribution methods, which we consider to be a fundamental weakness of those methods.
  • They use the axioms to guide the design of a new attribution method called Integrated Gradients.
  • Their method requires no modification to the original network and is extremely simple to implement; it just needs a few calls to the standard gradient operator.
  • Since this method is multimodal, they apply this method to a couple of image models, a couple of text models and a chemistry model, demonstrating its ability to debug networks, to extract rules from a network, and to enable users to engage with models better.
  • Since integrated gradients add up to the final prediction score, the magnitudes can be use for accounting the contributions of each feature. For instance, for the molecule in the figure, atom-pairs that have a bond between them cumulatively contribute to 46% of the prediction score, while all other pairs cumulatively contribute to only −3%.

2021

Style Equalization: Unsupervised Learning of Controllable Generative Sequence Models
  • There exists a training-inference mismatch when learning these models in typical unsupervised training of controllable generative models. During training, the same sample is used as content input and style input, whereas during inference, content and style inputs are from different samples, i.e., the reference style sample contains a different content than the target content. The mismatch leads to incorrect content generation during inference.
  • This paper by Chang et al. from Apple in 2021 presented a simple but effective technique to deal with the training-inference mismatch when controllable auto-regressive models are learned in an unsupervised manner. Further, to mitigate the training-inference mismatch, the paper proposed style equalization which takes unpaired samples as input during both training and inference. It transforms the style of sample B to that of A by estimating their style difference.
  • Trained using tuples \((x_i, c_i)\) where \(x_i\) is the style sample and \(c_i\) is the content sample.
  • If a generative model learns to utilize the content information in the style example, during inference the generative model will generate wrong content. This phenomenon is called content leakage.
  • Instead of directly using sample B as style (in which case there is no ground truth), they jointly learn a style transformation function (using CNNs + Multihead attention), which estimates the style difference between A and B and transforms the style of sample B to the style of A. The generative model then takes content A and the transformation output (that contains the style of A) to reconstruct sample A. The proposed method enables us to use sample A as the ground truth while learning in the non-parallel setting. During inference given arbitrary content A and reference sample B, they turn off the style transformation (since by construction, the style difference is zero), and thus the output sample contains content A and style of B.
  • The proposed method is general and can be applied to different sequence signals. They apply the proposed method on two signal domains, speech and online handwriting, and evaluate the performance carefully via quantitative evaluation (by computing content error rates) and conducting qualitative user studies. Their results show that the proposed method outperforms state-of-the-art methods, including those having access to additional style supervision like speaker labels. Both quantitative and qualitative results show that their model achieves near-real content accuracy and style reproduction.
  • Note that for style equalization to be successful, \(M()\) should not transfer any content-related information (e.g., copy the entire sequence) from \(x\) but only its style information so that the decoder will utilize the transferred style and will rely on provided content input to generate the output. Therefore the design of \(M\) is critical.
  • They evaluate the proposed method on two multi-speaker speech datasets. VCTK dataset (Yamagishi et al., 2019) contains 110 speakers and 44 hours of speech, and LibriTTS dataset (Zen et al., 2019) contains 2,311 speakers and 555 hours of speech in the training set.

2022

Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction
  • This paper by Shi et al. from FB Research in 2022 introduces Audio-Visual Hidden Unit BERT (AV-HuBERT) which exploits video recordings of speech contain correlated audio and visual information, providing a strong signal for speech representation learning from the speaker’s lip movements and the produced sound.
  • AV-HuBERT is a self-supervised representation learning framework for audio-visual speech, which masks multi-stream video input and predicts automatically discovered and iteratively refined multimodal hidden units. It learns powerful audio-visual speech representation benefiting both lip-reading and automatic speech recognition.
  • On the largest public lip-reading benchmark LRS3 (433 hours), AV-HuBERT achieves 32.5% WER with only 30 hours of labeled data, outperforming the former state-of-the-art approach (33.6%) trained with a thousand times more transcribed video data (31K hours). The lip-reading WER is further reduced to 26.9% when using all 433 hours of labeled data from LRS3 and combined with self-training. Using their audio-visual representation on the same benchmark for audio-only speech recognition leads to a 40% relative WER reduction over the state-of-the-art performance (1.3% vs 2.3%).
  • Code and models are available here.
  • Facebook AI article.

Core ML

2018

Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning
  • This article by Raschka from UW-Madison in 2018 reviews different techniques that can be used for model evaluation, model selection, and algorithm selection.
  • Each technique is discussed and its pros and cons are weighed with supporting examples. Further, recommendations are given to encourage best yet feasible practices in research and applications of machine learning.
  • Common methods such as the holdout method for model evaluation and selection are covered, which are not recommended when working with small datasets. Different flavors of the bootstrap technique are introduced for estimating the uncertainty of performance estimates, as an alternative to confidence intervals via normal approximation if bootstrapping is computationally feasible. Common cross-validation techniques such as leave-one-out cross-validation and \(k\)-fold cross-validation are reviewed, the bias-variance trade-off for choosing \(k\) is discussed, and practical tips for the optimal choice of \(k\) are given based on empirical evidence.
  • Different statistical tests for algorithm comparisons are presented, and strategies for dealing with multiple comparisons such as omnibus tests and multiple-comparison corrections are discussed.
  • Finally, alternative methods for algorithm selection, such as the combined F-test 5x2 cross-validation and nested cross-validation, are recommended for comparing machine learning algorithms when datasets are small.