Background

  • There are three common types of generative models, GAN, VAE, and Flow-based models. They have shown great success in generating high-quality samples, but each has some limitations of its own. GAN models are known for potentially unstable training and less diversity in generation due to their adversarial training nature. VAE relies on a surrogate loss. Flow models have to use specialized architectures to construct reversible transform.
  • Diffusion models are inspired by non-equilibrium thermodynamics. They define a Markov chain of diffusion steps to slowly add random noise to data and then learn to reverse the diffusion process to construct desired data samples from the noise. Unlike VAE or flow models, diffusion models are learned with a fixed procedure and the latent variable has high dimensionality (same as the original data).
  • The idea of diffusion for generative modeling was actually already introduced in Sohl-Dickstein et al., 2015. However, it took until Song et al., 2019 at Stanford University, and then Ho et al., 2020 at Google Brain who independently improved the approach.
  • The following diagram from Lilian Weng’s blog provides an overview of the different types of generative models:

Overview

  • The meteoric rise of diffusion models is one of the biggest developments in Machine Learning in the past several years.
  • Diffusion models are generative models which have been gaining significant popularity in the past several years, and for good reason. A handful of seminal papers released in the 2020s alone have shown the world what Diffusion models are capable of, such as beating GANs on image synthesis. Most recently, Diffusion models were used in DALL-E 2, OpenAI’s image generation model (image below generated using DALL-E 2).

  • Given the recent wave of success by Diffusion models, many Machine Learning practitioners are surely interested in their inner workings.
  • In this article, we will examine the theoretical foundations for Diffusion models, and then demonstrate how to generate images with a diffusion model in PyTorch. Let’s dive in!

Introduction

  • Diffusion probabilistic models (also simply called diffusion models) are generative models, meaning that they are used to generate data similar to the data on which they are trained. As the name suggests, generative models are used to generate new data, for e.g., they can generate new photos of animals that look like real animals whereas discriminative models could tell apart a cat from a dog.

  • Fundamentally, diffusion models work by destroying training data through the successive addition of Gaussian noise, and then learning to recover the data by reversing this noising process. In other words, diffusion models are parameterized Markov chains models trained to gradually denoise data. After training, we can use the diffusion model to generate data by simply passing randomly sampled noise through the learned denoising process. In other words, diffusion models undergo the process of transforming a random collection of numbers (the “latents tensor”) into a processed collection of numbers containing the right image information.
  • Diffusion Models also go by Diffusion Probabilistic Models, score-based generative models or in some contexts, have been compared to denoising autoencoders owing to similarity in behavior.
  • The following diagram shows that diffusion models can be used to generate images from noise (figure modified from source):

  • More specifically, a diffusion model is a latent variable model which maps to the latent space using a fixed Markov chain. This chain gradually adds noise to the data in order to obtain the approximate posterior \(q\left(\mathbf{x} 1: T \mid \mathbf{x}_{0}\right)\), where \(\mathbf{x} 1, \ldots, \mathbf{x} T\) are the latent variables with the same dimensionality as \(\mathbf{x}_{0}\). In the figure below, we see such a Markov chain manifested for image data.
  • The following diagram (figure modified from source):

  • Ultimately, the image is asymptotically transformed to pure Gaussian noise. The goal of training a diffusion model is to learn the reverse process - i.e. training \(p_{\theta}\left(x_{t-1} \mid x_{t}\right)\). By traversing backwards along this chain, we can generate new data, as shown below (figure modified from source).

  • Under-the-hood, diffusion Models define a Markov chain of diffusion steps that add random noise to the data and then learn to reverse the diffusion process in order to create the desired data output from the noise. This can be seen in the image below:
    • Recall that a Markov chain is a stochastic model that describes a sequence of possible events where the probability of each event only depends on the state of the previous event. Markov chains are used to calculate the probability of an event occurring by considering it as a state transitioning to another state or a state transitioning to the same state as before. The defining characteristic of a Markov chain is that no matter how the process arrived at its present state, the possible future states are fixed.
  • Key takeaway
    • In a nutshell, diffusion models are constructed by first describing a procedure for gradually turning data into noise, and then training a neural network that learns to invert this procedure step-by-step. Each of these steps consists of taking a noisy input and making it slightly less noisy, by filling in some of the information obscured by the noise. If you start from pure noise and do this enough times, it turns out you can generate data this way! (source)

Advantages

  • Diffusion probabilistic models are latent variable models capable to synthesize high quality images. As mentioned above, research into diffusion models has exploded in recent years. Inspired by non-equilibrium thermodynamics, diffusion models currently produce State-of-the-Art image quality, examples of which can be seen below (figure adapted from source):

  • Beyond cutting-edge image quality, diffusion models come with a host of other benefits, including not requiring adversarial training. The difficulties of adversarial training are well-documented; and, in cases where non-adversarial alternatives exist with comparable performance and training efficiency, it is usually best to utilize them. On the topic of training efficiency, diffusion models also have the added benefits of scalability and parallelizability.
  • While diffusion models almost seem to be producing results out of thin air, there are a lot of careful and interesting mathematical choices and details that provide the foundation for these results, and best practices are still evolving in the literature. Let’s take a look at the mathematical theory underpinning diffusion models in more detail now.
  • Their performance is, allegedly, superior to recent state-of-the-art generative models like Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs) in most cases.

Definitions

Diffusion Models

  • Diffusion models are neural models that model \(p_{\theta}\left(\mathbf{x}_{t-1} \mid \mathbf{x}_{t}\right)\) and are trained end-to-end to denoise a noisy input to a continuous output such as an image/audio (similar to how GANs generate continuous outputs). Examples: UNet, Conditioned UNet, 3D UNet, Transformer UNet.
  • The following figure from the DDPM paper shows the process of a diffusion model:

Schedulers

  • Algorithm class for both inference and training. The class provides functionality to compute previous image according to alpha, beta schedule as well as predict noise for training. Examples: DDPM, DDIM, PNDM, DEIS.
  • The figure below from the DDPM paper shows the sampling and training algorithms:

Sampling and training algorithms

  • Diffusion Pipeline: End-to-end pipeline that includes multiple diffusion models, possible text encoders, super-resolution model (for high-res image generation, in case of Imagen), etc.
  • The figure below from the Imagen paper shows the overall flow of the model.

Diffusion models: A Deep Dive

General overview

  • Diffusion Models are a latent variable model that maps the latent space using the Markov chain. They essentially are made up of neural networks that learn to gradually de-noise data.
    • Note: Latent variable models aim to model the probability distribution with latent variables. Latent variables are a transformation of the data points into a continuous lower-dimensional space. (source)
    • Additionally, the latent space is simply a representation of compressed data in which similar data points are closer together in space.
    • Latent space is useful for learning data features and for finding simpler representations of data for analysis.
  • Diffusion models consist of two processes: a predefined forward diffusion and a learned reverse de-noising diffusion process.
  • In the image below, we can see the Markov chain working towards image generation. It represents the first process of forward diffusion.
  • The forward diffusion process \(q\) of our choosing, gradually adds Gaussian noise to an image, until you end up with pure noise.

  • Next, the image is asymptotically transformed to just Gaussian noise. The goal of training a diffusion model is to learn the reverse process.
  • The second process below is the learned reverse de-noising diffusion process \(p_\theta\). Here, a neural network is trained to gradually de-noise an image starting from pure noise, until you end up with an actual image.
  • Thus, we traverse backwards along this chain to generate the new data as seen below:

  • Both the forward and reverse process (both indexed with \(t\)) continue for a duration of finite time steps \(T\) (the DDPM authors use \(T\) =1000).
  • You will start off with \(t = 0\) where you will sample a real image \(x_0\) from your data distribution.
    • A quick example is say you have an image of a cat from ImageNet dataset.
  • You will then continue with the forward process and sample some noise from a Gaussian distribution at each time step \(t\).
    • This will be added to the image of the previous time step.
  • Given a sufficiently large \(T\) and a continuous process of adding noise at each time step, you will end up with \(t = T\).
  • This is also called an isotropic Gaussian distribution.

  • Below is the high-level overview of how everything runs under the hood:
    • we take a random sample \(x_0\) from the real unknown and complex data distribution \(q(x_0)\)
    • we sample a noise level \(t\) uniformly between \(1\) and \(T\) (i.e., a random time step)
    • we sample some noise from a Gaussian distribution and corrupt the input by this noise at level \(t\) (using the nice property defined above)
    • the neural network is trained to predict this noise based on the corrupted image \(x_t\) (i.e. noise applied on \(x_0\) based on known schedule \(\beta_t\)) In reality, all of this is done on batches of data, as one uses stochastic gradient descent to optimize neural networks.

The Under-the-hood Math

  • As mentioned above, a diffusion model consists of a forward process (or diffusion process), in which a datum (generally an image) is progressively noised, and a reverse process (or reverse diffusion process), in which noise is transformed back into a sample from the target distribution.

  • The sampling chain transitions in the forward process can be set to conditional Gaussians when the noise level is sufficiently low. Combining this fact with the Markov assumption leads to a simple parameterization of the forward process:

    \[q\left(\mathbf{x}_{1: T} \mid \mathbf{x}_{0}\right):=\prod_{t=1}^{T} q\left(\mathbf{x}_{t} \mid \mathbf{x}_{t-1}\right):=\prod_{t=1}^{T} \mathcal{N}\left(\mathbf{x}_{t} ; \sqrt{1-\beta_{t}} \mathbf{x}_{t-1}, \beta_{t} \mathbf{I}\right)\]
    • where \(\beta_1, \ldots, \beta_T\) is a variance schedule (either learned or fixed) which, if well-behaved, ensures that \(x_T\) is nearly an isotropic Gaussian for sufficiently large \(T\).
  • Given the Markov assumption, the joint distribution of the latent variables is the product of the Gaussian conditional chain transitions (figure modified from source).

  • As mentioned previously, the “magic” of diffusion models comes in the reverse process. During training, the model learns to reverse this diffusion process in order to generate new data. Starting with the pure Gaussian noise \(p(\mathbf{x} T):=\mathcal{N}\left(\mathbf{x}_{T}, \mathbf{0}, \mathbf{I}\right)\), the model learns the joint distribution \(p \theta(\mathbf{x} 0: T)\) as,

    \[p_{\theta}\left(\mathbf{x}_{0: T}\right):=p\left(\mathbf{x}_{T}\right) \prod_{t=1}^{T} p_{\theta}\left(\mathbf{x}_{t-1} \mid \mathbf{x}_{t}\right):=p\left(\mathbf{x}_{T}\right) \prod_{t=1}^{T} \mathcal{N}\left(\mathbf{x}_{t-1} ; \boldsymbol{\mu}_{\theta}\left(\mathbf{x}_{t}, t\right), \boldsymbol{\Sigma}_{\theta}\left(\mathbf{x}_{t}, t\right)\right)\]
    • where the time-dependent parameters of the Gaussian transitions are learned. Note in particular that the Markov formulation asserts that a given reverse diffusion transition distribution depends only on the previous timestep (or following timestep, depending on how you look at it):
\[p_{\theta}\left(\mathbf{x}_{t-1} \mid \mathbf{x}_{t}\right):=\mathcal{N}\left(\mathbf{x}_{t-1} ; \boldsymbol{\mu}_{\theta}\left(\mathbf{x}_{t}, t\right), \mathbf{\Sigma}_{\theta}\left(\mathbf{x}_{t}, t\right)\right)\]

Training

  • A diffusion model is trained by finding the reverse Markov transitions that maximize the likelihood of the training data. In practice, training equivalently consists of minimizing the variational upper bound on the negative log likelihood.
\[\mathbb{E}\left[-\log p_{\theta}\left(\mathbf{x}_{0}\right)\right] \leq \mathbb{E}_{q}\left[-\log \frac{p_{\theta}\left(\mathbf{x}_{0: T}\right)}{q\left(\mathbf{x}_{1: T} \mid \mathbf{x}_{0}\right)}\right]=: L_{v l b}\]
  • Notation Detail: Note that \(L_{v l b}\) is technically an upper bound (the negative of the ELBO) which we are trying to minimize, but we refer to it as \(L_{v l b}\) for consistency with the literature.

  • We seek to rewrite the \(L_{v l b}\) in terms of Kullback-Leibler (KL) Divergences. The KL Divergence is an asymmetric statistical distance measure of how much one probability distribution \(P\) differs from a reference distribution \(Q\). We are interested in formulating \(L_{v l b}\) in terms of \(KL\) divergences because the transition distributions in our Markov chain are Gaussians, and the KL divergence between Gaussians has a closed form.

Recap: KL Divergence

  • The mathematical form of the \(\mathrm{KL}\) divergence for continuous distributions is,

    \[D_{\mathrm{KL}}(P \| Q)=\int_{-\infty}^{\infty} p(x) \log \left(\frac{p(x)}{q(x)}\right) d x\]
    • Note that the double bars in the above equation indicate that the function is not symmetric with respect to its arguments.
  • Below you can see the \(K L\) divergence of a varying distribution \(P\) (blue) from a reference distribution \(Q\) (red). The green curve indicates the function within the integral in the definition for the \(\mathrm{KL}\) divergence above, and the total area under the curve represents the value of the KL divergence of \(P\) from \(Q\) at any given moment, a value which is also displayed numerically.

Casting \(L_{v l b}\) in Terms of KL Divergences

  • As mentioned previously, it is possible[1] to rewrite \(L v l b\) almost completely in terms of KL divergences:

    \[L_{v l b}=L_{0}+L_{1}+\ldots+L_{T-1}+L_{T}\]
    • where, \(\begin{gathered} L_{0}=-\log p_{\theta}\left(x_{0} \mid x_{1}\right) \\ L_{t-1}=D_{K L}\left(q\left(x_{t-1} \mid x_{t}, x_{0}\right) \| p_{\theta}\left(x_{t-1} \mid x_{t}\right)\right) \\ L_{T}=D_{K L}\left(q\left(x_{T} \mid x_{0}\right) \| p\left(x_{T}\right)\right) \end{gathered}\)
  • Conditioning the forward process posterior on \(x_{0}\) in \(L_{t-1}\) results in a tractable form that leads to all KL divergences being comparisons between Gaussians. This means that the divergences can be exactly calculated with closed-form expressions rather than with Monte Carlo estimates.

Model Choices

  • With the mathematical foundation for our objective function established, we now need to make several choices regarding how our diffusion model will be implemented. For the forward process, the only choice required is defining the variance schedule, the values of which are generally increasing during the forward process.
  • For the reverse process, we much choose the Gaussian distribution parameterization / model architecture(s). Note the high degree of flexibility that Diffusion models afford - the only requirement on our architecture is that its input and output have the same dimensionality. We will explore the details of these choices in more detail below.

Forward Process and \(L_{T}\)

  • As noted above, regarding the forward process, we must define the variance schedule. In particular, we set them to be time-dependent constants, ignoring the fact that they can be learned. For example, per Denoising Diffusion Probabilistic models, a linear schedule from \(\beta_{1}=10^{-4}\) to \(\beta_{T}=0.2\) might be used, or perhaps a geometric series. Regardless of the particular values chosen, the fact that the variance schedule is fixed results in \(L_{T}\) becoming a constant with respect to our set of learnable parameters, allowing us to ignore it as far as training is concerned.
\[L_{T}=D_{K L}\left(q\left(x_{T}+x_{0}\right) \| p\left(x_{T}\right)\right)\]

Reverse Process and \(L_{1: T-1}\)

  • Now we discuss the choices required in defining the reverse process. Recall from above we defined the reverse Markov transitions as a Gaussian:
\[p_{\theta}\left(\mathbf{x}_{t-1} \mid \mathbf{x}_{t}\right):=\mathcal{N}\left(\mathbf{x}_{t-1} ; \boldsymbol{\mu}_{\theta}\left(\mathbf{x}_{t}, t\right), \mathbf{\Sigma}_{\theta}\left(\mathbf{x}_{t}, t\right)\right)\]
  • We must now define the functional forms of \(\mu_{\theta}\) or \(\Sigma_{\theta}\). While there are more complicated ways to parameterize \(\boldsymbol{\Sigma}_\theta\), we simply set,
\[\begin{gathered} \boldsymbol{\Sigma}_{\theta}\left(x_{t}, t\right)=\sigma_{t}^{2} \mathbb{I} \\ \sigma_{t}^{2}=\beta_{t} \end{gathered}\]
  • That is, we assume that the multivariate Gaussian is a product of independent gaussians with identical variance, a variance value which can change with time. We set these variances to be equivalent to our forward process variance schedule.
  • Given this new formulation of \(\Sigma_{\theta_{1}}\), we have

    \[p_{\theta}\left(\mathbf{x}_{t-1} \mid \mathbf{x}_{t}\right):=\mathcal{N}\left(\mathbf{x}_{t-1} ; \boldsymbol{\mu}_{\theta}\left(\mathbf{x}_{t}, t\right), \mathbf{\Sigma}_{\theta}\left(\mathbf{x}_{t}, t\right):=\mathcal{N}\left(\mathbf{x}_{t-1} ; \boldsymbol{\mu}_{\theta}\left(\mathbf{x}_{t}, t\right), \sigma_{t}^{2} \mathbf{I}\right)\right.\]
    • which allows us to transform,
    \[L_{t-1}=D_{K L}\left(q\left(x_{t-1} \mid x_{t}, x_{0}\right) \| p_{\theta}\left(x_{t-1} \mid x_{t}\right)\right)\]
    • to,
    \[L_{t-1} \propto\left\|\tilde{\mu}_{t}\left(x_{t}, x_{0}\right)-\mu_{\theta}\left(x_{t}, t\right)\right\|^{2}\]
    • where the first term in the difference is a linear combination of \(x t\) and \(x_{0}\) that depends on the variance schedule \(\beta_{t}\). The exact form of this function is not relevant for our purposes, but it can be found in Denoising Diffusion Probabilistic models. The significance of the above proportion is that the most straightforward parameterization of \(\mu_{\theta}\) simply predicts the diffusion posterior mean. Importantly, the authors of Denoising Diffusion Probabilistic models actually found that training \(\mu \theta\) to predict the noise component at any given timestep yields better results. In particular, let
    \[\boldsymbol{\mu}_{\theta}\left(\mathbf{x}_{t}, t\right)=\frac{1}{\sqrt{\alpha_{t}}}\left(\mathbf{x}_{t}-\frac{\beta_{t}}{\sqrt{1-\bar{\alpha}_{t}}} \boldsymbol{\epsilon}_{\theta}\left(\mathbf{x}_{t}, t\right)\right)\]
    • where, \(\alpha_{t}:=1-\beta_{t} \quad\) and \(\quad \bar{\alpha}_{t}:=\prod_{s=1}^{t} \alpha_{s}\).
  • This leads to the following alternative loss function, which the authors of Denoising Diffusion Probabilistic models found to lead to more stable training and better results:
\[L_{\text {simple }}(\theta):=\mathbb{E}_{t, \mathbf{x}_{0}, \epsilon}\left[\left\|\boldsymbol{\epsilon}-\boldsymbol{\epsilon}_{\theta}\left(\sqrt{\bar{\alpha}_{t}} \mathbf{x}_{0}+\sqrt{1-\bar{\alpha}_{t}} \boldsymbol{\epsilon}, t\right)\right\|^{2}\right]\]
  • The authors of Denoising Diffusion Probabilistic models also note connections of this formulation of Diffusion models to score-matching generative models based on Langevin dynamics. Indeed, it appears that Diffusion models and ScoreBased models may be two sides of the same coin, akin to the independent and concurrent development of wave-based quantum mechanics and matrix-based quantum mechanics revealing two equivalent formulations of the same phenomena.

Network Architecture

  • Lets dive deep into the network architecture of a diffusion model.
  • Note that the only requirement for the model is that its input and output dimensionality are identical. Specifically, the neural network needs to take in a noised image at a particular time step and return the predicted noise. The predicted noise here is a tensor that has the same size and resolution as the input image. Thus, this neural networks takes in tensors and returns tensors of the same shape. Given this restriction, it is perhaps unsurprising that image diffusion models are commonly implemented with U-Net-like architectures.
  • The neural net architecture that the authors of DDPM used was U-Net.
    • This network consists of a “bottleneck” layer in the middle of its architecture in between the encoder and decoder.
    • This bottleneck ensures the network learns only the most important information.
    • The encoder first encodes an image into a smaller hidden representation called the “bottleneck” and then the decoder decodes that hidden representation back into an actual image.
    • Below is a representation of the U-Net model:

  • The U-Net model first downsamples the input, or makes the input smaller in terms of spatial resolution, and then upsamples it.

Reverse Process Decoder and \(L_{0}\)

  • The path along the reverse process consists of many transformations under continuous conditional Gaussian distributions. At the end of the reverse process, recall that we are trying to produce an image, which is composed of integer pixel values. Therefore, we must devise a way to obtain discrete (log) likelihoods for each possible pixel value across all pixels.
  • The way that this is done is by setting the last transition in the reverse diffusion chain to an independent discrete decoder. To determine the likelihood of a given image \(x 0\) given \(x 1_{1}\), we first impose independence between the data dimensions:

    \[p_{\theta}\left(x_{0} \mid x_{1}\right)=\prod_{i=1}^{D} p_{\theta}\left(x_{0}^{i} \mid x_{1}^{i}\right)\]
    • where \(D\) is the dimensionality of the data and the superscript \(i\) indicates the extraction of one coordinate. The goal now is to determine how likely each integer value is for a given pixel given the distribution across possible values for the corresponding pixel in the slightly noised image at time \(t=1\) :
    \[\mathcal{N}\left(x ; \mu_{\theta}^{i}\left(x_{1}, 1\right), \sigma_{1}^{2}\right)\]
    • where the pixel distributions for \(t=1\) are derived from the below multivariate Gaussian whose diagonal covariance matrix allows us to split the distribution into a product of univariate Gaussians, one for each dimension of the data:
    \[\mathcal{N}\left(x ; \mu_{\theta}\left(x_{1}, 1\right), \sigma_{1}^{2} \mathbb{I}\right)=\prod_{i=1}^{D} \mathcal{N}\left(x ; \mu_{\theta}^{i}\left(x_{1}, 1\right), \sigma_{1}^{2}\right)\]
  • We assume that the images consist of integers in \(0,1, \ldots, 255\) (as standard RGB images do) which have been scaled linearly to \([-1,1]\). We then break down the real line into small “buckets”, where, for a given scaled pixel value \(x\), the bucket for that range is \([x-1 / 255, x+1 / 255]\). The probability of a pixel value \(x\), given the univariate Gaussian distribution of the corresponding pixel in \(x 1\), is the area under that univariate Gaussian distribution within the bucket centered at \(x\).

  • Below you can see the area for each of these buckets with their probabilities for a mean-0 Gaussian which, in this context, corresponds to a distribution with an average pixel value of \(\frac{255}{2}\) (half brightness). The red curve represents the distribution of a specific pixel in the \(t=1\) image, and the areas give the probability of the corresponding pixel value in the \(t=0\) image.

  • Technical Note: The first and final buckets extend out to -inf and +inf to preserve total probability.

  • Given a \(t=0\) pixel value for each pixel, the value of \(p_{\theta}\left(x_{0} \mid x_{1}\right)\) is simply their product. Succinctly, this process is succinctly encapsulated by the following equation:

    \[p_{\theta}\left(x_{0} \mid x_{1}\right)=\prod_{i=1}^{D} p_{\theta}\left(x_{0}^{i} \mid x_{1}^{i}\right)=\prod_{i=1}^{D} \int_{\delta_{-}\left(x_{0}^{i}\right)}^{\delta_{+}\left(x_{0}^{i}\right)} \mathcal{N}\left(x ; \mu_{\theta}^{i}\left(x_{1}, 1\right), \sigma_{1}^{2}\right) d x\]
    • where,
    \[\delta_{-}(x)= \begin{cases}-\infty & x=-1 \\ x-\frac{1}{255} & x>-1\end{cases}\]
    • and
    \[\delta_{+}(x)= \begin{cases}\infty & x=1 \\ x+\frac{1}{255} & x<1\end{cases}\]
  • Given this equation for \(p_{\theta}\left(x_{0} \mid x_{1}\right)\), we can calculate the final term of \(L_{v l b}\) which is not formulated as a \(\mathrm{KL}\) Divergence:

    \[L_{0}=-\log p_{\theta}\left(x_{0} \mid x_{1}\right)\]

Final Objective

  • As mentioned in the last section, the authors of Denoising Diffusion Probabilistic models found that predicting the noise component of an image at a given timestep produced the best results. Ultimately, they use the following objective:
\[L_{\text {simple }}(\theta):=\mathbb{E}_{t, \mathbf{x}_{0}, \epsilon}\left[\left\|\epsilon-\epsilon \theta\left(\sqrt{\bar{\alpha}_{t}} \mathbf{x}_{0}+\sqrt{1-\bar{\alpha}_{t}} \epsilon, t\right)\right\|^{2}\right]\]
  • The training and sampling algorithms for our diffusion model therefore can be succinctly captured in the below table (from source):

Diffusion Model Theory Summary

  • In this section we took a detailed dive into the theory of Diffusion models. It can be easy to get caught up in mathematical details, so we note the most important points within this section below in order to keep ourselves oriented from a birds-eye perspective:

    1. Our diffusion model is parameterized as a Markov chain, meaning that our latent variables \(x_1, \ldots, x_T\) depend only on the previous (or following) timestep.
    2. The transition distributions in the Markov chain are Gaussian, where the forward process requires a variance schedule, and the reverse process parameters are learned.
    3. The diffusion process ensures that \(x T\) is asymptotically distributed as an isotropic Gaussian for sufficiently large \(\mathrm{T}\).
    4. In our case, the variance schedule was fixed, but it can be learned as well. For fixed schedules, following a geometric progression may afford better results than a linear progression. In either case, the variances are generally increasing with time in the series (i.e. \(\beta_{i}<\beta_{j}\) for \(i<j\) ).
    5. Diffusion models are highly flexible and allow for any architecture whose input and output dimensionality are the same to be used. Many implementations use U-Net-like architectures.
    6. The training objective is to maximize the likelihood of the training data. This is manifested as tuning the model parameters to minimize the variational upper bound of the negative log likelihood of the data.
    7. Almost all terms in the objective function can be cast as KL Divergences as a result of our Markov assumption. These values become tenable to calculate given that we are using Gaussians, therefore omitting the need to perform Monte Carlo approximation.
    8. Ultimately, using a simplified training objective to train a function which predicts the noise component of a given latent variable yields the best and most stable results.
    9. A discrete decoder is used to obtain log likelihoods across pixel values as the last step in the reverse diffusion process.
  • With this high-level overview of Diffusion models in our minds, let’s move on to see how to use a Diffusion models in PyTorch.

Diffusion models in PyTorch

Implementing the original paper

Pre-requisites: Setup and Importing Libraries

  • Let’s start with the setup and importing all the required libraries:
from IPython.display import Image
Image(filename='assets/78_annotated-diffusion/ddpm_paper.png')

!pip install -q -U einops datasets matplotlib tqdm

import math
from inspect import isfunction
from functools import partial

%matplotlib inline
import matplotlib.pyplot as plt
from tqdm.auto import tqdm
from einops import rearrange

import torch
from torch import nn, einsum
import torch.nn.functional as F

Helper functions

  • Now let’s implement the neural network we have looked at earlier. First we start with a few helper functions.
  • Most notably, we define Residual class which will add the input to the output of a particular function. That is, it adds a residual connection to a particular function.
def exists(x):
    return x is not None

def default(val, d):
    if exists(val):
        return val
    return d() if isfunction(d) else d

class Residual(nn.Module):
    def __init__(self, fn):
        super().__init__()
        self.fn = fn

    def forward(self, x, *args, **kwargs):
        return self.fn(x, *args, **kwargs) + x

def Upsample(dim):
    return nn.ConvTranspose2d(dim, dim, 4, 2, 1)

def Downsample(dim):
    return nn.Conv2d(dim, dim, 4, 2, 1)
  • Note: the parameters of the neural network are shared across time (noise level).
  • Thus, for the neural network to keep track of which time step (noise level) it is on, the authors used sinusoidal position embeddings to encode \(t\).
  • The SinusoidalPositionEmbeddings class, that we have defined below, takes a tensor of shape (batch_size,1) as input or the noise levels in a batch.
  • It will then turn this input tensor into a tensor of shape (batch_size, dim) where dim$ is the dimensionality of the position embeddings.
class SinusoidalPositionEmbeddings(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.dim = dim

    def forward(self, time):
        device = time.device
        half_dim = self.dim // 2
        embeddings = math.log(10000) / (half_dim - 1)
        embeddings = torch.exp(torch.arange(half_dim, device=device) * -embeddings)
        embeddings = time[:, None] * embeddings[None, :]
        embeddings = torch.cat((embeddings.sin(), embeddings.cos()), dim=-1)
        return embeddings

Model Core: ResNet or ConvNeXT

  • Now we will look at the meat or the core part of our U-Net model. The original DDPM authors employed a Wide ResNet block via Zagoruyko et al., 2016, however Phil Wang has also introduced support for ConvNeXT block via Liu et al., 2022.
  • You are free to choose either or in your final U-Net architecture but both are provided below:
class Block(nn.Module):
    def __init__(self, dim, dim_out, groups = 8):
        super().__init__()
        self.proj = nn.Conv2d(dim, dim_out, 3, padding = 1)
        self.norm = nn.GroupNorm(groups, dim_out)
        self.act = nn.SiLU()

    def forward(self, x, scale_shift = None):
        x = self.proj(x)
        x = self.norm(x)

        if exists(scale_shift):
            scale, shift = scale_shift
            x = x * (scale + 1) + shift

        x = self.act(x)
        return x

class ResnetBlock(nn.Module):
    """https://arxiv.org/abs/1512.03385"""
    
    def __init__(self, dim, dim_out, *, time_emb_dim=None, groups=8):
        super().__init__()
        self.mlp = (
            nn.Sequential(nn.SiLU(), nn.Linear(time_emb_dim, dim_out))
            if exists(time_emb_dim)
            else None
        )

        self.block1 = Block(dim, dim_out, groups=groups)
        self.block2 = Block(dim_out, dim_out, groups=groups)
        self.res_conv = nn.Conv2d(dim, dim_out, 1) if dim != dim_out else nn.Identity()

    def forward(self, x, time_emb=None):
        h = self.block1(x)

        if exists(self.mlp) and exists(time_emb):
            time_emb = self.mlp(time_emb)
            h = rearrange(time_emb, "b c -> b c 1 1") + h

        h = self.block2(h)
        return h + self.res_conv(x)
    
class ConvNextBlock(nn.Module):
    """https://arxiv.org/abs/2201.03545"""

    def __init__(self, dim, dim_out, *, time_emb_dim=None, mult=2, norm=True):
        super().__init__()
        self.mlp = (
            nn.Sequential(nn.GELU(), nn.Linear(time_emb_dim, dim))
            if exists(time_emb_dim)
            else None
        )

        self.ds_conv = nn.Conv2d(dim, dim, 7, padding=3, groups=dim)

        self.net = nn.Sequential(
            nn.GroupNorm(1, dim) if norm else nn.Identity(),
            nn.Conv2d(dim, dim_out * mult, 3, padding=1),
            nn.GELU(),
            nn.GroupNorm(1, dim_out * mult),
            nn.Conv2d(dim_out * mult, dim_out, 3, padding=1),
        )

        self.res_conv = nn.Conv2d(dim, dim_out, 1) if dim != dim_out else nn.Identity()

    def forward(self, x, time_emb=None):
        h = self.ds_conv(x)

        if exists(self.mlp) and exists(time_emb):
            condition = self.mlp(time_emb)
            h = h + rearrange(condition, "b c -> b c 1 1")

        h = self.net(h)
        return h + self.res_conv(x)

Attention

  • Next, we will look into defining the attention module which was added between the convolutional blocks in DDPM.
  • Phil Wang added two variants of attention, a normal multi-headed self-attention from the original Transformer paper (Vaswani et al.,2017), and linear attention variant (Shen et al., 2018).
    • Linear attention variant’s time and memory requirements scale linear in the sequence length, as opposed to quadratic for regular attention.
class Attention(nn.Module):
    def __init__(self, dim, heads=4, dim_head=32):
        super().__init__()
        self.scale = dim_head**-0.5
        self.heads = heads
        hidden_dim = dim_head * heads
        self.to_qkv = nn.Conv2d(dim, hidden_dim * 3, 1, bias=False)
        self.to_out = nn.Conv2d(hidden_dim, dim, 1)

    def forward(self, x):
        b, c, h, w = x.shape
        qkv = self.to_qkv(x).chunk(3, dim=1)
        q, k, v = map(
            lambda t: rearrange(t, "b (h c) x y -> b h c (x y)", h=self.heads), qkv
        )
        q = q * self.scale

        sim = einsum("b h d i, b h d j -> b h i j", q, k)
        sim = sim - sim.amax(dim=-1, keepdim=True).detach()
        attn = sim.softmax(dim=-1)

        out = einsum("b h i j, b h d j -> b h i d", attn, v)
        out = rearrange(out, "b h (x y) d -> b (h d) x y", x=h, y=w)
        return self.to_out(out)

class LinearAttention(nn.Module):
    def __init__(self, dim, heads=4, dim_head=32):
        super().__init__()
        self.scale = dim_head**-0.5
        self.heads = heads
        hidden_dim = dim_head * heads
        self.to_qkv = nn.Conv2d(dim, hidden_dim * 3, 1, bias=False)

        self.to_out = nn.Sequential(nn.Conv2d(hidden_dim, dim, 1), 
                                    nn.GroupNorm(1, dim))

    def forward(self, x):
        b, c, h, w = x.shape
        qkv = self.to_qkv(x).chunk(3, dim=1)
        q, k, v = map(
            lambda t: rearrange(t, "b (h c) x y -> b h c (x y)", h=self.heads), qkv
        )

        q = q.softmax(dim=-2)
        k = k.softmax(dim=-1)

        q = q * self.scale
        context = torch.einsum("b h d n, b h e n -> b h d e", k, v)

        out = torch.einsum("b h d e, b h d n -> b h e n", context, q)
        out = rearrange(out, "b h c (x y) -> b (h c) x y", h=self.heads, x=h, y=w)
        return self.to_out(out)
  • DDPM then adds group normalization to interleave the convolutional/attention layers of the U-Net architecture.
  • Below, the PreNorm class will apply group normalization before the attention layer.
    • Note, there has been a debate about whether groupnorm is better to be applied before or after attention in Transformers.
class PreNorm(nn.Module):
    def __init__(self, dim, fn):
        super().__init__()
        self.fn = fn
        self.norm = nn.GroupNorm(1, dim)

    def forward(self, x):
        x = self.norm(x)
        return self.fn(x)

Overall network

  • Now that we have all the building blocks of the neural network (ResNet/ConvNeXT blocks, attention, positional embeddings, group norm), lets define our entire neural network.
  • The task of this neural network is to take in a batch of noisy images and their noise levels and then to output the noise added to the input.
  • The network takes a batch of noisy images of shape (batch_size, num_channels, height, width) and a batch of noise levels of shape (batch_size, 1) as input, and returns a tensor of shape (batch_size, num_channels, height, width).
  • The network is built up as follows: (source)
    • first, a convolutional layer is applied on the batch of noisy images, and position embeddings are computed for the noise levels
    • next, a sequence of downsampling stages are applied.
      • Each downsampling stage consists of two ResNet/ConvNeXT blocks + groupnorm + attention + residual connection + a downsample operation
    • at the middle of the network, again ResNet or ConvNeXT blocks are applied, interleaved with attention
    • next, a sequence of upsampling stages are applied.
      • Each upsampling stage consists of two ResNet/ConvNeXT blocks + groupnorm + attention + residual connection + an upsample operation
    • finally, a ResNet/ConvNeXT block followed by a convolutional layer is applied.
class Unet(nn.Module):
    def __init__(
        self,
        dim,
        init_dim=None,
        out_dim=None,
        dim_mults=(1, 2, 4, 8),
        channels=3,
        with_time_emb=True,
        resnet_block_groups=8,
        use_convnext=True,
        convnext_mult=2,
    ):
        super().__init__()

        # determine dimensions
        self.channels = channels

        init_dim = default(init_dim, dim // 3 * 2)
        self.init_conv = nn.Conv2d(channels, init_dim, 7, padding=3)

        dims = [init_dim, *map(lambda m: dim * m, dim_mults)]
        in_out = list(zip(dims[:-1], dims[1:]))
        
        if use_convnext:
            block_klass = partial(ConvNextBlock, mult=convnext_mult)
        else:
            block_klass = partial(ResnetBlock, groups=resnet_block_groups)

        # time embeddings
        if with_time_emb:
            time_dim = dim * 4
            self.time_mlp = nn.Sequential(
                SinusoidalPositionEmbeddings(dim),
                nn.Linear(dim, time_dim),
                nn.GELU(),
                nn.Linear(time_dim, time_dim),
            )
        else:
            time_dim = None
            self.time_mlp = None

        # layers
        self.downs = nn.ModuleList([])
        self.ups = nn.ModuleList([])
        num_resolutions = len(in_out)

        for ind, (dim_in, dim_out) in enumerate(in_out):
            is_last = ind >= (num_resolutions - 1)

            self.downs.append(
                nn.ModuleList(
                    [
                        block_klass(dim_in, dim_out, time_emb_dim=time_dim),
                        block_klass(dim_out, dim_out, time_emb_dim=time_dim),
                        Residual(PreNorm(dim_out, LinearAttention(dim_out))),
                        Downsample(dim_out) if not is_last else nn.Identity(),
                    ]
                )
            )

        mid_dim = dims[-1]
        self.mid_block1 = block_klass(mid_dim, mid_dim, time_emb_dim=time_dim)
        self.mid_attn = Residual(PreNorm(mid_dim, Attention(mid_dim)))
        self.mid_block2 = block_klass(mid_dim, mid_dim, time_emb_dim=time_dim)

        for ind, (dim_in, dim_out) in enumerate(reversed(in_out[1:])):
            is_last = ind >= (num_resolutions - 1)

            self.ups.append(
                nn.ModuleList(
                    [
                        block_klass(dim_out * 2, dim_in, time_emb_dim=time_dim),
                        block_klass(dim_in, dim_in, time_emb_dim=time_dim),
                        Residual(PreNorm(dim_in, LinearAttention(dim_in))),
                        Upsample(dim_in) if not is_last else nn.Identity(),
                    ]
                )
            )

        out_dim = default(out_dim, channels)
        self.final_conv = nn.Sequential(
            block_klass(dim, dim), nn.Conv2d(dim, out_dim, 1)
        )

    def forward(self, x, time):
        x = self.init_conv(x)

        t = self.time_mlp(time) if exists(self.time_mlp) else None

        h = []

        # downsample
        for block1, block2, attn, downsample in self.downs:
            x = block1(x, t)
            x = block2(x, t)
            x = attn(x)
            h.append(x)
            x = downsample(x)

        # bottleneck
        x = self.mid_block1(x, t)
        x = self.mid_attn(x)
        x = self.mid_block2(x, t)

        # upsample
        for block1, block2, attn, upsample in self.ups:
            x = torch.cat((x, h.pop()), dim=1)
            x = block1(x, t)
            x = block2(x, t)
            x = attn(x)
            x = upsample(x)

        return self.final_conv(x)
  • Note: by default, the noise predictor uses ConvNeXT blocks (as use_convnext is set to True) and position embeddings are added (as with_time_emb is set to True).

Forward diffusion

  • Now lets take a look at the forward diffusion process. Remember forward diffusion process will gradually add noise to an image withing a number of time steps \(T\).
def cosine_beta_schedule(timesteps, s=0.008):
    """
    cosine schedule as proposed in https://arxiv.org/abs/2102.09672
    """
    steps = timesteps + 1
    x = torch.linspace(0, timesteps, steps)
    alphas_cumprod = torch.cos(((x / timesteps) + s) / (1 + s) * torch.pi * 0.5) ** 2
    alphas_cumprod = alphas_cumprod / alphas_cumprod[0]
    betas = 1 - (alphas_cumprod[1:] / alphas_cumprod[:-1])
    return torch.clip(betas, 0.0001, 0.9999)

def linear_beta_schedule(timesteps):
    beta_start = 0.0001
    beta_end = 0.02
    return torch.linspace(beta_start, beta_end, timesteps)

def quadratic_beta_schedule(timesteps):
    beta_start = 0.0001
    beta_end = 0.02
    return torch.linspace(beta_start**0.5, beta_end**0.5, timesteps) ** 2

def sigmoid_beta_schedule(timesteps):
    beta_start = 0.0001
    beta_end = 0.02
    betas = torch.linspace(-6, 6, timesteps)
    return torch.sigmoid(betas) * (beta_end - beta_start) + beta_start
  • To start with, let’s use the linear schedule for \(T=200\) time steps and define the various variables from the \(\beta_t\) which we will need, such as the cumulative product of the variances \(\bar{\alpha}_t\)
  • Each of the variables below are just 1-dimensional tensors, storing values from \(t\) to \(T\).
  • Importantly, we also define an extract function, which will allow us to extract the appropriate \(t\) index for a batch of indices. (source)
timesteps = 200

# define beta schedule
betas = linear_beta_schedule(timesteps=timesteps)

# define alphas 
alphas = 1. - betas
alphas_cumprod = torch.cumprod(alphas, axis=0)
alphas_cumprod_prev = F.pad(alphas_cumprod[:-1], (1, 0), value=1.0)
sqrt_recip_alphas = torch.sqrt(1.0 / alphas)

# calculations for diffusion q(x_t | x_{t-1}) and others
sqrt_alphas_cumprod = torch.sqrt(alphas_cumprod)
sqrt_one_minus_alphas_cumprod = torch.sqrt(1. - alphas_cumprod)

# calculations for posterior q(x_{t-1} | x_t, x_0)
posterior_variance = betas * (1. - alphas_cumprod_prev) / (1. - alphas_cumprod)

def extract(a, t, x_shape):
    batch_size = t.shape[0]
    out = a.gather(-1, t.cpu())
    return out.reshape(batch_size, *((1,) * (len(x_shape) - 1))).to(t.device)
  • Now let’s take an image and illustrate how noise is added at each time step of the diffusion process to the PyTorch tensors:
from PIL import Image
import requests

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
image

  • We first normalize images by dividing by 255 (such that they are in the [0,1] range), and then make sure they are in the [-1, 1] range.
from torchvision.transforms import Compose, ToTensor, Lambda, ToPILImage, CenterCrop, Resize

image_size = 128
transform = Compose([
    Resize(image_size),
    CenterCrop(image_size),
    ToTensor(), # turn into Numpy array of shape HWC, divide by 255
    Lambda(lambda t: (t * 2) - 1),
    
])

x_start = transform(image).unsqueeze(0)
x_start.shape
Output:
----------------------------------------------------------------------------------------------------
torch.Size([1, 3, 128, 128])
  • We also define the reverse transform, which takes in a PyTorch tensor containing values in [-1, 1] and turn them back into a PIL image:
import numpy as np

reverse_transform = Compose([
     Lambda(lambda t: (t + 1) / 2),
     Lambda(lambda t: t.permute(1, 2, 0)), # CHW to HWC
     Lambda(lambda t: t * 255.),
     Lambda(lambda t: t.numpy().astype(np.uint8)),
     ToPILImage(),
])
  • Let’s run an example and see what it produces:
reverse_transform(x_start.squeeze())

  • We can now define the forward diffusion process as in the paper:
# forward diffusion (using the nice property)
def q_sample(x_start, t, noise=None):
    if noise is None:
        noise = torch.randn_like(x_start)

    sqrt_alphas_cumprod_t = extract(sqrt_alphas_cumprod, t, x_start.shape)
    sqrt_one_minus_alphas_cumprod_t = extract(
        sqrt_one_minus_alphas_cumprod, t, x_start.shape
    )

    return sqrt_alphas_cumprod_t * x_start + sqrt_one_minus_alphas_cumprod_t * noise
  • Let’s test it on a particular time step and see the image it produces:
def get_noisy_image(x_start, t):
  # add noise
  x_noisy = q_sample(x_start, t=t)

  # turn back into PIL image
  noisy_image = reverse_transform(x_noisy.squeeze())

  return noisy_image
# take time step
t = torch.tensor([40])

get_noisy_image(x_start, t)

  • We can see the image is getting more noisy. Now let’s zoom out a bit and visualize this for various time steps:
import matplotlib.pyplot as plt

# use seed for reproducability
torch.manual_seed(0)

# source: https://pytorch.org/vision/stable/auto_examples/plot_transforms.html#sphx-glr-auto-examples-plot-transforms-py
def plot(imgs, with_orig=False, row_title=None, **imshow_kwargs):
    if not isinstance(imgs[0], list):
        # Make a 2d grid even if there's just 1 row
        imgs = [imgs]

    num_rows = len(imgs)
    num_cols = len(imgs[0]) + with_orig
    fig, axs = plt.subplots(figsize=(200,200), nrows=num_rows, ncols=num_cols, squeeze=False)
    for row_idx, row in enumerate(imgs):
        row = [image] + row if with_orig else row
        for col_idx, img in enumerate(row):
            ax = axs[row_idx, col_idx]
            ax.imshow(np.asarray(img), **imshow_kwargs)
            ax.set(xticklabels=[], yticklabels=[], xticks=[], yticks=[])

    if with_orig:
        axs[0, 0].set(title='Original image')
        axs[0, 0].title.set_size(8)
    if row_title is not None:
        for row_idx in range(num_rows):
            axs[row_idx, 0].set(ylabel=row_title[row_idx])

    plt.tight_layout()
plot([get_noisy_image(x_start, torch.tensor([t])) for t in [0, 50, 100, 150, 199]])

  • As we can see above, the image going through forward diffusion is definitely becoming more apparent.
  • Thus, we can now move on to defining our loss function. The denoise_model will be our U-Net defined above. We’ll employ the Huber loss between the true and the predicted noise.
def p_losses(denoise_model, x_start, t, noise=None, loss_type="l1"):
    if noise is None:
        noise = torch.randn_like(x_start)

    x_noisy = q_sample(x_start=x_start, t=t, noise=noise)
    predicted_noise = denoise_model(x_noisy, t)

    if loss_type == 'l1':
        loss = F.l1_loss(noise, predicted_noise)
    elif loss_type == 'l2':
        loss = F.mse_loss(noise, predicted_noise)
    elif loss_type == "huber":
        loss = F.smooth_l1_loss(noise, predicted_noise)
    else:
        raise NotImplementedError()

    return loss

Dataset

  • Let’s now look into loading up our dataset. A quick note, our dataset needs to make sure all images are resized to the same size.
  • Hugging Face’s fashion_mnist dataset which we will use in this example already does that for us with all images having a same resolution of \(28 \times 28\).
from datasets import load_dataset

# load dataset from the hub
dataset = load_dataset("fashion_mnist")
image_size = 28
channels = 1
batch_size = 128
  • Now, we will define a function transforms which we’ll apply on-the-fly on the entire dataset.
  • The function just applies some basic image preprocessing: random horizontal flips, rescaling and finally make them have values in the [-1,1] range.
from torchvision import transforms
from torch.utils.data import DataLoader

# define image transformations (e.g. using torchvision)
transform = Compose([
            transforms.RandomHorizontalFlip(),
            transforms.ToTensor(),
            transforms.Lambda(lambda t: (t * 2) - 1)
])

# define function
def transforms(examples):
   examples["pixel_values"] = [transform(image.convert("L")) for image in examples["image"]]
   del examples["image"]

   return examples

transformed_dataset = dataset.with_transform(transforms).remove_columns("label")

# create dataloader
dataloader = DataLoader(transformed_dataset["train"], batch_size=batch_size, shuffle=True)
batch = next(iter(dataloader))
print(batch.keys())
Output:
----------------------------------------------------------------------------------------------------
dict_keys(['pixel_values'])

Sampling during training

  • The paper also talks about sampling from the model during training in order to track progress.
  • Ideally, generating new images from a diffusion model happens by reversing the diffusion process:
    • We start from \(T\), where we sample pure noise from a Gaussian distribution
    • Then use our neural network to gradually de-noise it using the conditional probability it has learned, continuing until we end up at time step \(t\) = 0.
    • We can derive a slightly less de-noised image \(x_{(t-1)}\) by plugging in the re-parametrization of the mean, using our noise predictor.
    • Remember that the variance is known ahead of time.
  • After all of this, ideally, we end up with an image that looks like it came from the real data distribution.
  • Lets look at the code for that below:
@torch.no_grad()
def p_sample(model, x, t, t_index):
    betas_t = extract(betas, t, x.shape)
    sqrt_one_minus_alphas_cumprod_t = extract(
        sqrt_one_minus_alphas_cumprod, t, x.shape
    )
    sqrt_recip_alphas_t = extract(sqrt_recip_alphas, t, x.shape)
    
    # Equation 11 in the paper
    # Use our model (noise predictor) to predict the mean
    model_mean = sqrt_recip_alphas_t * (
        x - betas_t * model(x, t) / sqrt_one_minus_alphas_cumprod_t
    )

    if t_index == 0:
        return model_mean
    else:
        posterior_variance_t = extract(posterior_variance, t, x.shape)
        noise = torch.randn_like(x)
        # Algorithm 2 line 4:
        return model_mean + torch.sqrt(posterior_variance_t) * noise 

# Algorithm 2 (including returning all images)
@torch.no_grad()
def p_sample_loop(model, shape):
    device = next(model.parameters()).device

    b = shape[0]
    # start from pure noise (for each example in the batch)
    img = torch.randn(shape, device=device)
    imgs = []

    for i in tqdm(reversed(range(0, timesteps)), desc='sampling loop time step', total=timesteps):
        img = p_sample(model, img, torch.full((b,), i, device=device, dtype=torch.long), i)
        imgs.append(img.cpu().numpy())
    return imgs

@torch.no_grad()
def sample(model, image_size, batch_size=16, channels=3):
    return p_sample_loop(model, shape=(batch_size, channels, image_size, image_size))
  • Now, lets get to some training! We will train the model via PyTorch and occasionally save a few image samples using the sample function from above.
from pathlib import Path

def num_to_groups(num, divisor):
    groups = num // divisor
    remainder = num % divisor
    arr = [divisor] * groups
    if remainder > 0:
        arr.append(remainder)
    return arr

results_folder = Path("./results")
results_folder.mkdir(exist_ok = True)
save_and_sample_every = 1000
  • Below, we define the model, and move it to the GPU along with defining Adam, a standard optimizer.
from torch.optim import Adam

device = "cuda" if torch.cuda.is_available() else "cpu"

model = Unet(
    dim=image_size,
    channels=channels,
    dim_mults=(1, 2, 4,)
)
model.to(device)

optimizer = Adam(model.parameters(), lr=1e-3)

Training

  • Now lets start the training process:
from torchvision.utils import save_image

epochs = 5

for epoch in range(epochs):
    for step, batch in enumerate(dataloader):
      optimizer.zero_grad()

      batch_size = batch["pixel_values"].shape[0]
      batch = batch["pixel_values"].to(device)

      # Algorithm 1 line 3: sample t uniformally for every example in the batch
      t = torch.randint(0, timesteps, (batch_size,), device=device).long()

      loss = p_losses(model, batch, t, loss_type="huber")

      if step % 100 == 0:
        print("Loss:", loss.item())

      loss.backward()
      optimizer.step()

      # save generated images
      if step != 0 and step % save_and_sample_every == 0:
        milestone = step // save_and_sample_every
        batches = num_to_groups(4, batch_size)
        all_images_list = list(map(lambda n: sample(model, batch_size=n, channels=channels), batches))
        all_images = torch.cat(all_images_list, dim=0)
        all_images = (all_images + 1) * 0.5
        save_image(all_images, str(results_folder / f'sample-{milestone}.png'), nrow = 6)

Output:
----------------------------------------------------------------------------------------------------
Loss: 0.46477368474006653
Loss: 0.12143351882696152
Loss: 0.08106148988008499
Loss: 0.0801810547709465
Loss: 0.06122320517897606
Loss: 0.06310459971427917
Loss: 0.05681884288787842
Loss: 0.05729678273200989
Loss: 0.05497899278998375
Loss: 0.04439849033951759
Loss: 0.05415581166744232
Loss: 0.06020551547408104
Loss: 0.046830907464027405
Loss: 0.051029372960329056
Loss: 0.0478244312107563
Loss: 0.046767622232437134
Loss: 0.04305662214756012
Loss: 0.05216279625892639
Loss: 0.04748568311333656
Loss: 0.05107741802930832
Loss: 0.04588869959115982
Loss: 0.043014321476221085
Loss: 0.046371955424547195
Loss: 0.04952816292643547
Loss: 0.04472338408231735
  • And finally, let’s look at our inference or sampling from the sample function we defined above.
# sample 64 images
samples = sample(model, image_size=image_size, batch_size=64, channels=channels)

# show a random one
random_index = 5
plt.imshow(samples[-1][random_index].reshape(image_size, image_size, channels), cmap="gray")

  • Seems like the model is capable of generating a nice T-shirt! Keep in mind that the dataset we trained on is pretty low-resolution (28x28).

Creating a GIF

  • Lastly, in order to see the progression of the de-noising process, we can create a GIF:
import matplotlib.animation as animation

random_index = 53

fig = plt.figure()
ims = []
for i in range(timesteps):
    im = plt.imshow(samples[i][random_index].reshape(image_size, image_size, channels), cmap="gray", animated=True)
    ims.append([im])

animate = animation.ArtistAnimation(fig, ims, interval=50, blit=True, repeat_delay=1000)
animate.save('diffusion.gif')
plt.show()

  • Hopefully this was beneficial in clarifying the diffusion model concepts!
  • Furthermore, it his highly recommend looking at Hugging Face’s Training with Diffusers notebook to see how to leverage their Diffusion library to train a simple model.
  • And, for inference, they also provide this notebook where you can see the images being generated.

denoising-diffusion-pytorch package

  • While Diffusion models have not yet been democratized to the same degree as other older architectures/approaches in Machine Learning, there are still implementations available for use. The easiest way to use a diffusion model in PyTorch is to use the denoising-diffusion-pytorch package, which implements an image diffusion model like the one discussed in this article. To install the package, simply type the following command in the terminal:
pip install denoising_diffusion_pytorch

Minimal Example

  • To train a model and generate images, we first import the necessary packages:
import torch
from denoising_diffusion_pytorch import Unet, GaussianDiffusion
  • Next, we define our network architecture, in this case a U-Net. The dim parameter specifies the number of feature maps before the first down-sampling, and the dim_mults parameter provides multiplicands for this value and successive down-samplings:
model = Unet(
    dim = 64,
    dim_mults = (1, 2, 4, 8)
)
  • Now that our network architecture is defined, we need to define the diffusion model itself. We pass in the U-Net model that we just defined along with several parameters - the size of images to generate, the number of timesteps in the diffusion process, and a choice between the L1 and L2 norms.
diffusion = GaussianDiffusion(
    model,
    image_size = 128,
    timesteps = 1000,   # number of steps
    loss_type = 'l1'    # L1 or L2
)
  • Now that the diffusion model is defined, it’s time to train. We generate random data to train on, and then train the diffusion model in the usual fashion:
training_images = torch.randn(8, 3, 128, 128)
loss = diffusion(training_images)
loss.backward()
  • Once the model is trained, we can finally generate images by using the sample() method of the diffusion object. Here we generate 4 images, which are only noise given that our training data was random:
sampled_images = diffusion.sample(batch_size = 4)

Training on Custom Data

  • The denoising-diffusion-pytorch package also allow you to train a diffusion model on a specific dataset. Simply replace the 'path/to/your/images' string with the dataset directory path in the Trainer() object below, and change image_size to the appropriate value. After that, simply run the code to train the model, and then sample as before. Note that PyTorch must be compiled with CUDA enabled in order to use the Trainer class:
from denoising_diffusion_pytorch import Unet, GaussianDiffusion, Trainer

model = Unet(
    dim = 64,
    dim_mults = (1, 2, 4, 8)
).cuda()

diffusion = GaussianDiffusion(
    model,
    image_size = 128,
    timesteps = 1000,   # number of steps
    loss_type = 'l1'    # L1 or L2
).cuda()

trainer = Trainer(
    diffusion,
    'path/to/your/images',
    train_batch_size = 32,
    train_lr = 2e-5,
    train_num_steps = 700000,         # total training steps
    gradient_accumulate_every = 2,    # gradient accumulation steps
    ema_decay = 0.995,                # exponential moving average decay
    amp = True                        # turn on mixed precision
)

trainer.train()
  • Below you can see progressive denoising from multivariate Gaussian noise to MNIST digits akin to reverse diffusion:

HuggingFace Diffusers

  • HuggingFace diffusers provides pretrained diffusion models across multiple modalities, such as vision and audio, and serves as a modular toolbox for inference and training of diffusion models.

  • More precisely, HuggingFace Diffusers offers:
    • State-of-the-art diffusion pipelines that can be run in inference with just a couple of lines of code.
    • Various noise schedulers that can be used interchangeably for the prefered speed vs. quality trade-off in inference.
    • Multiple types of models, such as UNet, that can be used as building blocks in an end-to-end diffusion system.
    • Training examples to show how to train the most popular diffusion models.

Implementations

Stable Diffusion

  • Stable Diffusion is a state of the art text-to-image model that generates images from text. It’s makes it’s high performance models available to the public at large to use here.
  • Stable Diffusion is a text-to-image latent diffusion model created by the researchers and engineers from CompVis, Stability AI and LAION. It is trained on 512x512 images from a subset of the LAION-5B database which is the largest, freely accessible multi-modal dataset. (source)
  • Let’s now look at how it works with the illustrations below by Jay Alammar.

  • Stable Diffusion is quite versatile because it can be used in a variety of ways.
  • In the image we see above, we can see that it can take text as input and output a generated image. This is the primary use case, however, it is not the only one.

  • As we can see from the image above, another use case of Stable Diffusion is with image and text as input, and it will output a generated image again. This is called img2img.
  • It’s able to be so versatile because Stable Diffusion is not one monolith model, but a system made up of several components and models.
  • To be specific, Stable Diffusion is made up of a:
    • 1) Text Understanding component
    • 2) Image Generation component

  • The text understanding component is actually the text encoder used within CLIP.
  • As we can see represented in the image below, Stable Diffusion takes the input text within the Text Understander component and returns a vector representing each token in the text.
  • This information is then passed over to the Image Generator component which internally is composed of 2 components as well.

  • Now, referring to the image below, let’s look at the two components within the Image Generator component.
    • Image Information Creator:
      • This is the ‘secret sauce’ of Stable Diffusion as it runs for a number of steps refining the information that should go in the image that will become the model’s output.
    • Image Decoder:
      • This component takes the processed information and paints the picture.

  • Let’s zoom out for a second and look at the higher level components we have so far all working together for the image generation task:

  • All the 3 components above are actually individual neural networks working together, specifically, they are:
    • CLIPText: Used to encode the text
    • U-Net + scheduler: Used to gradually process image information(latent diffusion)
    • Autoencoder Decoder: paints the final image

  • Above we can see the steps that Stable Diffusion takes to generate its images.

  • Lastly, let’s zoom into the image decoder and get a better understanding of its inner workings. Remember the image decoder is one of the two components the image generator comprises of

  • The random vector is considered to be random noise.
  • Stable Diffusion is able to obtain it’s speed from the fact that the processing happens in the latent space (which needs less calculations as compared to the pixel space).

Dream Studio

  • Dream Studio is Stable Diffusion’s AI Art Web App Tool.
  • DreamStudio is a new suite of generative media tools engineered to grant everyone the power of limitless imagination and the effortless ease of visual expression through a combination of natural language processing and revolutionary input controls for accelerated creativity.

Midjourney

  • Midjourney is an independent research lab exploring new mediums of thought and expanding the imaginative powers of the human species.
  • Midjourney has not made it’s architecture details publicly available, but one has to think they still leverage diffusion models in some fashion.
  • While Dall-E 2 creates more realistic images, MidJourney shines in adapting real art styles into creating an image of any combination of things your heart desires.

DALL-E 2

  • DALL-E 2 utilized diffusion models to create its images and was created by OpenAI.
  • DALL-E 2 can make realistic edits to existing images from a natural language caption.
    • It can add and remove elements while taking shadows, reflections, and textures into consideration.
  • DALL-E 2 has learned the relationship between images and the text used to describe them.
  • It uses diffusion, which starts with a pattern of random dots and gradually alters that pattern towards an image when it recognizes specific aspects of that image.
  • OpenAI has limited the ability for DALL-E 2 to generate violent, hate, or adult images.
  • By removing the most explicit content from the training data, OpenAI has minimized DALL-E 2’s exposure to these concepts.
  • They have also used advanced techniques to prevent photorealistic generations of real individuals’ faces, including those of public figures.
  • Among the most important building blocks in the DALL-E 2 architecture is CLIP to function as the main bridge between text and images.
  • While CLIP does not use a diffusion model, it is essential to understand DALL-E 2 so let’s do a quick recap of CLIP’s architecture.
  • CLIP is a neural network trained on a variety of (image, text) pairs.
  • It can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task, similarly to the zero-shot capabilities of GPT-2 and 3.
  • CLIP is a multi-modal vision and language model.
  • It can be used for image-text similarity and for zero-shot image classification.
  • CLIP uses a ViT like transformer to get visual features and a causal language model to get the text features.
  • Both the text and visual features are then projected to a latent space with identical dimension. The dot product between the projected image and text features is then used as a similar score.
  • CLIP enables us to take textual phrases and understand how they map onto images
  • Showcasing a few images generated via Diffusion Models along with their text prompts given:
    • A Corgi puppy painted like the Mona Lisa:

    • Beyonce sitting at a desk and coding:

    • Snow in Hawaii:

    • Sun coming in from a big window with curtains and casting a shadow on the rest of the room, artistic style:

    • The Taj Mahal painted in Starry Night by Vincent Van Gogh:

Conclusion

  • Diffusion models are a conceptually simple and elegant approach to the problem of generating data. Their state-of-the-art results combined with non-adversarial training has propelled them to great heights, and further improvements can be expected in the coming years given their nascent status.
  • In particular, diffusion models have been found to be essential to the performance of cutting-edge models that carry out (un)conditional image/audio/video generation. Popular examples (at the time of writing) include GLIDE and DALL-E 2 by OpenAI, Latent Diffusion by the University of Heidelberg, Imagen by Google Brain, and Stable Diffusion by Stability.ai.

FAQ

What is the difference between DDPM and DDIM models?

  • Denoising Diffusion Probabilistic Models (DDPMs) and Denoising Diffusion Implicit Models (DDIMs) are both types of diffusion models used in deep learning, particularly for generating high-quality, complex data such as images. While they share the same underlying principle of diffusion processes, there are key differences in their approach and characteristics.
  • Denoising Diffusion Probabilistic Models (DDPMs):
    • Basic Principle: DDPMs work by gradually adding noise to data over a series of steps, transforming the data into a Gaussian noise distribution. The model then learns to reverse this process, generating new data by denoising starting from noise.
    • Markov Chain Process: The process involves a forward diffusion phase (adding noise) and a reverse diffusion phase (removing noise). Both phases are modeled as Markov chains.
    • Stochastic Sampling: The reverse process in DDPMs is stochastic. This means that during the generation phase, random noise is introduced at each step, leading to variation in the outputs even if the process starts from the same noise.
    • Sampling Time: DDPMs typically require a large number of steps to generate a sample, which can be computationally intensive and time-consuming.
    • High-Quality Generation: DDPMs have been shown to generate high-quality samples that are often indistinguishable from real data, especially in the context of image generation.
  • Denoising Diffusion Implicit Models (DDIMs):
    • Modified Sampling Process: DDIMs are a variant of DDPMs that modify the sampling process. They use a deterministic approach instead of a stochastic one.
    • Deterministic Sampling: In DDIMs, the reverse process is deterministic, meaning no random noise is added during the generation phase. This leads to consistent outputs for the same starting point.
    • Faster Sampling: Because of the deterministic nature and some modifications to the diffusion process, DDIMs can generate samples in fewer steps compared to traditional DDPMs.
    • Flexibility in Time Steps: DDIMs offer more flexibility in choosing the number of timesteps, allowing for a trade-off between generation quality and computational efficiency.
    • Quality vs. Diversity Trade-off: While DDIMs can generate high-quality images like DDPMs, the lack of stochasticity in the reverse process might lead to less diversity in the generated samples.
  • Summary:
    • DDPM uses a stochastic process, adding random noise at each step of the reverse process, leading to high diversity in outputs. It requires more steps for sample generation, which makes it slower but excellent at generating high-quality diverse results.
    • DDIM employs a deterministic reverse process without introducing randomness in the generation phase. It allows for faster sampling with fewer steps and can generate consistent outputs but might lack the diversity seen in DDPM outputs.
  • Both models represent advanced techniques in generative modeling, particularly for applications like image synthesis, where they can generate realistic and varied outputs. The choice between DDPM and DDIM depends on the specific requirements regarding diversity, computational resources, and generation speed.

In diffusion models, there is a forward diffusion process, and a denoising process. For these two processes, when do you use them in training and inference?

  • In diffusion models, which are a class of generative models, the forward diffusion process and the denoising process play distinct roles during training and inference. Understanding when and how these processes are used is key to grasping how diffusion models work.
    • Forward Diffusion Process
      • During Training:
        • Noise Addition: In the forward diffusion process, noise is gradually added to the data over several steps or iterations. This process transforms the original data into a pure noise distribution through a predefined sequence of steps.
        • Training Objective: The model is trained to predict the noise that was added at each step. Essentially, it learns to reverse the diffusion process.
      • During Inference:
        • Not Directly Used: The forward diffusion process is not explicitly used during inference. However, the knowledge gained during training (about how noise is added) is implicitly used to guide the denoising process.
    • Denoising Process
      • During Training:
        • Learning to Reverse Noise: The model learns to denoise the data, i.e., to reverse the forward diffusion process. It does this by predicting the noise that was added at each step during the forward diffusion and then subtracting this noise.
        • Parameter Optimization: The parameters of the model are optimized to make accurate predictions of the added noise, thereby learning to gradually denoise the data back to its original form.
      • During Inference:
        • Data Generation: The denoising process is the key to generating new data. Starting from pure noise, the model iteratively denoises this input, using the reverse of the forward process, to generate a sample.
        • Iterative Refinement: At each step, the model predicts the noise to remove, effectively refining the sample from random noise into a coherent output.
    • Summary
      • Training Phase: Both the forward diffusion (adding noise) and the denoising (removing noise) processes are actively used. The model learns how to reverse the gradual corruption of the data (caused by adding noise) by being trained to predict and remove the noise at each step.
      • Inference Phase: Only the denoising process is used, where the model starts with noise and iteratively applies the learned denoising steps to generate a sample. The forward process is not explicitly run during inference, but its principles underpin the reverse process.
  • In essence, the forward diffusion process is crucial for training the model to understand and reverse the noise addition, while the denoising process is used both in training (to learn this reversal) and in inference (to generate new data).

At a high level, how do diffusion models work? What are some other models that are useful for image generation, and how do they compare to diffusion models?

  • Diffusion models, along with Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs), are powerful tools in the domain of image generation. Each of these models employs distinct mechanisms and has its strengths and limitations.
  • How Diffusion Models Work
    • Reversible Noising Process:
      • Forward Process: Diffusion models work by gradually adding noise to an image (or any data) over a series of steps, transitioning it from the original data distribution to a noise distribution. This process is known as the forward diffusion process.
      • Backward Process: The model then learns to reverse this process, which is where the actual generative modeling happens. By learning this reverse diffusion process, the model effectively learns to generate data starting from noise.
    • Data Generation: In the generative phase, the model starts with a sample of random noise and applies the learned reverse transformation to this noise, progressively denoising it to generate a sample of data (e.g., an image).
    • Higher Quality Samples: Diffusion models are known for generating high-quality samples that are often more realistic and less prone to artifacts compared to other generative models.
  • Comparison with VAEs and GANs
    • Variational Autoencoders (VAEs):
      • Mechanism: VAEs consist of an encoder that maps input data to a latent space and a decoder that reconstructs the data from this latent space. The training involves optimizing the reconstruction loss and a regularization term (KL divergence) that keeps the latent space distributions well-behaved.
      • Image Generation: VAEs are effective for image generation but can sometimes produce blurrier results compared to GANs and diffusion models.
    • Generative Adversarial Networks (GANs):
      • Mechanism: GANs involve a generator that creates images and a discriminator that evaluates them. The generator learns to produce increasingly realistic images, while the discriminator improves at distinguishing real images from generated ones.
      • Training Instability: GANs can generate high-quality, sharp images but are known for their training instability issues, such as mode collapse, where the generator produces a limited variety of outputs.
      • Comparison to Diffusion Models: Diffusion models, in contrast, generally don’t have these training instabilities and can produce high-quality images, arguably with more consistency than GANs.
  • Summary
    • Diffusion Models: Known for their ability to generate high-quality images and a more stable training process compared to GANs. They work by reversing a learned noising process and are particularly good at capturing fine details in images such as text in generated images, hair follicles as part of a person’s face, etc.
    • VAEs: Offer a stable training regime and good general-purpose image generation but sometimes lack the sharpness and realism provided by GANs and diffusion models.
    • GANs: Excel in creating sharp and realistic images but can be challenging to train and may suffer from stability issues. Each of these models has its place in the field of image generation, with the choice depending on factors like desired image quality, training stability, and architectural complexity.

What are the loss functions used in Diffusion Models?

  • Diffusion models, which are a type of generative model, use a specific approach to learning that involves gradually adding noise to data and then learning to reverse this process. The loss functions used in diffusion models are designed to facilitate this learning process. The primary loss function used is the denoising score matching loss, often combined with other components depending on the specific type of diffusion model.
  • Denoising Score Matching Loss
    • In diffusion models, the training process involves adding noise to the data in small increments over many steps, resulting in a series of noisier and noisier versions of the original data. The model then learns to reverse this process. The key loss function used in this context is the denoising score matching loss, which can be described as follows:
      • Objective: The model is trained to predict the noise that was added to the data at each step. Essentially, it learns how to reverse the diffusion process.
      • Implementation: Mathematically, this can be implemented as a regression problem where the model predicts the noise added to each data point. The loss function then measures the difference between the predicted noise and the actual noise that was added. A common choice for this is the mean squared error (MSE) between these two values.
  • Variational Lower Bound (ELBO) in Variational Diffusion Models
    • Some diffusion models, particularly variational ones, use a loss function derived from the evidence lower bound (ELBO) principle common in variational inference:
      • Objective: This loss function aims to maximize the likelihood of the data under the model while regularizing the latent space representations.
      • Components: The ELBO for diffusion models typically includes terms that represent the reconstruction error (similar to the denoising score matching loss) and terms that regularize the latent space (like KL divergence in VAEs).
  • Additional Regularization Terms
    • Depending on the specific architecture and objectives of the diffusion model, additional regularization terms might be included:
      • KL Divergence: In some models, especially those that involve variational approaches, a KL divergence term can be included to ensure that the learned distributions in the latent space adhere to certain desired properties.
      • Adversarial Loss: For models that integrate adversarial training principles, an adversarial loss term might be added to encourage the generation of more realistic data.
  • Summary: The choice of loss function in diffusion models is closely tied to their unique training process, which involves learning to reverse a controlled noise-adding process. The denoising score matching loss is central to this, often supplemented by other loss components based on variational principles or additional regularization objectives. The combination of these loss functions allows diffusion models to effectively learn the complex process of generating high-quality data from noisy inputs.

What is the Denoising Score Matching Loss in Diffusion models? Provide equation and intuition.

  • The Denoising Score Matching Loss is a critical component in the training of diffusion models, a class of generative models. This loss function is designed to train the model to effectively reverse a diffusion process, which gradually adds noise to the data over a series of steps.
  • Denoising Score Matching Loss: Equation and Intuition
    • Background:
      • In diffusion models, the data is incrementally noised over a sequence of steps. The reverse process, which the model learns, involves denoising or reversing this noise addition to recreate the original data from noise.
      • Equation:
      • The denoising score matching loss at a particular timestep \(t\) can be formulated as: \(L(\theta)=\mathbb{E}_{x_0, \epsilon \sim \mathcal{N}(0, I), t}\left[\left\|s_\theta\left(x_t, t\right)-\nabla_{x_t} \log p_{t \mid 0}\left(x_t \mid x_0\right)\right\|^2\right]\)
        • where, \(x_0\) is the original data, \(x_t\) is the noised data at timestep \(t\), and $\epsilon$ is the added Gaussian noise.
        • \(s_\theta\left(x_t, t\right)\) is the score (gradient of the log probability) predicted by the model with parameters \(\theta\).
        • \(\nabla_{x_t} \log p_{t \mid 0}\left(x_t \mid x_0\right)\) is the true score, which is the gradient of the log probability of the noised data \(x_t\) conditioned on the original data \(x_0\).
      • Intuition:
        • The loss function encourages the model to predict the gradient of the log probability of the noised data with respect to the data itself. Essentially, it’s training the model to estimate how to reverse the diffusion process at each step.
        • By minimizing this loss, the model learns to approximate the reverse of the noising process, thereby learning to generate data starting from noise.
        • This process effectively teaches the model the denoising direction at each step of the noised data, guiding it on how to gradually remove noise and reconstruct the original data.
      • Importance in Training: The denoising score matching loss is crucial for training diffusion models to generate high-quality samples. It ensures that the model learns a detailed and accurate reverse mapping of the diffusion process, capturing the complex data distribution.
      • Advantages: This approach allows diffusion models to generate samples that are often of higher quality and more diverse compared to other generative models, as it carefully guides the generative process through the learned noise reversal.
  • In summary, the denoising score matching loss in diffusion models is fundamental in training these models to effectively reverse the process of gradual noise addition, enabling the generation of high-quality data samples from a noise distribution. This loss function is key to the model’s ability to learn the intricate details of the data distribution and the precise dynamics of the denoising process.

What does the “stable” in stable diffusion refer to?

  • The “stability” in stable diffusion also refers to maintaining image content in the latent space throughout the diffusion process. In diffusion models, the image is transformed from the pixel space to the “latent space” – this is a high-dimensional abstract representation of the image. Here are the differences between the two:
    • Pixel Space:
      • This refers to the space in which the data (such as images) is represented in its raw form – as pixels.
      • Each dimension corresponds to a pixel value, so an image of size 100x100 would have a pixel space of 10,000 dimensions.
      • Pixel space representations are direct and intuitive but can be very high-dimensional and sparse for complex data like images.
    • Latent Space:
      • Latent space is a lower-dimensional space where data is represented in a more compressed and abstract form.
      • Generative models, like Variational Autoencoders (VAEs) or Generative Adversarial Networks (GANs), encode high-dimensional data (from pixel space) into this lower-dimensional latent space.
      • The latent representation captures the essential features or characteristics of the data, allowing for more efficient processing and manipulation.
      • Operations and transformations are often performed in latent space because they can be more meaningful and computationally efficient. For example, interpolating between two points in latent space can result in a smooth transition between two images when decoded back to pixel space.
  • The “Stable” in Stable Diffusion refers to the fact that the forward and reverse diffusion process occur in a low-dimensional latent space vs. a high-dimensional pixel space leading to stability during diffusion. If the latent space becomes unstable and loses image content too quickly, the generated pixel space images will be poor.
  • Stable diffusion uses techniques to keep the latent space more stable throughout the diffusion process:
    • The denoising model tries to remove noise while preserving latent space content at each step.
    • Regularization prevents the denoising model from changing too drastically between steps.
    • Careful noise scheduling maintains stability in early diffusion steps.
  • This stable latent space leads to higher quality pixel generations. At the end, Stable Diffusion transforms the image from latent space back to the pixel space.

What is the role of Classifier-free Guidance in diffusion models?

  • Some diffusion models use a technique called classifier-free guidance to balance the fidelity to the text prompt with the model’s creative freedom. Classifier-free guidance thus plays a crucial role in enhancing the generation of images or other data types. Here’s a breakdown of its key functions:
    1. Improving Sample Quality: Classifier-free guidance helps in improving the overall quality of the generated samples. This is achieved by adjusting the model’s behavior to generate more realistic and coherent outputs, which is particularly useful in applications like image generation where fidelity to real-world appearances is important.
    2. Controlling the Generation Process: It provides a way to control the generation process without relying on a separate classifier model. This is beneficial because it simplifies the architecture and reduces the computational overhead that comes with using an additional classifier.
    3. Flexibility in Guidance to Balance Fidelity/Adherence and Creativity: Classifier-free guidance allows for varying degrees of guidance. By adjusting a guidance scale parameter, users can control how closely the generated output adheres to the input prompt. The guidance scale is often a tunable parameter, allowing users to choose between more literal or more imaginative interpretations of the prompt. This flexibility is useful in applications where different levels of adherence to the input conditions are desired.
    4. Reduction in Model Bias: Since classifier-free guidance does not depend on a separate classifier model, it can help in reducing biases that might be present in such classifiers. This is especially important in scenarios where the bias in data can lead to undesirable outcomes.
    5. Broader Applicability: The approach is broadly applicable to various types of data generation tasks, not just image generation. It can be used in text generation, audio synthesis, and other areas where generative models are applied.
  • In summary, classifier-free guidance in diffusion models enhances the generation quality, offers better control over the generation process, reduces model complexity, mitigates bias, and has wide applicability across different domains of generative modeling.

How do you condition a diffusion model to the textual input prompt?

  • In the context of text-to-image generation, conditioning a diffusion model to a textual input prompt involves a multi-step process where the model learns to align its outputs (e.g., images) with the semantic content of the text prompt. Here’s an overview of how this is typically achieved:
    1. Pre-trained Language Model Integration:
      • Language Model Embeddings: The process starts by encoding the text prompt using a pre-trained language model (like GPT, BERT, or a similar transformer-based model). This model converts the text into a high-dimensional vector that captures its semantic meaning.
      • Alignment with Diffusion Model: The embeddings from the language model are then integrated with the diffusion model. This integration ensures that the diffusion model ‘understands’ the content and context of the text prompt.
    2. Training with Paired Data:
      • Paired Dataset: The diffusion model is trained on a large dataset of text-image pairs. These pairs help the model learn the association between textual descriptions and their corresponding visual representations.
      • Loss Function: The training involves a loss function that penalizes the model when the generated images do not correspond well to the text prompts. This encourages the model to generate images that are semantically aligned with the input text.
    3. Iterative Refinement Process:
      • Diffusion Process: The diffusion model generates images through an iterative process. It starts with a random noise distribution and gradually refines this into a coherent image over numerous steps.
      • Guidance at Each Step: During these steps, the model is ‘guided’ by the text embeddings, ensuring that each refinement moves the image closer to a representation that aligns with the text prompt.
    4. Classifier-Free Guidance:
      • Balancing Fidelity and Creativity: Some models use a technique called classifier-free guidance to balance the fidelity to the text prompt with the model’s creative freedom. This is often a tunable parameter, allowing users to choose between more literal or more imaginative interpretations of the prompt.
    5. Fine-tuning for Specific Tasks:
      • Custom Datasets: In some cases, the model may be further fine-tuned on specific types of text-image pairs (like medical images and reports, or fashion items and descriptions) to improve its performance in certain domains.
    6. Key Challenges:
      • Semantic Alignment: Ensuring that the model accurately captures the nuances of the text prompt in the generated image.
      • Diversity and Creativity: Maintaining a balance between adhering to the text prompt and generating diverse, creative outputs.
      • Avoiding Biases: Minimizing biases in generated images, which can stem from biases present in the training data.
  • In conclusion, conditioning a diffusion model on a textual input prompt involves a complex interplay between language understanding (via language model embeddings), training on paired data, iterative refinement, and potentially classifier-free guidance, all geared towards aligning the model’s output with the semantic content of the text prompt.

How is randomness in the outputs induced in a diffusion model?

  • Randomness in the outputs of a diffusion model, such as DALL-E, is induced through a process that involves gradually adding and then removing noise from an image. The key steps in this process are:

    1. Starting with Noise: The model begins with a random noise image. This initial noise is essentially a canvas of random pixel values, providing the initial randomness to the process.

    2. Forward Diffusion Process: Over several steps, the model incrementally adds more noise to the image. This process, known as the forward diffusion, transforms the image from a coherent state into a state of pure noise gradually. The model learns how to apply this noise in a controlled manner.

    3. Reverse Diffusion Process: Once the image is fully noisy, the reverse process begins. The model gradually denoises the image, step by step. In each step, it predicts a slightly less noisy version of the current image, using its understanding of how images are structured and what kind of content it is supposed to generate. This is where the model’s training and knowledge come into play.

    4. Incorporating Randomness in the Reverse Process: During the reverse diffusion process, the model introduces randomness at each step. This is usually done by sampling from a probability distribution (like a Gaussian distribution) to decide how much and what kind of noise to remove or what features to add in each step.

    5. Guidance by the Conditioning Input: In a conditional diffusion model like DALL-E, the process is guided by the input prompt or conditioning signal. This signal influences what kind of image the model tries to recreate from the noise, but the inherent randomness in the denoising process ensures that each output is unique.

    6. Final Output: The final output is generated after several steps of this reverse diffusion process, each step making the image more coherent and closer to the desired outcome, yet maintaining an element of randomness.

  • The randomness in each step of the reverse diffusion process ensures that even with the same starting point and the same conditioning input, the model can generate a wide variety of different outputs. This is a crucial feature for creative applications like image generation, where uniqueness and variation are desirable.

Recent Papers

High-Resolution Image Synthesis with Latent Diffusion Models

  • The following paper summary has been contributed by Zhibo Zhang.
  • Diffusion models are known to be computationally expensive given that they require many steps of diffusion and denoising diffusion operations in possibly high-dimensional input feature spaces.
  • This paper by Rombach et al. from Ludwig Maximilian University of Munich & IWR, Heidelberg University and Runway ML in CVPR 2022 introduces diffusion models that operate on the latent space, aiming at generating high-resolution images with lower computation demands compared to those that operate directly on the pixel space.
  • In particular, the authors adopted an autoencoder that compresses the input images into a lower dimensional latent space. The autoencoder relies on either KL regularization or VQ regularization to constrain the variance of the latent space.
  • As shown in the illustration figure below by Rombach et al., in the latent space, the latent representation of the input image goes through a total of \(T\) diffusion operations to get the noisy representation. A U-Net is then applied on top of the noisy representation for \(T\) iterations to produce the denoised version of the representation. In addition, the authors introduced a cross attention mechanism to condition the denoising process on other types of inputs such as text and semantic maps.
  • In the final stage, the denoised representation will be mapped back to the pixel space using the decoder to get the synthesized image.
  • Empirically, the best performing latent diffusion model (with a carefully chosen downsampling factor) achieved competitive FID scores in image generation when comparing with a few other state-of-the-art generative models such as variations of generative adversarial nets on a few datasets including the CelebA-HQ dataset.
  • Code

Diffusion Model Alignment Using Direct Preference Optimization

  • This paper by Wallace et al. from Salesforce AI and Stanford University proposes a novel method for aligning diffusion models to human preferences.
  • The paper introduces Diffusion-DPO, a method adapted from Direct Preference Optimization (DPO), for aligning text-to-image diffusion models with human preferences. This approach is a significant shift from typical language model training, emphasizing direct optimization on human comparison data.
  • Unlike typical methods that fine-tune pre-trained models using curated images and captions, Diffusion-DPO directly optimizes a policy that best satisfies human preferences under a classification objective. It re-formulates DPO to account for a diffusion model notion of likelihood using the evidence lower bound, deriving a differentiable objective.
  • The authors utilized the Pick-a-Pic dataset, comprising 851K crowdsourced pairwise preferences, to fine-tune the base model of the Stable Diffusion XL (SDXL)-1.0 model with Diffusion-DPO. The fine-tuned model showed significant improvements over both the base SDXL-1.0 and its larger variant in terms of visual appeal and prompt alignment, as evaluated by human preferences.
  • The paper also explores a variant of the method that uses AI feedback, showing comparable performance to training on human preferences. This opens up possibilities for scaling diffusion model alignment methods.
  • The figure below from paper illustrates: (Top) DPO-SDXL significantly outperforms SDXL in human evaluation. (L) PartiPrompts and (R) HPSv2 benchmark results across three evaluation questions, majority vote of 5 labelers. (Bottom) Qualitative comparisons between SDXL and DPO-SDXL. DPOSDXL demonstrates superior prompt following and realism. DPO-SDXL outputs are better aligned with human aesthetic preferences, favoring high contrast, vivid colors, fine detail, and focused composition. They also capture fine-grained textual details more faithfully.

  • Experiments demonstrate the effectiveness of Diffusion-DPO in various scenarios, including image-to-image editing and learning from AI feedback. The method significantly outperforms existing models in human evaluations for general preference, visual appeal, and prompt alignment.
  • The paper’s findings indicate that Diffusion-DPO can effectively increase measured human appeal across an open vocabulary with stable training, without increased inference time, and improves generic text-image alignment.
  • The authors note ethical considerations and risks associated with text-to-image generation, emphasizing the importance of diverse and representative sets of labelers and the potential biases inherent in the pre-trained models and labeling process.
  • In summary, the paper presents a groundbreaking approach to align diffusion models with human preferences, demonstrating notable improvements in visual appeal and prompt alignment. It highlights the potential of direct preference optimization in the realm of text-to-image diffusion models and opens avenues for further research and application in this field.

Scalable Diffusion Models with Transformers

  • This paper by Peebles and Xie from UC Berkeley and New York University introduces a new class of diffusion models that leverage the Transformer architecture for generating images. This innovative approach replaces the traditional convolutional U-Net backbone in latent diffusion models (LDMs) with a transformer operating on latent patches.
  • Traditional diffusion models in image-level generative tasks predominantly use a convolutional U-Net architecture. However, the dominance of transformers in various domains like natural language processing and vision prompts this exploration of their use as a backbone for diffusion models.
  • The paper proposes Diffusion Transformers (DiTs), which adhere closely to the standard Vision Transformer (ViT) model but with some vital tweaks. DiTs are designed to be faithful to standard transformer architecture, particularly the Vision Transformer (ViT) model, and are trained as latent diffusion models of images.
  • Transformer Blocks and Design Space:
    • DiTs process input tokens transformed from spatial representations of images (“patchify” process) through a sequence of transformer blocks.
    • Four types of transformer blocks are explored: in-context conditioning, cross-attention block, adaptive layer norm (adaLN) block, and adaLN-Zero block. Each block processes additional conditional information like noise timesteps or class labels.
    • The adaLN-Zero block, which initializes each DiT block as an identity function and modulates the activations immediately prior to any residual connections within the block, demonstrates the most efficient performance, achieving lower Frechet Inception Distance (FID) values than the other block types.
  • The figure below from the paper shows the Diffusion Transformer (DiT) architecture. Left: We train conditional latent DiT models. The input latent is decomposed into patches and processed by several DiT blocks. Right: Details of our DiT blocks. We experiment with variants of standard transformer blocks that incorporate conditioning via adaptive layer norm, cross-attention and extra input tokens. Adaptive layer norm works best.

  • Model Scaling and Performance:
    • DiTs are scalable in terms of forward pass complexity, measured in GFLOPs. Different configurations (DiT-S, DiT-B, DiT-L, DiT-XL) cover a range of model sizes and computational complexities.
    • Increasing model size and decreasing patch size significantly improves performance. FID scores improve as the transformer becomes deeper and wider, indicating that scaling model size (GFLOPs) is key to improved performance.
    • The largest DiT-XL/2 models outperform all prior diffusion models, achieving a state-of-the-art FID of 2.27 on class-conditional ImageNet benchmarks at resolutions of 256 \(\times\) 256 and 512 \(\times\) 512.
  • Implementation and Results: The models are trained using the AdamW optimizer. The DiT-XL/2 model, trained for 7 million steps, demonstrates high compute efficiency compared to both latent and pixel space U-Net models.
  • Visual Quality: The paper highlights notable improvements in the visual quality of generated images with scaling in both model size and the number of tokens processed.
  • Overall, the paper showcases the potential of transformer-based architectures in diffusion models, emphasizing scalability and compute efficiency, which contributes significantly to the field of generative models for images.
  • Project page.

DeepFloyd IF

  • DeepFloyd, a part of Stability AI, has introduced DeepFloyd IF, a cutting-edge text-to-image cascaded pixel diffusion model known for its high photorealism and language understanding capabilities. This model is an open-source project and represents a significant advancement in text-to-image synthesis technology.
  • DeepFloyd IF is built with multiple neural modules (independent neural networks that tackle specific tasks), joining forces within a single architecture to produce a synergistic effect.
  • DeepFloyd IF generates high-resolution images in a cascading manner: the action kicks off with a base model that produces low-resolution samples, which are then boosted by a series of upscale models to create stunning high-resolution images, as shown in the figure (source) below.

  • DeepFloyd IF’s base and super-resolution models adopt diffusion models, making use of Markov chain steps to introduce random noise into the data, before reversing the process to generate new data samples from the noise.
  • DeepFloyd IF operates within the pixel space, as opposed to latent diffusion (e.g. Stable Diffusion) that depends on latent image representations.
  • The unique structure of DeepFloyd IF consists of a frozen text encoder and three cascaded pixel diffusion modules. The process begins with a base model generating a 64 \(\times\) 64 pixel image from a text prompt. This is followed by two super-resolution models, each escalating the resolution to 256 \(\times\) 256 pixels and then to 1024 \(\times\) 1024 pixels. All stages utilize a frozen text encoder based on the T5 transformer architecture, which extracts text embeddings. These embeddings are then input into a UNet architecture, which is enhanced with cross-attention and attention pooling features.
  • The figure below from the paper shows the model architecture of DeepFloyd IF.

  • The efficiency and effectiveness of DeepFloyd IF are evident in its performance, where it achieved a zero-shot FID score of 6.66 on the COCO dataset. This score is a testament to its state-of-the-art capabilities, outperforming other models in the domain. The success of DeepFloyd IF underscores the potential of larger UNet architectures in the initial stages of cascaded diffusion models and opens new avenues for future advancements in text-to-image synthesis.
  • Code; Project page.

PIXART-\(\alpha\): Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

  • The paper “PIXART-\(\alpha\): Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis” by Chen et al. introduces PIXART-\(\alpha\), a transformer-based latent diffusion model for text-to-image (T2I) synthesis. This model competes with leading image generators such as SDXL, Imagen, and DALL-E 2 in quality, while significantly reducing training costs and CO2 emissions. Notably, it is also OPEN-RAIL licensed.
  • Key Innovations:
    • Efficient Architecture: PIXART-\(\alpha\) employs a Diffusion Transformer (DiT) with cross-attention modules, focusing on efficiency. This includes a streamlined class-condition branch and reparameterization for efficient training.
    • Training Strategy Decomposition: The training is divided into three stages: learning pixel distributions, text-image alignment, and aesthetic enhancement.
    • High-Informative Data: Utilizes an auto-labeling pipeline with LLaVA to create a dense, precise text-image dataset, improving the speed of text-image alignment learning.
  • Technical Implementation:
    • Text Encoding: Uses the T5-XXL model for advanced text encoding, enabling better handling of complex prompts.
    • Pre-training and Stages: Incorporates pre-training on ImageNet, learning stages for pixel distribution, alignment, and aesthetics.
    • Hardware Requirements: Initially requires 23GB GPU VRAM, but with diffusers, it can run under 7GB.
  • The figure below from the paper shows the model architecture of PIXART-\(\alpha\). A cross-attention module is integrated into each block to inject textual conditions. To optimize efficiency, all blocks share the same adaLN-single parameters for time conditions.

  • Performance and Efficiency:
    • Quality and Control: Delivers high-quality image synthesis with superior semantic control.
    • Resource Efficiency: Achieves near state-of-the-art quality with only 2% of the training cost of other models, reducing CO2 emissions by 90%.
    • Optimization Techniques: Implements shared normalization parameters (adaLN-single) and uses AdamW optimizer to enhance efficiency.
  • Applications and Extensions: Showcases versatility through methods like DreamBooth and ControlNet, further expanding its practical applications.
  • PIXART-\(\alpha\) represents a major advancement in T2I generation, offering a high-quality, efficient, and environmentally friendly solution. Its unique architecture and training strategy make it an innovative addition to the field of photorealistic T2I synthesis.
  • Code; Hugging Face; Project page

RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths

  • This technical report by Xue et al. from the University of Hong Kong and SenseTime Research, the authors introduce RAPHAEL, a novel text-to-image diffusion model that generates highly artistic images closely aligned with textual prompts.
  • RAPHAEL uniquely combines tens of mixture-of-experts (MoEs) layers, including space-MoE and time-MoE layers, allowing billions of diffusion paths. Each path intuitively functions as a “painter” for depicting specific textual concepts onto designated image regions at certain diffusion timesteps. This mechanism substantially enhances the precision in aligning text and image content.
  • The authors report that RAPHAEL outperforms recent models like Stable Diffusion, ERNIE-ViLG 2.0, DeepFloyd, and DALL-E 2 in terms of image quality and aesthetic appeal. This is evidenced by superior performance in diverse styles (e.g., Japanese comics, realism, cyberpunk) and a state-of-the-art zero-shot FID score of 6.61 on the COCO dataset.
  • An edge-supervised learning module is introduced to further refine image quality, focusing on maintaining intricate boundary details in various styles. RAPHAEL is implemented using a U-Net architecture with 16 transformer blocks, each containing a self-attention layer, a cross-attention layer, space-MoE, and time-MoE layers. The model, with three billion parameters, was trained on 1,000 A100 GPUs for two months.
  • Framework of RAPHAEL. (a) Each block contains four primary components including a selfattention layer, a cross-attention layer, a space-MoE layer, and a time-MoE layer. The space-MoE is responsible for depicting different text concepts in specific image regions, while the time-MoE handles different diffusion timesteps. Each block uses edge-supervised cross-attention learning to further improve image quality. (b) shows details of space-MoE. For example, given a prompt “a furry bear under sky”, each text token and its corresponding image region (given by a binary mask) are directed through distinct space experts, i.e., each expert learns particular visual features at a region. By stacking several space-MoEs, we can easily learn to depict thousands of text concepts.

  • The authors conducted extensive experiments, including a user study using the ViLG-300 benchmark, demonstrating RAPHAEL’s robustness and superiority in generating images that closely conform to the textual prompts. The study also showcases RAPHAEL’s flexibility in generating images of diverse styles and high resolutions up to 4096 \(\times\) 6144 when combined with a tailor-made SR-GAN model.
  • RAPHAEL’s potential applications extend to various domains, with implications for both academic research and industry. The model’s limitations include the potential misuse for creating misleading or false information, a challenge common to powerful text-to-image generators.
  • Project page

ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts

  • This paper by Feng et al. from Baidu Inc. and Wuhan University of Science and Technology in CVPR 2023 focuses on enhancing text-to-image generation using diffusion models.
  • They introduce ERNIE-ViLG 2.0, a large-scale Chinese text-to-image generation model, employing a diffusion-based approach with a 24B parameter scale. The model aims to significantly upgrade image quality and text relevancy.
  • The model incorporates fine-grained textual and visual knowledge to improve semantic control and resolve object-attribute mismatching in image generation. This is achieved by using a text parser and an object detector to identify key elements in the text-image pair and aligning them in the learning process.
  • Introduction of the Mixture-of-Denoising-Experts (MoDE) mechanism, which uses multiple specialized expert networks for different stages of the denoising process, allowing for more efficient handling of various denoising requirements at different steps.
  • The figure below from the paper shows the architecture of ERNIE-ViLG 2.0, which incorporates fine-grained textual and visual knowledge of key elements in the scene and utilizes different denoising experts at different denoising stages.

  • ERNIE-ViLG 2.0 demonstrates state-of-the-art performance on MS-COCO with a zero-shot FID-30k score of 6.75. It also outperforms models like DALL-E 2 and Stable Diffusion in human evaluations using a bilingual prompt set, ViLG-300, for a fair comparison between English and Chinese text-to-image models.
  • The model’s implementation involves a transformer-based text encoder with 1.3B parameters, 10 denoising U-Net experts with 2.2B parameters each, and training on 320 Tesla A100 GPUs for 18 days. The dataset comprises 170M image-text pairs, including English datasets translated into Chinese.
  • Ablation studies and qualitative showcases confirm the effectiveness of the proposed knowledge enhancement strategies and the MoDE mechanism. The model shows improved handling of complex prompts, better sharpness, and texture in generated images.
  • Future work includes enriching external image-text alignment knowledge and expanding the usage of multiple experts to advance generation capabilities. The paper also discusses potential risks and limitations related to data bias and model misuse in text-to-image generation.
  • Project page.

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

  • This paper by Podell et al. from Stability AI Applied Research details significant advancements in the field of text-to-image synthesis using latent diffusion models (LDMs).
  • The paper introduces SDXL, a latent diffusion model that significantly improves upon previous versions of Stable Diffusion for text-to-image synthesis.
  • SDXL incorporates a UNet architecture three times larger than its predecessors, primarily due to an increased number of attention blocks and a larger cross-attention context. This is achieved by using a second text encoder, significantly enhancing the model’s capabilities.
  • Novel conditioning schemes are introduced, such as conditioning on original image resolution and cropping parameters. This conditioning is achieved through Fourier feature encoding and significantly improves the model’s performance and flexibility.
  • SDXL is trained on multiple aspect ratios, a notable departure from standard square image outputs. This training approach allows the model to better handle images with varied aspect ratios, reflecting real-world data more accurately.
  • An improved autoencoder is used, enhancing the fidelity of generated images, particularly in high-frequency details.
  • The paper also discusses a refinement model used as a post-hoc image-to-image technique to further improve the visual quality of samples generated by SDXL. SDXL demonstrates superior performance compared to earlier versions of Stable Diffusion and rivals state-of-the-art black-box image generators. The model’s performance was validated through user studies and quantitative metrics.
  • The figure below from the illustrates: (Left) Comparing user preferences between SDXL and Stable Diffusion 1.5 & 2.1. While SDXL already clearly outperforms Stable Diffusion 1.5 & 2.1, adding the additional refinement stage boosts performance. (Right) Visualization of the two-stage pipeline: They generate initial latents of size 128 \(\times\) 128 using SDXL. Afterwards, they utilize a specialized high-resolution refinement model and apply SDEdit on the latents generated in the first step, using the same prompt. SDXL and the refinement model use the same autoencoder.

  • The authors emphasize the open nature of SDXL, highlighting its potential to foster transparency in large model training and evaluation, which is crucial for responsible and ethical deployment of such technologies.
  • The paper represents a significant step forward in generative modeling for high-resolution image synthesis, showcasing the potential of latent diffusion models in creating detailed and realistic images from textual descriptions.

Dreamix: Video Diffusion Models are General Video Editors

  • Text-driven image and video diffusion models have recently achieved unprecedented generation realism. While diffusion models have been successfully applied for image editing, very few works have done so for video editing.
  • This paper by Molad et al. from Google Research and The Hebrew University of Jerusalem presents the first diffusion-based method that is able to perform text-based motion and appearance editing of general videos. Our approach uses a video diffusion model to combine, at inference time, the low-resolution spatio-temporal information from the original video with new, high resolution information that it synthesized to align with the guiding text prompt.
  • The following figure from the paper shows the video editing use-case with Dreamix: Frames from a video conditioned on the text prompt “A bear dancing and jumping to upbeat music, moving his whole body“. Dreamix transforms the eating monkey (top row) into a dancing bear, affecting appearance and motion (bottom row). It maintains fidelity to color, posture, object size and camera pose, resulting in a temporally consistent video.

  • As obtaining high-fidelity to the original video requires retaining some of its high-resolution information, we add a preliminary stage of finetuning the model on the original video, significantly boosting fidelity.
  • They propose to improve motion editability by a new, mixed objective that jointly finetunes with full temporal attention and with temporal attention masking.
  • They further introduce a new framework for image animation. They first transform the image into a coarse video by simple image processing operations such as replication and perspective geometric projections, and then use their general video editor to animate it.
  • As a further application, Dreamix can be used for subject-driven video generation. Extensive qualitative and numerical experiments showcase the remarkable editing ability of Dreamix and establish its superior performance compared to baseline methods.
  • The following figure from the paper illustrates the process of inference. Dreamix supports multiple applications by application dependent pre-processing (left), converting the input content into a uniform video format. For image-to-video, the input image is duplicated and transformed using perspective transformations, synthesizing a coarse video with some camera motion. For subject-driven video generation, the input is omitted - finetuning alone takes care of the fidelity. This coarse video is then edited using their general “Dreamix Video Editor“ (right): we first corrupt the video by downsampling followed by adding noise. We then apply the finetuned text-guided VDM, which upscales the video to the final spatio-temporal resolution.

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

  • This paper by Blattmann et al. from Stability AI introduces Stable Video Diffusion (SVD), a latent video diffusion model designed for high-resolution text-to-video and image-to-video generation. They address the challenge of lacking a unified strategy for curating video data and propose a methodical curation process for training successful video LDMs, which includes three stages:
    • Stage I: Text-to-image (or simply, image pretraining), i.e., a 2D text-to-image diffusion model.
    • Stage II: video pretraining, which trains on large amounts of videos.
    • Stage III: video finetuning, which refines the model on a small subset of high-quality videos at higher resolution.
  • In the initial stage, leveraging insights from large-scale image model training, the authors curated an extensive pretraining dataset named LVD, consisting of approximately 580 million annotated video clip pairs. This dataset underwent rigorous processing, including cut detection and annotation using several methods such as image captioning and optical flow analysis, to filter out low-quality content. Specifically, to avoid the samples in the dataset that can be expected to degrade the performance of the final video model, such as clips with less motion, excessive text presence, or generally low aesthetic value, they therefore additionally annotate the dataset with dense optical flow calculated at 2 FPS, with which static scenes are filtered out by removing any videos whose average optical flow magnitude is below a certain threshold.
  • The following figure from the paper shows that the initial dataset contains many static scenes and cuts which hurts training of generative video models. Left: Average number of clips per video before and after our processing, revealing that our pipeline detects lots of additional cuts. Right: The distribution of average optical flow score for one of these subsets before processing, which contains many static clips.

  • The paper outlines the importance of each training stage and demonstrates that systematic data curation significantly boosts model performance. Notably, they emphasize the necessity of pretraining on a well-curated dataset for generating high-quality videos, showing that models pretrained in this manner outperform others when finetuned on smaller, high-quality datasets.
  • Leveraging the curated dataset, the authors trained a base model that provides a comprehensive motion representation. This base model was further finetuned for several applications, including text-to-video and image-to-video generation, demonstrating state-of-the-art performance. The model also supports controlled camera motion through LoRA modules and has been shown to serve as a robust multi-view 3D prior, capable of generating multiple consistent views of an object in a feedforward manner.
  • The SVD model stands out for its ability to efficiently generate high-fidelity videos from both text and images, offering a substantial advancement over existing methods in terms of visual quality and consistency. The authors released the code and model weights, contributing a valuable resource to the research community for further exploration and development in video generation technology.
  • Blog; Code; Hugging Face

Fine-tuning Diffusion Models

  • Per Using LoRA for Efficient Stable Diffusion Fine-Tuning, Low-Rank Adaptation (LoRA) can be used for efficient fine-tuning of large language models, originally introduced by Microsoft researchers.
  • LoRA involves freezing pre-trained model weights and adding trainable layers to reduce the number of parameters and GPU memory requirements. It has been applied to Stable Diffusion fine-tuning, particularly in cross-attention layers.
  • The technique enables quicker and less computationally intensive training, resulting in much smaller trained weights. The article also covers the use of LoRA in diffusers for Dreambooth and full fine-tuning methods, highlighting the reduced training time and lower computational requirements.
  • Additionally, it introduces methods like Textual Inversion and Pivotal Tuning, which are complementary to LoRA. The page includes code snippets for using LoRA in Stable Diffusion fine-tuning and Dreambooth training.

Further Reading

The Illustrated Stable Diffusion

Understanding Diffusion Models: A Unified Perspective

  • This tutorial paper by Calvin Luo from Google Brain goes from the basics of ELBO, VAE, and hierarchical VAE to diffusion models.

The Annotated Diffusion Model

  • This blog post by HugginFace takes a deeper look into Denoising Diffusion Probabilistic Models (also known as DDPMs, diffusion models, score-based generative models or simply autoencoders) as researchers have been able to achieve remarkable results with them for generative models. It goes over the original DDPM paper (Ho et al., 2020), implementing it step-by-step in PyTorch, based on Phil Wang’s implementation - which itself is based on the original TensorFlow implementation.

Lilian Weng: What are Diffusion Models?

  • This tutorial paper by Lilian Weng from OpenAI covers the math behind diffusion models in detail.

Stable Diffusion - What, Why, How?

  • This YouTube video by Edan Meyer explains how Stable Diffusion works at a high level, briefly talks about how it is different from other Diffusion-based models, compares it to DALL-E 2, and digs into the code.

How does Stable Diffusion work? – Latent Diffusion Models Explained

  • This YouTube video by Letitia covers diffusion models, injecting text to generate images (conditional generation), and stable diffusion as a latent diffusion model.

Jupyter notebook on the theoretical and implementation aspects of Score-based Generative Models (SGMs)

  • Great technical explanation and implementation (along with a JAX implementation) of score-based generative models (SGMs), also called diffusion models.

References

Citation

If you found our work useful, please cite it as:

@article{Chadha2020DistilledDiffusionModels,
  title   = {Diffusion Models},
  author  = {Chadha, Aman and Jain, Vinija},
  journal = {Distilled AI},
  year    = {2020},
  note    = {\url{https://aman.ai}}
}