Aman's AI Journal • Diffusion Models
 Background
 Overview
 Introduction
 Benefits
 Definitions
 Diffusion models: A Deep Dive
 Training
 Model Choices
 Network Architecture
 Final Objective
 Diffusion Model Theory Summary
 Diffusion models in PyTorch
 Training on Custom Data
 HuggingFace Diffusers
 Final Words
 References
 Citation
Background
 There are three common types of generative models, GAN, VAE, and Flowbased models. They have shown great success in generating highquality samples, but each has some limitations of its own. GAN models are known for potentially unstable training and less diversity in generation due to their adversarial training nature. VAE relies on a surrogate loss. Flow models have to use specialized architectures to construct reversible transform.
 Diffusion models are inspired by nonequilibrium thermodynamics. They define a Markov chain of diffusion steps to slowly add random noise to data and then learn to reverse the diffusion process to construct desired data samples from the noise. Unlike VAE or flow models, diffusion models are learned with a fixed procedure and the latent variable has high dimensionality (same as the original data).
 The following diagram from Lilian Weng’s blog provides an overview of the different types of generative models:
Overview
 The meteoric rise of diffusion models is one of the biggest developments in Machine Learning in the past several years.
 Diffusion models are generative models which have been gaining significant popularity in the past several years, and for good reason. A handful of seminal papers released in the 2020s alone have shown the world what Diffusion models are capable of, such as beating GANs on image synthesis. Most recently, practitioners will have seen Diffusion models used in DALLE 2, OpenAI’s image generation model released last month.
 Given the recent wave of success by Diffusion models, many Machine Learning practitioners are surely interested in their inner workings.
 In this article, we will examine the theoretical foundations for Diffusion models, and then demonstrate how to generate images with a diffusion model in PyTorch. Let’s dive in!
Introduction
 Diffusion probabilistic models (also simply called diffusion models) are generative models, meaning that they are used to generate data similar to the data on which they are trained.
 Fundamentally, diffusion models work by destroying training data through the successive addition of Gaussian noise, and then learning to recover the data by reversing this noising process. In other words, diffusion models are parameterized Markov chains models trained to gradually denoise data. After training, we can use the diffusion model to generate data by simply passing randomly sampled noise through the learned denoising process.
 The following diagram shows that diffusion models can be used to generate images from noise (figure modified from source):
 More specifically, a diffusion model is a latent variable model which maps to the latent space using a fixed Markov chain. This chain gradually adds noise to the data in order to obtain the approximate posterior \(q\left(\mathbf{x} 1: T \mid \mathbf{x}_{0}\right)\), where \(\mathbf{x} 1, \ldots, \mathbf{x} T\) are the latent variables with the same dimensionality as \(\mathbf{x}_{0}\). In the figure below, we see such a Markov chain manifested for image data.
 The following diagram (modified from source):
 Ultimately, the image is asymptotically transformed to pure Gaussian noise. The goal of training a diffusion model is to learn the reverse process  i.e. training \(p_{\theta}\left(x_{t1} \mid x_{t}\right)\). By traversing backwards along this chain, we can generate new data, as shown below (figure modified from source).
Benefits
 Diffusion probabilistic models are latent variable models capable to synthesize high quality images. As mentioned above, research into diffusion models has exploded in recent years. Inspired by nonequilibrium thermodynamics, diffusion models currently produce StateoftheArt image quality, examples of which can be seen below (figure adapted from source):
 Beyond cuttingedge image quality, diffusion models come with a host of other benefits, including not requiring adversarial training. The difficulties of adversarial training are welldocumented; and, in cases where nonadversarial alternatives exist with comparable performance and training efficiency, it is usually best to utilize them. On the topic of training efficiency, diffusion models also have the added benefits of scalability and parallelizability.
 While diffusion models almost seem to be producing results out of thin air, there are a lot of careful and interesting mathematical choices and details that provide the foundation for these results, and best practices are still evolving in the literature. Let’s take a look at the mathematical theory underpinning diffusion models in more detail now.
 Their performance is, allegedly, superior to recent stateoftheart generative models like Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs) in most cases.
Definitions
Models
 Diffusion models are neural models that model \(p_{\theta}\left(\mathbf{x}_{t1} \mid \mathbf{x}_{t}\right)\) and are trained endtoend to denoise a noisy input to a continuous output such as an image/audio (similar to how GANs generate continuous outputs). Examples: UNet, Conditioned UNet, 3D UNet, Transformer UNet.
 The following figure from the DDPM paper shows the process of a diffusion model:
Schedulers
 Algorithm class for both inference and training. The class provides functionality to compute previous image according to alpha, beta schedule as well as predict noise for training. Examples: DDPM, DDIM, PNDM, DEIS.
 The figure below from the DDPM paper shows the sampling and training algorithms:
Sampling and training algorithms
 Diffusion Pipeline: Endtoend pipeline that includes multiple diffusion models, possible text encoders, superresolution model (for highres image generation, in case of Imagen), etc.
 Examples: GLIDE, LatentDiffusion, Imagen, DALLE 2.
 The figure below from the Imagen paper shows the overall flow of the model.
Diffusion models: A Deep Dive

As mentioned above, a diffusion model consists of a forward process (or diffusion process), in which a datum (generally an image) is progressively noised, and a reverse process (or reverse diffusion process), in which noise is transformed back into a sample from the target distribution.

The sampling chain transitions in the forward process can be set to conditional Gaussians when the noise level is sufficiently low. Combining this fact with the Markov assumption leads to a simple parameterization of the forward process:
\[q\left(\mathbf{x}_{1: T} \mid \mathbf{x}_{0}\right):=\prod_{t=1}^{T} q\left(\mathbf{x}_{t} \mid \mathbf{x}_{t1}\right):=\prod_{t=1}^{T} \mathcal{N}\left(\mathbf{x}_{t} ; \sqrt{1\beta_{t}} \mathbf{x}_{t1}, \beta_{t} \mathbf{I}\right)\] where \(\beta 1, \ldots, \beta_T\) is a variance schedule (either learned or fixed) which, if wellbehaved, ensures that \(x_T\) is nearly an isotropic Gaussian for sufficiently large \(T\).

Given the Markov assumption, the joint distribution of the latent variables is the product of the Gaussian conditional chain transitions (figure modified from source).

As mentioned previously, the “magic” of diffusion models comes in the reverse process. During training, the model learns to reverse this diffusion process in order to generate new data. Starting with the pure Gaussian noise \(p(\mathbf{x} T):=\mathcal{N}\left(\mathbf{x}_{T}, \mathbf{0}, \mathbf{I}\right)\), the model learns the joint distribution \(p \theta(\mathbf{x} 0: T)\) as,
\[p_{\theta}\left(\mathbf{x}_{0: T}\right):=p\left(\mathbf{x}_{T}\right) \prod_{t=1}^{T} p_{\theta}\left(\mathbf{x}_{t1} \mid \mathbf{x}_{t}\right):=p\left(\mathbf{x}_{T}\right) \prod_{t=1}^{T} \mathcal{N}\left(\mathbf{x}_{t1} ; \boldsymbol{\mu}_{\theta}\left(\mathbf{x}_{t}, t\right), \boldsymbol{\Sigma}_{\theta}\left(\mathbf{x}_{t}, t\right)\right)\] where the timedependent parameters of the Gaussian transitions are learned. Note in particular that the Markov formulation asserts that a given reverse diffusion transition distribution depends only on the previous timestep (or following timestep, depending on how you look at it):
Training
 A diffusion model is trained by finding the reverse Markov transitions that maximize the likelihood of the training data. In practice, training equivalently consists of minimizing the variational upper bound on the negative log likelihood.

Notation Detail: Note that \(L_{v l b}\) is technically an upper bound (the negative of the ELBO) which we are trying to minimize, but we refer to it as \(L_{v l b}\) for consistency with the literature.

We seek to rewrite the \(L_{v l b}\) in terms of KullbackLeibler (KL) Divergences. The KL Divergence is an asymmetric statistical distance measure of how much one probability distribution \(P\) differs from a reference distribution \(Q\). We are interested in formulating \(L_{v l b}\) in terms of \(KL\) divergences because the transition distributions in our Markov chain are Gaussians, and the KL divergence between Gaussians has a closed form.
What is the KL Divergence?

The mathematical form of the \(\mathrm{KL}\) divergence for continuous distributions is,
\[D_{\mathrm{KL}}(P \ Q)=\int_{\infty}^{\infty} p(x) \log \left(\frac{p(x)}{q(x)}\right) d x\] Note that the double bars in the above equation indicate that the function is not symmetric with respect to its arguments.

Below you can see the \(K L\) divergence of a varying distribution \(P\) (blue) from a reference distribution \(Q\) (red). The green curve indicates the function within the integral in the definition for the \(\mathrm{KL}\) divergence above, and the total area under the curve represents the value of the KL divergence of \(P\) from \(Q\) at any given moment, a value which is also displayed numerically.
Casting \(L_{v l b}\) in Terms of KL Divergences

As mentioned previously, it is possible[1] to rewrite \(L v l b\) almost completely in terms of KL divergences:
\[L_{v l b}=L_{0}+L_{1}+\ldots+L_{T1}+L_{T}\] where, \(\begin{gathered} L_{0}=\log p_{\theta}\left(x_{0} \mid x_{1}\right) \\ L_{t1}=D_{K L}\left(q\left(x_{t1} \mid x_{t}, x_{0}\right) \ p_{\theta}\left(x_{t1} \mid x_{t}\right)\right) \\ L_{T}=D_{K L}\left(q\left(x_{T} \mid x_{0}\right) \ p\left(x_{T}\right)\right) \end{gathered}\)

Conditioning the forward process posterior on \(x_{0}\) in \(L_{t1}\) results in a tractable form that leads to all KL divergences being comparisons between Gaussians. This means that the divergences can be exactly calculated with closedform expressions rather than with Monte Carlo estimates.
Model Choices
 With the mathematical foundation for our objective function established, we now need to make several choices regarding how our diffusion model will be implemented. For the forward process, the only choice required is defining the variance schedule, the values of which are generally increasing during the forward process.
 For the reverse process, we much choose the Gaussian distribution parameterization / model architecture(s). Note the high degree of flexibility that Diffusion models afford  the only requirement on our architecture is that its input and output have the same dimensionality. We will explore the details of these choices in more detail below.
Forward Process and \(L_{T}\)
 As noted above, regarding the forward process, we must define the variance schedule. In particular, we set them to be timedependent constants, ignoring the fact that they can be learned. For example, per Denoising Diffusion Probabilistic models, a linear schedule from \(\beta_{1}=10^{4}\) to \(\beta_{T}=0.2\) might be used, or perhaps a geometric series. Regardless of the particular values chosen, the fact that the variance schedule is fixed results in \(L_{T}\) becoming a constant with respect to our set of learnable parameters, allowing us to ignore it as far as training is concerned.
Reverse Process and \(L_{1: T1}\)
 Now we discuss the choices required in defining the reverse process. Recall from above we defined the reverse Markov transitions as a Gaussian:
 We must now define the functional forms of \(\mu_{\theta}\) or \(\Sigma_{\theta}\). While there are more complicated ways to parameterize \(\boldsymbol{\Sigma} \theta\), we simply set,
 That is, we assume that the multivariate Gaussian is a product of independent gaussians with identical variance, a variance value which can change with time. We set these variances to be equivalent to our forward process variance schedule.

Given this new formulation of \(\Sigma \theta_{1}\), we have
\[p_{\theta}\left(\mathbf{x}_{t1} \mid \mathbf{x}_{t}\right):=\mathcal{N}\left(\mathbf{x}_{t1} ; \boldsymbol{\mu}_{\theta}\left(\mathbf{x}_{t}, t\right), \mathbf{\Sigma}_{\theta}\left(\mathbf{x}_{t}, t\right):=\mathcal{N}\left(\mathbf{x}_{t1} ; \boldsymbol{\mu}_{\theta}\left(\mathbf{x}_{t}, t\right), \sigma_{t}^{2} \mathbf{I}\right)\right.\] which allows us to transform,
 to,
 where the first term in the difference is a linear combination of \(x t\) and \(x_{0}\) that depends on the variance schedule \(\beta_{t}\). The exact form of this function is not relevant for our purposes, but it can be found in Denoising Diffusion Probabilistic models. The significance of the above proportion is that the most straightforward parameterization of \(\mu_{\theta}\) simply predicts the diffusion posterior mean. Importantly, the authors of Denoising Diffusion Probabilistic models actually found that training \(\mu \theta\) to predict the noise component at any given timestep yields better results. In particular, let
 where, \(\alpha_{t}:=1\beta_{t} \quad\) and \(\quad \bar{\alpha}_{t}:=\prod_{s=1}^{t} \alpha_{s}\).
 This leads to the following alternative loss function, which the authors of Denoising Diffusion Probabilistic models found to lead to more stable training and better results:
 The authors of Denoising Diffusion Probabilistic models also note connections of this formulation of Diffusion models to scorematching generative models based on Langevin dynamics. Indeed, it appears that Diffusion models and ScoreBased models may be two sides of the same coin, akin to the independent and concurrent development of wavebased quantum mechanics and matrixbased quantum mechanics revealing two equivalent formulations of the same phenomena.
Network Architecture
 While our simplified loss function seeks to train a model \(\epsilon \theta\), we have still not yet defined the architecture of this model. Note that the only requirement for the model is that its input and output dimensionality are identical.
 Given this restriction, it is perhaps unsurprising that image Diffusion models are commonly implemented with UNetlike architectures.
Reverse Process Decoder and \(L_{0}\)
 The path along the reverse process consists of many transformations under continuous conditional Gaussian distributions. At the end of the reverse process, recall that we are trying to produce an image, which is composed of integer pixel values. Therefore, we must devise a way to obtain discrete (log) likelihoods for each possible pixel value across all pixels.

The way that this is done is by setting the last transition in the reverse diffusion chain to an independent discrete decoder. To determine the likelihood of a given image \(x 0\) given \(x 1_{1}\), we first impose independence between the data dimensions:
\[p_{\theta}\left(x_{0} \mid x_{1}\right)=\prod_{i=1}^{D} p_{\theta}\left(x_{0}^{i} \mid x_{1}^{i}\right)\] where \(D\) is the dimensionality of the data and the superscript \(i\) indicates the extraction of one coordinate. The goal now is to determine how likely each integer value is for a given pixel given the distribution across possible values for the corresponding pixel in the slightly noised image at time \(t=1\) :
 where the pixel distributions for \(t=1\) are derived from the below multivariate Gaussian whose diagonal covariance matrix allows us to split the distribution into a product of univariate Gaussians, one for each dimension of the data:

We assume that the images consist of integers in \(0,1, \ldots, 255\) (as standard RGB images do) which have been scaled linearly to \([1,1]\). We then break down the real line into small “buckets”, where, for a given scaled pixel value \(x\), the bucket for that range is \([x1 / 255, x+1 / 255]\). The probability of a pixel value \(x\), given the univariate Gaussian distribution of the corresponding pixel in \(x 1\), is the area under that univariate Gaussian distribution within the bucket centered at \(x\).
 Below you can see the area for each of these buckets with their probabilities for a mean0 Gaussian which, in this context, corresponds to a distribution with an average pixel value of \(\frac{255}{2}\) (half brightness). The red curve represents the distribution of a specific pixel in the \(t=1\) image, and the areas give the probability of the corresponding pixel value in the \(t=0\) image.

Technical Note: The first and final buckets extend out to inf and +inf to preserve total probability.

Given a \(t=0\) pixel value for each pixel, the value of \(p_{\theta}\left(x_{0} \mid x_{1}\right)\) is simply their product. Succinctly, this process is succinctly encapsulated by the following equation:
\[p_{\theta}\left(x_{0} \mid x_{1}\right)=\prod_{i=1}^{D} p_{\theta}\left(x_{0}^{i} \mid x_{1}^{i}\right)=\prod_{i=1}^{D} \int_{\delta_{}\left(x_{0}^{i}\right)}^{\delta_{+}\left(x_{0}^{i}\right)} \mathcal{N}\left(x ; \mu_{\theta}^{i}\left(x_{1}, 1\right), \sigma_{1}^{2}\right) d x\] where,
 and

Given this equation for \(p_{\theta}\left(x_{0} \mid x_{1}\right)\), we can calculate the final term of \(L_{v l b}\) which is not formulated as a \(\mathrm{KL}\) Divergence:
\[L_{0}=\log p_{\theta}\left(x_{0} \mid x_{1}\right)\]
Final Objective
 As mentioned in the last section, the authors of Denoising Diffusion Probabilistic models found that predicting the noise component of an image at a given timestep produced the best results. Ultimately, they use the following objective:
 The training and sampling algorithms for our diffusion model therefore can be succinctly captured in the below table (from source):
Diffusion Model Theory Summary

In this section we took a detailed dive into the theory of Diffusion models. It can be easy to get caught up in mathematical details, so we note the most important points within this section below in order to keep ourselves oriented from a birdseye perspective:
 Our diffusion model is parameterized as a Markov chain, meaning that our latent variables \(x 1, \ldots, x T\) depend only on the previous (or following) timestep.
 The transition distributions in the Markov chain are Gaussian, where the forward process requires a variance schedule, and the reverse process parameters are learned.
 The diffusion process ensures that \(x T\) is asymptotically distributed as an isotropic Gaussian for sufficiently large \(\mathrm{T}\).
 In our case, the variance schedule was fixed, but it can be learned as well. For fixed schedules, following a geometric progression may afford better results than a linear progression. In either case, the variances are generally increasing with time in the series (i.e. \(\beta_{i}<\beta_{j}\) for \(i<j\) ).
 Diffusion models are highly flexible and allow for any architecture whose input and output dimensionality are the same to be used. Many implementations use UNetlike architectures.
 The training objective is to maximize the likelihood of the training data. This is manifested as tuning the model parameters to minimize the variational upper bound of the negative log likelihood of the data.
 Almost all terms in the objective function can be cast as KL Divergences as a result of our Markov assumption. These values become tenable to calculate given that we are using Gaussians, therefore omitting the need to perform Monte Carlo approximation.
 Ultimately, using a simplified training objective to train a function which predicts the noise component of a given latent variable yields the best and most stable results.
 A discrete decoder is used to obtain log likelihoods across pixel values as the last step in the reverse diffusion process.

With this highlevel overview of Diffusion models in our minds, let’s move on to see how to use a Diffusion models in PyTorch.
Diffusion models in PyTorch
 While Diffusion models have not yet been democratized to the same degree as other older architectures/approaches in Machine Learning, there are still implementations available for use. The easiest way to use a diffusion model in PyTorch is to use the
denoisingdiffusionpytorch
package, which implements an image diffusion model like the one discussed in this article. To install the package, simply type the following command in the terminal:
pip install denoising_diffusion_pytorch
Minimal Example
 To train a model and generate images, we first import the necessary packages:
import torch
from denoising_diffusion_pytorch import Unet, GaussianDiffusion
 Next, we define our network architecture, in this case a UNet. The
dim
parameter specifies the number of feature maps before the first downsampling, and thedim_mults
parameter provides multiplicands for this value and successive downsamplings:
model = Unet(
dim = 64,
dim_mults = (1, 2, 4, 8)
)
 Now that our network architecture is defined, we need to define the diffusion model itself. We pass in the UNet model that we just defined along with several parameters  the size of images to generate, the number of timesteps in the diffusion process, and a choice between the L1 and L2 norms.
diffusion = GaussianDiffusion(
model,
image_size = 128,
timesteps = 1000, # number of steps
loss_type = 'l1' # L1 or L2
)
 Now that the diffusion model is defined, it’s time to train. We generate random data to train on, and then train the diffusion model in the usual fashion:
training_images = torch.randn(8, 3, 128, 128)
loss = diffusion(training_images)
loss.backward()
 Once the model is trained, we can finally generate images by using the
sample()
method of thediffusion
object. Here we generate 4 images, which are only noise given that our training data was random:
sampled_images = diffusion.sample(batch_size = 4)
Training on Custom Data
 The
denoisingdiffusionpytorch
package also allow you to train a diffusion model on a specific dataset. Simply replace the'path/to/your/images'
string with the dataset directory path in theTrainer()
object below, and changeimage_size
to the appropriate value. After that, simply run the code to train the model, and then sample as before. Note that PyTorch must be compiled with CUDA enabled in order to use theTrainer
class:
from denoising_diffusion_pytorch import Unet, GaussianDiffusion, Trainer
model = Unet(
dim = 64,
dim_mults = (1, 2, 4, 8)
).cuda()
diffusion = GaussianDiffusion(
model,
image_size = 128,
timesteps = 1000, # number of steps
loss_type = 'l1' # L1 or L2
).cuda()
trainer = Trainer(
diffusion,
'path/to/your/images',
train_batch_size = 32,
train_lr = 2e5,
train_num_steps = 700000, # total training steps
gradient_accumulate_every = 2, # gradient accumulation steps
ema_decay = 0.995, # exponential moving average decay
amp = True # turn on mixed precision
)
trainer.train()
 Below you can see progressive denoising from multivariate Gaussian noise to MNIST digits akin to reverse diffusion:
HuggingFace Diffusers
 HuggingFace diffusers provides pretrained diffusion models across multiple modalities, such as vision and audio, and serves as a modular toolbox for inference and training of diffusion models.
 More precisely, HuggingFace Diffusers offers:
 Stateoftheart diffusion pipelines that can be run in inference with just a couple of lines of code.
 Various noise schedulers that can be used interchangeably for the prefered speed vs. quality tradeoff in inference.
 Multiple types of models, such as UNet, that can be used as building blocks in an endtoend diffusion system.
 Training examples to show how to train the most popular diffusion models.
Final Words
 Diffusion models are a conceptually simple and elegant approach to the problem of generating data. Their StateoftheArt results combined with nonadversarial training has propelled them to great heights, and further improvements can be expected in the coming years given their nascent status.
 In particular, Diffusion models have been found to be essential to the performance of cuttingedge models like DALLE 2.
References
 Diffusion models Vs GANs: Which one to choose for Image Synthesis
 Diffusion models
 Introduction to Diffusion Models for Machine Learning
 What are Diffusion Models?
Citation
If you found our work useful, please cite it as:
@article{Chadha2020DistilledDiffusionModels,
title = {Diffusion Models},
author = {Chadha, Aman},
journal = {Distilled AI},
year = {2020},
note = {\url{https://aman.ai}}
}