Primers • Diffusion Models
- Background
- Overview
- Introduction
- Advantages
- Definitions
- Diffusion Models: The Theory
- Diffusion Models as Latent-Variable Generative Models
- Markovian Structure and Tractability
- Fixed Forward Process and Learned Reverse Process
- Likelihood-Based Training via Variational Inference
- Noise Prediction as a Canonical Parameterization
- Connection to Continuous-Time and Score-Based Models
- Discrete Data and Final Decoding
- Takeaways
- Diffusion Models: A Deep Dive
- Taxonomy of Diffusion Models
- Pixel-Space Diffusion Models
- Latent-Space Diffusion Models
- Continuous-Time Diffusion Models (Representation-Agnostic)
- Stochastic Differential Equation (SDE)-Based Diffusion Models
- Score-Based Generative Modeling (SGMs)
- Reverse-Time SDE and Sampling
- Sampling via Langevin Dynamics (Discrete Approximation)
- Probability Flow ODE (Deterministic Sampling)
- Flow Matching Models (Deterministic Continuous-Time Generative Models)
- Pros
- Cons
- Comparative Analysis
- Training
- Model Choices
- Network Architecture: U-Net and Diffusion Transformer (DiT)
- Comparison Between U-Net and Diffusion Transformer (DiT) Architectures
- Reverse Process of U-Net-Based Diffusion Models
- Reverse Process of DiT-Based Diffusion Models
- Final Objective
- Conditional Diffusion Models
- Classifier-Free Guidance
- Prompting Guidance
- Diffusion Models in PyTorch
- HuggingFace Diffusers
- Implementations
- Gallery
- FAQs
- At a high level, how do diffusion models work? What are some other models that are useful for image generation, and how do they compare to diffusion models?
- What is the difference between DDPM and DDIMs models?
- In diffusion models, there is a forward diffusion process and a reverse diffusion/denoising process. When do you use which during training and inference?
- What are the loss functions used in Diffusion Models?
- Integration with MSE
- What is the Denoising Score Matching Loss in Diffusion models? Provide equation and intuition.
- What does the “stable” in stable diffusion refer to?
- How do you condition a diffusion model to the textual input prompt?
- In the context of diffusion models, what role does cross attention play? How are the \(Q\), \(K\), and \(V\) abstractions modeled for diffusion models?
- How is randomness in the outputs induced in a diffusion model?
- How does the noise schedule work in diffusion models? What are some standard noise schedules?
- Choosing a Noise Schedule
- Recent Papers
- High-Resolution Image Synthesis with Latent Diffusion Models
- Diffusion Model Alignment Using Direct Preference Optimization
- Scalable Diffusion Models with Transformers
- DeepFloyd IF
- PIXART-\(\alpha\): Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
- RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths
- ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts
- Imagen Video: High Definition Video Generation with Diffusion Models
- Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution
- SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
- Dreamix: Video Diffusion Models are General Video Editors
- Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
- Fine-tuning Diffusion Models
- Diffusion Model Alignment
- Further Reading
- The Illustrated Stable Diffusion
- Understanding Diffusion Models: A Unified Perspective
- The Annotated Diffusion Model
- Lilian Weng: What are Diffusion Models?
- Stable Diffusion - What, Why, How?
- How does Stable Diffusion work? – Latent Diffusion Models Explained
- Diffusion Explainer
- Jupyter notebook on the theoretical and implementation aspects of Score-based Generative Models (SGMs)
- References
- Citation
Background
-
Generative modeling is a central problem in machine learning, concerned with learning a probability distribution \(p(x)\) from which new, realistic data samples can be drawn. Over the past decade, three families of generative models have dominated the literature: Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and normalizing flow–based models. Each of these paradigms offers distinct advantages while also exhibiting fundamental limitations.
-
GANs, introduced in Generative Adversarial Nets by Goodfellow et al. (2014), rely on an adversarial training framework in which a generator and discriminator compete in a minimax game. While GANs are capable of producing highly sharp and realistic samples, their training dynamics are notoriously unstable and sensitive to hyperparameters, often suffering from mode collapse and lack of diversity On the Convergence and Stability of GANs by Mescheder et al. (2018). Moreover, GANs do not provide an explicit likelihood, complicating principled evaluation and comparison.
-
VAEs, introduced in Auto-Encoding Variational Bayes by Kingma and Welling (2013), take a probabilistic approach by optimizing a variational lower bound on the data likelihood. VAEs are stable to train and provide an explicit generative density, but the reliance on surrogate objectives—such as Gaussian likelihood assumptions and KL regularization—often leads to overly smooth or blurry samples, particularly in image generation tasks.
-
Flow-based models, such as those introduced in Density Estimation using Real NVP by Dinh et al. (2016) and Glow: Generative Flow with Invertible 1×1 Convolutions by Kingma and Dhariwal (2018), address likelihood estimation directly via exact change-of-variables formulas. However, they require carefully designed invertible architectures with tractable Jacobians, which significantly constrains model design and increases implementation complexity.
Emergence of Diffusion Models
-
Diffusion models present a compelling alternative to these earlier generative paradigms. Inspired by ideas from non-equilibrium thermodynamics, diffusion models define a stochastic process that incrementally transforms data into noise through a fixed forward process, and then learn to reverse this process in order to generate samples. Unlike GANs, diffusion models do not rely on adversarial training, and unlike VAEs and flow-based models, they do not require restrictive architectural constraints or surrogate likelihood assumptions.
-
The foundational idea of diffusion-based generative modeling was introduced in Deep Unsupervised Learning using Nonequilibrium Thermodynamics by Sohl-Dickstein et al. (2015). In this work, the authors proposed modeling data generation as the reversal of a gradual diffusion process, framing learning as approximating the reverse-time dynamics of a Markov chain that incrementally adds Gaussian noise.
-
Subsequent breakthroughs significantly improved the scalability and practicality of diffusion models. In Generative Modeling by Estimating Gradients of the Data Distribution by Song and Ermon (2019), the authors introduced score-based generative modeling, showing that learning the gradient of the log-density of noisy data distributions suffices for generation. Shortly thereafter, Denoising Diffusion Probabilistic Models by Ho et al. (2020) reformulated diffusion models using a simplified and highly stable training objective based on noise prediction, dramatically improving sample quality and ease of implementation.
-
A concise visual overview situating diffusion models among other generative approaches is provided in the diagram below from Lilian Weng’s blog post “What are Diffusion Models?”:

Impact and Practical Adoption
-
Diffusion models are conceptually simple yet remarkably powerful. Their training procedure is stable, does not require adversarial objectives, and scales effectively with model size and data. As a result, diffusion models have rapidly become the dominant paradigm for high-fidelity generative modeling across multiple modalities, including images, audio, and video.
-
They form the core of several landmark systems for conditional and unconditional generation. Notable examples include GLIDE GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models by Nichol et al. (2021), DALL·E 2 Hierarchical Text-Conditional Image Generation with CLIP Latents by Ramesh et al. (2022), Latent Diffusion Models High-Resolution Image Synthesis with Latent Diffusion Models by Rombach et al. (2022), Imagen Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding by Saharia et al. (2022), and Stable Diffusion, developed by Stability AI.
-
These systems demonstrate that diffusion models are capable of matching or surpassing prior state-of-the-art generative approaches in both sample quality and controllability, while maintaining a principled probabilistic foundation.
Outlook
- As a relatively recent paradigm, diffusion models remain an active area of research, with ongoing work exploring faster sampling methods, improved conditioning mechanisms, continuous-time formulations, and tighter theoretical guarantees. Their rapid adoption in both academic research and industrial-scale systems underscores their growing importance as a unifying framework for modern generative modeling.
Overview
-
The rapid ascent of diffusion models represents one of the most significant developments in generative modeling over the past several years. Beginning as a theoretically motivated alternative to adversarial and variational approaches, diffusion models have evolved into a highly practical and dominant paradigm for high-fidelity data generation across a wide range of modalities.
-
Diffusion models are a class of likelihood-based generative models that construct complex data distributions through a sequence of simple stochastic transformations. Rather than generating samples in a single step, diffusion models define a multi-step process in which data are gradually corrupted by noise and then reconstructed by reversing this corruption. This incremental formulation enables diffusion models to decompose a challenging global modeling problem into a series of tractable local denoising tasks.
-
A sequence of influential papers in the early 2020s demonstrated that diffusion models can not only rival but often outperform GANs in terms of sample quality, stability, and coverage of the data distribution. For instance, Diffusion Models Beat GANs on Image Synthesis by Dhariwal and Nichol (2021) showed that diffusion models achieve superior Fréchet Inception Distance (FID) scores compared to state-of-the-art GANs on large-scale image benchmarks, while also exhibiting more stable training dynamics.
-
More recently, diffusion models have formed the backbone of several widely publicized generative systems. A notable example is DALL·E 2, introduced in Hierarchical Text-Conditional Image Generation with CLIP Latents by Ramesh et al. (2022), which uses diffusion in a learned latent space to generate photorealistic images conditioned on natural language prompts. A high-level explanation of this system can be found in the blog post How DALL·E 2 Actually Works.

-
The success of these systems has sparked widespread interest among machine learning practitioners and researchers alike. Diffusion models are now routinely applied to problems in image synthesis, audio generation, video modeling, super-resolution, inpainting, and multimodal generation, often serving as the core generative component in large, modular pipelines.
-
From a conceptual standpoint, diffusion models are appealing because they combine several desirable properties:
- they are trained using simple regression-style objectives rather than adversarial losses,
- they admit clear probabilistic interpretations,
- and they scale reliably with model capacity and dataset size.
-
At the same time, diffusion models are flexible enough to incorporate a wide range of architectural choices, conditioning mechanisms, and sampling strategies. Modern implementations commonly integrate convolutional neural networks, attention mechanisms, transformers, and learned latent representations, while still adhering to the same underlying diffusion framework.
-
In this primer, we aim to demystify diffusion models by examining both their theoretical foundations and their practical implementation details. We begin by introducing the core principles that govern diffusion-based generative modeling, followed by a detailed exploration of the mathematical structure underlying diffusion processes. Building on this foundation, we then examine how diffusion models are instantiated in practice, including architectural design choices, training objectives, and sampling algorithms. Finally, we demonstrate how diffusion models can be implemented in PyTorch to generate images, providing concrete intuition for how these models operate end-to-end.
Introduction
- Diffusion probabilistic models—commonly referred to as diffusion models—are a class of generative models designed to learn complex data distributions and generate new samples that resemble the training data. As generative models, their purpose is fundamentally different from that of discriminative models: rather than predicting labels or making decisions about inputs, diffusion models aim to synthesize new data that are statistically similar to observed examples. For instance, a diffusion model trained on a dataset of animal images can generate novel images that appear to depict realistic animals, whereas a discriminative model would be tasked with classifying an image as containing a cat or a dog.

-
At a high level, diffusion models operate by defining two complementary stochastic processes:
- A forward diffusion process, which progressively corrupts data by adding Gaussian noise, and
- A reverse denoising process, which is learned by a neural network and gradually removes noise in order to reconstruct data samples.
-
This formulation casts generative modeling as the problem of learning to invert a simple, fixed noising process. Starting from pure noise, the learned reverse process iteratively transforms noise into structured data. This perspective was formalized in Deep Unsupervised Learning using Nonequilibrium Thermodynamics by Sohl-Dickstein et al. (2015) and later refined and popularized through more practical formulations.
-
Conceptually, diffusion models can be understood as parameterized Markov chains trained to denoise data one step at a time. After training, generation proceeds by sampling an initial noise vector—often referred to as a latent tensor—and repeatedly applying the learned denoising transitions until a data sample is obtained. In this sense, diffusion models transform an unstructured collection of random numbers into a coherent output, such as an image, through a long sequence of small, incremental refinements.
-
Diffusion models are also closely related to several existing ideas in the generative modeling literature. They can be viewed as a form of latent variable model, in which the latent variables \(x_1, \ldots, x_T\) have the same dimensionality as the observed data \(x_0\). They share conceptual similarities with denoising autoencoders, which learn to reconstruct clean data from corrupted inputs, as discussed in A Connection Between Score Matching and Denoising Autoencoders by Vincent (2011). In addition, diffusion models are tightly connected to score-based generative modeling, where the goal is to estimate gradients of log-density functions rather than explicit likelihoods Generative Modeling by Estimating Gradients of the Data Distribution by Song and Ermon (2019).
-
The overall generation process is illustrated in the diagram below, adapted from Denoising Diffusion Probabilistic Models by Ho et al. (2020). The model begins from random noise and progressively refines the sample through repeated denoising steps.

-
More formally, diffusion models define a latent-variable formulation in which a fixed Markov chain maps a data sample \(x_0\) to a sequence of progressively noisier variables \(x_1, \ldots, x_T\). The joint distribution of these variables under the forward process is denoted
\[q(x_{1:T} \mid x_0)\]- where each \(x_t\) has the same dimensionality as the original data. In the context of image generation, this corresponds to repeatedly adding small amounts of Gaussian noise to an image until the final variable \(x_T\) is approximately indistinguishable from isotropic Gaussian noise.
-
This forward diffusion process is visualized below, with each step incrementally destroying structure in the data (figure adapted from Denoising Diffusion Probabilistic Models by Ho et al. (2020)):

-
The objective of training a diffusion model is to learn the reverse process, denoted
\[p_\theta(x_{t-1} \mid x_t)\]- which approximates the inverse of each forward noising step. By traversing this learned reverse chain from \(t = T\) down to \(t = 0\), the model can transform pure noise into a structured data sample. This reverse-time generation process is illustrated below (figure adapted from Denoising Diffusion Probabilistic Models by Ho et al. (2020)):

-
Under the hood, diffusion models rely on the Markov property, meaning that each state in the diffusion chain depends only on the immediately preceding (or following) state. A Markov chain is a stochastic process in which future states are conditionally independent of past states given the present state. This property allows diffusion models to decompose generation into a sequence of local transitions, each of which is relatively simple to model.
-
Key takeaway: diffusion models are constructed by first specifying a simple and tractable procedure for gradually turning data into noise, and then training a neural network to invert this procedure step-by-step. Each denoising step removes a small amount of noise and restores a small amount of structure. When this process is repeated sufficiently many times—starting from pure noise—the result is a coherent data sample. This deceptively simple idea underlies the remarkable effectiveness of diffusion models in modern generative modeling.
Advantages
- Diffusion models offer a combination of theoretical elegance, empirical performance, and practical robustness that has driven their rapid adoption across modern generative modeling applications. Their advantages span sample quality, training stability, scalability, and flexibility, positioning them as a compelling alternative to earlier generative paradigms such as GANs, VAEs, and flow-based models.
High-Fidelity Sample Quality
-
One of the most striking advantages of diffusion models is their ability to produce state-of-the-art sample quality, particularly in high-resolution image generation. Empirical evaluations have shown that diffusion models consistently achieve lower Fréchet Inception Distance (FID) scores than competing GAN-based approaches on standard benchmarks.
-
This result was demonstrated explicitly in Diffusion Models Beat GANs on Image Synthesis by Dhariwal and Nichol (2021), where diffusion models surpassed BigGAN-style architectures in both image fidelity and diversity. The figure below illustrates the qualitative improvements achieved by diffusion-based generators (figure adapted from the same source):

- Unlike GANs, which often trade off diversity for sharpness due to adversarial dynamics, diffusion models naturally balance these objectives by modeling the full data distribution through a likelihood-based framework.
Stable and Non-Adversarial Training
-
Diffusion models avoid the instability inherent to adversarial training. GANs require carefully balanced updates between a generator and discriminator, and even minor imbalances can lead to divergence or mode collapse On the Convergence and Stability of GANs by Mescheder et al. (2018). In contrast, diffusion models are trained using simple regression-style objectives, most commonly mean squared error losses that predict injected Gaussian noise.
-
This non-adversarial setup results in:
- predictable optimization behavior,
- reduced sensitivity to hyperparameters,
- and reliable convergence across a wide range of datasets and model scales.
-
As shown in Denoising Diffusion Probabilistic Models by Ho et al. (2020), diffusion training objectives can be derived directly from variational likelihood bounds, providing both empirical stability and probabilistic justification.
Explicit Probabilistic Interpretation
-
Diffusion models admit a clear probabilistic interpretation grounded in latent-variable modeling and Markov chains. The forward diffusion process is fixed and analytically tractable, while the reverse process is learned to approximate the true reverse-time dynamics.
-
This structure enables:
- principled likelihood estimation (or lower bounds thereof),
- theoretical analysis using tools from stochastic processes,
- and direct connections to score matching and stochastic differential equations.
-
In particular, the unification of diffusion models with continuous-time stochastic processes in Score-Based Generative Modeling through Stochastic Differential Equations by Song et al. (2021) provides a rigorous mathematical framework that explains why diffusion-based approaches are effective and how different sampling methods relate to one another.
Scalability and Parallelization
-
Diffusion models scale exceptionally well with both model capacity and dataset size. Training can be parallelized efficiently across large batches and distributed systems because each training example involves independent noise corruption and denoising prediction.
-
Moreover, architectural choices such as U-Nets with attention or transformer-based backbones can be incorporated without altering the core diffusion objective. This scalability has enabled diffusion models to serve as the backbone of large-scale systems such as:
- Imagen Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding by Saharia et al. (2022),
- DALL·E 2 Hierarchical Text-Conditional Image Generation with CLIP Latents by Ramesh et al. (2022),
- and Stable Diffusion developed by Stability AI.
Flexibility Across Modalities and Conditioning Schemes
-
Diffusion models are highly adaptable and have been successfully applied to a wide range of data modalities, including images, audio, video, 3D data, and multimodal settings. Conditioning mechanisms—such as class labels, text embeddings, semantic maps, or other structured inputs—can be integrated naturally via concatenation, cross-attention, or feature-wise modulation.
-
This flexibility is exemplified by models such as:
- GLIDE GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models by Nichol et al. (2021),
- and Latent Diffusion Models High-Resolution Image Synthesis with Latent Diffusion Models by Rombach et al. (2022).
-
Because the diffusion objective remains unchanged, these conditioning strategies can be added without destabilizing training.
Strong Theoretical Foundations with Room for Innovation
-
Despite their empirical success, diffusion models are not merely heuristic methods. Their grounding in stochastic processes, variational inference, and score matching provides a strong theoretical foundation. At the same time, many aspects of diffusion modeling—such as optimal noise schedules, sampling algorithms, and architectural inductive biases—remain active areas of research.
-
This combination of theoretical clarity and open research questions makes diffusion models both practically effective and scientifically rich, with significant potential for further innovation.
Definitions
- This section introduces the core concepts and components that recur throughout diffusion-based generative modeling. These definitions establish a shared vocabulary for understanding diffusion models at both a theoretical and practical level.
Diffusion Models
-
Diffusion models are neural generative models that learn to approximate the reverse of a stochastic diffusion process. Concretely, a diffusion model parameterizes conditional distributions of the form
\[p_\theta(x_{t-1} \mid x_t)\]- where \(x_t\) denotes a noisy version of the data at diffusion timestep \(t\), and \(\theta\) denotes the learnable parameters of the model. The objective is to iteratively transform noisy inputs into cleaner ones until a final sample \(x_0\) is obtained.
-
Diffusion models are typically trained end-to-end to denoise corrupted inputs and produce continuous outputs such as images, audio waveforms, or video frames. Unlike GANs, diffusion models do not rely on adversarial objectives, and unlike flow-based models, they do not require invertible architectures.
-
From an architectural perspective, diffusion models can employ any neural network whose input and output dimensionalities match. In practice, U-Net–style architectures dominate due to their ability to preserve spatial structure while integrating global context via skip connections and attention mechanisms. Variants include conditional U-Nets, 3D U-Nets for video or volumetric data, and transformer-based U-Nets for large-scale multimodal generation.
-
A canonical visualization of the diffusion process is shown below (figure adapted from Denoising Diffusion Probabilistic Models by Ho et al. (2020)):

Forward and Reverse Processes
-
Diffusion models consist of two coupled stochastic processes:
- a forward (diffusion) process, which progressively adds noise to data, and
- a reverse (denoising) process, which is learned by the model.
-
The forward process is fixed and typically defined as a Markov chain with Gaussian transitions. The reverse process is parameterized by a neural network and trained to approximate the true reverse-time dynamics. This formulation was introduced in Deep Unsupervised Learning using Nonequilibrium Thermodynamics by Sohl-Dickstein et al. (2015) and refined in later works such as Denoising Diffusion Probabilistic Models by Ho et al. (2020).
Schedulers
-
Schedulers define how noise is added and removed over time during both training and inference. Formally, a scheduler specifies:
- the noise variance schedule \({\beta_t}_{t=1}^T\) or its continuous-time analogue,
- how to compute intermediate quantities such as \(\alpha_t = 1 - \beta_t\) and \(\bar{\alpha}_t = \prod_{s=1}^t \alpha_s\),
- and how to map model predictions to the previous timestep during sampling.
-
Schedulers are algorithmic components rather than neural networks, and they play a critical role in controlling sample quality, stability, and efficiency.
-
Prominent schedulers and samplers include:
- DDPM: introduced in Denoising Diffusion Probabilistic Models by Ho et al. (2020),
- DDIM: introduced in Denoising Diffusion Implicit Models by Song et al. (2020),
- PNDM: proposed in Pseudo Numerical Methods for Diffusion Models on Manifolds by Liu et al. (2022),
- DEIS: introduced in Fast Sampling of Diffusion Models with Discrete Exponential Integrators by Zhang and Chen (2022).
-
The figure below, adapted from Denoising Diffusion Probabilistic Models by Ho et al. (2020), illustrates the interaction between training and sampling algorithms governed by the scheduler:

Sampling and Training Pipelines
-
In practical systems, diffusion models are rarely deployed in isolation. Instead, they are embedded within end-to-end diffusion pipelines that combine multiple components, such as:
- diffusion models operating at different resolutions,
- text or class encoders for conditional generation,
- super-resolution or refinement models,
- and post-processing stages.
-
These pipelines orchestrate training and inference across multiple models and noise schedules to achieve high-quality generation at scale.
-
Well-known examples include:
- GLIDE GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models by Nichol et al. (2021),
- Latent Diffusion Models High-Resolution Image Synthesis with Latent Diffusion Models by Rombach et al. (2022),
- Imagen Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding by Saharia et al. (2022),
- DALL·E 2 Hierarchical Text-Conditional Image Generation with CLIP Latents by Ramesh et al. (2022).
-
A high-level overview of such a pipeline is shown below (figure adapted from Imagen):

Takeaways
- Diffusion models learn to reverse a fixed noising process via neural denoising.
- Schedulers control the temporal dynamics of noise injection and removal.
-
Sampling pipelines integrate diffusion models with encoders, decoders, and conditioning mechanisms to enable large-scale generation.
- These definitions provide the conceptual building blocks required to understand the mathematical theory of diffusion models, which we examine next.
Diffusion Models: The Theory
- This section develops the theoretical foundations of diffusion models at a conceptual level. Rather than reproducing detailed derivations, we focus on how diffusion models are structured, why they are mathematically well-founded, and how their design choices connect probabilistic modeling, stochastic processes, and modern neural networks. Formal derivations and exact equations are deferred to The Math Under-the-Hood section.
Diffusion Models as Latent-Variable Generative Models
-
Diffusion models belong to the class of latent-variable generative models, meaning that observed data are assumed to arise from a sequence of unobserved random variables. Unlike traditional latent-variable models such as Variational Autoencoders (VAEs), diffusion models introduce a high-dimensional latent trajectory in which every latent variable has the same dimensionality as the observed data.
-
The theoretical motivation for this construction was first introduced in Deep Unsupervised Learning using Nonequilibrium Thermodynamics by Sohl-Dickstein et al. (2015), which framed generation as the reversal of a gradual entropy-increasing process. This idea was later made computationally practical and scalable in Denoising Diffusion Probabilistic Models by Ho et al. (2020).
-
At a high level, diffusion models define:
- A forward process that gradually destroys information in the data by injecting noise.
- A reverse process that learns to recover data by removing noise step by step.
-
This framing transforms generation into a sequence of local denoising problems, each of which is significantly easier than modeling the full data distribution in a single step.
Markovian Structure and Tractability
-
A defining theoretical assumption of diffusion models is the Markov property: each latent variable depends only on its immediate predecessor (in the forward process) or successor (in the reverse process). This choice has several important consequences:
- It enables a clean factorization of joint probability distributions.
- It allows likelihood-based training using variational inference.
- It ensures that learning and sampling can be performed with bounded memory and computation per timestep.
-
The Markov structure is not merely a modeling convenience; it is essential for making diffusion models analytically tractable and numerically stable. By restricting dependencies to adjacent timesteps, diffusion models avoid the intractable posterior dependencies that often plague deep latent-variable models.
Fixed Forward Process and Learned Reverse Process
-
From a theoretical perspective, one of the most elegant aspects of diffusion models is the asymmetry between the forward and reverse processes:
- The forward diffusion process is fixed and known, chosen by the model designer.
- The reverse diffusion process is unknown and learned, parameterized by a neural network.
-
This asymmetry is crucial. Because the forward process is analytically defined, it induces a known family of corrupted data distributions. The learning problem then reduces to approximating how these corrupted distributions should be inverted.
-
This idea was central to the reformulation of diffusion models in Denoising Diffusion Probabilistic Models by Ho et al. (2020), which showed that learning the reverse process can be cast as a series of regression problems with known targets.
Likelihood-Based Training via Variational Inference
-
Diffusion models are explicit likelihood models. Unlike GANs, which rely on implicit distributions and adversarial training, diffusion models optimize a well-defined objective derived from probability theory.
-
Training proceeds by maximizing a variational lower bound (ELBO) on the data log-likelihood. Conceptually, this objective measures how well the learned reverse process approximates the true reverse of the forward diffusion.
-
The theoretical importance of this formulation is threefold:
- It provides a principled objective grounded in statistical inference.
- It ensures that diffusion models are comparable using likelihood-based metrics.
- It allows training to decompose into a sum of independent per-timestep terms.
-
The use of Gaussian distributions in both forward and reverse processes ensures that all divergence terms appearing in the ELBO are analytically tractable, avoiding the need for high-variance Monte Carlo estimators.
Noise Prediction as a Canonical Parameterization
-
A key theoretical and practical insight introduced in Denoising Diffusion Probabilistic Models by Ho et al. (2020) is that the reverse process need not be parameterized directly in terms of probability distributions.
-
Instead, the model can be trained to predict the noise component that was added during the forward process. This reparameterization has deep theoretical implications:
- It implicitly trains the model to estimate the score function of noisy data distributions.
- It connects diffusion models to denoising score matching, originally developed in Generative Modeling by Estimating Gradients of the Data Distribution by Song and Ermon (2019).
- It yields a simple mean-squared-error objective that is stable across noise levels.
-
This perspective reveals that diffusion models and score-based generative models are not separate paradigms, but rather different parameterizations of the same underlying probabilistic structure.
Connection to Continuous-Time and Score-Based Models
-
Although this section focuses on discrete-time diffusion models, the theory naturally extends to continuous-time formulations. In particular:
- Discrete-time diffusion models can be viewed as numerical discretizations of continuous stochastic processes.
- Learning to predict noise is equivalent to learning the score of a time-dependent distribution.
- Sampling procedures correspond to solving stochastic or deterministic differential equations.
-
These connections were formalized in Score-Based Generative Modeling through Stochastic Differential Equations by Song et al. (2021), which unified DDPMs, DDIMs, and score-based models under a single SDE-based framework.
Discrete Data and Final Decoding
-
A final theoretical consideration concerns the generation of discrete-valued data, such as pixel intensities. While diffusion operates in continuous space, the final output must correspond to discrete observations.
-
Diffusion models address this by defining an explicit likelihood for the final denoising step, typically using discretized continuous distributions. This ensures that diffusion models remain fully probabilistic and that likelihood evaluation is well-defined even for discrete domains.
Takeaways
-
At a theoretical level, diffusion models are characterized by:
- A fixed, analytically tractable forward corruption process.
- A learned reverse denoising process with local dependencies.
- A likelihood-based variational training objective.
- A noise-prediction parameterization linked to score matching.
- A natural extension to continuous-time stochastic processes.
-
These properties collectively explain why diffusion models combine strong theoretical guarantees with exceptional empirical robustness, providing a solid foundation for the architectural and algorithmic design choices explored in subsequent sections.
Diffusion Models: A Deep Dive
- This section connects the theoretical formulation of diffusion models to their concrete operational behavior. We unpack how the forward and reverse processes interact during training and sampling, providing intuition for why diffusion models are effective and how their components work together in practice.
Overview
-
Diffusion models are a form of latent variable model, in which observed data are associated with a sequence of latent states that progressively increase in noise. Latent variable models aim to describe complex data distributions by introducing hidden variables that capture underlying structure in a continuous space, as discussed in Latent Variable Models by The AI Summer.
-
In diffusion models, however, the latent variables \(x_1, \ldots, x_T\) are not lower-dimensional abstractions of the data. Instead, they share the same dimensionality as the observed variable \(x_0\) and represent progressively noisier versions of it. The latent space in diffusion models therefore corresponds to different noise levels rather than semantic compression.
-
Diffusion models consist of two tightly coupled processes:
- A forward diffusion process \(q\) that gradually adds noise to data.
- A reverse denoising process \(p_\theta\) that learns to remove this noise step-by-step.
-
The forward process is fixed by design, while the reverse process is parameterized by a neural network and learned from data.
-
The forward diffusion process is illustrated below, where structured data are gradually transformed into noise through a sequence of small Gaussian perturbations:

Forward Diffusion Process
-
The forward process defines a Markov chain that incrementally corrupts a clean data sample \(x_0\). At each timestep \(t\), Gaussian noise is added according to a predefined variance schedule \({\beta_t}_{t=1}^T\).
-
Formally, the forward process is defined as:
\[q(x_t \mid x_{t-1}) =N\left( x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I \right)\]-
where \(\beta_t\) controls the magnitude of noise added at step \(t\). Repeating this process for sufficiently large \(T\) ensures that the final variable \(x_T\) is approximately distributed as an isotropic Gaussian:
\[q(x_T) \approx N(0, I)\]
-
-
A key practical insight, introduced in Denoising Diffusion Probabilistic Models by Ho et al. (2020), is that the forward process admits a closed-form solution for sampling \(x_t\) directly from \(x_0\):
\[x_t =\sqrt{\bar{\alpha}_t} , x_0 +\sqrt{1 - \bar{\alpha}_t} \epsilon, \quad \epsilon \sim N(0, I)\]- with \(\bar{\alpha}_t = \prod_{s=1}^t (1 - \beta_s)\). This property allows efficient training by randomly sampling timesteps without explicitly simulating the full forward chain.
Reverse Denoising Process
-
The reverse process is the core learned component of a diffusion model. Its purpose is to invert the forward diffusion dynamics by progressively removing noise.
-
Starting from pure noise \(x_T \sim N(0, I)\), the model applies a sequence of learned reverse transitions:
- Each reverse transition is parameterized as a Gaussian distribution:
-
In practice, the variance \(\Sigma_\theta\) is often fixed or chosen from a small set of options, while the mean \(\mu_\theta\) is predicted by a neural network conditioned on both the noisy input \(x_t\) and the timestep \(t\).
-
This reverse process is illustrated below, where noise is gradually transformed back into a structured sample:

Training Procedure and Intuition
-
Training a diffusion model involves teaching the neural network how to denoise inputs at all noise levels. The training loop proceeds as follows:
- Sample a clean data point \(x_0 \sim q(x_0)\).
- Sample a timestep \(t\) uniformly from \({1, \ldots, T}\).
- Sample noise \(\epsilon \sim N(0, I)\).
- Construct a noisy input \(x_t\) using the closed-form forward process.
- Train the network to predict the noise \(\epsilon\) from \(x_t\) and \(t\).
-
This procedure is repeated across batches of data using stochastic gradient descent. Importantly, the network learns a local denoising rule at each timestep rather than a global mapping from noise to data.
Noise Prediction and Score Estimation
- Modern diffusion models are almost always trained to predict the injected noise \(\epsilon\) rather than directly predicting the clean data \(x_0\) or the reverse-process mean. The corresponding loss function is:
-
This formulation has several advantages:
- It yields stable gradients across timesteps.
- It avoids scale issues at high noise levels.
- It connects diffusion models to denoising score matching, as shown in Generative Modeling by Estimating Gradients of the Data Distribution by Song and Ermon (2019).
-
Intuitively, predicting noise is equivalent to estimating the direction in which a noisy sample should be moved to increase data likelihood.
Sampling and Generation
- After training, generation proceeds by initializing \(x_T \sim N(0, I)\) and repeatedly applying the learned reverse transitions:
-
Each step removes a small amount of noise and restores structure. After \(T\) steps, the final output \(x_0\) is obtained.
-
While this procedure yields high-quality samples, it is computationally expensive due to the large number of sequential steps. This limitation motivated the development of accelerated samplers such as DDIM Denoising Diffusion Implicit Models by Song et al. (2020) and ODE-based solvers derived from continuous-time diffusion theory.
Practical Insights
-
From a practical standpoint, diffusion models succeed because they:
- decompose generation into many simple denoising steps,
- train a single network to handle all noise levels,
- and leverage a fixed, well-behaved forward corruption process.
-
This design transforms a challenging generative modeling problem into a sequence of tractable regression tasks, explaining both the robustness and the scalability of diffusion-based approaches.
The Math Under-the-Hood
- At the core of diffusion models lies a probabilistic construction based on Markov chains, Gaussian perturbations, and variational inference. A diffusion model defines a structured latent-variable model in which data are progressively corrupted by noise through a fixed forward process, and then reconstructed through a learned reverse process. This formulation was first proposed in Deep Unsupervised Learning using Nonequilibrium Thermodynamics by Sohl-Dickstein et al. (2015) and later refined into a practical and scalable framework in Denoising Diffusion Probabilistic Models by Ho et al. (2020).
Forward Diffusion Process (Noising)
-
The forward diffusion process is a fixed, non-learned stochastic process that gradually destroys structure in the data by adding Gaussian noise. Let \(x_0 \sim q(x_0)\) denote a data sample drawn from the unknown data distribution. The forward process defines a sequence of latent variables \(x_1, \ldots, x_T\) such that each variable is obtained by perturbing the previous one with a small amount of noise.
-
Formally, the forward process is defined as a Markov chain with Gaussian transition kernels:
\[q\left(x_{1:T} \mid x_0\right) := \prod_{t=1}^{T} q\left(x_t \mid x_{t-1}\right) := \prod_{t=1}^{T} N\left( x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t I \right)\]-
where:
- \(\beta_t \in (0,1)\) is a variance schedule controlling the amount of noise added at timestep \(t\),
- \(I\) is the identity covariance matrix,
- and \(T\) is the total number of diffusion steps.
-
-
The variance schedule \({\beta_t}_{t=1}^T\) is chosen such that noise increases monotonically over time. For sufficiently large \(T\) and a well-behaved schedule, the final latent variable \(x_T\) converges in distribution to an isotropic Gaussian:
-
A crucial property of this construction, derived in Denoising Diffusion Probabilistic Models by Ho et al. (2020), is that the marginal distribution \(q(x_t \mid x_0)\) admits a closed-form expression:
\[x_t =\sqrt{\bar{\alpha}_t}x_0 +\sqrt{1-\bar{\alpha}_t}\epsilon, \quad \epsilon \sim N(0, I),\]-
where:
\[\alpha_t = 1 - \beta_t, \qquad \bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s\]
-
-
This result allows direct sampling of \(x_t\) from \(x_0\) at any timestep \(t\) without explicitly simulating the entire forward chain, which is critical for efficient training.
-
The joint structure of the forward diffusion process is visualized below (figure adapted from Denoising Diffusion Probabilistic Models by Ho et al. (2020)):

Reverse Diffusion Process (Denoising)
-
The generative power of diffusion models arises from learning the reverse diffusion process, which inverts the forward noising dynamics. While the forward process is analytically defined, the true reverse process \(q(x_{t-1} \mid x_t)\) is generally intractable. Diffusion models therefore learn a parametric approximation to this reverse process.
-
The learned generative model defines the joint distribution:
- where the prior over the final latent variable is fixed as:
-
Each reverse transition is parameterized as a Gaussian:
\[p_\theta(x_{t-1} \mid x_t) =N\left( x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t) \right)\]-
with:
- \(\mu_\theta(x_t, t)\) denoting the predicted mean,
- \(\Sigma_\theta(x_t, t)\) denoting the predicted or fixed covariance,
- and \(\theta\) representing the parameters of a neural network conditioned on \(x_t\) and the timestep \(t\).
-
-
The Markov property plays a crucial role here: each reverse transition depends only on the current noisy state \(x_t\) and not on earlier or later latent variables. This conditional independence enables tractable likelihood bounds and efficient training.
-
The reverse diffusion chain is illustrated below (figure adapted from Denoising Diffusion Probabilistic Models by Ho et al. (2020)):

Variational Learning Perspective
- Training diffusion models is framed as variational inference. Specifically, the objective is to maximize a variational lower bound (ELBO) on the data log-likelihood:
- Due to the Markov structure and Gaussian assumptions, the ELBO decomposes into a sum of Kullback–Leibler divergence terms between forward and reverse transition distributions, plus a reconstruction term at the final step:
- Because both the forward and reverse transitions are Gaussian, all KL divergence terms admit closed-form expressions. This tractability distinguishes diffusion models from many other latent-variable models and avoids reliance on high-variance Monte Carlo estimators.
Noise Prediction Parameterization
-
A key empirical insight introduced in Denoising Diffusion Probabilistic Models by Ho et al. (2020) is that training becomes significantly simpler and more stable when the model is parameterized to predict the noise \(\epsilon\) rather than the reverse-process mean directly.
-
Under this parameterization, the neural network \(\epsilon_\theta(x_t, t)\) is trained to approximate the noise used to generate \(x_t\):
-
This yields the widely used simplified training objective:
\[L_{\text{simple}}(\theta) =\mathbb{E}_{x_0, t, \epsilon} \left[ \left\lVert \epsilon -\epsilon_\theta(x_t, t) \right\rVert^2 \right]\]- where \(t\) is sampled uniformly from \({1,\ldots,T}\) and \(\epsilon \sim N(0, I)\).
-
This objective can be interpreted as a form of denoising score matching, establishing a direct connection between diffusion models and score-based generative modeling, as shown in Generative Modeling by Estimating Gradients of the Data Distribution by Song and Ermon (2019).
Takeaways
-
In summary, the mathematical foundation of diffusion models rests on:
- a fixed Gaussian forward diffusion process,
- a learned Gaussian reverse process parameterized by neural networks,
- a variational likelihood objective composed of tractable KL divergences,
- and a noise-prediction parameterization that yields stable and efficient training.
-
This combination of probabilistic rigor and practical simplicity explains why diffusion models are both theoretically well-grounded and empirically successful, and it sets the stage for understanding architectural choices and sampling algorithms in subsequent sections.
Taxonomy of Diffusion Models
-
Diffusion models are a class of generative models that transform data through a forward process of adding noise and a reverse process that learns to denoise it step-by-step to generate new data. At a high level, diffusion models can be categorized along two largely orthogonal axes:
- Time formulation: discrete-time versus continuous-time diffusion.
- Representation space: pixel space, latent space, or other learned representations.
-
In their most common and practically deployed form, diffusion models are discrete-time models, where noise is added and removed over a finite sequence of timesteps (\(t \in {1,\dots,T}\)). Within this class, diffusion is typically implemented either directly in pixel space or in a learned latent space.
-
More precisely, modern diffusion models usually learn a local denoising rule at each noise level, parameterized by a neural network that predicts one of the following equivalent quantities:
- the injected Gaussian noise (\(\epsilon\)),
- the original clean sample (\(x_0\)),
-
or the score (\(\nabla_x \log p_t(x)\)) of the noisy data distribution.
- These parameterizations are mathematically interchangeable under Gaussian noise assumptions and correspond to different but equivalent views of the reverse diffusion dynamics.
-
Mathematically speaking, in discrete-time diffusion models, the reverse model is typically trained via a denoising objective that matches the model’s prediction (most commonly \(\epsilon_\theta(x_t, t)\)) to the true Gaussian noise used to construct \(x_t\) from \(x_0\). Generation then emerges by repeatedly applying the learned update rule across discrete timesteps, starting from \(t = T\) and proceeding down to \(t = 0\).
- This discrete-time framing has proven remarkably robust, as it allows complex data distributions to be learned via simple Gaussian perturbations and local denoising steps rather than direct density modeling. Canonical examples of this class include Denoising Diffusion Probabilistic Models (DDPMs) and their accelerated samplers such as Denoising Diffusion Implicit Models (DDIMs).
-
Continuous-time diffusion models generalize this perspective by describing the forward and reverse processes as Stochastic Differential Equations (SDEs) defined over a continuous time variable (\(t \in [0,1]\)). In this formulation, diffusion is no longer tied to a fixed number of discrete noise levels but instead evolves according to continuous stochastic dynamics.
- Importantly, this continuous-time view is representation-agnostic: the same SDE framework applies whether diffusion is performed in pixel space, latent space, or another learned representation. Discrete-time diffusion models such as DDPMs and DDIMs can be recovered as specific numerical discretizations of these continuous-time processes.
-
In practice, most large-scale systems rely on latent-space, discrete-time diffusion models trained with Denoising Diffusion Probabilistic Model (DDPM)–style objectives, especially in image, video, and multimodal generation, due to their favorable trade-off between sample quality and computational cost. In these systems, diffusion is applied in a learned latent space (typically produced by a Variational AutoEncoder), but the time formulation remains discrete.
-
Sampling in modern systems is most commonly performed using:
- Denoising Diffusion Implicit Model (DDIM)–like deterministic or partially deterministic discrete-time samplers, or
- probability-flow Ordinary Differential Equation (ODE) solvers derived from the continuous-time Stochastic Differential Equation (SDE) framework.
-
By contrast, pure pixel-space discrete-time DDPMs and standalone continuous-time Score-Based Generative Models (SGMs) based on Langevin dynamics are now primarily used for research, benchmarking, or specialized domains where fidelity, theoretical clarity, or likelihood estimation is prioritized over raw sampling speed.
Pixel-Space Diffusion Models
-
Pixel-space diffusion models operate directly on high-dimensional data representations such as image pixels or raw audio waveforms. In this setting, the diffusion process acts on the original data coordinates, and no intermediate learned representation is introduced. As a result, pixel-space diffusion models offer a clear probabilistic interpretation and can achieve very high sample fidelity.
-
From a hierarchical perspective, pixel-space diffusion models can be further divided according to their time formulation:
- Discrete-time pixel-space diffusion models, where noise is added and removed over a finite sequence of timesteps.
-
Continuous-time pixel-space diffusion models, where diffusion is defined via stochastic differential equations and score matching.
- Historically and practically, discrete-time pixel-space diffusion models appeared first and form the conceptual foundation of the field.
-
While pixel-space diffusion enables precise modeling of fine-grained details, it also leads to substantial computational costs. The dimensionality of pixel data is extremely high, and both training and sampling require repeated neural network evaluations over many diffusion steps. These limitations motivated the development of latent-space diffusion models, which apply the same principles in a compressed representation.
-
Despite these costs, pixel-space diffusion models remain important for understanding the theoretical foundations of diffusion-based generative modeling and continue to be used in domains where the highest possible fidelity or exact likelihood computation is required.
Denoising Diffusion Probabilistic Models (DDPMs) / Discrete-Time Pixel-Space Diffusion Models
-
Denoising Diffusion Probabilistic Models (DDPMs) are the canonical formulation of discrete-time diffusion-based generative modeling in pixel space. They were introduced in Denoising Diffusion Probabilistic Models by Ho et al. (2020) and define a tractable likelihood-based framework for learning complex data distributions via a sequence of noising and denoising steps indexed by a finite timestep variable.
-
In the hierarchy of diffusion models, DDPMs occupy a central position:
- They are discrete-time rather than continuous-time models.
- They operate directly in pixel space, rather than a learned latent space.
- They explicitly parameterize the reverse diffusion transitions as conditional probability distributions.
-
DDPMs model generation as the reversal of a fixed Markovian diffusion process that gradually destroys structure in the data by injecting Gaussian noise. Learning focuses on approximating the reverse transitions, which enables sampling from the data distribution starting from pure noise.
-
Overall, DDPMs form the conceptual backbone of modern diffusion models and serve as the starting point for numerous extensions, including accelerated discrete-time samplers (such as DDIMs), continuous-time SDE formulations, and latent-space diffusion models.
Implementation Details
-
Forward (Diffusion / Noising) Process:
- The forward process is a fixed, non-learned Markov chain that progressively adds Gaussian noise to a data sample over \(T\) discrete timesteps. Given an original data point \(x_0\), the transition from timestep \(t-1\) to \(t\) is defined as
- where \({\beta_t}_{t=1}^T\) is a predefined variance schedule controlling the amount of noise added at each step. As (t) increases, the signal-to-noise ratio decreases, and for sufficiently large \(T\), the distribution of \(x_T\) approaches a standard Gaussian.
-
A key property of this construction is that the marginal distribution \(q(x_t \mid x_0)\) admits a closed-form expression:
\[x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t}, \epsilon, \quad \epsilon \sim N(0, I)\]- where \(\alpha_t = 1 - \beta_t\) and \(\bar{\alpha}_t = \prod_{s=1}^t \alpha_s\). This allows training to sample arbitrary timesteps directly without simulating the full forward chain.
-
Reverse (Denoising) Process:
-
The reverse process aims to invert the diffusion by learning a parameterized Markov chain that gradually removes noise. Each reverse transition is modeled as a Gaussian distribution:
\[p_\theta(x_{t-1} \mid x_t) = N\left(x_{t-1}; \mu_\theta(x_t, t), \sigma_t^2 I\right)\]- where the mean \(\mu_\theta\) is predicted by a neural network (typically a U-Net conditioned on the timestep (t)), and the variance \(\sigma_t^2\) is either fixed or learned.
-
In practice, DDPMs are commonly parameterized to predict the noise \(\epsilon\) rather than the mean directly. This reparameterization simplifies optimization and improves empirical performance.
-
-
The following figure from the paper illustrates DDPMs as directed graphical models:

-
Training Objective:
- DDPMs are trained by maximizing a variational lower bound on the data log-likelihood. In practice, this objective simplifies to a denoising score-matching loss that minimizes the mean squared error between the true noise and the predicted noise:
- where \(t\) is sampled uniformly from \({1, \ldots, T}\) and \(\epsilon \sim N(0, I)\).
-
Sampling:
- Sampling begins from pure Gaussian noise \(x_T \sim N(0, I)\) and applies the learned reverse transitions iteratively:
- Each step incrementally denoises the sample until a final output (x_0) is produced. Although this procedure yields high-quality samples, it typically requires hundreds or thousands of sequential steps.
Pros
- Strong probabilistic foundation with an explicit likelihood formulation.
- Stable training and consistently high sample quality.
- Conceptually simple and broadly applicable across data modalities.
Cons
- Slow sampling due to the large number of required denoising steps.
- High computational cost for high-resolution data when operating in pixel space.
Denoising Diffusion Implicit Models (DDIMs)
-
Denoising Diffusion Implicit Models (DDIMs), are not separate classes of diffusion models but rather alternative sampling procedures applied to models trained with the DDPM objective. Introduced in Denoising Diffusion Implicit Models by Song et al. (2020), DDIMs replace the stochastic reverse Markov chain used in DDPM sampling with deterministic or partially stochastic trajectories that traverse the same discrete noise levels. This reinterpretation enables substantially faster generation without retraining, and places DDIMs within a broader family of discrete-time samplers that interpolate between fully stochastic DDPM sampling and deterministic probability-flow dynamics.
-
In the corrected hierarchy, DDIMs should be understood as:
- operating in discrete time,
- reusing the same forward diffusion process and training objective as DDPMs,
- defining a non-stochastic or partially stochastic reverse process that follows a different trajectory through the same sequence of noise levels.
-
Conceptually, DDIMs demonstrate that the DDPM reverse process is only one member of a broader family of valid reverse processes consistent with the same forward diffusion. By selecting a deterministic member of this family, DDIMs enable substantially faster sampling without retraining the model.
-
DDIMs form a crucial conceptual and practical bridge between probabilistic discrete-time diffusion models and continuous-time probability-flow formulations derived from stochastic differential equations.
-
The following figure from the paper illustrates DDIMs as a graphical model for accelerated generation, where \(\tau = [1, 3]\):

Implementation Details
-
Forward Process (Identical to DDPMs):
- The forward diffusion process in DDIMs is exactly the same as in DDPMs. Gaussian noise is added to the data over \(T\) discrete timesteps according to:
-
where:
- \(x_{t-1}\) is the data sample at timestep \(t-1\)
- \(x_t\) is the noisy sample at timestep \(t\)
- \(\beta_t\) is the variance schedule coefficient
- \(I\) is the identity covariance matrix
- \(N(\cdot;\mu,\Sigma)\) denotes a multivariate Gaussian distribution
-
As in DDPMs, the marginal distribution admits a closed-form expression:
-
where:
- \(x_0\) is the clean data sample
- \[\bar{\alpha}_t = \prod_{s=1}^t \alpha_s\]
- \[\alpha_t = 1 - \beta_t\]
-
Training Objective (Unchanged from DDPMs):
- DDIMs require no changes to training. The noise-prediction network is trained using the same DDPM objective:
-
where:
- \(\epsilon_\theta(x_t, t)\) is the neural network’s prediction of the injected noise
- \(t\) is sampled uniformly from \({1,\dots,T}\)
- \[\epsilon \sim N(0, I)\]
-
Reverse (Implicit) Process:
- The defining feature of DDIMs is their implicit reverse process, which replaces stochastic sampling with a deterministic update rule. The reverse update is given by:
-
where:
- \(x_t\) is the sample at timestep \(t\)
- \(x_{t-1}\) is the reconstructed sample at timestep \(t-1\)
- \(\epsilon_\theta(x_t, t)\) is the predicted noise
- the fraction term estimates \(x_0\) from \(x_t\)
-
This update corresponds to setting the stochasticity parameter \(\eta = 0\). More generally, DDIMs introduce a parameter \(\eta\) that interpolates between deterministic DDIM sampling and stochastic DDPM sampling:
- \(\eta = 0\) \(\rightarrow\) deterministic DDIM
- \(\eta = 1\) \(\rightarrow\) stochastic DDPM
-
Sampling:
- Sampling begins from \(x_T \sim N(0, I)\) and proceeds by applying the deterministic update rule over a subset of timesteps:
- By skipping intermediate timesteps and following a non-Markovian trajectory, DDIMs can generate samples in tens of steps rather than hundreds or thousands.
Pros
- Dramatically faster sampling than DDPMs.
- Deterministic trajectories enable reproducibility and interpolation.
- No retraining required—fully compatible with DDPM-trained models.
Cons
- Deterministic sampling can reduce sample diversity.
- Excessive timestep skipping can degrade sample quality.
- Still fundamentally discrete-time and tied to a predefined noise schedule.
Latent-Space Diffusion Models
-
Latent-space diffusion models apply the diffusion process in a learned, lower-dimensional representation space rather than directly in the original data space. From a hierarchical standpoint, latent-space diffusion is orthogonal to the time formulation: diffusion in latent space can be implemented using discrete-time objectives (e.g., DDPM-style) or derived from continuous-time SDE formulations.
-
Concretely, an encoder first maps data \(x\) from pixel space into a latent variable \(z\). Diffusion—whether discrete or continuous in time—is then defined over \(z\) using the same Gaussian noising and denoising principles as pixel-space models. Generation proceeds by denoising latent noise and decoding the resulting latent back into data space.
-
In practice, latent-space diffusion has become the dominant paradigm for large-scale generative modeling, particularly for high-resolution image, video, and multimodal generation. Most deployed systems combine:
- latent representations learned via a Variational AutoEncoder (VAE),
- discrete-time DDPM-style noise-prediction objectives, and
- accelerated samplers such as DDIM or probability-flow ODE solvers derived from continuous-time SDE theory.
-
By dramatically reducing the dimensionality of the diffusion space, latent diffusion achieves large gains in computational efficiency while preserving the expressive power and stability of diffusion-based generative modeling.
Latent-Space Diffusion Models (LDMs) / Variational Diffusion Models (VDMs)
-
Latent Diffusion Models (LDMs) apply diffusion processes not in the original data space but in a learned latent representation. This approach was popularized by High-Resolution Image Synthesis with Latent Diffusion Models by Rombach et al. (2022) and directly addresses the primary computational bottleneck of pixel-space diffusion: operating in extremely high-dimensional spaces.
-
While the term Variational Diffusion Models (VDMs) is sometimes used broadly, in modern usage it typically refers to diffusion models combined with a Variational AutoEncoder (VAE), where diffusion is performed in the VAE’s latent space rather than directly on pixels.
-
In the corrected hierarchy:
- LDMs are typically discrete-time diffusion models trained with DDPM-style objectives.
- The use of latent space is an implementation and representation choice, not a distinct diffusion family.
- Continuous-time SDE formulations can also be applied in latent space, yielding equivalent generative dynamics under appropriate discretization.
-
LDMs now form the backbone of many modern text-to-image and multimodal systems by combining efficient latent representations with powerful diffusion-based denoising networks.
Implementation Details
-
Latent Representation via a VAE:
- Latent diffusion models first train a VAE to compress data into a lower-dimensional latent space. Given a data sample \(x\), an encoder produces a latent representation \(z\):
-
where:
- \(x\) denotes a data sample in pixel space
- \(z\) denotes the latent representation
- \(q_\phi(z \mid x)\) is the encoder distribution parameterized by \(\phi\)
- \(\phi\) represents the encoder parameters
-
A decoder reconstructs the data via:
-
where:
- \(p_\theta(x \mid z)\) is the decoder (likelihood model) parameterized by \(\theta\)
-
The VAE is trained by maximizing the evidence lower bound (ELBO):
-
where:
- \(p(z)\) is the latent prior, typically \(N(0, I)\)
- \(D_{KL}(\cdot \mid\mid \cdot)\) denotes the Kullback–Leibler divergence
-
Diffusion in Latent Space (Discrete-Time Formulation):
- After VAE training, a diffusion process is defined over latent variables using a discrete-time forward process identical in form to pixel-space DDPMs:
-
where:
- \(z_t\) denotes the latent variable at diffusion timestep \(t\)
- \(\beta_t\) is the noise variance schedule
-
The corresponding marginal distribution is:
-
Training Objective (DDPM-Style Noise Prediction):
- A neural network is trained to reverse the latent diffusion process using the same objective as pixel-space DDPMs:
-
where:
- \(\epsilon_\theta(z_t, t)\) is the predicted latent noise
- \(t\) is sampled uniformly from \({1,\dots,T}\)
-
Because the latent space is much lower dimensional than pixel space, both training and sampling are substantially more efficient.
-
Sampling and Decoding:
- Sampling begins from latent noise:
- Reverse diffusion produces a clean latent \(z_0\), which is decoded into pixel space:
- The decoder maps the denoised latent back into the data space, producing the final sample.
-
As shown in the illustration below from High-Resolution Image Synthesis with Latent Diffusion Models by Rombach et al., the latent representation undergoes \(T\) diffusion steps, after which a U-Net denoiser operates over the noisy latent. Conditioning on text or other modalities is typically implemented via concatenation or cross-attention.

Pros
- Dramatically reduces computational and memory costs compared to pixel-space diffusion.
- Enables efficient high-resolution and multimodal generation.
- Retains the expressive power and stability of discrete-time diffusion models.
Cons
- Overall sample quality depends on the quality of the learned latent representation.
- Introduces additional complexity due to the VAE training stage.
- Reconstruction errors from the VAE can limit ultimate fidelity.
Continuous-Time Diffusion Models (Representation-Agnostic)
-
Continuous-time diffusion models formulate generative modeling as the evolution of data under a continuous-time dynamical process, rather than as a finite sequence of discrete noise steps. In this setting, diffusion is parameterized by a continuous time variable \(t \in [0,1]\), and both the forward (noising) and reverse (generation) processes are described using Stochastic Differential Equations (SDEs).
-
This perspective was unified and formalized in Score-Based Generative Modeling through Stochastic Differential Equations by Song et al. (2021), which showed that many seemingly distinct generative models—including DDPMs, DDIMs, and score-based models—can be understood as different discretizations or solver choices applied to the same underlying continuous-time formulation.
-
From a taxonomy standpoint, continuous-time diffusion models serve as a representation-agnostic and time-continuous generalization of discrete-time diffusion models:
- Pixel-space diffusion models correspond to applying these dynamics directly in data space.
- Latent-space diffusion models correspond to applying the same dynamics in a learned latent representation.
- Score-Based Generative Models (SGMs) correspond to learning the score function required to reverse the diffusion process.
-
Importantly, continuous-time diffusion does not prescribe:
- Where diffusion occurs (pixel space versus latent space), nor
- How the reverse process is parameterized (noise prediction, score prediction, or velocity prediction).
-
Instead, it provides a unifying mathematical framework in which these modeling choices can be rigorously related.
-
In modern systems, continuous-time diffusion primarily functions as a conceptual and theoretical backbone. Practical implementations often rely on discrete-time training objectives (e.g., DDPM-style losses) and fast samplers (e.g., DDIM or ODE solvers) that are derived from this framework.
Stochastic Differential Equation (SDE)-Based Diffusion Models
- SDE-based diffusion models define the forward noising process as a continuous-time stochastic process and the reverse generative process as its time reversal. This framework unifies Denoising Diffusion Probabilistic Models (DDPMs), Denoising Diffusion Implicit Models (DDIMs), and Score-Based Generative Models (SGMs) within a single mathematical formulation, as shown in Score-Based Generative Modeling through Stochastic Differential Equations by Song et al. (2021).
Forward Diffusion as an SDE
-
The forward diffusion process is defined by the SDE
\[dx=f(x,t)dt+g(t)dW_t\]-
where:
- \(x\) is the data state at time \(t\),
- \(f(x,t)\) is the drift term,
- \(g(t)\) controls the noise magnitude,
- \(W_t\) is a standard Wiener (Brownian motion) process.
-
-
A widely used instance is the variance-preserving (VP) SDE:
\[dx=-\frac{1}{2}\beta(t)x dt +\sqrt{\beta(t)}dW_t\]- where \(\beta(t)\) is a time-dependent noise schedule chosen so that the marginal distribution transitions smoothly from the data distribution at \(t=0\) to an isotropic Gaussian as \(t \to 1\).
-
Discrete-time DDPMs arise as Euler–Maruyama discretizations of this SDE, as shown in Denoising Diffusion Probabilistic Models by Ho et al. (2020).
Score-Based Generative Modeling (SGMs)
-
Score-Based Generative Models are the score-learning instantiation of SDE-based diffusion models. Instead of parameterizing reverse transition kernels directly, SGMs learn the score function
\[s_\theta(x,t) \approx \nabla_x \log p_t(x)\]- which fully specifies the reverse-time dynamics.
-
The foundational formulation was introduced in Generative Modeling by Estimating Gradients of the Data Distribution by Song and Ermon (2019) and later generalized to continuous time via SDEs by Song et al. (2021).
Noise-Conditional Perturbation
-
SGMs define a continuum of noisy distributions by perturbing clean data:
\[x_\sigma =x_0 +\sigma \epsilon, \quad \epsilon \sim N(0,I)\]-
where \(\sigma\) is a continuous noise scale. A neural network \(s_\theta(x,\sigma)\) is trained to approximate the score of the noisy distribution:
\[s_\theta(x_\sigma,\sigma) \approx \nabla_{x_\sigma} \log p_\sigma(x_\sigma)\]
-
Training via Denoising Score Matching
-
Training is performed using denoising score matching, yielding the objective
\[L=\mathbb{E}_{x_0,\sigma,\epsilon} \left[ \lambda(\sigma) \left\lVert s_\theta(x_\sigma,\sigma) +\frac{\epsilon}{\sigma} \right\rVert^2 \right]\]- where \(\lambda(\sigma)\) is a noise-dependent weighting function. This objective trains the model to recover the score field across continuous noise levels.
Reverse-Time SDE and Sampling
-
Given a learned score function, the reverse-time SDE is
\[dx =\left[f(x,t) -g(t)^2 s_\theta(x,t) \right]dt +g(t) d\bar{W}_t\]- where \(\bar{W}_t\) denotes reverse-time Brownian motion. Solving this SDE stochastically yields generative samples.
-
A deterministic alternative is the probability flow ODE:
\[\frac{dx}{dt} =f(x,t)-\frac{1}{2}g(t)^2 s_\theta(x,t)\]- which preserves the same marginal distributions and corresponds to DDIM-style sampling, as shown in Denoising Diffusion Implicit Models by Song et al. (2020).
Sampling via Langevin Dynamics (Discrete Approximation)
- Early SGMs implement stochastic sampling via annealed Langevin dynamics, a discrete approximation of the reverse SDE:
- As \(\sigma\) is gradually annealed from large to small values, sampling transitions from coarse structure formation to fine-grained detail refinement.
Probability Flow ODE (Deterministic Sampling)
- The same SDE defines an associated probability flow Ordinary Differential Equation (ODE):
-
Solving this ODE yields deterministic sampling trajectories that:
- are equivalent to DDIM-style sampling in discrete time,
- preserve the same marginal distributions as the stochastic SDE,
- and enable exact likelihood computation under mild regularity conditions.
Flow Matching Models (Deterministic Continuous-Time Generative Models)
-
Flow Matching models constitute a closely related but distinct paradigm within continuous-time generative modeling. Rather than deriving dynamics from an SDE or learning a score function, flow matching directly learns a deterministic vector field that transports samples from a simple base distribution to the data distribution.
-
This approach was introduced in Flow Matching for Generative Modeling by Lipman et al. (2022) and further developed in works such as Conditional Flow Matching by Tong et al. (2023).
Core Idea
-
Flow matching defines a time-dependent vector field \(v_\theta(x,t)\) such that samples evolve according to the ODE:
\[\frac{dx}{dt} =v_\theta(x,t)\]-
with boundary conditions:
- \(x(0) \sim p_{\text{base}}\) (e.g., Gaussian noise),
- \(x(1) \sim p_{\text{data}}\).
-
-
The model is trained to match the true velocity field of an interpolation between base and data distributions.
Flow Matching Objective
-
A common formulation minimizes a regression loss of the form:
\[L_{\text{FM}} =\mathbb{E}_{x_0,x_1,t} \left[ \left\lVert v_\theta(x_t,t) -\frac{d}{dt}x_t \right\rVert^2 \right]\]- where \(x_t\) lies on a predefined interpolation path between \(x_0\) and \(x_1\).
Relationship to Diffusion and SGMs
- Flow matching can be seen as learning the probability flow ODE directly, without passing through score estimation.
- Unlike SDE-based diffusion, flow matching is fully deterministic and does not inject noise during sampling.
-
Unlike classical normalizing flows, it does not require tractable Jacobian determinants during training.
- As discussed in A Tutorial on Flow Matching by Lilian Weng (2023), flow matching offers a unifying and often simpler alternative to diffusion-based training while retaining continuous-time expressivity.
Pros
- Unified continuous-time view of diffusion, score-based, and ODE-based models
- Flexible trade-offs between stochastic and deterministic generation
- Representation-agnostic and theoretically principled
- Flow matching avoids stochastic sampling and score estimation
Cons
- Requires numerical ODE/SDE solvers and careful time discretization
- Continuous-time formulations are conceptually more abstract
- Flow matching currently has fewer large-scale empirical benchmarks than diffusion
Comparative Analysis
-
Modern generative models based on diffusion and related continuous-time dynamics are best understood through four closely related architectural paradigms, each emphasizing different trade-offs among fidelity, efficiency, determinism, and theoretical generality:
-
Pixel-space diffusion models operate directly on raw high-dimensional data (e.g., image pixels). They offer strong fidelity and interpretability but are computationally expensive and slow to sample from.
-
Latent-space diffusion models apply diffusion in a learned, lower-dimensional representation of the data. This paradigm dramatically improves scalability and efficiency and underpins most modern large-scale generative systems.
-
Continuous-time diffusion models provide a unifying theoretical framework based on stochastic differential equations (SDEs). This formulation connects discrete-time diffusion, score-based modeling, and ODE-based sampling within a single mathematical view.
-
Flow Matching models describe generative modeling as learning a deterministic continuous-time vector field that transports samples from a simple base distribution to the data distribution. While closely related to continuous-time diffusion, flow matching avoids stochastic noise injection and score estimation, offering an alternative deterministic paradigm.
-
-
These paradigms are not mutually exclusive; rather, they occupy different regions of a shared design space:
- Pixel-space vs. latent-space distinguishes where generation occurs.
- Stochastic vs. deterministic dynamics distinguishes how probability mass is transported.
- Discrete-time vs. continuous-time formulations distinguish how the generative process is parameterized and solved.
-
In practice, state-of-the-art systems combine elements from multiple paradigms. A typical modern pipeline uses latent representations for efficiency, DDPM-style discrete-time objectives for stable training, continuous-time theory (via SDEs or ODEs) for principled interpretation and solver design, and—increasingly—deterministic alternatives such as DDIM or flow-matching-style dynamics for fast sampling.
-
This hybrid perspective highlights that modern generative modeling is less about choosing a single model family and more about composing compatible design choices to balance quality, speed, and theoretical clarity.
Pixel-Space Diffusion Models
-
Pixel-space diffusion models apply the generative process directly to high-dimensional data, such as image pixels or raw audio waveforms. In these models, diffusion operates in the original data space, preserving fine-grained details and offering a clear probabilistic interpretation of the generation process.
-
From a design perspective, pixel-space diffusion specifies where the generative dynamics occur, but not how time is parameterized. In principle, pixel-space diffusion can be combined with either discrete-time or continuous-time formulations. Historically and practically, however, pixel-space diffusion models have been dominated by discrete-time stochastic formulations, which established the empirical and conceptual foundation of the field.
-
Pixel-space diffusion models are therefore most naturally organized around discrete-time diffusion, with continuous-time interpretations emerging later as theoretical generalizations rather than as primary implementation choices.
Denoising Diffusion Probabilistic Models (DDPMs)
-
Key idea: Denoising Diffusion Probabilistic Models (DDPMs), introduced in Denoising Diffusion Probabilistic Models by Ho et al. (2020), define a fixed discrete-time forward process that gradually adds Gaussian noise to data, along with a learned reverse Markov chain that removes noise step by step. Generation proceeds by iteratively reversing the diffusion process starting from pure Gaussian noise.
-
In the broader taxonomy, DDPMs:
- are discrete-time diffusion models,
- explicitly parameterize stochastic reverse transitions,
- and operate directly in pixel space.
-
DDPMs form the canonical reference point for diffusion-based generative modeling. Most later developments—including latent diffusion, DDIM sampling, and continuous-time SDE formulations—can be understood as extensions, reinterpretations, or accelerations of this core model.
-
Pros:
- Strong probabilistic grounding with an explicit likelihood
- Stable and well-understood training dynamics
- High-fidelity sample generation
-
Cons:
- Extremely slow sampling due to hundreds or thousands of sequential denoising steps
- Computationally expensive at high resolutions
Denoising Diffusion Implicit Models (DDIMs)
-
Key idea: Denoising Diffusion Implicit Models (DDIMs), proposed in Denoising Diffusion Implicit Models by Song et al. (2020), are sampling procedures rather than new model families. DDIMs reuse the same training objective and learned noise predictor as DDPMs, but replace the stochastic reverse process with deterministic or semi-deterministic trajectories that skip diffusion steps while preserving marginal distributions.
-
Conceptually, DDIMs move pixel-space diffusion toward deterministic continuous-time dynamics, foreshadowing later connections to probability-flow ODEs and flow-based generative perspectives.
-
Pros:
- Orders-of-magnitude faster sampling than DDPMs
- No retraining required
- Deterministic trajectories enable reproducibility and smooth interpolation
-
Cons:
- Aggressive timestep skipping can reduce diversity
- Still constrained by pixel-space computation costs
Relationship to Continuous-Time and Flow-Based Models
-
While DDPMs and DDIMs are formulated in discrete time, both admit continuous-time interpretations:
- DDPMs correspond to numerical discretizations of stochastic differential equations.
- DDIMs correspond to deterministic solvers of associated probability-flow ODEs.
-
Pixel-space Score-Based Generative Models and Flow Matching models, which operate directly on pixels but are formulated in continuous time, are therefore best understood as extensions of pixel-space diffusion into the continuous-time regime, rather than as entirely separate pixel-space categories.
-
This perspective highlights pixel-space diffusion as the historical and conceptual starting point from which modern continuous-time and deterministic generative models emerged.
Latent-Space Diffusion Models
-
Latent-space diffusion models apply the generative process in a learned, lower-dimensional representation space rather than directly in pixel space. This latent representation is typically obtained using an autoencoder—most commonly a Variational AutoEncoder (VAE)—which compresses high-dimensional data into a compact and semantically meaningful latent space.
-
From the architectural taxonomy, latent diffusion is a representation-level design choice, orthogonal to the time formulation of the generative process. As a result, latent-space diffusion models can, in principle, be combined with:
- discrete-time stochastic diffusion (e.g., DDPM-style objectives),
- continuous-time stochastic diffusion (via SDEs),
- or deterministic continuous-time dynamics (e.g., probability-flow ODEs or flow-matching-style vector fields).
In practice, however, the vast majority of latent diffusion systems rely on discrete-time DDPM-style training, paired with accelerated samplers derived from continuous-time theory.
Latent Diffusion Models (LDMs)
-
Key idea: Latent Diffusion Models (LDMs), introduced in High-Resolution Image Synthesis with Latent Diffusion Models by Rombach et al. (2022), decouple representation learning from generative modeling. Data are first encoded into a latent variable \(z\), diffusion is applied to \(z\) rather than to pixels, and the final sample is obtained by decoding the denoised latent back into data space.
-
In the broader taxonomy, LDMs:
- specify where generation occurs (latent space),
- typically use discrete-time diffusion objectives for training,
- and often rely on DDIM or ODE-based samplers for efficient inference.
-
This separation dramatically reduces computational cost while preserving the expressive power of diffusion models, enabling high-resolution and multimodal generation at scale.
Advantages of Latent-Space Diffusion
- Computational efficiency: Diffusion operates in a space that is orders of magnitude lower-dimensional than pixel space, significantly reducing memory usage and runtime.
- Scalability: Enables high-resolution image generation and large-scale multimodal systems that would be impractical in pixel space.
- Modularity: The autoencoder and diffusion model can be trained and improved independently.
Limitations and Trade-offs
- Fidelity bound by representation quality: The final sample quality is constrained by the reconstruction accuracy of the autoencoder.
- Additional modeling complexity: Training and maintaining an encoder–decoder pair introduces extra engineering and optimization challenges.
- Approximation error: Latent compression introduces an irreversible information bottleneck.
Relationship to Continuous-Time and Flow-Based Models
-
Latent-space diffusion models can be interpreted within the same continuous-time SDE framework as pixel-space models. Discrete-time latent diffusion objectives correspond to numerical discretizations of latent-space SDEs.
-
Moreover, recent work has explored deterministic latent-space generative dynamics, including probability-flow ODE solvers and flow-matching-style models, which learn continuous-time vector fields directly in latent space.
-
This makes latent diffusion a natural bridge between practical discrete-time diffusion models and deterministic continuous-time approaches, including flow matching, which can further reduce sampling cost by eliminating stochastic noise injection.
Continuous-Time Diffusion Models (Representation-Agnostic)
-
Continuous-time diffusion models describe generative modeling as the evolution of data under a continuous-time dynamical system, rather than a fixed sequence of discrete noise levels. In this paradigm, generation is formulated over a continuous time variable \(t \in [0,1]\), and the transformation from noise to data is governed by either stochastic differential equations (SDEs) or ordinary differential equations (ODEs).
-
From the architectural taxonomy, continuous-time diffusion models form a unifying theoretical layer that connects and generalizes discrete-time diffusion models. Importantly, this paradigm is representation-agnostic: the same continuous-time formulation applies whether generation is performed in pixel space, latent space, or another learned representation.
-
Within this framework:
- Discrete-time DDPMs arise as numerical discretizations of specific SDEs.
- DDIM sampling corresponds to solving a deterministic ODE (the probability flow ODE) associated with the same SDE.
- Score-Based Generative Models (SGMs) correspond to learning the score function required to solve reverse-time stochastic dynamics.
- Flow Matching models correspond to directly learning a deterministic continuous-time vector field that transports noise to data, bypassing explicit stochastic diffusion.
-
As a result, continuous-time diffusion should be understood not as a competing model family, but as a mathematical foundation that encompasses stochastic diffusion, deterministic diffusion, and flow-based continuous dynamics within a single conceptual framework.
SDE-Based Diffusion Models (Including Score-Based Generative Models)
-
Stochastic Differential Equation (SDE)-based diffusion models were formalized in Score-Based Generative Modeling through Stochastic Differential Equations by Song et al. (2021). This work showed that learning the score function of noisy data distributions is sufficient to define a valid generative process in continuous time.
-
In this formulation, the forward diffusion process is defined by an SDE of the form:
- where \(W_t\) denotes a standard Wiener process. The forward SDE progressively transforms data into noise, while generation corresponds to solving the associated reverse-time SDE, which depends explicitly on the learned score function:
-
This perspective unifies several previously distinct approaches:
- DDPMs correspond to discretizations of variance-preserving SDEs.
- Langevin-based SGMs correspond to stochastic numerical solvers of the reverse-time SDE.
- DDIM-style samplers correspond to deterministic solvers of the associated probability flow ODE.
-
The key insight is that the score function is the central learned object. Once the score is known, both stochastic and deterministic sampling trajectories are fully specified.
Deterministic Dynamics and Probability Flow ODEs
- Every SDE used in diffusion modeling admits an associated probability flow ODE, which shares the same marginal distributions as the stochastic process but evolves deterministically:
-
Solving this ODE yields deterministic trajectories from noise to data that:
- recover DDIM-style sampling in discrete time,
- eliminate stochastic noise injection during sampling,
- and enable exact likelihood computation under mild regularity conditions.
-
This deterministic view of diffusion provides a conceptual bridge between stochastic diffusion models and fully deterministic continuous-time generators.
Flow Matching Models (Deterministic Continuous-Time Generative Modeling)
-
Flow Matching models represent a closely related but distinct approach to continuous-time generative modeling. Introduced in Flow Matching for Generative Modeling by Lipman et al. (2022), flow matching directly learns a deterministic vector field \(v_\theta(x,t)\) that transports samples from a simple base distribution (e.g., Gaussian noise) to the data distribution.
-
Unlike SDE-based diffusion models, flow matching:
- does not require stochastic forward diffusion,
- does not learn a score function,
- and does not rely on reverse-time stochastic dynamics.
-
Instead, flow matching trains a model to satisfy:
\[\frac{dx}{dt} =v_\theta(x,t)\]-
where \(v_\theta\) is optimized to match a target velocity field defined by pairs of noise and data samples. A common training objective minimizes the mean squared error between the predicted velocity and the target velocity:
\[L_{\text{FM}} =\mathbb{E}_{x_0, x_1, t} \left[ \left\lVert v_\theta(x_t, t) -v^\star(x_t, t) \right\rVert^2 \right]\]
-
-
Flow matching can be viewed as learning the probability flow ODE directly, without passing through an intermediate score or diffusion formulation. As such, it occupies the deterministic end of the continuous-time generative modeling spectrum.
Pros and Cons of Continuous-Time Approaches
-
Pros:
- Provides a unifying theoretical framework for DDPMs, DDIMs, SGMs, and flow matching
- Supports both stochastic (SDE-based) and deterministic (ODE-based) generation
- Representation-agnostic and compatible with pixel-space or latent-space modeling
- Enables flexible trade-offs between sample quality, diversity, and sampling speed
-
Cons:
- Requires numerical solvers and careful step-size or tolerance control
- More abstract than purely discrete-time formulations
- Flow matching models may lack the explicit probabilistic interpretation of diffusion-based likelihoods
Tabular Summary
-
Modern diffusion-based generative models are best understood as configurations within a shared design space, rather than as isolated model families. The primary axes along which these models differ are:
- Representation space: where generation occurs (pixel space vs. latent space)
- Time formulation: whether the generative process is discrete-time or continuous-time
- Generative dynamics: whether sampling is stochastic (noise-injecting) or deterministic
- Learned quantity: noise, score, or velocity field
-
The table below summarizes how the major approaches fit into this taxonomy, including Flow Matching as a distinct deterministic continuous-time paradigm.
| Model / Method | Representation Space | Time Formulation | Sampling Dynamics | Learned Object | Key Trade-offs |
|---|---|---|---|---|---|
| DDPM | Pixel space | Discrete-time | Stochastic (Markovian) | Noise $$\epsilon$$ | High fidelity, explicit likelihood, very slow sampling |
| DDIM | Pixel or latent space | Discrete-time | Deterministic / semi-stochastic | Noise $$\epsilon$$ | Fast sampling, possible loss of diversity |
| Latent Diffusion (LDM) | Latent space | Discrete-time (typically) | DDIM or ODE-based | Noise $$\epsilon$$ | Scalable and efficient, bounded by autoencoder fidelity |
| Score-Based Generative Models (SGMs) | Pixel or latent space | Continuous-time | Stochastic (reverse SDE / Langevin) | Score $$\nabla_x \log p_t(x)$$ | Theoretically principled, slow and solver-sensitive |
| SDE-Based Diffusion | Representation-agnostic | Continuous-time | Stochastic (SDE) or deterministic (ODE) | Score $$s_\theta(x,t)$$ | Unifying framework, higher conceptual complexity |
| Flow Matching | Pixel or latent space | Continuous-time | Deterministic (ODE) | Velocity field $$v_\theta(x,t)$$ | Very fast sampling, weaker probabilistic grounding |
-
Interpretation and Takeaways
- DDPMs define the original discrete-time stochastic diffusion formulation.
- DDIMs are accelerated samplers, not separate models.
- Latent diffusion modifies the representation, not the diffusion theory.
- SGMs are continuous-time, score-learning instantiations of diffusion.
- SDE-based diffusion provides the unifying mathematical framework.
- Flow Matching occupies the deterministic extreme of continuous-time generative modeling, learning transport dynamics directly rather than via noise or score estimation.
-
In practice, modern systems often blend these ideas: latent representations for scalability, DDPM-style objectives for stable training, DDIM or ODE solvers for fast inference, continuous-time theory for interpretation, and—emerging increasingly—flow-matching-style objectives for fully deterministic generation.
Training
- Training diffusion models is grounded in variational inference and aims to learn a parametric approximation to the reverse diffusion process that maximizes the likelihood of observed data. This section presents a high-level yet rigorous view of the training objective, clarifying how likelihood maximization, KL divergences, and simplified denoising objectives fit together, while avoiding overlap with the lower-level derivations covered elsewhere.
Likelihood Maximization and the Variational Objective
- A diffusion model defines a joint distribution over observed data \(\mathbf{x}_0\) and latent variables \(\mathbf{x}_{1:T}\) through a learned reverse process and a fixed forward noising process. Training seeks to maximize the marginal likelihood of the data:
-
Direct maximization of this likelihood is intractable due to the presence of latent variables. Instead, diffusion models optimize a variational lower bound (ELBO) on the log-likelihood, following the formulation introduced in Deep Unsupervised Learning using Nonequilibrium Thermodynamics by Sohl-Dickstein et al. (2015) and refined in Denoising Diffusion Probabilistic Models by Ho et al. (2020).
-
In practice, training minimizes the negative ELBO, commonly referred to in the diffusion literature as the variational lower bound loss:
\[\mathbb{E}\left[-\log p_\theta(\mathbf{x}_0)\right] \le \mathbb{E}_q \left[ -\log \frac{ p*\theta(\mathbf{x}_{0:T}) }{ q(\mathbf{x}_{1:T} \mid \mathbf{x}_0) } \right] := L_{\mathrm{VLB}}\]-
where:
- \(p_\theta(\mathbf{x}_{0:T})\) is the learned joint distribution defined by the reverse process,
- \(q(\mathbf{x}_{1:T} \mid \mathbf{x}_0)\) is the fixed forward (noising) process,
- \(L_{\mathrm{VLB}}\) is minimized during training.
-
-
Although \(L_{\mathrm{VLB}}\) is technically an upper bound on the negative log-likelihood, the terminology is preserved to remain consistent with the ELBO convention widely used in variational inference literature.
Role of KL Divergences in Diffusion Training
-
A defining advantage of diffusion models is that both the forward and reverse transition distributions are modeled as Gaussian distributions. This allows the variational objective to be decomposed into a sum of Kullback–Leibler (KL) divergence terms, each of which admits a closed-form analytical expression.
-
Expressing the objective in terms of KL divergences provides both theoretical clarity and practical efficiency, as it avoids high-variance Monte Carlo estimators and enables stable optimization.
Recap: KL Divergence
-
The Kullback–Leibler divergence is a fundamental quantity in information theory that measures how one probability distribution diverges from another reference distribution. For continuous distributions, it is defined as:
\[D_{\mathrm{KL}}(P \mid\mid Q) =\int_{-\infty}^{\infty} p(x) \log \left( \frac{p(x)}{q(x)} \right) dx\]-
where:
- \(P\) and \(Q\) are probability distributions over a continuous variable \(x\),
- \(p(x)\) and \(q(x)\) denote their corresponding density functions,
- the logarithmic term compares the relative likelihood assigned by \(P\) and \(Q\) at each point \(x\).
-
-
The KL divergence has several important properties that are directly relevant to diffusion models:
-
Non-negativity: \(D_{\mathrm{KL}}(P \mid\mid Q) \ge 0\), with equality if and only if \(P = Q\) almost everywhere.
- Asymmetry:
\(D_{\mathrm{KL}}(P \mid\mid Q) \neq D_{\mathrm{KL}}(Q \mid\mid P)\) in general.
- This asymmetry reflects the fact that KL divergence measures the inefficiency of using \(Q\) to approximate \(P\), not a symmetric distance between them.
- Information-theoretic interpretation:
- KL divergence quantifies the expected excess code length incurred when encoding samples from \(P\) using a code optimized for \(Q\).
-
-
In diffusion models, this asymmetry is intentional and meaningful: the KL terms penalize discrepancies between the true posterior distributions induced by the forward diffusion process and the learned reverse-time transition distributions.
-
Because these distributions are Gaussian, the KL divergence between them can be computed exactly, which is a key reason diffusion models scale effectively to high-dimensional data.
-
The intuition behind KL divergence is illustrated below. The blue curve represents a varying distribution \(P\), the red curve is a reference distribution \(Q\), and the green curve shows the integrand of the KL expression. The total shaded area corresponds to the KL divergence value.

Decomposition of the Variational Lower Bound
-
Leveraging the Markov structure of the diffusion process, the variational loss can be decomposed into a sum of per-timestep terms:
\[L_{\mathrm{VLB}} =L_0 + \sum_{t=1}^{T-1} L_t + L_T\]-
where:
\[\begin{aligned} L_0 &= -\log p_\theta(\mathbf{x}_0 \mid \mathbf{x}_1), \ L_t &= D_{\mathrm{KL}} \big( q(\mathbf{x}_{t} \mid \mathbf{x}_{t+1}, \mathbf{x}_0) \mid\mid p*\theta(\mathbf{x}_{t} \mid \mathbf{x}_{t+1}) \big), \ L_T &= D_{\mathrm{KL}} \big( q(\mathbf{x}_T \mid \mathbf{x}_0) \mid\mid p(\mathbf{x}_T) \big) \end{aligned}\]
-
-
This decomposition, derived in Denoising Diffusion Probabilistic Models by Ho et al. (2020), highlights that:
- all KL terms are analytically tractable,
- the final term \(L_T\) becomes constant when the noise schedule is fixed,
- training primarily focuses on aligning the learned reverse transitions with the true diffusion posteriors.
Simplified Training via Noise Prediction
-
While the full variational objective provides theoretical grounding, Ho et al. (2020) observed that it can be greatly simplified without sacrificing performance.
-
Instead of directly parameterizing reverse transition distributions, the model is trained to predict the noise added during the forward process. This yields a denoising objective closely related to denoising score matching, as formalized in Generative Modeling by Estimating Gradients of the Data Distribution by Song and Ermon (2019).
-
The resulting training objective is:
\[L_{\mathrm{simple}}(\theta) =\mathbb{E}_{\mathbf{x}_0, t, \boldsymbol{\epsilon}} \left[ \left\lVert \boldsymbol{\epsilon} -\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) \right\rVert^2 \right]\]- This objective avoids explicit density estimation, yields stable gradients, and empirically matches or exceeds the performance of the full ELBO.
Interpretation and Practical Implications
-
From a training perspective, diffusion models combine:
- a principled likelihood-based foundation,
- a decomposition into closed-form KL divergences,
- and a simple, robust denoising objective used in practice.
-
This balance between theoretical rigor and empirical effectiveness is a central reason diffusion models have become the dominant paradigm in modern generative modeling, as discussed in A Diffusion Model Primer and Diffusion Models Beat GANs on Image Synthesis by Dhariwal and Nichol (2021).
Model Choices
-
With the training objective established, the practical implementation of a diffusion model requires several architectural and design decisions. These choices determine the model’s expressivity, computational efficiency, and sampling behavior. Importantly, diffusion models are unusually flexible: the probabilistic framework places minimal constraints on the neural architecture, requiring only that inputs and outputs share the same dimensionality.
-
This section outlines the major modeling decisions involved in building a diffusion system, focusing on variance scheduling, reverse-process parameterization, and network architecture, while situating these choices within the broader diffusion literature.
Forward Process Design and the Role of the Noise Schedule
-
The forward diffusion process is fixed and non-learned in most practical systems. Its primary design choice is the variance (noise) schedule, which controls how rapidly information is destroyed over time.
-
In discrete-time diffusion models, this schedule is defined by a sequence:
\[{\beta_1, \beta_2, \ldots, \beta_T}\]- where each \(\beta_t\) determines the amount of Gaussian noise added at timestep \(t\).
-
Early diffusion models used simple linear schedules, as proposed in Denoising Diffusion Probabilistic Models by Ho et al. (2020). Later work showed that alternative schedules—such as cosine schedules—improve sample quality and training stability, as demonstrated in Improved Denoising Diffusion Probabilistic Models by Nichol and Dhariwal (2021).
-
Because the forward process is fixed, the KL divergence term associated with the final timestep becomes constant with respect to model parameters. As a result, it does not influence gradient-based optimization and can be ignored during training.
-
In continuous-time formulations, the discrete schedule generalizes to a time-dependent noise rate \(\beta(t)\), preserving the same conceptual role within a stochastic differential equation framework (Score-Based Generative Modeling through Stochastic Differential Equations by Song et al. (2021)).
Reverse Process Parameterization
-
The reverse diffusion process is learned and defines the generative capacity of the model. Each reverse transition is parameterized as a Gaussian distribution whose parameters are predicted by a neural network.
-
In practice, most diffusion models adopt a restricted covariance parameterization, fixing the covariance to a diagonal matrix determined by the noise schedule. This simplifies optimization and yields stable training, as empirically validated in Denoising Diffusion Probabilistic Models by Ho et al. (2020).
-
Under this design, learning focuses primarily on predicting the mean of the reverse transition, which can be equivalently expressed through alternative parameterizations:
- Noise prediction (predicting \(\epsilon\)),
- Data prediction (predicting \(x_0\)),
- Velocity prediction, introduced in Elucidating the Design Space of Diffusion-Based Generative Models by Karras et al. (2022).
-
Although mathematically equivalent under Gaussian assumptions, these parameterizations differ in numerical stability and empirical performance. Noise prediction remains the most widely used due to its simplicity and robustness.
Neural Network Architectures
-
Diffusion models impose only one structural requirement on the neural network: input and output dimensionalities must match. This flexibility allows a wide range of architectures to be used.
-
In practice, U-Net–style architectures dominate diffusion modeling due to their ability to capture multi-scale spatial structure and propagate information across resolutions. This design choice was popularized in Denoising Diffusion Probabilistic Models by Ho et al. (2020) and refined in subsequent work.
-
Key architectural features commonly employed include:
- multi-resolution convolutional blocks,
- skip connections for stable gradient flow,
- explicit timestep or noise-level embeddings,
- attention layers for long-range dependency modeling.
-
In latent diffusion models, the same architectural principles apply, but diffusion operates on compressed latent representations, as introduced in High-Resolution Image Synthesis with Latent Diffusion Models by Rombach et al. (2022).
Conditioning and Guidance Mechanisms
-
Many practical diffusion systems are conditional generative models, incorporating auxiliary information such as class labels or text embeddings.
-
Conditioning is typically implemented by injecting additional embeddings into the network via concatenation, cross-attention, or feature-wise modulation. This approach underlies text-to-image systems such as:
- GLIDE by Nichol et al. (2022),
- Imagen by Saharia et al. (2022),
- Stable Diffusion by Rombach et al. (2022).
-
Classifier-free guidance, introduced in Classifier-Free Diffusion Guidance by Ho and Salimans (2022), further improves controllability by interpolating between conditional and unconditional predictions during sampling.
Design Trade-offs
-
The design space of diffusion models is characterized by a clear separation of concerns:
- Forward process: fixed, analytically tractable, defined by a noise schedule.
- Reverse process: learned, parameterized by neural networks.
- Architecture: flexible, with U-Nets as the dominant choice.
- Parameterization: multiple equivalent formulations with different practical properties.
-
This modularity is a key strength of diffusion models, enabling rapid experimentation, theoretical analysis, and integration with emerging frameworks such as continuous-time diffusion and flow-matching models.
Network Architecture: U-Net and Diffusion Transformer (DiT)
-
Diffusion models are a class of generative models that learn to produce high-fidelity data by simulating a noising (forward) and denoising (reverse) process. At a high level, these models include a neural network denoiser that takes in a noisy input at timestep \(t\) and predicts the noise that was added. Because the network must output a tensor of the same shape and spatial resolution as the input image, architectural choices that preserve spatial dimensions and capture both local and global structure are vital.
-
In practice, two dominant architectures have emerged for this denoising network:
- U-Net-based diffusion networks, which rely on spatial convolutions and hierarchical encoding/decoding with skip connections.
- Diffusion Transformers (DiTs), which replace convolutions with attention modules to capture long-range dependencies.
-
Both architectures aim to model the same denoising function:
\[\epsilon_{\theta}(\mathbf{x}_t, t) \approx \epsilon\]- where \(\mathbf{x}_t\) is the noisy image, \(\epsilon\) is the true noise added at step \(t\), and \(\epsilon_{\theta}\) is the network’s prediction.
-
This section describes each architecture in detail, including structural insights and how they are implemented in the context of diffusion models.
U-Net-Based Diffusion Models
-
The canonical implementation for image diffusion models uses U-Net as the backbone denoising network due to its flexibility in processing spatial information across scales. This architecture was adopted in the seminal paper on diffusion models, Denoising Diffusion Probabilistic Models by Jonathan Ho et al. (2020), which demonstrated state-of-the-art image generation performance on benchmarks like CIFAR-10 and LSUN.
-
Key architectural elements of U-Net:
-
Encoder–Decoder Structure:: U-Net consists of a sequence of downsampling (encoder) layers that progressively reduce spatial resolution and increase feature abstraction, followed by a symmetric sequence of upsampling (decoder) layers that reconstruct the original resolution. This structure facilitates effective multiscale feature extraction.
-
Skip Connections:: Direct connections between corresponding encoder and decoder layers pass fine-grained feature maps forward. This mitigates information loss during downsampling and enables precise reconstruction during upsampling. This bypassing of bottleneck information is critical for high-quality image reconstruction.
-
Bottleneck Layer:: The central bottleneck compresses the learned representation and helps the model focus on essential features that are robust to noise, serving as a compact summary of the input.
-
Residual Blocks & Time Embedding:: Modern U-Net variants used in diffusion models often incorporate ResNet-like blocks with time and class conditioning embedded into the network via sinusoidal or learned embeddings, enabling the model to reason about temporal noise levels.
-
Because the denoiser must predict noise at each diffusion step, the output of a U-Net takes the same dimensions as the input most of the time:
- This is a key structural constraint that drives the choice toward architectures preserving spatial size.
-
-
Loss Function in U-Net Diffusion:
-
In DDPM training, the network is trained with a simple mean-squared error (MSE) loss between the predicted noise and actual noise:
\[L_{\text{simple}}(\theta)=\mathbb{E}_{t,\mathbf{x}_0,\epsilon}\left[\lVert \epsilon - \epsilon_{\theta}(\mathbf{x}_t,t)\rVert^2\right]\]-
where:
\[\mathbf{x}_t = \sqrt{\bar{\alpha}_t},\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t},\epsilon,\]- with \(\bar{\alpha}_t\) a noise schedule that determines how much noise is added at step \(t\).
-
-
This loss encourages the U-Net to predict the true Gaussian noise component added at each diffusion step.
-
Diffusion Transformer (DiT)
-
While U-Nets excel at capturing local spatial structure, self-attention mechanisms introduced by transformer models can capture long-range dependencies across the image. Scalable Diffusion Models with Transformers by William Peebles et al. (2022) introduced DiTs, a family of architectures that replaces the U-Net backbone with transformer blocks to process images in a tokenized or patchified manner.
-
Key features of DiTs include:
-
Patchify and Tokenize: Images are split into patches (similar to Vision Transformers), flattened, and embedded into a sequence of tokens that can be processed by transformer layers.
-
Multi-Head Self-Attention: Each transformer layer employs multi-head attention to aggregate context across all tokens, enabling the model to inherently reason about interactions between distant spatial locations.
-
Feed-Forward Modules: After attention, feed-forward networks (FFNs) apply nonlinear transformations to the attention outputs, enhancing representational capacity.
-
Conditioning for Time and Labels: Additional learned positional embeddings and timestep embeddings are concatenated or added to patch tokens to condition the transformer on the diffusion step and optional class labels.
-
-
DiTs generalize the idea that diffusion models can be implemented with attention-based backbones, showing competitive or superior performance to convolutional U-Nets on high-resolution benchmarks, especially when models are scaled up.
-
Loss Function in DiTs:
-
The objective for DiTs remains analogous to U-Net diffusion:
\[L_{\text{DiT}}(\theta)=\mathbb{E}_{t,\mathbf{x}_0,\epsilon} \left[\lVert \epsilon - \epsilon_{\theta}(\mathbf{x}_t,t)\rVert^2\right]\]- where \(\epsilon_{\theta}\) is now parameterized by a transformer architecture trained to predict noise from a sequence of patch tokens.
-
Comparison Between U-Net and Diffusion Transformer (DiT) Architectures
-
While both U-Net-based models and DiTs are trained to approximate the same denoising function in diffusion models, they differ substantially in their inductive biases, computational characteristics, and scaling behavior. These differences have important implications for model performance, training efficiency, and applicability across data modalities.
-
At a high level, both architectures aim to learn:
\[\epsilon_{\theta}(\mathbf{x}_t, t) \approx \epsilon\]- but the way spatial and contextual information is processed differs significantly.
Inductive Bias and Representation Learning
-
U-Net architectures introduce a strong spatial inductive bias through convolutional operations. Convolutions assume locality and translation invariance, which aligns well with the statistics of natural images. This makes U-Nets particularly effective at modeling fine-grained texture and local structure, even with limited data.
-
This inductive bias was leveraged in Denoising Diffusion Probabilistic Models by Ho et al. (2020), where a convolutional U-Net achieved high-quality image generation without requiring massive datasets or model sizes.
-
In contrast, Diffusion Transformers remove most spatial assumptions and instead rely on self-attention to learn relationships directly from data. Each token can attend to every other token, allowing the model to capture global dependencies explicitly. This design follows the philosophy of transformers introduced in Attention Is All You Need by Vaswani et al. (2017), and adapted to diffusion models in Scalable Diffusion Models with Transformers by Peebles et al. (2022).
-
Formally, self-attention computes:
\[\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V\]- allowing DiTs to model long-range spatial correlations that convolutional filters must approximate hierarchically.
Model Complexity and Parameter Scaling
-
A major practical difference lies in how computational complexity scales with input size.
- U-Net Complexity:
- Convolutional layers scale approximately linearly with the number of pixels:
- where \(H\), \(W\), and \(C\) denote image height, width, and channels, and \(k\) is the kernel size. This makes U-Nets efficient for high-resolution images.
- DiT Complexity:
- Self-attention scales quadratically with the number of tokens:
- where \(N\) is the number of image patches and \(d\) is the embedding dimension. As image resolution increases, this cost grows rapidly, motivating patch-based representations and large-scale compute.
- Despite this cost, Peebles et al. (2022) show that DiTs scale more predictably with model size and dataset size, similar to large language models, often surpassing U-Nets when trained at sufficient scale.
Training Stability and Optimization
-
U-Net-based diffusion models are generally easier to train due to:
- Well-understood convolutional behavior
- Stable gradient flow from skip connections
- Fewer hyperparameters tied to sequence modeling
-
As a result, they are often preferred in low-resource or rapid-iteration settings, such as early research prototyping or smaller datasets.
-
DiT models, on the other hand, require more careful optimization strategies. Training typically involves:
- Large batch sizes
- Learning-rate warmup schedules
- Gradient clipping
- Weight initialization tuned for transformers
-
These techniques mirror best practices from transformer training, as discussed in Scalable Diffusion Models with Transformers by Peebles et al. (2022).
Flexibility Across Modalities
-
A key advantage of DiTs is their architectural generality. Because transformers operate on sequences rather than grids, DiTs can be adapted to:
- Images
- Videos
- Point clouds
- Multimodal representations
-
U-Nets, by contrast, are most naturally suited for grid-structured data such as images and volumetric inputs.
-
This flexibility aligns DiTs with a broader trend toward foundation diffusion models, where a single architecture can be reused across domains, similar to the role of transformers in NLP and multimodal learning.
Trade-offs
-
U-Net diffusion models excel at:
- Efficient high-resolution image generation
- Stable training with limited data
- Strong spatial inductive bias
-
DiTs excel at:
- Modeling global dependencies
- Scaling with data and model size
- Generalization across modalities
-
Both architectures optimize essentially the same diffusion loss:
\[L(\theta) = \mathbb{E}_{t,\mathbf{x}_0,\epsilon} \left[\lVert \epsilon - \epsilon_{\theta}(\mathbf{x}_t,t)\rVert^2\right]\]- but differ in how \(\epsilon_{\theta}\) is parameterized and learned.
Reverse Process of U-Net-Based Diffusion Models
- The reverse (denoising) process is the core generative mechanism in diffusion models. Starting from pure Gaussian noise, the model iteratively applies learned conditional distributions to gradually recover a clean image. This formulation is introduced and formalized in Denoising Diffusion Probabilistic Models by Ho et al. (2020), which defines the reverse process as a learned approximation to the true posterior of the forward noising process.
Markovian Reverse Diffusion Process
-
The forward process is defined as a fixed Markov chain that adds Gaussian noise at each timestep:
\[q(\mathbf{x}_t \mid \mathbf{x}_{t-1}) = \mathcal{N}\left(\mathbf{x}_t; \sqrt{1-\beta_t},\mathbf{x}_{t-1}, \beta_t \mathbf{I}\right)\]- where \(\beta_t \in (0,1)\) is a variance schedule.
-
The reverse process is parameterized as another Markov chain:
\[p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t) = \mathcal{N}\left(\mathbf{x}_{t-1}; \mu_\theta(\mathbf{x}_t,t), \Sigma_\theta(\mathbf{x}_t,t)\right)\]- where the mean \(\mu_\theta\) is learned using a U-Net denoiser, and the variance \(\Sigma_\theta\) is often fixed or learned depending on the variant. This parameterization is derived from the variational framework introduced in Variational Diffusion Models by Kingma et al. (2021).
Noise Prediction Parameterization
-
Instead of directly predicting \(\mu_\theta\), Denoising Diffusion Probabilistic Models by Ho et al. (2020) showed that predicting the noise \(\epsilon\) yields superior empirical performance. Given:
\[\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\epsilon\]- the network is trained to predict \(\epsilon_\theta(\mathbf{x}_t,t)\), and the mean is reconstructed as:
-
This reparameterization simplifies training and leads to the widely used noise-prediction objective.
Discrete Likelihood at the Final Reverse Step
-
At the final timestep \(t=1\), the model must map a continuous Gaussian distribution back to a discrete image space, since images consist of integer-valued pixels. This discretization step is crucial for computing the exact likelihood term in the variational lower bound.
-
Following Denoising Diffusion Probabilistic Models by Ho et al. (2020), the reverse distribution for the final step is defined as:
\[p_\theta(\mathbf{x}_0 \mid \mathbf{x}_1) =\prod_{i=1}^{D} p_\theta(x_0^i \mid x_1^i)\]- where \(D\) is the total number of pixels (including channels).
-
Each pixel is modeled using a univariate Gaussian:
\[\mathcal{N}\left(x;\mu_\theta^i(\mathbf{x}_1,1),\sigma_1^2\right)\]- which arises from the diagonal covariance assumption:
Bucket-Based Discretization of Pixel Values
- Images are assumed to be integers in \({0,\dots,255}\), linearly scaled to \([-1,1]\). For a scaled pixel value \(x\), probability mass is assigned by integrating the Gaussian over a small interval:
- Thus, the probability of a pixel value is:
- where:
- The full likelihood becomes:
Visualization of Discrete Likelihood Buckets
- The following figure (retained from the original writeup) illustrates the discretization process. The red curve shows a Gaussian distribution for a pixel at timestep \(t=1\), while the shaded regions represent probability mass assigned to discrete pixel values at (t=0).

- The first and last buckets extend to \(-\infty\) and \(+\infty\), ensuring that the total probability mass sums to one.
Contribution to the Variational Lower Bound
- This discrete likelihood term forms the only component of the variational lower bound that is not a KL divergence:
-
The full training objective, known as the variational lower bound (VLB), is derived in Denoising Diffusion Probabilistic Models by Ho et al. (2020) and later refined in Improved Denoising Diffusion Probabilistic Models by Nichol et al. (2021).
-
Continuing with the next section, expanded in the same style and level of rigor, and consistent with the original writeup. (There were no images originally in this subsection, so none are added or removed.)
Reverse Process of DiT-Based Diffusion Models
- Diffusion Transformer (DiT) models follow the same probabilistic reverse diffusion framework as U-Net-based models, but differ fundamentally in how the denoising function is parameterized. Rather than operating directly on spatial feature maps with convolutions, DiTs use transformer blocks to model global interactions between image regions via self-attention. This approach was introduced in Scalable Diffusion Models with Transformers by Peebles et al. (2022).
Shared Probabilistic Formulation
-
As with all diffusion models, the reverse process is defined as a Markov chain:
\[p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t) =\mathcal{N}\left( \mathbf{x}_{t-1}; \mu_\theta(\mathbf{x}_t,t), \Sigma_\theta(\mathbf{x}_t,t) \right)\]-
where the mean \(\mu_\theta\) is implicitly defined via the model’s noise prediction:
\[\epsilon_\theta(\mathbf{x}_t,t)\]
-
-
The forward process remains unchanged:
\[\mathbf{x}_t =\sqrt{\bar{\alpha}_t}\mathbf{x}_0 +\sqrt{1-\bar{\alpha}_t}\epsilon, \quad \epsilon \sim \mathcal{N}(\mathbf{0},\mathbf{I})\]- as introduced in Denoising Diffusion Probabilistic Models by Ho et al. (2020). The distinction lies entirely in how \(\epsilon_\theta\) is implemented.
Image Tokenization and Latent Representation
-
In DiT models, images are first divided into non-overlapping patches, analogous to Vision Transformers. Given an image \(\mathbf{x}_t \in \mathbb{R}^{H \times W \times C}\), it is split into \(N\) patches of size \(P \times P\), producing a sequence:
\[\mathbf{z}_t \in \mathbb{R}^{N \times d}\]- where \(d\) is the embedding dimension after linear projection.
-
Each patch embedding is augmented with:
- Positional embeddings
- Timestep embeddings
- Optional class embeddings (for class-conditional generation)
-
This design closely follows transformer conditioning strategies introduced in Attention Is All You Need by Vaswani et al. (2017) and adapted for diffusion in Peebles et al. (2022).
Transformer-Based Denoising Dynamics
-
The core of the reverse process consists of stacked transformer blocks. Each block applies multi-head self-attention followed by a feed-forward network:
\[\mathbf{H}^{(l+1)} =\text{FFN}\left( \text{MHA}\left(\mathbf{H}^{(l)}\right) \right)\]- where multi-head attention is defined as:
-
with:
\[\text{head}_i =\text{softmax}\left( \frac{QW_i^Q (KW_i^K)^\top}{\sqrt{d_k}} \right) VW_i^V\]
-
Through attention, each patch can condition its denoising prediction on all other patches, allowing DiTs to capture global image structure more directly than convolutional architectures. This property becomes especially important at high resolutions, as shown empirically in Scalable Diffusion Models with Transformers by Peebles et al. (2022).
Noise Prediction and Reconstruction
- After passing through the transformer layers, the token sequence is projected back into patch space and unpatchified to reconstruct a full-resolution noise estimate:
-
As in U-Net-based models, this predicted noise is used to compute the reverse mean:
\[\mu_\theta(\mathbf{x}_t,t) =\frac{1}{\sqrt{\alpha_t}} \left( \mathbf{x}_t -\frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \epsilon_\theta(\mathbf{x}_t,t) \right)\]- ensuring that DiTs remain fully compatible with standard diffusion samplers such as DDPM and DDIM, as described in Denoising Diffusion Implicit Models by Song et al. (2020).
Training Objective
- Despite the architectural differences, DiT models are trained with the same simplified noise-prediction loss as U-Net diffusion models:
- This objective arises from minimizing a variational bound on the negative log-likelihood, as shown in Denoising Diffusion Probabilistic Models by Ho et al. (2020) and further analyzed in Improved Denoising Diffusion Probabilistic Models by Nichol et al. (2021).
Sampling Behavior and Scaling Properties
-
One of the key findings of Peebles et al. (2022) is that DiTs exhibit smoother and more predictable scaling behavior than U-Nets as model size and dataset size increase. This mirrors trends observed in large language models and suggests that attention-based diffusion architectures may be better suited for foundation-scale generative modeling.
-
However, the quadratic cost of attention imposes practical constraints on resolution and patch size, motivating hybrid approaches and latent-space diffusion methods explored in High-Resolution Image Synthesis with Latent Diffusion Models by Rombach et al. (2022).
Final Objective
- A central contribution of Denoising Diffusion Probabilistic Models by Ho et al. (2020) is the observation that diffusion model training can be significantly simplified by predicting the noise component added at each timestep, rather than directly predicting the clean image or the reverse-process mean. This insight leads to a remarkably simple and stable training objective that underpins most modern diffusion models.
From Variational Lower Bound to Simplified Loss
-
Diffusion models are trained by maximizing a VLB on the data log-likelihood:
\[\log p_\theta(\mathbf{x}_0) \ge -L_{\mathrm{VLB}}\]-
where the VLB decomposes into a sum of KL divergence terms across timesteps and a final reconstruction term:
\[L_{\mathrm{VLB}} =\mathbb{E}_q\left[L_0 +\sum_{t=2}^{T} L_t +L_T \right]\]
-
-
Most terms in this decomposition take the form:
\[L_t =\mathrm{KL} \big( q(\mathbf{x}_{t-1}\mid\mathbf{x}_t,\mathbf{x}_0) \mid\mid p_\theta(\mathbf{x}_{t-1}\mid\mathbf{x}_t) \big)\]- as derived in Denoising Diffusion Probabilistic Models by Ho et al. (2020) and refined in Improved Denoising Diffusion Probabilistic Models by Nichol et al. (2021).
Noise Prediction Parameterization
-
Rather than learning \(\mu_\theta(\mathbf{x}_t,t)\) directly, Denoising Diffusion Probabilistic Models by Ho et al. (2020) reparameterize the reverse-process mean in terms of predicted noise. Using the forward process:
\[\mathbf{x}_t =\sqrt{\bar{\alpha}_t}\mathbf{x}_0 +\sqrt{1-\bar{\alpha}_t}\epsilon, \quad \epsilon \sim \mathcal{N}(\mathbf{0},\mathbf{I})\]- the model is trained to predict \(\epsilon\) via \(\epsilon_\theta(\mathbf{x}_t,t)\).
-
Under this parameterization, minimizing the KL terms in the VLB becomes (up to constants and weighting factors) equivalent to minimizing a simple mean-squared error loss on the noise:
- This simplification is one of the key reasons diffusion models are stable and easy to train in practice.
The Simple Objective
- Putting everything together, the final training objective used in DDPMs is:
-
This objective:
- Avoids explicit KL divergence computation
- Does not require modeling pixel likelihoods at intermediate timesteps
- Works uniformly across U-Net and DiT architectures
-
The same loss is used in later diffusion variants, including Denoising Diffusion Implicit Models by Song et al. (2020) and Scalable Diffusion Models with Transformers by Peebles et al. (2022).
Training and Sampling Algorithms
-
Training alternates between:
- Sampling a clean image \(\mathbf{x}_0\)
- Sampling a timestep \(t \sim \text{Uniform}({1,\dots,T})\)
- Adding noise to obtain \(\mathbf{x}_t\)
- Updating \(\theta\) to minimize \(L_{\text{simple}}\)
-
Sampling reverses this process, starting from \(\mathbf{x}_T \sim \mathcal{N}(\mathbf{0},\mathbf{I})\) and iteratively applying:
\[\mathbf{x}_{t-1} \sim p_\theta(\mathbf{x}_{t-1}\mid\mathbf{x}_t)\]- until a clean image is produced.
-
The full procedure is summarized in the following table, reproduced from Denoising Diffusion Probabilistic Models by Ho et al. (2020):

Why This Objective Works
-
Empirically, Denoising Diffusion Probabilistic Models by Ho et al. (2020) show that:
- Noise prediction yields better perceptual quality than predicting \(\mathbf{x}_0\)
- The simplified objective correlates well with likelihood
- The approach generalizes across architectures and datasets
-
This noise-prediction objective has since become the standard loss function for diffusion models, forming the foundation for virtually all modern diffusion-based generative systems.
Key Takeaways
- In summary, U-Net-based diffusion models are the most prevalent type of diffusion models, particularly effective for image-related tasks due to their spatially structured convolutional architecture. They are simpler to train and computationally more efficient. The reverse process in U-Net-based models involves many transformations under continuous conditional Gaussian distributions and concludes with an independent discrete decoder to determine pixel values.
- On the other hand, DiT leverage the power of transformers to handle a variety of data types, capturing long-range dependencies through attention mechanisms. They utilize a series of denoising steps with transformer blocks, employing self-attention mechanisms to effectively model interactions within the data. However, DiT models are more complex and resource-intensive. The reverse process in DiT-based models involves iterative denoising steps using transformer blocks to progressively refine the noisy image.
- The choice between these models depends on the specific requirements of the task, the nature of the data, and the available computational resources.
Conditional Diffusion Models
- Conditional Diffusion Models (CDMs) are an extension of diffusion probabilistic models, where the generation process is conditioned on auxiliary information. This conditioning allows for more structured and controlled synthesis, enabling models to produce outputs that adhere to specific constraints or descriptions.
- Conditioning in diffusion models can be applied using various inputs, such as text (e.g., CLIP embeddings, transformers) or visual data (e.g., images, segmentation maps, depth maps). These inputs influence both the theoretical underpinnings and practical implementations of the models, enhancing their ability to generate outputs aligned with user-defined specifications.
- Early diffusion models relied on simple concatenation techniques for conditioning. However, modern architectures have adopted more sophisticated methods like cross-attention mechanisms, which significantly improve guidance effectiveness. Additionally, techniques such as classifier-free guidance and feature modulation further refine controllability, allowing models to better interpret conditioning signals. These advancements make CDMs powerful tools for diverse tasks, including text-to-image synthesis and guided image manipulation.
Conditioning Mechanisms
-
Diffusion models, which iteratively denoise a Gaussian noise sample to generate an image, can be conditioned by modifying either the forward diffusion process, the reverse process, or both. Below are the primary methods used for conditioning:
- Concatenation: Directly concatenating conditioning information to the input (e.g., concatenating a text embedding or image feature map to the input image tensor). This was widely used in earlier models such as SR3 (Saharia et al., 2021) and Palette (Saharia et al., 2022).
- Cross-Attention: Using a transformer-based cross-attention mechanism to modulate the noise prediction process. This is commonly used in modern models like Imagen (Saharia et al., 2022) and Stable Diffusion (Rombach et al., 2022).
- Adaptive Normalization (AdaGN, AdaIN): Using conditioning information to modulate the mean and variance of intermediate activations.
- Classifier Guidance: Using an external classifier to guide the reverse diffusion process.
- Score-Based Guidance: Modifying the score function based on conditioning information.
-
Below, we describe how these approaches work mathematically and their implementations.
Text Conditioning in Diffusion Models
- Text conditioning in diffusion models typically involves leveraging text encoders such as CLIP, T5, or BERT to obtain a text embedding, which is then integrated into the diffusion model’s denoising network.
Encoding Textual Information
-
A text encoder extracts a fixed-length embedding from an input text description. Suppose the input text is denoted as \(T\), the text encoder \(E_{text}\) produces an embedding:
\[z_T = E_{text}(T) \in \mathbb{R}^{d_{text}}\]- where \(z_T\) is the resulting embedding vector.
Concatenation vs. Cross-Attention Conditioning
- Earlier models such as SR3 (Saharia et al., 2021) and Palette (Saharia et al., 2022) used direct concatenation of conditioning inputs with noise latents. However, modern models like Stable Diffusion and Imagen rely on cross-attention for more expressive conditioning.
Cross-Attention
-
A common method for integrating \(z_T\) into the U-Net-based denoiser is via cross-attention. If \(f_l\) represents the feature map at layer \(l\) of the U-Net, attention-modulated features are computed as:
\[\text{Attention}(Q, K, V) = \text{softmax}\left( \frac{Q K^T}{\sqrt{d}} \right) V\]- where:
-
This allows the model to attend to relevant text features while generating an image.
Implementation Details (PyTorch)
class CrossAttention(nn.Module):
def __init__(self, dim, context_dim):
super().__init__()
self.to_q = nn.Linear(dim, dim)
self.to_k = nn.Linear(context_dim, dim)
self.to_v = nn.Linear(context_dim, dim)
self.scale = dim ** -0.5
def forward(self, x, context):
q = self.to_q(x)
k = self.to_k(context)
v = self.to_v(context)
attn = torch.einsum('b i d, b j d -> b i j', q, k) * self.scale
attn = attn.softmax(dim=-1)
out = torch.einsum('b i j, b j d -> b i d', attn, v)
return out
Visual Conditioning in Diffusion Models
- Visual conditioning can be applied using images, segmentation maps, edge maps, or depth maps as conditioning inputs.
Concatenation-Based Conditioning
- A simple way to condition on an image is by concatenating it with the noise input at each timestep:
\(x_t' = \text{concat}(x_t, C)\)
- where \(x_t\) is the noisy image and \(C\) is the conditioning image. This method was prevalent in early models like SR3.
Feature Map Injection via Cross-Attention
- More advanced methods use feature injection via cross-attention, as seen in Stable Diffusion and Imagen. Instead of concatenation, this method extracts feature maps from a pretrained encoder \(E_{img}\):
- and injects these features at various U-Net layers via FiLM (Feature-wise Linear Modulation):
Implementation Details (PyTorch)
class FiLM(nn.Module):
def __init__(self, in_channels, conditioning_dim):
super().__init__()
self.gamma = nn.Linear(conditioning_dim, in_channels)
self.beta = nn.Linear(conditioning_dim, in_channels)
def forward(self, x, conditioning):
gamma = self.gamma(conditioning).unsqueeze(-1).unsqueeze(-1)
beta = self.beta(conditioning).unsqueeze(-1).unsqueeze(-1)
return gamma * x + beta
Classifier-Free Guidance
Background: Why Are External Classifiers Needed for Text-to-Image Synthesis Using Diffusion Models?
- Diffusion models when used for text-to-image synthesis produce high-quality and coherent images from textual descriptions. However, early implementations of diffusion-based text-to-image models often struggled with aligning generated images precisely with their corresponding textual descriptions. One method to improve this alignment is through classifier guidance, where an external classifier is used to steer the diffusion process. The introduction of an external classifier provided an initial improvement in text-to-image synthesis by guiding diffusion models towards more accurate outputs.
The Need for External Classifiers
- Conditional Control: Early diffusion models generated images by iteratively refining a noise vector but lacked a robust mechanism to ensure strict adherence to the input text description.
- Gradient-Based Steering: External classifiers enabled gradient-based guidance by evaluating intermediate diffusion steps and providing directional corrections to better match the conditioning input.
- Enhancing Specificity: Without an external classifier, models sometimes produced images that, while visually plausible, did not accurately capture the semantics of the input text. The classifier provided a corrective mechanism to reinforce textual consistency.
- Limitations of Pure Unconditional Diffusion Models: Unconditional diffusion models trained without any conditioning struggled to generate diverse yet accurate samples aligned with a given input prompt. External classifiers were introduced to bridge this gap by explicitly providing additional constraints during inference.
Key Papers Introducing External Classifiers for Text-to-Image Synthesis Using Diffusion Models
-
Several papers introduced and explored the use of external classifiers for guiding text-to-image synthesis in diffusion models:
- Dhariwal and Nichol (2021): “Diffusion Models Beat GANs on Image Synthesis”
- This paper introduced classifier guidance as a mechanism to improve the fidelity and control of image generation in diffusion models.
- The approach leveraged an external classifier trained to predict image labels, which was then used to modify the sampling process by influencing the reverse diffusion steps.
- Mathematically, the classifier-based guidance modifies the score function as: \(\nabla_x \log p(y \mid\mid x) \approx \frac{\partial f_y(x)}{\partial x},\) where \(f_y(x)\) represents the classifier’s output logits for class \(y\) given an image \(x\).
- Ho et al. (2021): “Classifier-Free Diffusion Guidance”
- This work proposed classifier-free guidance as an alternative to classifier-based guidance, enabling the model to learn both conditioned and unconditioned paths internally without requiring an external classifier.
- It showed that classifier-free guidance could achieve competitive or superior results compared to classifier-based methods while reducing architectural complexity.
- Ramesh et al. (2022): “Hierarchical Text-Conditional Image Generation with CLIP Latents (DALL·E 2)”
- This paper incorporated a CLIP-based approach to improve text-to-image alignment without directly using an external classifier.
- Instead of an explicit classifier, a pretrained CLIP model was used to guide the image generation by matching textual and visual embeddings.
- Dhariwal and Nichol (2021): “Diffusion Models Beat GANs on Image Synthesis”
How Classifier-Free Guidance Works
- Compared to using external classifiers, classifier-free guidance has since emerged as a more efficient and flexible alternative, eliminating the need for additional classifiers while maintaining or exceeding the performance of classifier-based methods. Put simply, classifier-free guidance provides an alternative to external classifier-based guidance by training the model to handle both conditioned and unconditioned paths internally.
- By incorporating a dual-path training approach and an adjustable guidance scale, classifier-free guidance enhances fidelity, efficiency, and control in text-to-image synthesis, making it a preferred choice in modern generative models.
Dual Training Path
- Conditioned and Unconditioned Paths: During training, the model learns two distinct paths:
- A conditioned path, where the model is trained to generate outputs aligned with a given text description.
- An unconditioned path, where the model generates outputs without any guidance.
- Random Conditioning Dropout: To encourage robustness, the model is trained with random conditioning dropout, where a fraction of inputs are deliberately trained without text guidance.
- Self-Guidance Mechanism: By learning both paths simultaneously, the model can interpolate between conditioned and unconditioned generations, allowing it to effectively control guidance strength during inference.
Equations
- Training Objective:
-
The model learns two score functions:
\[\epsilon_\theta(x_t, y) \text{ and } \epsilon_\theta(x_t, \varnothing)\]- where:
- \(x_t\) is the noised image at time step \(t\),
- \(y\) represents the conditioning input (e.g., text prompt), and
- \(\varnothing\) represents the unconditioned input.
- where:
-
- Guidance During Inference:
-
Classifier-free guidance is implemented as:
\[\tilde{\epsilon}_\theta(x_t, y) = (1 + \gamma) \epsilon_\theta(x_t, y) - \gamma \epsilon_\theta(x_t, \varnothing)\]- where \(\gamma\) is the guidance scale controlling adherence to the conditioning input.
-
- Effect of Guidance Scale:
- When \(\gamma = 0\), the model behaves as an unconditional generator.
- When \(\gamma\) is increased, the generated output aligns more closely with the text condition.
Benefits of Classifier-Free Guidance
- Eliminates the Need for an External Classifier
- Traditional classifier-based guidance requires a separately trained classifier, adding complexity to both training and inference.
- Classifier-free guidance removes this dependency, simplifying the overall architecture while maintaining strong performance.
- Improved Sample Quality
- External classifiers introduce additional noise and potential misalignment between the classifier and the generative model.
- Classifier-free guidance directly integrates the conditioning within the diffusion process, leading to more natural and coherent outputs.
- Reduced Computational Cost
- Training and utilizing an external classifier increases the computational burden.
- Classifier-free guidance eliminates the need for additional model components, streamlining both training and inference.
- Enhanced Generalization and Robustness
- Classifier-based methods can be prone to adversarial vulnerabilities and overfitting to specific datasets.
- Classifier-free approaches allow the diffusion model to generalize better across different conditioning signals and input variations.
- Flexibility and Real-Time Control
- Classifier-free guidance allows for dynamic adjustment of the guidance scale \(\gamma\) at inference time, providing fine-tuned control over generation quality and diversity.
- Users can experiment with different \(\gamma\) values without retraining the model, unlike classifier-based methods where the external classifier’s influence is fixed.
Prompting Guidance
- Crafting effective prompts is crucial for generating high-quality and relevant outputs using diffusion models. This guide is divided into two main sections: (i) Prompting for Text-to-Image models, and (ii) Prompting for Text-to-Video models.
Prompting Text-to-Image Models
- Text-to-image models, such as Stable Diffusion, DALL-E, and Imagen, translate textual descriptions into visual outputs. The success of a prompt depends on how well it describes the desired image in a structured, caption-like format. Below are the key considerations and techniques for crafting effective prompts for text-to-image generation.
Key Prompting Guidelines
- Phrase Your Prompt as an Image Caption:
- Avoid conversational language or commands. Instead, describe the desired image with concise, clear details as you would in an image caption.
- Example: “Realistic photo of a snowy mountain range under a clear blue sky, with sunlight casting long shadows.”
- Structure Your Prompt Using the Formula:
- [Subject] in [Environment], [Optional Pose/Position], [Optional Lighting], [Optional Camera Position/Framing], [Optional Style/Medium].
- Example: “A golden retriever playing in a grassy park during sunset, photorealistic, warm lighting.”
- Character Limit:
- Prompts must not exceed 1024 characters. Place less important details near the end.
- Avoid Negation Words:
- Do not use words like “no,” “not,” or “without.” For example, the prompt “a fruit basket with no bananas” may result in bananas being included. Instead, use negative prompts:
- Example:
Prompt: A fruit basket with apples and oranges.
Negative Prompt: Bananas.
- Example:
- Do not use words like “no,” “not,” or “without.” For example, the prompt “a fruit basket with no bananas” may result in bananas being included. Instead, use negative prompts:
- Refinement Techniques:
- Use a consistent seed value to test prompt variations, iterating with small changes to understand how each affects the output.
- Once satisfied with a prompt, generate variations by running the same prompt with different seed values.
Example Prompts for Text-to-Image Models
| Use Case | Prompt | Negative Prompt |
|---|---|---|
| Stock Photo | "Realistic editorial photo of a teacher standing at a blackboard with a warm smile." | "Crossed arms." |
| Story Illustration | "Whimsical storybook illustration: a knight in armor kneeling before a glowing sword." | "Cartoonish style." |
| Cinematic Landscape | "Drone view of a dark river winding through a stark Icelandic landscape, cinematic quality." | None |
Prompting Text-to-Video Models
- Text-to-video models extend text-to-image capabilities to temporal domains, generating coherent sequences of frames based on textual prompts. These models use additional techniques, such as temporal embeddings, to capture motion and transitions over time.
Key Prompting Guidelines
- Phrase Your Prompt as a Video Summary:
- Describe the video sequence as if summarizing its content, focusing on the subject, action, and environment.
- Example: “A time-lapse of a sunflower blooming in a sunny garden. Vibrant colors, cinematic lighting.”
- Include Camera Movement for Dynamic Outputs:
- Add camera movement descriptions (e.g., dolly shot, aerial view) at the start or end of the prompt for optimal results.
- Example: “Arc shot of a basketball spinning on a finger in slow motion. Cinematic, sharp focus, 4K resolution.”
- Character Limit:
- Like text-to-image prompts, video prompts must not exceed 1024 characters.
- Avoid Negation Words:
- Use negative prompts to exclude unwanted elements, similar to text-to-image generation.
- Refinement Techniques:
- Experiment with different camera movements, action descriptions, or lighting effects to improve output consistency and realism.
Camera Movements
- In video prompts, describing camera motion adds dynamic perspectives to the generated sequence. Below is a reference table of common camera movements and their suggested keywords:
| Camera Movement | Suggested Keywords | Definition |
|---|---|---|
| Aerial Shot | aerial shot, drone shot, first-person view (FPV) | A shot taken from above, often from a drone or aircraft. |
| Arc Shot | arc shot, 360-degree shot, orbit shot | Camera moves in a circular path around a central point/object. |
| Clockwise Rotation | camera rotates clockwise, clockwise rolling shot | Camera rotates in a clockwise direction. |
| Dolly In | dolly in, camera moves forward | Camera moves forward. |
Example Prompts for Text-to-Video Models
| Use Case | Prompt | Negative Prompt |
|---|---|---|
| Food Advertisement | "Cinematic dolly shot of a juicy cheeseburger with melting cheese, fries, and a cola on a diner table." | "Messy table." |
| Product Showcase | "Arc shot of a luxury wristwatch on a glass display, under studio lighting, with a blurred background." | "Low resolution." |
| Nature Scene | "Aerial shot of a waterfall cascading through a dense forest. Soft lighting, 4K resolution." | None |
Key Takeaways
Text-to-Image
- For text-to-image tasks, focus on describing the subject and its environment with optional details like lighting, style, and camera position. Use clear, concise descriptions structured like image captions.
Text-to-Video
-
For text-to-video tasks, describe the sequence as a whole, including subject actions, camera movements, and temporal transitions. Camera motion plays a critical role in adding dynamic elements to the video output.
-
Both types of prompting require careful attention to phrasing and refinement to achieve optimal results. By iterating and experimenting with different seeds and negative prompts, you can generate visually stunning and contextually accurate outputs tailored to your needs.
Diffusion Models in PyTorch
Implementing the original paper
- Let’s go over the original Denoising Diffusion Probabilistic Models (DDPMs) paper by Ho et al.,2020 and implement it step by step based on Phil Wang’s implementation and The Annotated Diffusion by Hugging Face which are both based off the original implementation.

Pre-requisites: Setup and Importing Libraries
- Let’s start with the setup and importing all the required libraries:
from IPython.display import Image
Image(filename='assets/78_annotated-diffusion/ddpm_paper.png')
!pip install -q -U einops datasets matplotlib tqdm
import math
from inspect import isfunction
from functools import partial
%matplotlib inline
import matplotlib.pyplot as plt
from tqdm.auto import tqdm
from einops import rearrange
import torch
from torch import nn, einsum
import torch.nn.functional as F
Helper functions
- Now let’s implement the neural network we have looked at earlier. First we start with a few helper functions.
- Most notably, we define
Residualclass which will add the input to the output of a particular function. That is, it adds a residual connection to a particular function.
def exists(x):
return x is not None
def default(val, d):
if exists(val):
return val
return d() if isfunction(d) else d
class Residual(nn.Module):
def __init__(self, fn):
super().__init__()
self.fn = fn
def forward(self, x, *args, **kwargs):
return self.fn(x, *args, **kwargs) + x
def Upsample(dim):
return nn.ConvTranspose2d(dim, dim, 4, 2, 1)
def Downsample(dim):
return nn.Conv2d(dim, dim, 4, 2, 1)
- Note: the parameters of the neural network are shared across time (noise level).
- Thus, for the neural network to keep track of which time step (noise level) it is on, the authors used sinusoidal position embeddings to encode \(t\).
- The
SinusoidalPositionEmbeddingsclass, that we have defined below, takes a tensor of shape(batch_size,1)as input or the noise levels in a batch. - It will then turn this input tensor into a tensor of shape
(batch_size, dim)wheredimis the dimensionality of the position embeddings.
class SinusoidalPositionEmbeddings(nn.Module):
def __init__(self, dim):
super().__init__()
self.dim = dim
def forward(self, time):
device = time.device
half_dim = self.dim // 2
embeddings = math.log(10000) / (half_dim - 1)
embeddings = torch.exp(torch.arange(half_dim, device=device) * -embeddings)
embeddings = time[:, None] * embeddings[None, :]
embeddings = torch.cat((embeddings.sin(), embeddings.cos()), dim=-1)
return embeddings
Model Core: ResNet or ConvNeXT
- Now we will look at the meat or the core part of our U-Net model. The original DDPM authors employed a Wide ResNet block via Zagoruyko et al., 2016, however Phil Wang has also introduced support for ConvNeXT block via Liu et al., 2022.
- You are free to choose either or in your final U-Net architecture but both are provided below:
class Block(nn.Module):
def __init__(self, dim, dim_out, groups = 8):
super().__init__()
self.proj = nn.Conv2d(dim, dim_out, 3, padding = 1)
self.norm = nn.GroupNorm(groups, dim_out)
self.act = nn.SiLU()
def forward(self, x, scale_shift = None):
x = self.proj(x)
x = self.norm(x)
if exists(scale_shift):
scale, shift = scale_shift
x = x * (scale + 1) + shift
x = self.act(x)
return x
class ResnetBlock(nn.Module):
"""https://arxiv.org/abs/1512.03385"""
def __init__(self, dim, dim_out, *, time_emb_dim=None, groups=8):
super().__init__()
self.mlp = (
nn.Sequential(nn.SiLU(), nn.Linear(time_emb_dim, dim_out))
if exists(time_emb_dim)
else None
)
self.block1 = Block(dim, dim_out, groups=groups)
self.block2 = Block(dim_out, dim_out, groups=groups)
self.res_conv = nn.Conv2d(dim, dim_out, 1) if dim != dim_out else nn.Identity()
def forward(self, x, time_emb=None):
h = self.block1(x)
if exists(self.mlp) and exists(time_emb):
time_emb = self.mlp(time_emb)
h = rearrange(time_emb, "b c -> b c 1 1") + h
h = self.block2(h)
return h + self.res_conv(x)
class ConvNextBlock(nn.Module):
"""https://arxiv.org/abs/2201.03545"""
def __init__(self, dim, dim_out, *, time_emb_dim=None, mult=2, norm=True):
super().__init__()
self.mlp = (
nn.Sequential(nn.GELU(), nn.Linear(time_emb_dim, dim))
if exists(time_emb_dim)
else None
)
self.ds_conv = nn.Conv2d(dim, dim, 7, padding=3, groups=dim)
self.net = nn.Sequential(
nn.GroupNorm(1, dim) if norm else nn.Identity(),
nn.Conv2d(dim, dim_out * mult, 3, padding=1),
nn.GELU(),
nn.GroupNorm(1, dim_out * mult),
nn.Conv2d(dim_out * mult, dim_out, 3, padding=1),
)
self.res_conv = nn.Conv2d(dim, dim_out, 1) if dim != dim_out else nn.Identity()
def forward(self, x, time_emb=None):
h = self.ds_conv(x)
if exists(self.mlp) and exists(time_emb):
condition = self.mlp(time_emb)
h = h + rearrange(condition, "b c -> b c 1 1")
h = self.net(h)
return h + self.res_conv(x)
Attention
- Next, we will look into defining the attention module which was added between the convolutional blocks in DDPM.
- Phil Wang added two variants of attention, a normal multi-headed self-attention from the original Transformer paper (Vaswani et al.,2017), and linear attention variant (Shen et al., 2018).
- Linear attention variant’s time and memory requirements scale linear in the sequence length, as opposed to quadratic for regular attention.
class Attention(nn.Module):
def __init__(self, dim, heads=4, dim_head=32):
super().__init__()
self.scale = dim_head**-0.5
self.heads = heads
hidden_dim = dim_head * heads
self.to_qkv = nn.Conv2d(dim, hidden_dim * 3, 1, bias=False)
self.to_out = nn.Conv2d(hidden_dim, dim, 1)
def forward(self, x):
b, c, h, w = x.shape
qkv = self.to_qkv(x).chunk(3, dim=1)
q, k, v = map(
lambda t: rearrange(t, "b (h c) x y -> b h c (x y)", h=self.heads), qkv
)
q = q * self.scale
sim = einsum("b h d i, b h d j -> b h i j", q, k)
sim = sim - sim.amax(dim=-1, keepdim=True).detach()
attn = sim.softmax(dim=-1)
out = einsum("b h i j, b h d j -> b h i d", attn, v)
out = rearrange(out, "b h (x y) d -> b (h d) x y", x=h, y=w)
return self.to_out(out)
class LinearAttention(nn.Module):
def __init__(self, dim, heads=4, dim_head=32):
super().__init__()
self.scale = dim_head**-0.5
self.heads = heads
hidden_dim = dim_head * heads
self.to_qkv = nn.Conv2d(dim, hidden_dim * 3, 1, bias=False)
self.to_out = nn.Sequential(nn.Conv2d(hidden_dim, dim, 1),
nn.GroupNorm(1, dim))
def forward(self, x):
b, c, h, w = x.shape
qkv = self.to_qkv(x).chunk(3, dim=1)
q, k, v = map(
lambda t: rearrange(t, "b (h c) x y -> b h c (x y)", h=self.heads), qkv
)
q = q.softmax(dim=-2)
k = k.softmax(dim=-1)
q = q * self.scale
context = torch.einsum("b h d n, b h e n -> b h d e", k, v)
out = torch.einsum("b h d e, b h d n -> b h e n", context, q)
out = rearrange(out, "b h c (x y) -> b (h c) x y", h=self.heads, x=h, y=w)
return self.to_out(out)
- DDPM then adds group normalization to interleave the convolutional/attention layers of the U-Net architecture.
- Below, the
PreNormclass will apply group normalization before the attention layer.- Note, there has been a debate about whether groupnorm is better to be applied before or after attention in Transformers.
class PreNorm(nn.Module):
def __init__(self, dim, fn):
super().__init__()
self.fn = fn
self.norm = nn.GroupNorm(1, dim)
def forward(self, x):
x = self.norm(x)
return self.fn(x)
Overall network
- Now that we have all the building blocks of the neural network (ResNet/ConvNeXT blocks, attention, positional embeddings, group norm), lets define our entire neural network.
- The task of this neural network is to take in a batch of noisy images and their noise levels and then to output the noise added to the input.
- The network takes a batch of noisy images of shape
(batch_size, num_channels, height, width)and a batch of noise levels of shape(batch_size, 1)as input, and returns a tensor of shape(batch_size, num_channels, height, width). - The network is built up as follows: (source)
- first, a convolutional layer is applied on the batch of noisy images, and position embeddings are computed for the noise levels
- next, a sequence of downsampling stages are applied.
- Each downsampling stage consists of two ResNet/ConvNeXT blocks + groupnorm + attention + residual connection + a downsample operation
- at the middle of the network, again ResNet or ConvNeXT blocks are applied, interleaved with attention
- next, a sequence of upsampling stages are applied.
- Each upsampling stage consists of two ResNet/ConvNeXT blocks + groupnorm + attention + residual connection + an upsample operation
- finally, a ResNet/ConvNeXT block followed by a convolutional layer is applied.
class Unet(nn.Module):
def __init__(
self,
dim,
init_dim=None,
out_dim=None,
dim_mults=(1, 2, 4, 8),
channels=3,
with_time_emb=True,
resnet_block_groups=8,
use_convnext=True,
convnext_mult=2,
):
super().__init__()
# determine dimensions
self.channels = channels
init_dim = default(init_dim, dim // 3 * 2)
self.init_conv = nn.Conv2d(channels, init_dim, 7, padding=3)
dims = [init_dim, *map(lambda m: dim * m, dim_mults)]
in_out = list(zip(dims[:-1], dims[1:]))
if use_convnext:
block_klass = partial(ConvNextBlock, mult=convnext_mult)
else:
block_klass = partial(ResnetBlock, groups=resnet_block_groups)
# time embeddings
if with_time_emb:
time_dim = dim * 4
self.time_mlp = nn.Sequential(
SinusoidalPositionEmbeddings(dim),
nn.Linear(dim, time_dim),
nn.GELU(),
nn.Linear(time_dim, time_dim),
)
else:
time_dim = None
self.time_mlp = None
# layers
self.downs = nn.ModuleList([])
self.ups = nn.ModuleList([])
num_resolutions = len(in_out)
for ind, (dim_in, dim_out) in enumerate(in_out):
is_last = ind >= (num_resolutions - 1)
self.downs.append(
nn.ModuleList(
[
block_klass(dim_in, dim_out, time_emb_dim=time_dim),
block_klass(dim_out, dim_out, time_emb_dim=time_dim),
Residual(PreNorm(dim_out, LinearAttention(dim_out))),
Downsample(dim_out) if not is_last else nn.Identity(),
]
)
)
mid_dim = dims[-1]
self.mid_block1 = block_klass(mid_dim, mid_dim, time_emb_dim=time_dim)
self.mid_attn = Residual(PreNorm(mid_dim, Attention(mid_dim)))
self.mid_block2 = block_klass(mid_dim, mid_dim, time_emb_dim=time_dim)
for ind, (dim_in, dim_out) in enumerate(reversed(in_out[1:])):
is_last = ind >= (num_resolutions - 1)
self.ups.append(
nn.ModuleList(
[
block_klass(dim_out * 2, dim_in, time_emb_dim=time_dim),
block_klass(dim_in, dim_in, time_emb_dim=time_dim),
Residual(PreNorm(dim_in, LinearAttention(dim_in))),
Upsample(dim_in) if not is_last else nn.Identity(),
]
)
)
out_dim = default(out_dim, channels)
self.final_conv = nn.Sequential(
block_klass(dim, dim), nn.Conv2d(dim, out_dim, 1)
)
def forward(self, x, time):
x = self.init_conv(x)
t = self.time_mlp(time) if exists(self.time_mlp) else None
h = []
# downsample
for block1, block2, attn, downsample in self.downs:
x = block1(x, t)
x = block2(x, t)
x = attn(x)
h.append(x)
x = downsample(x)
# bottleneck
x = self.mid_block1(x, t)
x = self.mid_attn(x)
x = self.mid_block2(x, t)
# upsample
for block1, block2, attn, upsample in self.ups:
x = torch.cat((x, h.pop()), dim=1)
x = block1(x, t)
x = block2(x, t)
x = attn(x)
x = upsample(x)
return self.final_conv(x)
- Note: by default, the noise predictor uses ConvNeXT blocks (as
use_convnextis set to True) and position embeddings are added (aswith_time_embis set to True).
Forward diffusion
- Now lets take a look at the forward diffusion process. Remember forward diffusion process will gradually add noise to an image withing a number of time steps \(T\).
def cosine_beta_schedule(timesteps, s=0.008):
"""
cosine schedule as proposed in https://arxiv.org/abs/2102.09672
"""
steps = timesteps + 1
x = torch.linspace(0, timesteps, steps)
alphas_cumprod = torch.cos(((x / timesteps) + s) / (1 + s) * torch.pi * 0.5) ** 2
alphas_cumprod = alphas_cumprod / alphas_cumprod[0]
betas = 1 - (alphas_cumprod[1:] / alphas_cumprod[:-1])
return torch.clip(betas, 0.0001, 0.9999)
def linear_beta_schedule(timesteps):
beta_start = 0.0001
beta_end = 0.02
return torch.linspace(beta_start, beta_end, timesteps)
def quadratic_beta_schedule(timesteps):
beta_start = 0.0001
beta_end = 0.02
return torch.linspace(beta_start**0.5, beta_end**0.5, timesteps) ** 2
def sigmoid_beta_schedule(timesteps):
beta_start = 0.0001
beta_end = 0.02
betas = torch.linspace(-6, 6, timesteps)
return torch.sigmoid(betas) * (beta_end - beta_start) + beta_start
- To start with, let’s use the linear schedule for \(T=200\) time steps and define the various variables from the \(\beta_t\) which we will need, such as the cumulative product of the variances \(\bar{\alpha}_t\)
- Each of the variables below are just 1-dimensional tensors, storing values from \(t\) to \(T\).
- Importantly, we also define an extract function, which will allow us to extract the appropriate \(t\) index for a batch of indices. (source)
timesteps = 200
# define beta schedule
betas = linear_beta_schedule(timesteps=timesteps)
# define alphas
alphas = 1. - betas
alphas_cumprod = torch.cumprod(alphas, axis=0)
alphas_cumprod_prev = F.pad(alphas_cumprod[:-1], (1, 0), value=1.0)
sqrt_recip_alphas = torch.sqrt(1.0 / alphas)
# calculations for diffusion q(x_t \mid\mid x_{t-1}) and others
sqrt_alphas_cumprod = torch.sqrt(alphas_cumprod)
sqrt_one_minus_alphas_cumprod = torch.sqrt(1. - alphas_cumprod)
# calculations for posterior q(x_{t-1} \mid\mid x_t, x_0)
posterior_variance = betas * (1. - alphas_cumprod_prev) / (1. - alphas_cumprod)
def extract(a, t, x_shape):
batch_size = t.shape[0]
out = a.gather(-1, t.cpu())
return out.reshape(batch_size, *((1,) * (len(x_shape) - 1))).to(t.device)
- Now let’s take an image and illustrate how noise is added at each time step of the diffusion process to the PyTorch tensors:
from PIL import Image
import requests
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
image

- We first normalize images by dividing by 255 (such that they are in the
[0,1]range), and then make sure they are in the[-1, 1]range.
from torchvision.transforms import Compose, ToTensor, Lambda, ToPILImage, CenterCrop, Resize
image_size = 128
transform = Compose([
Resize(image_size),
CenterCrop(image_size),
ToTensor(), # turn into Numpy array of shape HWC, divide by 255
Lambda(lambda t: (t * 2) - 1),
])
x_start = transform(image).unsqueeze(0)
x_start.shape
Output:
----------------------------------------------------------------------------------------------------
torch.Size([1, 3, 128, 128])
- We also define the reverse transform, which takes in a PyTorch tensor containing values in
[-1, 1]and turn them back into a PIL image:
import numpy as np
reverse_transform = Compose([
Lambda(lambda t: (t + 1) / 2),
Lambda(lambda t: t.permute(1, 2, 0)), # CHW to HWC
Lambda(lambda t: t * 255.),
Lambda(lambda t: t.numpy().astype(np.uint8)),
ToPILImage(),
])
- Let’s run an example and see what it produces:
reverse_transform(x_start.squeeze())

- We can now define the forward diffusion process as in the paper:
# forward diffusion (using the nice property)
def q_sample(x_start, t, noise=None):
if noise is None:
noise = torch.randn_like(x_start)
sqrt_alphas_cumprod_t = extract(sqrt_alphas_cumprod, t, x_start.shape)
sqrt_one_minus_alphas_cumprod_t = extract(
sqrt_one_minus_alphas_cumprod, t, x_start.shape
)
return sqrt_alphas_cumprod_t * x_start + sqrt_one_minus_alphas_cumprod_t * noise
- Let’s test it on a particular time step and see the image it produces:
def get_noisy_image(x_start, t):
# add noise
x_noisy = q_sample(x_start, t=t)
# turn back into PIL image
noisy_image = reverse_transform(x_noisy.squeeze())
return noisy_image
# take time step
t = torch.tensor([40])
get_noisy_image(x_start, t)

- We can see the image is getting more noisy. Now let’s zoom out a bit and visualize this for various time steps:
import matplotlib.pyplot as plt
# use seed for reproducability
torch.manual_seed(0)
# source: https://pytorch.org/vision/stable/auto_examples/plot_transforms.html#sphx-glr-auto-examples-plot-transforms-py
def plot(imgs, with_orig=False, row_title=None, **imshow_kwargs):
if not isinstance(imgs[0], list):
# Make a 2d grid even if there's just 1 row
imgs = [imgs]
num_rows = len(imgs)
num_cols = len(imgs[0]) + with_orig
fig, axs = plt.subplots(figsize=(200,200), nrows=num_rows, ncols=num_cols, squeeze=False)
for row_idx, row in enumerate(imgs):
row = [image] + row if with_orig else row
for col_idx, img in enumerate(row):
ax = axs[row_idx, col_idx]
ax.imshow(np.asarray(img), **imshow_kwargs)
ax.set(xticklabels=[], yticklabels=[], xticks=[], yticks=[])
if with_orig:
axs[0, 0].set(title='Original image')
axs[0, 0].title.set_size(8)
if row_title is not None:
for row_idx in range(num_rows):
axs[row_idx, 0].set(ylabel=row_title[row_idx])
plt.tight_layout()
plot([get_noisy_image(x_start, torch.tensor([t])) for t in [0, 50, 100, 150, 199]])

- As we can see above, the image going through forward diffusion is definitely becoming more apparent.
- Thus, we can now move on to defining our loss function. The
denoise_modelwill be our U-Net defined above. We’ll employ the Huber loss between the true and the predicted noise.
def p_losses(denoise_model, x_start, t, noise=None, loss_type="l1"):
if noise is None:
noise = torch.randn_like(x_start)
x_noisy = q_sample(x_start=x_start, t=t, noise=noise)
predicted_noise = denoise_model(x_noisy, t)
if loss_type == 'l1':
loss = F.l1_loss(noise, predicted_noise)
elif loss_type == 'l2':
loss = F.mse_loss(noise, predicted_noise)
elif loss_type == "huber":
loss = F.smooth_l1_loss(noise, predicted_noise)
else:
raise NotImplementedError()
return loss
Dataset
- Let’s now look into loading up our dataset. A quick note, our dataset needs to make sure all images are resized to the same size.
- Hugging Face’s fashion_mnist dataset which we will use in this example already does that for us with all images having a same resolution of \(28 \times 28\).
from datasets import load_dataset
# load dataset from the hub
dataset = load_dataset("fashion_mnist")
image_size = 28
channels = 1
batch_size = 128
- Now, we will define a function
transformswhich we’ll apply on-the-fly on the entire dataset. - The function just applies some basic image preprocessing: random horizontal flips, rescaling and finally make them have values in the
[-1,1]range.
from torchvision import transforms
from torch.utils.data import DataLoader
# define image transformations (e.g. using torchvision)
transform = Compose([
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Lambda(lambda t: (t * 2) - 1)
])
# define function
def transforms(examples):
examples["pixel_values"] = [transform(image.convert("L")) for image in examples["image"]]
del examples["image"]
return examples
transformed_dataset = dataset.with_transform(transforms).remove_columns("label")
# create dataloader
dataloader = DataLoader(transformed_dataset["train"], batch_size=batch_size, shuffle=True)
batch = next(iter(dataloader))
print(batch.keys())
Output:
----------------------------------------------------------------------------------------------------
dict_keys(['pixel_values'])
Sampling during training
- The paper also talks about sampling from the model during training in order to track progress.
- Ideally, generating new images from a diffusion model happens by reversing the diffusion process:
- We start from \(T\), where we sample pure noise from a Gaussian distribution
- Then use our neural network to gradually de-noise it using the conditional probability it has learned, continuing until we end up at time step \(t\) = 0.
- We can derive a slightly less de-noised image \(x_{(t-1)}\) by plugging in the re-parametrization of the mean, using our noise predictor.
- Remember that the variance is known ahead of time.
- After all of this, ideally, we end up with an image that looks like it came from the real data distribution.
- Lets look at the code for that below:
@torch.no_grad()
def p_sample(model, x, t, t_index):
betas_t = extract(betas, t, x.shape)
sqrt_one_minus_alphas_cumprod_t = extract(
sqrt_one_minus_alphas_cumprod, t, x.shape
)
sqrt_recip_alphas_t = extract(sqrt_recip_alphas, t, x.shape)
# Equation 11 in the paper
# Use our model (noise predictor) to predict the mean
model_mean = sqrt_recip_alphas_t * (
x - betas_t * model(x, t) / sqrt_one_minus_alphas_cumprod_t
)
if t_index == 0:
return model_mean
else:
posterior_variance_t = extract(posterior_variance, t, x.shape)
noise = torch.randn_like(x)
# Algorithm 2 line 4:
return model_mean + torch.sqrt(posterior_variance_t) * noise
# Algorithm 2 (including returning all images)
@torch.no_grad()
def p_sample_loop(model, shape):
device = next(model.parameters()).device
b = shape[0]
# start from pure noise (for each example in the batch)
img = torch.randn(shape, device=device)
imgs = []
for i in tqdm(reversed(range(0, timesteps)), desc='sampling loop time step', total=timesteps):
img = p_sample(model, img, torch.full((b,), i, device=device, dtype=torch.long), i)
imgs.append(img.cpu().numpy())
return imgs
@torch.no_grad()
def sample(model, image_size, batch_size=16, channels=3):
return p_sample_loop(model, shape=(batch_size, channels, image_size, image_size))
- Now, lets get to some training! We will train the model via PyTorch and occasionally save a few image samples using the
samplefunction from above.
from pathlib import Path
def num_to_groups(num, divisor):
groups = num // divisor
remainder = num % divisor
arr = [divisor] * groups
if remainder > 0:
arr.append(remainder)
return arr
results_folder = Path("./results")
results_folder.mkdir(exist_ok = True)
save_and_sample_every = 1000
- Below, we define the model, and move it to the GPU along with defining Adam, a standard optimizer.
from torch.optim import Adam
device = "cuda" if torch.cuda.is_available() else "cpu"
model = Unet(
dim=image_size,
channels=channels,
dim_mults=(1, 2, 4,)
)
model.to(device)
optimizer = Adam(model.parameters(), lr=1e-3)
Training
- Now lets start the training process:
from torchvision.utils import save_image
epochs = 5
for epoch in range(epochs):
for step, batch in enumerate(dataloader):
optimizer.zero_grad()
batch_size = batch["pixel_values"].shape[0]
batch = batch["pixel_values"].to(device)
# Algorithm 1 line 3: sample t uniformally for every example in the batch
t = torch.randint(0, timesteps, (batch_size,), device=device).long()
loss = p_losses(model, batch, t, loss_type="huber")
if step % 100 == 0:
print("Loss:", loss.item())
loss.backward()
optimizer.step()
# save generated images
if step != 0 and step % save_and_sample_every == 0:
milestone = step // save_and_sample_every
batches = num_to_groups(4, batch_size)
all_images_list = list(map(lambda n: sample(model, batch_size=n, channels=channels), batches))
all_images = torch.cat(all_images_list, dim=0)
all_images = (all_images + 1) * 0.5
save_image(all_images, str(results_folder / f'sample-{milestone}.png'), nrow = 6)
Output:
----------------------------------------------------------------------------------------------------
Loss: 0.46477368474006653
Loss: 0.12143351882696152
Loss: 0.08106148988008499
Loss: 0.0801810547709465
Loss: 0.06122320517897606
Loss: 0.06310459971427917
Loss: 0.05681884288787842
Loss: 0.05729678273200989
Loss: 0.05497899278998375
Loss: 0.04439849033951759
Loss: 0.05415581166744232
Loss: 0.06020551547408104
Loss: 0.046830907464027405
Loss: 0.051029372960329056
Loss: 0.0478244312107563
Loss: 0.046767622232437134
Loss: 0.04305662214756012
Loss: 0.05216279625892639
Loss: 0.04748568311333656
Loss: 0.05107741802930832
Loss: 0.04588869959115982
Loss: 0.043014321476221085
Loss: 0.046371955424547195
Loss: 0.04952816292643547
Loss: 0.04472338408231735
- And finally, let’s look at our inference or sampling from the
samplefunction we defined above.
# sample 64 images
samples = sample(model, image_size=image_size, batch_size=64, channels=channels)
# show a random one
random_index = 5
plt.imshow(samples[-1][random_index].reshape(image_size, image_size, channels), cmap="gray")

- Seems like the model is capable of generating a nice T-shirt! Keep in mind that the dataset we trained on is pretty low-resolution (28x28).
Creating a GIF
- Lastly, in order to see the progression of the de-noising process, we can create a GIF:
import matplotlib.animation as animation
random_index = 53
fig = plt.figure()
ims = []
for i in range(timesteps):
im = plt.imshow(samples[i][random_index].reshape(image_size, image_size, channels), cmap="gray", animated=True)
ims.append([im])
animate = animation.ArtistAnimation(fig, ims, interval=50, blit=True, repeat_delay=1000)
animate.save('diffusion.gif')
plt.show()

- Hopefully this was beneficial in clarifying the diffusion model concepts!
- Furthermore, it his highly recommend looking at Hugging Face’s Training with Diffusers notebook to see how to leverage their Diffusion library to train a simple model.
- And, for inference, they also provide this notebook where you can see the images being generated.
denoising-diffusion-pytorch package
- While Diffusion models have not yet been democratized to the same degree as other older architectures/approaches in Machine Learning, there are still implementations available for use. The easiest way to use a diffusion model in PyTorch is to use the
denoising-diffusion-pytorchpackage, which implements an image diffusion model like the one discussed in this article. To install the package, simply type the following command in the terminal:
pip install denoising_diffusion_pytorch
Minimal Example
- To train a model and generate images, we first import the necessary packages:
import torch
from denoising_diffusion_pytorch import Unet, GaussianDiffusion
- Next, we define our network architecture, in this case a U-Net. The
dimparameter specifies the number of feature maps before the first down-sampling, and thedim_multsparameter provides multiplicands for this value and successive down-samplings:
model = Unet(
dim = 64,
dim_mults = (1, 2, 4, 8)
)
- Now that our network architecture is defined, we need to define the diffusion model itself. We pass in the U-Net model that we just defined along with several parameters - the size of images to generate, the number of timesteps in the diffusion process, and a choice between the L1 and L2 norms.
diffusion = GaussianDiffusion(
model,
image_size = 128,
timesteps = 1000, # number of steps
loss_type = 'l1' # L1 or L2
)
- Now that the diffusion model is defined, it’s time to train. We generate random data to train on, and then train the diffusion model in the usual fashion:
training_images = torch.randn(8, 3, 128, 128)
loss = diffusion(training_images)
loss.backward()
- Once the model is trained, we can finally generate images by using the
sample()method of thediffusionobject. Here we generate 4 images, which are only noise given that our training data was random:
sampled_images = diffusion.sample(batch_size = 4)

Training on Custom Data
- The
denoising-diffusion-pytorchpackage also allow you to train a diffusion model on a specific dataset. Simply replace the'path/to/your/images'string with the dataset directory path in theTrainer()object below, and changeimage_sizeto the appropriate value. After that, simply run the code to train the model, and then sample as before. Note that PyTorch must be compiled with CUDA enabled in order to use theTrainerclass:
from denoising_diffusion_pytorch import Unet, GaussianDiffusion, Trainer
model = Unet(
dim = 64,
dim_mults = (1, 2, 4, 8)
).cuda()
diffusion = GaussianDiffusion(
model,
image_size = 128,
timesteps = 1000, # number of steps
loss_type = 'l1' # L1 or L2
).cuda()
trainer = Trainer(
diffusion,
'path/to/your/images',
train_batch_size = 32,
train_lr = 2e-5,
train_num_steps = 700000, # total training steps
gradient_accumulate_every = 2, # gradient accumulation steps
ema_decay = 0.995, # exponential moving average decay
amp = True # turn on mixed precision
)
trainer.train()
- Below you can see progressive denoising from multivariate Gaussian noise to MNIST digits akin to reverse diffusion:

HuggingFace Diffusers
- HuggingFace diffusers provides pretrained diffusion models across multiple modalities, such as vision and audio, and serves as a modular toolbox for inference and training of diffusion models.

- More precisely, HuggingFace Diffusers offers:
- State-of-the-art diffusion pipelines that can be run in inference with just a couple of lines of code.
- Various noise schedulers that can be used interchangeably for the prefered speed vs. quality trade-off in inference.
- Multiple types of models, such as UNet, that can be used as building blocks in an end-to-end diffusion system.
- Training examples to show how to train the most popular diffusion models.
Implementations
Stable Diffusion
- Stable Diffusion (blog) is a state of the art text-to-image model that generates images from text. It’s makes it’s high performance models available to the public at large to use here.
- Stable Diffusion is a text-to-image latent diffusion model created by the researchers and engineers from CompVis, Stability AI and LAION. It is trained on 512x512 images from a subset of the LAION-5B database which is the largest, freely accessible multi-modal dataset.
- Let’s now look at how it works with the illustrations below by Jay Alammar.

- Stable Diffusion is quite versatile because it can be used in a variety of ways.
- In the image we see above, we can see that it can take text as input and output a generated image. This is the primary use case, however, it is not the only one.

- As we can see from the image above, another use case of Stable Diffusion is with image and text as input, and it will output a generated image again. This is called img2img.
- It’s able to be so versatile because Stable Diffusion is not one monolith model, but a system made up of several components and models.
- To be specific, Stable Diffusion is made up of a:
- 1) Text Understanding component
- 2) Image Generation component

- The text understanding component is actually the text encoder used within CLIP.
- As we can see represented in the image below, Stable Diffusion takes the input text within the Text Understander component and returns a vector representing each token in the text.
- This information is then passed over to the Image Generator component which internally is composed of 2 components as well.

- Now, referring to the image below, let’s look at the two components within the Image Generator component.
- Image Information Creator:
- This is the ‘secret sauce’ of Stable Diffusion as it runs for a number of steps refining the information that should go in the image that will become the model’s output.
- Image Decoder:
- This component takes the processed information and paints the picture.
- Image Information Creator:

- Let’s zoom out for a second and look at the higher level components we have so far all working together for the image generation task:

- All the 3 components above are actually individual neural networks working together, specifically, they are:
- CLIPText: Used to encode the text
- U-Net + scheduler: Used to gradually process image information(latent diffusion)
- Autoencoder Decoder: paints the final image

-
Above we can see the steps that Stable Diffusion takes to generate its images.
-
Lastly, let’s zoom into the image decoder and get a better understanding of its inner workings. Remember the image decoder is one of the two components the image generator comprises of

- The random vector is considered to be random noise.
- Stable Diffusion is able to obtain it’s speed from the fact that the processing happens in the latent space (which needs less calculations as compared to the pixel space).
Dream Studio
- Dream Studio is Stable Diffusion’s AI Art Web App Tool.
- DreamStudio is a new suite of generative media tools engineered to grant everyone the power of limitless imagination and the effortless ease of visual expression through a combination of natural language processing and revolutionary input controls for accelerated creativity.
Midjourney
- Midjourney is an independent research lab exploring new mediums of thought and expanding the imaginative powers of the human species.
- Midjourney has not made it’s architecture details publicly available, but one has to think they still leverage diffusion models in some fashion.
- While Dall-E 2 creates more realistic images, MidJourney shines in adapting real art styles into creating an image of any combination of things your heart desires.

DALL-E 2
- DALL-E 2 utilized diffusion models to create its images and was created by OpenAI.
- DALL-E 2 can make realistic edits to existing images from a natural language caption.
- It can add and remove elements while taking shadows, reflections, and textures into consideration.
- DALL-E 2 has learned the relationship between images and the text used to describe them.
- It uses diffusion, which starts with a pattern of random dots and gradually alters that pattern towards an image when it recognizes specific aspects of that image.
- OpenAI has limited the ability for DALL-E 2 to generate violent, hate, or adult images.
- By removing the most explicit content from the training data, OpenAI has minimized DALL-E 2’s exposure to these concepts.
- They have also used advanced techniques to prevent photorealistic generations of real individuals’ faces, including those of public figures.
- Among the most important building blocks in the DALL-E 2 architecture is CLIP to function as the main bridge between text and images.
Related: CLIP (Contrastive Language-Image Pre-Training)
- While CLIP does not use a diffusion model, it is essential to understand DALL-E 2 so let’s do a quick recap of CLIP’s architecture.
- CLIP is a neural network trained on a variety of (image, text) pairs.
- It can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task, similarly to the zero-shot capabilities of GPT-2 and 3.
- CLIP is a multi-modal vision and language model.
- It can be used for image-text similarity and for zero-shot image classification.
- CLIP uses a ViT like transformer to get visual features and a causal language model to get the text features.
- Both the text and visual features are then projected to a latent space with identical dimension. The dot product between the projected image and text features is then used as a similar score.
- CLIP enables us to take textual phrases and understand how they map onto images
Gallery
- Showcasing a few images generated via Diffusion Models along with their text prompts given:
- A Corgi puppy painted like the Mona Lisa:

- Beyonce sitting at a desk and coding:

- Snow in Hawaii:

- Sun coming in from a big window with curtains and casting a shadow on the rest of the room, artistic style:

- The Taj Mahal painted in Starry Night by Vincent Van Gogh:

FAQs
At a high level, how do diffusion models work? What are some other models that are useful for image generation, and how do they compare to diffusion models?
High-Level Overview of Diffusion Models
- Diffusion models are a type of generative model that creates data by gradually denoising a sample from a noise distribution. The process involves two main phases: a forward diffusion process that corrupts the data by adding noise, and a reverse denoising process that learns to remove the noise step-by-step to recover the original data. Here’s a high-level breakdown:
Forward Diffusion Process
- Start with a Data Sample: Begin with a real data sample, such as an image.
- Add Noise Incrementally: Over a series of steps \(t\), progressively add Gaussian noise to the sample. The amount of noise added at each step is controlled by a noise schedule \(\beta_t\). \(x_t = \sqrt{\alpha_t} x_{t-1} + \sqrt{1 - \alpha_t} \epsilon, \quad \epsilon \sim N(0, I)\)
- Result in Noisy Data: By the final step, the data is almost completely transformed into pure noise.
Reverse Denoising Process
- Start with Noise: Begin with a sample of pure noise.
- Learn to Remove Noise: A neural network is trained to predict and remove the added noise at each step, effectively denoising the sample. \(p_\theta(x_{t-1} \mid\mid x_t) = N(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))\)
- Recover Original Data: Iteratively apply the denoising steps to transform the noise back into a data sample that resembles the original data distribution.
Other Models for Image Generation
- Several other models are commonly used for image generation, each with unique characteristics and methodologies. Here are some notable ones:
Generative Adversarial Networks (GANs)
- How They Work:
- Two Networks: Consist of a generator and a discriminator network that compete against each other.
- Generator: Creates fake images from random noise.
- Discriminator: Tries to distinguish between real images and fake images produced by the generator.
-
Adversarial Training: The generator improves to produce more realistic images as the discriminator gets better at distinguishing them.
- Comparison to Diffusion Models:
- Training Stability: GANs can be harder to train and may suffer from issues like mode collapse.
- Speed: Typically faster in generation since GANs do not require iterative denoising steps.
- Quality: Can produce high-quality images, but may lack diversity in the generated samples compared to diffusion models.
Variational Autoencoders (VAEs)
- How They Work:
- Encoder-Decoder Architecture: Consist of an encoder that maps input data to a latent space and a decoder that reconstructs the data from the latent space.
- Latent Space Sampling: Imposes a probabilistic structure on the latent space, encouraging smooth transitions and interpolation.
-
Variational Inference: Uses a loss function that includes a reconstruction term and a regularization term (Kullback-Leibler divergence).
- Comparison to Diffusion Models:
- Latent Space Representation: VAEs provide an explicit latent space representation, which can be useful for tasks like interpolation and manipulation.
- Sample Quality: VAEs typically produce lower-quality images compared to GANs and diffusion models.
- Training Stability: Generally more stable and easier to train than GANs.
Autoregressive Models
- How They Work:
- Sequential Generation: Generate images pixel-by-pixel or patch-by-patch in a sequential manner.
- Conditional Dependencies: Each pixel or patch is conditioned on the previously generated ones.
-
Examples: PixelRNN, PixelCNN.
- Comparison to Diffusion Models:
- Generation Time: Autoregressive models can be slow due to sequential nature.
- Quality: Can produce high-quality images with strong dependencies between pixels.
- Flexibility: Can naturally model complex dependencies but are computationally intensive.
Flow-based Models
- How They Work:
- Invertible Transformations: Use a series of invertible transformations to map data to a latent space and vice versa.
- Exact Likelihood: Allow exact computation of the data likelihood, making them powerful for density estimation.
- Examples: RealNVP, Glow.
- Comparison to Diffusion Models:
- Efficiency: Flow-based models can be efficient in both training and sampling due to invertible nature.
- Quality: Produce high-quality images but may require more complex architectures for challenging datasets.
- Interpretability: Provide explicit likelihood estimates and interpretable latent spaces.
Summary
- Diffusion Models: Offer a robust and principled approach to image generation with a focus on iterative denoising. They provide high-quality samples but can be slower due to the iterative nature.
- GANs: Known for producing very high-quality images quickly but can be challenging to train due to adversarial dynamics.
- VAEs: Provide stable training and useful latent space representations, though often at the cost of sample quality.
- Autoregressive Models: Capable of modeling complex dependencies with high-quality outputs, but slow due to sequential generation.
- Flow-based Models: Efficient and interpretable with exact likelihood estimation, balancing quality and computational requirements.
- In summary, each model type has its strengths and weaknesses, making them suitable for different applications and preferences in the trade-off between quality, efficiency, and ease of training.
What is the difference between DDPM and DDIMs models?
- Denoising Diffusion Probabilistic Models (DDPM) and Denoising Diffusion Implicit Models (DDIMs) are both types of diffusion models used for generative tasks, but they differ in their approach to the reverse diffusion process, which leads to differences in their efficiency and the quality of generated samples. Here’s a detailed explanation of both models and their differences:
DDPM
- DDPMs are a class of generative models that create data by reversing a Markovian diffusion process. The diffusion process gradually adds noise to the data in several steps until it becomes nearly pure Gaussian noise. The model then learns to reverse this process, step by step, to generate new data samples.
Key Characteristics
- Forward Process:
- The forward diffusion process adds Gaussian noise to data over \(T\) timesteps.
- Each step is defined as: \(q(x_t \mid\mid x_{t-1}) = N(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I)\)
- \(\beta_t\) is the noise schedule, typically increasing linearly or following another schedule over time.
- Reverse Process:
- The reverse process is learned using a neural network to approximate the conditional probabilities: \(p_\theta(x_{t-1} \mid\mid x_t) = N(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))\)
- The mean \(\mu_\theta\) and variance \(\Sigma_\theta\) are predicted by the neural network.
- Training:
- The model is trained to minimize the variational bound on the data likelihood, which involves matching the reverse process to the true posterior of the forward process.
- Sampling:
- Sampling involves running the reverse process starting from Gaussian noise \(x_T\), iteratively refining it to produce a sample \(x_0\).
Advantages and Disadvantages
- Advantages:
- Generates high-quality samples.
- Well-grounded in probabilistic principles, leading to stable training.
- Disadvantages:
- The reverse process is slow because it involves many iterative steps.
- Each step requires a neural network forward pass, making the sampling process computationally expensive.
DDIMs
- DDIMss are a variation of diffusion models that introduce a non-Markovian forward process, which allows for a more efficient reverse process. The key idea is to find a deterministic mapping that approximates the same data distribution as the original Markovian process used in DDPMs.
Key Characteristics
- Forward Process:
- The forward process in DDIMss can be viewed as a non-Markovian process that achieves the same goal of perturbing data into noise.
- Instead of a strict Markov chain, DDIMss introduce a sequence of latent variables that allow skipping steps while preserving the ability to reverse the process.
- Reverse Process:
- The reverse process becomes deterministic, significantly speeding up the sampling process.
- The reverse step is defined by a deterministic mapping, approximating the reverse diffusion without needing as many steps as DDPMs.
- This is achieved through a reparameterization that relates the noise-added data at different timesteps directly.
- Training:
- Training is similar to DDPMs but leverages the deterministic nature of the reverse process for improved efficiency.
- Sampling:
- Sampling in DDIMss can be done with fewer steps while still producing high-quality samples.
- The deterministic reverse process can potentially offer more control over the generation process, enabling finer adjustments to the generated data.
Advantages and Disadvantages
- Advantages:
- Faster sampling compared to DDPMs due to the deterministic reverse process.
- Fewer sampling steps needed while maintaining or even improving sample quality.
- Disadvantages:
- The theoretical underpinnings are less straightforward compared to the probabilistic foundations of DDPMs.
- Potentially less flexibility in certain applications where stochasticity in the reverse process is beneficial.
Key Differences
- Process Type:
- DDPM: Markovian forward process with a stochastic reverse process.
- DDIMs: Non-Markovian forward process with a deterministic reverse process.
- Sampling Efficiency:
- DDPM: Requires many reverse steps, making it computationally expensive.
- DDIMs: Achieves faster sampling with fewer steps.
- Reverse Process:
- DDPM: Stochastic reverse process, which involves sampling from a learned Gaussian distribution at each step.
- DDIMs: Deterministic reverse process, which directly maps noisy data to clean data without stochastic sampling.
- Complexity and Flexibility:
- DDPM: More flexible in representing complex distributions due to the stochastic nature of the reverse process.
- DDIMs: More efficient and potentially more controllable but may be less flexible in certain scenarios.
- In summary, while both DDPM and DDIMs are powerful diffusion-based generative models, DDIMss offer a more efficient sampling process by employing a deterministic reverse process, leading to faster generation of samples without compromising quality. DDPMs, on the other hand, are grounded in a robust probabilistic framework, making them more flexible but slower in practice.
In diffusion models, there is a forward diffusion process and a reverse diffusion/denoising process. When do you use which during training and inference?
- In diffusion models, which are a class of generative models, the forward diffusion process and the denoising process play distinct roles during training and inference. Understanding when and how these processes are used is key to grasping how diffusion models work.
- Forward Diffusion Process
- During Training:
- Noise Addition: In the forward diffusion process, noise is gradually added to the data over several steps or iterations. This process transforms the original data into a pure noise distribution through a predefined sequence of steps.
- Training Objective: The model is trained to predict the noise that was added at each step. Essentially, it learns to reverse the diffusion process.
- During Inference:
- Not Directly Used: The forward diffusion process is not explicitly used during inference. However, the knowledge gained during training (about how noise is added) is implicitly used to guide the denoising process.
- During Training:
- Denoising Process
- During Training:
- Learning to Reverse Noise: The model learns to denoise the data, i.e., to reverse the forward diffusion process. It does this by predicting the noise that was added at each step during the forward diffusion and then subtracting this noise.
- Parameter Optimization: The parameters of the model are optimized to make accurate predictions of the added noise, thereby learning to gradually denoise the data back to its original form.
- During Inference:
- Data Generation: The denoising process is the key to generating new data. Starting from pure noise, the model iteratively denoises this input, using the reverse of the forward process, to generate a sample.
- Iterative Refinement: At each step, the model predicts the noise to remove, effectively refining the sample from random noise into a coherent output.
- During Training:
- Summary
- Training Phase: Both the forward diffusion (adding noise) and the denoising (removing noise) processes are actively used. The model learns how to reverse the gradual corruption of the data (caused by adding noise) by being trained to predict and remove the noise at each step.
- Inference Phase: Only the denoising process is used, where the model starts with noise and iteratively applies the learned denoising steps to generate a sample. The forward process is not explicitly run during inference, but its principles underpin the reverse process.
- Forward Diffusion Process
- In essence, the forward diffusion process is crucial for training the model to understand and reverse the noise addition, while the denoising process is used both in training (to learn this reversal) and in inference (to generate new data).
What are the loss functions used in Diffusion Models?
- Here are some common loss functions used in diffusion models:
Mean Squared Error (MSE)
-
Description: The Mean Squared Error loss function measures the average of the squares of the errors between the predicted and actual values.
-
Mathematical Formulation:
- Use in Diffusion Models: MSE is commonly used in the denoising score matching variant of diffusion models. Here, the model learns to predict the noise added to the original data at each diffusion step. The loss function measures how well the model can predict this noise.
Denoising Score Matching (DSM)
-
Description: Denoising Score Matching (DSM) is a loss function used to train models to estimate the score function, which is the gradient of the log probability density function of the data. The model learns to predict the noise added to the data, effectively denoising it.
-
Mathematical Formulation:
\[L_{\text{DSM}} =\mathbb{E}_{p_{\text{data}}(x),p_{\sigma}(\tilde{x}\mid x)} \left[ \left\lVert s_\theta(\tilde{x}, \sigma) -\nabla_{\tilde{x}} \log p_\sigma(\tilde{x}\mid x) \right\rVert^2 \right]\]- where, \(s_\theta(\tilde{x}, \sigma)\) is the score function parameterized by the model, \(\tilde{x}\) is the noisy version of the data \(x\), and \(p_{\sigma}(\tilde{x} \mid\mid x)\) is the noise distribution.
-
Use in Diffusion Models: DSM is particularly used in Score-Based Generative Models (SBGM), which are a type of diffusion model. The model learns to predict the score (gradient of the log probability) of the data distribution at different noise levels.
Integration with MSE
-
Denoising Score Matching can be viewed as a specific case of MSE where the target is the gradient of the log probability density function. In practice, DSM can be integrated into an MSE framework:
-
Mathematical Formulation:
\[L_{\text{MSE-DSM}} =\mathbb{E}_{p_{\text{data}}(x),p_{\sigma}(\tilde{x}\mid x)} \left[ \left\lVert s_\theta(\tilde{x}, \sigma) -\frac{\tilde{x} - x}{\sigma^2} \right\rVert^2 \right]\]- where, the target \(\frac{\tilde{x} - x}{\sigma^2}\) represents the gradient of the log probability of the noise distribution, making this a practical implementation of DSM within an MSE framework.
Kullback-Leibler Divergence (KL Divergence)
-
Description: KL Divergence is a measure of how one probability distribution diverges from a second, expected probability distribution.
-
Mathematical Formulation:
\[L_{\text{KL}}(P || Q) = \sum_{x \in X} P(x) \log \frac{P(x)}{Q(x)}\] -
Use in Diffusion Models: In variational diffusion models, KL Divergence is used to regularize the latent space distribution to match a prior distribution (often a standard normal distribution). This helps in ensuring that the learned latent space is structured and follows the desired distribution.
Negative Log-Likelihood (NLL)
-
Description: Negative Log-Likelihood measures the likelihood of the data under the model, with a higher likelihood indicating a better model.
-
Mathematical Formulation:
\[L_{\text{NLL}} = - \log P(x)\] -
Use in Diffusion Models: In continuous diffusion models, NLL can be used to maximize the likelihood of the data given the reverse diffusion process. This involves computing the likelihood of generating the data from the noise distribution.
Evidence Lower Bound (ELBO)
-
Description: ELBO is used in variational inference to approximate the true posterior distribution by maximizing a lower bound on the evidence.
-
Mathematical Formulation:
\[L_{\text{ELBO}} = \mathbb{E}_{q(z|x)}[\log p(x|z)] - \text{KL}(q(z|x) || p(z))\] -
Use in Diffusion Models: In variational diffusion models, ELBO is used to optimize the generative process by balancing the reconstruction accuracy and the regularization of the latent space. The first term ensures that the model can reconstruct the input data, while the second term ensures that the latent space follows the desired distribution.
Hybrid Loss
-
Description: Hybrid loss combines multiple loss functions to leverage their strengths and mitigate their weaknesses.
-
Mathematical Formulation:
- Use in Diffusion Models: Hybrid loss functions can be designed to combine MSE for denoising with KL Divergence to regularize the latent space, allowing for both effective denoising and structured latent space learning.
Cross-Entropy Loss
-
Description: Cross-Entropy Loss measures the difference between two probability distributions and is often used for classification tasks.
-
Mathematical Formulation:
- Use in Diffusion Models: In discrete diffusion models, cross-entropy loss can be used when the model is learning to predict categorical distributions at each diffusion step, such as predicting discrete tokens in a text generation task.
Variational Bound Loss
-
Description: This loss is a combination of reconstruction loss and a KL divergence term, ensuring that the model generates samples that are close to the true data distribution.
-
Mathematical Formulation:
- Use in Diffusion Models: Variational bound loss is often used in continuous-time diffusion models to learn the reverse diffusion process by approximating the true posterior distribution.
Common Loss Functions in Diffusion Models
- Mean Squared Error (MSE): Measures the average of the squares of the errors between the predicted and actual values. Practically used as Denoising Score Matching (DSM) which trains models to estimate the score function, which is the gradient of the log probability density function of the data.
- Kullback-Leibler Divergence (KL Divergence): Measures how one probability distribution diverges from a second, expected probability distribution.
- Negative Log-Likelihood (NLL): Measures the likelihood of the data under the model.
- Evidence Lower Bound (ELBO): Used in variational inference to approximate the true posterior distribution.
- Hybrid Loss: Combines multiple loss functions to leverage their strengths.
- Cross-Entropy Loss: Measures the difference between two probability distributions, used for classification tasks.
- Variational Bound Loss: Combines reconstruction loss and a KL divergence term.
What is the Denoising Score Matching Loss in Diffusion models? Provide equation and intuition.
- The Denoising Score Matching Loss is a critical component in the training of diffusion models, a class of generative models. This loss function is designed to train the model to effectively reverse a diffusion process, which gradually adds noise to the data over a series of steps.
- Denoising Score Matching Loss: Equation and Intuition
- Background:
- In diffusion models, the data is incrementally noised over a sequence of steps. The reverse process, which the model learns, involves denoising or reversing this noise addition to recreate the original data from noise.
- Equation:
-
The denoising score matching loss at a particular timestep \(t\) can be formulated as:
\[L(\theta)=\mathbb{E}_{x_0, \epsilon \sim N(0, I), t}\left[\left\lVert s_\theta\left(x_t, t\right)-\nabla_{x_t} \log p_{t \mid 0}\left(x_t \mid x_0\right)\right\rVert^2\right]\]- where, \(x_0\) is the original data, \(x_t\) is the noised data at timestep \(t\), and \(\epsilon\) is the added Gaussian noise.
- \(s_\theta\left(x_t, t\right)\) is the score (gradient of the log probability) predicted by the model with parameters \(\theta\).
- \(\nabla_{x_t} \log p_{t \mid 0}\left(x_t \mid x_0\right)\) is the true score, which is the gradient of the log probability of the noised data \(x_t\) conditioned on the original data \(x_0\).
- Intuition:
- The loss function encourages the model to predict the gradient of the log probability of the noised data with respect to the data itself. Essentially, it’s training the model to estimate how to reverse the diffusion process at each step.
- By minimizing this loss, the model learns to approximate the reverse of the noising process, thereby learning to generate data starting from noise.
- This process effectively teaches the model the denoising direction at each step of the noised data, guiding it on how to gradually remove noise and reconstruct the original data.
- Importance in Training: The denoising score matching loss is crucial for training diffusion models to generate high-quality samples. It ensures that the model learns a detailed and accurate reverse mapping of the diffusion process, capturing the complex data distribution.
- Advantages: This approach allows diffusion models to generate samples that are often of higher quality and more diverse compared to other generative models, as it carefully guides the generative process through the learned noise reversal.
- Background:
- In summary, the denoising score matching loss in diffusion models is fundamental in training these models to effectively reverse the process of gradual noise addition, enabling the generation of high-quality data samples from a noise distribution. This loss function is key to the model’s ability to learn the intricate details of the data distribution and the precise dynamics of the denoising process.
What does the “stable” in stable diffusion refer to?
- The “stability” in stable diffusion also refers to maintaining image content in the latent space throughout the diffusion process. In diffusion models, the image is transformed from the pixel space to the “latent space” – this is a high-dimensional abstract representation of the image. Here are the differences between the two:
- Pixel Space:
- This refers to the space in which the data (such as images) is represented in its raw form – as pixels.
- Each dimension corresponds to a pixel value, so an image of size 100x100 would have a pixel space of 10,000 dimensions.
- Pixel space representations are direct and intuitive but can be very high-dimensional and sparse for complex data like images.
- Latent Space:
- Latent space is a lower-dimensional space where data is represented in a more compressed and abstract form.
- Generative models, like Variational Autoencoders (VAEs) or Generative Adversarial Networks (GANs), encode high-dimensional data (from pixel space) into this lower-dimensional latent space.
- The latent representation captures the essential features or characteristics of the data, allowing for more efficient processing and manipulation.
- Operations and transformations are often performed in latent space because they can be more meaningful and computationally efficient. For example, interpolating between two points in latent space can result in a smooth transition between two images when decoded back to pixel space.
- Pixel Space:
- The “Stable” in Stable Diffusion refers to the fact that the forward and reverse diffusion process occur in a low-dimensional latent space vs. a high-dimensional pixel space leading to stability during diffusion. If the latent space becomes unstable and loses image content too quickly, the generated pixel space images will be poor.
- Stable diffusion uses techniques to keep the latent space more stable throughout the diffusion process:
- The denoising model tries to remove noise while preserving latent space content at each step.
- Regularization prevents the denoising model from changing too drastically between steps.
- Careful noise scheduling maintains stability in early diffusion steps.
- This stable latent space leads to higher quality pixel generations. At the end, Stable Diffusion transforms the image from latent space back to the pixel space.
How do you condition a diffusion model to the textual input prompt?
- Conditioning a diffusion model on a textual input prompt is a key technique in generating content that aligns closely with textual descriptions, particularly useful in applications such as text-to-image generation. This process involves several steps and components to effectively integrate text-based conditioning into the generative process of diffusion models like DALL-E, Imagen, or similar systems. Here’s a detailed explanation of how it works:
Text Encoding
- Text Encoder: The first step involves encoding the textual prompt into a continuous vector representation that the model can utilize. This is typically done using a transformer-based text encoder, such as those found in language models (BERT, GPT, etc.). The encoder translates the text prompt into a high-dimensional space, capturing semantic and syntactic nuances of the input text.
- Embeddings: The output of the text encoder is a set of embeddings or feature vectors that represent different parts or aspects of the text. These embeddings serve as the basis for conditioning the diffusion process.
Integrating Text Embeddings into the Diffusion Process
- Conditioning Layer: In the architecture of the diffusion model, there are typically one or more layers specifically designed to integrate the text embeddings with the image generation process. This might be done through mechanisms such as cross-attention, where the text embeddings (queries) interact with image features (keys and values) at various stages of the diffusion process.
- Guidance Mechanisms: Techniques like classifier-free guidance can be employed, where the model is trained to generate both conditioned (on text) and unconditioned (no text) samples. During inference, the model uses a guidance scale to adjust the strength of the conditioning, amplifying the influence of the text on the generated images.
Reverse Diffusion with Textual Guidance
- Starting from Noise: The diffusion model typically starts with a sample drawn from a Gaussian distribution (i.e., a noisy image) and progressively denoises it through a series of steps.
- Conditional Denoising Steps: During each step of the reverse diffusion process, the model consults the text embeddings to adjust the denoising trajectory. This is done by calculating how the current state of the image needs to be altered to better reflect the textual prompt, using the gradients of the loss function that measures the difference between the current image and the target condition.
- Iterative Refinement: With each step, the model refines the image, increasingly aligning it with the conditioning text. This involves repeatedly applying the learned conditional distributions to reduce noise and enhance details that correspond to the text.
Sampling and Optimization
- Dynamic Adjustments: Throughout the reverse diffusion process, parameters such as the guidance scale can be adjusted to increase or decrease the influence of the text embeddings, allowing for dynamic control over the fidelity and creativity of the generated outputs.
- Optimization Techniques: Advanced sampling techniques like Langevin dynamics or ancestral sampling may be used to navigate the probability distributions effectively, ensuring high-quality generation that closely matches the conditioning text.
Evaluation and Fine-Tuning
- Quality and Relevance Checks: The outputs are typically evaluated for both quality (visual, aesthetic) and relevance (accuracy in reflecting the text prompt). Feedback from these evaluations can be used to fine-tune the text encoder, conditioning layers, or other components of the model.
-
User Interaction: In practical applications, users might interact with the model by tweaking the text prompt or adjusting control parameters to iteratively refine the output until it meets their requirements.
- In summary, Conditioning diffusion models on textual input requires a sophisticated interplay of text encoding, model architecture adaptations, and careful management of the generative process. This complexity allows the models to produce remarkably accurate visual representations from textual descriptions, enabling a wide range of applications from art generation to functional design assistance.
In the context of diffusion models, what role does cross attention play? How are the \(Q\), \(K\), and \(V\) abstractions modeled for diffusion models?
- In the context of diffusion models, particularly those that are used for generating images conditioned on text (like DALL-E 2 or Imagen), cross-attention plays a crucial role in integrating information from different modalities, typically text and images. Here’s how cross-attention is used and how the Query (\(Q\)), Key (\(K\)), and Value (\(V\)) components are modeled within such systems:
Role of Cross-Attention in Diffusion Models
- Text-to-Image Synthesis: In diffusion models designed for tasks like text-to-image generation, cross-attention mechanisms enable the model to effectively align and utilize textual information to guide the image generation process. This is critical for producing images that accurately reflect the content described by the input text.
- Conditional Generation: Cross-attention allows the diffusion model to focus on specific aspects of the text throughout the various steps of the diffusion process. This dynamic focusing is key to iteratively refining the generated image to better match the textual description.
Modeling \(Q\), \(K\), and \(V\) in Diffusion Models
- Source of \(Q\), \(K\), and \(V\): In a typical setup for a text-to-image diffusion model, the text input is encoded into a series of embeddings that are used to generate the queries (\(Q\)). The evolving image representations (as the image is gradually denoised through the reverse diffusion process) are used to produce the keys (\(K\)) and values (\(V\)).
Detailed Steps
- Text Encoding:
- The text description is processed by a text encoder (often a Transformer-based model), which converts the input text into a series of embeddings. These embeddings serve as the queries (\(Q\)) in the cross-attention mechanism. They represent what the model needs to focus on or include in the image.
- Image Representation:
- At each step of the reverse diffusion process, the partially denoised image is encoded to generate keys (\(K\)) and values (\(V\)). The keys help the model to understand where in the current image representation the aspects described by the text (queries) are relevant.
- The values carry the actual visual content that could be adjusted or enhanced based on the alignment with the text description as determined by the attention scores.
- Attention Calculation:
- Cross-attention calculates how much each part of the image (values) should be influenced by parts of the text (queries). This is done by computing attention scores based on the similarity between queries and keys. These scores dictate how much each element of the value should be adjusted in response to the corresponding textual information.
- Iterative Refinement:
- During the reverse diffusion process, this cross-attention guided adjustment happens iteratively. With each step, the model refines the image further, enhancing areas of the image that need more detail or correction as per the text description.
Conclusion
- In diffusion models, cross-attention is a powerful tool for bridging the gap between textual descriptions and visual content, ensuring that the generated images are not only high-quality but also contextually accurate. The interaction between \(Q\), \(K\), and \(V\) within the cross-attention layers effectively enables the model to “attend” to relevant textual features while translating these cues into visual modifications, thereby tailoring the image generation process to the specifics of the input text.
How is randomness in the outputs induced in a diffusion model?
- Diffusion models inherently introduce randomness in their outputs as part of the generative process, which is a key feature allowing these models to produce diverse and high-quality samples. Here’s how randomness is systematically incorporated into the operation of diffusion models:
The Basic Framework of Diffusion Models
-
Diffusion models operate on a principle of gradually adding noise to the data over a series of steps (forward process) and then learning to reverse this process to generate data from noise (reverse process). This structure is inherently probabilistic and relies heavily on randomness at multiple stages:
-
Forward Process (Noise Addition): In the forward process, data is progressively corrupted by adding Gaussian noise in a sequence of steps until it becomes indistinguishable from Gaussian noise. The noise levels typically increase according to a predetermined schedule, which is crucial for the model to learn the characteristics of the data at various levels of corruption.
-
Reverse Process (Noise Removal/Denoising): The reverse process is where the model generates new data by starting with pure noise and progressively denoising it. This process is guided by the learned parameters but is fundamentally random due to the stochastic nature of the process and the initial noise state.
-
Randomness in Sampling
-
The core mechanism through which randomness influences the outputs of diffusion models is through the sampling process during the reverse diffusion:
-
Stochastic Sampling: At each step of the reverse process, the model predicts the mean and variance of the conditional distribution of the denoised data given the current noisy data. A sample is then drawn from this conditional distribution, typically assumed to be Gaussian. This sampling introduces randomness because the exact point sampled from the distribution can vary, leading to different outcomes each time the process is run.
-
Parameterization of Noise Levels: The variance of the noise added at each step can be a critical parameter that controls the amount of randomness. By adjusting this variance, one can control the diversity of the generated samples. Higher variance typically leads to more randomness and hence more diverse outputs.
-
Conditional Generation
-
In conditional diffusion models, such as those conditioned on text for image generation, randomness is also introduced in how the conditioning information influences the generation:
-
Conditioning Mechanism: Although the text or other conditioning data guides the generation process, the interpretation of this data by the model can introduce variations. For instance, the text “a cat sitting on a mat” could lead to images of different cats, different mats, or different settings, depending on the randomness in the sampling steps of the model.
-
Influence of Latent Variables: Some diffusion models integrate latent variables that capture aspects of the data not specified by the conditioning input. These latent variables add another layer of randomness, allowing for variations in features that are not explicitly controlled by the input conditions.
-
Temperature Scaling
- Temperature scaling is a technique used in many generative models to control the randomness of the outputs:
- Temperature Factor: By adjusting a temperature parameter in the noise distribution (especially in the variance), the model can be made to produce more or less random (diverse) outputs. Lower temperatures lead to less noise and often more coherent, deterministic, and conservative outputs, while higher temperatures increase randomness and diversity.
Conclusion
- Randomness in diffusion models is fundamental to their design and functionality. It allows these models to generate diverse and creative outputs from a probabilistic foundation. The control of this randomness through model design and sampling parameters is key to harnessing diffusion models for practical applications, ensuring a balance between diversity, creativity, and fidelity to any conditioning inputs.
How does the noise schedule work in diffusion models? What are some standard noise schedules?
- Diffusion models are a class of generative models that learn to generate data by iteratively denoising a sample, starting from pure noise. A crucial component of these models is the noise schedule, which determines how noise is added during the forward diffusion process and how it is removed during the reverse denoising process.
Noise Schedule in Diffusion Models
- The noise schedule in diffusion models defines the variance of the noise added at each step during the forward process. This schedule affects the quality of the generated samples and the efficiency of the learning process. The noise schedule is often described by a series of variance values \(\beta_t\) or their cumulative products \(\alpha_t\) and \(\bar{\alpha}_t\), where \(t\) denotes the time step.
Forward Diffusion Process
-
In the forward process, noise is added to the data at each step \(t\) according to a predefined schedule:
\[q(x_t \mid\mid x_{t-1}) = N(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I)\]- Here, \(\beta_t\) represents the variance of the noise added at step \(t\). The relationship between the cumulative products and variances is given by:
-
The above expressions allow us to express the noisy sample at any step \(t\) as:
\[x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon\]- where \(\epsilon \sim N(0, I)\) is standard Gaussian noise, and \(x_0\) is the original data point.
Reverse Denoising Process
- The reverse process involves learning to denoise the samples iteratively, starting from \(x_T\), which is almost pure noise, to \(x_0\). The model is trained to approximate the reverse conditional distributions \(p_\theta(x_{t-1} \mid x_t)\), typically parameterized as Gaussian distributions whose means and variances depend on the current step \(t\) and the model’s parameters \(\theta\).
Standard Noise Schedules
-
Several noise schedules have been proposed and used in practice, each with different properties and trade-offs. Some standard noise schedules include:
-
Linear Schedule: The variances \(\beta_t\) are increased linearly from \(\beta_1\) to \(\beta_T\). \(\beta_t = \beta_{\text{start}} + \frac{t}{T} (\beta_{\text{end}} - \beta_{\text{start}})\). This schedule is simple and often used as a baseline.
-
Cosine Schedule: The variances are defined using a cosine function, which provides a smooth transition and is empirically found to perform well. \(\bar{\alpha}_t = \cos\left(\frac{t / T + s}{1 + s} \cdot \frac{\pi}{2}\right)^2\), where \(s\) is a small constant to avoid zero at \(t=0\).
-
Quadratic Schedule: The variances \(\beta_t\) follow a quadratic function. \(\beta_t = (\beta_{\text{start}} + (\beta_{\text{end}} - \beta_{\text{start}}) \cdot (t/T)^2)\)
-
Exponential Schedule: The variances increase exponentially. \(\beta_t = \beta_{\text{start}} \cdot \left(\frac{\beta_{\text{end}}}{\beta_{\text{start}}}\right)^{t/T}\)
-
Constant Schedule: The variances remain constant throughout the process. \(\beta_t = \text{constant}\)
-
Choosing a Noise Schedule
-
The choice of noise schedule affects the stability and performance of the diffusion model. It is often a hyperparameter that needs to be tuned for the specific application. Here are some considerations:
- Linearity and Simplicity: Linear schedules are straightforward and often serve as a good starting point.
- Smoothness: Smoother schedules like the cosine schedule can result in more stable training and better sample quality.
- Model Capacity: More complex schedules might be beneficial if the model has high capacity and can learn intricate denoising processes.
- Empirical Performance: Often, the best schedule is determined through experimentation and empirical evaluation on the target dataset.
-
In summary, the noise schedule is a critical component of diffusion models, dictating how noise is introduced and removed through the forward and reverse processes. Various schedules, such as linear, cosine, quadratic, and exponential, provide different ways to balance the trade-offs between model complexity, stability, and sample quality.
Recent Papers
High-Resolution Image Synthesis with Latent Diffusion Models
- The following paper summary has been contributed by Zhibo Zhang.
- Diffusion models are known to be computationally expensive given that they require many steps of diffusion and denoising diffusion operations in possibly high-dimensional input feature spaces.
- This paper by Rombach et al. from Ludwig Maximilian University of Munich & IWR, Heidelberg University and Runway ML in CVPR 2022 introduces diffusion models that operate on the latent space, aiming at generating high-resolution images with lower computation demands compared to those that operate directly on the pixel space.
- In particular, the authors adopted an autoencoder that compresses the input images into a lower dimensional latent space. The autoencoder relies on either KL regularization or VQ regularization to constrain the variance of the latent space.
- As shown in the illustration figure below by Rombach et al., in the latent space, the latent representation of the input image goes through a total of \(T\) diffusion operations to get the noisy representation. A U-Net is then applied on top of the noisy representation for \(T\) iterations to produce the denoised version of the representation. In addition, the authors introduced a cross attention mechanism to condition the denoising process on other types of inputs such as text and semantic maps.

- In the final stage, the denoised representation will be mapped back to the pixel space using the decoder to get the synthesized image.
- Empirically, the best performing latent diffusion model (with a carefully chosen downsampling factor) achieved competitive FID scores in image generation when comparing with a few other state-of-the-art generative models such as variations of generative adversarial nets on a few datasets including the CelebA-HQ dataset.
- Code
Diffusion Model Alignment Using Direct Preference Optimization
- This paper by Wallace et al. from Salesforce AI and Stanford University proposes a novel method for aligning diffusion models to human preferences.
- The paper introduces Diffusion-DPO, a method adapted from Direct Preference Optimization (DPO), for aligning text-to-image diffusion models with human preferences. This approach is a significant shift from typical language model training, emphasizing direct optimization on human comparison data.
- Unlike typical methods that fine-tune pre-trained models using curated images and captions, Diffusion-DPO directly optimizes a policy that best satisfies human preferences under a classification objective. It re-formulates DPO to account for a diffusion model notion of likelihood using the evidence lower bound, deriving a differentiable objective.
- The authors utilized the Pick-a-Pic dataset, comprising 851K crowdsourced pairwise preferences, to fine-tune the base model of the Stable Diffusion XL (SDXL)-1.0 model with Diffusion-DPO. The fine-tuned model showed significant improvements over both the base SDXL-1.0 and its larger variant in terms of visual appeal and prompt alignment, as evaluated by human preferences.
- The paper also explores a variant of the method that uses AI feedback, showing comparable performance to training on human preferences. This opens up possibilities for scaling diffusion model alignment methods.
- The figure below from paper illustrates: (Top) DPO-SDXL significantly outperforms SDXL in human evaluation. (L) PartiPrompts and (R) HPSv2 benchmark results across three evaluation questions, majority vote of 5 labelers. (Bottom) Qualitative comparisons between SDXL and DPO-SDXL. DPOSDXL demonstrates superior prompt following and realism. DPO-SDXL outputs are better aligned with human aesthetic preferences, favoring high contrast, vivid colors, fine detail, and focused composition. They also capture fine-grained textual details more faithfully.

- Experiments demonstrate the effectiveness of Diffusion-DPO in various scenarios, including image-to-image editing and learning from AI feedback. The method significantly outperforms existing models in human evaluations for general preference, visual appeal, and prompt alignment.
- The paper’s findings indicate that Diffusion-DPO can effectively increase measured human appeal across an open vocabulary with stable training, without increased inference time, and improves generic text-image alignment.
- The authors note ethical considerations and risks associated with text-to-image generation, emphasizing the importance of diverse and representative sets of labelers and the potential biases inherent in the pre-trained models and labeling process.
- In summary, the paper presents a groundbreaking approach to align diffusion models with human preferences, demonstrating notable improvements in visual appeal and prompt alignment. It highlights the potential of direct preference optimization in the realm of text-to-image diffusion models and opens avenues for further research and application in this field.
Scalable Diffusion Models with Transformers
- This paper by Peebles and Xie from UC Berkeley and New York University introduces a new class of diffusion models that leverage the Transformer architecture for generating images. This innovative approach replaces the traditional convolutional U-Net backbone in latent diffusion models (LDMs) with a transformer operating on latent patches.
- Traditional diffusion models in image-level generative tasks predominantly use a convolutional U-Net architecture. However, the dominance of transformers in various domains like natural language processing and vision prompts this exploration of their use as a backbone for diffusion models.
- The paper proposes Diffusion Transformers (DiTs), which adhere closely to the standard Vision Transformer (ViT) model but with some vital tweaks. DiTs are designed to be faithful to standard transformer architecture, particularly the Vision Transformer (ViT) model, and are trained as latent diffusion models of images.
- Transformer Blocks and Design Space:
- DiTs process input tokens transformed from spatial representations of images (“patchify” process) through a sequence of transformer blocks.
- Four types of transformer blocks are explored: in-context conditioning, cross-attention block, adaptive layer norm (adaLN) block, and adaLN-Zero block. Each block processes additional conditional information like noise timesteps or class labels.
- The adaLN-Zero block, which initializes each DiT block as an identity function and modulates the activations immediately prior to any residual connections within the block, demonstrates the most efficient performance, achieving lower Frechet Inception Distance (FID) values than the other block types.
- The figure below from the paper shows the Diffusion Transformer (DiT) architecture. Left: We train conditional latent DiT models. The input latent is decomposed into patches and processed by several DiT blocks. Right: Details of our DiT blocks. We experiment with variants of standard transformer blocks that incorporate conditioning via adaptive layer norm, cross-attention and extra input tokens. Adaptive layer norm works best.

- Model Scaling and Performance:
- DiTs are scalable in terms of forward pass complexity, measured in GFLOPs. Different configurations (DiT-S, DiT-B, DiT-L, DiT-XL) cover a range of model sizes and computational complexities.
- Increasing model size and decreasing patch size significantly improves performance. FID scores improve as the transformer becomes deeper and wider, indicating that scaling model size (GFLOPs) is key to improved performance.
- The largest DiT-XL/2 models outperform all prior diffusion models, achieving a state-of-the-art FID of 2.27 on class-conditional ImageNet benchmarks at resolutions of 256 \(\times\) 256 and 512 \(\times\) 512.
- Implementation and Results: The models are trained using the AdamW optimizer. The DiT-XL/2 model, trained for 7 million steps, demonstrates high compute efficiency compared to both latent and pixel space U-Net models.
- Visual Quality: The paper highlights notable improvements in the visual quality of generated images with scaling in both model size and the number of tokens processed.
- Overall, the paper showcases the potential of transformer-based architectures in diffusion models, emphasizing scalability and compute efficiency, which contributes significantly to the field of generative models for images.
- Project page
DeepFloyd IF
- DeepFloyd, a part of Stability AI, has introduced DeepFloyd IF, a cutting-edge text-to-image cascaded pixel diffusion model known for its high photorealism and language understanding capabilities. This model is an open-source project and represents a significant advancement in text-to-image synthesis technology.
- DeepFloyd IF is built with multiple neural modules (independent neural networks that tackle specific tasks), joining forces within a single architecture to produce a synergistic effect.
- DeepFloyd IF generates high-resolution images in a cascading manner: the action kicks off with a base model that produces low-resolution samples, which are then boosted by a series of upscale models to create stunning high-resolution images, as shown in the figure (source) below.

- DeepFloyd IF’s base and super-resolution models adopt diffusion models, making use of Markov chain steps to introduce random noise into the data, before reversing the process to generate new data samples from the noise.
- DeepFloyd IF operates within the pixel space, as opposed to latent diffusion (e.g. Stable Diffusion) that depends on latent image representations.
- The unique structure of DeepFloyd IF consists of a frozen text encoder and three cascaded pixel diffusion modules. The process begins with a base model generating a 64 \(\times\) 64 pixel image from a text prompt. This is followed by two super-resolution models, each escalating the resolution to 256 \(\times\) 256 pixels and then to 1024 \(\times\) 1024 pixels. All stages utilize a frozen text encoder based on the T5 transformer architecture, which extracts text embeddings. These embeddings are then input into a UNet architecture, which is enhanced with cross-attention and attention pooling features.
- The figure below from the paper shows the model architecture of DeepFloyd IF.

- The efficiency and effectiveness of DeepFloyd IF are evident in its performance, where it achieved a zero-shot FID score of 6.66 on the COCO dataset. This score is a testament to its state-of-the-art capabilities, outperforming other models in the domain. The success of DeepFloyd IF underscores the potential of larger UNet architectures in the initial stages of cascaded diffusion models and opens new avenues for future advancements in text-to-image synthesis.
- Code; Project page.
PIXART-\(\alpha\): Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
- The paper “PIXART-\(\alpha\): Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis” by Chen et al. introduces PIXART-\(\alpha\), a transformer-based latent diffusion model for text-to-image (T2I) synthesis. This model competes with leading image generators such as SDXL, Imagen, and DALL-E 2 in quality, while significantly reducing training costs and CO2 emissions. Notably, it is also OPEN-RAIL licensed.
- Key Innovations:
- Efficient Architecture: PIXART-\(\alpha\) employs a Diffusion Transformer (DiT) with cross-attention modules, focusing on efficiency. This includes a streamlined class-condition branch and reparameterization for efficient training.
- Training Strategy Decomposition: The training is divided into three stages: learning pixel distributions, text-image alignment, and aesthetic enhancement.
- High-Informative Data: Utilizes an auto-labeling pipeline with LLaVA to create a dense, precise text-image dataset, improving the speed of text-image alignment learning.
- Technical Implementation:
- Text Encoding: Uses the T5-XXL model for advanced text encoding, enabling better handling of complex prompts.
- Pre-training and Stages: Incorporates pre-training on ImageNet, learning stages for pixel distribution, alignment, and aesthetics.
- Hardware Requirements: Initially requires 23GB GPU VRAM, but with diffusers, it can run under 7GB.
- The figure below from the paper shows the model architecture of PIXART-\(\alpha\). A cross-attention module is integrated into each block to inject textual conditions. To optimize efficiency, all blocks share the same adaLN-single parameters for time conditions.

- Performance and Efficiency:
- Quality and Control: Delivers high-quality image synthesis with superior semantic control.
- Resource Efficiency: Achieves near state-of-the-art quality with only 2% of the training cost of other models, reducing CO2 emissions by 90%.
- Optimization Techniques: Implements shared normalization parameters (adaLN-single) and uses AdamW optimizer to enhance efficiency.
- Applications and Extensions: Showcases versatility through methods like DreamBooth and ControlNet, further expanding its practical applications.
- PIXART-\(\alpha\) represents a major advancement in T2I generation, offering a high-quality, efficient, and environmentally friendly solution. Its unique architecture and training strategy make it an innovative addition to the field of photorealistic T2I synthesis.
- Code; Hugging Face; Project page
RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths
- This technical report by Xue et al. from the University of Hong Kong and SenseTime Research, the authors introduce RAPHAEL, a novel text-to-image diffusion model that generates highly artistic images closely aligned with textual prompts.
- RAPHAEL uniquely combines tens of mixture-of-experts (MoEs) layers, including space-MoE and time-MoE layers, allowing billions of diffusion paths. Each path intuitively functions as a “painter” for depicting specific textual concepts onto designated image regions at certain diffusion timesteps. This mechanism substantially enhances the precision in aligning text and image content.
- The authors report that RAPHAEL outperforms recent models like Stable Diffusion, ERNIE-ViLG 2.0, DeepFloyd, and DALL-E 2 in terms of image quality and aesthetic appeal. This is evidenced by superior performance in diverse styles (e.g., Japanese comics, realism, cyberpunk) and a state-of-the-art zero-shot FID score of 6.61 on the COCO dataset.
- An edge-supervised learning module is introduced to further refine image quality, focusing on maintaining intricate boundary details in various styles. RAPHAEL is implemented using a U-Net architecture with 16 transformer blocks, each containing a self-attention layer, a cross-attention layer, space-MoE, and time-MoE layers. The model, with three billion parameters, was trained on 1,000 A100 GPUs for two months.
- Framework of RAPHAEL. (a) Each block contains four primary components including a selfattention layer, a cross-attention layer, a space-MoE layer, and a time-MoE layer. The space-MoE is responsible for depicting different text concepts in specific image regions, while the time-MoE handles different diffusion timesteps. Each block uses edge-supervised cross-attention learning to further improve image quality. (b) shows details of space-MoE. For example, given a prompt “a furry bear under sky”, each text token and its corresponding image region (given by a binary mask) are directed through distinct space experts, i.e., each expert learns particular visual features at a region. By stacking several space-MoEs, we can easily learn to depict thousands of text concepts.

- The authors conducted extensive experiments, including a user study using the ViLG-300 benchmark, demonstrating RAPHAEL’s robustness and superiority in generating images that closely conform to the textual prompts. The study also showcases RAPHAEL’s flexibility in generating images of diverse styles and high resolutions up to 4096 \(\times\) 6144 when combined with a tailor-made SR-GAN model.
- RAPHAEL’s potential applications extend to various domains, with implications for both academic research and industry. The model’s limitations include the potential misuse for creating misleading or false information, a challenge common to powerful text-to-image generators.
- Project page
ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts
- This paper by Feng et al. from Baidu Inc. and Wuhan University of Science and Technology in CVPR 2023 focuses on enhancing text-to-image generation using diffusion models.
- They introduce ERNIE-ViLG 2.0, a large-scale Chinese text-to-image generation model, employing a diffusion-based approach with a 24B parameter scale. The model aims to significantly upgrade image quality and text relevancy.
- The model incorporates fine-grained textual and visual knowledge to improve semantic control and resolve object-attribute mismatching in image generation. This is achieved by using a text parser and an object detector to identify key elements in the text-image pair and aligning them in the learning process.
- Introduction of the Mixture-of-Denoising-Experts (MoDE) mechanism, which uses multiple specialized expert networks for different stages of the denoising process, allowing for more efficient handling of various denoising requirements at different steps.
- The figure below from the paper shows the architecture of ERNIE-ViLG 2.0, which incorporates fine-grained textual and visual knowledge of key elements in the scene and utilizes different denoising experts at different denoising stages.

- ERNIE-ViLG 2.0 demonstrates state-of-the-art performance on MS-COCO with a zero-shot FID-30k score of 6.75. It also outperforms models like DALL-E 2 and Stable Diffusion in human evaluations using a bilingual prompt set, ViLG-300, for a fair comparison between English and Chinese text-to-image models.
- The model’s implementation involves a transformer-based text encoder with 1.3B parameters, 10 denoising U-Net experts with 2.2B parameters each, and training on 320 Tesla A100 GPUs for 18 days. The dataset comprises 170M image-text pairs, including English datasets translated into Chinese.
- Ablation studies and qualitative showcases confirm the effectiveness of the proposed knowledge enhancement strategies and the MoDE mechanism. The model shows improved handling of complex prompts, better sharpness, and texture in generated images.
- Future work includes enriching external image-text alignment knowledge and expanding the usage of multiple experts to advance generation capabilities. The paper also discusses potential risks and limitations related to data bias and model misuse in text-to-image generation.
- Project page
Imagen Video: High Definition Video Generation with Diffusion Models
- This paper by Ho et al. from Google Research, Brain Team, introduces Imagen Video, a text-conditional video generation system leveraging a cascade of video diffusion models. Imagen Video generates high-definition videos from text prompts using a base video generation model and a sequence of interleaved spatial and temporal super-resolution models.
- The core contributions and methodology of this work include the following technical details:
- Architecture and Components: Imagen Video utilizes a frozen T5 text encoder to process the text prompts, followed by a base video diffusion model and multiple spatial and temporal super-resolution (SSR and TSR) models. Specifically, the system comprises seven sub-models: one base video generation model, three SSR models, and three TSR models. This cascade structure allows the system to generate 1280x768 resolution videos at 24 frames per second, with a total of 128 frames (approximately 5.3 seconds).
- Diffusion Models: The diffusion models in Imagen Video are based on continuous-time formulations, with a forward process defined as a Gaussian process. The training objective is to denoise the latent variables through a noise-prediction loss. The v-parameterization is employed to predict noise, which ensures numerical stability and avoids color-shifting artifacts.
- Text Conditioning and Cascading: Text conditioning is achieved by injecting contextual embeddings from the T5-XXL text encoder into all models, ensuring alignment between the generated video and the text prompt. The cascading approach involves generating a low-resolution video first, which is then progressively enhanced through spatial and temporal super-resolution models. This method allows for high-resolution outputs without overly complex individual models.
- The following figure from the paper shows the cascaded sampling pipeline starting from a text prompt input to generating a 5.3-second, 1280 \(\times\) 768 video at 24fps. “SSR” and “TSR” denote spatial and temporal super-resolution respectively, and videos are labeled as frames \(\times\) width \(\times\) height. In practice, the text embeddings are injected into all models, not just the base model.

- Implementation Details:
- v-parameterization: Used for numerical stability and to avoid artifacts in high-resolution video generation.
- Conditioning Augmentation: Gaussian noise augmentation is applied to the conditioning inputs during training to reduce domain gaps and facilitate parallel training of different models in the cascade.
- Joint Training on Images and Videos: The models are trained on a mix of video-text pairs and image-text pairs, treating individual images as single-frame videos. This approach allows the model to leverage larger and more diverse image-text datasets.
- Classifier-Free Guidance: This method enhances sample fidelity and ensures that the generated video closely follows the text prompt by adjusting the denoising prediction.
- Progressive Distillation: This technique is used to speed up the sampling process. It involves distilling a trained DDIMs sampler into a model requiring fewer steps, thus significantly reducing computation time while maintaining sample quality.
- Experiments and Findings:
- The model shows high fidelity in video generation and can produce diverse content, including 3D object understanding and various artistic styles.
- Scaling the parameter count of the video U-Net leads to improved performance, indicating that video modeling benefits significantly from larger models.
- The v-parameterization outperforms ε-parameterization, especially at higher resolutions, due to faster convergence and reduced color inconsistencies.
- Distillation reduces sampling time by 18x, making the model more efficient without sacrificing perceptual quality.
- Conclusion: Imagen Video extends text-to-image diffusion techniques to video generation, achieving high-quality, temporally consistent videos. The integration of various advanced methodologies from image generation, such as v-parameterization, conditioning augmentation, and classifier-free guidance, demonstrates their effectiveness in the video domain. The work also highlights the potential for further improvements in video generation capabilities through continued research and development.
- Project page
Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution
- This paper by Dehghani et al. from Google DeepMind introduces NaViT (Native Resolution ViT), a vision transformer designed to process images of arbitrary resolutions and aspect ratios without resizing them to a fixed resolution, which is common but suboptimal.
- NaViT leverages sequence packing during training, a technique inspired by natural language processing where multiple examples are packed into a single sequence, allowing efficient training on variable length inputs. This is termed Patch n’ Pack.
- Architectural Changes: NaViT builds on the Vision Transformer (ViT) but introduces masked self-attention and masked pooling to prevent different examples from attending to each other. It also uses factorized and fractional positional embeddings to handle arbitrary resolutions and aspect ratios. These embeddings are decomposed into separate embeddings for x and y coordinates and summed together, allowing for easy extrapolation to unseen resolutions.
- Training Enhancements: NaViT employs continuous token dropping, varying the token dropping rate per image, and resolution sampling, allowing mixed-resolution training by sampling from a distribution of image sizes while preserving aspect ratios. This enhances throughput and exposes the model to high-resolution images during training, yielding substantial performance improvements over equivalent ViTs.
- Efficiency: NaViT demonstrates significant computational efficiency, processing five times more images during training than ViT within the same compute budget. The O(n^2) cost of attention, a concern when packing multiple images into longer sequences, diminishes with model scale, making the attention cost a smaller proportion of the overall computation.
- The following figure from the paper shows an example packing enables variable resolution images with preserved aspect ratio, reducing training time, improving performance and increasing flexibility. We show here the aspects of the data preprocessing and modelling that need to be modified to support Patch n’ Pack. The position-wise operations in the network, such as MLPs, residual connections, and layer normalisations, do not need to be altered.

- Implementation: The authors implemented NaViT using a greedy packing approach to fix the final sequence lengths containing multiple examples. They addressed padding issues and example-level loss computation by modifying pooling heads to account for packing and using chunked contrastive loss to manage memory and time constraints.
- Performance: NaViT consistently outperforms ViT across various tasks, including image and video classification, object detection, and semantic segmentation. It shows improved results on robustness and fairness benchmarks, achieving better performance with lower computational costs and providing flexibility in handling different resolutions during inference.
- Evaluation: NaViT’s training and adaptation efficiency were evaluated through empirical studies on datasets like ImageNet, LVIS, WebLI, and ADE20k. The model demonstrated superior performance in terms of accuracy and computational efficiency, highlighting the benefits of preserving aspect ratios and using mixed-resolution training.
- NaViT represents a significant departure from the traditional convolutional neural network (CNN)-designed pipelines, offering a promising direction for Vision Transformers by enabling flexible and efficient processing of images at their native resolutions.
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
- This paper by Podell et al. from Stability AI Applied Research details significant advancements in the field of text-to-image synthesis using latent diffusion models (LDMs).
- The paper introduces SDXL, a latent diffusion model that significantly improves upon previous versions of Stable Diffusion for text-to-image synthesis.
- SDXL incorporates a UNet architecture three times larger than its predecessors, primarily due to an increased number of attention blocks and a larger cross-attention context. This is achieved by using a second text encoder, significantly enhancing the model’s capabilities.
- Novel conditioning schemes are introduced, such as conditioning on original image resolution and cropping parameters. This conditioning is achieved through Fourier feature encoding and significantly improves the model’s performance and flexibility.
- SDXL is trained on multiple aspect ratios, a notable departure from standard square image outputs. This training approach allows the model to better handle images with varied aspect ratios, reflecting real-world data more accurately.
- An improved autoencoder is used, enhancing the fidelity of generated images, particularly in high-frequency details.
- The paper also discusses a refinement model used as a post-hoc image-to-image technique to further improve the visual quality of samples generated by SDXL. SDXL demonstrates superior performance compared to earlier versions of Stable Diffusion and rivals state-of-the-art black-box image generators. The model’s performance was validated through user studies and quantitative metrics.
- The figure below from the illustrates: (Left) Comparing user preferences between SDXL and Stable Diffusion 1.5 & 2.1. While SDXL already clearly outperforms Stable Diffusion 1.5 & 2.1, adding the additional refinement stage boosts performance. (Right) Visualization of the two-stage pipeline: They generate initial latents of size 128 \(\times\) 128 using SDXL. Afterwards, they utilize a specialized high-resolution refinement model and apply SDEdit on the latents generated in the first step, using the same prompt. SDXL and the refinement model use the same autoencoder.

- The authors emphasize the open nature of SDXL, highlighting its potential to foster transparency in large model training and evaluation, which is crucial for responsible and ethical deployment of such technologies.
- The paper represents a significant step forward in generative modeling for high-resolution image synthesis, showcasing the potential of latent diffusion models in creating detailed and realistic images from textual descriptions.
Dreamix: Video Diffusion Models are General Video Editors
- Text-driven image and video diffusion models have recently achieved unprecedented generation realism. While diffusion models have been successfully applied for image editing, very few works have done so for video editing.
- This paper by Molad et al. from Google Research and The Hebrew University of Jerusalem presents the first diffusion-based method that is able to perform text-based motion and appearance editing of general videos. Our approach uses a video diffusion model to combine, at inference time, the low-resolution spatio-temporal information from the original video with new, high resolution information that it synthesized to align with the guiding text prompt.
- The following figure from the paper shows the video editing use-case with Dreamix: Frames from a video conditioned on the text prompt “A bear dancing and jumping to upbeat music, moving his whole body“. Dreamix transforms the eating monkey (top row) into a dancing bear, affecting appearance and motion (bottom row). It maintains fidelity to color, posture, object size and camera pose, resulting in a temporally consistent video.

- As obtaining high-fidelity to the original video requires retaining some of its high-resolution information, we add a preliminary stage of finetuning the model on the original video, significantly boosting fidelity.
- They propose to improve motion editability by a new, mixed objective that jointly finetunes with full temporal attention and with temporal attention masking.
- They further introduce a new framework for image animation. They first transform the image into a coarse video by simple image processing operations such as replication and perspective geometric projections, and then use their general video editor to animate it.
- As a further application, Dreamix can be used for subject-driven video generation. Extensive qualitative and numerical experiments showcase the remarkable editing ability of Dreamix and establish its superior performance compared to baseline methods.
- The following figure from the paper illustrates the process of inference. Dreamix supports multiple applications by application dependent pre-processing (left), converting the input content into a uniform video format. For image-to-video, the input image is duplicated and transformed using perspective transformations, synthesizing a coarse video with some camera motion. For subject-driven video generation, the input is omitted - finetuning alone takes care of the fidelity. This coarse video is then edited using their general “Dreamix Video Editor“ (right): we first corrupt the video by downsampling followed by adding noise. We then apply the finetuned text-guided VDM, which upscales the video to the final spatio-temporal resolution.

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
- This paper by Blattmann et al. from Stability AI introduces Stable Video Diffusion (SVD), a latent video diffusion model designed for high-resolution text-to-video and image-to-video generation. They address the challenge of lacking a unified strategy for curating video data and propose a methodical curation process for training successful video LDMs, which includes three stages:
- Stage I: Text-to-image (or simply, image pretraining), i.e., a 2D text-to-image diffusion model.
- Stage II: video pretraining, which trains on large amounts of videos.
- Stage III: video finetuning, which refines the model on a small subset of high-quality videos at higher resolution.
- In the initial stage, leveraging insights from large-scale image model training, the authors curated an extensive pretraining dataset named LVD, consisting of approximately 580 million annotated video clip pairs. This dataset underwent rigorous processing, including cut detection and annotation using several methods such as image captioning and optical flow analysis, to filter out low-quality content. Specifically, to avoid the samples in the dataset that can be expected to degrade the performance of the final video model, such as clips with less motion, excessive text presence, or generally low aesthetic value, they therefore additionally annotate the dataset with dense optical flow calculated at 2 FPS, with which static scenes are filtered out by removing any videos whose average optical flow magnitude is below a certain threshold.
- The following figure from the paper shows that the initial dataset contains many static scenes and cuts which hurts training of generative video models. Left: Average number of clips per video before and after our processing, revealing that our pipeline detects lots of additional cuts. Right: The distribution of average optical flow score for one of these subsets before processing, which contains many static clips.

- The paper outlines the importance of each training stage and demonstrates that systematic data curation significantly boosts model performance. Notably, they emphasize the necessity of pretraining on a well-curated dataset for generating high-quality videos, showing that models pretrained in this manner outperform others when finetuned on smaller, high-quality datasets.
- Leveraging the curated dataset, the authors trained a base model that provides a comprehensive motion representation. This base model was further finetuned for several applications, including text-to-video and image-to-video generation, demonstrating state-of-the-art performance. The model also supports controlled camera motion through LoRA modules and has been shown to serve as a robust multi-view 3D prior, capable of generating multiple consistent views of an object in a feedforward manner.
- The SVD model stands out for its ability to efficiently generate high-fidelity videos from both text and images, offering a substantial advancement over existing methods in terms of visual quality and consistency. The authors released the code and model weights, contributing a valuable resource to the research community for further exploration and development in video generation technology.
- Blog; Code; Hugging Face
Fine-tuning Diffusion Models
- Per Using LoRA for Efficient Stable Diffusion Fine-Tuning, Low-Rank Adaptation (LoRA) can be used for efficient fine-tuning of large language models, originally introduced by Microsoft researchers.
- LoRA involves freezing pre-trained model weights and adding trainable layers to reduce the number of parameters and GPU memory requirements. It has been applied to Stable Diffusion fine-tuning, particularly in cross-attention layers.
- The technique enables quicker and less computationally intensive training, resulting in much smaller trained weights. The article also covers the use of LoRA in diffusers for Dreambooth and full fine-tuning methods, highlighting the reduced training time and lower computational requirements.
- Additionally, it introduces methods like Textual Inversion and Pivotal Tuning, which are complementary to LoRA. The page includes code snippets for using LoRA in Stable Diffusion fine-tuning and Dreambooth training.
Diffusion Model Alignment
Diffusion Model Alignment Using Direct Preference Optimization
- This paper by Wallace et al. from Salesforce AI and Stanford University proposes a novel method for aligning diffusion models to human preferences.
- The paper introduces Diffusion-DPO, a method adapted from DPO, for aligning text-to-image diffusion models with human preferences. This approach is a significant shift from typical language model training, emphasizing direct optimization on human comparison data.
- Unlike typical methods that fine-tune pre-trained models using curated images and captions, Diffusion-DPO directly optimizes a policy that best satisfies human preferences under a classification objective. It re-formulates DPO to account for a diffusion model notion of likelihood using the evidence lower bound, deriving a differentiable objective.
- The authors utilized the Pick-a-Pic dataset, comprising 851K crowdsourced pairwise preferences, to fine-tune the base model of the Stable Diffusion XL (SDXL)-1.0 model with Diffusion-DPO. The fine-tuned model showed significant improvements over both the base SDXL-1.0 and its larger variant in terms of visual appeal and prompt alignment, as evaluated by human preferences.
- The paper also explores a variant of the method that uses AI feedback, showing comparable performance to training on human preferences. This opens up possibilities for scaling diffusion model alignment methods.
- The figure below from paper illustrates: (Top) DPO-SDXL significantly outperforms SDXL in human evaluation. (L) PartiPrompts and (R) HPSv2 benchmark results across three evaluation questions, majority vote of 5 labelers. (Bottom) Qualitative comparisons between SDXL and DPO-SDXL. DPOSDXL demonstrates superior prompt following and realism. DPO-SDXL outputs are better aligned with human aesthetic preferences, favoring high contrast, vivid colors, fine detail, and focused composition. They also capture fine-grained textual details more faithfully.

- Experiments demonstrate the effectiveness of Diffusion-DPO in various scenarios, including image-to-image editing and learning from AI feedback. The method significantly outperforms existing models in human evaluations for general preference, visual appeal, and prompt alignment.
- The paper’s findings indicate that Diffusion-DPO can effectively increase measured human appeal across an open vocabulary with stable training, without increased inference time, and improves generic text-image alignment.
- The authors note ethical considerations and risks associated with text-to-image generation, emphasizing the importance of diverse and representative sets of labelers and the potential biases inherent in the pre-trained models and labeling process.
- In summary, the paper presents a groundbreaking approach to align diffusion models with human preferences, demonstrating notable improvements in visual appeal and prompt alignment. It highlights the potential of direct preference optimization in the realm of text-to-image diffusion models and opens avenues for further research and application in this field.
Further Reading
The Illustrated Stable Diffusion
- Jay Alammar’s (of The Illustrated Transformer fame) article explaining Stable Diffusion.
Understanding Diffusion Models: A Unified Perspective
- This tutorial paper by Calvin Luo from Google Brain goes from the basics of ELBO, VAE, and hierarchical VAE to diffusion models.
The Annotated Diffusion Model
- This blog post by HugginFace takes a deeper look into Denoising Diffusion Probabilistic Models (also known as DDPMs, diffusion models, score-based generative models or simply autoencoders) as researchers have been able to achieve remarkable results with them for generative models. It goes over the original DDPM paper (Ho et al., 2020), implementing it step-by-step in PyTorch, based on Phil Wang’s implementation - which itself is based on the original TensorFlow implementation.
Lilian Weng: What are Diffusion Models?
- This tutorial paper by Lilian Weng from OpenAI covers the math behind diffusion models in detail.
Stable Diffusion - What, Why, How?
- This YouTube video by Edan Meyer explains how Stable Diffusion works at a high level, briefly talks about how it is different from other Diffusion-based models, compares it to DALL-E 2, and digs into the code.
How does Stable Diffusion work? – Latent Diffusion Models Explained
- This YouTube video by Letitia covers diffusion models, injecting text to generate images (conditional generation), and stable diffusion as a latent diffusion model.
Diffusion Explainer
-
Diffusion Explainer, an interactive web application, is designed to visually demonstrate the process of transforming a text prompt into high-resolution images within seconds.
- The app offers the following features for exploration:
- Text representation generation: Observe how your text prompt is tokenized and converted into numerical vectors, which guide the creation of images.
- Image representation refinement: Witness the transformation of random noise into a coherent image through successive steps.
- Image upscaling: Discover how the final image representation is enhanced into high-resolution output based on your input.
- With its interactive controls, you can modify prompts, adjust seeds, and tweak guidance scales, allowing you to see how each factor influences the final image. This tool is ideal for those seeking a deeper understanding of diffusion models and text-to-image generation technologies.
Jupyter notebook on the theoretical and implementation aspects of Score-based Generative Models (SGMs)
- Great technical explanation and implementation (along with a JAX implementation) of score-based generative models (SGMs), also called diffusion models.
References
- Diffusion models Vs GANs: Which one to choose for Image Synthesis
- Diffusion models
- Introduction to Diffusion Models for Machine Learning
- What are Diffusion Models?
- Denoising Diffusion Probabilistic Models by Ho et. al
- Lil’Log: Diffusion Models
- Assembly AI: Diffusion Models for Machine Learning Introduction
- Google ML Crashcourse on GANs
- Wikipedia: Markov Chain
- AI Summer: Latent Variable Models
- Diffusion Models Beat GANs on Image Synthesis
- Jay Alammar’s Stable Diffusion Twitter thread
- Hugging Face Stable Diffusion
- Towards Data Science: DALL-E 2 Explained
- Hugging Face’s Training with Diffusers notebook
- Hugging Face Diffusers Intro notebook
Citation
If you found our work useful, please cite it as:
@article{Chadha2020DistilledDiffusionModels,
title = {Diffusion Models},
author = {Chadha, Aman and Jain, Vinija},
journal = {Distilled AI},
year = {2020},
note = {\url{https://aman.ai}}
}







