Generative Models

  • Generative models are a type of Unsupervised learning.

  • Supervised vs Unsupervised Learning:

    •   Supervised Learning Unsupervised Learning
      Data structure Data: (x, y), and x is data, y is label Data: x, Just data, no labels!
      Data price Training data is expensive in a lot of cases. Training data are cheap!
      Goal Learn a function to map x -> y Learn some underlying hidden structure of the data
      Examples Classification, regression, object detection, semantic segmentation, image captioning Clustering, dimensionality reduction, feature learning, density estimation
  • Autoencoders are a Feature learning technique.


    • It contains an encoder and a decoder. The encoder downsamples the image while the decoder upsamples the features.
    • The loss used is L2 loss.
  • Density estimation is where we want to learn/estimate the underlaying distribution for the data!

  • There are a lot of research open problems in unsupervised learning compared with supervised learning!

Generative Models

  • Given training data, generate new samples from same distribution.
  • Addresses density estimation, a core problem in unsupervised learning.
  • We have different ways to do this:
    • Explicit density estimation: explicitly define and solve for the learning model.
    • Learn model that can sample from the learning model without explicitly defining it.
  • Why Generative Models?
    • Realistic samples for artwork, super-resolution, colorization, etc
    • Generative models of time-series data can be used for simulation and planning (reinforcement learning applications!)
    • Training generative models can also enable inference of latent representations that can be useful as general features
  • Taxonomy of Generative Models: ![]((assets/generative-models/52.png)
  • In this lecture we will discuss: PixelRNN/CNN, Variational Autoencoder, and GANs as they are the popular models in research now.

PixelRNN and PixelCNN

  • In a full visible belief network we use the chain rule to decompose likelihood of an image x into product of 1-d distributions
    • p(x) = sum(p(x[i]| x[1]x[2]....x[i-1]))
    • Where p(x) is the Likelihood of image x and x[i] is Probability of i’th pixel value given all previous pixels.
  • To solve the problem we need to maximize the likelihood of training data but the distribution is so complex over pixel values.
  • Also we will need to define ordering of previous pixels.
  • PixelRNN
    • Founded by [van der Oord et al. 2016]
    • Dependency on previous pixels modeled using an RNN (LSTM)
    • Generate image pixels starting from corner
    • Drawback: sequential generation is slow! because you have to generate pixel by pixel!
  • PixelCNN
    • Also Founded by [van der Oord et al. 2016]
    • Still generate image pixels starting from corner.
    • Dependency on previous pixels now modeled using a CNN over context region
    • Training is faster than PixelRNN (can parallelize convolutions since context region values known from training images)
    • Generation must still proceed sequentially still slow.
  • There are some tricks to improve PixelRNN & PixelCNN.
  • PixelRNN and PixelCNN can generate good samples and are still active area of research.


  • Unsupervised approach for learning a lower-dimensional feature representation from unlabeled training data.
  • Consists of Encoder and decoder.
  • The encoder:
    • Converts the input x to the features z. z should be smaller than x to get only the important values out of the input. We can call this dimensionality reduction.
    • The encoder can be made with:
      • Linear or non linear layers (earlier days days)
      • Deep fully connected NN (Then)
      • RELU CNN (Currently we use this on images)
  • The decoder:
    • We want the encoder to map the features we have produced to output something similar to x or the same x.
    • The decoder can be made with the same techniques we made the encoder and currently it uses a RELU CNN.
  • The encoder is a conv layer while the decoder is deconv layer! Means Decreasing and then increasing.
  • The loss function is L2 loss function:
    • L[i] = |y[i] - y'[i]|^2
      • After training we though away the decoder.# Now we have the features we need
  • We can use this encoder we have to make a supervised model.
    • The value of this it can learn a good feature representation to the input you have.
    • A lot of times we will have a small amount of data to solve problem. One way to tackle this is to use an Autoencoder that learns how to get features from images and train your small dataset on top of that model.
  • The question is can we generate data (Images) from this Autoencoder?

Variational Autoencoders (VAE)

  • Probabilistic spin on Autoencoders - will let us sample from the model to generate data!
  • We have z as the features vector that has been formed using the encoder.
  • We then choose prior p(z) to be simple, e.g. Gaussian.
    • Reasonable for hidden attributes: e.g. pose, how much smile.
  • Conditional p(x z) is complex (generates image) => represent with neural network
  • But we cant compute integral for P(z)p(x|z)dz as the following equation: ![]((assets/generative-models/25.png)
  • After resolving all the equations that solves the last equation we should get this: ![]((assets/generative-models/26.png)
  • Variational Autoencoder are an approach to generative models but Samples blurrier and lower quality compared to state-of-the-art (GANs)
  • Active areas of research:
    • More flexible approximations, e.g. richer approximate posterior instead of diagonal Gaussian
    • Incorporating structure in latent variables

Generative Adversarial Networks (GANs)

  • GANs don’t work with any explicit density function!

  • Instead, take game-theoretic approach: learn to generate from training distribution through 2-player game.

  • Yann LeCun, who oversees AI research at Facebook, has called GANs:

    • The coolest idea in deep learning in the last 20 years

  • Problem: Want to sample from complex, high-dimensional training distribution. No direct way to do this as we have discussed!

  • Solution: Sample from a simple distribution, e.g. random noise. Learn transformation to training distribution.

  • So we create a noise image which are drawn from simple distribution feed it to NN we will call it a generator network that should learn to transform this into the distribution we want.

  • Training GANs: Two-player game:

    • Generator network: try to fool the discriminator by generating real-looking images.
    • Discriminator network: try to distinguish between real and fake images.
  • If we are able to train the Discriminator well then we can train the generator to generate the right images.

  • The loss function of GANs as minimax game are here:


  • The label of the generator network will be 0 and the real images are 1.

  • To train the network we will do:

    • Gradient ascent on discriminator.
    • Gradient ascent on generator but with different loss.
  • You can read the full algorithm with the equations here:


  • Aside: Jointly training two networks is challenging, can be unstable. Choosing objectives with better loss landscapes helps training is an active area of research.

  • Convolutional Architectures:

    • Generator is an upsampling network with fractionally-strided convolutions Discriminator is a Convolutional network.
    • Guidelines for stable deep Conv GANs:
      • Replace any pooling layers with strided convs (discriminator) and fractional-strided convs with (Generator).
      • Use batch norm for both networks.
      • Remove fully connected hidden layers for deeper architectures.
      • Use RELU activation in generator for all layers except the output which uses Tanh
      • Use leaky RELU in discriminator for all the layers.
  • 2017 is the year of the GANs! it has exploded and there are some really good results.

  • Active areas of research also is GANs for all kinds of applications.

  • The GAN zoo can be found here:

  • Tips and tricks for using GANs:

  • NIPS 2016 Tutorial GANs:


If you found our work useful, please cite it as:

  title   = {Generative Models},
  author  = {Chadha, Aman},
  journal = {Distilled Notes for Stanford CS231n: Convolutional Neural Networks for Visual Recognition},
  year    = {2020},
  note    = {\url{}}