CS230 • Generative Adversarial Networks
 Motivation
 The generator vs. discriminator game
 Training GANs
 Tips to train GANs
 Examples of GAN applications and nice results
 Generating superresolution (SR) images
 Image to Image translation via CycleGANs
 References

Neural networks are widely used to predict. What if they could be used to generate new images, texts or even audio clips?

Imagine training a robotic arm to localize objects on a table (in order to grasp them.) Collecting real data for this task is expensive. It requires to position objects on a table, take pictures and label with bounding boxes. Alternatively, taking screenshots in simulations allows you to virtually generate millions of labelled images. The downside is that a network trained on simulated data might not generalize to real data. Having a network that generates real homologues of simulated images would be a game changer. This is one example of application of Generative Adversarial Networks (GANs.)
 This topic will give you a thorough grounding on GANs and how to apply them to cuttingedge tasks.
Motivation

Are networks capable of generating images of cats they have never seen? Intuitively, they should be. If a cat vs. noncat classifier generalizes to unseen data, it means that it understands the salient features of the data (i.e. what a cat is and isn’t) instead of overfitting the training data. Similarly, a generative model should be able to generate pictures of cats it has never seen because its complexity (~ number of parameters) doesn’t allow it to memorize the training set.

For instance, the following cats, cars and faces were generated by Karras et al. 1 using GANs. They do not exist in reality!
The generator vs. discriminator game

Although there exist various generative algorithms, this article will focus on the study of GANs.

A GAN 2 involves two neural networks. The first network is called the “generator” ($G$) and its goal is to generate realistic samples. The second network is a binary classifier called the “discriminator”, and its goal is to differentiate fake samples (label $0$) from real sample (label $1$.)

These two networks play a game. $D$ alternatively receives real samples from a database and fake samples generated by $G$, and has to learn to differentiate them. At the same time, $G$ learns to fool $D$. The game ends when $G$ generates samples that are realistic enough to fool $D$. When training ends successfully, you can use $G$ to generate realistic samples. Here’s an illustration of the GAN game.
 It is common to choose a random code $z$ as input to $G$, such that $x = G(z)$ is a generated image. Later, you will learn alternative designs for z allowing you to choose the features of $x$.
Training GANs
 To training the GAN, you need to optimize two cost function simultaneously.

Discrimator cost $J^{(D)}$: $D$ is a binary classifier aiming to map inputs $x=G(z)$ to $y = 0$ and inputs $x=x_{real}$ to $y = 1$. Thus, the logistic loss ( binary crossentropy loss) is appropriate:
\[J^{(D)} = \frac{1}{m_{\text{real}}}\sum_{i=1}^{m_{\text{real}}} y_{\text{real}}^{(i)}\log (D(x^{(i)})) \frac{1}{m_{\text{gen}}}\sum_{i=1}^{m_{\text{gen}}} (1y_{\text{gen}}^{(i)})\log (1D(G(z^{(i)})))\] where $m_{\text{real}}$ (resp. $m_{\text{gen}}$) is the number of real (resp. generated) examples in a batch. $y_{\text{gen}}$ = 0$ and $y_{\text{real}} = 1$.
 Generator cost $J^{(G)}$: Since success is measured by the ability of $G$ to fool $D$, $J^{(G)}$ should be the opposite of $J^{(D)}$:

Note: the first term of $J^{(D)}$ does not appear in $J^{(G)}$ because it is independent of $z$ and will entail no gradient during optimization.

You can run an optimization algorithm such as Adam 3 simultaneously using two minibatches of real and fake samples. You can think of it as a two step process:
 Forward propagate a minibatch of real samples, compute $J^{(D)}$. Then, backprogate to compute $\frac{\partial J^{(D)}}{\partial W_D}$ where $W_D denotes the parameters of $D$.
 Forward propagate a minibatch of samples freshly generated by $G$, compute $J^{(D)}$ and $J^{(G)}$. Then backpropagate to compute $\frac{\partial J^{(D)}}{\partial W_D}$ and $\frac{\partial J^{(G)}}{\partial W_G}$ where $W_G denotes the parameters of $G$.

If training is successful, the distribution of fake samples coming from $G$ should match the true distribution of data.

In mathematical terms, convergence is guaranteed. Here’s why:

This [repository]](https://github.com/eriklindernoren/KerasGAN/blob/master/gan/gan.py) is a nice code example on how to train a GAN.
Tips to train GANs
 In practice, training GANs is hard and requires subtle tricks.
Trick 1: Using a nonsaturating cost

GAN training is an iterative game between $D$ and $G$. If $G(z)$ isn’t realistic, $D$ doesn’t need to improve. Alternatively, if $D$ is easy to fool, $G$ doesn’t need to improve the realism of the generated samples. Consequently, $D$ and $G$ need to grow together in quality.

Early in the training, $G$ is usually generating random noise and $D$ easily differentiates $x=G(z)$ from real samples. This unbalanced power hinders training. To understand why, let’s plot $J^{(G)}$ against $D(G(z))$:

On the graph above, consider the xaxis to be $D(G(z))$ “in expectation” over a given batch of examples.

The gradient $\frac{\partial J^{(G)}}{\partial D(G(z))}$ represented by the slope of the plotted function is small early in the training (i.e. when D(G(z)) is close to 0.) As a consequence, the backpropagated gradient $\frac{\partial J^{(G)}}{\partial W_G}$ is also small. Fortunately, the following simple mathematical trick solves the problem:
 As you can see on the graph below, minimizing $  \frac{1}{m_{\text{gen}}}\sum_{i=1}^{m_{\text{gen}}} \log (D(G(z^{(i)})))$ instead of $ \frac{1}{m_{\text{gen}}}\sum_{i=1}^{m_{\text{gen}}} \log (1D(G(z^{(i)})))$ ensures a higher gradient signal early in the training. Because a successful training will end when $D(G(z)) ≈ 0.5$ (i.e. $D$ is unable to differentiate a generated sample from a real sample), the low slope for $D(G(z)) ≈ 1$ doesn’t slow training.

This approach is known as the nonsaturating trick and ensures that $G$ and $D$ receive a stronger gradient when they are “weaker” than their counterpart.

You can find a more detailed explanation of the saturating cost trick in Goodfellow’s NIPS tutorial (2016).
Trick 2: Keeping D uptodate
 According to the nonsaturating graph above, $G$ doesn’t undergo a strong gradient when it easily fools $D$ (i.e. in expectation $D(G(z)) ≈ 1$.) Thus, it is strategic to ensure $D$ is “stronger” when updating $G$. This is achievable by updating $D$ $k$ times more than $G$, with $k>1$.
 There exist many other tricks to successfully train GANs, this repository contains helpful supplements to our class.
Examples of GAN applications and nice results
 GANs have been applied in myriads of applications including compressing and reconstructing images for efficient memory storage, generating superresolution images 4 5, preserving privacy for clinical data sharing 6, generating images based on text descriptions 7, converting maps images corresponding satellite images, street scene translation 8, mass customization of medical products such as dental crowns 9, cracking enciphered language data 10 and many more. Let’s delve into some of them.
Operation on latent codes
 The latent space of random noise input, from which $G$ maps to the real sample space, usually contains sound meanings. For instance, Radford et al. 11 show that inputing in $G$ the latent code $z = z_{\text{a man wearing glasses}} + z_{\text{a man without glasses}}  z_{\text{a woman without glasses}}$ leads to a generated image $G(z)$ representing “a woman wearing glasses.”
Generating superresolution (SR) images

There is promising research in GANs to recover highresolution (HR) image from lowresolution (LR) images. Specifically, it is challenging to recover the finer texture details when superresolving at large upscaling factors.

Practical applications of SR includes fastmoving vehicles identification, number plates reading 12, biomedical applications such as accurate measurement and visualization of structure in living tissues, and biometrics regnition, to name a few.

The following picture compares different SR algorithms’ outcome. The ground thruth is the leftmost picture.
 Here’s a picture generated by a CS230 project award winner student 13 in Fall 2018. From left to right: 32x32 LR input, SRPGGAN Output (256x256) and HR groundtruth (256x256.)
Image to Image translation via CycleGANs
 Translating images between different domains has been an important application of GANs. For example, CycleGANs 14 translate horses into zebras, apples to oranges, summer features into winter features and viceversa.

They consist of two pairs of generatordiscriminator players: $(G_1,D_1)$ and $(G_2,D_2)$.

The goal of the $(G_1, D_1)$ game is to turn domain 1 samples into domain 2 samples. In contrast, the goal of $(G_2, D_2)$ game is to turn domain 2 samples into domain 1 samples.

A necessary cycle constraint imposes that the composition of $G_1$ and $G_2$ results in the identity function, to ensure that the nonchanging features (nonhorse or zebra elements) are saved during the translation.

The training of this fourplayer game is summarized in five cost functions:

$D_1$’s cost: $J^{(D_1)} = \frac{1}{m_{\text{real}}}\sum_{i=1}^{m_{\text{real}}} \log (D_1(z^{(i)})) \frac{1}{m_{\text{gen}}}\sum_{i=1}^{m_{\text{gen}}} \log (1D_1(G_1(H^{(i)})))$

$G_1$’s cost: $ J^{(G_1)} =  \frac{1}{m_{\text{gen}}}\sum_{i=1}^{m_{\text{gen}}} \log (D_1(G_1(H^{(i)})))$

$D_2$’s cost: $J^{(D_2)} = \frac{1}{m_{\text{real}}}\sum_{i=1}^{m_{\text{real}}} \log (D_2(h^{(i)})) \frac{1}{m_{\text{gen}}}\sum_{i=1}^{m_{\text{gen}}} \log (1D_2(G_2(Z^{(i)})))$

$G_2$’s cost: $ J^{(G_2)} =  \frac{1}{m_{\text{gen}}}\sum_{i=1}^{m_{\text{gen}}} \log (D_2(G_2(Z^{(i)})))$


Cycleconsistent cost: $ J^{\text{cycle}} =  \frac{1}{m_{\text{gen}}}\sum_{i=1}^{m_{\text{gen}}} \Vert G_2(G_1(H^{(i)}))  H^{(i)} \Vert_1  \frac{1}{m_{\text{gen}}}\sum_{i=1}^{m_{\text{gen}}} \Vert G_1(G_2(Z^{(i)}))  Z^{(i)} \Vert_1 $
References
[1] Tero Karras, Samuli Laine, Timo Aila: A StyleBased Generator Architecture for Generative Adversarial Networks (2019) [2] Ian J. Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, Yoshua Bengio: Generative Adversarial Networks (2014) [3] Diederik P. Kingma, Jimmy Ba: Adam: A Method for Stochastic Optimization (2014) [4] Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, Wenzhe Shi: PhotoRealistic Single Image SuperResolution Using a Generative Adversarial Network (2016) [5] Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Chen Change Loy, Yu Qiao, Xiaoou Tang: ESRGAN: Enhanced SuperResolution Generative Adversarial Networks (2018) [6] Brett K. BeaulieuJones, Zhiwei Steven Wu, Chris Williams, Ran Lee, Sanjeev P. Bhavnani, James Brian Byrd, Casey S. Greene: Privacypreserving generative deep neural networks support clinical data sharing (2017) [7] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, Honglak Lee: Generative Adversarial Text to Image Synthesis (2016) [8] MingYu Liu, Thomas Breuel, Jan Kautz: Unsupervised ImagetoImage Translation Networks (2017) [9] JyhJing Hwang, Sergei Azernikov, Alexei A. Efros, Stella X. Yu: Learning Beyond Human Expertise with Generative Models for Dental Restorations (2018) [10] Aidan N. Gomez, Sicong Huang, Ivan Zhang, Bryan M. Li, Muhammad Osama, Lukasz Kaiser: Unsupervised Cipher Cracking Using Discrete GANs (2018) [11] Alec Radford, Luke Metz, Soumith Chintala: Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks (2015) [12] Yuan Jie, Du Sidan, Zhu Xiang: Fast Superresolution for License Plate Image Reconstruction (2008) [13] Yujie Shu: Human Portrait Super Resolution Using GANs (2018) [14] JunYan Zhu, Taesung Park, Phillip Isola, Alexei A. Efros: Unpaired ImagetoImage Translation using CycleConsistent Adversarial Networks (2017)