With generative modeling, we aim to learn how to generate new samples from the same distribution of the given training data. Specifically, there are two major objectives:
- Learn
$p_{\text{model}}(x)$ that approximates true data distribution$p_{\text{data}}(x)$ - Sampling new
$x$ from$p_{\text{model}}(x)$
The former can be structured as learning how likely a given sample is drawn from a true data distribution; the latter means the model should be able to produce new samples that are similar but not exactly the same as the training samples. One way to judge if the model has learned the correct underlying representation of the training data distribution is the quality of the new samples produced by the trained model.
These objectives can be formulated as density estimation problems. There are two different approaches:
-
Explicit density estimation: explicitly define and solve for
$p_{\text{model}}(x)$ -
Implicit density estimation: learn model that can sample from
$p_{\text{model}}(x)$ without explicitly define it
The explicit approach can be challenging because it is generally difficult to find an expression for image likelihood function from a high dimensional space. The implicit approach may be preferable in situations that the only interest is to generate new samples. In this case, instead of finding the specific expression of the density function, we can simply training the model to directly sample from the data distribution without going through the process of explicit modeling.
Generative models are widely used in various computer vision tasks. For instance, they are used in super-resolution applications in which the model fills in the details of the low resolution inputs and generates higher resolution images. They are also used for colorization in which greyscale images get converted to color images.
PixelRNN and PixelCNN [van den Oord et al., 2016] are examples of a fully visible belief network (FVBN) in which data likelihood function
( p(x) = p(x_1, x_2, \cdots, x_n) )
where
( \displaystyle p(x) = \prod_{i=1}^{n} p(x_i \mid x_1, \cdots, x_{i-1}) )
where each distribution in the product gives the probability of
To train the model, we could try to maximize the defined likelihood of the training data.
The problem with above naive approach is that the conditional distributions can be extremely complex. To mitigate the difficulty, the PixelRNN instead expresses conditional distribution of each pixel as sequence modeling problem. It uses RNNs (more specifically LSTMs) to model the joint likelihood function
Another thing that the PixelRNN model does slightly differently is that it defines pixel order diagonally. This allows some level of parallelization, which makes training and generation a bit faster. With this ordering, generation process starts from the top-left corner, and then makes its way down and right until the entire image is produced.
Because the generation process even with this diagonal ordering is still largely sequential, it is expensive to train such model. To achieve further parallelization, instead of taking all previous pixels into consideration, we could instead only model dependencies on pixels in a context region. This gives rise to the PixelCNN model in which a context region is defined by a masked convolution. The receptive field of a masked convolution is an incomplete square around a central pixel (darker blue squares). This ensures that each pixel only depends on the already generated pixels in the region. The paper shows that with enough masked convolutional layers, the effective receptive field is the same as the pixel generation process that directly models dependencies on all previous pixels (all blue squares), like the PixelRNN model.
Because context region values are known from training images, PixelCNN is faster in training thanks to convolution parallelizations. However, generation is still slow as the process is inherently sequential. For instance, for a
From the generation samples on CIFAR-10
(left) and ImageNet
(right), we see these models are able to capture the distribution of training data to some extent, yet the generated samples do not look like natural images. Later models like flow based deep generative models are able to strike a better balance between training and generation efficiencies, and generate better quality images.
In summary, PixelRNN and PixelCNN models explicitly compute likelihood, and thus are relatively easy to optimize. The major drawback of these models is the sequential generation process which is time consuming. There have been follow-up efforts on improving PixelCNN performance, ranging from architecture changes to training tricks.
We introduce a new latent variable
( \displaystyle p_{\theta}(x) = \int p_{\theta}(z) \cdot p_{\theta}(x \mid z) ~dz )
In other words, all pixels of an image are independent with each other given latent variable
On a high-level, the goal of an autoencoder is to learn a lower-dimensional feature representation from un-labeled training data. The "encoder" component of an autoencoder aims at compressing input data into a lower-dimensional feature vector
The idea of the dimensionality reduction step is that we want every dimension of the feature vector
Now the autoencoder gives a way to effectively represent the underlying structure of the data distribution, which is one of the objectives of generative modeling. However, since do not know the entire latent space that
To be able to sample from the latent space, we take a probabilistic approach to autoencoder models. Assume training data $\big{x^{(i)}\big}{i=1}^{N}$ is generated from the distribution of unobserved latent representation $z$. So $x$ follows the conditional distribution given $z$; that is, $p{\theta^{\ast}}(x \mid z^{(i)})$. And
With variational autoencoder [Kingma and Welling, 2014], we would like to estimate true parameters
(\displaystyle p_{\theta}(x) = \int p_{\theta}(z) \cdot p_{\theta}(x \mid z) ~dz )
We note that to train the model, we need to compute the integral which involves computing
( p_{\theta}(z \mid x) = \dfrac{p_{\theta}(x \mid z) \cdot p_{\theta}(z)}{p_{\theta}(x)} )
we see it is still intractable to compute because
To make it tractable, we instead learn another distribution
To derive the tractable lower bound, we start from the log likelihood of an observed example:
( \begin{aligned} \log p_{\theta}(x^{(i)}) &= \mathbb{E}{z \sim q{\phi}(z \mid x^{(i)})} \Big[\log p_{\theta}(x^{(i)})\Big] \quad \cdots \small\mathsf{(1)} \ &= \mathbb{E}{z} \bigg[\log \frac{p{\theta}(x^{(i)} \mid z) \cdot p_{\theta}(z)}{p_{\theta}(z \mid x^{(i)})}\bigg] \quad \cdots \small\mathsf{(2)} \ &= \mathbb{E}{z} \bigg[\log \bigg(\frac{p{\theta}(x^{(i)} \mid z) \cdot p_{\theta}(z)}{p_{\theta}(z \mid x^{(i)})} \cdot \frac{q_{\phi}(z \mid x^{(i)})}{q_{\phi}(z \mid x^{(i)})}\bigg)\bigg] \quad \cdots \small\mathsf{(3)} \ &= \mathbb{E}{z} \Big[\log p{\theta}(x^{(i)} \mid z) \Big] - \mathbb{E}{z} \bigg[\log \frac{q{\phi}(z \mid x^{(i)})}{p_{\theta}(z)}\bigg] + \mathbb{E}{z} \bigg[\log \frac{q{\phi}(z \mid x^{(i)})}{p_{\theta}(z \mid x^{(i)})}\bigg] \quad \cdots \small\mathsf{(4)} \ &= \mathbb{E}{z} \Big[\log p{\theta}(x^{(i)} \mid z) \Big] - D_{\mathrm{KL}} \Big(q_{\phi}(z \mid x^{(i)}) \parallel p_{\theta}(z)\Big) + D_{\mathrm{KL}} \Big(q_{\phi}(z \mid x^{(i)}) \parallel p_{\theta}(z \mid x^{(i)})\Big) \quad \cdots \small\mathsf{(5)} \end{aligned} )
-
Step
$\mathrm{(1)}$ : the true data distribution is independent of the estimated posterior$q_{\phi}(z \mid x^{(i)})$ ; moreover, since$q_{\phi}(z \mid x^{(i)})$ is represented by a neural network, we are able to sample from distribution$q_{\phi}$ . -
Step
$\mathrm{(2)}$ : by the Baye's rule:
( \begin{aligned} & p_{\theta}(z \mid x) = \dfrac{p_{\theta}(x \mid z) \cdot p_{\theta}(z)}{p_{\theta}(x)} \ \Longrightarrow \quad & p_{\theta}(x) = \dfrac{p_{\theta}(x \mid z) \cdot p_{\theta}(z)}{p_{\theta}(z \mid x)} \end{aligned} )
-
Step
$\mathrm{(3)}$ : multiplying the expression by$1 = \dfrac{q_{\phi}(z \mid x^{(i)})}{q_{\phi}(z \mid x^{(i)})}$ -
Step
$\mathrm{(4)}$ : by logarithm properties as well as linearity of expectation:
( \begin{aligned} &~ \mathbb{E}{z} \bigg[\log \bigg(\frac{p{\theta}(x^{(i)} \mid z) \cdot p_{\theta}(z)}{p_{\theta}(z \mid x^{(i)})} \cdot \frac{q_{\phi}(z \mid x^{(i)})}{q_{\phi}(z \mid x^{(i)})}\bigg)\bigg] \ =&~ \mathbb{E}{z} \bigg[\log \bigg(p{\theta}(x^{(i)} \mid z) \cdot \frac{p_{\theta}(z)}{q_{\phi}(z \mid x^{(i)})} \cdot \frac{q_{\phi}(z \mid x^{(i)})}{p_{\theta}(z \mid x^{(i)})}\bigg)\bigg] \ =&~ \mathbb{E}{z} \bigg[\log p{\theta}(x^{(i)} \mid z) + \log \frac{p_{\theta}(z)}{q_{\phi}(z \mid x^{(i)})} + \log \frac{q_{\phi}(z \mid x^{(i)})}{p_{\theta}(z \mid x^{(i)})}\bigg] \ =&~ \mathbb{E}{z} \bigg[\log p{\theta}(x^{(i)} \mid z) - \log \frac{q_{\phi}(z \mid x^{(i)})}{p_{\theta}(z)} + \log \frac{q_{\phi}(z \mid x^{(i)})}{p_{\theta}(z \mid x^{(i)})}\bigg] \ =&~ \mathbb{E}{z} \bigg[\log p{\theta}(x^{(i)} \mid z)\bigg] - \mathbb{E}{z}\bigg[\log \frac{q{\phi}(z \mid x^{(i)})}{p_{\theta}(z)}\bigg] + \mathbb{E}{z}\bigg[\log \frac{q{\phi}(z \mid x^{(i)})}{p_{\theta}(z \mid x^{(i)})}\bigg] \ \end{aligned} )
- Step
$\mathrm{(5)}$ : by definition of the Kullback–Leibler divergence. The KL divergence gives a measure of the “distance” between two distributions.
We see the first term $\mathbb{E}{z} \Big[\log p{\theta}(x^{(i)} \mid z) \Big]$ involves
The second term is the KL divergence between the approximate posterior and the prior (a Gaussian distribution). Assuming the approximate posterior posterior takes on a Gaussian form with diagonal covariance matrix, the KL divergence then has an analytical closed form solution.
The third term is the KL divergence between between the approximate posterior and the true posterior. Even though we it is intractable to computer, by non-negativity of KL divergence, we know this term is non-negative.
Therefore we obtain the tractable lower bound of log likelihood of the data:
( \log p_{\theta}(x^{(i)}) = \underbrace{\mathbb{E}{z} \Big[\log p{\theta}(x^{(i)} \mid z) \Big] - D_{\mathrm{KL}} \Big(q_{\phi}(z \mid x^{(i)}) \parallel p_{\theta}(z)\Big)}{\mathcal{L}(x^{(i)}; \theta, \phi)} + \underbrace{D{\mathrm{KL}} \Big(q_{\phi}(z \mid x^{(i)}) \parallel p_{\theta}(z \mid x^{(i)})\Big)}_{\geqslant 0} )
We note that
The lower bound can also be interpreted as encoder component
For a given input, we first use the encoder network to to generate the mean
Next we compute the gradient of the expectation term $\mathbb{E}{z} \Big[\log p{\theta}(x^{(i)} \mid z) \Big]$. Since
Lastly, we use the decoder network to produce the pixel-wise conditional distribution
For every minibatch of input data, we compute the forward pass and then perform the back-propagation.
We take a sample
Since we assumed diagonal prior for
After training the model using a MNIST
, we discover that varying the samples of 6
to 9
through 7
, and
Similarly, we also find that dimensions of latent variable
From above generation samples on CIFAR-10
(left) and labeled face images (right), we see newly generated images are similar to the original ones. However, these generated images are still blurry and generating high quality images is an active area for research.
We would like to train a model to directly generate high quality samples without modeling any explicit density function
To overcome this challenge, we recognize the general objective that all the images generated from the latent space of
We then can use the output from the discriminator network to compute gradient and perform back-propagation to the generator network to gradually improve the image generation process. Overtime, learning signal from the discriminator will inform the generator on how to produce more "realistic" samples. Similarly, as generated images from the generator become more and more close to the real training data, the discriminator adapt its decision boundary to fit the training data distribution better. The discriminator effectively learns to model the data distribution without explicitly defining it.
In summary:
- discriminator network: try to distinguish between real and fake images
- generator network: try to fool the discriminator by generating real-looking images
Training GAN can be formulated as the minimax optimization of a two-player adversarial game. Assume that the discriminator outputs likelihood in
( \displaystyle \min_{\theta_g} \max_{\theta_d} \Big{\mathbb{E}{x \sim p{\mathrm{data}}}\big[\log \underbrace{D_{\theta_d}(x)}{\mathsf{(1)}}\big] + \mathbb{E}{z \sim p(z)}\big[\log\big(1 - \underbrace{D_{\theta_d}(G_{\theta_g}(z))\big)}_{\mathsf{(2)}}\big]\Big} )
-
$\mathsf{(1)}$ :$D_{\theta_d}(x)$ is the discriminator output (score) for real data$x$ -
$\mathsf{(2)}$ :$D_{\theta_d}(G_{\theta_g}(z))\big)$ is the discriminator output (score) for generated fake data$G(z)$
The inner maximization is the discriminator objective. The discriminator aims to find maximizer
The outer minimization is the generator objective. The generator aims to find minimizer
Naively, we could alternate between maximization and minimization by performing gradient ascent on discriminator:
( \displaystyle \max_{\theta_d} \Big{\mathbb{E}{x \sim p{\mathrm{data}}}\big[\log D_{\theta_d}(x)\big] + \mathbb{E}{z \sim p(z)}\big[\log \big(1 - D{\theta_d}(G_{\theta_g}(z))\big)\big]\Big} )
and gradient descent on generator:
( \displaystyle \min_{\theta_g} \Big{\mathbb{E}{z \sim p(z)}\big[\log \big(1 - D{\theta_d}(G_{\theta_g}(z))\big)\big]\Big} )
However, we note that when a sample is likely fake—hence
To remedy this problem, we now maximize likelihood of the discriminator being wrong, as opposed to minimizing the likelihood of it being correct:
( \displaystyle \max{\theta_g} \Big{\mathbb{E}{z \sim p(z)}\big[\log D{\theta_d}\big(G_{\theta_g}(z)\big)\big]\Big} )
The objective remains unchanged, yet there will be higher gradient signal to the generator for unrealistic samples (in the eyes of the discriminator), which improves training performance.
for number of training iterations do:
for$k$ steps do:
- Sample minibatch of$m$ noise samples${z^{(1)}, \cdots, z^{(m)}}$ from noise prior$p(z)$
- Sample minibatch of$m$ samples${x^{(1)}, \cdots, x^{(m)}}$ from data generating distribution$p_{\mathrm{data}}(x)$
- Update the discriminator by ascending its stochastic gradient:
( \displaystyle \nabla_{\theta_d} \frac{1}{m} \sum_{i=1}^{m} \Big[\log D_{\theta_d}(x^{(i)}) + \log \big(1 - D_{\theta_d}(G_{\theta_g}(z^{(i)}))\big)\Big] )
end for
- Sample minibatch of$m$ samples${x^{(1)}, \cdots, x^{(m)}}$ from data generating distribution$p_{\mathrm{data}}(x)$
- Update the generator by ascending its stochastic gradient of the improved objective:
( \displaystyle \nabla_{\theta_g} \frac{1}{m} \sum_{i=1}^{m} \log D_{\theta_d}(G_{\theta_g}(z^{(i)})) )
end for
Here
After training, we use the generator network to generate images. Specifically, we first draw a sample
From the generated samples, we see GAN can generate high quality samples, indicating the model does not simply memorize exact images from the training data. Training sets from left to right: MNIST
, Toronto Face Dataset (TFD)
, CIFAR-10
. The highlighted columns show the nearest training example of the neighboring generated sample.
There have been numerous followup studies on improving sample quality, training stability, and other aspects of GANs. The ICLR 2016 paper [Radford et al., 2015] proposed deep convolutional networks and other architecture features (deep convolutional generative adversarial networks, or DCGANs) to achieve better image quality and training stability:
- Replace any pooling layers with strided convolutions (discriminator) and fractional-strided convolutions (generator).
- Use batchnorm in both the generator and the discriminator.
- Remove fully connected hidden layers for deeper architectures.
- Use ReLU activation in generator for all layers except for the output, which uses Tanh.
- Use LeakyReLU activation in the discriminator for all layers.
Generated samples from DCGANs trained on LSUN
bedrooms dataset show promising improvements as the model can produce high resolution and high quality images without memorizing (overfitting) training examples.
Similar to VAE, we are also able to find structures in the latent space and meaningfully interpolate random points in the latent space. This means we observe smooth semantic changes to the image generations along any direction of the manifold, which suggests that model has learned relevant representations (as opposed to memorization).
The above figure shows smooth transitions between a series of
Additionally, we can also perform arithmetic on
Arithmetic is performed on the mean vectors and the resulting vector feeds into the generator to produce the center sample on the right hand side. The remaining samples around the center are produced by adding uniform noise in
We note that same arithmetic performed pixel-wise in the image space does not behave similarly, as it only yields in noise overlap due to misalignment. Therefore latent representations learned by the model and associated vector arithmetic have the potential to compactly model conditional generative process of complex image distributions.
- new loss function (LSGAN): Mao et al., Least Squares Generative Adversarial Networks, 2016
- new training methods:
- Wasserstein GAN: Arjovsky et al., Wasserstein GAN, 2017
- Improved Wasserstein GAN: Gulrajani et al., Improved Training of Wasserstein GANs, 2017
- Progressive GAN: Karras et al., Progressive Growing of GANs for Improved Quality, Stability, and Variation, 2017
- source-to-target domain transfer (CycleGAN): Zhu et al., Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks, 2017
- text-to-image synthesis: Reed et al., Generative Adversarial Text to Image Synthesis, 2016
- image-to-image translation (Pix2pix): Isola et al., Image-to-Image Translation with Conditional Adversarial Networks, 2016
- high-resolution and high-quality generations (BigGAN): Brock et al., Large Scale GAN Training for High Fidelity Natural Image Synthesis, 2018
- scene graphs to GANs: Johnson et al., Image Generation from Scene Graphs, 2018
- benchmark for generative models: Zhou, Gordon, Krishna et al., HYPE: Human eYe Perceptual Evaluations, 2019
- many more: "the GAN zoo"