A Brief Survey of Generative Models
A high-level summary of various generative models including Variational Autoencoders (VAE), Generative Adverserial Networks (GAN), and their notable extentions and generalizations, such as f-GAN, Adversarial Variational Bayes (AVB), Wasserstein GAN, Wasserstein Auto-Encoder (WAE), Cramer GAN and etc
Overview
This page is a high-level summary of various generative models with little explanations. Models to cover are as follows:
Variational Autoencoders (VAE)
- Kingma, Diederik P., and Max Welling. "Auto-encoding variational bayes." arXiv preprint arXiv:1312.6114 (2013).
Adversarial Variational Bayes (AVB)
Extention to VAE to use non-Gaussian encoders
- Mescheder, Lars, Sebastian Nowozin, and Andreas Geiger. "Adversarial variational bayes: Unifying variational autoencoders and generative adversarial networks." arXiv preprint arXiv:1701.04722 (2017).
Generative Adverserial Networks (GAN)
- Goodfellow, Ian, et al. "Generative adversarial nets." Advances in neural information processing systems. 2014.
Generalized divergence minimization GAN (
- Nowozin, Sebastian, Botond Cseke, and Ryota Tomioka. "f-gan: Training generative neural samplers using variational divergence minimization." Advances in Neural Information Processing Systems. 2016.
Wasserstein GAN (WGAN)
- Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein gan." arXiv preprint arXiv:1701.07875 (2017).
Adversarial Autoencoders (AAE)
- Makhzani, Alireza, et al. "Adversarial autoencoders." arXiv preprint arXiv:1511.05644 (2015).
Wasserstein Auto-Encoder (WAE)
- Tolstikhin, Ilya, et al. "Wasserstein Auto-Encoders." arXiv preprint arXiv:1711.01558 (2017).
Cramer GAN
- Bellemare, Marc G., et al. "The Cramer Distance as a Solution to Biased Wasserstein Gradients." arXiv preprint arXiv:1705.10743 (2017).
VAE
Model setup:
- Recognition model:
- Assumed fixed prior:
- Generation model:
- Implied (but intractable) posterior:
Key equations:
Optimization objective:
Gradient-friendly Monte Carlo:
Difficulties in calculating
- Due to the generality of
and (typically a neural network), the expectation in does not have an analytical form. So we need to resort to Monte Carlo estimation. - Furthermore, direct sampling
according to poses difficulty in taking derivative against parameters that parameterizes the distribution .
Solution: Reparameterization Trick
Find smooth and invertible transformation
For the Normal distribution used here (
For total
For sufficiently large batch size
Using non-Gaussian encoders
Todo: discuss AVB paper
Gumble trick for discrete latent variables
Ref for this section:
- Gumble max trick https://hips.seas.harvard.edu/blog/2013/04/06/the-gumbel-max-trick-for-discrete-distributions/
- Balog, Matej, et al. "Lost Relatives of the Gumbel Trick." arXiv preprint arXiv:1706.04161 (2017).
- Jang, Eric, Shixiang Gu, and Ben Poole. "Categorical reparameterization with gumbel-softmax." arXiv preprint arXiv:1611.01144 (2016).
Gumble distribution:
-GAN and GAN
Prelude on -divergence and its variational lower bound
The f-divergence family
where the generator function
Every convex, lower-semicontinuous function has a convex conjugate function
Function
With this we can establish a lower bound for estimating the f-divergence in general
where
The bond is tight for
Generative adversarial training
Suppose our goal is to come up with a distribution
To be specific:
- Evaluating
at any is easy, but integrating it is hard due to lack of easy functional form. - For
, we do not know how to evaluate it at any - Sampling from both
and are easy. Because drawing from data set approximates and we can make the model take random vectors as input which are easy to produce.
Fortunately, we can sample from both of them easily. In this case,
To ensure that the output of
GAN
For the original GAN, with a divergence target similar to Jensen-Shannon
with
which corresponds to the following
Practical considerations in adversarial training
Todo: log trick, DCGAN heuristics
Example divergence and their related functions
Name | Generator |
||
---|---|---|---|
Forward KL | |||
Reverse KL | |||
Jensen-Shannon | |||
GAN |
Name | Conjugate |
Output activation |
||
---|---|---|---|---|
Forward KL | ||||
Reverse KL | ||||
Jensen-Shannon | ||||
GAN |
WGAN and WAE
Optimal transport (OT)
Kantorovich formulated the optimization target in optimal transport problems as follows
where
Wasserstein distance
When
The optimization problem is highly intractable in general, due to the constraint. However when
The family of divergences from
Wasserstein GAN (WGAN)
Following the dual form of
Practical considerations for WGAN
Todo: Gradient clipping with K-Lipschitz constraint on; Soft gradient penalty (WGAN-GP)
Wasserstein Auto-encoder (WAE)
Rather than working with the dual or Wasserstein distance, which only holds for
The constraint put on
If the divergence between
Note: if decoder is probabilistic instead of deterministic, we would only have, so we are minimizing an upper bound of the true OT cost.
Thought: the original paper used JS divergence for, how about we use Wasserstein distance for .
Todo: discuss connections to AAE paper
Comments ()