Machine Learning

A Brief Survey of Generative Models

A high-level summary of various generative models including Variational Autoencoders (VAE), Generative Adverserial Networks (GAN), and their notable extentions and generalizations, such as f-GAN, Adversarial Variational Bayes (AVB), Wasserstein GAN, Wasserstein Auto-Encoder (WAE), Cramer GAN and etc

Overview

This page is a high-level summary of various generative models with little explanations. Models to cover are as follows:

Variational Autoencoders (VAE)

Kingma, Diederik P., and Max Welling. "Auto-encoding variational bayes." arXiv preprint arXiv:1312.6114 (2013).

Adversarial Variational Bayes (AVB)

Extention to VAE to use non-Gaussian encoders

Mescheder, Lars, Sebastian Nowozin, and Andreas Geiger. "Adversarial variational bayes: Unifying variational autoencoders and generative adversarial networks." arXiv preprint arXiv:1701.04722 (2017).

Generative Adverserial Networks (GAN)

Goodfellow, Ian, et al. "Generative adversarial nets." Advances in neural information processing systems. 2014.

Generalized divergence minimization GAN ( $f$ -GAN)

Nowozin, Sebastian, Botond Cseke, and Ryota Tomioka. "f-gan: Training generative neural samplers using variational divergence minimization." Advances in Neural Information Processing Systems. 2016.

Wasserstein GAN (WGAN)

Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein gan." arXiv preprint arXiv:1701.07875 (2017).

Adversarial Autoencoders (AAE)

Makhzani, Alireza, et al. "Adversarial autoencoders." arXiv preprint arXiv:1511.05644 (2015).

Wasserstein Auto-Encoder (WAE)

Tolstikhin, Ilya, et al. "Wasserstein Auto-Encoders." arXiv preprint arXiv:1711.01558 (2017).

Cramer GAN

Bellemare, Marc G., et al. "The Cramer Distance as a Solution to Biased Wasserstein Gradients." arXiv preprint arXiv:1705.10743 (2017).

VAE

Model setup:

Recognition model: $q_{ϕ} (z | x) = N (μ = h_{1} (x), σ^{2} I = h_{2} (x) I)$
Assumed fixed prior: $p (z) = N (0, I)$
Generation model: $p_{θ} (x | z) = N (μ = g_{1} (z), σ^{2} I = g_{2} (z) I)$
Implied (but intractable) posterior: $p_{θ} (z | x)$

Key equations:

\begin{matrix} (1) & \begin{array}{r} \log p_{θ} (x^{i}) = D_{K L} (q_{ϕ} (z | x^{i}) ‖ p_{θ} (z | x^{i}) + L (θ, ϕ, x^{i}) \end{array} \end{matrix}

\begin{matrix} (2) & \begin{aligned} L (θ, ϕ, x^{i}) & = E_{z \sim q_{ϕ} (z | x^{i})} [\log p_{θ} (x^{i}, z) - \log q_{ϕ} (z | x^{i})] \\ = E_{z \sim q_{ϕ} (z | x^{i})} [\log p_{θ} (x^{i}, z)] + H [q_{ϕ} (z | x^{i})] \\ = E_{z \sim q_{ϕ} (z | x^{i})} [\log p_{θ} (x^{i} | z)] - D_{K L} [q_{ϕ} (z | x^{i}) ‖ p (z)] \end{aligned} \end{matrix}

Optimization objective:

\hat{θ}, \hat{ϕ} = \underset{θ, ϕ}{argmax} \sum_{i} L (θ, ϕ, x^{i})

Gradient-friendly Monte Carlo:

Difficulties in calculating $L (θ, ϕ, x^{i})$ :

Due to the generality of $q$ and $p$ (typically a neural network), the expectation in $2$ does not have an analytical form. So we need to resort to Monte Carlo estimation.
Furthermore, direct sampling $z$ according to $q$ poses difficulty in taking derivative against parameters $ϕ$ that parameterizes the distribution $q$ .

Solution: Reparameterization Trick

Find smooth and invertible transformation $z = g_{ϕ} (ϵ)$ such that with $ϵ$ drawn from a fixed (non-parameterized) distribution $p (ϵ)$ we have $z \sim q (z; ϕ)$ , so

E_{z \sim q (z; ϕ)} [f (z)] = E_{ϵ \sim p (ϵ)} [f (g_{ϕ} (ϵ))]

For the Normal distribution used here ( $q_{ϕ} (z | x)$ ), it is convenient to use location-scale transformation, $z = μ + σ * ϵ$ with $ϵ \sim N (0, I)$ .

\begin{matrix} (3) & \tilde{L} (θ, ϕ, x^{i}) = \frac{1}{L} \sum_{l = 1}^{L} (\log p_{θ} (x^{i} | z^{i, l})] - D_{K L} [q_{ϕ} (z^{i, l} | x^{i}) ‖ p (z^{i, l})) \end{matrix}

z^{i, l} = μ_{x^{i}} + σ_{x^{i}} * ϵ^{i, l} and ϵ^{l} \sim N (0, I)

For total $N$ data points with mini batch size $M$ :

\begin{matrix} (4) & \begin{array}{r} L (θ, ϕ; X) = \sum_{i = 1}^{N} L (θ, ϕ, x^{i}) \approx \tilde{L^{M}} (θ, ϕ; X) = \frac{N}{M} \sum_{i = 1}^{M} \tilde{L} (θ, ϕ, x^{i}) \end{array} \end{matrix}

For sufficiently large batch size $M$ , the inner loop sample size $L$ can be set to 1. Due to stochastic mini batch gradient descent and stochastic expectation estimation, this is also called doubly stochastic estimation.

Using non-Gaussian encoders

Todo: discuss AVB paper

Gumble trick for discrete latent variables

Ref for this section:

Gumble max trick https://hips.seas.harvard.edu/blog/2013/04/06/the-gumbel-max-trick-for-discrete-distributions/
Balog, Matej, et al. "Lost Relatives of the Gumbel Trick." arXiv preprint arXiv:1706.04161 (2017).
Jang, Eric, Shixiang Gu, and Ben Poole. "Categorical reparameterization with gumbel-softmax." arXiv preprint arXiv:1611.01144 (2016).

Gumble distribution:

$f$ -GAN and GAN

Prelude on $f$ -divergence and its variational lower bound

The f-divergence family

\begin{matrix} (5) & D_{f} = \int_{X} q (x) f (\frac{p (x)}{q (x)}) d x \end{matrix}

where the generator function $f : R_{+} \to R$ is a convex, lower-semicontinuous function satisfying $f (1) = 0$ .

Every convex, lower-semicontinuous function has a convex conjugate function $f^{c}$ , also known as Fenchel conjugate. This function is defined as

\begin{matrix} (6) & f^{c} (t) = sup_{u \in {dom}_{f}} {u t - f (u)} \end{matrix}

Function $f^{c}$ is again convex and lower-semicontinuous and the pair $(f, f^{c})$ is dual to each other, i.e. ${(f^{c})}^{c} = f$ . So we can represent $f$ as

\begin{matrix} (7) & f (t) = sup_{t \in {dom}_{f^{c}}} {t u - f^{c} (t)} \end{matrix}

With this we can establish a lower bound for estimating the f-divergence in general

\begin{matrix} (8) & \begin{aligned} D_{f} (P ‖ Q) & = \int_{X} q (x) sup_{t \in {dom}_{f^{c}}} {t \frac{p (x)}{q (x)} - f^{c} (t)} d x \\ \geq sup_{T \in T} \int_{X} (p (x) T (x) - q (x) f^{c} (T (x))) d x \\ = sup_{T \in T} (E_{x \sim P} [T (x)] - E_{x \sim Q} [f^{c} (T (x))]) \end{aligned} \end{matrix}

where $T$ is an arbitrary class of functions $T : X \to R$ . The inequality is due to Jessen's inequality and constraints imposed by $T$ .

The bond is tight for

\begin{matrix} (9) & T^{*} (x) = f^{'} (\frac{p (x)}{q (x)}) \end{matrix}

Generative adversarial training

Suppose our goal is to come up with a distribution $Q$ (model) that is close to $P$ (the data distribution) and the similarity score (loss) is measured by $D_{f} (P | Q)$ . However the direct calculation of $5$ is intractable, such as the case where the functional form of $P$ is unknown and $Q$ is a complex model parameterized by a neural network.

To be specific:

Evaluating $q (x)$ at any $x$ is easy, but integrating it is hard due to lack of easy functional form.
For $p (x)$ , we do not know how to evaluate it at any $x$
Sampling from both $P$ and $Q$ are easy. Because drawing from data set approximates $x \sim P$ and we can make the model $Q$ take random vectors as input which are easy to produce.

Fortunately, we can sample from both of them easily. In this case, $8$ offers a way to estimate the lower bound of the divergence. We would need to maximize this lower bound by changing $T$ so that it is close to the true divergence, then minimize it over $Q$ . This is formally stated as follows.

\begin{matrix} (10) & F (θ, ω) = E_{x \sim P} [T_{ω} (x)] + E_{x \sim Q_{θ}} [- f^{c} (T_{ω} (x))] \end{matrix}

\begin{matrix} (11) & \hat{θ} = \underset{θ}{argmin} max_{ω} F (θ, ω) \end{matrix}

To ensure that the output of $T_{ω}$ respects the domain of $f^{c}$ , we define $T_{ω} (x) = g_{f} (V_{ω} (x))$ , where $V_{ω} : X \to R$ without any range constraints on the output and $g_{f} : R \to {dom}_{f^{c}}$ is an output activation function specific to the $f$ -divergence used with suitable output ranges.

GAN

For the original GAN, with a divergence target similar to Jensen-Shannon

\begin{matrix} (12) & F (θ, ω) = E_{x \sim P} [\log D_{ω} (x)] + E_{x \sim Q_{θ}} [\log (1 - D_{ω} (x))] \end{matrix}

with $D_{ω} = 1 / (1 + e^{- V_{ω} (x)})$
which corresponds to the following

g_{f} (ν) = \log (1 / (1 + e^{- ν}))

T_{ω} (x) = \log (D_{ω} (x)) = g_{f} (V_{ω} (x))

f^{c} (t) = - \log (1 - \exp (t))

\log (1 - D_{ω} (x)) = - f^{c} (T_{ω} (x))

Practical considerations in adversarial training

Todo: log trick, DCGAN heuristics

Name	$D_{f} (P \| Q)$	Generator $f (u)$	$T^{*} (x)$
Forward KL	$\int p (x) \log \frac{p (x)}{q (x)} d x$	$u \log u$	$1 + \log \frac{q (x)}{p (x)}$
Reverse KL	$\int q (x) \log \frac{p (x)}{q (x)} d x$	$- \log u$	$- \frac{q (x)}{p (x)}$
Jensen-Shannon	$\frac{1}{2} \int p (x) \log \frac{2 p (x)}{p (x) + q (x)} + q (x) \log \frac{2 q (x)}{p (x) + q (x)} d x$	$u \log u - (u + 1) \log \frac{u + 1}{2}$	$\log \frac{2 p (x)}{p (x) + q (x)}$
GAN	$\int p (x) \log \frac{2 p (x)}{p (x) + q (x)} + q (x) \log \frac{2 q (x)}{p (x) + q (x)} d x - \log (4)$	$u \log u - (u + 1) \log (u + 1)$	$\log \frac{p (x)}{p (x) + q (x)}$

Name	Conjugate $f^{c} (t)$	${dom}_{f^{c}}$	Output activation $g_{f}$	$f^{'} (1)$
Forward KL	$\exp (t - 1)$	$R$	$ν$	$1$
Reverse KL	$- 1 - \log (- t)$	$R_{-}$	$- \exp (ν)$	$- 1$
Jensen-Shannon	$- \log (2 - \exp (t))$	$t < \log (2)$	$\log (2) - \log (1 + \exp (- ν))$	$0$
GAN	$- \log (1 - \exp (t))$	$R_{-}$	$- \log (1 + \exp (- ν))$	$- \log (2)$

WGAN and WAE

Optimal transport (OT)

Kantorovich formulated the optimization target in optimal transport problems as follows

\begin{matrix} (13) & W_{c} (P_{X}, P_{G}) = inf_{Γ \in P (x \sim P_{X}, y \sim P_{Y})} E_{x, y \sim Γ} [c (x, y)] \end{matrix}

where $P (X \sim P_{X}, Y \sim P_{Y})$ is a set of all join distributions of $(X, Y)$ with marginals $P_{X}$ and $P_{Y}$ .

Wasserstein distance

When $c (x, y) = | x - y |^{p}$ for $p \geq 1$ , $W_{c}^{1 / p}$ is called p-Wasserstein distance.

\begin{matrix} (14) & W_{p} (P_{X}, P_{G}) = inf_{Γ \in P (x \sim P_{X}, y \sim P_{Y})} E_{x, y \sim Γ} [‖ x - y ‖^{p}] \end{matrix}

The optimization problem is highly intractable in general, due to the constraint. However when $p = 1$ , Kantorovich-Rubinstein duality holds:

\begin{matrix} (15) & W_{1} (P_{X}, P_{G}) = sup_{f \in {1-Lipschitz}} E_{x \sim P_{X}} [f (x)] - E_{y \sim P_{Y}} [f (y)] \end{matrix}

The family of divergences from $f$ -divergence only consider the relative probability (the ratio between two probability density functions) and do not measure the closeness of the underlying outcomes. With disjoint support or overlapping support but intersections that yield zero measure, the divergence between a target distribution and a $θ$ -parameterized distribution might not be continuous with respect to $θ$ . Wasserstein distance on the other hand does take into account the underlying topology of the outcomes and is continuous and differentiable almost everywhere with respect to $θ$ and thus almost always provide useful gradient for optimization.

Wasserstein GAN (WGAN)

Following the dual form of $W_{1}$ , we can form a generative-adversarial model for a data distribution $P_{D}$ and model $Q_{θ}$ with auxiliary function $f$ that is 1-Lipschitz continuous.

\begin{matrix} (16) & \hat{θ} = \underset{θ}{argmin} sup_{f \in {1-Lipschitz}} E_{x \sim P_{D}} [f (x)] - E_{x \sim Q_{θ}} [f (x)] \end{matrix}

Practical considerations for WGAN

Todo: Gradient clipping with K-Lipschitz constraint on $f$ ; Soft gradient penalty (WGAN-GP)

Wasserstein Auto-encoder (WAE)

Rather than working with the dual or Wasserstein distance, which only holds for $W_{1}$ , we can also work with the primal form directly. As shown in Tolstikhin, Ilya, et al. “Wasserstein Auto-Encoders.” the following holds when we have a deterministic decoder mapping latent variable $Z$ to $Y$ through $y = G (z)$ :

\begin{matrix} (17) & \begin{aligned} W_{c} (P_{X}, P_{G}) = W_{c}^{†} (P_{X}, P_{G}) & = inf_{P \in P (x \sim P_{X}, z \sim P_{Z})} E_{x, y \sim P} [c (x, G (z)] \\ = inf_{Q : Q_{Z} = P_{Z}} E_{x \sim P_{X}} E_{z \sim Q (Z | X)} [c (x, G (z)] \end{aligned} \end{matrix}

The constraint put on $Q (Z | X)$ is that its marginal needs to equal to $P (Z)$ . To have a feasible optimization problem we relax this constraint with the following constraint-free optimization target with a penalty that assess the closeness between $Q (Z)$ and $P (Z)$ via any reasonable divergence. This new objective is named penalized optimal transport (POT).

\begin{matrix} (18) & D_{P O T / W A E} (P_{X}, P_{G}) := inf_{Q \in Q} E_{x \sim P_{X}} E_{z \sim Q (Z | X)} [c (x, G (z)] + λ \cdot D_{Z} (Q_{Z}, P_{Z}) \end{matrix}

If the divergence between $P_{Z}$ and $Q_{Z}$ is intractable to directly calculate, we could use generative-adversarial training to approximate it (see $f$ -GAN).

Note: if decoder is probabilistic instead of deterministic, we would only have $W_{c} (P_{X}, P_{G}) \leq W_{c}^{†} (P_{X}, P_{G})$ , so we are minimizing an upper bound of the true OT cost.

Thought: the original paper used JS divergence for $D_{Z}$ , how about we use Wasserstein distance for $D_{Z}$ .

Todo: discuss connections to AAE paper

Diffusion Models

nanoGPT repo reading notes

Building LLM-powered products