Optimality Proof of GAN

This post documents a detailed proof of the conditions to achieve optimality for the original GAN.

First, let’s recap the GAN formulation. We have a minimax game

$\displaystyle\min_{\theta_{G}}\max_{\theta_{D}} V(D, G) = \mathbb{E}_{x \sim p_{\text{data}}(x)} \log D(x; \theta_{D}) + \mathbb{E}_{z \sim p_{z}(z)} \log (1 - D(G(z; \theta_{G}); \theta_{D}))$

The goal is to maximize $V(D, G)$ over $\theta_{D}$ where $G$ is fixed, while at the same time minimize $V(D, G)$ over $\theta_{G}$ where $D$ is fixed.

In this formulation, both $G(x; \theta_{G})$ and $D(x; \theta_{D})$ are neural networks (e.g. Multilayer Perceptron or CNN) parameterized by $\theta_{G}$ and $\theta_{D}$

We want to find

The condition that $D$ must satisfy to maximize $V(D, G)$ , while keeping $G$ fixed (Proposition 1)
The condition that $G$ must satisfy to miminize $V(D, G)$ , while keeping $D$ fixed (Proposition 2)

A note on notation and assumption:

Both $x$ and $z$ are random vectors (i.e. vector where each component is a random variable).

Proposition 1: When $\theta_{G}$ is fixed, the optimal Discriminator

$D(x; \theta_{D}) = \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_{g}(x)}$

Proof:

$\begin{align} V(G, D) & = \int_{x} p_{\text{data}} (x) \log D(x; \theta_{D}) \mathrm{d}x + \int_{z} p_{z}(z) \log (1 - D(g(z); \theta_{D})) \mathrm{d}z \\ & = \int_{x} \left( p_{\text{data}} (x) \log D(x; \theta_{D}) + p_{g}(x) \log (1 - D(x; \theta_{D})) \right) \mathrm{d}x \end{align}$

Note that $z$ is a random vector where $z \sim p_{z}(z)$ , so $x = g(z)$ is also a random vector that follows some distribution denoted as $p_{g}(x)$ .

Because $x$ is integrated out, $V(G, D)$ only depends on $\theta_{D}$ , and $V(G, D)$ is maximized if the expression being integrated

$f = p_{\text{data}} (x) \log D(x; \theta_{D}) + p_{g}(x) \log (1 - D(x; \theta_{D}))$

is maximized.

Let $y = D(x; \theta_{D})$ . Then $f = p_{\text{data}} (x) \log y + p_{g}(x) \log (1 - y)$

$\frac{\partial}{\partial y} f = \frac{p_{\text{data}} (x)} {y} - \frac{p_{g}(x)}{1 - y} = 0$

It follows that $y = D(x; \theta_{D}) = \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_{g}(x)}$ is necessary condition for achieving optimum.

We also have $\frac{\partial^{2} f}{\partial y^{2}} = - \frac{p_{\text{data}} (x)}{y^2} - \frac{ p_{g}(x) }{(1 - y)^{2}} < 0$ , because $y \in (0, 1)$ , $p_{\text{data}} (x) > 0$ , and $p_{g}(x) > 0$ . So $f$ achieves maximum at $y = \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_{g}(x)}$ . End of Proof.

Proposition 2: When $\theta_{D}$ is optimal and fixed, then we have $p_{g}(x) = p_{\text{data}}(x)$ almost everywhere, where $p_{g}$ is the probability distribution of the samples $x$ generatated from the optimal Generator.

Proof:

When $D^{*}$ is optimal and fixed, we have

$V(G, D^{*}) = \int_x \left( p_{\text{data}} (x) \log \frac{p_{\text{data}}(x)}{ p_{\text{data}}(x) + p_{g}(x)} + p_{g} (x) \log \frac{p_{g}(x)}{ p_{\text{data}}(x) + p_{g}(x)} \right) \mathrm{d} x$

We denote $C(G) = V(G, D^{*})$ to indicate that $V(G, D^{*})$ is a now function of $\theta_{G}$ .

Next we will show the relationship between $C(G)$ and the Jenson-Shannon Divergence (between $p_{\text{data}}$ and $p_{g}$ ):

$\begin{align} D_{\text{JS}}(p_{\text{data}} \| p_{g}) = & \frac{1}{2} D_{\text{KL}} (p_{\text{data}} \| \frac{p_{\text{data}}+p_{g}}{2}) + \frac{1}{2} D_{\text{KL}} (p_{g} \| \frac{p_{\text{data}}+p_{g}}{2}) \\ = & \frac{1}{2} \int_x p_{\text{data}} (x) \log \frac{2p_{\text{data}}(x)}{ p_{\text{data}}(x) + p_{g}(x)} \mathrm{d}x + \frac{1}{2} \int_x p_{g} (x) \log \frac{2p_{g}(x)}{ p_{\text{data}}(x) + p_{g}(x)} \mathrm{d}x \\ = & \frac{1}{2} \int_x \left( p_{\text{data}} (x) \log \frac{2p_{\text{data}}(x)}{ p_{\text{data}}(x) + p_{g}(x)} + p_{g} (x) \log \frac{2p_{g}(x)}{ p_{\text{data}}(x) + p_{g}(x)} \right) \mathrm{d}x \\ = & \frac{1}{2} \int_x p_{\text{data}} (x) \left( \log 2 + \log \frac{p_{\text{data}}(x)}{ p_{\text{data}}(x) + p_{g}(x)} \right) \mathrm{d} x + \\ & \frac{1}{2} \int_x p_{g} (x) \left( \log 2 + \log \frac{p_{g}(x)}{ p_{\text{data}}(x) + p_{g}(x)} \right) \mathrm{d} x \\ = & \frac{1}{2} \log 2 + \frac{1}{2} \int_x p_{\text{data}} (x) \log \frac{p_{\text{data}}(x)}{ p_{\text{data}}(x) + p_{g}(x)} \mathrm{d} x + \\ & \frac{1}{2} \log 2 + \frac{1}{2} \int_x p_{g} (x) \log \frac{p_{g}(x)}{ p_{\text{data}}(x) + p_{g}(x)} \mathrm{d} x \\ = & \log 2 + \frac{1}{2} \int_x \left( p_{\text{data}} (x) \log \frac{p_{\text{data}}(x)}{ p_{\text{data}}(x) + p_{g}(x)} + p_{g} (x) \log \frac{p_{g}(x)}{ p_{\text{data}}(x) + p_{g}(x)} \right) \mathrm{d} x \\ = & \log 2 + \frac{1}{2} C(G) \end{align}$

It follows that $C(G) = 2 D_{\text{JS}}(p_{\text{data}} \| p_{g}) - 2 \log 2$ , and $C(G)$ is minimized only when $D_{\text{JS}}(p_{\text{data}} \| p_{g}) = 0$ , or the two probability distributions $p_{\text{data}} = p_{g}(x)$ almost everywhere. End of proof.