Wasserstein GANs

Martin Arjovsky, Soumith Chintala, and L ́eon Bottou
Courant Institute of Mathematical, Sciences Facebook AI Research

Outline

GANs

GANs refers to Generative Adversarial Networks.

GAN-diagram

  • GANs is inspired by game theory, it has 2 nets (Generator & Discriminator), they play against each other to get stronger at each round.
  • GANs is an implicit generative model since Generator uses the signal (loss) from the Discriminator (classifier) to implicitly approximate his intractable cost function.

$\begin{aligned}\min_G \max_D L(D, G) & = \mathbb{E}_{x \sim p_{r}(x)} [\log D(x)] + \mathbb{E}_{z \sim p_z(z)} [\log(1 - D(G(z)))] \\& = \mathbb{E}_{x \sim p_{r}(x)} [\log D(x)] + \mathbb{E}_{x \sim p_g(x)} [\log(1 - D(x)]\end{aligned}$

Review GANs

  • $x$: real data example.
  • $\hat{x}$ fake data example from $G(z)$
  • $z$: noise input usually from a uniform distribution.
  • $y$: a label $\in \{\text{Real:1},\text{Fake:0}\}$. GAN-diagram1
  • $D$: a discriminator net to estimate $p(y|x)$.
  • $G$: a generator net to output fake example($\hat{x}$).
  • $P_z$: assumed data distribution over noise input $z$.
  • $P_g$: generator distribution over sample $\hat{x}$.
  • $P_r$: ‘real’ data distribution over real sample $x$.
  • Discriminator:
    • $\max_D \mathbb{E}_{x \sim p_{r}(x)} [\log D(x)] + \mathbb{E}_{z \sim p_z(z)} [\log(1 - D(G(z)))]$
  • Generator:
    • GAN0: $\min_G \mathbb{E}_{z \sim p_z(z)} [\log(1 - D(G(z)))]$
    • GAN1: $\min_G \mathbb{E}_{z \sim p_z(z)} [\log(- D(G(z)))]$

KL divergence and JS divergence

$D_{KL}(p_g \| P_r) = \int_x p_g(x) \log \frac{p_g(x)}{p_r(x)} dx$ $D_{KL}(p_r \| p_g) = \int_x p_r(x) \log \frac{p_r(x)}{p_g(x)} dx$ $D_{JS}(p_g \| P_r) = \frac{1}{2} D_{KL}(p_g \| \frac{p_r + p_g}{2}) + \frac{1}{2} D_{KL}(p_r \| \frac{p_r + p_g}{2})$

  • KL is a measure of how one probability distribution diverges from a second, expected probability distribution
  • KL $\in [0,\infty]$ and not symmetric, forward $D_{KL}(p_g \| p_r) \neq $ reversed $D_{KL}(p_r \| P_g)$
  • JS $\in [0,1]$, is a symmetric and more smooth measure of 2 probability distribution.
  • kl1

Analyze loss function 1

$\begin{aligned}\min_G \max_D L(D, G) & = \mathbb{E}_{x \sim p_{r}(x)} [\log D(x)] + \mathbb{E}_{z \sim p_z(z)} [\log(1 - D(G(z)))] \\& = \mathbb{E}_{x \sim p_{r}(x)} [\log D(x)] + \mathbb{E}_{x \sim p_g(x)} [\log(1 - D(x)]\end{aligned}$

  • when we fixed G,what is the optimal D:
    • this is taking the partial partial derivative of the loss function w.r.t D(x) to 0
    • we can get $D^*(x) = \frac{p_{r}(x)}{p_{r}(x) + p_g(x)} \in [0, 1]$

Analyze loss function 2

  • when we have optimal D* what is the loss for min G :

$ L(G, D^*) = \int_x \bigg( p_{r}(x) \log(D^*(x)) + p_g (x) \log(1 - D^*(x)) \bigg) dx$ $= \int_x \bigg( p_{r}(x) \log(\frac{p_{r}(x)}{p_{r}(x) + p_g(x)}) + p_g (x) \log(1 -\frac{p_{r}(x)}{p_{r}(x) + p_g(x)}) \bigg) dx $ $= \int_x \bigg( p_{r}(x) \log(\frac{p_{r}(x)}{p_{r}(x) + p_g(x)}) + p_g (x) \log(\frac{p{g}(x)}{p_{r}(x) + p_g(x)}) \bigg) dx $ $= \int_x \bigg( p_{r}(x) \log(\frac{p_{r}(x)}{2\frac{1}{2}(p_{r}(x) + p_g(x))}) + p_g (x) \log(\frac{p{g}(x)}{2\frac{1}{2}(p_{r}(x) + p_g(x))}) \bigg) dx $ $= \int_x \bigg( p_{r}(x) \log(\frac{p_{r}(x)}{\frac{1}{2}(p_{r}(x) + p_g(x))}) - \log(2) + p_g (x) \log(\frac{p{g}(x)}{\frac{1}{2}(p_{r}(x) + p_g(x))}) - \log(2) \bigg) dx $ $= \int_x \bigg( p_{r}(x) \log(\frac{p_{r}(x)}{\frac{1}{2}(p_{r}(x) + p_g(x))}) + p_g (x) \log(\frac{p{g}(x)}{\frac{1}{2}(p_{r}(x) + p_g(x))}) \bigg) dx - 2\log(2)$ $= 2D_{JS}(p_{r} \| p_g) - 2\log2$

Problems 1: Gradient vanishing

$D_{KL}(p_g \| P_r) = \int_x p_g(x) \log \frac{p_g(x)}{p_r(x)} dx$ $D_{KL}(p_r \| p_g) = \int_x p_r(x) \log \frac{p_r(x)}{p_g(x)} dx$ $D_{JS}(p_g \| P_r) = \frac{1}{2} D_{KL}(p_g \| \frac{p_r + p_g}{2}) + \frac{1}{2} D_{KL}(p_r \| \frac{p_r + p_g}{2})$

  • now we know that when we have optimal D, min g is same as min $D_{JS}(p_r||p_g)$

  • there are 3 different cases to consider when we plug it in to the JS measure

  • $p_r(x)=0$ and $P_g(x)=0$ $D_{kl}$ -> 0 , $D_{js}$ -> 0

  • $p_r(x) \neq 0 P_g(x)=0$ or $p_r(x)=0, P_g(x) \neq 0$ $D_{kl}$ -> $\inf$ , $D_{js}$ -> $\log2$

  • $p_r(x) \neq 0$ and $P_g(x)\neq 0$ barely happen,neglectable

Manifold Assumption

The data distribution lie close to a low-dimensional manifold Example: consider image data

  • Very high dimensional (1,000,000D)
  • A randomly generated image will almost certainly not look like any real world scene
  • The space of images that occur in nature is almost completely empty
  • Hypothesis: real world images lie on a smooth, low-dimensional manifold
  • maniford3
  • maniford2

Assumption: Support of $P_r \& P_g$ lie on low dimensional manifolds

Support: A real-valued function f is he subset of the domain containing those elements which are not mapped to zero.

  • we now assume Support of $P_r $ lives in a low-dimensional manifold embedded in a higher-dimensional space (input space)
  • now think about what does the generator net do?
    • we first randomly generate z and dim(z) << dim(x)
    • we use G(z) as a non-linear mapping from dim(z) to dim(x)
    • so what does the p_g represent eventually?
    • since Manifold learning is an approach to non-linear dimensionality reduction
    • p_g represents a consequence after reverting manifold learning
  • we now assume Support of $P_r \& P_g$ lie on low dimensional manifolds
    • this means each one of manifold hardly fills up the whole high dimensional space
    • they are almost certainly gonna be disjoint, the case where they overlap is neglectable
  • maniford_overlap_neglectable

problem2: mode collapse,unstable gradient updates

Alternative D loss for min G