Theory behind GAN | 魔法沉思录

Generation

Target: find data distribution .
x 是一个图片 ( a high-dimensional vector)

Before GAN

Target：使用一个分布去拟合分布（固定的），使两者越接近越好. 例如可以是 gaussian mixture model,此时为 means and variances of this model

Given a data distribution (We can sample from it)

未知，但是可以 sample from it. (就是从已有的database中sample出来一些)
We have a data distribution parameterized by

We want to find such that close to
Sample {} from and compute
Find maximizing the Object Function

Object Function：Maximum Likelihood Estimation (最大似然估计)

$\tag not allowed in aligned environment \begin{aligned} F(\theta) &= \prod_{x=1}^{m} P_G(x^i;\theta)\tag{1} \end{aligned}$

要使达到最大，即求使取得最值

$$\begin{aligned}
\theta^* &= \arg\mathop{\max}\limits_{\theta}F(\theta) \tag{2} \
&= \arg\mathop{\max}\limits_{\theta}\sum_{x=1}^mlog(P_G(x^i;\theta)) \
&\approx \color{red}\arg\mathop{\max}\limits_{\theta}E_x\sim_{P_{data}(x)}(log(P_G(x;\theta)))\
&= \color{red}\arg\mathop{\max}\limits_{\theta} (\int_x P_{data}(x)log(P_G(x;\theta))dx - \int_x P_{data}(x)log(P_{data}(x))dx) \
&= \arg\mathop{\max}\limits_{\theta}\int_xP_{data}(x)log(P_G(x;\theta)/P_{data}(x))dx \
&= \arg\mathop{\min}\limits_{\theta} KL Div(P_{data}||P_G)

\end{aligned}$$

即： Maximum Likelihood Estimation KL Divergence

因此对于 Generator 的目的就是 ,即使 $和$ 之间的散度最小。

这里存在一个问题: How to define general ?

对于分布(更加 general distribution),它可能无法进行计算(如 Nerual Network),即无法计算.

Using GAN

GAN 是如何处理这个问题的呢？ (Generator)

GAN 采用了一个 NN 来拟合，即 Generator 是一个 network, 使用它来定义分布

To learn the generator’s distribution ,we define a prior on input noise variables , then represent a mapping to data space as ,where G is a differentiable function represented by a multilayer perceptron with parameters .

prior distribution $一般可以选取任意分布，如$

根据上面可知，它的 Object Function (loss) 是某种 Divergence, 即 ,
而且目的为是这种散度最小，即 .
关键就是如何计算这种 Divergence？

How to Compute Divergence (Discriminator) ?

对于或者因为是未知的，我们无法计算，但是可以 Sample from them.

We alse define a second multilayer perceptron that outputs a single scalar. D(x) represents the probability that x came from the data rather that . We train D to maximize the probability of assigning the correct label to both training examples and samples from G.

Object Function For D

When train D, Gis fixed:
$\tag not allowed in aligned environment\begin{aligned} V(G,D) &= E_x \sim_{P_{data}(x)}[log(D(x))] + E_z \sim_{P_z(z)}[log(1-D(G(z)))] \ & = E_x \sim_{P_{data}(x)}[log(D(x))] + E_x \sim_{P_G(x)}[log(1-D(x))] \tag{3} \end{aligned}$

即

式(3)如何解释呢？

当 x 从 sample 出来的时，那就使 scalar(D(x)) 越大越好（因为它是真实的）
当 x 从 sample 出来时，saclar(D(x))越小越好，即1-D(x) 越大越好（因为它是Generator出来的的）

上面我们讲到，generator的目的使和的某种 Divergence 越小越好,其实对于式(3)它就等同于某种 Divergence

证明 V(G,D) 等同于 JS-Divergence

fixed G, 求解
$\tag not allowed in aligned environment \begin{aligned} D^* &= \arg\mathop{\max}\limits_DV(D,G) \ &= \arg\mathop{\max}\limits_D[\int_xP_{data}(x)log(D(x))dx + \int_xP_G(x)log(1-D(x))dx ]\ &= \arg\mathop{\max}\limits_D\int_x[P_{data}(x)log(D(x)) + P_G(x)log(1-D(x))]dx \tag{5} \end{aligned}$
Given x, 使 maximum
简化为

Assumed: and

$$\begin{aligned}
&L(D) = alogD + blog(1-D) \
&\frac{dL(D)}{dD} = a/D + b/(1-D)(-1) = 0\
& D^ = a/(a+b) \tag{6}

\end{aligned}$$

For any (a,b) , function achieves its maximum in [0,1] at
将带入到式（3）中，有:

计算

上式就等同于:

其中

Algorithm

Initialize generator and discriminator
In each training iteration:
1. Fix G, update D
2. Fix D, update G

Procerss

实际上：

initialize for D and for G

In each training iteration
Training D, repeat K times
- sample m examples{} from database
- sample m examples{} from prior distribution
- get generated data {} via
- fixed ,update to maximize:
Training G, only once
- sample m examples{} from prior distribution
- fixed ,update to minimize:

实际上，在 Training G 时，采用的是：