Theory behind GAN
codeflysafe Lv5

Generation

Target: find data distribution .
x 是一个图片 ( a high-dimensional vector)

Before GAN

Target:使用一个分布 去拟合分布(固定的),使两者越接近越好. 例如可以是 gaussian mixture model,此时 means and variances of this model

  • Given a data distribution (We can sample from it)

    未知,但是可以 sample from it. (就是从已有的databasesample出来一些)

  • We have a data distribution parameterized by

    We want to find such that close to

  • Sample {} from and compute
  • Find maximizing the Object Function

Object FunctionMaximum Likelihood Estimation (最大似然估计)

\tag not allowed in aligned environment \begin{aligned} F(\theta) &= \prod_{x=1}^{m} P_G(x^i;\theta)\tag{1} \end{aligned}

要使 达到最大,即求 使取得最值

$$\begin{aligned}
\theta^* &= \arg\mathop{\max}\limits_{\theta}F(\theta) \tag{2} \
&= \arg\mathop{\max}\limits_{\theta}\sum_{x=1}^mlog(P_G(x^i;\theta)) \
&\approx \color{red}\arg\mathop{\max}\limits_{\theta}E_x\sim_{P_{data}(x)}(log(P_G(x;\theta)))\
&= \color{red}\arg\mathop{\max}\limits_{\theta} (\int_x P_{data}(x)log(P_G(x;\theta))dx - \int_x P_{data}(x)log(P_{data}(x))dx) \
&= \arg\mathop{\max}\limits_{\theta}\int_xP_{data}(x)log(P_G(x;\theta)/P_{data}(x))dx \
&= \arg\mathop{\min}\limits_{\theta} KL Div(P_{data}||P_G)

\end{aligned}$$

即: Maximum Likelihood Estimation KL Divergence

因此 对于 Generator 的目的就是 ,即使 之间的散度最小。

这里存在一个问题: How to define general ?

对于 分布(更加 general distribution),它可能无法进行计算(如 Nerual Network),即 无法计算.

Using GAN

GAN 是如何处理这个问题的呢? (Generator)

GAN 采用了一个 NN 来拟合 ,即 Generator 是一个 network, 使用它来定义分布

To learn the generator’s distribution ,we define a prior on input noise variables , then represent a mapping to data space as ,where G is a differentiable function represented by a multilayer perceptron with parameters .

prior distribution

根据上面可知,它的 Object Function (loss) 是某种 Divergence, 即 ,
而且目的为是这种散度最小,即 .
关键就是如何计算这种 Divergence?

How to Compute Divergence (Discriminator) ?

对于 或者 因为是未知的,我们无法计算,但是可以 Sample from them.

We alse define a second multilayer perceptron that outputs a single scalar. D(x) represents the probability that x came from the data rather that . We train D to maximize the probability of assigning the correct label to both training examples and samples from G.

Object Function For D

When train D, Gis fixed:
\tag not allowed in aligned environment\begin{aligned} V(G,D) &= E_x \sim_{P_{data}(x)}[log(D(x))] + E_z \sim_{P_z(z)}[log(1-D(G(z)))] \ & = E_x \sim_{P_{data}(x)}[log(D(x))] + E_x \sim_{P_G(x)}[log(1-D(x))] \tag{3} \end{aligned}

式(3)如何解释呢?

  1. 当 x 从 sample 出来的时,那就使 scalar(D(x)) 越大越好(因为它是真实的)
  2. 当 x 从 sample 出来时,saclar(D(x))越小越好,即1-D(x) 越大越好(因为它是Generator出来的的)

上面我们讲到,generator的目的使 的某种 Divergence 越小越好,其实对于式(3)它就等同于某种 Divergence

证明 V(G,D) 等同于 JS-Divergence

  1. fixed G, 求解
    \tag not allowed in aligned environment \begin{aligned} D^* &= \arg\mathop{\max}\limits_DV(D,G) \ &= \arg\mathop{\max}\limits_D[\int_xP_{data}(x)log(D(x))dx + \int_xP_G(x)log(1-D(x))dx ]\ &= \arg\mathop{\max}\limits_D\int_x[P_{data}(x)log(D(x)) + P_G(x)log(1-D(x))]dx \tag{5} \end{aligned}

  2. Given x, 使 maximum
    简化为

    Assumed: and

    $$\begin{aligned}
    &L(D) = alogD + blog(1-D) \
    &\frac{dL(D)}{dD} = a/D + b/(1-D)(-1) = 0\
    & D^
    = a/(a+b) \tag{6}

    \end{aligned}$$

    For any (a,b) , function achieves its maximum in [0,1] at

  3. 带入到式(3)中,有:

计算

上式就等同于:

其中

Algorithm

  1. Initialize generator and discriminator
  2. In each training iteration:
    1. Fix G, update D
    2. Fix D, update G

Procerss

实际上:

initialize for D and for G

  • In each training iteration

  • Training D, repeat K times

    • sample m examples{} from database
    • sample m examples{} from prior distribution
    • get generated data {} via
    • fixed ,update to maximize:
  • Training G, only once

    • sample m examples{} from prior distribution
    • fixed ,update to minimize:

实际上,在 Training G 时,采用的是:

  • 本文标题: Theory behind GAN
  • 本文作者:codeflysafe
  • 创建时间:2020-06-15 12:56:39
  • 本文链接:https://codeflysafe.github.io/2020/06/15/2020-06-15-Theory-behind-GAN/
  • 版权声明:本博客所有文章除特别声明外,均采用 BY-NC-SA 许可协议。转载请注明出处!
 评论