Generative Adversarial Networks

Generative Adversarial Networks

Github repo: GANs-Gallery
             Generator vs Discriminator
FID: frechet inception distance
measures the similarity between two distributions
Pr∼N(μr,Σr),Pg∼N(μg,Σg)FID(Pr,Pg)=∥μr−μg∥22+Tr(Σr+Σg−2(ΣrΣg)1/2)
P_r \sim \mathcal{N}(\mu_r, \Sigma_r), 
P_g \sim \mathcal{N}(\mu_g, \Sigma_g) \\

\mathrm{FID}(P_r, P_g) =
     \| \mu_r - \mu_g \|_2^2
     + \mathrm{Tr}\left( \Sigma_r + \Sigma_g - 2\left( \Sigma_r \Sigma_g \right)^{1/2} \right)
Pr​∼N(μr​,Σr​),Pg​∼N(μg​,Σg​)FID(Pr​,Pg​)=∥μr​−μg​∥22​+Tr(Σr​+Σg​−2(Σr​Σg​)1/2)
KID: kernel inception distance
KID(Pr,Pf)=Ex,x′[k(x,x′)]+Ey,y′[k(y,y′)]−2Ex,y[k(x,y)]\mathrm{KID}(P_r, P_f) =     \mathbb{E}_{x,x'}[k(x,x')] +     \mathbb{E}_{y,y'}[k(y,y')] -     2 \mathbb{E}_{x,y}[k(x,y)]KID(Pr​,Pf​)=Ex,x′​[k(x,x′)]+Ey,y′​[k(y,y′)]−2Ex,y​[k(x,y)]
IS: Inception Score
given a set of generated images x1,x2,…,xn{x_1, x_2, \dots, x_n}x1​,x2​,…,xn​﻿, the inception score is defined as:
IS=exp⁡ ⁣(Ex∼G[DKL(p(y∣x) ∥ ∣p(y))] ⁣)\mathrm{IS} = \exp\!\Big(\mathbb{E}_{x \sim G}\big[D_{\mathrm{KL}}\big(p(y| x)\,\|\,|p(y)\big)\big]\!\Big)IS=exp(Ex∼G​[DKL​(p(y∣x)∥∣p(y))])
where p(y)=1N∑i=1Np(y∣xi)p(y) = \frac{1}{N}\sum_{i=1}^N p(y|x_i)p(y)=N1​∑i=1N​p(y∣xi​)﻿​
high score ⇒ indiv points are sharply classified + classes diversity
low scores ⇒ blurry images / low diversity 
(flat p(y/x)p(y/x)p(y/x)﻿ + peaky p(y)p(y)p(y)﻿)
LPIPS: Learned Perceptual Image Patch Similarity
weighted average of distances between feature map outputs from a vision network
 
x,x′∈R3,h,w,yl=fl(x),y^l=yl∥yl∥2x, x^\prime \in \R^{3, h, w}, \quad y_l = f_l(x), \quad \hat{y}_l = \frac{y_l}{\| y_l \|_2}x,x′∈R3,h,w,yl​=fl​(x),y^​l​=∥yl​∥2​yl​​
dl=1HlWl∑h=1Hl∑w=1Wl∥y^l,h,w−y^l,h,w′∥22\\d_l = \frac{1}{H_l W_l} \sum_{h=1}^{H_l} \sum_{w=1}^{W_l}\| \hat{y}_{l,h,w} - \hat{y}'_{l,h,w} \|_2^2dl​=Hl​Wl​1​h=1∑Hl​​w=1∑Wl​​∥y^​l,h,w​−y^​l,h,w′​∥22​
LPIPS(x,x′)=∑l∈Lwl×dl\text{LPIPS}(x, x') = \sum_{l \in L} w_l \times d_lLPIPS(x,x′)=l∈L∑​wl​×dl​
G_EMA: 
exponential moving average of the Generator weights
overfitting heuristics:
rt=E[Dtrain]−E[Dval]E[Dtrain]−E[Dgen]r_t = \frac{\mathbb{E}[D_\text{train}] - \mathbb{E}[D_\text{val}]}{\mathbb{E}[D_\text{train}] - \mathbb{E}[D_\text{gen}]}rt​=E[Dtrain​]−E[Dgen​]E[Dtrain​]−E[Dval​]​
rt=E[sign(Dtrain)]r_t = \mathbb{E}[\mathcal{sign}(D_\text{train})]rt​=E[sign(Dtrain​)]
0 ⇒ not overfitting
1 ⇒ overfitting
       GAN FID evaluation            GAN training loss
ALT
Losses
non-saturating loss:
LD=−Ex∼pdata[log⁡D(x)]−Ez∼pz[log⁡(1−D(G(z))]\mathcal{L}_D = -\mathbb{E}_{x \sim p_{\text{data}}} [\log D(x)] - \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z))]\\LD​=−Ex∼pdata​​[logD(x)]−Ez∼pz​​[log(1−D(G(z))]
LG=−Ez∼pz[log(D(G(z))]\mathcal{L}_G = -\mathbb{E}_{z \sim p_z}[log(D(G(z))]
LG​=−Ez∼pz​​[log(D(G(z))]
LSGAN: Least Squares GAN
minimizes pearson’s χ2\chi^2χ2﻿ divergence
LD=12Ex∼pdata ⁣[(D(x)−0)2 ⁣]+12Ez∼pz ⁣[(D(G(z))−1)2 ⁣]LG=12Ez∼pz[(D(G(z))−1)2]L_D = \frac{1}{2} \mathrm{E}_{x \sim p_\text{data}}\!\big[(D(x)-0)^2\!\big] + \frac{1}{2} \mathrm{E}_{z \sim p_\text{z}}\!\big[(D(G(z))-1)^2\!\big] \\ 
L_G = \frac{1}{2} \mathrm{E}_{z\sim p_z}\big[(D(G(z))-1)^2\big]LD​=21​Ex∼pdata​​[(D(x)−0)2]+21​Ez∼pz​​[(D(G(z))−1)2]LG​=21​Ez∼pz​​[(D(G(z))−1)2]
Wassertein GAN
minimizes 1-wassertein’s distance
LD=−Ex∼pdata[D(x)]+Ez∼pz[D(G(z))]LG=−Ez∼pz[D(G(z))]L_D = -\mathrm{E}_{x \sim p_\text{data}}\big[D(x)\big] +  \mathrm{E}_{z \sim p_\text{z}}\big[D(G(z))\big] \\ 
L_G = - \mathrm{E}_{z\sim p_z}\big[D(G(z))\big]LD​=−Ex∼pdata​​[D(x)]+Ez∼pz​​[D(G(z))]LG​=−Ez∼pz​​[D(G(z))]
enforcing DDD﻿ to be 1-LipschitzLipschitzLipschitz﻿ through weight clipping
RaGAN
LD=−Ex[log⁡σ(C(x)−Ez[C(G(z))])]−Ez[log⁡σ(Ex[C(x)]−C(G(z)))]\mathcal{L}_D =- \mathbb{E}_{x} \left[ \log \sigma \left(C(x) - \mathbb{E}_{z}[C(G(z))] \right) \right]- \mathbb{E}_{z} \left[ \log \sigma \left( \mathbb{E}_{x}[C(x)] - C(G(z)) \right) \right]LD​=−Ex​[logσ(C(x)−Ez​[C(G(z))])]−Ez​[logσ(Ex​[C(x)]−C(G(z)))]
LG=−Ez[log⁡σ(C(G(z))−Ex[C(x)])]−Ex[log⁡σ(Ez[C(G(z))]−C(x))]\mathcal{L}_G =- \mathbb{E}_{z} \left[ \log \sigma \left(C(G(z)) - \mathbb{E}_{x}[C(x)] \right) \right]- \mathbb{E}_{x} \left[ \log \sigma \left( \mathbb{E}_{z}[C(G(z))] - C(x) \right) \right]LG​=−Ez​[logσ(C(G(z))−Ex​[C(x)])]−Ex​[logσ(Ez​[C(G(z))]−C(x))]
Regularizers
WGAN-GP
WGAN + a gradient penalty (: enforcing 1-Lipschitz)
interpolating x^\hat xx^﻿ samples from the fake-real path
λ×Ex^∼Px^(∥∇x^fw(x^)∥2−1)\lambda \small\times \mathbb{E}_{\hat x \sim P_{\hat x}} \big(\|\nabla_{\hat x} f_w(\hat x) \|_2 - 1\big)λ×Ex^∼Px^​​(∥∇x^​fw​(x^)∥2​−1)
R1, R2
LR1=γ2Ex∼Preal[∥∇xD(x)∥2]L_\text{R1} = \frac{\gamma}{2} \mathrm{E}_{x\sim P_\text{real}}[\|\nabla_xD(x)\|^2]LR1​=2γ​Ex∼Preal​​[∥∇x​D(x)∥2]
LR2=γ2Ez∼Pz[∥∇xD(g(z))∥2]L_\text{R2} = \frac{\gamma}{2} \mathrm{E}_{z\sim P_z}[\|\nabla_xD(g(z))\|^2]LR2​=2γ​Ez∼Pz​​[∥∇x​D(g(z))∥2]
in practice R1 have sticked around while R2 turned out to be less stable in practice
path length penalty
v∼N(0,1),J=∂G(w)∂w\textbf{v} \sim \mathcal{N}(0, 1), \mathcal{J} = \frac{\partial{G(w)}} {\partial{w}}v∼N(0,1),J=∂w∂G(w)​
LPLP=Ew,v[(∥J.v∥2−a)2]\mathcal{L}_\text{PLP} = \mathrm{E}_{w, \textbf{v}}\big[(\|\mathcal{J}.\textbf{v}\|_{2} - a)^2\big]LPLP​=Ew,v​[(∥J.v∥2​−a)2]
computing the deviation of the output generated image wrt perturbation of intermediate latent state, enforcing it to be close to normal distribution over an EMA
efficient estimate / alternative: directional derivative
s=<G(w),yhat>=∑i=1nG(w)i.yhat, i\mathcal{s} = \big<G(w),y_\text{hat}\big> = \sum_{i=1}^{n}{G(w)_i . \mathrm{y}_\text{hat, i}}s=⟨G(w),yhat​⟩=i=1∑n​G(w)i​.yhat, i​
L=∥∂s∂w∥2=∥JT.yhat∥2\mathcal{L} = \|\frac{\partial{\mathcal{s}}}{\partial{w}}\|_2 = \|\mathcal{J}^T . \mathrm{y}_hat \|_2L=∥∂w∂s​∥2​=∥JT.yh​at∥2​
Lpath length=λpath length×(L−EMA)2\mathcal{L}_\text{path length} = \lambda_\text{path length} \times (\mathcal{L} - \mathrm{EMA})^2Lpath length​=λpath length​×(L−EMA)2
Training GANs with limited data (ADA): paper
overfitting in GANs:
always (p) using augmentations on real and fake images
invertible augmentations: invertible in the sense that the undderlying distribution is still learnable
p < .8 ⇒ aug leaks unlikely to happen
best observed transformations for small datasets:
pixel blinting
geometric transforms
color transforms
Adaptive Discriminator Augmentation
r_t & r_v: measuring overfitting ⇒ used to adapt p during training
target .6 gave consistantly good results
evaluate every N steps ⇒ 
define p update speed
update p
clamp to [0, 1]
Evalutation
PA-GANs: progressive augmentation
WGANS: using wasserstein distance + grad penalty ⇒ restricting lipschtiz constraint on D
KID is more informative than FID when training on a small dataset
GANS trained by Two Time-scale Update Rule Converge to a local nash equilibrium: paper
main points
Generator lr = a, Discriminator lr = a / b
∑n=0∞a(n)=∞,∑n=0∞b(n)=∞,∑n=0∞a(n)2<∞,∑n=0∞b(n)2<∞,a(n)b(n)→0as n→∞\sum_{n=0}^{\infty} a(n) = \infty, \quad \sum_{n=0}^{\infty} b(n) = \infty,
\newline \sum_{n=0}^{\infty} a(n)^2 < \infty,
\quad \sum_{n=0}^{\infty} b(n)^2 < \infty,
\newline \frac{a(n)}{b(n)} \to 0 \quad \text{as } n \to \infty∑n=0∞​a(n)=∞,∑n=0∞​b(n)=∞,∑n=0∞​a(n)2<∞,∑n=0∞​b(n)2<∞,b(n)a(n)​→0as n→∞﻿​
note: D should be updated more frequently / careful steps G learns through D’s gradient, thus D should be “near-optimal”
evaluate FID every 1K Discriminator step
lipschitz continuty assumed (use ELU or other smooth variants of ReLU, or relying on weight decay for smoothing)
Wasserstein GAN w/ Gradient Penalty
Wassertein distance:
1-wasserstein distance (a.k.a earth mover’s, how dramatic)
W1(Pr,Pf)=inf⁡γ∈Π(Pr,Pf)E(x,y)∼γ[∥x−y∥]\mathcal{W_1}(P_r, P_f) = \inf_{\gamma \in \Pi(P_r, P_f)} \mathbb{E}_{(x, y)\sim\gamma}\big[\|x - y\|\big]W1​(Pr​,Pf​)=γ∈Π(Pr​,Pf​)inf​E(x,y)∼γ​[∥x−y∥]
kantorovich rubinstein dual form
W1(Pr,Pf)=sup⁡∥f∥L≤1{Ex∼Pr[f(x)]−Ey∼Pf[f(y)]}\mathcal{W}_1(P_r, P_f) = \sup_{\|f\|_L\le1}\big\{\mathbb{E}_{x\sim P_r}[f(x)] - \mathbb{E}_{y\sim P_f}[f(y)]\}W1​(Pr​,Pf​)=∥f∥L​≤1sup​{Ex∼Pr​​[f(x)]−Ey∼Pf​​[f(y)]}
parametrizing f as a neul-net
Lcritic(W)=Ex∼Pr[fw(x)]−Ex∼Pr[fw(Gθ(z)]subject to ∥fw∥L≤1\mathcal{L}_\text{critic}(W) = \mathbb{E}_{x \sim P_r}[f_w(x)] - \mathbb{E}_{x \sim P_r}[f_w(G_\theta(z)] \\[1ex] \text{subject to } \|f_w\|_L \le 1Lcritic​(W)=Ex∼Pr​​[fw​(x)]−Ex∼Pr​​[fw​(Gθ​(z)]subject to ∥fw​∥L​≤1
enforcing 1-Lipschitz on f
clamping [-c, c]
gradient penalty
λ×Ex^∼Px^(∥∇x^fw(x^)∥2−1)\lambda \small\times \mathbb{E}_{\hat x \sim P_{\hat x}} \big(\|\nabla_{\hat x} f_w(\hat x) \|_2 -1\big)λ×Ex^∼Px^​​(∥∇x^​fw​(x^)∥2​−1)
Generator objective function
LG(θ)=−Ez∼Pz[fw(Gθ(z))]L_\text{G}(\theta) = - E_{z \sim P_z}[f_w(G_\theta(z))]LG​(θ)=−Ez∼Pz​​[fw​(Gθ​(z))]
,D is called the critic here (makin it sound fancy)
unpaired image-to-image translation using CycleGAN: paper
StyleGANs core innovations (super duper cool):
Generator:
mapping network: latent space disentanglement 
W=MLP(Z)\mathcal{W} = \mathrm{MLP}(Z)W=MLP(Z)
starting from a learned initial `canvas`
Noise injection in style blocks
noise∼N(0,1),x′=x+α×noise\mathrm{noise}\sim \mathcal{N}(0, 1), \quad x^\prime = x + \alpha \times \mathrm{noise}noise∼N(0,1),x′=x+α×noise
Modulated Convolution & Style vectors
W∈RCout×Cin×k×k,s∈R1×Cin×1×1W \in \R^{C_\text{out}\times C_\text{in}\times k \times k}, s \in \R^{1\times C_\text{in}\times 1 \times 1}W∈RCout​×Cin​×k×k,s∈R1×Cin​×1×1
W~=W×s,Wi,j^=W~i,j∑j,k,lW~i,j,k,l2+ϵ\tilde{W} = W \times s, \quad \hat{W_{i, j}} = \frac{\tilde{W}_{i, j}}{\sqrt{\sum_{j,k,l} \tilde{W}_{i,j,k,l}^2 + \epsilon}}W~=W×s,Wi,j​^​=∑j,k,l​W~i,j,k,l2​+ϵ​W~i,j​​
Equalized Linear layer
X∈Rn,faninscale=gainfanin\mathrm{X} \in \R^{n, \mathrm{fan}_{in}} \quad \mathrm{scale} = \frac{\mathrm{gain}}{\sqrt{\mathrm{fan}_{in}}}X∈Rn,fanin​scale=fanin​​gain​
W^=W×scaleXout=X.W^T+b\mathrm{\hat{W}} = \mathrm{W} \times \mathrm{scale} \quad \mathrm{X}_{out} = \mathrm{X}.\mathrm{\hat{W}}^T + \mathrm{b}W^=W×scaleXout​=X.W^T+b
Equalized Convolution
X∈Rb,Cin,h,wfanin=Cin×h×wscale=gainfanin\mathrm{X} \in \R^{b, C_{in}, h, w} \quad \mathrm{fan}_{in} =  C_{in} \times h \times w \quad \mathrm{scale} = \frac{\mathrm{gain}}{\sqrt{\mathrm{fan}_{in}}}X∈Rb,Cin​,h,wfanin​=Cin​×h×wscale=fanin​​gain​
Discriminator
batch std
x∈RB×C×H×W,μ=1B∑b=1Bxbx \in \mathbb{R}^{B \times C \times H \times W}, \quad
\mu = \frac{1}{B} \sum_{b=1}^B x_b x∈RB×C×H×W,μ=B1​b=1∑B​xb​
σ~=1B∑b=1B(xb−μ)2+ϵ,σ=1C⋅H⋅W∑c=1C∑h=1H∑w=1Wσ~c,h,w\tilde{\sigma} = \sqrt{ \frac{1}{B} \sum_{b=1}^B (x_b - \mu)^2 + \epsilon }, \quad
\sigma = \frac{1}{C \cdot H \cdot W} 
\sum_{c=1}^C \sum_{h=1}^H \sum_{w=1}^W \tilde{\sigma}_{c,h,w}σ~=B1​b=1∑B​(xb​−μ)2+ϵ​,σ=C⋅H⋅W1​c=1∑C​h=1∑H​w=1∑W​σ~c,h,w​
σ→RB×1×H×W,x′=x∥σ\sigma \rightarrow \mathbb{R}^{B \times 1 \times H \times W}, \quad
x^\prime = x \| \sigmaσ→RB×1×H×W,x′=x∥σ
Concurrent CUDA streams during training:
aiming to maximize device usage, luckily, multiple penalties / losses can be independently computed, with few entanglements
To compute:
G loss
D loss
Gradient Penalty
R1 Penalty
Path Length Penalty
stream 1 
fake images →
fake logits x
→ G loss →
Dloss
Stream 2
real logits x
R1 penalty →
→ Gradient Penalty
→ Path Length Penalty
→ METRIC: waited for an event
METRIC →: a computation is waiting
METRIC x: needed to compute another metric