microGAN — From Adversarial Games to StyleGAN

Chapter 1: Generator & Discriminator

The generator takes random noise z (sampled from a simple distribution like a Gaussian) and transforms it into a data sample (e.g., an image). It's a neural network mapping from noise space to data space: G(z) → fake image.

The discriminator takes a data sample (real or generated) and outputs a single number: the probability that the input is real. It's a binary classifier: D(x) → [0, 1]. Architecturally, it's typically a CNN that downsamples the image through strided convolutions until a final dense layer produces the scalar output.

Random Noise z

z ~ N(0, I), typically 128-512 dims

↓ Generator G

Fake Image G(z)

Same dimensions as real data

↓ Discriminator D

Real or Fake?

D(x) ∈ [0, 1]

Noise to Data Mapping

The generator learns a function from noise (left, uniform dots) to data (right, structured distribution). Each orange dot is a noise sample mapped through G. Click "Retrain" to see a different mapping.

Component	Input	Output	Goal
Generator G	Random noise z	Fake data G(z)	Fool D into saying "real"
Discriminator D	Data sample x	P(real) ∈ [0,1]	Correctly classify real vs fake

Concrete Data Flow (64×64 Image GAN)

Let's trace the exact tensor shapes through a DCGAN generating 64×64 RGB images:

Generator path: z [B, 128] sampled from N(0, I) → Dense(128, 4×4×512) → reshape to [B, 512, 4, 4] → TransConv(512→256) → [B, 256, 8, 8] → TransConv(256→128) → [B, 128, 16, 16] → TransConv(128→64) → [B, 64, 32, 32] → TransConv(64→3) → [B, 3, 64, 64] → tanh → pixels in [−1, 1]. Each transposed convolution doubles spatial resolution: 4→8→16→32→64.

Discriminator path: Image [B, 3, 64, 64] → Conv(3→64) → [B, 64, 32, 32] → Conv(64→128) → [B, 128, 16, 16] → Conv(128→256) → [B, 256, 8, 8] → Conv(256→512) → [B, 512, 4, 4] → flatten → Dense → [B, 1] → sigmoid → real/fake probability. The discriminator is the mirror image of the generator.

Each layer in the generator uses: TransposedConv2d (stride=2, padding=1, kernel=4) → BatchNorm → ReLU. Each layer in the discriminator uses: Conv2d (stride=2, padding=1, kernel=4) → BatchNorm → LeakyReLU(0.2). The final generator layer uses tanh (output range [−1, 1]), and the final discriminator layer uses sigmoid (output range [0, 1]). Real images are normalized to [−1, 1] to match.

Check: What does the generator take as input?

Random noise z from a simple distribution Real images Text descriptions

Chapter 2: The Min-Max Game

The GAN training objective is a minimax game. The discriminator tries to maximize V(D,G), the generator tries to minimize it. At the Nash equilibrium, the generator produces perfect fakes and the discriminator outputs 0.5 for everything ("I can't tell").

min_G max_D V(D,G) = E[log D(x)] + E[log(1 − D(G(z)))]

Breaking this down: D wants to maximize log D(x) (output 1 for real data) AND maximize log(1 − D(G(z))) (output 0 for fakes). G wants to minimize log(1 − D(G(z))) (make D output 1 for fakes).

D's perspective: Maximize V. "I want D(real) = 1 and D(fake) = 0." This is just binary cross-entropy.

G's perspective: Minimize V. "I want D(G(z)) = 1." Make the discriminator think fakes are real.

Min-Max Landscape

D(G(z)) is the discriminator's output on fake data. See how the loss changes for each player. The teal curve is D's loss; the orange curve is G's loss.

D(G(z))0.50

In practice: Instead of G minimizing log(1 − D(G(z))), we use the non-saturating loss: G maximizes log D(G(z)). This gives stronger gradients early in training when D easily rejects fakes.

Why the Non-Saturating Loss

Early in training, D easily rejects all fakes: D(G(z)) ≈ 0. The gradient of log(1 − D(G(z))) at D(G(z))=0 is nearly zero — G gets no learning signal. But the gradient of −log(D(G(z))) at D(G(z))=0 is very large — G gets a strong push to improve. Same equilibrium, much better gradient landscape.

G's loss (saturating): −log(1 − D(G(z))) | G's loss (non-saturating): −log(D(G(z)))

The non-saturating loss is the default in virtually every modern GAN implementation. It's such a universal trick that many papers don't even mention it — it's just assumed.

Check: At the Nash equilibrium of the GAN game, what does D output?

Always 1 (everything is real) 0.5 for everything (can't distinguish real from fake) Always 0 (everything is fake)

🔨 Derivation The Optimal Discriminator is a Density Ratio ▶ ✓ ATTEMPTED

For a fixed generator G, the discriminator D is trained to maximize V(D,G) = E_{x~p_data}[log D(x)] + E_{z~p_z}[log(1 − D(G(z)))].

Your task: Find the optimal D*(x) by taking the derivative of the integrand with respect to D(x) at each point x, setting it to zero, and solving. Show that D*(x) = p_data(x) / (p_data(x) + p_g(x)).

At any fixed x, the contribution to V is: p_data(x) · log(D(x)) + p_g(x) · log(1 − D(x)). This is because E_z[log(1-D(G(z)))] = E_{x~p_g}[log(1-D(x))] by the change of variables from z to G(z).

d/dy [a·log(y) + b·log(1-y)] = a/y − b/(1-y). Set this to zero: a/y = b/(1-y), so a(1-y) = by, giving y = a/(a+b).

With a = p_data(x) and b = p_g(x), we get D*(x) = p_data(x) / (p_data(x) + p_g(x)). Verify: when p_g = p_data, D* = 1/2 everywhere (Nash equilibrium).

Full derivation:

V(D,G) = ∫ p_data(x) log D(x) + p_g(x) log(1-D(x)) dx

For fixed G, maximize over D pointwise. At each x, we maximize f(y) = a·log(y) + b·log(1-y) where a = p_data(x), b = p_g(x), y = D(x).

f'(y) = a/y − b/(1-y) = 0

a(1-y) = by → a − ay = by → a = y(a+b) → y = a/(a+b)

Therefore: D*(x) = p_data(x) / (p_data(x) + p_g(x))

The key insight: The optimal discriminator is literally computing a density ratio. It outputs the probability that a sample came from the real distribution vs the mixture of real+fake. At equilibrium (p_g = p_data), D* = 1/2 everywhere — it truly cannot distinguish real from fake.

Checkpoint — Before you move on

Explain in your own words: why does G minimizing log(1 − D(G(z))) lead to the same equilibrium as G maximizing log(D(G(z))), but with different gradient behavior early in training?

✓ Gate cleared

Model Answer

Both losses have the same Nash equilibrium (p_g = p_data, D* = 1/2). But their gradient landscapes differ dramatically early in training:

Saturating loss −log(1-D(G(z))): When D(G(z)) ≈ 0 (D easily rejects fakes), the gradient is ∂/∂D(G(z)) [−log(1-D(G(z)))] = 1/(1-D(G(z))) ≈ 1. This is small — G barely learns.

Non-saturating loss −log(D(G(z))): When D(G(z)) ≈ 0, the gradient is ∂/∂D(G(z)) [−log(D(G(z)))] = −1/D(G(z)) → −∞. This is huge — G gets a massive push to improve.

Same destination, vastly different journey. The non-saturating loss is preferred because it gives G useful gradients when it needs them most (early training when fakes are bad).

Chapter 3: Training Dynamics

GANs are trained by alternating updates. One step for D, one step for G, repeat. On each D step, we show it a batch of real data (label: 1) and a batch of fake data from G (label: 0). On each G step, we generate fakes and update G to make D's output on them closer to 1.

This works — sometimes. GAN training is notoriously unstable. If D gets too good too fast, G gets no useful gradient signal. If G gets too good, D can't learn. The balance is delicate.

Step 1: Train D

D sees real (label 1) + fake (label 0). Update D only.

↓

Step 2: Train G

G generates fakes. D scores them. Update G to fool D.

↓ repeat

Training Loss Curves

Simulated D loss (teal) and G loss (orange). Healthy training: both losses oscillate around a stable value. Watch for divergence (D wins) or collapse (G wins too easily).

The Training Loop in Code

One iteration looks like this: (1) sample a real batch from your dataset, (2) sample z and generate a fake batch through G, (3) train D on both (real label=1, fake label=0) with binary cross-entropy, (4) sample new z, generate fakes, and train G to make D output 1. Two separate Adam optimizers — one for D, one for G — with learning rate balance being critical (typically 2e-4 for both, β₁=0.5).

python
# One training iteration
real_batch = next(dataloader)           # [B, 3, 64, 64]
z = torch.randn(B, 128)               # noise vector
fake_batch = G(z)                      # [B, 3, 64, 64]

# Train D: maximize log(D(real)) + log(1-D(fake))
loss_D = BCE(D(real_batch), 1) + BCE(D(fake_batch.detach()), 0)
opt_D.step()

# Train G: maximize log(D(G(z)))
z2 = torch.randn(B, 128)
loss_G = BCE(D(G(z2)), 1)             # fool D
opt_G.step()

Warning signs: D loss goes to 0 (D is too strong, G can't learn). G loss oscillates wildly. Generated samples all look the same (mode collapse). GAN training requires babysitting.

Check: Why is GAN training unstable?

The learning rate is always wrong The data is too noisy The two networks must stay in delicate balance — if one dominates, the other can't learn

💥 Break-It Lab What Dies When You Remove Components? ▶ ✓ ATTEMPTED

A working GAN has three essential ingredients: discriminator training, balanced D/G updates, and stochastic noise input z. Toggle each off and watch the loss curves collapse into pathology.

Remove D Training ACTIVE

Failure mode: D never learns, outputs random scores. G receives meaningless gradients — random noise as a learning signal. G's outputs are random noise forever. Without a critic, there is no learning signal. This is why the discriminator is the teacher.

Over-train D (10x steps) ACTIVE

Failure mode: D becomes perfect: D(real)=1, D(fake)=0 with certainty. The gradient of log(1-D(G(z))) at D(G(z))=0 vanishes — G receives zero gradient. D loss flatlines at 0, G loss flatlines at a high value. This is "vanishing gradients for G" — the discriminator got too strong.

Remove Noise Input z ACTIVE

Failure mode: Without stochastic input, G is a deterministic function with no source of randomness. It can only produce a single fixed output. Immediate mode collapse to one point. The noise vector z is the source of all diversity — it parameterizes the latent space of possible outputs.

💻 Build It Implement the GAN Training Loop ▶ ✓ ATTEMPTED

You've seen the alternating optimization. Now write the complete inner loop: one D step and one G step, including loss computation, gradient computation, and optimizer updates. Assume PyTorch with BCE loss.

signature def train_step(G, D, opt_G, opt_D, real_batch, z_dim=128): """One GAN training iteration. Args: G: Generator network, takes [B, z_dim] -> [B, 3, 64, 64] D: Discriminator network, takes [B, 3, 64, 64] -> [B, 1] opt_G, opt_D: Adam optimizers real_batch: [B, 3, 64, 64] tensor of real images z_dim: latent dimension Returns: (d_loss, g_loss): scalar loss values """

Test case

Given real_batch of shape [32, 3, 64, 64], after one call: - d_loss should be around 1.38 (= -log(0.5)*2, initial random D) - g_loss should be around 0.69 (= -log(0.5), initial random D) - G and D parameters should both be updated

When training D, we don't want gradients flowing back through G (we're only updating D). Use fake_batch.detach() or torch.no_grad() around G. When training G, we DO want gradients through G(z) into D, so don't detach.

python
def train_step(G, D, opt_G, opt_D, real_batch, z_dim=128):
    B = real_batch.size(0)
    ones = torch.ones(B, 1).to(real_batch.device)
    zeros = torch.zeros(B, 1).to(real_batch.device)
    criterion = nn.BCEWithLogitsLoss()

    # ---- Train D ----
    opt_D.zero_grad()
    # Real
    d_real = D(real_batch)
    loss_real = criterion(d_real, ones)
    # Fake (detach! don't update G)
    z = torch.randn(B, z_dim).to(real_batch.device)
    fake = G(z).detach()
    d_fake = D(fake)
    loss_fake = criterion(d_fake, zeros)
    d_loss = loss_real + loss_fake
    d_loss.backward()
    opt_D.step()

    # ---- Train G ----
    opt_G.zero_grad()
    z2 = torch.randn(B, z_dim).to(real_batch.device)
    fake2 = G(z2)           # fresh fakes, no detach
    d_fake2 = D(fake2)
    g_loss = criterion(d_fake2, ones)  # non-saturating: fool D
    g_loss.backward()
    opt_G.step()

    return d_loss.item(), g_loss.item()

Bonus challenge: Modify this to use WGAN-GP loss: replace BCE with Wasserstein loss (E[D(fake)] - E[D(real)]) and add the gradient penalty. You'll need torch.autograd.grad for the penalty term.

🔗 Pattern Recognition

Adversarial Training = Self-Play

This Lesson (GANs)

G improves by playing against D. D improves by playing against G. Neither has a fixed target — the opponent IS the curriculum.

Reinforcement Learning

In self-play (AlphaGo, OpenAI Five), an agent improves by playing against copies of itself. The opponent's improving strength IS the curriculum. → RL Algorithms

The deep pattern: when you can't write an explicit loss function for "good behavior," create an adversary whose job is to find flaws. The protagonist learns by fixing what the adversary exploits. This is minimax optimization in both cases — the difference is whether the two players share parameters (self-play) or are separate networks (GANs).

Where else do you see adversarial training? Think about adversarial examples in robustness research, discriminators in domain adaptation, and critics in actor-critic RL.

Chapter 4: Mode Collapse

The generator's worst failure mode: it finds one thing that fools the discriminator and repeats it forever. If the real data has many modes (faces with glasses, without glasses, smiling, frowning), the generator might only learn to produce one type of face. This is mode collapse.

Why does it happen? The generator optimizes to fool D, not to cover all modes. If one particular output consistently gets high D scores, G concentrates there. D eventually catches on, but G just jumps to a new single mode — creating a cycle called mode hopping.

The minibatch tells the story. Sample a batch of 64 images from G. Compute pairwise L2 distances between all 64 images. In healthy training, the mean distance should match the diversity of real data. In mode collapse, the mean distance drops toward zero — all 64 images are nearly identical. This is a dead-simple diagnostic you can compute every 100 training steps.

Mode Collapse Simulation

Teal clusters = real data modes. Orange dots = generator outputs. In healthy training, orange covers all clusters. In collapse, it concentrates on one.

Partial collapse is when some modes are covered but others are missing. Full collapse is when all generated samples are nearly identical. Both are common in vanilla GANs.

Detection and Fixes

Detection: Generate 1000 samples, compute pairwise distances. If the standard deviation of distances is near zero, you have collapse. You can also track the Frechet Inception Distance (FID) — it measures both quality and diversity. A low FID with high intra-class similarity screams mode collapse.

Fixes: Spectral normalization constrains D's Lipschitz constant per-layer, preventing it from becoming too sharp. Progressive growing starts training at 4×4 resolution and slowly adds layers, giving G time to learn at each scale. Minibatch discrimination gives D statistics about the whole batch, so it can detect when all samples look identical.

Symptom	What you see
Full collapse	All generated samples look identical
Partial collapse	Some categories of data are never generated
Mode hopping	Generator cycles between modes without covering all at once

Check: What is mode collapse?

The generator produces only one or a few types of output, ignoring the diversity of real data The discriminator can't learn The training loss becomes zero

⚔ Adversarial: Your GAN generates only 3 distinct faces regardless of the input z

You're training a face GAN on CelebA (200K images, huge diversity). After 50K iterations, you sample 1000 images with different z vectors. Visual inspection shows only 3 distinct faces (with minor pixel-level noise variations). The D loss is 0.6, G loss is 0.7. The FID score is 45 (reasonable quality). What is happening, and what structural fix addresses it?

The model is underfitting — increase capacity of G Mode collapse — add minibatch discrimination or unrolled GAN steps so D can detect lack of diversity Overfitting — the model memorized 3 training images The latent dimension is too small — increase z_dim

🔨 Derivation Why JS Divergence Rewards Mode Collapse ▶ ✓ ATTEMPTED

The original GAN loss, when the optimal D* is substituted back, equals 2·JSD(p_data || p_g) − log 4. The JS divergence is symmetric and bounded: 0 ≤ JSD ≤ log 2.

Your task: Show that a generator covering only ONE mode of a multi-modal p_data can achieve a low JSD, and explain why this makes mode collapse a local minimum of the GAN objective.

JSD(P||Q) = (KL(P||M) + KL(Q||M))/2 where M = (P+Q)/2. It penalizes both: (1) places where P has mass but Q doesn't, and (2) places where Q has mass but P doesn't. But these two penalties are averaged — a generator covering one mode perfectly has zero penalty on that mode.

Strategy A: G spreads thin across all modes (imperfect coverage of each). Strategy B: G perfectly covers one mode, ignores others. JSD is bounded by log 2. Strategy B gets perfect KL(Q||M) = 0 on the covered mode, and the penalty from uncovered modes is bounded. Strategy A might have higher KL everywhere because imperfect coverage of each mode means Q deviates from P at every point.

G is trained by gradient descent. If G currently covers one mode well, the gradient to "spread out" to another mode requires crossing a region where p_data = 0. In this region, D* = p_g/(p_data + p_g) → 1, so log(1-D*) → −∞. The loss surface has a steep local minimum around each mode.

The argument:

Consider p_data = 0.5·δ(x−a) + 0.5·δ(x−b) (two modes at a and b).

Case 1: p_g = δ(x−a) (covers one mode perfectly). Then M = 0.75·δ(x−a) + 0.25·δ(x−b). KL(p_data||M) = 0.5·log(0.5/0.75) + 0.5·log(0.5/0.25) = 0.5·(−0.405 + 0.693) = 0.144. KL(p_g||M) = log(1/0.75) = 0.288. JSD = (0.144 + 0.288)/2 = 0.216.

Case 2: p_g = 0.5·δ(x−a) + 0.5·δ(x−b) (covers both perfectly). JSD = 0.

So the global optimum (Case 2) has JSD=0, but Case 1 (mode collapse) has JSD=0.216, which is a local minimum. The problem: to go from Case 1 to Case 2, G must move probability mass through regions of zero p_data, where the gradient signal from D is adversarial (D says "that's fake!" in empty regions). The loss landscape has a valley around each mode.

The key insight: JSD doesn't penalize mode collapse as harshly as you'd expect because it averages two KL terms. Covering one mode perfectly (half the job) gets you to JSD = 0.216 out of max 0.693 — already 69% of the way to optimal. The last 31% requires crossing adversarial territory, making mode collapse a stable attractor.

Chapter 5: WGAN & Stabilization

The original GAN loss has a fundamental problem: when D is perfect, the gradients for G vanish. The Wasserstein GAN (WGAN) fixes this by replacing the JS divergence with the Wasserstein distance (Earth Mover's Distance) — a metric that provides useful gradients even when distributions don't overlap.

The WGAN critic (no longer called a "discriminator" since it doesn't classify) outputs an unbounded score, not a probability. No sigmoid on the final layer. The score difference between real and fake data estimates the Wasserstein distance. But the critic must satisfy a Lipschitz constraint: its output can't change too fast as the input changes. Without this constraint, the critic could assign arbitrarily large scores, making the optimization meaningless.

W(p_r, p_g) = sup_{||f||_L≤1} E[f(x)] − E[f(G(z))]

Wasserstein vs JS Distance

Two 1D distributions. Move them apart. Wasserstein provides a smooth gradient everywhere. JS divergence saturates when distributions don't overlap.

Separation2.0

Technique	How	Purpose
Weight clipping	Clamp D weights to [-c, c]	Enforce Lipschitz (crude)
Gradient penalty (GP)	Penalize \|\|∇D\|\| ≠ 1	Better Lipschitz enforcement
Spectral normalization	Normalize weight matrices by spectral norm	Control Lipschitz constant per-layer
R1 penalty	Penalize \|\|∇D(x_real)\|\|²	Stabilize D on real data

Gradient Penalty: The Implementation

Weight clipping is crude — it pushes weights to the edges of [−c, c], wasting capacity. The gradient penalty (WGAN-GP) is much better. It samples random interpolations between real and fake data, then penalizes deviations from unit gradient norm:

L_GP = λ · E[(||∇_x D(x̂)||₂ − 1)²] where x̂ = α · x_real + (1−α) · x_fake, α ~ U(0,1)

This enforces the 1-Lipschitz constraint: D's output can change by at most 1 unit per unit change in input. The interpolation trick samples points between the real and fake distributions — exactly where the gradient matters most. Typical λ = 10. The total critic loss becomes: E[D(fake)] - E[D(real)] + λ * GP.

Impact: WGAN + gradient penalty made GAN training dramatically more stable. It reduced mode collapse and made the loss curves meaningful — lower critic loss actually correlates with better image quality.

Check: Why is Wasserstein distance better than JS divergence for GANs?

It provides useful gradients even when distributions don't overlap It's faster to compute It doesn't require a discriminator

🔨 Derivation From Earth Mover's Distance to the 1-Lipschitz Constraint ▶ ✓ ATTEMPTED

The Wasserstein-1 distance is: W(P, Q) = inf_{γ ∈ Π(P,Q)} E_(x,y)~γ[||x − y||] where Π(P,Q) is the set of all joint distributions with marginals P and Q.

Your task: Using the Kantorovich-Rubinstein duality, show that W(P, Q) = sup_{||f||_L≤1} E_x~P[f(x)] − E_y~Q[f(y)], and explain why the supremum must be taken over 1-Lipschitz functions (not arbitrary functions).

A function f is 1-Lipschitz if |f(x) − f(y)| ≤ ||x − y|| for all x, y. The function's output can change by at most 1 unit per unit change in input. This bounds its gradient: ||∇f|| ≤ 1 everywhere.

If f is unrestricted, we could choose f(x) = +∞ on the support of P and f(x) = −∞ on the support of Q. The difference E[f(x)] − E[f(y)] would be infinite. The Lipschitz constraint prevents f from being too "steep" — it can only assign high scores to P and low scores to Q at a rate proportional to the actual distance between the distributions.

Think of it this way: the primal (transport plan γ) asks "what's the cheapest way to move mass from P to Q?" The dual (Lipschitz function f) asks "what's the maximum profit a 1-Lipschitz pricing function can extract from the difference between P and Q?" Strong duality tells us these are equal. The critic in WGAN is learning this optimal pricing function f.

The Kantorovich-Rubinstein duality (sketch):

The primal problem: W(P,Q) = min_γ ∫∫ ||x-y|| dγ(x,y) subject to γ having marginals P and Q.

This is a linear program (linear objective, linear constraints on marginals). By LP duality, the dual is: W(P,Q) = max_f,g E_P[f(x)] + E_Q[g(y)] subject to f(x) + g(y) ≤ ||x-y|| for all x,y.

The constraint f(x) + g(y) ≤ ||x-y|| implies (setting y=x): f(x) + g(x) ≤ 0 for all x. The optimal solution has g = −f (complementary slackness), so the constraint becomes: f(x) − f(y) ≤ ||x-y||, i.e., f is 1-Lipschitz.

Substituting g = −f: W(P,Q) = max_{||f||_L≤1} E_P[f(x)] − E_Q[f(y)]

The key insight: The WGAN critic approximates this optimal f using a neural network. The 1-Lipschitz constraint is enforced via weight clipping (crude), gradient penalty (better), or spectral normalization (best). Without the constraint, the critic could assign ±∞ scores and the optimization would be meaningless. The Lipschitz bound makes the critic's "opinions" proportional to actual distributional distance.

⚔ Adversarial: Your WGAN critic loss is -500 and keeps decreasing

You're training a WGAN-GP. After 10K iterations, the critic loss (E[D(fake)] - E[D(real)]) is -500 and steadily decreasing. The gradient penalty is 0.01 (nearly zero). Generated images look like noise. What went wrong?

The learning rate is too high Mode collapse is occurring The gradient penalty lambda is too low — the critic isn't 1-Lipschitz, so "Wasserstein distance" is meaningless The generator is too small

Chapter 6: StyleGAN

StyleGAN (Karras et al., 2019) revolutionized image generation with two key ideas: style mixing (inject different noise vectors at different layers to control coarse vs fine features) and progressive growing (start training at low resolution and gradually increase). The result: photorealistic 1024×1024 faces that fooled humans ~50% of the time.

The generator architecture uses a mapping network (8-layer MLP) to transform z into a style vector w, which is then injected into each layer via adaptive instance normalization (AdaIN). Different layers control different scales: early layers = pose, face shape; later layers = hair color, fine texture.

Noise z

512-dim random vector

↓ Mapping Network (8 MLP layers)

Style w

512-dim disentangled style

↓ Inject via AdaIN at each layer

Synthesis Network

4×4 → 8×8 → ... → 1024×1024

Style Mixing Simulator

Two style vectors: Style A and Style B. The crossover point determines which layers use A vs B. Early layers = structure, late layers = details.

Crossover layer4

StyleGAN by the Numbers

In StyleGAN2 (1024×1024 faces): z is 512-dim, the mapping network is 8 fully-connected layers (512→512 each with LeakyReLU), producing w (also 512-dim). The synthesis network has 18 layers (2 per resolution from 4×4 to 1024×1024). Style w is injected at every layer via weight demodulation (StyleGAN2 replaced AdaIN). Total parameters: ~30M for the mapping network + ~30M for the synthesis network.

Why a mapping network? The raw z space is entangled: moving in one direction changes multiple features. The mapping network learns a disentangled w space where each direction corresponds to one semantic attribute (smile, age, glasses).

Check: In StyleGAN, early layers of the synthesis network control...

Coarse features like pose and face shape Fine details like skin texture The background only

🏗 Design Challenge You're the Architect: 1024×1024 Face Generation ▶ ✓ ATTEMPTED

Your team needs a GAN that generates photorealistic 1024×1024 faces for a synthetic data company. The model must produce diverse outputs (no mode collapse), train stably on 8 A100 GPUs in under 2 weeks, and support attribute control (age, pose, expression).

Resolution

1024×1024 RGB

Hardware

8× A100 (80GB each)

Training budget

14 days max

Dataset

FFHQ (70K images, 1024×1024)

Diversity requirement

FID < 5, all attributes represented

Inference latency

<200ms per image

1. Progressive growing (start 4×4, add layers) vs direct 1024×1024 training: which do you choose and why?

2. Discriminator: PatchGAN, standard CNN, or multi-scale? What stabilization (GP, spectral norm, R1)?

3. How do you enable attribute control? Conditional labels, disentangled latent space, or both?

4. What's your training stability strategy? How do you detect and recover from mode collapse mid-training?

The answer is StyleGAN2/3:

Architecture: Skip progressive growing (it causes phase artifacts). Use StyleGAN2's direct training with skip connections and residual discriminator. 8-layer mapping network (z→w), weight demodulation instead of AdaIN, path length regularization for smooth latent space.

Discriminator: Multi-scale residual CNN with R1 regularization (λ=10, applied every 16 steps for efficiency). R1 is simpler than GP and works better in practice: just penalize ||∇D(x_real)||². No spectral norm needed with R1.

Attribute control: The disentangled W space gives this for free. After training, find linear directions in W space for each attribute (age, pose, smile) using labeled probes. No conditional training needed — the mapping network naturally disentangles attributes.

Stability: Exponential moving average of G weights (for inference), lazy regularization (R1 every 16 steps, path length penalty every 4 steps), learning rate = 2.5e-3 for G, 2.5e-3 for D with equalized learning rate. Monitor FID every 5K steps, track PPL (perceptual path length) for smoothness. On 8 A100s with batch size 32: ~7 days to convergence at 25M images seen.

Chapter 7: Conditional GANs

Vanilla GANs generate random samples. Conditional GANs generate samples conditioned on some input: a class label, a text prompt, or even another image. The condition is fed to both G and D.

pix2pix translates one image to another (sketch → photo, satellite → map). CycleGAN does unpaired translation (horses ↔ zebras) using a cycle consistency loss. The PatchGAN discriminator classifies each N×N patch as real or fake, rather than the whole image.

Condition c

Class label, text, source image, ...

↓

G(z, c)

Generate sample matching condition

↓

D(x, c)

Real sample matching condition? Or fake?

PatchGAN Discriminator

Instead of one real/fake score for the whole image, PatchGAN gives a score per patch. Green = real, red = fake. This preserves high-frequency details.

Patch size4

Model	Task	Key trick
pix2pix	Paired image translation	L1 loss + PatchGAN
CycleGAN	Unpaired image translation	Cycle consistency loss
SPADE	Semantic → photo	Spatially-adaptive normalization
GauGAN	Landscape painting tool	SPADE + style encoder

CycleGAN's insight: If you translate horse → zebra → horse, you should get back the original horse. This "cycle consistency" constraint makes unpaired translation possible.

How Conditioning Works

In a class-conditional GAN, the condition (e.g., "cat" = class 5) is embedded into a vector and concatenated with z before entering G. In D, the class embedding is concatenated with the feature vector before the final classification. For image-conditional GANs like pix2pix, the source image is literally concatenated channel-wise with the input: D receives a 6-channel input (3 from the source, 3 from the generated/real target).

G: (z, c) → image D: (image, c) → real/fake

The projection discriminator (Miyato & Koyama, 2018) is more elegant: instead of concatenation, it takes the inner product between the class embedding and D's feature vector. This gives a per-class real/fake score and trains more stably than concatenation, especially with many classes.

Check: What makes a PatchGAN discriminator different from a regular one?

It classifies each local patch independently rather than the whole image It uses patches of real data for training It works on smaller images

Chapter 8: GANs Today

For pure image quality and diversity, diffusion models have overtaken GANs. Models like Stable Diffusion, DALL-E 3, and Imagen produce higher-quality, more diverse outputs with more stable training. GANs had their era — roughly 2014-2021 — as the undisputed kings of generative modeling.

But GANs aren't dead. They remain dominant where speed matters: real-time face filters (Snapchat, TikTok), game asset generation, super-resolution (Real-ESRGAN), video prediction, and fast inference. A GAN generates an image in one forward pass; diffusion models need 20–50 denoising steps. For applications requiring <100ms latency, GANs are still the only game in town.

GAN vs Diffusion: Tradeoff Space

Quality vs speed. GANs are fast but less diverse. Diffusion is slow but higher quality. Hybrid approaches try to get both.

Dimension	GANs	Diffusion
Speed	1 forward pass (~50ms)	20-50 steps (~seconds)
Quality	Good but mode collapse risk	Excellent diversity
Training	Unstable, requires tricks	Stable, simple loss
Control	Conditional, but limited	Text guidance, inpainting, etc.
Still used for	Real-time apps, super-resolution	Everything else

The Speed Gap

A StyleGAN2 generator produces a 1024×1024 image in ~50ms on a single GPU. Stable Diffusion takes ~3 seconds (50 steps at ~60ms each) for a 512×512 image. That's a 60x speed gap. For real-time applications (face filters at 30fps, game asset generation, live video), GANs remain the only viable option. Diffusion models need consistency distillation or adversarial distillation to close this gap.

Distillation: A growing trend is using GANs to distill diffusion models: train a GAN to match the diffusion model's output in a single step. This gives GAN speed with diffusion quality. Models like SDXL-Turbo and LCM achieve near-diffusion quality in 1–4 steps by training a student GAN on the teacher diffusion model's outputs.

🔗 Pattern Recognition

Noise → Image: Three Different Roads

GANs (This Lesson)

z ~ N(0,I) → G(z) → image. One-shot transformation. Implicit density (never computes p(x)). Trained via adversarial game.

Diffusion Models

x_T ~ N(0,I) → denoise → x_T-1 → ... → x₀. Iterative refinement (50 steps). Explicit density. Trained via denoising score matching. → Diffusion

GANs (This Lesson)

No encoder. Cannot compute p(x) or reconstruct. Generator is a one-way mapping from noise to data.

VAEs

Encoder q(z|x) + decoder p(x|z). Explicit ELBO objective. Can reconstruct and interpolate. Trained via variational inference. → VAE & VQ-VAE

All three start from noise and produce images. The difference is the mapping: GANs learn it in one shot (fast but unstable), diffusion iterates (slow but stable), VAEs encode/decode (fast but blurry). Modern approaches combine them: diffusion backbone with GAN-based distillation for speed, or VAE encoder with GAN discriminator for sharpness.

The adversarial loss appears in many non-GAN models (e.g., perceptual loss in super-resolution, discriminator in VAE-GAN). Why does adding a discriminator to any generative model tend to sharpen outputs?

"The most interesting idea in the last 10 years in ML."

— Yann LeCun, on GANs (2016)

You now understand the adversarial game, its dynamics, its failures, and how it was stabilized. GANs may no longer be the state of the art, but their ideas echo in every generative model.

Understand Generative
Adversarial Networks

Chapter 0: The Adversarial Idea

Chapter 1: Generator & Discriminator

Concrete Data Flow (64×64 Image GAN)

Chapter 2: The Min-Max Game

Why the Non-Saturating Loss

Chapter 3: Training Dynamics

The Training Loop in Code

Chapter 4: Mode Collapse

Detection and Fixes

Chapter 5: WGAN & Stabilization

Gradient Penalty: The Implementation

Chapter 6: StyleGAN

StyleGAN by the Numbers

Chapter 7: Conditional GANs

How Conditioning Works

Chapter 8: GANs Today

The Speed Gap

Understand GenerativeAdversarial Networks

Chapter 0: The Adversarial Idea

Chapter 1: Generator & Discriminator

Concrete Data Flow (64×64 Image GAN)

Chapter 2: The Min-Max Game

Why the Non-Saturating Loss

Chapter 3: Training Dynamics

The Training Loop in Code

Chapter 4: Mode Collapse

Detection and Fixes

Chapter 5: WGAN & Stabilization

Gradient Penalty: The Implementation

Chapter 6: StyleGAN

StyleGAN by the Numbers

Chapter 7: Conditional GANs

How Conditioning Works

Chapter 8: GANs Today

The Speed Gap

Understand Generative
Adversarial Networks