The Complete Beginner's Path

Understand Generative
Adversarial Networks

Two neural networks locked in a game that conjures photorealistic images from pure noise. The idea that launched a thousand deepfakes.

Prerequisites: Neural network basics + What a loss function is. That's it.
9
Chapters
8+
Interactives
0
Assumed Knowledge

Chapter 0: The Adversarial Idea

Imagine a counterfeiter (the generator G) trying to produce fake banknotes, and a detective (the discriminator D) trying to spot them. The counterfeiter gets better by studying what the detective catches. The detective gets better by studying fakes that slip through.

Over time, both improve. Eventually, the counterfeiter's fakes are indistinguishable from real banknotes. That's the GAN idea: two networks competing until the generator produces data so realistic that no discriminator can tell the difference.

The core insight: You don't need to define "what makes a good image" explicitly. You just need a discriminator that can tell real from fake. The generator learns quality implicitly by fooling the discriminator.
Counterfeiter vs Detective

Watch the game unfold. Teal = real data distribution. Orange = generator's current output. Purple line = discriminator boundary. Click "Train Step" to advance.

Step: 0
Check: What are the two players in a GAN?

Chapter 1: Generator & Discriminator

The generator takes random noise z (sampled from a simple distribution like a Gaussian) and transforms it into a data sample (e.g., an image). It's a neural network mapping from noise space to data space: G(z) → fake image.

The discriminator takes a data sample (real or generated) and outputs a single number: the probability that the input is real. It's a binary classifier: D(x) → [0, 1]. Architecturally, it's typically a CNN that downsamples the image through strided convolutions until a final dense layer produces the scalar output.

Random Noise z
z ~ N(0, I), typically 128-512 dims
↓ Generator G
Fake Image G(z)
Same dimensions as real data
↓ Discriminator D
Real or Fake?
D(x) ∈ [0, 1]
Noise to Data Mapping

The generator learns a function from noise (left, uniform dots) to data (right, structured distribution). Each orange dot is a noise sample mapped through G. Click "Retrain" to see a different mapping.

ComponentInputOutputGoal
Generator GRandom noise zFake data G(z)Fool D into saying "real"
Discriminator DData sample xP(real) ∈ [0,1]Correctly classify real vs fake

Concrete Data Flow (64×64 Image GAN)

Let's trace the exact tensor shapes through a DCGAN generating 64×64 RGB images:

Generator path: z [B, 128] sampled from N(0, I) → Dense(128, 4×4×512) → reshape to [B, 512, 4, 4] → TransConv(512→256) → [B, 256, 8, 8] → TransConv(256→128) → [B, 128, 16, 16] → TransConv(128→64) → [B, 64, 32, 32] → TransConv(64→3) → [B, 3, 64, 64] → tanh → pixels in [−1, 1]. Each transposed convolution doubles spatial resolution: 4→8→16→32→64.
Discriminator path: Image [B, 3, 64, 64] → Conv(3→64) → [B, 64, 32, 32] → Conv(64→128) → [B, 128, 16, 16] → Conv(128→256) → [B, 256, 8, 8] → Conv(256→512) → [B, 512, 4, 4] → flatten → Dense → [B, 1] → sigmoid → real/fake probability. The discriminator is the mirror image of the generator.

Each layer in the generator uses: TransposedConv2d (stride=2, padding=1, kernel=4) → BatchNorm → ReLU. Each layer in the discriminator uses: Conv2d (stride=2, padding=1, kernel=4) → BatchNorm → LeakyReLU(0.2). The final generator layer uses tanh (output range [−1, 1]), and the final discriminator layer uses sigmoid (output range [0, 1]). Real images are normalized to [−1, 1] to match.

Check: What does the generator take as input?

Chapter 2: The Min-Max Game

The GAN training objective is a minimax game. The discriminator tries to maximize V(D,G), the generator tries to minimize it. At the Nash equilibrium, the generator produces perfect fakes and the discriminator outputs 0.5 for everything ("I can't tell").

minG maxD V(D,G) = E[log D(x)] + E[log(1 − D(G(z)))]

Breaking this down: D wants to maximize log D(x) (output 1 for real data) AND maximize log(1 − D(G(z))) (output 0 for fakes). G wants to minimize log(1 − D(G(z))) (make D output 1 for fakes).

D's perspective: Maximize V. "I want D(real) = 1 and D(fake) = 0." This is just binary cross-entropy.
G's perspective: Minimize V. "I want D(G(z)) = 1." Make the discriminator think fakes are real.
Min-Max Landscape

D(G(z)) is the discriminator's output on fake data. See how the loss changes for each player. The teal curve is D's loss; the orange curve is G's loss.

D(G(z))0.50
In practice: Instead of G minimizing log(1 − D(G(z))), we use the non-saturating loss: G maximizes log D(G(z)). This gives stronger gradients early in training when D easily rejects fakes.

Why the Non-Saturating Loss

Early in training, D easily rejects all fakes: D(G(z)) ≈ 0. The gradient of log(1 − D(G(z))) at D(G(z))=0 is nearly zero — G gets no learning signal. But the gradient of −log(D(G(z))) at D(G(z))=0 is very large — G gets a strong push to improve. Same equilibrium, much better gradient landscape.

G's loss (saturating): −log(1 − D(G(z)))   |   G's loss (non-saturating): −log(D(G(z)))

The non-saturating loss is the default in virtually every modern GAN implementation. It's such a universal trick that many papers don't even mention it — it's just assumed.

Check: At the Nash equilibrium of the GAN game, what does D output?
🔨 Derivation The Optimal Discriminator is a Density Ratio ✓ ATTEMPTED

For a fixed generator G, the discriminator D is trained to maximize V(D,G) = Ex~pdata[log D(x)] + Ez~pz[log(1 − D(G(z)))].

Your task: Find the optimal D*(x) by taking the derivative of the integrand with respect to D(x) at each point x, setting it to zero, and solving. Show that D*(x) = pdata(x) / (pdata(x) + pg(x)).

At any fixed x, the contribution to V is: pdata(x) · log(D(x)) + pg(x) · log(1 − D(x)). This is because Ez[log(1-D(G(z)))] = Ex~pg[log(1-D(x))] by the change of variables from z to G(z).
d/dy [a·log(y) + b·log(1-y)] = a/y − b/(1-y). Set this to zero: a/y = b/(1-y), so a(1-y) = by, giving y = a/(a+b).
With a = pdata(x) and b = pg(x), we get D*(x) = pdata(x) / (pdata(x) + pg(x)). Verify: when pg = pdata, D* = 1/2 everywhere (Nash equilibrium).

Full derivation:

V(D,G) = ∫ pdata(x) log D(x) + pg(x) log(1-D(x)) dx

For fixed G, maximize over D pointwise. At each x, we maximize f(y) = a·log(y) + b·log(1-y) where a = pdata(x), b = pg(x), y = D(x).

f'(y) = a/y − b/(1-y) = 0

a(1-y) = by → a − ay = by → a = y(a+b) → y = a/(a+b)

Therefore: D*(x) = pdata(x) / (pdata(x) + pg(x))

The key insight: The optimal discriminator is literally computing a density ratio. It outputs the probability that a sample came from the real distribution vs the mixture of real+fake. At equilibrium (pg = pdata), D* = 1/2 everywhere — it truly cannot distinguish real from fake.

Checkpoint — Before you move on
Explain in your own words: why does G minimizing log(1 − D(G(z))) lead to the same equilibrium as G maximizing log(D(G(z))), but with different gradient behavior early in training?
✓ Gate cleared
Model Answer

Both losses have the same Nash equilibrium (pg = pdata, D* = 1/2). But their gradient landscapes differ dramatically early in training:

Saturating loss −log(1-D(G(z))): When D(G(z)) ≈ 0 (D easily rejects fakes), the gradient is ∂/∂D(G(z)) [−log(1-D(G(z)))] = 1/(1-D(G(z))) ≈ 1. This is small — G barely learns.

Non-saturating loss −log(D(G(z))): When D(G(z)) ≈ 0, the gradient is ∂/∂D(G(z)) [−log(D(G(z)))] = −1/D(G(z)) → −∞. This is huge — G gets a massive push to improve.

Same destination, vastly different journey. The non-saturating loss is preferred because it gives G useful gradients when it needs them most (early training when fakes are bad).

Chapter 3: Training Dynamics

GANs are trained by alternating updates. One step for D, one step for G, repeat. On each D step, we show it a batch of real data (label: 1) and a batch of fake data from G (label: 0). On each G step, we generate fakes and update G to make D's output on them closer to 1.

This works — sometimes. GAN training is notoriously unstable. If D gets too good too fast, G gets no useful gradient signal. If G gets too good, D can't learn. The balance is delicate.

Step 1: Train D
D sees real (label 1) + fake (label 0). Update D only.
Step 2: Train G
G generates fakes. D scores them. Update G to fool D.
↓ repeat
Training Loss Curves

Simulated D loss (teal) and G loss (orange). Healthy training: both losses oscillate around a stable value. Watch for divergence (D wins) or collapse (G wins too easily).

The Training Loop in Code

One iteration looks like this: (1) sample a real batch from your dataset, (2) sample z and generate a fake batch through G, (3) train D on both (real label=1, fake label=0) with binary cross-entropy, (4) sample new z, generate fakes, and train G to make D output 1. Two separate Adam optimizers — one for D, one for G — with learning rate balance being critical (typically 2e-4 for both, β1=0.5).

python
# One training iteration
real_batch = next(dataloader)           # [B, 3, 64, 64]
z = torch.randn(B, 128)               # noise vector
fake_batch = G(z)                      # [B, 3, 64, 64]

# Train D: maximize log(D(real)) + log(1-D(fake))
loss_D = BCE(D(real_batch), 1) + BCE(D(fake_batch.detach()), 0)
opt_D.step()

# Train G: maximize log(D(G(z)))
z2 = torch.randn(B, 128)
loss_G = BCE(D(G(z2)), 1)             # fool D
opt_G.step()
Warning signs: D loss goes to 0 (D is too strong, G can't learn). G loss oscillates wildly. Generated samples all look the same (mode collapse). GAN training requires babysitting.
Check: Why is GAN training unstable?
💥 Break-It Lab What Dies When You Remove Components? ✓ ATTEMPTED
A working GAN has three essential ingredients: discriminator training, balanced D/G updates, and stochastic noise input z. Toggle each off and watch the loss curves collapse into pathology.
Remove D Training ACTIVE
Failure mode: D never learns, outputs random scores. G receives meaningless gradients — random noise as a learning signal. G's outputs are random noise forever. Without a critic, there is no learning signal. This is why the discriminator is the teacher.
Over-train D (10x steps) ACTIVE
Failure mode: D becomes perfect: D(real)=1, D(fake)=0 with certainty. The gradient of log(1-D(G(z))) at D(G(z))=0 vanishes — G receives zero gradient. D loss flatlines at 0, G loss flatlines at a high value. This is "vanishing gradients for G" — the discriminator got too strong.
Remove Noise Input z ACTIVE
Failure mode: Without stochastic input, G is a deterministic function with no source of randomness. It can only produce a single fixed output. Immediate mode collapse to one point. The noise vector z is the source of all diversity — it parameterizes the latent space of possible outputs.
💻 Build It Implement the GAN Training Loop ✓ ATTEMPTED
You've seen the alternating optimization. Now write the complete inner loop: one D step and one G step, including loss computation, gradient computation, and optimizer updates. Assume PyTorch with BCE loss.
signature def train_step(G, D, opt_G, opt_D, real_batch, z_dim=128): """One GAN training iteration. Args: G: Generator network, takes [B, z_dim] -> [B, 3, 64, 64] D: Discriminator network, takes [B, 3, 64, 64] -> [B, 1] opt_G, opt_D: Adam optimizers real_batch: [B, 3, 64, 64] tensor of real images z_dim: latent dimension Returns: (d_loss, g_loss): scalar loss values """
Test case
Given real_batch of shape [32, 3, 64, 64], after one call: - d_loss should be around 1.38 (= -log(0.5)*2, initial random D) - g_loss should be around 0.69 (= -log(0.5), initial random D) - G and D parameters should both be updated
When training D, we don't want gradients flowing back through G (we're only updating D). Use fake_batch.detach() or torch.no_grad() around G. When training G, we DO want gradients through G(z) into D, so don't detach.
python
def train_step(G, D, opt_G, opt_D, real_batch, z_dim=128):
    B = real_batch.size(0)
    ones = torch.ones(B, 1).to(real_batch.device)
    zeros = torch.zeros(B, 1).to(real_batch.device)
    criterion = nn.BCEWithLogitsLoss()

    # ---- Train D ----
    opt_D.zero_grad()
    # Real
    d_real = D(real_batch)
    loss_real = criterion(d_real, ones)
    # Fake (detach! don't update G)
    z = torch.randn(B, z_dim).to(real_batch.device)
    fake = G(z).detach()
    d_fake = D(fake)
    loss_fake = criterion(d_fake, zeros)
    d_loss = loss_real + loss_fake
    d_loss.backward()
    opt_D.step()

    # ---- Train G ----
    opt_G.zero_grad()
    z2 = torch.randn(B, z_dim).to(real_batch.device)
    fake2 = G(z2)           # fresh fakes, no detach
    d_fake2 = D(fake2)
    g_loss = criterion(d_fake2, ones)  # non-saturating: fool D
    g_loss.backward()
    opt_G.step()

    return d_loss.item(), g_loss.item()
Bonus challenge: Modify this to use WGAN-GP loss: replace BCE with Wasserstein loss (E[D(fake)] - E[D(real)]) and add the gradient penalty. You'll need torch.autograd.grad for the penalty term.
🔗 Pattern Recognition
Adversarial Training = Self-Play
This Lesson (GANs)
G improves by playing against D. D improves by playing against G. Neither has a fixed target — the opponent IS the curriculum.
Reinforcement Learning
In self-play (AlphaGo, OpenAI Five), an agent improves by playing against copies of itself. The opponent's improving strength IS the curriculum. → RL Algorithms

The deep pattern: when you can't write an explicit loss function for "good behavior," create an adversary whose job is to find flaws. The protagonist learns by fixing what the adversary exploits. This is minimax optimization in both cases — the difference is whether the two players share parameters (self-play) or are separate networks (GANs).

Where else do you see adversarial training? Think about adversarial examples in robustness research, discriminators in domain adaptation, and critics in actor-critic RL.

Chapter 4: Mode Collapse

The generator's worst failure mode: it finds one thing that fools the discriminator and repeats it forever. If the real data has many modes (faces with glasses, without glasses, smiling, frowning), the generator might only learn to produce one type of face. This is mode collapse.

Why does it happen? The generator optimizes to fool D, not to cover all modes. If one particular output consistently gets high D scores, G concentrates there. D eventually catches on, but G just jumps to a new single mode — creating a cycle called mode hopping.

The minibatch tells the story. Sample a batch of 64 images from G. Compute pairwise L2 distances between all 64 images. In healthy training, the mean distance should match the diversity of real data. In mode collapse, the mean distance drops toward zero — all 64 images are nearly identical. This is a dead-simple diagnostic you can compute every 100 training steps.
Mode Collapse Simulation

Teal clusters = real data modes. Orange dots = generator outputs. In healthy training, orange covers all clusters. In collapse, it concentrates on one.

Partial collapse is when some modes are covered but others are missing. Full collapse is when all generated samples are nearly identical. Both are common in vanilla GANs.

Detection and Fixes

Detection: Generate 1000 samples, compute pairwise distances. If the standard deviation of distances is near zero, you have collapse. You can also track the Frechet Inception Distance (FID) — it measures both quality and diversity. A low FID with high intra-class similarity screams mode collapse.

Fixes: Spectral normalization constrains D's Lipschitz constant per-layer, preventing it from becoming too sharp. Progressive growing starts training at 4×4 resolution and slowly adds layers, giving G time to learn at each scale. Minibatch discrimination gives D statistics about the whole batch, so it can detect when all samples look identical.

SymptomWhat you see
Full collapseAll generated samples look identical
Partial collapseSome categories of data are never generated
Mode hoppingGenerator cycles between modes without covering all at once
Check: What is mode collapse?
⚔ Adversarial: Your GAN generates only 3 distinct faces regardless of the input z
You're training a face GAN on CelebA (200K images, huge diversity). After 50K iterations, you sample 1000 images with different z vectors. Visual inspection shows only 3 distinct faces (with minor pixel-level noise variations). The D loss is 0.6, G loss is 0.7. The FID score is 45 (reasonable quality). What is happening, and what structural fix addresses it?
🔨 Derivation Why JS Divergence Rewards Mode Collapse ✓ ATTEMPTED

The original GAN loss, when the optimal D* is substituted back, equals 2·JSD(pdata || pg) − log 4. The JS divergence is symmetric and bounded: 0 ≤ JSD ≤ log 2.

Your task: Show that a generator covering only ONE mode of a multi-modal pdata can achieve a low JSD, and explain why this makes mode collapse a local minimum of the GAN objective.

JSD(P||Q) = (KL(P||M) + KL(Q||M))/2 where M = (P+Q)/2. It penalizes both: (1) places where P has mass but Q doesn't, and (2) places where Q has mass but P doesn't. But these two penalties are averaged — a generator covering one mode perfectly has zero penalty on that mode.
Strategy A: G spreads thin across all modes (imperfect coverage of each). Strategy B: G perfectly covers one mode, ignores others. JSD is bounded by log 2. Strategy B gets perfect KL(Q||M) = 0 on the covered mode, and the penalty from uncovered modes is bounded. Strategy A might have higher KL everywhere because imperfect coverage of each mode means Q deviates from P at every point.
G is trained by gradient descent. If G currently covers one mode well, the gradient to "spread out" to another mode requires crossing a region where pdata = 0. In this region, D* = pg/(pdata + pg) → 1, so log(1-D*) → −∞. The loss surface has a steep local minimum around each mode.

The argument:

Consider pdata = 0.5·δ(x−a) + 0.5·δ(x−b) (two modes at a and b).

Case 1: pg = δ(x−a) (covers one mode perfectly). Then M = 0.75·δ(x−a) + 0.25·δ(x−b). KL(pdata||M) = 0.5·log(0.5/0.75) + 0.5·log(0.5/0.25) = 0.5·(−0.405 + 0.693) = 0.144. KL(pg||M) = log(1/0.75) = 0.288. JSD = (0.144 + 0.288)/2 = 0.216.

Case 2: pg = 0.5·δ(x−a) + 0.5·δ(x−b) (covers both perfectly). JSD = 0.

So the global optimum (Case 2) has JSD=0, but Case 1 (mode collapse) has JSD=0.216, which is a local minimum. The problem: to go from Case 1 to Case 2, G must move probability mass through regions of zero pdata, where the gradient signal from D is adversarial (D says "that's fake!" in empty regions). The loss landscape has a valley around each mode.

The key insight: JSD doesn't penalize mode collapse as harshly as you'd expect because it averages two KL terms. Covering one mode perfectly (half the job) gets you to JSD = 0.216 out of max 0.693 — already 69% of the way to optimal. The last 31% requires crossing adversarial territory, making mode collapse a stable attractor.

Chapter 5: WGAN & Stabilization

The original GAN loss has a fundamental problem: when D is perfect, the gradients for G vanish. The Wasserstein GAN (WGAN) fixes this by replacing the JS divergence with the Wasserstein distance (Earth Mover's Distance) — a metric that provides useful gradients even when distributions don't overlap.

The WGAN critic (no longer called a "discriminator" since it doesn't classify) outputs an unbounded score, not a probability. No sigmoid on the final layer. The score difference between real and fake data estimates the Wasserstein distance. But the critic must satisfy a Lipschitz constraint: its output can't change too fast as the input changes. Without this constraint, the critic could assign arbitrarily large scores, making the optimization meaningless.

W(pr, pg) = sup||f||L≤1 E[f(x)] − E[f(G(z))]
Wasserstein vs JS Distance

Two 1D distributions. Move them apart. Wasserstein provides a smooth gradient everywhere. JS divergence saturates when distributions don't overlap.

Separation2.0
TechniqueHowPurpose
Weight clippingClamp D weights to [-c, c]Enforce Lipschitz (crude)
Gradient penalty (GP)Penalize ||∇D|| ≠ 1Better Lipschitz enforcement
Spectral normalizationNormalize weight matrices by spectral normControl Lipschitz constant per-layer
R1 penaltyPenalize ||∇D(xreal)||²Stabilize D on real data

Gradient Penalty: The Implementation

Weight clipping is crude — it pushes weights to the edges of [−c, c], wasting capacity. The gradient penalty (WGAN-GP) is much better. It samples random interpolations between real and fake data, then penalizes deviations from unit gradient norm:

LGP = λ · E[(||∇x D(x̂)||2 − 1)²]    where   x̂ = α · xreal + (1−α) · xfake,   α ~ U(0,1)

This enforces the 1-Lipschitz constraint: D's output can change by at most 1 unit per unit change in input. The interpolation trick samples points between the real and fake distributions — exactly where the gradient matters most. Typical λ = 10. The total critic loss becomes: E[D(fake)] - E[D(real)] + λ * GP.

Impact: WGAN + gradient penalty made GAN training dramatically more stable. It reduced mode collapse and made the loss curves meaningful — lower critic loss actually correlates with better image quality.
Check: Why is Wasserstein distance better than JS divergence for GANs?
🔨 Derivation From Earth Mover's Distance to the 1-Lipschitz Constraint ✓ ATTEMPTED

The Wasserstein-1 distance is: W(P, Q) = infγ ∈ Π(P,Q) E(x,y)~γ[||x − y||] where Π(P,Q) is the set of all joint distributions with marginals P and Q.

Your task: Using the Kantorovich-Rubinstein duality, show that W(P, Q) = sup||f||L≤1 Ex~P[f(x)] − Ey~Q[f(y)], and explain why the supremum must be taken over 1-Lipschitz functions (not arbitrary functions).

A function f is 1-Lipschitz if |f(x) − f(y)| ≤ ||x − y|| for all x, y. The function's output can change by at most 1 unit per unit change in input. This bounds its gradient: ||∇f|| ≤ 1 everywhere.
If f is unrestricted, we could choose f(x) = +∞ on the support of P and f(x) = −∞ on the support of Q. The difference E[f(x)] − E[f(y)] would be infinite. The Lipschitz constraint prevents f from being too "steep" — it can only assign high scores to P and low scores to Q at a rate proportional to the actual distance between the distributions.
Think of it this way: the primal (transport plan γ) asks "what's the cheapest way to move mass from P to Q?" The dual (Lipschitz function f) asks "what's the maximum profit a 1-Lipschitz pricing function can extract from the difference between P and Q?" Strong duality tells us these are equal. The critic in WGAN is learning this optimal pricing function f.

The Kantorovich-Rubinstein duality (sketch):

The primal problem: W(P,Q) = minγ ∫∫ ||x-y|| dγ(x,y) subject to γ having marginals P and Q.

This is a linear program (linear objective, linear constraints on marginals). By LP duality, the dual is: W(P,Q) = maxf,g EP[f(x)] + EQ[g(y)] subject to f(x) + g(y) ≤ ||x-y|| for all x,y.

The constraint f(x) + g(y) ≤ ||x-y|| implies (setting y=x): f(x) + g(x) ≤ 0 for all x. The optimal solution has g = −f (complementary slackness), so the constraint becomes: f(x) − f(y) ≤ ||x-y||, i.e., f is 1-Lipschitz.

Substituting g = −f: W(P,Q) = max||f||L≤1 EP[f(x)] − EQ[f(y)]

The key insight: The WGAN critic approximates this optimal f using a neural network. The 1-Lipschitz constraint is enforced via weight clipping (crude), gradient penalty (better), or spectral normalization (best). Without the constraint, the critic could assign ±∞ scores and the optimization would be meaningless. The Lipschitz bound makes the critic's "opinions" proportional to actual distributional distance.

⚔ Adversarial: Your WGAN critic loss is -500 and keeps decreasing
You're training a WGAN-GP. After 10K iterations, the critic loss (E[D(fake)] - E[D(real)]) is -500 and steadily decreasing. The gradient penalty is 0.01 (nearly zero). Generated images look like noise. What went wrong?

Chapter 6: StyleGAN

StyleGAN (Karras et al., 2019) revolutionized image generation with two key ideas: style mixing (inject different noise vectors at different layers to control coarse vs fine features) and progressive growing (start training at low resolution and gradually increase). The result: photorealistic 1024×1024 faces that fooled humans ~50% of the time.

The generator architecture uses a mapping network (8-layer MLP) to transform z into a style vector w, which is then injected into each layer via adaptive instance normalization (AdaIN). Different layers control different scales: early layers = pose, face shape; later layers = hair color, fine texture.

Noise z
512-dim random vector
↓ Mapping Network (8 MLP layers)
Style w
512-dim disentangled style
↓ Inject via AdaIN at each layer
Synthesis Network
4×4 → 8×8 → ... → 1024×1024
Style Mixing Simulator

Two style vectors: Style A and Style B. The crossover point determines which layers use A vs B. Early layers = structure, late layers = details.

Crossover layer4

StyleGAN by the Numbers

In StyleGAN2 (1024×1024 faces): z is 512-dim, the mapping network is 8 fully-connected layers (512→512 each with LeakyReLU), producing w (also 512-dim). The synthesis network has 18 layers (2 per resolution from 4×4 to 1024×1024). Style w is injected at every layer via weight demodulation (StyleGAN2 replaced AdaIN). Total parameters: ~30M for the mapping network + ~30M for the synthesis network.

Why a mapping network? The raw z space is entangled: moving in one direction changes multiple features. The mapping network learns a disentangled w space where each direction corresponds to one semantic attribute (smile, age, glasses).
Check: In StyleGAN, early layers of the synthesis network control...
🏗 Design Challenge You're the Architect: 1024×1024 Face Generation ✓ ATTEMPTED
Your team needs a GAN that generates photorealistic 1024×1024 faces for a synthetic data company. The model must produce diverse outputs (no mode collapse), train stably on 8 A100 GPUs in under 2 weeks, and support attribute control (age, pose, expression).
Resolution
1024×1024 RGB
Hardware
8× A100 (80GB each)
Training budget
14 days max
Dataset
FFHQ (70K images, 1024×1024)
Diversity requirement
FID < 5, all attributes represented
Inference latency
<200ms per image
1. Progressive growing (start 4×4, add layers) vs direct 1024×1024 training: which do you choose and why?
2. Discriminator: PatchGAN, standard CNN, or multi-scale? What stabilization (GP, spectral norm, R1)?
3. How do you enable attribute control? Conditional labels, disentangled latent space, or both?
4. What's your training stability strategy? How do you detect and recover from mode collapse mid-training?

The answer is StyleGAN2/3:

Architecture: Skip progressive growing (it causes phase artifacts). Use StyleGAN2's direct training with skip connections and residual discriminator. 8-layer mapping network (z→w), weight demodulation instead of AdaIN, path length regularization for smooth latent space.

Discriminator: Multi-scale residual CNN with R1 regularization (λ=10, applied every 16 steps for efficiency). R1 is simpler than GP and works better in practice: just penalize ||∇D(xreal)||². No spectral norm needed with R1.

Attribute control: The disentangled W space gives this for free. After training, find linear directions in W space for each attribute (age, pose, smile) using labeled probes. No conditional training needed — the mapping network naturally disentangles attributes.

Stability: Exponential moving average of G weights (for inference), lazy regularization (R1 every 16 steps, path length penalty every 4 steps), learning rate = 2.5e-3 for G, 2.5e-3 for D with equalized learning rate. Monitor FID every 5K steps, track PPL (perceptual path length) for smoothness. On 8 A100s with batch size 32: ~7 days to convergence at 25M images seen.

Chapter 7: Conditional GANs

Vanilla GANs generate random samples. Conditional GANs generate samples conditioned on some input: a class label, a text prompt, or even another image. The condition is fed to both G and D.

pix2pix translates one image to another (sketch → photo, satellite → map). CycleGAN does unpaired translation (horses ↔ zebras) using a cycle consistency loss. The PatchGAN discriminator classifies each N×N patch as real or fake, rather than the whole image.

Condition c
Class label, text, source image, ...
G(z, c)
Generate sample matching condition
D(x, c)
Real sample matching condition? Or fake?
PatchGAN Discriminator

Instead of one real/fake score for the whole image, PatchGAN gives a score per patch. Green = real, red = fake. This preserves high-frequency details.

Patch size4
ModelTaskKey trick
pix2pixPaired image translationL1 loss + PatchGAN
CycleGANUnpaired image translationCycle consistency loss
SPADESemantic → photoSpatially-adaptive normalization
GauGANLandscape painting toolSPADE + style encoder
CycleGAN's insight: If you translate horse → zebra → horse, you should get back the original horse. This "cycle consistency" constraint makes unpaired translation possible.

How Conditioning Works

In a class-conditional GAN, the condition (e.g., "cat" = class 5) is embedded into a vector and concatenated with z before entering G. In D, the class embedding is concatenated with the feature vector before the final classification. For image-conditional GANs like pix2pix, the source image is literally concatenated channel-wise with the input: D receives a 6-channel input (3 from the source, 3 from the generated/real target).

G: (z, c) → image     D: (image, c) → real/fake

The projection discriminator (Miyato & Koyama, 2018) is more elegant: instead of concatenation, it takes the inner product between the class embedding and D's feature vector. This gives a per-class real/fake score and trains more stably than concatenation, especially with many classes.

Check: What makes a PatchGAN discriminator different from a regular one?

Chapter 8: GANs Today

For pure image quality and diversity, diffusion models have overtaken GANs. Models like Stable Diffusion, DALL-E 3, and Imagen produce higher-quality, more diverse outputs with more stable training. GANs had their era — roughly 2014-2021 — as the undisputed kings of generative modeling.

But GANs aren't dead. They remain dominant where speed matters: real-time face filters (Snapchat, TikTok), game asset generation, super-resolution (Real-ESRGAN), video prediction, and fast inference. A GAN generates an image in one forward pass; diffusion models need 20–50 denoising steps. For applications requiring <100ms latency, GANs are still the only game in town.

GAN vs Diffusion: Tradeoff Space

Quality vs speed. GANs are fast but less diverse. Diffusion is slow but higher quality. Hybrid approaches try to get both.

DimensionGANsDiffusion
Speed1 forward pass (~50ms)20-50 steps (~seconds)
QualityGood but mode collapse riskExcellent diversity
TrainingUnstable, requires tricksStable, simple loss
ControlConditional, but limitedText guidance, inpainting, etc.
Still used forReal-time apps, super-resolutionEverything else

The Speed Gap

A StyleGAN2 generator produces a 1024×1024 image in ~50ms on a single GPU. Stable Diffusion takes ~3 seconds (50 steps at ~60ms each) for a 512×512 image. That's a 60x speed gap. For real-time applications (face filters at 30fps, game asset generation, live video), GANs remain the only viable option. Diffusion models need consistency distillation or adversarial distillation to close this gap.

Distillation: A growing trend is using GANs to distill diffusion models: train a GAN to match the diffusion model's output in a single step. This gives GAN speed with diffusion quality. Models like SDXL-Turbo and LCM achieve near-diffusion quality in 1–4 steps by training a student GAN on the teacher diffusion model's outputs.
🔗 Pattern Recognition
Noise → Image: Three Different Roads
GANs (This Lesson)
z ~ N(0,I) → G(z) → image. One-shot transformation. Implicit density (never computes p(x)). Trained via adversarial game.
Diffusion Models
xT ~ N(0,I) → denoise → xT-1 → ... → x0. Iterative refinement (50 steps). Explicit density. Trained via denoising score matching. → Diffusion
GANs (This Lesson)
No encoder. Cannot compute p(x) or reconstruct. Generator is a one-way mapping from noise to data.
VAEs
Encoder q(z|x) + decoder p(x|z). Explicit ELBO objective. Can reconstruct and interpolate. Trained via variational inference. → VAE & VQ-VAE

All three start from noise and produce images. The difference is the mapping: GANs learn it in one shot (fast but unstable), diffusion iterates (slow but stable), VAEs encode/decode (fast but blurry). Modern approaches combine them: diffusion backbone with GAN-based distillation for speed, or VAE encoder with GAN discriminator for sharpness.

The adversarial loss appears in many non-GAN models (e.g., perceptual loss in super-resolution, discriminator in VAE-GAN). Why does adding a discriminator to any generative model tend to sharpen outputs?

"The most interesting idea in the last 10 years in ML."
— Yann LeCun, on GANs (2016)

You now understand the adversarial game, its dynamics, its failures, and how it was stabilized. GANs may no longer be the state of the art, but their ideas echo in every generative model.