The Complete Beginner's Path

Understand Generative
Adversarial Networks

Two neural networks locked in a game that conjures photorealistic images from pure noise. The idea that launched a thousand deepfakes.

Prerequisites: Neural network basics + What a loss function is. That's it.
9
Chapters
8+
Interactives
0
Assumed Knowledge

Chapter 0: The Adversarial Idea

Imagine a counterfeiter (the generator G) trying to produce fake banknotes, and a detective (the discriminator D) trying to spot them. The counterfeiter gets better by studying what the detective catches. The detective gets better by studying fakes that slip through.

Over time, both improve. Eventually, the counterfeiter's fakes are indistinguishable from real banknotes. That's the GAN idea: two networks competing until the generator produces data so realistic that no discriminator can tell the difference.

The core insight: You don't need to define "what makes a good image" explicitly. You just need a discriminator that can tell real from fake. The generator learns quality implicitly by fooling the discriminator.
Counterfeiter vs Detective

Watch the game unfold. Teal = real data distribution. Orange = generator's current output. Purple line = discriminator boundary. Click "Train Step" to advance.

Step: 0
Check: What are the two players in a GAN?

Chapter 1: Generator & Discriminator

The generator takes random noise z (sampled from a simple distribution like a Gaussian) and transforms it into a data sample (e.g., an image). It's a neural network mapping from noise space to data space: G(z) → fake image.

The discriminator takes a data sample (real or generated) and outputs a single number: the probability that the input is real. It's a classifier: D(x) → [0, 1].

Random Noise z
z ~ N(0, I), typically 128-512 dims
↓ Generator G
Fake Image G(z)
Same dimensions as real data
↓ Discriminator D
Real or Fake?
D(x) ∈ [0, 1]
Noise to Data Mapping

The generator learns a function from noise (left, uniform dots) to data (right, structured distribution). Each orange dot is a noise sample mapped through G. Click "Retrain" to see a different mapping.

ComponentInputOutputGoal
Generator GRandom noise zFake data G(z)Fool D into saying "real"
Discriminator DData sample xP(real) ∈ [0,1]Correctly classify real vs fake
Check: What does the generator take as input?

Chapter 2: The Min-Max Game

The GAN training objective is a minimax game. The discriminator tries to maximize V(D,G), the generator tries to minimize it. At the Nash equilibrium, the generator produces perfect fakes and the discriminator outputs 0.5 for everything ("I can't tell").

minG maxD V(D,G) = E[log D(x)] + E[log(1 − D(G(z)))]

Breaking this down: D wants to maximize log D(x) (output 1 for real data) AND maximize log(1 − D(G(z))) (output 0 for fakes). G wants to minimize log(1 − D(G(z))) (make D output 1 for fakes).

D's perspective: Maximize V. "I want D(real) = 1 and D(fake) = 0." This is just binary cross-entropy.
G's perspective: Minimize V. "I want D(G(z)) = 1." Make the discriminator think fakes are real.
Min-Max Landscape

D(G(z)) is the discriminator's output on fake data. See how the loss changes for each player. The teal curve is D's loss; the orange curve is G's loss.

D(G(z))0.50
In practice: Instead of G minimizing log(1 − D(G(z))), we use the non-saturating loss: G maximizes log D(G(z)). This gives stronger gradients early in training when D easily rejects fakes.
Check: At the Nash equilibrium of the GAN game, what does D output?

Chapter 3: Training Dynamics

GANs are trained by alternating updates. One step for D, one step for G, repeat. On each D step, we show it a batch of real data (label: 1) and a batch of fake data from G (label: 0). On each G step, we generate fakes and update G to make D's output on them closer to 1.

This works — sometimes. GAN training is notoriously unstable. If D gets too good too fast, G gets no useful gradient signal. If G gets too good, D can't learn. The balance is delicate.

Step 1: Train D
D sees real (label 1) + fake (label 0). Update D only.
Step 2: Train G
G generates fakes. D scores them. Update G to fool D.
↓ repeat
Training Loss Curves

Simulated D loss (teal) and G loss (orange). Healthy training: both losses oscillate around a stable value. Watch for divergence (D wins) or collapse (G wins too easily).

Warning signs: D loss goes to 0 (D is too strong, G can't learn). G loss oscillates wildly. Generated samples all look the same (mode collapse). GAN training requires babysitting.
Check: Why is GAN training unstable?

Chapter 4: Mode Collapse

The generator's worst failure mode: it finds one thing that fools the discriminator and repeats it forever. If the real data has many modes (faces with glasses, without glasses, smiling, frowning), the generator might only learn to produce one type of face. This is mode collapse.

Why does it happen? The generator optimizes to fool D, not to cover all modes. If one particular output consistently gets high D scores, G concentrates there. D eventually catches on, but G just jumps to a new single mode.

Mode Collapse Simulation

Teal clusters = real data modes. Orange dots = generator outputs. In healthy training, orange covers all clusters. In collapse, it concentrates on one.

Partial collapse is when some modes are covered but others are missing. Full collapse is when all generated samples are nearly identical. Both are common in vanilla GANs.
SymptomWhat you see
Full collapseAll generated samples look identical
Partial collapseSome categories of data are never generated
Mode hoppingGenerator cycles between modes without covering all at once
Check: What is mode collapse?

Chapter 5: WGAN & Stabilization

The original GAN loss has a fundamental problem: when D is perfect, the gradients for G vanish. The Wasserstein GAN (WGAN) fixes this by replacing the JS divergence with the Wasserstein distance (Earth Mover's Distance) — a metric that provides useful gradients even when distributions don't overlap.

The WGAN critic (no longer a "discriminator") outputs an unbounded score, not a probability. It must satisfy a Lipschitz constraint: its output can't change too fast as the input changes.

W(pr, pg) = sup||f||L≤1 E[f(x)] − E[f(G(z))]
Wasserstein vs JS Distance

Two 1D distributions. Move them apart. Wasserstein provides a smooth gradient everywhere. JS divergence saturates when distributions don't overlap.

Separation2.0
TechniqueHowPurpose
Weight clippingClamp D weights to [-c, c]Enforce Lipschitz (crude)
Gradient penalty (GP)Penalize ||∇D|| ≠ 1Better Lipschitz enforcement
Spectral normalizationNormalize weight matrices by spectral normControl Lipschitz constant per-layer
R1 penaltyPenalize ||∇D(xreal)||²Stabilize D on real data
Impact: WGAN + gradient penalty made GAN training dramatically more stable. It reduced mode collapse and made the loss curves meaningful — lower critic loss actually correlates with better image quality.
Check: Why is Wasserstein distance better than JS divergence for GANs?

Chapter 6: StyleGAN

StyleGAN revolutionized image generation with two key ideas: style mixing (inject different noise vectors at different layers to control coarse vs fine features) and progressive growing (start training at low resolution and gradually increase).

The generator architecture uses a mapping network (8-layer MLP) to transform z into a style vector w, which is then injected into each layer via adaptive instance normalization (AdaIN). Different layers control different scales: early layers = pose, face shape; later layers = hair color, fine texture.

Noise z
512-dim random vector
↓ Mapping Network (8 MLP layers)
Style w
512-dim disentangled style
↓ Inject via AdaIN at each layer
Synthesis Network
4×4 → 8×8 → ... → 1024×1024
Style Mixing Simulator

Two style vectors: Style A and Style B. The crossover point determines which layers use A vs B. Early layers = structure, late layers = details.

Crossover layer4
Why a mapping network? The raw z space is entangled: moving in one direction changes multiple features. The mapping network learns a disentangled w space where each direction corresponds to one semantic attribute (smile, age, glasses).
Check: In StyleGAN, early layers of the synthesis network control...

Chapter 7: Conditional GANs

Vanilla GANs generate random samples. Conditional GANs generate samples conditioned on some input: a class label, a text prompt, or even another image. The condition is fed to both G and D.

pix2pix translates one image to another (sketch → photo, satellite → map). CycleGAN does unpaired translation (horses ↔ zebras) using a cycle consistency loss. The PatchGAN discriminator classifies each N×N patch as real or fake, rather than the whole image.

Condition c
Class label, text, source image, ...
G(z, c)
Generate sample matching condition
D(x, c)
Real sample matching condition? Or fake?
PatchGAN Discriminator

Instead of one real/fake score for the whole image, PatchGAN gives a score per patch. Green = real, red = fake. This preserves high-frequency details.

Patch size4
ModelTaskKey trick
pix2pixPaired image translationL1 loss + PatchGAN
CycleGANUnpaired image translationCycle consistency loss
SPADESemantic → photoSpatially-adaptive normalization
GauGANLandscape painting toolSPADE + style encoder
CycleGAN's insight: If you translate horse → zebra → horse, you should get back the original horse. This "cycle consistency" constraint makes unpaired translation possible.
Check: What makes a PatchGAN discriminator different from a regular one?

Chapter 8: GANs Today

For pure image quality and diversity, diffusion models have overtaken GANs. Models like Stable Diffusion, DALL-E 3, and Imagen produce higher-quality, more diverse outputs with more stable training. GANs had their era — roughly 2014-2021 — as the undisputed kings of generative modeling.

But GANs aren't dead. They remain dominant where speed matters: real-time face filters, game asset generation, super-resolution, and fast inference. A GAN generates an image in one forward pass; diffusion models need 20-50 steps.

GAN vs Diffusion: Tradeoff Space

Quality vs speed. GANs are fast but less diverse. Diffusion is slow but higher quality. Hybrid approaches try to get both.

DimensionGANsDiffusion
Speed1 forward pass (~50ms)20-50 steps (~seconds)
QualityGood but mode collapse riskExcellent diversity
TrainingUnstable, requires tricksStable, simple loss
ControlConditional, but limitedText guidance, inpainting, etc.
Still used forReal-time apps, super-resolutionEverything else
Distillation: A growing trend is using GANs to distill diffusion models: train a GAN to match the diffusion model's output in a single step. This gives GAN speed with diffusion quality.
"The most interesting idea in the last 10 years in ML."
— Yann LeCun, on GANs (2016)

You now understand the adversarial game, its dynamics, its failures, and how it was stabilized. GANs may no longer be the state of the art, but their ideas echo in every generative model.