Two neural networks locked in a game that conjures photorealistic images from pure noise. The idea that launched a thousand deepfakes.
Imagine a counterfeiter (the generator G) trying to produce fake banknotes, and a detective (the discriminator D) trying to spot them. The counterfeiter gets better by studying what the detective catches. The detective gets better by studying fakes that slip through.
Over time, both improve. Eventually, the counterfeiter's fakes are indistinguishable from real banknotes. That's the GAN idea: two networks competing until the generator produces data so realistic that no discriminator can tell the difference.
Watch the game unfold. Teal = real data distribution. Orange = generator's current output. Purple line = discriminator boundary. Click "Train Step" to advance.
The generator takes random noise z (sampled from a simple distribution like a Gaussian) and transforms it into a data sample (e.g., an image). It's a neural network mapping from noise space to data space: G(z) → fake image.
The discriminator takes a data sample (real or generated) and outputs a single number: the probability that the input is real. It's a classifier: D(x) → [0, 1].
The generator learns a function from noise (left, uniform dots) to data (right, structured distribution). Each orange dot is a noise sample mapped through G. Click "Retrain" to see a different mapping.
| Component | Input | Output | Goal |
|---|---|---|---|
| Generator G | Random noise z | Fake data G(z) | Fool D into saying "real" |
| Discriminator D | Data sample x | P(real) ∈ [0,1] | Correctly classify real vs fake |
The GAN training objective is a minimax game. The discriminator tries to maximize V(D,G), the generator tries to minimize it. At the Nash equilibrium, the generator produces perfect fakes and the discriminator outputs 0.5 for everything ("I can't tell").
Breaking this down: D wants to maximize log D(x) (output 1 for real data) AND maximize log(1 − D(G(z))) (output 0 for fakes). G wants to minimize log(1 − D(G(z))) (make D output 1 for fakes).
D(G(z)) is the discriminator's output on fake data. See how the loss changes for each player. The teal curve is D's loss; the orange curve is G's loss.
GANs are trained by alternating updates. One step for D, one step for G, repeat. On each D step, we show it a batch of real data (label: 1) and a batch of fake data from G (label: 0). On each G step, we generate fakes and update G to make D's output on them closer to 1.
This works — sometimes. GAN training is notoriously unstable. If D gets too good too fast, G gets no useful gradient signal. If G gets too good, D can't learn. The balance is delicate.
Simulated D loss (teal) and G loss (orange). Healthy training: both losses oscillate around a stable value. Watch for divergence (D wins) or collapse (G wins too easily).
The generator's worst failure mode: it finds one thing that fools the discriminator and repeats it forever. If the real data has many modes (faces with glasses, without glasses, smiling, frowning), the generator might only learn to produce one type of face. This is mode collapse.
Why does it happen? The generator optimizes to fool D, not to cover all modes. If one particular output consistently gets high D scores, G concentrates there. D eventually catches on, but G just jumps to a new single mode.
Teal clusters = real data modes. Orange dots = generator outputs. In healthy training, orange covers all clusters. In collapse, it concentrates on one.
| Symptom | What you see |
|---|---|
| Full collapse | All generated samples look identical |
| Partial collapse | Some categories of data are never generated |
| Mode hopping | Generator cycles between modes without covering all at once |
The original GAN loss has a fundamental problem: when D is perfect, the gradients for G vanish. The Wasserstein GAN (WGAN) fixes this by replacing the JS divergence with the Wasserstein distance (Earth Mover's Distance) — a metric that provides useful gradients even when distributions don't overlap.
The WGAN critic (no longer a "discriminator") outputs an unbounded score, not a probability. It must satisfy a Lipschitz constraint: its output can't change too fast as the input changes.
Two 1D distributions. Move them apart. Wasserstein provides a smooth gradient everywhere. JS divergence saturates when distributions don't overlap.
| Technique | How | Purpose |
|---|---|---|
| Weight clipping | Clamp D weights to [-c, c] | Enforce Lipschitz (crude) |
| Gradient penalty (GP) | Penalize ||∇D|| ≠ 1 | Better Lipschitz enforcement |
| Spectral normalization | Normalize weight matrices by spectral norm | Control Lipschitz constant per-layer |
| R1 penalty | Penalize ||∇D(xreal)||² | Stabilize D on real data |
StyleGAN revolutionized image generation with two key ideas: style mixing (inject different noise vectors at different layers to control coarse vs fine features) and progressive growing (start training at low resolution and gradually increase).
The generator architecture uses a mapping network (8-layer MLP) to transform z into a style vector w, which is then injected into each layer via adaptive instance normalization (AdaIN). Different layers control different scales: early layers = pose, face shape; later layers = hair color, fine texture.
Two style vectors: Style A and Style B. The crossover point determines which layers use A vs B. Early layers = structure, late layers = details.
Vanilla GANs generate random samples. Conditional GANs generate samples conditioned on some input: a class label, a text prompt, or even another image. The condition is fed to both G and D.
pix2pix translates one image to another (sketch → photo, satellite → map). CycleGAN does unpaired translation (horses ↔ zebras) using a cycle consistency loss. The PatchGAN discriminator classifies each N×N patch as real or fake, rather than the whole image.
Instead of one real/fake score for the whole image, PatchGAN gives a score per patch. Green = real, red = fake. This preserves high-frequency details.
| Model | Task | Key trick |
|---|---|---|
| pix2pix | Paired image translation | L1 loss + PatchGAN |
| CycleGAN | Unpaired image translation | Cycle consistency loss |
| SPADE | Semantic → photo | Spatially-adaptive normalization |
| GauGAN | Landscape painting tool | SPADE + style encoder |
For pure image quality and diversity, diffusion models have overtaken GANs. Models like Stable Diffusion, DALL-E 3, and Imagen produce higher-quality, more diverse outputs with more stable training. GANs had their era — roughly 2014-2021 — as the undisputed kings of generative modeling.
But GANs aren't dead. They remain dominant where speed matters: real-time face filters, game asset generation, super-resolution, and fast inference. A GAN generates an image in one forward pass; diffusion models need 20-50 steps.
Quality vs speed. GANs are fast but less diverse. Diffusion is slow but higher quality. Hybrid approaches try to get both.
| Dimension | GANs | Diffusion |
|---|---|---|
| Speed | 1 forward pass (~50ms) | 20-50 steps (~seconds) |
| Quality | Good but mode collapse risk | Excellent diversity |
| Training | Unstable, requires tricks | Stable, simple loss |
| Control | Conditional, but limited | Text guidance, inpainting, etc. |
| Still used for | Real-time apps, super-resolution | Everything else |
You now understand the adversarial game, its dynamics, its failures, and how it was stabilized. GANs may no longer be the state of the art, but their ideas echo in every generative model.