Two neural networks locked in a game that conjures photorealistic images from pure noise. The idea that launched a thousand deepfakes.
Imagine a counterfeiter (the generator G) trying to produce fake banknotes, and a detective (the discriminator D) trying to spot them. The counterfeiter gets better by studying what the detective catches. The detective gets better by studying fakes that slip through.
Over time, both improve. Eventually, the counterfeiter's fakes are indistinguishable from real banknotes. That's the GAN idea: two networks competing until the generator produces data so realistic that no discriminator can tell the difference.
Watch the game unfold. Teal = real data distribution. Orange = generator's current output. Purple line = discriminator boundary. Click "Train Step" to advance.
The generator takes random noise z (sampled from a simple distribution like a Gaussian) and transforms it into a data sample (e.g., an image). It's a neural network mapping from noise space to data space: G(z) → fake image.
The discriminator takes a data sample (real or generated) and outputs a single number: the probability that the input is real. It's a binary classifier: D(x) → [0, 1]. Architecturally, it's typically a CNN that downsamples the image through strided convolutions until a final dense layer produces the scalar output.
The generator learns a function from noise (left, uniform dots) to data (right, structured distribution). Each orange dot is a noise sample mapped through G. Click "Retrain" to see a different mapping.
| Component | Input | Output | Goal |
|---|---|---|---|
| Generator G | Random noise z | Fake data G(z) | Fool D into saying "real" |
| Discriminator D | Data sample x | P(real) ∈ [0,1] | Correctly classify real vs fake |
Let's trace the exact tensor shapes through a DCGAN generating 64×64 RGB images:
[B, 128] sampled from N(0, I) → Dense(128, 4×4×512) → reshape to [B, 512, 4, 4] → TransConv(512→256) → [B, 256, 8, 8] → TransConv(256→128) → [B, 128, 16, 16] → TransConv(128→64) → [B, 64, 32, 32] → TransConv(64→3) → [B, 3, 64, 64] → tanh → pixels in [−1, 1]. Each transposed convolution doubles spatial resolution: 4→8→16→32→64.[B, 3, 64, 64] → Conv(3→64) → [B, 64, 32, 32] → Conv(64→128) → [B, 128, 16, 16] → Conv(128→256) → [B, 256, 8, 8] → Conv(256→512) → [B, 512, 4, 4] → flatten → Dense → [B, 1] → sigmoid → real/fake probability. The discriminator is the mirror image of the generator.Each layer in the generator uses: TransposedConv2d (stride=2, padding=1, kernel=4) → BatchNorm → ReLU. Each layer in the discriminator uses: Conv2d (stride=2, padding=1, kernel=4) → BatchNorm → LeakyReLU(0.2). The final generator layer uses tanh (output range [−1, 1]), and the final discriminator layer uses sigmoid (output range [0, 1]). Real images are normalized to [−1, 1] to match.
The GAN training objective is a minimax game. The discriminator tries to maximize V(D,G), the generator tries to minimize it. At the Nash equilibrium, the generator produces perfect fakes and the discriminator outputs 0.5 for everything ("I can't tell").
Breaking this down: D wants to maximize log D(x) (output 1 for real data) AND maximize log(1 − D(G(z))) (output 0 for fakes). G wants to minimize log(1 − D(G(z))) (make D output 1 for fakes).
D(G(z)) is the discriminator's output on fake data. See how the loss changes for each player. The teal curve is D's loss; the orange curve is G's loss.
Early in training, D easily rejects all fakes: D(G(z)) ≈ 0. The gradient of log(1 − D(G(z))) at D(G(z))=0 is nearly zero — G gets no learning signal. But the gradient of −log(D(G(z))) at D(G(z))=0 is very large — G gets a strong push to improve. Same equilibrium, much better gradient landscape.
The non-saturating loss is the default in virtually every modern GAN implementation. It's such a universal trick that many papers don't even mention it — it's just assumed.
For a fixed generator G, the discriminator D is trained to maximize V(D,G) = Ex~pdata[log D(x)] + Ez~pz[log(1 − D(G(z)))].
Your task: Find the optimal D*(x) by taking the derivative of the integrand with respect to D(x) at each point x, setting it to zero, and solving. Show that D*(x) = pdata(x) / (pdata(x) + pg(x)).
Full derivation:
V(D,G) = ∫ pdata(x) log D(x) + pg(x) log(1-D(x)) dx
For fixed G, maximize over D pointwise. At each x, we maximize f(y) = a·log(y) + b·log(1-y) where a = pdata(x), b = pg(x), y = D(x).
f'(y) = a/y − b/(1-y) = 0
a(1-y) = by → a − ay = by → a = y(a+b) → y = a/(a+b)
Therefore: D*(x) = pdata(x) / (pdata(x) + pg(x))
The key insight: The optimal discriminator is literally computing a density ratio. It outputs the probability that a sample came from the real distribution vs the mixture of real+fake. At equilibrium (pg = pdata), D* = 1/2 everywhere — it truly cannot distinguish real from fake.
Both losses have the same Nash equilibrium (pg = pdata, D* = 1/2). But their gradient landscapes differ dramatically early in training:
Saturating loss −log(1-D(G(z))): When D(G(z)) ≈ 0 (D easily rejects fakes), the gradient is ∂/∂D(G(z)) [−log(1-D(G(z)))] = 1/(1-D(G(z))) ≈ 1. This is small — G barely learns.
Non-saturating loss −log(D(G(z))): When D(G(z)) ≈ 0, the gradient is ∂/∂D(G(z)) [−log(D(G(z)))] = −1/D(G(z)) → −∞. This is huge — G gets a massive push to improve.
Same destination, vastly different journey. The non-saturating loss is preferred because it gives G useful gradients when it needs them most (early training when fakes are bad).
GANs are trained by alternating updates. One step for D, one step for G, repeat. On each D step, we show it a batch of real data (label: 1) and a batch of fake data from G (label: 0). On each G step, we generate fakes and update G to make D's output on them closer to 1.
This works — sometimes. GAN training is notoriously unstable. If D gets too good too fast, G gets no useful gradient signal. If G gets too good, D can't learn. The balance is delicate.
Simulated D loss (teal) and G loss (orange). Healthy training: both losses oscillate around a stable value. Watch for divergence (D wins) or collapse (G wins too easily).
One iteration looks like this: (1) sample a real batch from your dataset, (2) sample z and generate a fake batch through G, (3) train D on both (real label=1, fake label=0) with binary cross-entropy, (4) sample new z, generate fakes, and train G to make D output 1. Two separate Adam optimizers — one for D, one for G — with learning rate balance being critical (typically 2e-4 for both, β1=0.5).
python # One training iteration real_batch = next(dataloader) # [B, 3, 64, 64] z = torch.randn(B, 128) # noise vector fake_batch = G(z) # [B, 3, 64, 64] # Train D: maximize log(D(real)) + log(1-D(fake)) loss_D = BCE(D(real_batch), 1) + BCE(D(fake_batch.detach()), 0) opt_D.step() # Train G: maximize log(D(G(z))) z2 = torch.randn(B, 128) loss_G = BCE(D(G(z2)), 1) # fool D opt_G.step()
python def train_step(G, D, opt_G, opt_D, real_batch, z_dim=128): B = real_batch.size(0) ones = torch.ones(B, 1).to(real_batch.device) zeros = torch.zeros(B, 1).to(real_batch.device) criterion = nn.BCEWithLogitsLoss() # ---- Train D ---- opt_D.zero_grad() # Real d_real = D(real_batch) loss_real = criterion(d_real, ones) # Fake (detach! don't update G) z = torch.randn(B, z_dim).to(real_batch.device) fake = G(z).detach() d_fake = D(fake) loss_fake = criterion(d_fake, zeros) d_loss = loss_real + loss_fake d_loss.backward() opt_D.step() # ---- Train G ---- opt_G.zero_grad() z2 = torch.randn(B, z_dim).to(real_batch.device) fake2 = G(z2) # fresh fakes, no detach d_fake2 = D(fake2) g_loss = criterion(d_fake2, ones) # non-saturating: fool D g_loss.backward() opt_G.step() return d_loss.item(), g_loss.item()
The deep pattern: when you can't write an explicit loss function for "good behavior," create an adversary whose job is to find flaws. The protagonist learns by fixing what the adversary exploits. This is minimax optimization in both cases — the difference is whether the two players share parameters (self-play) or are separate networks (GANs).
Where else do you see adversarial training? Think about adversarial examples in robustness research, discriminators in domain adaptation, and critics in actor-critic RL.
The generator's worst failure mode: it finds one thing that fools the discriminator and repeats it forever. If the real data has many modes (faces with glasses, without glasses, smiling, frowning), the generator might only learn to produce one type of face. This is mode collapse.
Why does it happen? The generator optimizes to fool D, not to cover all modes. If one particular output consistently gets high D scores, G concentrates there. D eventually catches on, but G just jumps to a new single mode — creating a cycle called mode hopping.
Teal clusters = real data modes. Orange dots = generator outputs. In healthy training, orange covers all clusters. In collapse, it concentrates on one.
Detection: Generate 1000 samples, compute pairwise distances. If the standard deviation of distances is near zero, you have collapse. You can also track the Frechet Inception Distance (FID) — it measures both quality and diversity. A low FID with high intra-class similarity screams mode collapse.
Fixes: Spectral normalization constrains D's Lipschitz constant per-layer, preventing it from becoming too sharp. Progressive growing starts training at 4×4 resolution and slowly adds layers, giving G time to learn at each scale. Minibatch discrimination gives D statistics about the whole batch, so it can detect when all samples look identical.
| Symptom | What you see |
|---|---|
| Full collapse | All generated samples look identical |
| Partial collapse | Some categories of data are never generated |
| Mode hopping | Generator cycles between modes without covering all at once |
The original GAN loss, when the optimal D* is substituted back, equals 2·JSD(pdata || pg) − log 4. The JS divergence is symmetric and bounded: 0 ≤ JSD ≤ log 2.
Your task: Show that a generator covering only ONE mode of a multi-modal pdata can achieve a low JSD, and explain why this makes mode collapse a local minimum of the GAN objective.
The argument:
Consider pdata = 0.5·δ(x−a) + 0.5·δ(x−b) (two modes at a and b).
Case 1: pg = δ(x−a) (covers one mode perfectly). Then M = 0.75·δ(x−a) + 0.25·δ(x−b). KL(pdata||M) = 0.5·log(0.5/0.75) + 0.5·log(0.5/0.25) = 0.5·(−0.405 + 0.693) = 0.144. KL(pg||M) = log(1/0.75) = 0.288. JSD = (0.144 + 0.288)/2 = 0.216.
Case 2: pg = 0.5·δ(x−a) + 0.5·δ(x−b) (covers both perfectly). JSD = 0.
So the global optimum (Case 2) has JSD=0, but Case 1 (mode collapse) has JSD=0.216, which is a local minimum. The problem: to go from Case 1 to Case 2, G must move probability mass through regions of zero pdata, where the gradient signal from D is adversarial (D says "that's fake!" in empty regions). The loss landscape has a valley around each mode.
The key insight: JSD doesn't penalize mode collapse as harshly as you'd expect because it averages two KL terms. Covering one mode perfectly (half the job) gets you to JSD = 0.216 out of max 0.693 — already 69% of the way to optimal. The last 31% requires crossing adversarial territory, making mode collapse a stable attractor.
The original GAN loss has a fundamental problem: when D is perfect, the gradients for G vanish. The Wasserstein GAN (WGAN) fixes this by replacing the JS divergence with the Wasserstein distance (Earth Mover's Distance) — a metric that provides useful gradients even when distributions don't overlap.
The WGAN critic (no longer called a "discriminator" since it doesn't classify) outputs an unbounded score, not a probability. No sigmoid on the final layer. The score difference between real and fake data estimates the Wasserstein distance. But the critic must satisfy a Lipschitz constraint: its output can't change too fast as the input changes. Without this constraint, the critic could assign arbitrarily large scores, making the optimization meaningless.
Two 1D distributions. Move them apart. Wasserstein provides a smooth gradient everywhere. JS divergence saturates when distributions don't overlap.
| Technique | How | Purpose |
|---|---|---|
| Weight clipping | Clamp D weights to [-c, c] | Enforce Lipschitz (crude) |
| Gradient penalty (GP) | Penalize ||∇D|| ≠ 1 | Better Lipschitz enforcement |
| Spectral normalization | Normalize weight matrices by spectral norm | Control Lipschitz constant per-layer |
| R1 penalty | Penalize ||∇D(xreal)||² | Stabilize D on real data |
Weight clipping is crude — it pushes weights to the edges of [−c, c], wasting capacity. The gradient penalty (WGAN-GP) is much better. It samples random interpolations between real and fake data, then penalizes deviations from unit gradient norm:
This enforces the 1-Lipschitz constraint: D's output can change by at most 1 unit per unit change in input. The interpolation trick samples points between the real and fake distributions — exactly where the gradient matters most. Typical λ = 10. The total critic loss becomes: E[D(fake)] - E[D(real)] + λ * GP.
The Wasserstein-1 distance is: W(P, Q) = infγ ∈ Π(P,Q) E(x,y)~γ[||x − y||] where Π(P,Q) is the set of all joint distributions with marginals P and Q.
Your task: Using the Kantorovich-Rubinstein duality, show that W(P, Q) = sup||f||L≤1 Ex~P[f(x)] − Ey~Q[f(y)], and explain why the supremum must be taken over 1-Lipschitz functions (not arbitrary functions).
The Kantorovich-Rubinstein duality (sketch):
The primal problem: W(P,Q) = minγ ∫∫ ||x-y|| dγ(x,y) subject to γ having marginals P and Q.
This is a linear program (linear objective, linear constraints on marginals). By LP duality, the dual is: W(P,Q) = maxf,g EP[f(x)] + EQ[g(y)] subject to f(x) + g(y) ≤ ||x-y|| for all x,y.
The constraint f(x) + g(y) ≤ ||x-y|| implies (setting y=x): f(x) + g(x) ≤ 0 for all x. The optimal solution has g = −f (complementary slackness), so the constraint becomes: f(x) − f(y) ≤ ||x-y||, i.e., f is 1-Lipschitz.
Substituting g = −f: W(P,Q) = max||f||L≤1 EP[f(x)] − EQ[f(y)]
The key insight: The WGAN critic approximates this optimal f using a neural network. The 1-Lipschitz constraint is enforced via weight clipping (crude), gradient penalty (better), or spectral normalization (best). Without the constraint, the critic could assign ±∞ scores and the optimization would be meaningless. The Lipschitz bound makes the critic's "opinions" proportional to actual distributional distance.
StyleGAN (Karras et al., 2019) revolutionized image generation with two key ideas: style mixing (inject different noise vectors at different layers to control coarse vs fine features) and progressive growing (start training at low resolution and gradually increase). The result: photorealistic 1024×1024 faces that fooled humans ~50% of the time.
The generator architecture uses a mapping network (8-layer MLP) to transform z into a style vector w, which is then injected into each layer via adaptive instance normalization (AdaIN). Different layers control different scales: early layers = pose, face shape; later layers = hair color, fine texture.
Two style vectors: Style A and Style B. The crossover point determines which layers use A vs B. Early layers = structure, late layers = details.
In StyleGAN2 (1024×1024 faces): z is 512-dim, the mapping network is 8 fully-connected layers (512→512 each with LeakyReLU), producing w (also 512-dim). The synthesis network has 18 layers (2 per resolution from 4×4 to 1024×1024). Style w is injected at every layer via weight demodulation (StyleGAN2 replaced AdaIN). Total parameters: ~30M for the mapping network + ~30M for the synthesis network.
The answer is StyleGAN2/3:
Architecture: Skip progressive growing (it causes phase artifacts). Use StyleGAN2's direct training with skip connections and residual discriminator. 8-layer mapping network (z→w), weight demodulation instead of AdaIN, path length regularization for smooth latent space.
Discriminator: Multi-scale residual CNN with R1 regularization (λ=10, applied every 16 steps for efficiency). R1 is simpler than GP and works better in practice: just penalize ||∇D(xreal)||². No spectral norm needed with R1.
Attribute control: The disentangled W space gives this for free. After training, find linear directions in W space for each attribute (age, pose, smile) using labeled probes. No conditional training needed — the mapping network naturally disentangles attributes.
Stability: Exponential moving average of G weights (for inference), lazy regularization (R1 every 16 steps, path length penalty every 4 steps), learning rate = 2.5e-3 for G, 2.5e-3 for D with equalized learning rate. Monitor FID every 5K steps, track PPL (perceptual path length) for smoothness. On 8 A100s with batch size 32: ~7 days to convergence at 25M images seen.
Vanilla GANs generate random samples. Conditional GANs generate samples conditioned on some input: a class label, a text prompt, or even another image. The condition is fed to both G and D.
pix2pix translates one image to another (sketch → photo, satellite → map). CycleGAN does unpaired translation (horses ↔ zebras) using a cycle consistency loss. The PatchGAN discriminator classifies each N×N patch as real or fake, rather than the whole image.
Instead of one real/fake score for the whole image, PatchGAN gives a score per patch. Green = real, red = fake. This preserves high-frequency details.
| Model | Task | Key trick |
|---|---|---|
| pix2pix | Paired image translation | L1 loss + PatchGAN |
| CycleGAN | Unpaired image translation | Cycle consistency loss |
| SPADE | Semantic → photo | Spatially-adaptive normalization |
| GauGAN | Landscape painting tool | SPADE + style encoder |
In a class-conditional GAN, the condition (e.g., "cat" = class 5) is embedded into a vector and concatenated with z before entering G. In D, the class embedding is concatenated with the feature vector before the final classification. For image-conditional GANs like pix2pix, the source image is literally concatenated channel-wise with the input: D receives a 6-channel input (3 from the source, 3 from the generated/real target).
The projection discriminator (Miyato & Koyama, 2018) is more elegant: instead of concatenation, it takes the inner product between the class embedding and D's feature vector. This gives a per-class real/fake score and trains more stably than concatenation, especially with many classes.
For pure image quality and diversity, diffusion models have overtaken GANs. Models like Stable Diffusion, DALL-E 3, and Imagen produce higher-quality, more diverse outputs with more stable training. GANs had their era — roughly 2014-2021 — as the undisputed kings of generative modeling.
But GANs aren't dead. They remain dominant where speed matters: real-time face filters (Snapchat, TikTok), game asset generation, super-resolution (Real-ESRGAN), video prediction, and fast inference. A GAN generates an image in one forward pass; diffusion models need 20–50 denoising steps. For applications requiring <100ms latency, GANs are still the only game in town.
Quality vs speed. GANs are fast but less diverse. Diffusion is slow but higher quality. Hybrid approaches try to get both.
| Dimension | GANs | Diffusion |
|---|---|---|
| Speed | 1 forward pass (~50ms) | 20-50 steps (~seconds) |
| Quality | Good but mode collapse risk | Excellent diversity |
| Training | Unstable, requires tricks | Stable, simple loss |
| Control | Conditional, but limited | Text guidance, inpainting, etc. |
| Still used for | Real-time apps, super-resolution | Everything else |
A StyleGAN2 generator produces a 1024×1024 image in ~50ms on a single GPU. Stable Diffusion takes ~3 seconds (50 steps at ~60ms each) for a 512×512 image. That's a 60x speed gap. For real-time applications (face filters at 30fps, game asset generation, live video), GANs remain the only viable option. Diffusion models need consistency distillation or adversarial distillation to close this gap.
All three start from noise and produce images. The difference is the mapping: GANs learn it in one shot (fast but unstable), diffusion iterates (slow but stable), VAEs encode/decode (fast but blurry). Modern approaches combine them: diffusion backbone with GAN-based distillation for speed, or VAE encoder with GAN discriminator for sharpness.
The adversarial loss appears in many non-GAN models (e.g., perceptual loss in super-resolution, discriminator in VAE-GAN). Why does adding a discriminator to any generative model tend to sharpen outputs?
You now understand the adversarial game, its dynamics, its failures, and how it was stabilized. GANs may no longer be the state of the art, but their ideas echo in every generative model.