What Is It?
A Generative Adversarial Network pits two neural networks against each other in a minimax game. The Generator (G) takes random noise and tries to produce data that looks real. The Discriminator (D) receives both real data and G's fakes, and tries to classify each as real or fake.
Training is adversarial: G minimizes D's accuracy while D maximizes it. When the system reaches equilibrium, G produces data indistinguishable from the real distribution — and D outputs 0.5 for every sample, unable to tell the difference.
Architecture
The canonical GAN architecture: noise z feeds into the Generator, which outputs a fake sample. Both real samples and fakes pass through the Discriminator, which outputs a real/fake score. The adversarial loop connects them.
Core Mechanisms
Min-Max Game
D maximizes V; G minimizes it. Optimal D*(x) = pdata(x) / (pdata(x) + pg(x)).
Mode Collapse
G learns to produce only a few modes of the data distribution, ignoring diversity. The generator finds "safe" outputs that reliably fool D rather than covering the full data manifold.
Training Instability
If D becomes too strong, gradients vanish for G. If G outpaces D, the signal is noisy. Balancing the two is the central engineering challenge of GAN training.
Wasserstein Distance (WGAN)
Replace JS divergence with Earth Mover's distance. Provides meaningful gradients even when distributions don't overlap. Uses weight clipping or gradient penalty to enforce the Lipschitz constraint.
Key Architectures
StyleGAN
Mapping network transforms z into style vectors w. Adaptive instance normalization injects style at each layer. Progressive growing builds resolution incrementally. Enables fine-grained control via style mixing.
pix2pix
Conditional GAN for paired image-to-image translation. Uses a U-Net generator with skip connections and a PatchGAN discriminator that judges overlapping patches for local texture realism.
CycleGAN
No paired data needed. Two generators (G: A→B, F: B→A) with cycle consistency loss: F(G(x)) ≈ x. Enables horse→zebra, summer→winter, and photo→painting transformations.
PatchGAN Discriminator
Instead of a single real/fake score, classifies each N×N patch independently. Captures local high-frequency texture. Fewer parameters, can work on arbitrary image sizes.
Stabilization Tricks
Training GANs is notoriously fragile. Over the years the community has developed a toolkit of regularization and architectural tricks that make convergence reliable.
- Spectral Normalization Constrains the Lipschitz constant of each layer by normalizing weights by their largest singular value. Simple, cheap, and effective for D.
- R1 Gradient Penalty Penalizes the squared gradient norm of D on real data: R1 = (gamma/2) E[||grad D(x)||^2]. Prevents D from creating sharp decision boundaries.
- Progressive Growing Start training at low resolution (4x4) and progressively add layers for higher resolutions. Each phase stabilizes before adding complexity. Core to ProGAN and early StyleGAN.
- EMA of Generator Maintain an exponential moving average of G's weights for inference. Smooths out training oscillations and consistently produces higher-quality outputs than the raw training checkpoint.
Where GANs Still Win
Diffusion models have largely superseded GANs for general image generation, but GANs retain clear advantages in several domains:
- Real-time super-resolution — Single forward pass means sub-millisecond upscaling. ESRGAN and Real-ESRGAN remain the standard for live video and gaming.
- Video synthesis — GANs still have edges in real-time face reenactment, neural avatars, and low-latency video style transfer.
- Interactive image editing — StyleGAN's disentangled latent space allows meaningful edits (age, expression, lighting) in real time via latent navigation.
- Data augmentation — Fast GAN variants generate training data for downstream classifiers, especially in medical imaging where real data is scarce.
Training / Inference
Training
- Alternating updates: Typically 1 D step per 1 G step (some methods use 5:1 D:G ratio).
- Learning rates: D and G often use different LRs. Common: 2e-4 for both with Adam (beta1=0, beta2=0.99).
- Batch size: StyleGAN2 uses batch 32–64. Larger batches help discriminator stability.
- Data: FFHQ (70K faces), LSUN (millions of scenes), ImageNet. Quality over quantity.
- Duration: StyleGAN2 at 1024x1024: ~8 days on 8 V100s (~25M images shown).
- Key losses: Non-saturating GAN loss + R1 penalty + path length regularization (StyleGAN2).
Inference
- Single forward pass through G — no iterative denoising. Typical: <50ms per 1024x1024 image on a modern GPU.
- Latent interpolation: Smooth walks through z-space or w-space produce semantically meaningful transitions.
- Truncation trick: Scale w toward the mean to trade diversity for quality. Truncation psi = 0.7 is common for demos.
- Projection: Given a real image, optimize z (or w) to reconstruct it, enabling editing of real photos.
Model Zoo
| Model | Year | Key Innovation | Resolution | Use Case |
|---|---|---|---|---|
| StyleGAN3 | 2021 | Alias-free layers; continuous equivariance; no texture sticking | Up to 1024² | High-fidelity face/scene synthesis, latent editing |
| GigaGAN | 2023 | Scaled GAN to 1B+ params; text-conditioned; fast 512² generation in 0.13s | 512² – 4096² | Text-to-image, super-resolution, style mixing at scale |
| ESRGAN / Real-ESRGAN | 2018 / 2021 | RRDB architecture; perceptual + adversarial loss; blind SR with degradation model | 4x upscale | Photo/video upscaling, game asset enhancement, restoration |