GAN — Modern AI Architecture Atlas

What Is It?

A Generative Adversarial Network pits two neural networks against each other in a minimax game. The Generator (G) takes random noise and tries to produce data that looks real. The Discriminator (D) receives both real data and G's fakes, and tries to classify each as real or fake.

Training is adversarial: G minimizes D's accuracy while D maximizes it. When the system reaches equilibrium, G produces data indistinguishable from the real distribution — and D outputs 0.5 for every sample, unable to tell the difference.

Key Insight

GANs learn implicitly — they never compute the data likelihood directly. Instead, the generator learns to map from a simple noise distribution to the data manifold, guided solely by the discriminator's gradient signal.

Architecture

The canonical GAN architecture: noise z feeds into the Generator, which outputs a fake sample. Both real samples and fakes pass through the Discriminator, which outputs a real/fake score. The adversarial loop connects them.

Generator-Discriminator Loop Interactive

Epoch 0 — Click Step to train

Core Mechanisms

Objective

Min-Max Game

V(D,G) = E x~p data [log D(x)] + E z~p z [log(1 - D(G(z)))]

D maximizes V; G minimizes it. Optimal D*(x) = p_data(x) / (p_data(x) + p_g(x)).

Failure Mode

Mode Collapse

G learns to produce only a few modes of the data distribution, ignoring diversity. The generator finds "safe" outputs that reliably fool D rather than covering the full data manifold.

Challenge

Training Instability

If D becomes too strong, gradients vanish for G. If G outpaces D, the signal is noisy. Balancing the two is the central engineering challenge of GAN training.

Innovation

Wasserstein Distance (WGAN)

Replace JS divergence with Earth Mover's distance. Provides meaningful gradients even when distributions don't overlap. Uses weight clipping or gradient penalty to enforce the Lipschitz constraint.

Training Dynamics — D/G Loss Curves Interactive

Ready

Key Architectures

Image Synthesis

StyleGAN

Mapping network transforms z into style vectors w. Adaptive instance normalization injects style at each layer. Progressive growing builds resolution incrementally. Enables fine-grained control via style mixing.

Paired Translation

pix2pix

Conditional GAN for paired image-to-image translation. Uses a U-Net generator with skip connections and a PatchGAN discriminator that judges overlapping patches for local texture realism.

Unpaired Translation

CycleGAN

No paired data needed. Two generators (G: A→B, F: B→A) with cycle consistency loss: F(G(x)) ≈ x. Enables horse→zebra, summer→winter, and photo→painting transformations.

Discriminator Design

PatchGAN Discriminator

Instead of a single real/fake score, classifies each N×N patch independently. Captures local high-frequency texture. Fewer parameters, can work on arbitrary image sizes.

Style Mixing Visualization Interactive

Crossover Layer 3

Coarse layers from Source A, fine layers from Source B

Stabilization Tricks

Training GANs is notoriously fragile. Over the years the community has developed a toolkit of regularization and architectural tricks that make convergence reliable.

Spectral Normalization Constrains the Lipschitz constant of each layer by normalizing weights by their largest singular value. Simple, cheap, and effective for D.
R1 Gradient Penalty Penalizes the squared gradient norm of D on real data: R₁ = (gamma/2) E[||grad D(x)||^2]. Prevents D from creating sharp decision boundaries.
Progressive Growing Start training at low resolution (4x4) and progressively add layers for higher resolutions. Each phase stabilizes before adding complexity. Core to ProGAN and early StyleGAN.
EMA of Generator Maintain an exponential moving average of G's weights for inference. Smooths out training oscillations and consistently produces higher-quality outputs than the raw training checkpoint.

PatchGAN Discriminator Grid Interactive

Patch Size 2

Each patch gets an independent real/fake score

Where GANs Still Win

Diffusion models have largely superseded GANs for general image generation, but GANs retain clear advantages in several domains:

Real-time super-resolution — Single forward pass means sub-millisecond upscaling. ESRGAN and Real-ESRGAN remain the standard for live video and gaming.
Video synthesis — GANs still have edges in real-time face reenactment, neural avatars, and low-latency video style transfer.
Interactive image editing — StyleGAN's disentangled latent space allows meaningful edits (age, expression, lighting) in real time via latent navigation.
Data augmentation — Fast GAN variants generate training data for downstream classifiers, especially in medical imaging where real data is scarce.

Perspective

The GAN era (2014–2022) produced most of the ideas that diffusion models later adopted: progressive training, style injection, perceptual losses, and discriminator-based refinement. Many state-of-the-art diffusion pipelines still use a GAN-based refinement stage.

Training / Inference

Training

Alternating updates: Typically 1 D step per 1 G step (some methods use 5:1 D:G ratio).
Learning rates: D and G often use different LRs. Common: 2e-4 for both with Adam (beta1=0, beta2=0.99).
Batch size: StyleGAN2 uses batch 32–64. Larger batches help discriminator stability.
Data: FFHQ (70K faces), LSUN (millions of scenes), ImageNet. Quality over quantity.
Duration: StyleGAN2 at 1024x1024: ~8 days on 8 V100s (~25M images shown).
Key losses: Non-saturating GAN loss + R1 penalty + path length regularization (StyleGAN2).

Inference

Single forward pass through G — no iterative denoising. Typical: <50ms per 1024x1024 image on a modern GPU.
Latent interpolation: Smooth walks through z-space or w-space produce semantically meaningful transitions.
Truncation trick: Scale w toward the mean to trade diversity for quality. Truncation psi = 0.7 is common for demos.
Projection: Given a real image, optimize z (or w) to reconstruct it, enabling editing of real photos.

Model Zoo

Model	Year	Key Innovation	Resolution	Use Case
StyleGAN3	2021	Alias-free layers; continuous equivariance; no texture sticking	Up to 1024²	High-fidelity face/scene synthesis, latent editing
GigaGAN	2023	Scaled GAN to 1B+ params; text-conditioned; fast 512² generation in 0.13s	512² – 4096²	Text-to-image, super-resolution, style mixing at scale
ESRGAN / Real-ESRGAN	2018 / 2021	RRDB architecture; perceptual + adversarial loss; blind SR with degradation model	4x upscale	Photo/video upscaling, game asset enhancement, restoration