Lottery Ticket Hypothesis (Frankle 2019)

Chapter 0: Why Overparameterized?

You're building a neural network to classify images. A small network with 10,000 parameters can memorize the training data perfectly, so logically, 10,000 parameters should be enough. But in practice, you use 10 million parameters — 1000x more than theoretically necessary. Why?

Because large networks train better. A small network and a large network might both be capable of representing the same function, but the large network will find it during training while the small one gets stuck in bad local minima. It's like searching for a needle in a haystack: a wider search net catches more needles.

This creates a paradox:

The overparameterization paradox: We need large networks to train well. But after training, most of those parameters are redundant — you can prune 90% of the weights without losing accuracy. So the vast majority of parameters were only needed during training, not at inference. This is incredibly wasteful. Can we find the "important" subnetwork before training?

Standard pruning attacks this waste: train a large network, then remove (zero out) the smallest-magnitude weights. A network with 10M parameters can often be pruned to 1M parameters (90% sparsity) with minimal accuracy loss. This is useful for deployment — smaller models are faster and cheaper.

But pruning doesn't solve the training cost problem. You still need to train the full 10M-parameter network first, then prune it. The question Frankle and Carlin asked was more radical: can we identify which weights matter before (or early in) training, and only train those?

The Overparameterization Paradox

This network has many connections (weights), but most can be removed after training. Drag the sparsity slider to prune weights — the network's accuracy barely changes until extreme sparsity. The red connections are pruned; the green ones survive.

Sparsity 0%

Why do we train neural networks with far more parameters than theoretically necessary?

Because large networks train better — the extra parameters create a smoother loss landscape with more paths to good solutions, helping optimization avoid bad local minima. After training, most parameters turn out to be redundant Because we don't know how many parameters are needed in advance Because more parameters always give better accuracy

Chapter 1: The Hypothesis

The Lottery Ticket Hypothesis states:

The Lottery Ticket Hypothesis: A randomly-initialized, dense neural network contains a subnetwork (a "winning ticket") that — when trained in isolation from the same initialization — can reach test accuracy comparable to the original network in a comparable number of iterations.

Let's unpack this carefully. Imagine a lottery:

1. A dense network is a bag of lottery tickets — millions of possible subnetworks, each defined by a different subset of weights.

2. Most subnetworks are "losing tickets" — they can't train to full accuracy from their initialization.

3. A few subnetworks are "winning tickets" — they happened to get lucky initializations that let them train to full accuracy.

4. The dense network succeeds because it contains enough tickets that at least one wins.

The crucial detail: winning tickets aren't just any sparse subnetwork. They must be combined with their original random initialization. A winning ticket reinitialised with new random weights is just a random sparse network — and random sparse networks train poorly.

Formal statement

Given a dense network f(x; θ₀) with initial parameters θ₀ ~ D (some initialization distribution):

f(x; m ⊙ θ₀) can be trained to test accuracy ≥ f(x; θ₀)

Where m ∈ {0, 1}^|θ| is a binary mask (1 = keep, 0 = prune), and ⊙ is element-wise multiplication. The mask m produces a subnetwork with ||m||₀ / |θ| parameters (the sparsity level). The hypothesis says that masks exist where this ratio is very small (90-95% sparsity) yet accuracy is maintained.

python
import torch

# The Lottery Ticket framework
class LotteryTicket:
    def __init__(self, model):
        # Step 1: Save the original initialization
        self.theta_0 = {n: p.clone() for n, p in model.named_parameters()}
        # Step 2: Create binary mask (all ones initially)
        self.mask = {n: torch.ones_like(p) for n, p in model.named_parameters()}

    def prune(self, model, percent=20):
        """Prune the smallest-magnitude weights."""
        for name, param in model.named_parameters():
            alive = self.mask[name].bool()
            # Among surviving weights, find the smallest
            values = param.abs()[alive]
            threshold = torch.quantile(values, percent / 100)
            # Zero out weights below threshold
            new_mask = (param.abs() >= threshold).float()
            self.mask[name] *= new_mask

    def reset_to_init(self, model):
        """Reset surviving weights to their original initialization."""
        for name, param in model.named_parameters():
            param.data = self.theta_0[name] * self.mask[name]

    def apply_mask(self, model):
        """Zero out pruned weights (call after each optimizer step)."""
        for name, param in model.named_parameters():
            param.data *= self.mask[name]

Winning vs Losing Tickets

Each small network is a potential "ticket." Click "Draw Ticket" to sample a random subnetwork and see if it can train to target accuracy. Winning tickets (green) have lucky initializations; losing tickets (red) get stuck. Notice that winning tickets are rare — you need many chances (a dense network) to find one.

0 drawn, 0 winners

What makes a "winning ticket" different from any random sparse subnetwork?

Winning tickets have more parameters Winning tickets must be combined with their ORIGINAL random initialization — the same sparse structure with new random weights performs much worse. The initialization is the "lucky" part Winning tickets use a different activation function

Chapter 2: Iterative Pruning

How do you find a winning ticket? You can't just pick a random sparse subnetwork — most are losing tickets. Frankle and Carlin's method is iterative magnitude pruning (IMP):

Step 1: Initialize

Randomly initialize the dense network. Save these initial weights θ₀.

↓

Step 2: Train

Train the network to convergence, reaching weights θ_T.

↓

Step 3: Prune

Remove the p% of weights with smallest magnitude in θ_T. This creates the mask m.

↓

Step 4: Reset

Reset surviving weights to their values from θ₀. Now you have f(x; m ⊙ θ₀).

↻ Repeat steps 2-4 with the pruned network

Each round prunes p% of the remaining weights (not the original total). After n rounds of p% pruning, the surviving fraction is (1-p/100)ⁿ.

Remaining after n rounds = (1 - p/100)ⁿ

python
# Iterative Magnitude Pruning
def find_winning_ticket(model_fn, train_fn, prune_pct=20, rounds=15):
    model = model_fn()                    # random init
    lt = LotteryTicket(model)              # save θ₀

    for r in range(rounds):
        train_fn(model)                    # train to convergence
        lt.prune(model, prune_pct)          # prune smallest 20%
        lt.reset_to_init(model)             # reset to θ₀

        sparsity = 1 - (1 - prune_pct/100)**(r+1)
        print(f"Round {r+1}: {sparsity*100:.1f}% pruned")
        # Round 1:  20.0% pruned
        # Round 5:  67.2% pruned
        # Round 10: 89.3% pruned
        # Round 15: 96.5% pruned

    return model, lt.mask  # the winning ticket

Why magnitude pruning?

The pruning criterion is simple: remove weights with the smallest absolute value after training. The intuition is that weights close to zero contribute little to the output — they've been driven toward zero by the training process itself, suggesting the network "wants" to remove them.

This is a retrospective method. You need to train the full network first to know which weights end up small. The lottery ticket isn't identified in advance — it's discovered after the fact. Finding winning tickets prospectively (before training) remains an open problem.

Why iterative (not one-shot)? Pruning 90% of weights at once is too aggressive — it removes some weights that were important but happened to have small magnitude. Iterative pruning is gentler: prune 20%, retrain, prune 20% of what remains, retrain... Each round gives the network a chance to redistribute importance among surviving weights, making the pruning decisions progressively more accurate.

Iterative Pruning Simulator

Watch iterative magnitude pruning in action. Each round trains the network (blue training curve), then prunes the smallest weights (red). Click "Next Round" to advance. The sparsity increases each round, but accuracy is maintained until extreme pruning.

Round 0 — 0% sparse

Why does iterative pruning (prune 20% each round over 15 rounds) find better winning tickets than one-shot pruning (prune 90% at once)?

Each round gives the network a chance to retrain and redistribute importance among surviving weights. Gradual pruning makes progressively better decisions about which weights to remove, avoiding the catastrophic loss of important weights that one-shot pruning causes Iterative pruning uses more compute and more compute is always better Iterative pruning changes the network architecture

Chapter 3: Why Initialization Matters

The most surprising finding in the paper is that the initialization is everything. The same sparse structure performs dramatically differently depending on which weights fill it.

Frankle and Carlin tested three conditions:

Condition	Structure	Weights	Performance
Winning ticket	Pruned mask	Original init θ₀	Matches dense
Random reinit	Same pruned mask	New random weights	Much worse
Random structure	Random mask (same sparsity)	Random weights	Even worse

The initialization is the "ticket." The pruned structure tells you WHERE the network's computation should flow. But the specific initial values tell the optimizer HOW to navigate the loss landscape. The original initialization places the weights in a "basin of attraction" that leads to a good minimum. New random values may place them in a basin that leads to a bad minimum — even though the structure is identical.

What makes a good initialization?

Why do some initializations "win"? The leading theory: winning initializations happen to place the subnetwork's weights near a favorable region of the loss landscape. Specifically:

1. Sign alignment: The initial signs of weights in winning tickets tend to be aligned with their final trained values. If a weight will end up positive, the winning ticket initializes it positive. This gives training a "head start."

2. Magnitude balance: Winning tickets have initial weights with magnitudes that are neither too large (causing instability) nor too small (causing vanishing signals) for the specific subnetwork structure.

3. Layer-wise connectivity: The pruned weights preserve enough connections in each layer to maintain information flow through the network. Random pruning might cut all connections in one layer, creating a bottleneck.

python
# Testing the importance of initialization
def ablation_study(model_fn, train_fn, mask):
    # Condition 1: Winning ticket (original init + mask)
    model1 = model_fn(seed=42)
    apply_mask(model1, mask)
    acc1 = train_fn(model1)
    print(f"Winning ticket: {acc1:.2f}%")  # ~95%

    # Condition 2: Same structure, new random weights
    model2 = model_fn(seed=999)  # different init
    apply_mask(model2, mask)       # same mask
    acc2 = train_fn(model2)
    print(f"Random reinit:  {acc2:.2f}%")  # ~85%

    # Condition 3: Random structure, random weights
    model3 = model_fn(seed=999)
    random_mask = random_mask_like(mask)  # same sparsity, random structure
    apply_mask(model3, random_mask)
    acc3 = train_fn(model3)
    print(f"Random struct:  {acc3:.2f}%")  # ~78%

Initialization Ablation

Watch three training curves: the winning ticket (original init + mask), the same mask with random reinit, and a random mask. All three have the same sparsity level (90%). Only the winning ticket matches the dense network's accuracy. Click "Train" to animate.

Click Train to start

If you take a winning ticket's sparse structure but reinitialize it with new random weights, what happens?

Performance drops significantly — the specific initial weight values are critical because they place the subnetwork in a favorable region of the loss landscape. The "winning" part is the combination of structure AND initialization, not structure alone Performance stays the same because the structure is what matters Performance improves because fresh random weights avoid overfitting

Chapter 4: Experimental Evidence

Frankle and Carlin validated the Lottery Ticket Hypothesis across multiple architectures and datasets. The evidence is remarkably consistent.

MNIST + small networks

Network	Full Params	Winning Ticket Sparsity	Full Acc	Ticket Acc
LeNet-300-100 (FC)	266K	96.5% (3.5% remaining)	98.3%	98.3%
Conv-2 (CNN)	37K	95.2%	99.1%	99.2%
Conv-4 (CNN)	63K	94.5%	99.2%	99.2%
Conv-6 (CNN)	95K	93.5%	99.2%	99.2%

On MNIST, winning tickets at 93-96% sparsity match or slightly exceed the full network's accuracy. A LeNet with only 3.5% of its parameters remaining (9,300 out of 266,000) achieves the same 98.3% accuracy as the full network.

CIFAR-10 + moderate networks

Network	Full Params	Max Sparsity (no loss)	Ticket Accuracy
ResNet-18	11.2M	~89%	93.5% (matches full)
VGG-16	14.7M	~93%	93.8% (matches full)

Key findings beyond accuracy

1. Winning tickets train faster. Not just to the same accuracy — they often reach the target accuracy in fewer iterations than the full network. The winning ticket's initialization is so favorable that it converges faster despite having fewer parameters.

2. Winning tickets generalize better. At moderate sparsity levels (60-80%), winning tickets achieve slightly higher test accuracy than the full network. The pruning acts as an implicit regularizer, preventing overfitting.

3. Winning tickets are not random subsets. Random subnetworks at the same sparsity level perform much worse. The structure and initialization together are crucial.

The "winning ticket" is really about finding the right subspace. The dense network explores a vast parameter space during training. The winning ticket identifies a small subspace where the loss landscape is particularly favorable — a subspace where gradient descent converges quickly to a good minimum. The original initialization places you at the entrance to this subspace.

Sparsity vs Accuracy Curves

Drag the slider to change sparsity level. The chart shows accuracy of winning tickets (green) vs random pruning (red) vs the full network (dashed). Winning tickets maintain accuracy far beyond what random pruning can handle.

Sparsity 50%

Convergence speed comparison

python
# Empirical result: winning tickets converge faster
# LeNet on MNIST at 90% sparsity

# Full network (266K params):
# Epoch  5: 96.2% → Epoch 10: 97.8% → Epoch 20: 98.3%

# Winning ticket (26.6K params, 90% sparse):
# Epoch  5: 97.1% → Epoch 10: 98.1% → Epoch 20: 98.3%
# ↑ Higher accuracy at epoch 5 despite 10x fewer params!

# Random reinit (26.6K params, 90% sparse):
# Epoch  5: 91.3% → Epoch 10: 93.5% → Epoch 20: 95.1%
# ↑ Never catches up — stuck in a worse basin

At 90% sparsity on MNIST, the winning ticket (26.6K params) reaches 97.1% accuracy by epoch 5, while the full network (266K params) only reaches 96.2%. Why?

The winning ticket's initialization is in such a favorable region of the loss landscape that gradient descent converges faster — the sparse network has a shorter path to a good minimum because the pruning has removed confounding parameters that slow training Smaller networks always train faster because they have fewer operations per step The pruning somehow gives the network more parameters

Chapter 5: Scaling & Limitations

The Lottery Ticket Hypothesis, as originally stated, has important limitations that subsequent work revealed.

The scaling problem

On larger networks (ResNets, Transformers) and harder datasets (ImageNet), the original LTH procedure fails. Winning tickets with the exact original initialization don't always match the full network's accuracy. Why?

Instability at scale: Large networks are sensitive to initialization perturbations. In a 100-layer ResNet, even small changes in initialization can shift the training trajectory dramatically. The "winning" initialization becomes a needle in a haystack that's almost impossible to find through magnitude pruning alone.

Frankle et al. (2020) proposed a fix: Late Rewinding. Instead of resetting to the initial weights θ₀, reset to the weights at some early training step k (e.g., k = 0.1% of total training):

f(x; m ⊙ θ_k) instead of f(x; m ⊙ θ₀)

This small change fixes the scaling problem. Rewinding to step k gives the network enough "warm-up" to stabilize its training trajectory before the winning ticket is identified.

Method	Reset Point	Works on Small CNNs?	Works on ResNets/ImageNet?
Original LTH	θ₀ (init)	Yes	No
Late Rewinding	θ_k (early step)	Yes	Yes
Fine-tuning	θ_T (final)	Yes	Yes (but not a ticket)

Computational cost

The elephant in the room: finding winning tickets is more expensive than just training the dense network. With 15 rounds of iterative pruning, you need 15 full training runs. The computational cost is:

Cost(IMP) = rounds × Cost(full training) = 15 × C

You spend 15x the compute to find a subnetwork that trains as well as the full network. The savings come at inference time (smaller, faster model), not at training time.

Modern applications

Despite the training cost, the LTH has had lasting impact:

1. Understanding neural networks: LTH provided evidence that good solutions live in surprisingly low-dimensional subspaces of the parameter space. This changed how researchers think about what networks "learn."

2. Structured pruning: Extensions to LTH prune entire neurons, channels, or attention heads (not individual weights), producing models that are actually faster on hardware (unstructured sparsity is hard to accelerate).

3. Transfer learning: Winning tickets found on one task often transfer to related tasks. The structure and initialization capture something general about the data domain.

Late Rewinding vs Original LTH

Compare the original LTH (reset to θ₀) with late rewinding (reset to θ_k). On small networks, both work. On large networks, only late rewinding maintains accuracy. Drag the slider to change the rewind point.

Rewind point (% of training) 0% (original init)

Why does the original Lottery Ticket Hypothesis fail on large models like ResNets on ImageNet?

Large networks are sensitive to initialization perturbations — resetting to the exact initial weights θ₀ creates instability because the training trajectory in deep networks is chaotic early on. Late rewinding (resetting to θ_k, a few steps into training) fixes this by using weights that have stabilized Large models have too many parameters to prune ImageNet has too many classes for sparse networks

Chapter 6: Lottery Ticket Explorer

Put the Lottery Ticket Hypothesis to the test. This explorer lets you run the full iterative magnitude pruning pipeline: initialize, train, prune, reset, and repeat.

Iterative Magnitude Pruning Simulator

Watch the full IMP pipeline. Each round trains the network, prunes the smallest weights, and resets to the original initialization. The network visualization shows surviving connections (green) and pruned connections (fading out). The accuracy chart shows the winning ticket's performance at each sparsity level.

Prune % 20%

Ticket vs Reinit vs Random

At the current sparsity level, compare three conditions: the winning ticket (original init + pruned mask), the same mask with random reinit, and a random mask. Click "Compare" to see training curves for all three conditions side by side.

Click Compare to start

What you should notice: (1) The winning ticket maintains accuracy even at 90%+ sparsity. (2) Random reinit degrades gradually. (3) The gap between winning ticket and random reinit proves that initialization — not just structure — is the key ingredient. (4) At extreme sparsity (>97%), even winning tickets fail.

After 15 rounds of 20% iterative pruning, what fraction of original weights remain?

20%, because 100% - 15 × 20% / something 70%, because each round only removes a few weights About 3.5%, because (1 - 0.20)^15 = 0.8^15 ≈ 0.035 — each round removes 20% of the REMAINING weights, compounding multiplicatively

Chapter 7: Connections

The Lottery Ticket Hypothesis connects to a broad web of ideas about what neural networks learn, why they work, and how to make them efficient.

What came before

Work	Contribution	Relationship to LTH
Optimal Brain Damage (LeCun 1990)	Prune using Hessian information	Early pruning work; LTH adds the initialization component
Deep Compression (Han 2016)	Prune + quantize + Huffman encode	Practical pruning pipeline; LTH explains WHY pruning works
Rethinking Generalization (Zhang 2017)	Networks can memorize random labels	Shows overparameterization enables memorization — LTH shows it also enables finding good subnetworks

What came after

Work	How It Extended LTH
Stabilizing LTH (Frankle 2020)	Late rewinding fixes the scaling problem
SNIP (Lee 2019)	Find tickets BEFORE training using gradient sensitivity
Edge-Popup (Ramanujan 2020)	Random weights are enough — just find the right MASK via optimization
LoRA (Hu 2022)	Instead of finding sparse subnetworks, train in a low-rank subspace
SparseGPT (Frantar 2023)	One-shot pruning for LLMs without retraining

The deeper insight

The Lottery Ticket Hypothesis tells us something fundamental about neural networks: they are massively redundant. The actual computation needed to solve a task lives in a tiny fraction of the parameter space. The rest of the parameters serve as "scaffolding" — they help optimization find the right subspace during training, but they're not needed for the final computation.

Pruning approach (LTH):

Train dense, identify important weights.

Result: sparse subnetwork.

Good for inference efficiency.

PEFT approach (LoRA, Adapters):

Freeze dense, train small additions.

Result: low-rank updates.

Good for training efficiency.

Both approaches exploit the same insight: the effective dimensionality of the solution is much lower than the parameter count suggests.

The LTH legacy: The Lottery Ticket Hypothesis showed that dense networks are not just "good at learning" — they are "good at containing learnable substructures." This reframed the field's understanding of overparameterization from "a waste" to "a strategy for finding good solutions." Every pruning method, knowledge distillation technique, and efficient architecture search builds on this foundation.

"In theory, there is no difference between theory and practice. In practice, there is." — attributed to Yogi Berra (relevant to LTH's scaling challenges)

What is the fundamental insight about neural networks that the Lottery Ticket Hypothesis reveals?

Dense networks are massively redundant — the actual computation needed to solve a task lives in a tiny fraction of parameters. The rest serve as "scaffolding" that helps optimization find the right subspace during training Neural networks should always be pruned after training Smaller networks are always better than larger ones