Finding Sparse, Trainable Neural Networks — dense networks contain sparse subnetworks ("winning tickets") that can train to full accuracy when reset to their original initialization.
You're building a neural network to classify images. A small network with 10,000 parameters can memorize the training data perfectly, so logically, 10,000 parameters should be enough. But in practice, you use 10 million parameters — 1000x more than theoretically necessary. Why?
Because large networks train better. A small network and a large network might both be capable of representing the same function, but the large network will find it during training while the small one gets stuck in bad local minima. It's like searching for a needle in a haystack: a wider search net catches more needles.
This creates a paradox:
Standard pruning attacks this waste: train a large network, then remove (zero out) the smallest-magnitude weights. A network with 10M parameters can often be pruned to 1M parameters (90% sparsity) with minimal accuracy loss. This is useful for deployment — smaller models are faster and cheaper.
But pruning doesn't solve the training cost problem. You still need to train the full 10M-parameter network first, then prune it. The question Frankle and Carlin asked was more radical: can we identify which weights matter before (or early in) training, and only train those?
This network has many connections (weights), but most can be removed after training. Drag the sparsity slider to prune weights — the network's accuracy barely changes until extreme sparsity. The red connections are pruned; the green ones survive.
The Lottery Ticket Hypothesis states:
Let's unpack this carefully. Imagine a lottery:
1. A dense network is a bag of lottery tickets — millions of possible subnetworks, each defined by a different subset of weights.
2. Most subnetworks are "losing tickets" — they can't train to full accuracy from their initialization.
3. A few subnetworks are "winning tickets" — they happened to get lucky initializations that let them train to full accuracy.
4. The dense network succeeds because it contains enough tickets that at least one wins.
The crucial detail: winning tickets aren't just any sparse subnetwork. They must be combined with their original random initialization. A winning ticket reinitialised with new random weights is just a random sparse network — and random sparse networks train poorly.
Given a dense network f(x; θ0) with initial parameters θ0 ~ D (some initialization distribution):
Where m ∈ {0, 1}|θ| is a binary mask (1 = keep, 0 = prune), and ⊙ is element-wise multiplication. The mask m produces a subnetwork with ||m||0 / |θ| parameters (the sparsity level). The hypothesis says that masks exist where this ratio is very small (90-95% sparsity) yet accuracy is maintained.
python import torch # The Lottery Ticket framework class LotteryTicket: def __init__(self, model): # Step 1: Save the original initialization self.theta_0 = {n: p.clone() for n, p in model.named_parameters()} # Step 2: Create binary mask (all ones initially) self.mask = {n: torch.ones_like(p) for n, p in model.named_parameters()} def prune(self, model, percent=20): """Prune the smallest-magnitude weights.""" for name, param in model.named_parameters(): alive = self.mask[name].bool() # Among surviving weights, find the smallest values = param.abs()[alive] threshold = torch.quantile(values, percent / 100) # Zero out weights below threshold new_mask = (param.abs() >= threshold).float() self.mask[name] *= new_mask def reset_to_init(self, model): """Reset surviving weights to their original initialization.""" for name, param in model.named_parameters(): param.data = self.theta_0[name] * self.mask[name] def apply_mask(self, model): """Zero out pruned weights (call after each optimizer step).""" for name, param in model.named_parameters(): param.data *= self.mask[name]
Each small network is a potential "ticket." Click "Draw Ticket" to sample a random subnetwork and see if it can train to target accuracy. Winning tickets (green) have lucky initializations; losing tickets (red) get stuck. Notice that winning tickets are rare — you need many chances (a dense network) to find one.
How do you find a winning ticket? You can't just pick a random sparse subnetwork — most are losing tickets. Frankle and Carlin's method is iterative magnitude pruning (IMP):
Each round prunes p% of the remaining weights (not the original total). After n rounds of p% pruning, the surviving fraction is (1-p/100)n.
python # Iterative Magnitude Pruning def find_winning_ticket(model_fn, train_fn, prune_pct=20, rounds=15): model = model_fn() # random init lt = LotteryTicket(model) # save θ₀ for r in range(rounds): train_fn(model) # train to convergence lt.prune(model, prune_pct) # prune smallest 20% lt.reset_to_init(model) # reset to θ₀ sparsity = 1 - (1 - prune_pct/100)**(r+1) print(f"Round {r+1}: {sparsity*100:.1f}% pruned") # Round 1: 20.0% pruned # Round 5: 67.2% pruned # Round 10: 89.3% pruned # Round 15: 96.5% pruned return model, lt.mask # the winning ticket
The pruning criterion is simple: remove weights with the smallest absolute value after training. The intuition is that weights close to zero contribute little to the output — they've been driven toward zero by the training process itself, suggesting the network "wants" to remove them.
This is a retrospective method. You need to train the full network first to know which weights end up small. The lottery ticket isn't identified in advance — it's discovered after the fact. Finding winning tickets prospectively (before training) remains an open problem.
Watch iterative magnitude pruning in action. Each round trains the network (blue training curve), then prunes the smallest weights (red). Click "Next Round" to advance. The sparsity increases each round, but accuracy is maintained until extreme pruning.
The most surprising finding in the paper is that the initialization is everything. The same sparse structure performs dramatically differently depending on which weights fill it.
Frankle and Carlin tested three conditions:
| Condition | Structure | Weights | Performance |
|---|---|---|---|
| Winning ticket | Pruned mask | Original init θ0 | Matches dense |
| Random reinit | Same pruned mask | New random weights | Much worse |
| Random structure | Random mask (same sparsity) | Random weights | Even worse |
Why do some initializations "win"? The leading theory: winning initializations happen to place the subnetwork's weights near a favorable region of the loss landscape. Specifically:
1. Sign alignment: The initial signs of weights in winning tickets tend to be aligned with their final trained values. If a weight will end up positive, the winning ticket initializes it positive. This gives training a "head start."
2. Magnitude balance: Winning tickets have initial weights with magnitudes that are neither too large (causing instability) nor too small (causing vanishing signals) for the specific subnetwork structure.
3. Layer-wise connectivity: The pruned weights preserve enough connections in each layer to maintain information flow through the network. Random pruning might cut all connections in one layer, creating a bottleneck.
python # Testing the importance of initialization def ablation_study(model_fn, train_fn, mask): # Condition 1: Winning ticket (original init + mask) model1 = model_fn(seed=42) apply_mask(model1, mask) acc1 = train_fn(model1) print(f"Winning ticket: {acc1:.2f}%") # ~95% # Condition 2: Same structure, new random weights model2 = model_fn(seed=999) # different init apply_mask(model2, mask) # same mask acc2 = train_fn(model2) print(f"Random reinit: {acc2:.2f}%") # ~85% # Condition 3: Random structure, random weights model3 = model_fn(seed=999) random_mask = random_mask_like(mask) # same sparsity, random structure apply_mask(model3, random_mask) acc3 = train_fn(model3) print(f"Random struct: {acc3:.2f}%") # ~78%
Watch three training curves: the winning ticket (original init + mask), the same mask with random reinit, and a random mask. All three have the same sparsity level (90%). Only the winning ticket matches the dense network's accuracy. Click "Train" to animate.
Frankle and Carlin validated the Lottery Ticket Hypothesis across multiple architectures and datasets. The evidence is remarkably consistent.
| Network | Full Params | Winning Ticket Sparsity | Full Acc | Ticket Acc |
|---|---|---|---|---|
| LeNet-300-100 (FC) | 266K | 96.5% (3.5% remaining) | 98.3% | 98.3% |
| Conv-2 (CNN) | 37K | 95.2% | 99.1% | 99.2% |
| Conv-4 (CNN) | 63K | 94.5% | 99.2% | 99.2% |
| Conv-6 (CNN) | 95K | 93.5% | 99.2% | 99.2% |
On MNIST, winning tickets at 93-96% sparsity match or slightly exceed the full network's accuracy. A LeNet with only 3.5% of its parameters remaining (9,300 out of 266,000) achieves the same 98.3% accuracy as the full network.
| Network | Full Params | Max Sparsity (no loss) | Ticket Accuracy |
|---|---|---|---|
| ResNet-18 | 11.2M | ~89% | 93.5% (matches full) |
| VGG-16 | 14.7M | ~93% | 93.8% (matches full) |
1. Winning tickets train faster. Not just to the same accuracy — they often reach the target accuracy in fewer iterations than the full network. The winning ticket's initialization is so favorable that it converges faster despite having fewer parameters.
2. Winning tickets generalize better. At moderate sparsity levels (60-80%), winning tickets achieve slightly higher test accuracy than the full network. The pruning acts as an implicit regularizer, preventing overfitting.
3. Winning tickets are not random subsets. Random subnetworks at the same sparsity level perform much worse. The structure and initialization together are crucial.
Drag the slider to change sparsity level. The chart shows accuracy of winning tickets (green) vs random pruning (red) vs the full network (dashed). Winning tickets maintain accuracy far beyond what random pruning can handle.
python # Empirical result: winning tickets converge faster # LeNet on MNIST at 90% sparsity # Full network (266K params): # Epoch 5: 96.2% → Epoch 10: 97.8% → Epoch 20: 98.3% # Winning ticket (26.6K params, 90% sparse): # Epoch 5: 97.1% → Epoch 10: 98.1% → Epoch 20: 98.3% # ↑ Higher accuracy at epoch 5 despite 10x fewer params! # Random reinit (26.6K params, 90% sparse): # Epoch 5: 91.3% → Epoch 10: 93.5% → Epoch 20: 95.1% # ↑ Never catches up — stuck in a worse basin
The Lottery Ticket Hypothesis, as originally stated, has important limitations that subsequent work revealed.
On larger networks (ResNets, Transformers) and harder datasets (ImageNet), the original LTH procedure fails. Winning tickets with the exact original initialization don't always match the full network's accuracy. Why?
Frankle et al. (2020) proposed a fix: Late Rewinding. Instead of resetting to the initial weights θ0, reset to the weights at some early training step k (e.g., k = 0.1% of total training):
This small change fixes the scaling problem. Rewinding to step k gives the network enough "warm-up" to stabilize its training trajectory before the winning ticket is identified.
| Method | Reset Point | Works on Small CNNs? | Works on ResNets/ImageNet? |
|---|---|---|---|
| Original LTH | θ0 (init) | Yes | No |
| Late Rewinding | θk (early step) | Yes | Yes |
| Fine-tuning | θT (final) | Yes | Yes (but not a ticket) |
The elephant in the room: finding winning tickets is more expensive than just training the dense network. With 15 rounds of iterative pruning, you need 15 full training runs. The computational cost is:
You spend 15x the compute to find a subnetwork that trains as well as the full network. The savings come at inference time (smaller, faster model), not at training time.
Despite the training cost, the LTH has had lasting impact:
1. Understanding neural networks: LTH provided evidence that good solutions live in surprisingly low-dimensional subspaces of the parameter space. This changed how researchers think about what networks "learn."
2. Structured pruning: Extensions to LTH prune entire neurons, channels, or attention heads (not individual weights), producing models that are actually faster on hardware (unstructured sparsity is hard to accelerate).
3. Transfer learning: Winning tickets found on one task often transfer to related tasks. The structure and initialization capture something general about the data domain.
Compare the original LTH (reset to θ₀) with late rewinding (reset to θ_k). On small networks, both work. On large networks, only late rewinding maintains accuracy. Drag the slider to change the rewind point.
Put the Lottery Ticket Hypothesis to the test. This explorer lets you run the full iterative magnitude pruning pipeline: initialize, train, prune, reset, and repeat.
Watch the full IMP pipeline. Each round trains the network, prunes the smallest weights, and resets to the original initialization. The network visualization shows surviving connections (green) and pruned connections (fading out). The accuracy chart shows the winning ticket's performance at each sparsity level.
At the current sparsity level, compare three conditions: the winning ticket (original init + pruned mask), the same mask with random reinit, and a random mask. Click "Compare" to see training curves for all three conditions side by side.
The Lottery Ticket Hypothesis connects to a broad web of ideas about what neural networks learn, why they work, and how to make them efficient.
| Work | Contribution | Relationship to LTH |
|---|---|---|
| Optimal Brain Damage (LeCun 1990) | Prune using Hessian information | Early pruning work; LTH adds the initialization component |
| Deep Compression (Han 2016) | Prune + quantize + Huffman encode | Practical pruning pipeline; LTH explains WHY pruning works |
| Rethinking Generalization (Zhang 2017) | Networks can memorize random labels | Shows overparameterization enables memorization — LTH shows it also enables finding good subnetworks |
| Work | How It Extended LTH |
|---|---|
| Stabilizing LTH (Frankle 2020) | Late rewinding fixes the scaling problem |
| SNIP (Lee 2019) | Find tickets BEFORE training using gradient sensitivity |
| Edge-Popup (Ramanujan 2020) | Random weights are enough — just find the right MASK via optimization |
| LoRA (Hu 2022) | Instead of finding sparse subnetworks, train in a low-rank subspace |
| SparseGPT (Frantar 2023) | One-shot pruning for LLMs without retraining |
The Lottery Ticket Hypothesis tells us something fundamental about neural networks: they are massively redundant. The actual computation needed to solve a task lives in a tiny fraction of the parameter space. The rest of the parameters serve as "scaffolding" — they help optimization find the right subspace during training, but they're not needed for the final computation.
Pruning approach (LTH):
Train dense, identify important weights.
Result: sparse subnetwork.
Good for inference efficiency.
PEFT approach (LoRA, Adapters):
Freeze dense, train small additions.
Result: low-rank updates.
Good for training efficiency.
Both approaches exploit the same insight: the effective dimensionality of the solution is much lower than the parameter count suggests.
"In theory, there is no difference between theory and practice. In practice, there is." — attributed to Yogi Berra (relevant to LTH's scaling challenges)