Why showing a network its easy examples first — the same way we teach children — can mean the difference between a model that generalizes and one that gives up.
Imagine you are teaching a child arithmetic. You would not open with “evaluate the integral of tan(x).” You start with 2 + 2, then carrying, then multiplication, and only much later, calculus. Each skill rests on the one before it. Scramble that order — throw integrals at a five-year-old between addition problems — and learning collapses.
Now look at how we train neural networks. The standard recipe shuffles the dataset and draws random mini-batches. Every batch is a uniform random mix of trivial examples and brutally hard ones. The five-year-old gets addition and integrals in the same breakfast. We never ask: does the order in which a model sees its data change what it learns?
That question is the whole subject of this lesson. The answer, discovered and re-discovered across two decades, is: sometimes, dramatically — and sometimes not at all. Knowing which situation you are in is the real skill. Curriculum learning is the family of techniques that order training examples from easy to hard instead of uniformly at random.
Early in training, the network's weights are near-random. Its predictions are garbage on every example. But a hard example — one near the true decision boundary, or with a mislabeled or ambiguous answer — produces a large, confidently-wrong gradient that yanks the weights in a direction that may be actively misleading. The model has no stable foundation yet, so it cannot tell a genuinely informative hard case from noise. It just gets shoved around.
Easy examples, by contrast, produce gradients that point in a clean, consistent direction — toward the broad structure of the task. Master those first and you build a rough-but-correct foundation. Then the hard examples become useful: now the model has enough structure to interpret them as fine corrections rather than chaos.
This is not just a metaphor. It connects to optimization: a non-convex loss surface has many bad local minima. Easy examples define a smoother, more convex-looking version of the loss landscape. Starting there is like continuation (or homotopy) methods in optimization — solve an easy smoothed problem first, then gradually deform it into the hard real one, carrying your solution along.
The simulation below trains a tiny classifier on the classic “two moons” dataset — two interleaving crescents. Points deep inside each crescent are easy (far from the other class). Points near where the crescents interleave are hard (ambiguous, close to the boundary). We will train the same model two ways and race them:
Press Train and watch the two decision boundaries form. The curriculum model commits to a clean separating curve early; the random model thrashes as conflicting gradients fight, often settling on a more jagged, lower-accuracy boundary.
Both panels train an identical 2-layer network on two moons. Left uses random batches; right unlocks examples easy→hard. Press Train, then watch the boundaries and the accuracy bars. Hit Reset and Train again — the curriculum is more consistent run to run.
Tip: nudge label noise up to 20% and re-train. The curriculum's lead widens — a clue we will cash out in Chapter 8.
Curriculum learning begins with a deceptively simple requirement: to order examples from easy to hard, you need a number for each one — a difficulty score, a function that maps a training example to a scalar. Everything downstream depends on this function. Get it wrong and your “curriculum” just reshuffles examples in a meaningless order. So: how do we measure difficulty?
There are three broad families, and they trade off cleanly against each other.
The cheapest option: use a property a human can define without running any model. For language modeling, sentence length — short sentences are easier. For object detection, number of objects or object size. For arithmetic, number of digits. These are free to compute and need no trained model, but they encode a human's guess about difficulty, which may not match the model's experience at all.
Let the model tell you. The most common signal is the per-example loss: run the example through the current model and read off how much loss it incurs. High loss = hard. A close cousin is the confidence (the probability the model assigns the correct class) and the margin (gap between the correct class score and the next-highest). These adapt to the actual model but are circular early on — a random model finds everything “hard.”
Borrow difficulty from a reference model. Train a small or earlier model, record each example's loss under it, and use that as a fixed, non-circular difficulty score for training the real model. This is the idea behind transfer teacher curricula and underlies industrial recipes like DoReMi (Chapter 7), where a small proxy model's losses guide the big model's data diet.
Let us make this concrete with the margin score, because it has a clean geometric meaning: how close is the example to the decision boundary? Suppose a binary classifier outputs logits (raw scores) for two classes, and the example's true class is class 0. Define the margin as m = ztrue − zother. Large positive margin = confidently correct = easy. Near zero = right on the boundary = hard. Negative = currently misclassified.
Take four examples with these (true, other) logits:
| Example | ztrue | zother | margin m | verdict |
|---|---|---|---|---|
| A | 4.0 | 0.5 | 4.0 − 0.5 = +3.5 | very easy |
| B | 2.0 | 1.0 | 2.0 − 1.0 = +1.0 | medium |
| C | 1.2 | 1.1 | 1.2 − 1.1 = +0.1 | hard (on boundary) |
| D | 0.3 | 1.4 | 0.3 − 1.4 = −1.1 | wrong / hardest |
Sort by margin descending and you get the easy→hard ordering A, B, C, D. A curriculum would feed A and B first, hold back C and D until the model has found roughly the right boundary, then introduce them as fine corrections. That ordering — nothing more — is the raw material of every method in this lesson.
python import torch import torch.nn.functional as F def difficulty_scores(model, X, y): # Per-example difficulty = per-example cross-entropy loss. # X: (N, d) inputs, y: (N,) integer labels. Returns: (N,) scores. model.eval() with torch.no_grad(): logits = model(X) # (N, C) # reduction='none' keeps ONE loss per example, not the mean loss = F.cross_entropy(logits, y, reduction='none') # (N,) return loss # high = hard def margin_scores(model, X, y): # Alternative: signed margin. High = easy, negative = misclassified. with torch.no_grad(): logits = model(X) # (N, C) true = logits.gather(1, y.unsqueeze(1)).squeeze(1) # (N,) logits_masked = logits.clone() logits_masked.scatter_(1, y.unsqueeze(1), -1e9) # hide true class other = logits_masked.max(dim=1).values # best wrong class return true - other # (N,) margin
Notice the crucial detail: reduction='none'. The default cross_entropy returns the mean over the batch — one number for backprop. For a curriculum we need the loss for each example separately, so we keep the full (N,) vector. That single argument is the difference between “train normally” and “measure difficulty.”
The widget below shows a 2D dataset with a linear decision boundary you can rotate and slide. Each point is colored by its margin-difficulty under that boundary: deep colors are easy (far from the line, confidently correct), pale gray points sit near the boundary (hard), and points on the wrong side glow red (misclassified, hardest). Drag the slider to move the boundary and watch difficulty recompute live. This is what a difficulty score “sees.”
Move the boundary. Points far from it on the correct side are easy (bold); points hugging it are hard (pale); points on the wrong side are misclassified (red ring). The histogram on the right is the difficulty distribution.
We have a difficulty score (Chapter 1). The next question is what to do with it. The first clean formalization came from Yoshua Bengio and colleagues in 2009, in a paper titled simply “Curriculum Learning.” Their framing is worth understanding exactly, because every later method is a variation on it.
Bengio's idea: instead of training on the fixed data distribution P(z) (where z is an example), train on a sequence of distributions Qλ(z) that start easy and gradually morph into P. A single knob λ slides from 0 (easiest distribution) to 1 (the real distribution).
Read this carefully. P(z) is the original probability of example z. Wλ(z) is a weight between 0 and 1 that says how much we let z into training at curriculum stage λ. We multiply them and renormalize (that is the “∝”) so the weighted thing is still a probability distribution. Bengio imposes two conditions that make this a genuine easy-to-hard schedule:
The simplest realization — the one most people actually use — is a hard threshold gate. Sort examples by difficulty. Pick a threshold; everything easier than the threshold has weight 1, everything harder has weight 0. Raise the threshold over training. This is the “baby step” or “one-pass” curriculum.
Suppose 6 examples with difficulty scores (0 = easiest, 1 = hardest) and equal original probability P(z) = 1/6 each:
| z | difficulty d | P(z) |
|---|---|---|
| z₁ | 0.10 | 1/6 |
| z₂ | 0.25 | 1/6 |
| z₃ | 0.40 | 1/6 |
| z₄ | 0.60 | 1/6 |
| z₅ | 0.80 | 1/6 |
| z₆ | 0.95 | 1/6 |
Use a threshold curriculum with threshold λ (an example is “in” if d ≤ λ). At λ = 0.5, the weights are W = [1, 1, 1, 0, 0, 0] — only z₁, z₂, z₃ are admitted. The unnormalized distribution is [1/6, 1/6, 1/6, 0, 0, 0]. Renormalizing (divide by the total, 3/6 = 0.5):
So at λ = 0.5 we sample uniformly among the three easiest examples, each with probability 1/3. Now advance to λ = 0.7: weights become [1,1,1,1,0,0], four examples in, each gets 1/4. At λ = 1.0: all six in, each back to 1/6 — we have recovered the original P exactly. That last equality is the “curriculum ends at the true objective” condition, made arithmetic.
python import numpy as np def curriculum_batches(difficulty, n_steps, batch=32, lam0=0.2): # difficulty: (N,) array, already normalized to [0,1] (0=easy) order = np.argsort(difficulty) # easy → hard indices N = len(difficulty) for step in range(n_steps): # competence: fraction of data unlocked, grows lam0 → 1 lam = min(1.0, lam0 + (1-lam0) * step / n_steps) n_avail = max(batch, int(lam * N)) # how many easiest examples are in pool = order[:n_avail] # the admitted subset idx = np.random.choice(pool, batch) # sample uniformly from it yield idx # indices for this training batch
That is the entire classical curriculum in nine lines. The model never sees a hard example until lam has risen enough to admit it into pool. Notice the data flow: difficulty scores (N,) → sorted order (N,) → a growing prefix pool → sampled indices (batch,) → the actual mini-batch fed to the model. The only thing that changes over training is how long that prefix is.
The widget shows the sampling probability over six difficulty bins as you slide λ from 0 to 1. At low λ all the probability mass sits on the easy bins (left). As λ rises, mass spreads rightward until, at λ = 1, every bin is equally likely — you have arrived at the true distribution P. The entropy readout below makes the “increasing entropy” condition literal.
Bars = sampling probability per difficulty bin (left = easy). Slide λ. Watch mass flow rightward and the entropy climb toward its maximum at λ = 1.
The threshold curriculum in Chapter 2 had a knob λ that we slid from its start value to 1. But how should it slide? Linearly? Quickly at first, then slow? The schedule that maps training step to “fraction of data unlocked” is called the pacing function (or competence function), written c(t): the fraction of the dataset, easiest-first, that the model is allowed to train on at step t.
This choice matters more than people expect. Open the gate too fast and the curriculum barely differs from random — the hard examples flood in before the foundation is built. Open it too slowly and you waste most of training on trivial examples the model already mastered, never reaching the hard cases with enough steps left to learn them.
Let t̃ = t/T be normalized progress (0 at the start, 1 at the end) and c0 the initial competence (the fraction unlocked at step 0, e.g. 0.1). The common families:
The linear one is obvious. The root pacing function, from Platanios et al. (2019), is the clever one, and worth understanding why it has that square-root shape. Here is the reasoning.
Take c0 = 0.2 (start with the easiest 20%) and evaluate the root function at three moments. Recall c(t) = √(c02 + (1 − c02)·t̃), and c02 = 0.04, so (1 − 0.04) = 0.96.
| progress t̃ | inside the root | c(t) = √(·) | % data unlocked |
|---|---|---|---|
| 0.00 (start) | 0.04 + 0.96×0 = 0.040 | √0.040 = 0.200 | 20% |
| 0.25 | 0.04 + 0.96×0.25 = 0.280 | √0.280 = 0.529 | 53% |
| 0.50 | 0.04 + 0.96×0.50 = 0.520 | √0.520 = 0.721 | 72% |
| 1.00 (end) | 0.04 + 0.96×1 = 1.000 | √1.000 = 1.000 | 100% |
Look at the shape: by the halfway point (t̃ = 0.5) we have already unlocked 72% of the data. Compare the linear schedule at t̃ = 0.5: c = 0.2 + 0.8×0.5 = 0.6, only 60%. The root function front-loads — it gets through the easy material faster and leaves the back half of training mostly on the full, hard distribution, which is where the real learning of difficult cases happens.
python import numpy as np def competence(t, T, c0=0.1, kind='root'): tn = t / T if kind == 'linear': c = c0 + (1 - c0) * tn elif kind == 'root': c = np.sqrt(c0**2 + (1 - c0**2) * tn) # Platanios 2019 elif kind == 'geom': c = 2**(np.log2(c0) * (1 - tn)) return min(1.0, c) # cdf[i] = cumulative difficulty rank of example i in [0,1], easiest first def sample_batch(cdf_rank, t, T, batch=32): c = competence(t, T, kind='root') # an example is available iff its difficulty rank ≤ current competence available = np.where(cdf_rank <= c)[0] return np.random.choice(available, batch)
The pattern: convert difficulty to a rank in [0,1] (so the median example has rank 0.5), then admit any example whose rank is below the current competence c(t). At c(t) = 0.72, the easiest 72% of examples are eligible. The model's data diet literally widens over time according to the curve.
The plot below puts normalized training progress on the x-axis and difficulty rank (easy at bottom, hard at top) on the y-axis. The curve is c(t); the shaded region beneath it is the data the model is allowed to train on at each moment. Switch between pacing functions and drag c0. Watch how root bulges upward early (front-loading the easy material) while linear is a straight ramp and geometric dawdles before rushing at the end.
Shaded = unlocked (eligible to train on). x = progress, y = difficulty rank. Toggle the pacing function and slide c0. The vertical line marks the current step; the readout shows what fraction is unlocked there.
Every method so far needs a difficulty score we define before training, by hand or from a reference model. But what is hard for a human — a long sentence, a cluttered image — may not be what is hard for this model right now. The cleanest fix: stop guessing difficulty and let the model rank its own examples by their current loss, and decide for itself which to study. This is self-paced learning (SPL), introduced by Kumar, Packer, and Koller in 2010.
The trick is to make example selection part of the optimization. Introduce a binary selection variable vi ∈ {0,1} for each example: 1 means “include it in training right now,” 0 means “skip it for now.” Then jointly minimize over both the model weights w and the selection v:
Stare at this. The first term is the usual loss, but only summed over selected examples. The second term, −λ∑vi, is a reward for selecting examples (it lowers the objective when vi = 1). So there is a tug of war: including an example adds its loss Li (bad) but earns a bonus λ (good). The model includes example i only if doing so is worth it.
Here is the beautiful part. Fix the weights w, and ask: for each example, set vi to 0 or 1 to minimize the objective? The contribution of example i is vi(Li − λ). If Li < λ, that term is negative when vi = 1, so we include it (it lowers the objective). If Li > λ, including it would raise the objective, so we set vi = 0. The optimal selection is simply a threshold on the loss:
So λ is a loss threshold: train only on examples the model already handles better than λ. And here is the curriculum: start λ small (only the very easiest, lowest-loss examples are admitted) and grow it over training. As λ rises, examples that used to be too hard cross the threshold and join. Kumar et al. call λ the “age” of the model — an older, wiser model admits harder material.
Five examples with current losses, and the age parameter set to λ = 0.4:
| example | loss Li | Li − λ | vi* (include?) |
|---|---|---|---|
| e₁ | 0.10 | 0.10 − 0.40 = −0.30 | 1 — include (negative = good) |
| e₂ | 0.30 | 0.30 − 0.40 = −0.10 | 1 — include |
| e₃ | 0.50 | 0.50 − 0.40 = +0.10 | 0 — skip (too hard for now) |
| e₄ | 2.00 | 2.00 − 0.40 = +1.60 | 0 — skip |
| e₅ | 0.35 | 0.35 − 0.40 = −0.05 | 1 — include |
Selection v* = [1, 1, 0, 0, 1] — three examples in. Now run a few SGD steps on those three. Their losses drop; suppose e₃'s loss falls to 0.38 as a side effect (it was near the ones we trained on). Raise the age to λ = 0.55. Re-threshold: now e₃ (0.38) and possibly more cross in. The pool grows organically. e₄ (loss 2.00) — perhaps an outlier or mislabeled point — stays excluded until very late, which is exactly the robustness we want.
python import torch, torch.nn.functional as F def spl_epoch(model, opt, X, y, lam): # STEP 1 — fix w, choose v: per-example loss, threshold at lam model.eval() with torch.no_grad(): per = F.cross_entropy(model(X), y, reduction='none') # (N,) v = (per < lam).float() # closed-form v* = 1[L_i < lam] # STEP 2 — fix v, update w: train only on selected examples model.train() logits = model(X) per = F.cross_entropy(logits, y, reduction='none') loss = (v * per).sum() / (v.sum() + 1e-8) # masked mean opt.zero_grad(); loss.backward(); opt.step() return v.sum().item() # how many were active # Outer loop: grow the age lam each epoch (the curriculum) lam = 0.2 for epoch in range(50): n_active = spl_epoch(model, opt, X, y, lam) lam *= 1.1 # age the model: admit harder examples over time
The data flow is a two-step alternation per epoch: (1) freeze the weights, compute per-example losses, threshold to get the mask v; (2) freeze v, take a gradient step on the masked loss. The v * per product zeroes out the gradient from any example the model has deemed too hard right now — those examples contribute literally nothing to the update.
Twelve examples, each a bar showing its current loss. The horizontal line is the age threshold λ. Bars below the line are selected (teal); bars above are skipped (gray). Press Train step: selected examples' losses drop (the model is learning them), occasionally pulling a neighbor below the line. Raise λ to age the model and admit harder examples. Notice the stubborn tall bar — an outlier that resists until λ climbs high, modelling SPL's built-in robustness to bad data.
Bars = per-example loss. Line = age λ. Below the line = selected (teal), trained this step; above = skipped (gray). Train repeatedly, then raise λ.
Self-paced learning still uses one fixed rule: “train on whatever has low loss.” But maybe the best thing to train on is not the easiest, nor the hardest, but the example at the edge of your ability — hard enough to teach you something, easy enough that you can actually learn it. Psychologists call this the zone of proximal development. Can a system discover that zone automatically, and move it as the model improves? Yes — with a teacher.
In automated curriculum learning (Graves et al., 2017), we have a student (the model being trained) and a teacher (a policy that, each step, picks which kind of example or task to feed the student). The teacher's goal is to maximize the student's learning progress — train on whatever the student is currently improving on the fastest.
Frame it as a multi-armed bandit. Each “arm” is a difficulty bucket or task category (say: 1-digit addition, 2-digit, 3-digit, …). Pulling an arm = training the student on a batch from that bucket. The reward is not accuracy — it is how much the student improved from that batch:
This reward signal is the whole insight. Think about what it does to the three regimes:
The teacher keeps a weight wk per arm and samples arm k with probability proportional to exp(wk) (a softmax / exponential-weights scheme that handles the explore-exploit tradeoff). After observing reward rk, it bumps that arm's weight:
where r̃k is the (normalized) learning-progress reward and η is the teacher's learning rate. Arms that yield progress get sampled more; arms that stall fade. Because progress is transient — it dries up once mastered — the probability mass travels across the arms over training, easy to hard, all on its own.
Three arms with weights w = [0, 0, 0], so initial probabilities are uniform: p = [⅓, ⅓, ⅓]. The teacher samples arm 2, trains the student there, and measures loss before = 1.40, after = 1.05. Reward r₂ = |1.40 − 1.05| = 0.35. With teacher rate η = 1.0, update w₂ ← 0 + 1.0×0.35 = 0.35. New weights w = [0, 0.35, 0]. Recompute probabilities:
Arm 2's probability jumped from 0.333 to 0.415 because it produced progress. If a later pull of arm 2 yields less progress (the student is mastering it), its weight grows more slowly than a freshly-learnable arm, and the balance tips toward whichever arm is now in the proximal zone.
python import numpy as np class TeacherEXP3: def __init__(self, n_arms, eta=2.0): self.w = np.zeros(n_arms) # log-weights, one per difficulty bucket self.eta = eta self.prev_loss = None def probs(self): z = self.w - self.w.max() # softmax over arms (numerically safe) e = np.exp(z); return e / e.sum() def choose(self): return np.random.choice(len(self.w), p=self.probs()) def update(self, arm, loss_before, loss_after): # reward = learning progress = how much loss dropped on this batch reward = abs(loss_before - loss_after) self.w *= 0.96 # gentle decay → stays responsive as frontier moves self.w[arm] += self.eta * reward # bump the arm that produced progress # training loop: teacher picks the bucket, student trains, teacher learns teacher = TeacherEXP3(n_arms=5) for step in range(n_steps): k = teacher.choose() # which difficulty bucket? Lb = eval_loss(student, bucket[k]) # loss before train_on(student, bucket[k]) # one student SGD step on that bucket La = eval_loss(student, bucket[k]) # loss after teacher.update(k, Lb, La) # reward = |Lb - La|
The data flow is a tight loop between two agents: the teacher emits an arm index k (a scalar), the student trains on bucket k and reports two loss scalars, the teacher converts their difference into a reward and updates one weight. No difficulty was ever labeled by a human — the ordering is a side effect of always chasing progress. The decay factor 0.96 is the subtle ingredient: it lets a once-favored arm fade as its progress dries up, so the probability mass keeps migrating to the new frontier.
Five difficulty buckets (left = easiest, right = hardest). Each bucket can only be learned once the one before it is mostly mastered — a chain of prerequisites. The filled portion of each bar is the student's mastery; the bright cap is the teacher's current probability of picking that bucket. Press Run teacher and watch the bright attention band sweep left-to-right: the teacher always sits on the learnable frontier, never told the order, discovering it from learning-progress alone.
Tall fill = student mastery of that bucket. Bright cap = teacher's pick probability. Run it: attention rides from easy to hard as each bucket is mastered. Reset and run again — same emergent easy→hard order, never hard-coded.
Everything so far — difficulty scores, threshold gates, pacing functions, self-pacing, teachers — comes together here. This is a real (tiny) neural network training live in your browser on the two-moons problem, and you choose its data strategy. A faint random-order baseline trains alongside every run so you can always see whether the curriculum actually helped.
Your job in this chapter is to break things on purpose and build intuition for when curricula win and when they don't:
Left: the data and the forming decision boundary. Bright points are currently unlocked for training; dim points are held back. Right: test accuracy over time — your strategy (teal) vs the random baseline (gray). Pick a strategy, press Train, and meddle.
Pick a strategy and press Train.
No quiz this chapter — the simulation is the test. If you can predict, before pressing Train, whether the curriculum line will beat the baseline for a given noise level, you understand this lesson.
At the scale of a frontier language model, “curriculum” rarely means ordering individual sentences. The data is split into domains — web text, code, books, Wikipedia, arXiv, math — and the real lever is the mixture weights: what fraction of each training batch comes from each domain. Choosing those weights well is one of the highest-leverage decisions in pretraining, and it is, in spirit, a curriculum over where the model's attention goes.
Call the mixture weights α = (α1, …, αK), one per domain, summing to 1. Each training batch samples domain k with probability αk. The naive choice is to set αk proportional to how much data each domain has — but the biggest domain (usually scraped web text) is not necessarily the most valuable per token.
DoReMi (Xie et al., 2023, “Domain Reweighting with Minimax Optimization”) finds good mixture weights without training the big model repeatedly. The recipe has three steps, and the data-flow is the point:
The heart is step 2. Define the excess loss of domain k as how much worse the proxy is than the reference on that domain:
A large positive excess means “the proxy is doing much worse than a reference model could here — there is room to improve, so spend more budget on this domain.” DoReMi raises α on exactly those domains via an exponentiated-gradient update:
This is a minimax game: the weights α maximize excess loss (seeking the worst-served domain), while the proxy's parameters minimize the α-weighted loss (fixing whatever the weights expose). The equilibrium is a mixture that protects the worst-case domain — no domain is left badly under-served relative to what a model could achieve.
Four domains, current weights α = [0.25, 0.25, 0.25, 0.25], excess losses just measured, teacher rate η = 1.0:
| domain | excessk | α·eη·excess |
|---|---|---|
| Web | 0.05 | 0.25 × e0.05 = 0.25 × 1.051 = 0.2628 |
| Code | 0.40 | 0.25 × e0.40 = 0.25 × 1.492 = 0.3730 |
| Books | 0.10 | 0.25 × e0.10 = 0.25 × 1.105 = 0.2763 |
| Wiki | 0.02 | 0.25 × e0.02 = 0.25 × 1.020 = 0.2551 |
Sum of the unnormalized column = 0.2628 + 0.3730 + 0.2763 + 0.2551 = 1.1672. Divide each by that total to renormalize:
Code — the domain with the biggest excess loss — jumped from 0.25 to 0.32, while Wiki (almost no excess) drifted down to 0.219. One step nudged the budget toward the domain with the most unrealized improvement. Iterate this and α settles on a mixture that no domain can complain about. Notice the data flow: per-domain losses (K,) → excess (K,) → multiplicative weight update (K,) → renormalize → new sampling distribution for the next batches.
Four domains. Drag the weight sliders to set the mixture by hand and watch the worst-case excess loss (the number DoReMi tries to push down). Then hit DoReMi step and let the exponentiated-gradient update do it for you — it automatically pours weight into the worst-served domain until the bars even out. Try to beat it by hand; it is harder than it looks.
Each domain shows its mixture weight αk (bar height) and its current excess loss (red marker). Worst-case excess is what matters. Adjust by hand, or let DoReMi auto-balance.
Here is the honest chapter, the one most tutorials skip. Curriculum learning is not free magic. On large, clean, well-curated datasets trained with a good optimizer and proper shuffling, carefully-built curricula often produce no measurable gain — and sometimes a small loss from the extra complexity and the time spent over-training easy examples. Worse, a famous body of work argues for the opposite strategy: train on the hardest examples first or most. Both camps have strong empirical support. How can both be right?
Hard-example mining — deliberately over-sampling the examples the model gets wrong — is everywhere in production:
These work because, for a model that is already reasonably trained, easy examples are redundant: they produce tiny gradients and teach nothing new. The information is concentrated in the hard examples. So the field holds two seemingly-contradictory beliefs at once: easy-first and hard-first. The resolution is that they apply at different times and under different conditions.
The single best predictor of whether easy-first helps is label noise. Mislabeled examples have high loss — they look exactly like “hard” examples to any loss-based difficulty score. A hard-first strategy therefore amplifies the noise: it pours training onto the corrupted labels. An easy-first strategy quarantines them: the noisy examples stay locked behind the difficulty gate until late, by which point the model is robust enough to discount them (or training has ended). This is why self-paced learning was originally pitched as a robustness method.
| Situation | Best strategy | Why |
|---|---|---|
| Noisy / weakly-labeled data | Easy-first | quarantines high-loss mislabeled examples |
| Small dataset | Easy-first | fragile model needs a stable foundation |
| Sparse-reward RL / hard exploration | Auto-curriculum | most tasks give zero signal; teacher finds the frontier |
| Very hard target task | Easy-then-hard | bootstrap simple skills, then specialize |
| Large, clean, balanced data | Often none | shuffling + good optimizer already suffices |
| Late-stage fine-tuning, clean labels | Hard-first (mining) | easy examples are redundant; signal is in the hard ones |
Imagine 1000 examples, 200 of them mislabeled. A loss-based difficulty score ranks all 200 mislabeled ones among the “hardest” (their labels contradict the pattern, so loss stays high). Easy-first with competence reaching, say, 0.8 by mid-training trains almost exclusively on the 800 clean examples first — the model learns the true pattern from clean data, then meets the noise late and largely ignores it. Hard-first does the reverse: it front-loads the 200 corrupted examples, teaching the model the wrong pattern before it ever sees the clean majority. Same data, opposite outcomes, decided entirely by ordering.
The plot sweeps label noise on the x-axis and shows final test accuracy for easy-first (teal) vs random (gray). At zero noise the lines nearly coincide — curriculum buys you almost nothing. As noise rises, the random line plunges (it trains on corrupted labels indiscriminately) while easy-first degrades gracefully. The vertical gap between them is the value of the curriculum. Drag the dataset-size slider: smaller datasets widen the gap at every noise level.
Teal = easy-first, gray = random. The gap is the curriculum's payoff — near zero on clean data, large under noise. Drag dataset size and the noise marker.
You now have the full toolbox: how to score difficulty, how to gate it, how fast to open the gate, how to let the model pace itself, how to make a teacher discover the order, how the same idea scales to LLM domain mixtures, and when to not bother at all. This closing chapter is your reference card.
| Method | Who picks the order | Difficulty signal | Best for |
|---|---|---|---|
| Classical (Bengio ’09) | You, in advance | fixed score + threshold λ | tasks with a clear human notion of difficulty |
| Competence-based | You, via a pacing function | fixed score + c(t) schedule | when you want smooth, principled pacing (root) |
| Self-paced (SPL) | The model | current per-example loss < λ | noisy data, no good a-priori difficulty |
| Teacher-student | A learned policy | learning progress |ΔL| | RL, multi-task, sparse rewards |
| DoReMi (domain) | Minimax proxy | excess loss vs reference | LLM pretraining data mixtures |
Curriculum learning is a layer that sits on top of the rest of the training stack you have been building in this series:
“Education is not the filling of a pail, but the lighting of a fire” — and you light it one easy spark at a time.