Training Foundations

Curriculum Learning

Why showing a network its easy examples first — the same way we teach children — can mean the difference between a model that generalizes and one that gives up.

Prerequisites: What a training loss is + SGD updates weights to lower loss. That’s it.
10
Chapters
9+
Simulations
0
Assumed Knowledge

Chapter 0: Does the Order Matter?

Imagine you are teaching a child arithmetic. You would not open with “evaluate the integral of tan(x).” You start with 2 + 2, then carrying, then multiplication, and only much later, calculus. Each skill rests on the one before it. Scramble that order — throw integrals at a five-year-old between addition problems — and learning collapses.

Now look at how we train neural networks. The standard recipe shuffles the dataset and draws random mini-batches. Every batch is a uniform random mix of trivial examples and brutally hard ones. The five-year-old gets addition and integrals in the same breakfast. We never ask: does the order in which a model sees its data change what it learns?

That question is the whole subject of this lesson. The answer, discovered and re-discovered across two decades, is: sometimes, dramatically — and sometimes not at all. Knowing which situation you are in is the real skill. Curriculum learning is the family of techniques that order training examples from easy to hard instead of uniformly at random.

The one-sentence version. Curriculum learning organizes the training data so the model masters easy, unambiguous examples first and only later faces the hard, noisy, or rare ones — the same scaffolding a good teacher gives a student.

Why Random Order Can Hurt

Early in training, the network's weights are near-random. Its predictions are garbage on every example. But a hard example — one near the true decision boundary, or with a mislabeled or ambiguous answer — produces a large, confidently-wrong gradient that yanks the weights in a direction that may be actively misleading. The model has no stable foundation yet, so it cannot tell a genuinely informative hard case from noise. It just gets shoved around.

Easy examples, by contrast, produce gradients that point in a clean, consistent direction — toward the broad structure of the task. Master those first and you build a rough-but-correct foundation. Then the hard examples become useful: now the model has enough structure to interpret them as fine corrections rather than chaos.

This is not just a metaphor. It connects to optimization: a non-convex loss surface has many bad local minima. Easy examples define a smoother, more convex-looking version of the loss landscape. Starting there is like continuation (or homotopy) methods in optimization — solve an easy smoothed problem first, then gradually deform it into the hard real one, carrying your solution along.

The Two-Moons Demonstration

The simulation below trains a tiny classifier on the classic “two moons” dataset — two interleaving crescents. Points deep inside each crescent are easy (far from the other class). Points near where the crescents interleave are hard (ambiguous, close to the boundary). We will train the same model two ways and race them:

Press Train and watch the two decision boundaries form. The curriculum model commits to a clean separating curve early; the random model thrashes as conflicting gradients fight, often settling on a more jagged, lower-accuracy boundary.

Race: Random Order vs. Easy-First Curriculum

Both panels train an identical 2-layer network on two moons. Left uses random batches; right unlocks examples easy→hard. Press Train, then watch the boundaries and the accuracy bars. Hit Reset and Train again — the curriculum is more consistent run to run.

Label noise 0%

Tip: nudge label noise up to 20% and re-train. The curriculum's lead widens — a clue we will cash out in Chapter 8.

Common misconception. “Order can't matter — SGD sees every example many times over all the epochs, so it averages out.” It does not average out, because training is path-dependent. Where the weights are when a gradient arrives changes what that gradient does. A hard example at step 10 (random weights) and the same example at step 10,000 (trained weights) push the model in completely different directions. Order shapes the path, and the path decides which minimum you fall into.
Why might a hard, near-boundary example hurt a randomly-initialized model more than it helps?

Chapter 1: What Makes an Example “Hard”?

Curriculum learning begins with a deceptively simple requirement: to order examples from easy to hard, you need a number for each one — a difficulty score, a function that maps a training example to a scalar. Everything downstream depends on this function. Get it wrong and your “curriculum” just reshuffles examples in a meaningless order. So: how do we measure difficulty?

There are three broad families, and they trade off cleanly against each other.

1. Human / heuristic scores

The cheapest option: use a property a human can define without running any model. For language modeling, sentence length — short sentences are easier. For object detection, number of objects or object size. For arithmetic, number of digits. These are free to compute and need no trained model, but they encode a human's guess about difficulty, which may not match the model's experience at all.

2. Model-based scores

Let the model tell you. The most common signal is the per-example loss: run the example through the current model and read off how much loss it incurs. High loss = hard. A close cousin is the confidence (the probability the model assigns the correct class) and the margin (gap between the correct class score and the next-highest). These adapt to the actual model but are circular early on — a random model finds everything “hard.”

3. Transfer scores

Borrow difficulty from a reference model. Train a small or earlier model, record each example's loss under it, and use that as a fixed, non-circular difficulty score for training the real model. This is the idea behind transfer teacher curricula and underlies industrial recipes like DoReMi (Chapter 7), where a small proxy model's losses guide the big model's data diet.

The key insight. Difficulty is not a property of the example alone — it is a property of the example and the model together. The same sentence is “hard” for a fresh network and “trivial” for a trained one. Every difficulty score is really answering “hard for whom?

The margin as difficulty — worked by hand

Let us make this concrete with the margin score, because it has a clean geometric meaning: how close is the example to the decision boundary? Suppose a binary classifier outputs logits (raw scores) for two classes, and the example's true class is class 0. Define the margin as m = ztrue − zother. Large positive margin = confidently correct = easy. Near zero = right on the boundary = hard. Negative = currently misclassified.

Take four examples with these (true, other) logits:

Exampleztruezothermargin mverdict
A4.00.54.0 − 0.5 = +3.5very easy
B2.01.02.0 − 1.0 = +1.0medium
C1.21.11.2 − 1.1 = +0.1hard (on boundary)
D0.31.40.3 − 1.4 = −1.1wrong / hardest

Sort by margin descending and you get the easy→hard ordering A, B, C, D. A curriculum would feed A and B first, hold back C and D until the model has found roughly the right boundary, then introduce them as fine corrections. That ordering — nothing more — is the raw material of every method in this lesson.

From scratch: a difficulty scorer

python
import torch
import torch.nn.functional as F

def difficulty_scores(model, X, y):
    # Per-example difficulty = per-example cross-entropy loss.
    # X: (N, d) inputs, y: (N,) integer labels. Returns: (N,) scores.
    model.eval()
    with torch.no_grad():
        logits = model(X)                       # (N, C)
        # reduction='none' keeps ONE loss per example, not the mean
        loss = F.cross_entropy(logits, y, reduction='none')  # (N,)
    return loss                                # high = hard

def margin_scores(model, X, y):
    # Alternative: signed margin. High = easy, negative = misclassified.
    with torch.no_grad():
        logits = model(X)                       # (N, C)
        true = logits.gather(1, y.unsqueeze(1)).squeeze(1)  # (N,)
        logits_masked = logits.clone()
        logits_masked.scatter_(1, y.unsqueeze(1), -1e9)  # hide true class
        other = logits_masked.max(dim=1).values        # best wrong class
    return true - other                        # (N,) margin

Notice the crucial detail: reduction='none'. The default cross_entropy returns the mean over the batch — one number for backprop. For a curriculum we need the loss for each example separately, so we keep the full (N,) vector. That single argument is the difference between “train normally” and “measure difficulty.”

See it: difficulty is distance to the boundary

The widget below shows a 2D dataset with a linear decision boundary you can rotate and slide. Each point is colored by its margin-difficulty under that boundary: deep colors are easy (far from the line, confidently correct), pale gray points sit near the boundary (hard), and points on the wrong side glow red (misclassified, hardest). Drag the slider to move the boundary and watch difficulty recompute live. This is what a difficulty score “sees.”

Difficulty = Distance to the Boundary

Move the boundary. Points far from it on the correct side are easy (bold); points hugging it are hard (pale); points on the wrong side are misclassified (red ring). The histogram on the right is the difficulty distribution.

Boundary angle 20°
Boundary offset 0.00
Common misconception. “Hard means the label is wrong.” No. The hardest correctly-labeled examples are the ones near the true boundary — the genuinely ambiguous cases that carry the most information about where the boundary goes. Mislabeled examples are a separate problem (Chapter 8): they also score as “hard,” which is exactly why feeding hard examples first to a fragile model is dangerous — you can't yet tell an informative boundary case from a corrupted label.
You compute difficulty as per-example loss under the current model at step 0 (random weights). What's the problem?

Chapter 2: The Classical Recipe — Bengio 2009

We have a difficulty score (Chapter 1). The next question is what to do with it. The first clean formalization came from Yoshua Bengio and colleagues in 2009, in a paper titled simply “Curriculum Learning.” Their framing is worth understanding exactly, because every later method is a variation on it.

Bengio's idea: instead of training on the fixed data distribution P(z) (where z is an example), train on a sequence of distributions Qλ(z) that start easy and gradually morph into P. A single knob λ slides from 0 (easiest distribution) to 1 (the real distribution).

Qλ(z) ∝ Wλ(z) · P(z),   0 ≤ λ ≤ 1

Read this carefully. P(z) is the original probability of example z. Wλ(z) is a weight between 0 and 1 that says how much we let z into training at curriculum stage λ. We multiply them and renormalize (that is the “∝”) so the weighted thing is still a probability distribution. Bengio imposes two conditions that make this a genuine easy-to-hard schedule:

The simplest realization — the one most people actually use — is a hard threshold gate. Sort examples by difficulty. Pick a threshold; everything easier than the threshold has weight 1, everything harder has weight 0. Raise the threshold over training. This is the “baby step” or “one-pass” curriculum.

Why “increasing entropy” is the real definition. A curriculum is not just “train on easy stuff.” It is a continuous deformation from a narrow, easy distribution to the true one. The entropy condition guarantees you always end at the correct objective P — the curriculum is a path to the real problem, never a replacement for it. Forget this and you get a model that's brilliant at easy cases and never learns the hard ones.

Worked example: a reweighting by hand

Suppose 6 examples with difficulty scores (0 = easiest, 1 = hardest) and equal original probability P(z) = 1/6 each:

zdifficulty dP(z)
z₁0.101/6
z₂0.251/6
z₃0.401/6
z₄0.601/6
z₅0.801/6
z₆0.951/6

Use a threshold curriculum with threshold λ (an example is “in” if d ≤ λ). At λ = 0.5, the weights are W = [1, 1, 1, 0, 0, 0] — only z₁, z₂, z₃ are admitted. The unnormalized distribution is [1/6, 1/6, 1/6, 0, 0, 0]. Renormalizing (divide by the total, 3/6 = 0.5):

Q0.5 = [⅓, ⅓, ⅓, 0, 0, 0]

So at λ = 0.5 we sample uniformly among the three easiest examples, each with probability 1/3. Now advance to λ = 0.7: weights become [1,1,1,1,0,0], four examples in, each gets 1/4. At λ = 1.0: all six in, each back to 1/6 — we have recovered the original P exactly. That last equality is the “curriculum ends at the true objective” condition, made arithmetic.

From scratch: a threshold curriculum sampler

python
import numpy as np

def curriculum_batches(difficulty, n_steps, batch=32, lam0=0.2):
    # difficulty: (N,) array, already normalized to [0,1] (0=easy)
    order = np.argsort(difficulty)        # easy → hard indices
    N = len(difficulty)
    for step in range(n_steps):
        # competence: fraction of data unlocked, grows lam0 → 1
        lam = min(1.0, lam0 + (1-lam0) * step / n_steps)
        n_avail = max(batch, int(lam * N))   # how many easiest examples are in
        pool = order[:n_avail]                # the admitted subset
        idx = np.random.choice(pool, batch)   # sample uniformly from it
        yield idx                             # indices for this training batch

That is the entire classical curriculum in nine lines. The model never sees a hard example until lam has risen enough to admit it into pool. Notice the data flow: difficulty scores (N,) → sorted order (N,) → a growing prefix pool → sampled indices (batch,) → the actual mini-batch fed to the model. The only thing that changes over training is how long that prefix is.

See it: the distribution morphing from easy to full

The widget shows the sampling probability over six difficulty bins as you slide λ from 0 to 1. At low λ all the probability mass sits on the easy bins (left). As λ rises, mass spreads rightward until, at λ = 1, every bin is equally likely — you have arrived at the true distribution P. The entropy readout below makes the “increasing entropy” condition literal.

Qλ: The Training Distribution Morphing Easy → Full

Bars = sampling probability per difficulty bin (left = easy). Slide λ. Watch mass flow rightward and the entropy climb toward its maximum at λ = 1.

Curriculum λ 0.30
Gate softness 0.05
Common misconception. “Curriculum means I drop the hard examples to make training easier.” The opposite — the schedule must finish on the full distribution including every hard example. If λ never reaches 1, you have trained a model that systematically ignores the hardest, often most important, cases. The curriculum changes the order and timing, never the final target.
In Bengio's formulation, what must be true of Qλ when λ = 1?

Chapter 3: Pacing — How Fast to Open the Gate

The threshold curriculum in Chapter 2 had a knob λ that we slid from its start value to 1. But how should it slide? Linearly? Quickly at first, then slow? The schedule that maps training step to “fraction of data unlocked” is called the pacing function (or competence function), written c(t): the fraction of the dataset, easiest-first, that the model is allowed to train on at step t.

This choice matters more than people expect. Open the gate too fast and the curriculum barely differs from random — the hard examples flood in before the foundation is built. Open it too slowly and you waste most of training on trivial examples the model already mastered, never reaching the hard cases with enough steps left to learn them.

Three pacing functions

Let t̃ = t/T be normalized progress (0 at the start, 1 at the end) and c0 the initial competence (the fraction unlocked at step 0, e.g. 0.1). The common families:

linear:   c(t) = min(1,  c0 + (1 − c0) · t̃)
root:      c(t) = min(1,  √(c02 + (1 − c02) · t̃))
geometric:   c(t) = min(1,  2(log₂c0)(1−t̃))

The linear one is obvious. The root pacing function, from Platanios et al. (2019), is the clever one, and worth understanding why it has that square-root shape. Here is the reasoning.

Why the square root? Platanios designed c(t) so that every example gets the same expected number of training appearances. Easy examples are unlocked early, so without correction they'd be sampled far more often than hard ones. If you want each example, once unlocked, to contribute equally over the remaining training, the competence must grow like √t̃ — the area under c(t) up to the moment an example unlocks must be balanced. The root shape spends more of the early budget on the small easy set (because that's all there is), then accelerates once the easy material is saturated.

Worked example: root pacing by hand

Take c0 = 0.2 (start with the easiest 20%) and evaluate the root function at three moments. Recall c(t) = √(c02 + (1 − c02)·t̃), and c02 = 0.04, so (1 − 0.04) = 0.96.

progress t̃inside the rootc(t) = √(·)% data unlocked
0.00 (start)0.04 + 0.96×0 = 0.040√0.040 = 0.20020%
0.250.04 + 0.96×0.25 = 0.280√0.280 = 0.52953%
0.500.04 + 0.96×0.50 = 0.520√0.520 = 0.72172%
1.00 (end)0.04 + 0.96×1 = 1.000√1.000 = 1.000100%

Look at the shape: by the halfway point (t̃ = 0.5) we have already unlocked 72% of the data. Compare the linear schedule at t̃ = 0.5: c = 0.2 + 0.8×0.5 = 0.6, only 60%. The root function front-loads — it gets through the easy material faster and leaves the back half of training mostly on the full, hard distribution, which is where the real learning of difficult cases happens.

From scratch: competence-based sampling

python
import numpy as np

def competence(t, T, c0=0.1, kind='root'):
    tn = t / T
    if kind == 'linear':
        c = c0 + (1 - c0) * tn
    elif kind == 'root':
        c = np.sqrt(c0**2 + (1 - c0**2) * tn)   # Platanios 2019
    elif kind == 'geom':
        c = 2**(np.log2(c0) * (1 - tn))
    return min(1.0, c)

# cdf[i] = cumulative difficulty rank of example i in [0,1], easiest first
def sample_batch(cdf_rank, t, T, batch=32):
    c = competence(t, T, kind='root')
    # an example is available iff its difficulty rank ≤ current competence
    available = np.where(cdf_rank <= c)[0]
    return np.random.choice(available, batch)

The pattern: convert difficulty to a rank in [0,1] (so the median example has rank 0.5), then admit any example whose rank is below the current competence c(t). At c(t) = 0.72, the easiest 72% of examples are eligible. The model's data diet literally widens over time according to the curve.

See it: the unlocking front sweeping over training

The plot below puts normalized training progress on the x-axis and difficulty rank (easy at bottom, hard at top) on the y-axis. The curve is c(t); the shaded region beneath it is the data the model is allowed to train on at each moment. Switch between pacing functions and drag c0. Watch how root bulges upward early (front-loading the easy material) while linear is a straight ramp and geometric dawdles before rushing at the end.

The Unlocking Front: Difficulty Available vs Training Progress

Shaded = unlocked (eligible to train on). x = progress, y = difficulty rank. Toggle the pacing function and slide c0. The vertical line marks the current step; the readout shows what fraction is unlocked there.

Initial competence c0 0.10
Current step t̃ 0.40
Common misconception. “Linear pacing is the safe default.” Linear is usually the worst of the three. It spends too much of the early budget on a small easy set (over-training it) and reaches the full distribution only at the very end, leaving few steps to actually learn the hard examples. Root pacing — saturate the easy set quickly, then live on the full distribution — tends to win in practice. Geometric is the gentlest start, useful when even the “easy” examples are hard for a fresh model.
At t̃ = 0.5 with c0 = 0.2, root pacing unlocks ~72% of data while linear unlocks 60%. What does this tell you about root's design?

Chapter 4: Self-Paced Learning — Let the Model Choose

Every method so far needs a difficulty score we define before training, by hand or from a reference model. But what is hard for a human — a long sentence, a cluttered image — may not be what is hard for this model right now. The cleanest fix: stop guessing difficulty and let the model rank its own examples by their current loss, and decide for itself which to study. This is self-paced learning (SPL), introduced by Kumar, Packer, and Koller in 2010.

The trick is to make example selection part of the optimization. Introduce a binary selection variable vi ∈ {0,1} for each example: 1 means “include it in training right now,” 0 means “skip it for now.” Then jointly minimize over both the model weights w and the selection v:

minw,v   ∑i vi Li(w)  −  λ ∑i vi

Stare at this. The first term is the usual loss, but only summed over selected examples. The second term, −λ∑vi, is a reward for selecting examples (it lowers the objective when vi = 1). So there is a tug of war: including an example adds its loss Li (bad) but earns a bonus λ (good). The model includes example i only if doing so is worth it.

The closed-form selection rule

Here is the beautiful part. Fix the weights w, and ask: for each example, set vi to 0 or 1 to minimize the objective? The contribution of example i is vi(Li − λ). If Li < λ, that term is negative when vi = 1, so we include it (it lowers the objective). If Li > λ, including it would raise the objective, so we set vi = 0. The optimal selection is simply a threshold on the loss:

vi* = 1  if  Li < λ,   else 0

So λ is a loss threshold: train only on examples the model already handles better than λ. And here is the curriculum: start λ small (only the very easiest, lowest-loss examples are admitted) and grow it over training. As λ rises, examples that used to be too hard cross the threshold and join. Kumar et al. call λ the “age” of the model — an older, wiser model admits harder material.

The self-correcting loop. Notice it is circular in the best way: the model trains on easy (low-loss) examples → that training lowers the loss on nearby examples → those examples now fall under λ and get admitted → training on them lowers loss on the next ring out. The curriculum emerges from the data and the model's own progress — nobody hand-labeled difficulty. The model paces itself, hence the name.

Worked example: SPL selection by hand

Five examples with current losses, and the age parameter set to λ = 0.4:

exampleloss LiLi − λvi* (include?)
e₁0.100.10 − 0.40 = −0.301 — include (negative = good)
e₂0.300.30 − 0.40 = −0.101 — include
e₃0.500.50 − 0.40 = +0.100 — skip (too hard for now)
e₄2.002.00 − 0.40 = +1.600 — skip
e₅0.350.35 − 0.40 = −0.051 — include

Selection v* = [1, 1, 0, 0, 1] — three examples in. Now run a few SGD steps on those three. Their losses drop; suppose e₃'s loss falls to 0.38 as a side effect (it was near the ones we trained on). Raise the age to λ = 0.55. Re-threshold: now e₃ (0.38) and possibly more cross in. The pool grows organically. e₄ (loss 2.00) — perhaps an outlier or mislabeled point — stays excluded until very late, which is exactly the robustness we want.

From scratch: the SPL alternating loop

python
import torch, torch.nn.functional as F

def spl_epoch(model, opt, X, y, lam):
    # STEP 1 — fix w, choose v: per-example loss, threshold at lam
    model.eval()
    with torch.no_grad():
        per = F.cross_entropy(model(X), y, reduction='none')  # (N,)
    v = (per < lam).float()                  # closed-form v* = 1[L_i < lam]

    # STEP 2 — fix v, update w: train only on selected examples
    model.train()
    logits = model(X)
    per = F.cross_entropy(logits, y, reduction='none')
    loss = (v * per).sum() / (v.sum() + 1e-8)   # masked mean
    opt.zero_grad(); loss.backward(); opt.step()
    return v.sum().item()                    # how many were active

# Outer loop: grow the age lam each epoch (the curriculum)
lam = 0.2
for epoch in range(50):
    n_active = spl_epoch(model, opt, X, y, lam)
    lam *= 1.1          # age the model: admit harder examples over time

The data flow is a two-step alternation per epoch: (1) freeze the weights, compute per-example losses, threshold to get the mask v; (2) freeze v, take a gradient step on the masked loss. The v * per product zeroes out the gradient from any example the model has deemed too hard right now — those examples contribute literally nothing to the update.

See it: the self-pacing loop in action

Twelve examples, each a bar showing its current loss. The horizontal line is the age threshold λ. Bars below the line are selected (teal); bars above are skipped (gray). Press Train step: selected examples' losses drop (the model is learning them), occasionally pulling a neighbor below the line. Raise λ to age the model and admit harder examples. Notice the stubborn tall bar — an outlier that resists until λ climbs high, modelling SPL's built-in robustness to bad data.

Self-Paced Learning: Threshold the Loss, Then Train

Bars = per-example loss. Line = age λ. Below the line = selected (teal), trained this step; above = skipped (gray). Train repeatedly, then raise λ.

Age λ (threshold) 0.45
Common misconception. “SPL just discards hard examples, so it must underfit them.” No — if λ grows large enough by the end, every example is eventually admitted (same “ends on full distribution” rule as Chapter 2). The point is timing: a hard or mislabeled example contributes its disruptive gradient only after the model is mature enough to interpret it. SPL's danger is the opposite — if you grow λ too slowly, a genuinely hard but correct example might never get its fair share of training.
In the SPL objective minw,v ∑ viLi − λ∑vi, why does the optimal vi become 1 exactly when Li < λ?

Chapter 5: Automatic Curriculum — Learning the Order

Self-paced learning still uses one fixed rule: “train on whatever has low loss.” But maybe the best thing to train on is not the easiest, nor the hardest, but the example at the edge of your ability — hard enough to teach you something, easy enough that you can actually learn it. Psychologists call this the zone of proximal development. Can a system discover that zone automatically, and move it as the model improves? Yes — with a teacher.

In automated curriculum learning (Graves et al., 2017), we have a student (the model being trained) and a teacher (a policy that, each step, picks which kind of example or task to feed the student). The teacher's goal is to maximize the student's learning progress — train on whatever the student is currently improving on the fastest.

The bandit formulation

Frame it as a multi-armed bandit. Each “arm” is a difficulty bucket or task category (say: 1-digit addition, 2-digit, 3-digit, …). Pulling an arm = training the student on a batch from that bucket. The reward is not accuracy — it is how much the student improved from that batch:

rewardk = | Lbefore − Lafter |   (the learning progress on arm k)

This reward signal is the whole insight. Think about what it does to the three regimes:

Why “progress,” not “loss”? If the teacher chased high loss, it would obsess over impossible examples (high loss forever) and ignore the learnable middle. If it chased low loss, it would re-drill mastered material. Chasing the change in loss automatically tracks the moving frontier of learnability — and as the student masters one bucket, that bucket's progress dries up and the teacher's attention slides to the next harder bucket. The curriculum emerges with nobody specifying the order.

The teacher's update rule (EXP3-style)

The teacher keeps a weight wk per arm and samples arm k with probability proportional to exp(wk) (a softmax / exponential-weights scheme that handles the explore-exploit tradeoff). After observing reward rk, it bumps that arm's weight:

wk ← wk + η · r̃k,   pk = exp(wk) / ∑j exp(wj)

where r̃k is the (normalized) learning-progress reward and η is the teacher's learning rate. Arms that yield progress get sampled more; arms that stall fade. Because progress is transient — it dries up once mastered — the probability mass travels across the arms over training, easy to hard, all on its own.

Worked example: one teacher update

Three arms with weights w = [0, 0, 0], so initial probabilities are uniform: p = [⅓, ⅓, ⅓]. The teacher samples arm 2, trains the student there, and measures loss before = 1.40, after = 1.05. Reward r₂ = |1.40 − 1.05| = 0.35. With teacher rate η = 1.0, update w₂ ← 0 + 1.0×0.35 = 0.35. New weights w = [0, 0.35, 0]. Recompute probabilities:

p = [e0, e0.35, e0] / (e0+e0.35+e0) = [1, 1.419, 1] / 3.419 = [0.293, 0.415, 0.293]

Arm 2's probability jumped from 0.333 to 0.415 because it produced progress. If a later pull of arm 2 yields less progress (the student is mastering it), its weight grows more slowly than a freshly-learnable arm, and the balance tips toward whichever arm is now in the proximal zone.

From scratch: an EXP3 teacher

python
import numpy as np

class TeacherEXP3:
    def __init__(self, n_arms, eta=2.0):
        self.w = np.zeros(n_arms)      # log-weights, one per difficulty bucket
        self.eta = eta
        self.prev_loss = None

    def probs(self):
        z = self.w - self.w.max()     # softmax over arms (numerically safe)
        e = np.exp(z); return e / e.sum()

    def choose(self):
        return np.random.choice(len(self.w), p=self.probs())

    def update(self, arm, loss_before, loss_after):
        # reward = learning progress = how much loss dropped on this batch
        reward = abs(loss_before - loss_after)
        self.w *= 0.96                  # gentle decay → stays responsive as frontier moves
        self.w[arm] += self.eta * reward   # bump the arm that produced progress

# training loop: teacher picks the bucket, student trains, teacher learns
teacher = TeacherEXP3(n_arms=5)
for step in range(n_steps):
    k = teacher.choose()                 # which difficulty bucket?
    Lb = eval_loss(student, bucket[k])  # loss before
    train_on(student, bucket[k])        # one student SGD step on that bucket
    La = eval_loss(student, bucket[k])  # loss after
    teacher.update(k, Lb, La)            # reward = |Lb - La|

The data flow is a tight loop between two agents: the teacher emits an arm index k (a scalar), the student trains on bucket k and reports two loss scalars, the teacher converts their difference into a reward and updates one weight. No difficulty was ever labeled by a human — the ordering is a side effect of always chasing progress. The decay factor 0.96 is the subtle ingredient: it lets a once-favored arm fade as its progress dries up, so the probability mass keeps migrating to the new frontier.

See it: the teacher discovering the curriculum

Five difficulty buckets (left = easiest, right = hardest). Each bucket can only be learned once the one before it is mostly mastered — a chain of prerequisites. The filled portion of each bar is the student's mastery; the bright cap is the teacher's current probability of picking that bucket. Press Run teacher and watch the bright attention band sweep left-to-right: the teacher always sits on the learnable frontier, never told the order, discovering it from learning-progress alone.

Teacher-Student: Attention Riding the Learnable Frontier

Tall fill = student mastery of that bucket. Bright cap = teacher's pick probability. Run it: attention rides from easy to hard as each bucket is mastered. Reset and run again — same emergent easy→hard order, never hard-coded.

Teacher rate η 2.5
Common misconception. “The teacher should just always train on the hardest unmastered bucket.” That is exactly what fails: the hardest unmastered bucket usually has no learnable signal yet (its prerequisites aren't in place), so progress is zero and training there is wasted. The teacher must find the frontier — the hardest bucket that is currently learnable — which is why it rewards progress, not difficulty. This is also why automatic curricula are essential in reinforcement learning with sparse rewards, where most tasks give zero signal.
Why does the teacher use |Δloss| (learning progress) as reward instead of the loss value itself?

Chapter 6: The Curriculum Trainer — Break It Yourself

Everything so far — difficulty scores, threshold gates, pacing functions, self-pacing, teachers — comes together here. This is a real (tiny) neural network training live in your browser on the two-moons problem, and you choose its data strategy. A faint random-order baseline trains alongside every run so you can always see whether the curriculum actually helped.

Your job in this chapter is to break things on purpose and build intuition for when curricula win and when they don't:

Live Curriculum Trainer (vs. random baseline)

Left: the data and the forming decision boundary. Bright points are currently unlocked for training; dim points are held back. Right: test accuracy over time — your strategy (teal) vs the random baseline (gray). Pick a strategy, press Train, and meddle.

Label noise 15%
Pacing (for easy/hard) root
Speed

Pick a strategy and press Train.

What to take away. On noisy or small data, the curriculum line ends clearly above the random line and gets there with far less thrashing. On clean, abundant, well-separated data, the two lines converge — the curriculum is neither helping nor hurting much. That gap, and how it depends on noise, is the practical theory of curriculum learning. The next chapter scales this idea up to billion-token language models; the chapter after that confronts the cases where curricula quietly do nothing.
Common misconception. “If easy-first is good, hard-first must be bad, always.” Try clean data with hard-first: it often matches random, and in some real settings (well-initialized models, hard-negative mining) it even wins. The trainer is here precisely to dissolve absolute rules. Curriculum is a tool whose value depends on label noise, data size, and how fragile the model is early on — not a universal law.

No quiz this chapter — the simulation is the test. If you can predict, before pressing Train, whether the curriculum line will beat the baseline for a given noise level, you understand this lesson.

Chapter 7: Curriculum at Scale — Data Mixing & DoReMi

At the scale of a frontier language model, “curriculum” rarely means ordering individual sentences. The data is split into domains — web text, code, books, Wikipedia, arXiv, math — and the real lever is the mixture weights: what fraction of each training batch comes from each domain. Choosing those weights well is one of the highest-leverage decisions in pretraining, and it is, in spirit, a curriculum over where the model's attention goes.

Call the mixture weights α = (α1, …, αK), one per domain, summing to 1. Each training batch samples domain k with probability αk. The naive choice is to set αk proportional to how much data each domain has — but the biggest domain (usually scraped web text) is not necessarily the most valuable per token.

Common misconception. “Train on more of the biggest pile.” Web text dwarfs every other domain, so proportional sampling drowns the model in it. But a small, dense domain like code or math can carry far more reasoning signal per token. The whole point of learned mixtures is to up-weight the domains that punch above their token count and stop over-feeding the model the abundant-but-redundant ones.

The DoReMi idea: optimize the mixture with a tiny proxy

DoReMi (Xie et al., 2023, “Domain Reweighting with Minimax Optimization”) finds good mixture weights without training the big model repeatedly. The recipe has three steps, and the data-flow is the point:

1. Reference
Train a small reference model on the naive (e.g. token-proportional) mixture. Record its per-domain loss refk.
2. Proxy + DRO
Train another small proxy model, but adjust α on the fly using Group DRO to maximize the worst-case excess loss over domains.
3. Transfer
Take the average α the proxy converged to and use it to train the big model. One expensive run, good weights.

The heart is step 2. Define the excess loss of domain k as how much worse the proxy is than the reference on that domain:

excessk = Lkproxy − Lkref

A large positive excess means “the proxy is doing much worse than a reference model could here — there is room to improve, so spend more budget on this domain.” DoReMi raises α on exactly those domains via an exponentiated-gradient update:

αk ← αk · exp(η · excessk),   then renormalize so ∑k αk = 1

This is a minimax game: the weights α maximize excess loss (seeking the worst-served domain), while the proxy's parameters minimize the α-weighted loss (fixing whatever the weights expose). The equilibrium is a mixture that protects the worst-case domain — no domain is left badly under-served relative to what a model could achieve.

Worked example: one DoReMi reweighting

Four domains, current weights α = [0.25, 0.25, 0.25, 0.25], excess losses just measured, teacher rate η = 1.0:

domainexcesskα·eη·excess
Web0.050.25 × e0.05 = 0.25 × 1.051 = 0.2628
Code0.400.25 × e0.40 = 0.25 × 1.492 = 0.3730
Books0.100.25 × e0.10 = 0.25 × 1.105 = 0.2763
Wiki0.020.25 × e0.02 = 0.25 × 1.020 = 0.2551

Sum of the unnormalized column = 0.2628 + 0.3730 + 0.2763 + 0.2551 = 1.1672. Divide each by that total to renormalize:

α' = [0.225, 0.320, 0.237, 0.219]

Code — the domain with the biggest excess loss — jumped from 0.25 to 0.32, while Wiki (almost no excess) drifted down to 0.219. One step nudged the budget toward the domain with the most unrealized improvement. Iterate this and α settles on a mixture that no domain can complain about. Notice the data flow: per-domain losses (K,) → excess (K,) → multiplicative weight update (K,) → renormalize → new sampling distribution for the next batches.

See it: the domain mixer — you vs DoReMi

Four domains. Drag the weight sliders to set the mixture by hand and watch the worst-case excess loss (the number DoReMi tries to push down). Then hit DoReMi step and let the exponentiated-gradient update do it for you — it automatically pours weight into the worst-served domain until the bars even out. Try to beat it by hand; it is harder than it looks.

Domain Mixer: Minimize the Worst-Case Excess Loss

Each domain shows its mixture weight αk (bar height) and its current excess loss (red marker). Worst-case excess is what matters. Adjust by hand, or let DoReMi auto-balance.

Web α0.55
Code α0.15
Books α0.18
Wiki α0.12
The connection back to Chapter 5. DoReMi's exponentiated-gradient update on α is the same machinery as the teacher's EXP3 update on arm weights — multiply by exp(η·signal), renormalize. The teacher chased learning progress; DoReMi chases excess loss. Both are automatic curricula; one orders examples, the other balances domains. Once you see the exponential-weights pattern, you see it everywhere in this field.
DoReMi raises a domain's mixture weight when its excess loss (proxy loss minus reference loss) is large. Why is excess loss the right signal, rather than raw loss?

Chapter 8: When It Doesn’t Help — and Anti-Curriculum

Here is the honest chapter, the one most tutorials skip. Curriculum learning is not free magic. On large, clean, well-curated datasets trained with a good optimizer and proper shuffling, carefully-built curricula often produce no measurable gain — and sometimes a small loss from the extra complexity and the time spent over-training easy examples. Worse, a famous body of work argues for the opposite strategy: train on the hardest examples first or most. Both camps have strong empirical support. How can both be right?

The case for hard-first (anti-curriculum)

Hard-example mining — deliberately over-sampling the examples the model gets wrong — is everywhere in production:

These work because, for a model that is already reasonably trained, easy examples are redundant: they produce tiny gradients and teach nothing new. The information is concentrated in the hard examples. So the field holds two seemingly-contradictory beliefs at once: easy-first and hard-first. The resolution is that they apply at different times and under different conditions.

The reconciliation. Difficulty's value flips across training and across data quality. Early, with a fragile model and possibly-noisy labels, easy examples build a stable foundation and hard examples mislead — easy-first wins. Late, with a competent model and clean labels, easy examples are redundant and hard examples carry all the remaining signal — hard-first (mining) wins. The sophisticated recipe is not “easy” or “hard” but easy-then-hard: a curriculum to bootstrap, followed by hard-example mining to polish.

The decisive variable: label noise

The single best predictor of whether easy-first helps is label noise. Mislabeled examples have high loss — they look exactly like “hard” examples to any loss-based difficulty score. A hard-first strategy therefore amplifies the noise: it pours training onto the corrupted labels. An easy-first strategy quarantines them: the noisy examples stay locked behind the difficulty gate until late, by which point the model is robust enough to discount them (or training has ended). This is why self-paced learning was originally pitched as a robustness method.

SituationBest strategyWhy
Noisy / weakly-labeled dataEasy-firstquarantines high-loss mislabeled examples
Small datasetEasy-firstfragile model needs a stable foundation
Sparse-reward RL / hard explorationAuto-curriculummost tasks give zero signal; teacher finds the frontier
Very hard target taskEasy-then-hardbootstrap simple skills, then specialize
Large, clean, balanced dataOften noneshuffling + good optimizer already suffices
Late-stage fine-tuning, clean labelsHard-first (mining)easy examples are redundant; signal is in the hard ones

Worked intuition: why noise flips the sign

Imagine 1000 examples, 200 of them mislabeled. A loss-based difficulty score ranks all 200 mislabeled ones among the “hardest” (their labels contradict the pattern, so loss stays high). Easy-first with competence reaching, say, 0.8 by mid-training trains almost exclusively on the 800 clean examples first — the model learns the true pattern from clean data, then meets the noise late and largely ignores it. Hard-first does the reverse: it front-loads the 200 corrupted examples, teaching the model the wrong pattern before it ever sees the clean majority. Same data, opposite outcomes, decided entirely by ordering.

See it: the advantage gap vs. label noise

The plot sweeps label noise on the x-axis and shows final test accuracy for easy-first (teal) vs random (gray). At zero noise the lines nearly coincide — curriculum buys you almost nothing. As noise rises, the random line plunges (it trains on corrupted labels indiscriminately) while easy-first degrades gracefully. The vertical gap between them is the value of the curriculum. Drag the dataset-size slider: smaller datasets widen the gap at every noise level.

When Does Easy-First Win? Accuracy vs. Label Noise

Teal = easy-first, gray = random. The gap is the curriculum's payoff — near zero on clean data, large under noise. Drag dataset size and the noise marker.

Dataset size small
Label noise marker 20%

Common misconception. “A published curriculum result means it'll help my task.” Curriculum gains are notoriously dataset-dependent and often shrink or vanish under careful baselines (well-tuned shuffling, longer training). Before investing, ask: is my data noisy or small? Is my task a hard exploration / sparse-reward problem? If your data is large and clean and your optimizer is solid, spend your effort elsewhere — the curriculum is likely to be a wash.
OHEM, focal loss, and hard-negative mining all emphasize hard examples — the opposite of easy-first curriculum. How do both ideas coexist?

Chapter 9: Connections & Cheat Sheet

You now have the full toolbox: how to score difficulty, how to gate it, how fast to open the gate, how to let the model pace itself, how to make a teacher discover the order, how the same idea scales to LLM domain mixtures, and when to not bother at all. This closing chapter is your reference card.

The five strategies at a glance

MethodWho picks the orderDifficulty signalBest for
Classical (Bengio ’09)You, in advancefixed score + threshold λtasks with a clear human notion of difficulty
Competence-basedYou, via a pacing functionfixed score + c(t) schedulewhen you want smooth, principled pacing (root)
Self-paced (SPL)The modelcurrent per-example loss < λnoisy data, no good a-priori difficulty
Teacher-studentA learned policylearning progress |ΔL|RL, multi-task, sparse rewards
DoReMi (domain)Minimax proxyexcess loss vs referenceLLM pretraining data mixtures

The cheat sheet — every formula in one place

Difficulty (margin):  m = ztrue − zother  (high = easy)
Difficulty (loss):  di = Li(w)  (high = hard)
Classical reweighting:  Qλ(z) ∝ Wλ(z)·P(z),  Q1 = P
Root pacing:  c(t) = √(c02 + (1−c02)·t/T)
Self-paced selection:  vi* = 1[Li < λ],  λ grows over time
Teacher reward:  rk = |Lbefore − Lafter|,  pk = softmax(w)k
DoReMi update:  αk ← αk·exp(η·excessk), renormalize

A decision guide

Is your data noisy or small?
Yes → easy-first / self-paced. The gate quarantines bad labels.
Sparse-reward RL or many tasks?
Yes → automatic teacher-student curriculum on learning progress.
Pretraining an LLM over domains?
Yes → learn the mixture (DoReMi), don’t just go proportional.
Large, clean, balanced data?
Probably skip it — or use hard-example mining late to polish.

Where this connects

Curriculum learning is a layer that sits on top of the rest of the training stack you have been building in this series:

The one thing to remember. Order is a hyperparameter. Most of the time we leave it on “shuffle” and never think about it — and for large clean datasets that's fine. But when data is noisy, scarce, or the task is a hard exploration problem, the order you show examples in can matter as much as the architecture or the learning rate. Curriculum learning is the discipline of treating that order as something you design, not something you forget.
You're pretraining a language model and want to set the web/code/books/wiki mixture. Which approach fits?

“Education is not the filling of a pail, but the lighting of a fire” — and you light it one easy spark at a time.