Robust Estimation — RANSAC, Huber & M-Estimators

Chapter 0: One Bad Point Wrecks Everything

You are calibrating a robot's wheel encoder against a ground-truth track. You collect five clean readings that fall almost perfectly on a straight line. Then a sixth reading comes in — the wheel slipped on a wet tile, and the sensor reports a value wildly off. You feed all six points into ordinary least squares, the workhorse line-fitter. What happens?

The line lurches. Not a little — a lot. That single corrupted point grabs the fit and drags it toward itself, ruining the estimate for all five good points. This is the central scandal of least squares: it has no defense against outliers. A measurement infinitely far away has infinite pull.

Drag the bad point in the simulation below and watch the orange least-squares line chase it, while the five honest points sit there ignored.

Least squares chasing a single outlier

Five points lie on the true line y = x (teal). Move the slider to drag the sixth point up. The orange least-squares fit tilts to chase it.

outlier height2.0

Why is the pull so violent? Because least squares minimizes the sum of squared residuals. A point that is 10 units off contributes 100 to the cost. The optimizer would rather tilt the whole line — nudging five good residuals up to 2 or 3 each — than leave that one 100-unit penalty sitting there. Squaring makes far points scream, and the fit caves to the loudest voice in the room.

By hand: watch the slope move

Take the five clean points (0,0), (1,1), (2,2), (3,3), (4,4) — perfectly on y = x. The least-squares slope is exactly 1, intercept 0. Now add a sixth point at (2, 10): same x as the middle, but y ten units too high.

The slope of the best-fit line is the covariance of x and y divided by the variance of x. Let's compute it with the outlier included. The means become x̄ = (0+1+2+3+4+2)/6 = 12/6 = 2.0, and ȳ = (0+1+2+3+4+10)/6 = 20/6 ≈ 3.33.

slope a = Σ(x_i−x̄)(y_i−ȳ) / Σ(x_i−x̄)²

The denominator (variance of x, unnormalized) is: (−2)² + (−1)² + 0² + 1² + 2² + 0² = 4+1+0+1+4+0 = 10. The numerator, summing (x_i−2)(y_i−3.33): for the five clean points it is (−2)(−3.33)+(−1)(−2.33)+(0)(−1.33)+(1)(0.67)+(2)(1.67) = 6.67+2.33+0+0.67+3.33 = 13.0; the outlier (2,10) adds (0)(6.67) = 0. So numerator = 13.0.

Slope a = 13.0 / 10 = 1.30, and intercept b = ȳ − a·x̄ = 3.33 − 1.30×2 = 0.73. The line went from y = x to y = 1.30x + 0.73. One point moved the slope by 30% and lifted the whole line off the floor — even though it sat at the center x, where it has the least leverage. Put the same outlier at x = 4 and the damage is far worse.

Misconception: “More data will dilute the bad point.” It won't, if the outlier is gross. With 100 good points and one measurement at y = 10,000, the squared term 10,000² = 100 million still dominates the entire cost. Least squares has a breakdown point of zero: a single sufficiently extreme outlier can carry the estimate arbitrarily far. We need estimators that refuse to be bullied.

There are two grand strategies, and this lesson teaches both. Down-weight the outliers so they lose their grip (M-estimators, the Huber loss, IRLS). Or vote them out: fit on tiny random samples, keep the fit that the most points agree with (RANSAC). By the end you'll know exactly when to reach for each, and how to combine them.

Why does one far-away outlier have such enormous influence on an ordinary least-squares fit?

Because least squares ignores the x-coordinate of points Because the cost is the sum of squared residuals, so a far point's penalty grows quadratically and dominates Because least squares always passes through the last data point

Chapter 1: The Influence Function

To fix least squares we first need to name its disease precisely. The right diagnostic tool is the influence function: how much does one data point pull on the estimate, as a function of its residual r (how far off it is)?

Here's the key fact. When you minimize a sum of per-point loss terms ρ(r), the optimum is where the derivative is zero — where the sum of ψ(r) = ρ′(r) balances out. That derivative ψ is the influence: it's the “force” each point exerts on the solution. So the shape of ψ tells you everything about robustness.

For ordinary least squares, ρ(r) = ½r², so the influence is ψ(r) = r. It grows without bound. Double the residual, double the pull. A point at r = 1000 pulls a thousand times harder than a point at r = 1. That linear-and-unbounded influence is exactly why a single far outlier can dominate. Toggle the loss in the widget and watch ψ for L2 shoot off the chart.

Loss ρ(r) and its influence ψ(r) = ρ′(r)

Top curve: the loss. Bottom curve: the influence (force on the fit). For L2, influence is the diagonal line — unbounded. For L1 (absolute error), influence saturates at ±1 — a far point pulls no harder than a near one.

Contrast this with the L1 loss, ρ(r) = |r|. Its influence is ψ(r) = sign(r): just +1 or −1. A point that is 10 units off and a point that is 10,000 units off pull with exactly the same force. L1 doesn't care how far the outlier is, only which side it's on. That bounded influence is the first taste of robustness.

By hand: the tug-of-war at the optimum

Suppose we're estimating a single number μ (the “location” of a cluster of points) — the simplest possible fit. The optimum sets Σ ψ(x_i − μ) = 0. Under L2, ψ is identity, so Σ(x_i − μ) = 0, which solves to μ = mean. The L2 estimate of location is the mean — and we know the mean is wrecked by outliers.

Under L1, ψ = sign, so we need Σ sign(x_i − μ) = 0 — equal numbers of points above and below. The L1 estimate of location is the median. And the median shrugs off outliers: drag one point to infinity and the median doesn't move at all. Same data, different loss, wildly different robustness — entirely explained by the shape of ψ.

Concretely, take values {1, 2, 3, 4, 100}. The mean is 110/5 = 20 — nowhere near the bulk of the data. The median is 3 — dead center of the honest points. The lone outlier dragged the mean by 17 units and the median by zero. That gap is robustness, and ψ predicted it.

Key insight: Robustness is a property of the influence function. If ψ(r) stays bounded as r → ∞, no single point can dominate. If ψ(r) actually returns to zero for huge r (we'll meet those — “redescending” estimators), then extreme outliers are not just capped but actively ignored.

Misconception: “So just always use L1, it's robust and simple.” L1 is robust but has two costs: it is non-differentiable at zero (awkward to optimize) and statistically inefficient for clean Gaussian noise — on good data it throws away information the mean would use. The art is a loss that behaves like L2 in the middle (efficient) and like L1 (or better) in the tails (robust). That compromise is the M-estimator, and its most famous member is Huber.

The influence function ψ(r) for the L1 loss is sign(r). What does that imply for an outlier 10,000 units away?

It pulls 10,000× harder than a point 1 unit away It pulls with zero force and is deleted It pulls with the same bounded force (±1) as any other point on that side

Chapter 2: M-Estimators — Swap the Loss

An M-estimator (the M is for “maximum-likelihood-type”) is breathtakingly simple in idea: keep the least-squares machinery, but replace the squared loss with a function ρ(r) that grows more slowly in the tails. Instead of minimizing Σ r_i², you minimize Σ ρ(r_i) for a smarter ρ.

Why is this principled and not just a hack? Because choosing ρ is the same as choosing a noise model. Minimizing Σ r² is maximum likelihood under Gaussian noise — thin tails, so a big residual is “impossible” and the fit strains to avoid it. If instead you believe your noise has heavy tails — that occasional large errors are normal — then −log of that heavy-tailed density gives you a ρ that flattens out, and large residuals stop being treated as catastrophes. Robust losses are just the negative-log-likelihoods of realistic, outlier-tolerant noise distributions.

Here is the family, plotted as loss ρ (how much a residual costs) and influence ψ (how hard it pulls). Click through them and watch the tails.

The M-estimator family

Top: loss ρ(r). Bottom: influence ψ(r). Watch how the robust losses flatten the cost of large residuals, and how Tukey's influence returns to zero.

Read the bottom (influence) curve like a personality test for each estimator:

Loss	Influence ψ in the tails	Behavior
L2	grows forever	not robust — the baseline
L1	flat at ±1	bounded; far points pull a constant amount
Huber	flat at ±δ	L2 in the middle, L1 in the tails — monotone
Cauchy	decays toward 0	far points pull less and less — redescending
Tukey	exactly 0 past a cutoff	extreme points pull nothing — hard redescending

Two camps emerge. Monotone influence (Huber): the pull never decreases — it caps, but a far outlier still tugs by the full amount δ. This makes the optimization convex, so there's a unique solution and you can't get stuck. Redescending influence (Cauchy, Tukey, Geman-McClure): the pull peaks and then falls back toward zero, so a sufficiently wild point is essentially erased. This is more robust but non-convex — you can land in a bad local minimum if your starting guess is poor.

By hand: Cauchy tames a screaming residual

The Cauchy (a.k.a. Lorentzian) loss is ρ(r) = (c²/2)·ln(1 + (r/c)²), with scale c. Its influence is ψ(r) = r / (1 + (r/c)²). Take c = 1. A residual r = 1 (a typical inlier) gives influence 1/(1+1) = 0.5. A residual r = 10 (an outlier) gives 10/(1+100) = 10/101 ≈ 0.099. A residual r = 100 gives 100/(1+10000) ≈ 0.01.

Read that progression: as the point goes from off to way-off to absurd, its pull goes 0.5 → 0.099 → 0.01 — shrinking. The further out it lies, the more the estimator decides it must be garbage and the less it listens. Compare L2, where those same residuals pull 1, 10, 100. Cauchy doesn't just cap the outlier — it tunes it out.

Think of it this way: An M-estimator is a trust policy. L2 trusts every point completely (and pays for it). Huber trusts you fully until you cross the line δ, then trusts you a fixed amount no matter how wild you get. Tukey trusts you up to a cutoff and then writes you off entirely. The scale parameter (δ or c) is where you draw the line between “noisy inlier” and “outlier.”

Misconception: “Redescending is strictly better — it deletes outliers, so use Tukey always.” The danger is the non-convexity: with a bad initialization, Tukey can converge to a fit that explains a cluster of outliers and redescends away the true inliers. Redescending estimators need a good starting point (often: run RANSAC or Huber first, then refine with Tukey). Convex Huber is the safe default when you can't guarantee a good initialization.

What distinguishes a “redescending” M-estimator (Cauchy, Tukey) from a monotone one (Huber)?

Redescending losses are always convex and easier to optimize The influence of far points decreases back toward zero, so extreme outliers are actively ignored — but the problem becomes non-convex They ignore the scale parameter entirely

Chapter 3: The Huber Loss in Detail

Huber's loss is the most-used robust loss in all of engineering — it's the default “robust kernel” in Ceres, g2o, and GTSAM. It is the elegant compromise we asked for at the end of Chapter 1: be least squares where the data is good, be absolute error where the data is bad, and stitch the two together smoothly.

It is defined piecewise around a threshold δ:

ρ_δ(r) = ½r² if |r| ≤ δ
ρ_δ(r) = δ(|r| − ½δ) if |r| > δ

Inside the band |r| ≤ δ, it's the familiar parabola ½r² — full least-squares efficiency on the clean data. Outside, it switches to a straight line of slope δ — the cost grows only linearly, so a far outlier can't blow up the total. Move the slider and watch the quadratic bowl hand off to two straight arms exactly at ±δ.

Huber loss: quadratic core, linear tails

The orange curve is Huber; the faint dashed parabola is pure L2. Inside ±δ they coincide; outside, Huber peels off into straight lines (slope δ). Vertical guides mark ±δ.

threshold δ1.5

Why those exact constants? Smoothness.

The piecewise definition isn't arbitrary — the constants are chosen to make the loss continuous and continuously differentiable (C¹) at the junction r = δ. Optimizers hate kinks, so this matters.

Check the value at r = δ. From the left (quadratic): ½δ². From the right (linear): δ(δ − ½δ) = δ·½δ = ½δ². They match — no jump. Now check the derivative (the influence ψ). From the left: ψ = r, which at r = δ equals δ. From the right: ψ = δ (constant). They match too — no kink in the slope. That's why the formula carries the −½δ offset: it's the exact bookkeeping that glues the two pieces together smoothly.

So the influence of Huber is ψ(r) = r clamped to [−δ, +δ]: it rises linearly like L2 until r hits δ, then flattens like L1. That single clamp is the whole robustness mechanism.

By hand: the cost of an outlier, L2 vs Huber

Let δ = 1.5 and consider an outlier with residual r = 8. Under L2, its cost is ½(8)² = 32. Under Huber, since 8 > 1.5, the cost is δ(|r| − ½δ) = 1.5×(8 − 0.75) = 1.5×7.25 = 10.875. And the force it exerts? Under L2, ψ = 8. Under Huber, ψ = δ = 1.5. The outlier's pull is slashed from 8 to 1.5 — it still nudges the fit (Huber is monotone, not redescending) but it can no longer dominate the five honest points each pulling around 1.

Choosing δ: the 1.345σ rule

δ is the boundary between “this is just noise” and “this is an outlier,” so it must be in the units of your noise. The standard choice is δ = 1.345·σ, where σ is the standard deviation of the inlier noise. That magic constant 1.345 is tuned so that on purely Gaussian data Huber retains 95% of the efficiency of least squares — you pay only a 5% statistical tax for the insurance against outliers, while points beyond ~1.3σ get gently capped.

In practice σ is unknown, so you estimate it robustly (you can't use the ordinary standard deviation — it's wrecked by the very outliers you're hunting). The go-to is the MAD, the median absolute deviation: σ̂ ≈ 1.4826·median(|r_i − median(r)|). The median-of-medians shrugs off outliers, giving a clean scale to set δ against.

Why 1.4826? For Gaussian data the MAD is about 0.6745×σ (the 75th percentile of a standard normal). Dividing by 0.6745 — i.e. multiplying by 1.4826 — rescales the MAD into an honest, outlier-proof estimate of σ. It's the robust replacement for the standard deviation.

Misconception: “Set δ tiny to be maximally robust.” Too small and Huber treats genuine noise as outliers — you down-weight good data, lose efficiency, and the fit gets noisy and uncertain. Too large and Huber degenerates back into L2, offering no protection. δ must track the real inlier noise scale; that's why it's pinned to σ, not a fixed number.

Why is the Huber loss for |r| > δ written as δ(|r| − ½δ) rather than just δ|r|?

To make the loss larger and punish outliers harder The −½δ offset makes the loss continuous and smooth (C¹) at r = δ, matching both the value ½δ² and the slope δ It has no effect; the two forms are identical

Chapter 4: IRLS — How You Actually Solve It

We have a robust loss. But minimizing Σ ρ(r_i) isn't a plain linear least-squares problem any more — ρ is nonlinear and (for Huber) piecewise. How do we actually find the fit? The beautiful answer is Iteratively Reweighted Least Squares (IRLS): turn the hard robust problem into a sequence of easy weighted least-squares problems.

The trick comes straight from the influence function. At the optimum Σ ψ(r_i)·(∂r_i/∂θ) = 0. Rewrite ψ(r) = w(r)·r by defining a weight w(r) = ψ(r)/r. Then the optimality condition becomes Σ w(r_i)·r_i·(∂r_i/∂θ) = 0 — which is exactly the normal equation for weighted least squares with weights w_i. The only snag: the weights depend on the residuals, which depend on the fit. So we iterate.

1. Fit

Solve weighted least squares with current weights (start all = 1)

↓

2. Residuals

Compute r_i = y_i − fit(x_i)

↓

3. Reweight

w_i = ψ(r_i)/r_i — outliers get tiny weights

↻ repeat until weights stop changing

For Huber, the weight is gorgeously simple: w(r) = 1 if |r| ≤ δ (inliers keep full weight), and w(r) = δ/|r| if |r| > δ (outliers get weight shrinking as 1/|r|). A point twice as far past the threshold gets half the weight. The fit literally stops listening to wild points.

IRLS converging on the corrupted data

Same five inliers + one outlier. Step IRLS and watch: dot size = current weight, the orange fit pulls back toward the true line, and the outlier's weight collapses. The bars show each point's weight.

By hand: one IRLS iteration

Start from the wrecked L2 fit of Chapter 0: y = 1.30x + 0.73, with the outlier at (2, 10) and δ = 1.5. Compute residuals r_i = y_i − (1.30x_i + 0.73):

point	prediction	residual r	\|r\| ≤ δ?	weight w = min(1, δ/\|r\|)
(0,0)	0.73	−0.73	yes	1.00
(1,1)	2.03	−1.03	yes	1.00
(2,2)	3.33	−1.33	yes	1.00
(3,3)	4.63	−1.63	no	1.5/1.63 = 0.92
(4,4)	5.93	−1.93	no	1.5/1.93 = 0.78
(2,10)	3.33	+6.67	no	1.5/6.67 = 0.22

The outlier's weight has already crashed to 0.22 — it now counts for about a fifth of a normal point. Re-solve weighted least squares with these weights and the line swings back toward y = x; the new residuals make the outlier's weight shrink further (toward ~0.05), and within a handful of iterations the fit locks onto the true line while the outlier is all but deleted. That's IRLS: each round, the bad point digs its own grave.

From scratch

python
import numpy as np

def irls_line(x, y, delta=1.5, iters=10):
    A = np.vstack([x, np.ones_like(x)]).T   # design matrix [x, 1]
    w = np.ones_like(y)                      # start: trust everyone (= plain L2)
    for _ in range(iters):
        W = np.sqrt(w)                       # fold weights into rows
        theta, *_ = np.linalg.lstsq(A * W[:,None], y * W, rcond=None)
        r = y - A @ theta                    # residuals under current fit
        absr = np.abs(r)
        w = np.where(absr <= delta, 1.0, delta / np.maximum(absr, 1e-9))  # Huber weights
    return theta, w                          # slope, intercept, and final weights

# slope/intercept ≈ (1.0, 0.0); the outlier ends with weight ≈ 0.05

That's the entire robust line-fitter — a lstsq call wrapped in a loop that recomputes weights. Swap the weight formula and you get Cauchy (w = 1/(1+(r/c)²)) or Tukey (w = (1−(r/c)²)² inside the cutoff, 0 outside). The skeleton never changes.

Key insight: IRLS is why M-estimators are practical. You never have to optimize the weird nonlinear robust loss directly — you reuse the same battle-tested linear solver you already have, just feeding it weights that say “ignore the points that don't fit.” For convex losses (Huber) it converges to the global optimum; for redescending ones it finds a local optimum, so initialization matters.

Misconception: “A zero residual means weight = ψ(r)/r blows up (0/0).” In practice you guard the division (the np.maximum(absr, 1e-9) above), and for Huber the inlier branch just sets w = 1 directly, so the singularity never bites. The weight is well-defined in the limit r→0 (it tends to 1 for Huber).

In IRLS for the Huber loss, what happens to an outlier's weight across iterations?

It grows, because the outlier is far from the line It stays fixed at 1 like every other point It shrinks (δ/|r| gets smaller as the fit ignores it), so the point loses influence each round

Chapter 5: RANSAC — Vote the Outliers Out

M-estimators down-weight outliers, but they still start from a fit over all the data — and if the outliers are numerous or clustered, even Huber's initial fit can be hopeless, and redescending losses can lock onto the wrong thing. When outliers are a large fraction of the data, we need a completely different idea. That idea is RANSAC — RANdom SAmple Consensus (Fischler & Bolles, 1981) — and it is the single most-used robust algorithm in computer vision.

The insight is almost cheeky. Instead of fitting all the data and hoping, fit a tiny random subset — the minimum needed to define the model — and then ask the rest of the data to vote: how many points agree with this fit? Repeat many times, and keep the fit with the most votes. Outliers, being random, rarely agree with each other, so a fit born from outliers gets few votes. A fit born from two true inliers gets a landslide.

1. Sample

Pick the minimal set at random (2 points for a line)

↓

2. Fit

Fit the model exactly to that minimal sample

↓

3. Score

Count inliers: points within threshold t of the model

↓

4. Keep best

If this beats the record, save it

↻ repeat N times, then refit on the winning consensus set

Step the simulation: each click draws a fresh random pair (highlighted), the candidate line through them, and the band of width ±t. Points inside the band are inliers and counted. The best-so-far line is kept. Watch how a pair of true inliers suddenly racks up a huge count, while a pair involving an outlier scores poorly.

RANSAC, one hypothesis at a time

Each step samples 2 random points (orange ring), fits a candidate line, and counts inliers inside the dashed band. The teal line is the best consensus found so far.

How many samples do you need? The core RANSAC equation

RANSAC is a probabilistic algorithm: it only succeeds if at least one of its random samples is all-inliers. So the central question is: how many iterations N guarantee that, with high probability, we drew at least one clean sample?

Let w = the fraction of points that are inliers, and s = the sample size (s = 2 for a line). The probability that a single random sample is all inliers is w^s. The probability it is not all-inliers is 1 − w^s. The probability that all N samples fail is (1 − w^s)^N. We want that failure probability below 1 − p, where p is our desired success confidence. Solving:

N = ln(1 − p) / ln(1 − w^s)

By hand: a worked count

Say half the data are outliers, so inlier fraction w = 0.5, we're fitting a line (s = 2), and we want p = 99% confidence. The chance a random pair is both inliers is w² = 0.25. So:

N = ln(1 − 0.99) / ln(1 − 0.25) = ln(0.01) / ln(0.75) = (−4.605) / (−0.2877) ≈ 16.0. Just 17 samples (round up) gives 99% confidence of hitting one clean pair — astonishingly cheap. That's the magic: you don't search all pairs, you just need one good one, and randomness finds it fast.

But watch the equation bite as outliers grow. If w = 0.2 (80% outliers!) and we fit a homography (s = 4), then w^s = 0.2⁴ = 0.0016, and N = ln(0.01)/ln(0.9984) ≈ 2,877. The sample size s is in the exponent, so the cost explodes with both the outlier ratio and model complexity. This is why minimal solvers (smallest possible s) are prized.

Key insight: RANSAC trades a hard continuous optimization (find the best line over all data) for an easy combinatorial search (find one good minimal sample). It needs no initialization, no gradients, and tolerates outlier fractions that would annihilate any M-estimator — up to 50%+ outliers is routine. The price is randomness (it's not deterministic) and the iteration cost rising with s.

Misconception: “RANSAC's output line is the fit through the 2 sampled points.” No — those 2 points only generate a hypothesis. The real estimate comes from the final step: take the largest consensus set (all inliers of the winning hypothesis) and do an ordinary (or Huber) least-squares fit over all of them. The 2-point line is just a lure to find the inlier set; the polish happens on the whole consensus.

Holding confidence p fixed, why does the required number of RANSAC iterations N rise so steeply when the outlier fraction increases?

Because each iteration gets slower to compute Because the chance a whole sample is all-inliers is w^s, which shrinks fast as the inlier fraction w drops — especially with larger sample size s Because the inlier threshold t must grow

Chapter 6: RANSAC in Practice

Vanilla RANSAC has three knobs that decide whether it works on real data: the inlier threshold, the iteration budget, and how you score hypotheses. Get them wrong and RANSAC silently returns garbage. Here's the practitioner's playbook, plus the modern variants that fix vanilla's weaknesses.

The inlier threshold t

t says how close a point must be to count as agreeing with the model. Too tight, and genuine inliers (with normal noise) get rejected, starving the consensus and making good hypotheses look bad. Too loose, and outliers sneak in, so a wrong model can win. The principled choice ties t to the inlier noise: if measurement noise is Gaussian with std σ, a common rule is t ≈ 2–3σ (for a point-to-line distance, t² follows a chi-squared law, so t = √(3.84)·σ ≈ 1.96σ captures 95% of inliers).

The threshold trade-off

Slide the inlier threshold and watch the consensus band widen. Too narrow rejects true inliers (band misses noisy points); too wide swallows outliers. The count shows how many points fall inside.

threshold t0.60

Adaptive iteration count

You usually don't know the inlier fraction w in advance — so you can't precompute N. The fix: estimate w on the fly. Start with a pessimistic w, and every time a hypothesis finds a bigger consensus set, update w = (best inlier count)/(total) and recompute N from the core equation. As you discover the data is cleaner than feared, the required N drops and you stop early. This adaptive stopping is in every real implementation.

Worked: you budgeted for w = 0.3 (N ≈ 49 for a line at p = 0.99). On iteration 5 a hypothesis captures 70% of the points, so you revise w = 0.7, recompute N = ln(0.01)/ln(1 − 0.49) ≈ ln(0.01)/ln(0.51) ≈ 7. You're already past 7 — stop now. Adaptive RANSAC just turned a 49-iteration budget into 7.

The variants worth knowing

Variant	The fix it adds
MSAC	Instead of counting inliers (0/1), score each by its truncated residual — inliers that fit better score better. Same cost, strictly better models.
LO-RANSAC	When a new best is found, run a quick local optimization (an inner least-squares / IRLS on the consensus) before continuing. Dramatically tightens the final fit and reduces needed iterations.
PROSAC	Sample guided by a quality prior (e.g. feature-match score) instead of uniformly — try the most promising correspondences first. Finds a good sample far sooner.
MAGSAC++	Eliminates the hard threshold t entirely by marginalizing over a range of noise scales — robust when you can't pin down σ. The current default in OpenCV.
USAC / GC-RANSAC	A unified framework / graph-cut spatial coherence — modern, fast, production-grade pipelines.

The degeneracy trap

RANSAC assumes a minimal sample defines the model uniquely. Sometimes it doesn't: pick 2 coincident points and the line is undefined; for a fundamental matrix, sample points all lying on one plane and the solution is degenerate (this is what DEGENSAC fixes). Real pipelines add a quick degeneracy check to discard samples that don't define a valid model before scoring them.

Misconception: “Crank N way up and RANSAC always wins.” More iterations help only up to the consensus your threshold allows. If t is mis-set (too tight), the true model never gets a high count no matter how many samples you draw — you'll confidently return a wrong fit. Threshold and scoring matter as much as iteration count; that's exactly what MSAC and MAGSAC address.

What does adaptive RANSAC do that vanilla RANSAC does not?

It removes the inlier threshold entirely It re-estimates the inlier fraction w from the best consensus found, recomputes the required N, and stops early when the data turns out cleaner than assumed It samples 3 points instead of 2 for a line

Chapter 7: Showcase — Four Estimators, One Mess

Now put it all together. Below is a dataset you control: a cloud of inliers around a true line, plus a tunable fraction of outliers and a tunable noise level. Four fits are computed live on the same data — ordinary least squares, L1, robust Huber (IRLS), and RANSAC — so you can watch exactly when each one breaks.

Robust fitting playground

Crank the outlier fraction and watch the blue L2 line peel away from the truth while Huber and RANSAC hold. Add noise to see RANSAC's threshold get tested. Regenerate for a fresh draw.

outlier fraction0.25

inlier noise0.30

L2 L1 Huber RANSAC

Things to try, and what you'll see:

Push outliers from 0 to 0.3. At zero outliers all four lines stack on the truth — L2 is even slightly best (it's most efficient on clean Gaussian data). As outliers climb, the blue L2 line is the first to bend away; the error readout for L2 shoots up while Huber and RANSAC stay flat.

Push outliers past 0.4. Now even L1 and Huber start to wobble — monotone M-estimators have a breakdown point below 50% when outliers gang up. RANSAC, which never trusted the bulk fit in the first place, keeps nailing the true line. This is the regime where RANSAC is irreplaceable.

Crank the noise with moderate outliers. Watch RANSAC's fixed threshold get stressed: when inlier noise approaches the threshold, RANSAC starts misclassifying noisy inliers as outliers and its fit gets jittery — the exact failure mode MAGSAC was built to cure. Huber, which uses a soft down-weight rather than a hard in/out cut, degrades more gracefully here.

The punchline: there is no single best estimator — it depends on the outlier fraction. Low outliers, clean noise → least squares (or Huber for safety). Moderate outliers → Huber / IRLS. Heavy or clustered outliers → RANSAC to find the inliers, then Huber least-squares to polish the fit on them. That last combo — RANSAC then robust refinement — is what real systems actually run.

No quiz here — the simulation is the test. If you can predict which line breaks first as you drag the sliders, you understand robust estimation.

Chapter 8: Breakdown Point & Real Systems

We keep saying “breaks down” — let's make it precise. The breakdown point of an estimator is the largest fraction of arbitrarily-bad outliers it can tolerate before the estimate can be carried off to infinity. It's the single most important robustness number.

Estimator	Breakdown point	Meaning
Mean / least squares	0%	One bad point can ruin it
Huber M-estimator	0% (asymptotically)*	Bounded influence but a flood of outliers on one side still drifts it
Median / L1 location	50%	Needs a majority of bad points to break
RANSAC	~50%+	Works as long as a clean minimal sample is findable
Least Median of Squares	50%	The theoretical max for a regression estimator

*Huber's influence is bounded (great against scattered outliers), but its formal breakdown point for regression is low because many outliers piled on one side bias it. Redescending estimators do better in practice but lose convexity. RANSAC and LMedS reach the 50% ceiling — you cannot beat 50%, because at half-and-half there's no way to tell inliers from outliers.

Breakdown in action: estimating a location

A cluster of points sits at the true value. Slide the outlier fraction up (outliers fly off to the right). Watch the mean get dragged immediately, the median hold firm until ~50%, then jump.

outlier fraction0.10

The deep connection: Black–Rangarajan duality

There's a beautiful theorem tying the two halves of this lesson together. Black & Rangarajan (1996) showed that every robust M-estimator is equivalent to least squares with an extra outlier process — a hidden per-point variable that decides “is this point an inlier or an outlier?” The IRLS weight w_i is that outlier process, softened into a continuous value between 0 and 1. So down-weighting (M-estimators) and labeling inliers/outliers (RANSAC) are two views of the same underlying problem: jointly estimate the model and which data to trust.

Where this lives in robotics & vision

Robust estimation isn't a side-topic — it is load-bearing infrastructure:

SLAM & bundle adjustment back-ends. The factor-graph optimizers (Ceres, g2o, GTSAM) wrap every measurement factor in a robust kernel — Huber or Cauchy — so a single bad data association or wrong loop closure can't corrupt the whole map. This is literally the IRLS weight applied to each factor's residual.

Switchable Constraints & Dynamic Covariance Scaling. For loop closures specifically, these methods (Sünderhauf, Agarwal) add a per-constraint switch variable the optimizer can turn off — the outlier process made explicit, letting the back-end reject false loop closures during optimization.

Graduated Non-Convexity (GNC). The modern way to use redescending estimators safely (Yang, Carlone, in TEASER++ for point-cloud registration): start with a convex (near-L2) surrogate, solve, then gradually morph the loss toward the non-convex redescending shape — getting Tukey-level robustness without needing a good initial guess. It's RANSAC-free robustness with optimality guarantees.

Feature matching & geometry. Every time you estimate a homography, fundamental matrix, or essential matrix from noisy feature matches (think image stitching, visual odometry front-ends), RANSAC is what separates the true correspondences from the mismatches before the geometry is computed.

The unifying view: robust estimation = jointly solving “what's the model?” and “which data do I believe?” RANSAC answers the second question by voting, M-estimators by soft weighting, switchable constraints by explicit on/off switches, GNC by annealing the loss shape. All four are fighting the same fight: don't let a few liars rewrite the truth.

Why can no regression estimator have a breakdown point above 50%?

Because computers can't handle more than 50% of data Because RANSAC always samples exactly half the points Because once outliers are the majority, they are the consensus — there's no signal left to tell which half is the truth

Chapter 9: Cheat Sheet & Connections

Everything in one place. First, the loss / influence / weight zoo — the three columns you actually need when coding an M-estimator (the weight column is what goes into your IRLS loop):

Name	Loss ρ(r)	Influence ψ(r)	IRLS weight w(r)=ψ/r	Type
L2	½r²	r	1	not robust
L1	\|r\|	sign(r)	1/\|r\|	robust, non-smooth
Huber	½r² or δ(\|r\|−½δ)	clamp(r,±δ)	min(1, δ/\|r\|)	convex, monotone
Cauchy	(c²/2)ln(1+(r/c)²)	r/(1+(r/c)²)	1/(1+(r/c)²)	redescending
Tukey	cutoff at c (const beyond)	0 beyond c	(1−(r/c)²)² or 0	hard redescending

The decision guide

Situation	Reach for
Clean data, Gaussian noise, no outliers	Plain least squares (most efficient)
A few scattered outliers, good initialization	Huber via IRLS (convex, safe default)
Outliers up to ~40–50%, or no initialization	RANSAC to find inliers, then Huber to refine
Need maximum robustness, decent init available	Redescending (Tukey/Cauchy), or GNC to anneal into it
Unknown noise scale, can't set a threshold	MAGSAC++ (RANSAC), or robust scale via MAD for Huber
SLAM loop closures that might be wrong	Switchable constraints / DCS in the factor-graph back-end

The three things to remember

1. Squaring is the sin. L2's unbounded influence is the whole problem; every robust method bounds or kills the influence of far points.

2. Two cures, often combined. Down-weight (M-estimators / IRLS — soft, smooth, needs init) or vote out (RANSAC — combinatorial, no init, handles heavy outliers). Real pipelines do RANSAC then robust refinement.

3. It's all “which data do I trust?” The IRLS weight, the RANSAC inlier flag, and the switchable-constraint switch are the same hidden variable — the Black–Rangarajan outlier process — in different clothes.

Where to go next

Robust estimation is the immune system of these systems — follow it into them:

Factor Graphs — where robust kernels wrap every measurement factor in the SLAM back-end. This is robust estimation at scale.
Kalman Filter — the L2-optimal recursive estimator; robust variants exist for exactly the outlier problem we attacked here.
Classical SLAM and Classical VIO — front-ends lean on RANSAC for feature-match geometry; back-ends lean on robust kernels.
Modern SLAM — switchable constraints, DCS, and GNC live here.
Bayesian Estimation — robust losses are negative-log-likelihoods of heavy-tailed noise models; the probabilistic view.

Closing thought (John Tukey, who started this field): “The data may not contain the answer. ... but it certainly may contain a few wrong answers loudly insisting they're right.” Robust estimation is the discipline of listening to the quiet majority over the loud few.

What is the single best general-purpose pipeline for fitting a model to data with a heavy, unknown fraction of outliers?

Use least squares but collect more data Use Tukey from a random initialization RANSAC to find the inlier set, then a Huber/least-squares refit on that consensus