One bad measurement can wreck a least-squares fit. Learn the two great cures — down-weight the outliers (M-estimators, Huber, IRLS) and vote them out (RANSAC) — the machinery behind every SLAM back-end, structure-from-motion pipeline, and sensor-fusion stack that survives the real world.
You are calibrating a robot's wheel encoder against a ground-truth track. You collect five clean readings that fall almost perfectly on a straight line. Then a sixth reading comes in — the wheel slipped on a wet tile, and the sensor reports a value wildly off. You feed all six points into ordinary least squares, the workhorse line-fitter. What happens?
The line lurches. Not a little — a lot. That single corrupted point grabs the fit and drags it toward itself, ruining the estimate for all five good points. This is the central scandal of least squares: it has no defense against outliers. A measurement infinitely far away has infinite pull.
Drag the bad point in the simulation below and watch the orange least-squares line chase it, while the five honest points sit there ignored.
Five points lie on the true line y = x (teal). Move the slider to drag the sixth point up. The orange least-squares fit tilts to chase it.
Why is the pull so violent? Because least squares minimizes the sum of squared residuals. A point that is 10 units off contributes 100 to the cost. The optimizer would rather tilt the whole line — nudging five good residuals up to 2 or 3 each — than leave that one 100-unit penalty sitting there. Squaring makes far points scream, and the fit caves to the loudest voice in the room.
Take the five clean points (0,0), (1,1), (2,2), (3,3), (4,4) — perfectly on y = x. The least-squares slope is exactly 1, intercept 0. Now add a sixth point at (2, 10): same x as the middle, but y ten units too high.
The slope of the best-fit line is the covariance of x and y divided by the variance of x. Let's compute it with the outlier included. The means become x̄ = (0+1+2+3+4+2)/6 = 12/6 = 2.0, and ȳ = (0+1+2+3+4+10)/6 = 20/6 ≈ 3.33.
The denominator (variance of x, unnormalized) is: (−2)² + (−1)² + 0² + 1² + 2² + 0² = 4+1+0+1+4+0 = 10. The numerator, summing (xi−2)(yi−3.33): for the five clean points it is (−2)(−3.33)+(−1)(−2.33)+(0)(−1.33)+(1)(0.67)+(2)(1.67) = 6.67+2.33+0+0.67+3.33 = 13.0; the outlier (2,10) adds (0)(6.67) = 0. So numerator = 13.0.
Slope a = 13.0 / 10 = 1.30, and intercept b = ȳ − a·x̄ = 3.33 − 1.30×2 = 0.73. The line went from y = x to y = 1.30x + 0.73. One point moved the slope by 30% and lifted the whole line off the floor — even though it sat at the center x, where it has the least leverage. Put the same outlier at x = 4 and the damage is far worse.
There are two grand strategies, and this lesson teaches both. Down-weight the outliers so they lose their grip (M-estimators, the Huber loss, IRLS). Or vote them out: fit on tiny random samples, keep the fit that the most points agree with (RANSAC). By the end you'll know exactly when to reach for each, and how to combine them.
To fix least squares we first need to name its disease precisely. The right diagnostic tool is the influence function: how much does one data point pull on the estimate, as a function of its residual r (how far off it is)?
Here's the key fact. When you minimize a sum of per-point loss terms ρ(r), the optimum is where the derivative is zero — where the sum of ψ(r) = ρ′(r) balances out. That derivative ψ is the influence: it's the “force” each point exerts on the solution. So the shape of ψ tells you everything about robustness.
For ordinary least squares, ρ(r) = ½r², so the influence is ψ(r) = r. It grows without bound. Double the residual, double the pull. A point at r = 1000 pulls a thousand times harder than a point at r = 1. That linear-and-unbounded influence is exactly why a single far outlier can dominate. Toggle the loss in the widget and watch ψ for L2 shoot off the chart.
Top curve: the loss. Bottom curve: the influence (force on the fit). For L2, influence is the diagonal line — unbounded. For L1 (absolute error), influence saturates at ±1 — a far point pulls no harder than a near one.
Contrast this with the L1 loss, ρ(r) = |r|. Its influence is ψ(r) = sign(r): just +1 or −1. A point that is 10 units off and a point that is 10,000 units off pull with exactly the same force. L1 doesn't care how far the outlier is, only which side it's on. That bounded influence is the first taste of robustness.
Suppose we're estimating a single number μ (the “location” of a cluster of points) — the simplest possible fit. The optimum sets Σ ψ(xi − μ) = 0. Under L2, ψ is identity, so Σ(xi − μ) = 0, which solves to μ = mean. The L2 estimate of location is the mean — and we know the mean is wrecked by outliers.
Under L1, ψ = sign, so we need Σ sign(xi − μ) = 0 — equal numbers of points above and below. The L1 estimate of location is the median. And the median shrugs off outliers: drag one point to infinity and the median doesn't move at all. Same data, different loss, wildly different robustness — entirely explained by the shape of ψ.
Concretely, take values {1, 2, 3, 4, 100}. The mean is 110/5 = 20 — nowhere near the bulk of the data. The median is 3 — dead center of the honest points. The lone outlier dragged the mean by 17 units and the median by zero. That gap is robustness, and ψ predicted it.
An M-estimator (the M is for “maximum-likelihood-type”) is breathtakingly simple in idea: keep the least-squares machinery, but replace the squared loss with a function ρ(r) that grows more slowly in the tails. Instead of minimizing Σ ri², you minimize Σ ρ(ri) for a smarter ρ.
Why is this principled and not just a hack? Because choosing ρ is the same as choosing a noise model. Minimizing Σ r² is maximum likelihood under Gaussian noise — thin tails, so a big residual is “impossible” and the fit strains to avoid it. If instead you believe your noise has heavy tails — that occasional large errors are normal — then −log of that heavy-tailed density gives you a ρ that flattens out, and large residuals stop being treated as catastrophes. Robust losses are just the negative-log-likelihoods of realistic, outlier-tolerant noise distributions.
Here is the family, plotted as loss ρ (how much a residual costs) and influence ψ (how hard it pulls). Click through them and watch the tails.
Top: loss ρ(r). Bottom: influence ψ(r). Watch how the robust losses flatten the cost of large residuals, and how Tukey's influence returns to zero.
Read the bottom (influence) curve like a personality test for each estimator:
| Loss | Influence ψ in the tails | Behavior |
|---|---|---|
| L2 | grows forever | not robust — the baseline |
| L1 | flat at ±1 | bounded; far points pull a constant amount |
| Huber | flat at ±δ | L2 in the middle, L1 in the tails — monotone |
| Cauchy | decays toward 0 | far points pull less and less — redescending |
| Tukey | exactly 0 past a cutoff | extreme points pull nothing — hard redescending |
Two camps emerge. Monotone influence (Huber): the pull never decreases — it caps, but a far outlier still tugs by the full amount δ. This makes the optimization convex, so there's a unique solution and you can't get stuck. Redescending influence (Cauchy, Tukey, Geman-McClure): the pull peaks and then falls back toward zero, so a sufficiently wild point is essentially erased. This is more robust but non-convex — you can land in a bad local minimum if your starting guess is poor.
The Cauchy (a.k.a. Lorentzian) loss is ρ(r) = (c²/2)·ln(1 + (r/c)²), with scale c. Its influence is ψ(r) = r / (1 + (r/c)²). Take c = 1. A residual r = 1 (a typical inlier) gives influence 1/(1+1) = 0.5. A residual r = 10 (an outlier) gives 10/(1+100) = 10/101 ≈ 0.099. A residual r = 100 gives 100/(1+10000) ≈ 0.01.
Read that progression: as the point goes from off to way-off to absurd, its pull goes 0.5 → 0.099 → 0.01 — shrinking. The further out it lies, the more the estimator decides it must be garbage and the less it listens. Compare L2, where those same residuals pull 1, 10, 100. Cauchy doesn't just cap the outlier — it tunes it out.
Huber's loss is the most-used robust loss in all of engineering — it's the default “robust kernel” in Ceres, g2o, and GTSAM. It is the elegant compromise we asked for at the end of Chapter 1: be least squares where the data is good, be absolute error where the data is bad, and stitch the two together smoothly.
It is defined piecewise around a threshold δ:
Inside the band |r| ≤ δ, it's the familiar parabola ½r² — full least-squares efficiency on the clean data. Outside, it switches to a straight line of slope δ — the cost grows only linearly, so a far outlier can't blow up the total. Move the slider and watch the quadratic bowl hand off to two straight arms exactly at ±δ.
The orange curve is Huber; the faint dashed parabola is pure L2. Inside ±δ they coincide; outside, Huber peels off into straight lines (slope δ). Vertical guides mark ±δ.
The piecewise definition isn't arbitrary — the constants are chosen to make the loss continuous and continuously differentiable (C¹) at the junction r = δ. Optimizers hate kinks, so this matters.
Check the value at r = δ. From the left (quadratic): ½δ². From the right (linear): δ(δ − ½δ) = δ·½δ = ½δ². They match — no jump. Now check the derivative (the influence ψ). From the left: ψ = r, which at r = δ equals δ. From the right: ψ = δ (constant). They match too — no kink in the slope. That's why the formula carries the −½δ offset: it's the exact bookkeeping that glues the two pieces together smoothly.
So the influence of Huber is ψ(r) = r clamped to [−δ, +δ]: it rises linearly like L2 until r hits δ, then flattens like L1. That single clamp is the whole robustness mechanism.
Let δ = 1.5 and consider an outlier with residual r = 8. Under L2, its cost is ½(8)² = 32. Under Huber, since 8 > 1.5, the cost is δ(|r| − ½δ) = 1.5×(8 − 0.75) = 1.5×7.25 = 10.875. And the force it exerts? Under L2, ψ = 8. Under Huber, ψ = δ = 1.5. The outlier's pull is slashed from 8 to 1.5 — it still nudges the fit (Huber is monotone, not redescending) but it can no longer dominate the five honest points each pulling around 1.
δ is the boundary between “this is just noise” and “this is an outlier,” so it must be in the units of your noise. The standard choice is δ = 1.345·σ, where σ is the standard deviation of the inlier noise. That magic constant 1.345 is tuned so that on purely Gaussian data Huber retains 95% of the efficiency of least squares — you pay only a 5% statistical tax for the insurance against outliers, while points beyond ~1.3σ get gently capped.
In practice σ is unknown, so you estimate it robustly (you can't use the ordinary standard deviation — it's wrecked by the very outliers you're hunting). The go-to is the MAD, the median absolute deviation: σ̂ ≈ 1.4826·median(|ri − median(r)|). The median-of-medians shrugs off outliers, giving a clean scale to set δ against.
We have a robust loss. But minimizing Σ ρ(ri) isn't a plain linear least-squares problem any more — ρ is nonlinear and (for Huber) piecewise. How do we actually find the fit? The beautiful answer is Iteratively Reweighted Least Squares (IRLS): turn the hard robust problem into a sequence of easy weighted least-squares problems.
The trick comes straight from the influence function. At the optimum Σ ψ(ri)·(∂ri/∂θ) = 0. Rewrite ψ(r) = w(r)·r by defining a weight w(r) = ψ(r)/r. Then the optimality condition becomes Σ w(ri)·ri·(∂ri/∂θ) = 0 — which is exactly the normal equation for weighted least squares with weights wi. The only snag: the weights depend on the residuals, which depend on the fit. So we iterate.
For Huber, the weight is gorgeously simple: w(r) = 1 if |r| ≤ δ (inliers keep full weight), and w(r) = δ/|r| if |r| > δ (outliers get weight shrinking as 1/|r|). A point twice as far past the threshold gets half the weight. The fit literally stops listening to wild points.
Same five inliers + one outlier. Step IRLS and watch: dot size = current weight, the orange fit pulls back toward the true line, and the outlier's weight collapses. The bars show each point's weight.
Start from the wrecked L2 fit of Chapter 0: y = 1.30x + 0.73, with the outlier at (2, 10) and δ = 1.5. Compute residuals ri = yi − (1.30xi + 0.73):
| point | prediction | residual r | |r| ≤ δ? | weight w = min(1, δ/|r|) |
|---|---|---|---|---|
| (0,0) | 0.73 | −0.73 | yes | 1.00 |
| (1,1) | 2.03 | −1.03 | yes | 1.00 |
| (2,2) | 3.33 | −1.33 | yes | 1.00 |
| (3,3) | 4.63 | −1.63 | no | 1.5/1.63 = 0.92 |
| (4,4) | 5.93 | −1.93 | no | 1.5/1.93 = 0.78 |
| (2,10) | 3.33 | +6.67 | no | 1.5/6.67 = 0.22 |
The outlier's weight has already crashed to 0.22 — it now counts for about a fifth of a normal point. Re-solve weighted least squares with these weights and the line swings back toward y = x; the new residuals make the outlier's weight shrink further (toward ~0.05), and within a handful of iterations the fit locks onto the true line while the outlier is all but deleted. That's IRLS: each round, the bad point digs its own grave.
python import numpy as np def irls_line(x, y, delta=1.5, iters=10): A = np.vstack([x, np.ones_like(x)]).T # design matrix [x, 1] w = np.ones_like(y) # start: trust everyone (= plain L2) for _ in range(iters): W = np.sqrt(w) # fold weights into rows theta, *_ = np.linalg.lstsq(A * W[:,None], y * W, rcond=None) r = y - A @ theta # residuals under current fit absr = np.abs(r) w = np.where(absr <= delta, 1.0, delta / np.maximum(absr, 1e-9)) # Huber weights return theta, w # slope, intercept, and final weights # slope/intercept ≈ (1.0, 0.0); the outlier ends with weight ≈ 0.05
That's the entire robust line-fitter — a lstsq call wrapped in a loop that recomputes weights. Swap the weight formula and you get Cauchy (w = 1/(1+(r/c)²)) or Tukey (w = (1−(r/c)²)² inside the cutoff, 0 outside). The skeleton never changes.
np.maximum(absr, 1e-9) above), and for Huber the inlier branch just sets w = 1 directly, so the singularity never bites. The weight is well-defined in the limit r→0 (it tends to 1 for Huber).
M-estimators down-weight outliers, but they still start from a fit over all the data — and if the outliers are numerous or clustered, even Huber's initial fit can be hopeless, and redescending losses can lock onto the wrong thing. When outliers are a large fraction of the data, we need a completely different idea. That idea is RANSAC — RANdom SAmple Consensus (Fischler & Bolles, 1981) — and it is the single most-used robust algorithm in computer vision.
The insight is almost cheeky. Instead of fitting all the data and hoping, fit a tiny random subset — the minimum needed to define the model — and then ask the rest of the data to vote: how many points agree with this fit? Repeat many times, and keep the fit with the most votes. Outliers, being random, rarely agree with each other, so a fit born from outliers gets few votes. A fit born from two true inliers gets a landslide.
Step the simulation: each click draws a fresh random pair (highlighted), the candidate line through them, and the band of width ±t. Points inside the band are inliers and counted. The best-so-far line is kept. Watch how a pair of true inliers suddenly racks up a huge count, while a pair involving an outlier scores poorly.
Each step samples 2 random points (orange ring), fits a candidate line, and counts inliers inside the dashed band. The teal line is the best consensus found so far.
RANSAC is a probabilistic algorithm: it only succeeds if at least one of its random samples is all-inliers. So the central question is: how many iterations N guarantee that, with high probability, we drew at least one clean sample?
Let w = the fraction of points that are inliers, and s = the sample size (s = 2 for a line). The probability that a single random sample is all inliers is ws. The probability it is not all-inliers is 1 − ws. The probability that all N samples fail is (1 − ws)N. We want that failure probability below 1 − p, where p is our desired success confidence. Solving:
Say half the data are outliers, so inlier fraction w = 0.5, we're fitting a line (s = 2), and we want p = 99% confidence. The chance a random pair is both inliers is w² = 0.25. So:
N = ln(1 − 0.99) / ln(1 − 0.25) = ln(0.01) / ln(0.75) = (−4.605) / (−0.2877) ≈ 16.0. Just 17 samples (round up) gives 99% confidence of hitting one clean pair — astonishingly cheap. That's the magic: you don't search all pairs, you just need one good one, and randomness finds it fast.
But watch the equation bite as outliers grow. If w = 0.2 (80% outliers!) and we fit a homography (s = 4), then ws = 0.2⁴ = 0.0016, and N = ln(0.01)/ln(0.9984) ≈ 2,877. The sample size s is in the exponent, so the cost explodes with both the outlier ratio and model complexity. This is why minimal solvers (smallest possible s) are prized.
Vanilla RANSAC has three knobs that decide whether it works on real data: the inlier threshold, the iteration budget, and how you score hypotheses. Get them wrong and RANSAC silently returns garbage. Here's the practitioner's playbook, plus the modern variants that fix vanilla's weaknesses.
t says how close a point must be to count as agreeing with the model. Too tight, and genuine inliers (with normal noise) get rejected, starving the consensus and making good hypotheses look bad. Too loose, and outliers sneak in, so a wrong model can win. The principled choice ties t to the inlier noise: if measurement noise is Gaussian with std σ, a common rule is t ≈ 2–3σ (for a point-to-line distance, t² follows a chi-squared law, so t = √(3.84)·σ ≈ 1.96σ captures 95% of inliers).
Slide the inlier threshold and watch the consensus band widen. Too narrow rejects true inliers (band misses noisy points); too wide swallows outliers. The count shows how many points fall inside.
You usually don't know the inlier fraction w in advance — so you can't precompute N. The fix: estimate w on the fly. Start with a pessimistic w, and every time a hypothesis finds a bigger consensus set, update w = (best inlier count)/(total) and recompute N from the core equation. As you discover the data is cleaner than feared, the required N drops and you stop early. This adaptive stopping is in every real implementation.
Worked: you budgeted for w = 0.3 (N ≈ 49 for a line at p = 0.99). On iteration 5 a hypothesis captures 70% of the points, so you revise w = 0.7, recompute N = ln(0.01)/ln(1 − 0.49) ≈ ln(0.01)/ln(0.51) ≈ 7. You're already past 7 — stop now. Adaptive RANSAC just turned a 49-iteration budget into 7.
| Variant | The fix it adds |
|---|---|
| MSAC | Instead of counting inliers (0/1), score each by its truncated residual — inliers that fit better score better. Same cost, strictly better models. |
| LO-RANSAC | When a new best is found, run a quick local optimization (an inner least-squares / IRLS on the consensus) before continuing. Dramatically tightens the final fit and reduces needed iterations. |
| PROSAC | Sample guided by a quality prior (e.g. feature-match score) instead of uniformly — try the most promising correspondences first. Finds a good sample far sooner. |
| MAGSAC++ | Eliminates the hard threshold t entirely by marginalizing over a range of noise scales — robust when you can't pin down σ. The current default in OpenCV. |
| USAC / GC-RANSAC | A unified framework / graph-cut spatial coherence — modern, fast, production-grade pipelines. |
RANSAC assumes a minimal sample defines the model uniquely. Sometimes it doesn't: pick 2 coincident points and the line is undefined; for a fundamental matrix, sample points all lying on one plane and the solution is degenerate (this is what DEGENSAC fixes). Real pipelines add a quick degeneracy check to discard samples that don't define a valid model before scoring them.
Now put it all together. Below is a dataset you control: a cloud of inliers around a true line, plus a tunable fraction of outliers and a tunable noise level. Four fits are computed live on the same data — ordinary least squares, L1, robust Huber (IRLS), and RANSAC — so you can watch exactly when each one breaks.
Crank the outlier fraction and watch the blue L2 line peel away from the truth while Huber and RANSAC hold. Add noise to see RANSAC's threshold get tested. Regenerate for a fresh draw.
Things to try, and what you'll see:
Push outliers from 0 to 0.3. At zero outliers all four lines stack on the truth — L2 is even slightly best (it's most efficient on clean Gaussian data). As outliers climb, the blue L2 line is the first to bend away; the error readout for L2 shoots up while Huber and RANSAC stay flat.
Push outliers past 0.4. Now even L1 and Huber start to wobble — monotone M-estimators have a breakdown point below 50% when outliers gang up. RANSAC, which never trusted the bulk fit in the first place, keeps nailing the true line. This is the regime where RANSAC is irreplaceable.
Crank the noise with moderate outliers. Watch RANSAC's fixed threshold get stressed: when inlier noise approaches the threshold, RANSAC starts misclassifying noisy inliers as outliers and its fit gets jittery — the exact failure mode MAGSAC was built to cure. Huber, which uses a soft down-weight rather than a hard in/out cut, degrades more gracefully here.
No quiz here — the simulation is the test. If you can predict which line breaks first as you drag the sliders, you understand robust estimation.
We keep saying “breaks down” — let's make it precise. The breakdown point of an estimator is the largest fraction of arbitrarily-bad outliers it can tolerate before the estimate can be carried off to infinity. It's the single most important robustness number.
| Estimator | Breakdown point | Meaning |
|---|---|---|
| Mean / least squares | 0% | One bad point can ruin it |
| Huber M-estimator | 0% (asymptotically)* | Bounded influence but a flood of outliers on one side still drifts it |
| Median / L1 location | 50% | Needs a majority of bad points to break |
| RANSAC | ~50%+ | Works as long as a clean minimal sample is findable |
| Least Median of Squares | 50% | The theoretical max for a regression estimator |
*Huber's influence is bounded (great against scattered outliers), but its formal breakdown point for regression is low because many outliers piled on one side bias it. Redescending estimators do better in practice but lose convexity. RANSAC and LMedS reach the 50% ceiling — you cannot beat 50%, because at half-and-half there's no way to tell inliers from outliers.
A cluster of points sits at the true value. Slide the outlier fraction up (outliers fly off to the right). Watch the mean get dragged immediately, the median hold firm until ~50%, then jump.
There's a beautiful theorem tying the two halves of this lesson together. Black & Rangarajan (1996) showed that every robust M-estimator is equivalent to least squares with an extra outlier process — a hidden per-point variable that decides “is this point an inlier or an outlier?” The IRLS weight wi is that outlier process, softened into a continuous value between 0 and 1. So down-weighting (M-estimators) and labeling inliers/outliers (RANSAC) are two views of the same underlying problem: jointly estimate the model and which data to trust.
Robust estimation isn't a side-topic — it is load-bearing infrastructure:
SLAM & bundle adjustment back-ends. The factor-graph optimizers (Ceres, g2o, GTSAM) wrap every measurement factor in a robust kernel — Huber or Cauchy — so a single bad data association or wrong loop closure can't corrupt the whole map. This is literally the IRLS weight applied to each factor's residual.
Switchable Constraints & Dynamic Covariance Scaling. For loop closures specifically, these methods (Sünderhauf, Agarwal) add a per-constraint switch variable the optimizer can turn off — the outlier process made explicit, letting the back-end reject false loop closures during optimization.
Graduated Non-Convexity (GNC). The modern way to use redescending estimators safely (Yang, Carlone, in TEASER++ for point-cloud registration): start with a convex (near-L2) surrogate, solve, then gradually morph the loss toward the non-convex redescending shape — getting Tukey-level robustness without needing a good initial guess. It's RANSAC-free robustness with optimality guarantees.
Feature matching & geometry. Every time you estimate a homography, fundamental matrix, or essential matrix from noisy feature matches (think image stitching, visual odometry front-ends), RANSAC is what separates the true correspondences from the mismatches before the geometry is computed.
Everything in one place. First, the loss / influence / weight zoo — the three columns you actually need when coding an M-estimator (the weight column is what goes into your IRLS loop):
| Name | Loss ρ(r) | Influence ψ(r) | IRLS weight w(r)=ψ/r | Type |
|---|---|---|---|---|
| L2 | ½r² | r | 1 | not robust |
| L1 | |r| | sign(r) | 1/|r| | robust, non-smooth |
| Huber | ½r² or δ(|r|−½δ) | clamp(r,±δ) | min(1, δ/|r|) | convex, monotone |
| Cauchy | (c²/2)ln(1+(r/c)²) | r/(1+(r/c)²) | 1/(1+(r/c)²) | redescending |
| Tukey | cutoff at c (const beyond) | 0 beyond c | (1−(r/c)²)² or 0 | hard redescending |
| Situation | Reach for |
|---|---|
| Clean data, Gaussian noise, no outliers | Plain least squares (most efficient) |
| A few scattered outliers, good initialization | Huber via IRLS (convex, safe default) |
| Outliers up to ~40–50%, or no initialization | RANSAC to find inliers, then Huber to refine |
| Need maximum robustness, decent init available | Redescending (Tukey/Cauchy), or GNC to anneal into it |
| Unknown noise scale, can't set a threshold | MAGSAC++ (RANSAC), or robust scale via MAD for Huber |
| SLAM loop closures that might be wrong | Switchable constraints / DCS in the factor-graph back-end |
1. Squaring is the sin. L2's unbounded influence is the whole problem; every robust method bounds or kills the influence of far points.
2. Two cures, often combined. Down-weight (M-estimators / IRLS — soft, smooth, needs init) or vote out (RANSAC — combinatorial, no init, handles heavy outliers). Real pipelines do RANSAC then robust refinement.
3. It's all “which data do I trust?” The IRLS weight, the RANSAC inlier flag, and the switchable-constraint switch are the same hidden variable — the Black–Rangarajan outlier process — in different clothes.
Robust estimation is the immune system of these systems — follow it into them: