Szeliski, Chapter 4

Model Fitting and Optimization

How to fit models to noisy data: interpolation, regularization, variational methods, and Markov random fields.

Prerequisites: Chapters 2-3 + basic linear algebra & calculus.
10
Chapters
6+
Simulations
0
Assumed CV Knowledge

Chapter 0: Why Model Fitting?

You have data — a set of measured points, pixel values, or correspondences — and you want to find a model that explains them. Maybe you want to fit a line to noisy edge points, a smooth surface to sparse depth measurements, or a segmentation map to an image. This is the model fitting problem.

The challenge is that real data is noisy and incomplete. Measured points deviate from the true model. Some measurements are outright wrong (outliers). The model itself might not perfectly describe reality. You need principled ways to handle all of this.

This chapter provides the mathematical machinery: least squares, robust estimation, regularization, and energy minimization. These tools reappear in every chapter that follows.

The fundamental tradeoff: A model must fit the data well (low residual error), but not too well (overfitting to noise). The art of model fitting is finding the right balance between data fidelity and model complexity.
Fitting to Noisy Data

Noisy points scattered around a line. Click to see different fitting approaches.

What is the fundamental tradeoff in model fitting?

Chapter 1: Least Squares

The most fundamental fitting technique minimizes the sum of squared differences between the model's predictions and the observed data:

E(p) = ∑i ||yi − f(xi; p)||2

where p are the model parameters, xi are inputs, yi are observed values, and f is the model function. For a linear model (y = ax + b), the solution is closed-form:

p = (ATA)−1 AT y

This is the normal equation. It gives the parameters that minimize squared error in one shot. For nonlinear models, we use iterative methods like Gauss-Newton or Levenberg-Marquardt.

Why squared error? Squaring has nice mathematical properties: it is differentiable everywhere, it penalizes large errors heavily, and under Gaussian noise assumptions, least squares gives the maximum likelihood estimate. But this sensitivity to large errors is also its weakness — a single outlier can ruin the fit.
Least Squares Line Fitting

Click "Add Outlier" to see how a single bad point can dramatically change the fit.

What is the main weakness of least squares fitting?

Chapter 2: Scattered Data Interpolation

Often you have sparse measurements at irregular positions and need to fill in the gaps. This is interpolation: given values at known locations, estimate the value at arbitrary points.

Radial basis functions (RBFs) are a powerful approach. The interpolant is a weighted sum of basis functions centered at the data points:

f(x) = ∑i wi · φ(||x − xi||)

Common basis functions φ:

Interpolation in vision: Sparse depth from feature matches needs to be densified. Known pixel correspondences need smooth extension to the full image. Radial basis functions handle both elegantly. Thin-plate splines are especially popular because they produce the smoothest surface that passes through all data points.
What property makes thin-plate splines popular for interpolation?

Chapter 3: Overfitting and Underfitting

Fit a polynomial of degree 1 (a line) to noisy data, and you might miss important structure (underfitting). Fit a polynomial of degree 20, and it passes through every noisy point but oscillates wildly between them (overfitting).

Polynomial Degree and Overfitting

Increase the polynomial degree. Too low = underfitting. Too high = overfitting.

Degree 1

The solution is model selection: choose the right complexity. Techniques include:

The bias-variance tradeoff: Simple models have high bias (systematic error) but low variance (stable across datasets). Complex models have low bias but high variance (they fit noise). The optimal model balances these — complex enough to capture the signal, simple enough to ignore the noise.
What characterizes an overfitted model?

Chapter 4: Robust Fitting

Least squares fails spectacularly when outliers are present. Robust estimation uses loss functions that grow more slowly than the squared error, limiting the influence of outliers.

Loss FunctionFormulaBehavior
Squared (L2)ρ(r) = r2Outlier-sensitive
Absolute (L1)ρ(r) = |r|Less outlier-sensitive
HuberQuadratic near 0, linear farBest of both
Tukey biweightZero beyond thresholdCompletely ignores outliers

RANSAC (Random Sample Consensus) is an alternative strategy. Instead of modifying the loss, it samples random minimal subsets of the data, fits a model to each, and keeps the one with the most inliers. We will see RANSAC again in Chapter 8 (image stitching).

RANSAC in a nutshell: 1) Pick the minimum number of points needed (2 for a line, 4 for a homography). 2) Fit the model. 3) Count how many other points agree (inliers). 4) Repeat many times. 5) Keep the model with the most inliers. It is conceptually simple but remarkably effective.
Robust Loss Functions

Compare how different loss functions penalize residual errors. Notice how robust losses cap the influence of large residuals.

What is RANSAC's key idea for handling outliers?

Chapter 5: Regularization

To prevent overfitting, we add a regularization term to the objective function that penalizes model complexity:

E(p) = Edata(p) + λ · Esmooth(p)

The parameter λ controls the balance between data fidelity and smoothness. Common regularizers:

L1 vs L2 intuition: L2 regularization spreads the penalty across all parameters (everyone shrinks a little). L1 regularization concentrates it (some parameters go to zero, others stay large). This makes L1 a natural sparsity-inducing prior — it automatically selects the most important features.
Regularization Strength

Increase λ to see how stronger regularization smooths the fit. Too little = overfitting; too much = underfitting.

λ 0
What does total variation regularization do to an image?

Chapter 6: Variational Methods

Variational methods formulate vision problems as energy minimization. The idea: define an energy function that is low when the solution is good and high when it is bad, then find the minimum.

E(f) = ∫ [D(f, g) + λ · S(∇f)] dx

where D is the data term (how well does f explain the observed data g?) and S is the smoothness term (how regular is f?).

For continuous problems, the minimum satisfies the Euler-Lagrange equation, a partial differential equation. In practice, we discretize onto a pixel grid and solve a large sparse linear system.

Variational = energy minimization. Optical flow (Chapter 9), stereo matching (Chapter 12), image segmentation, and surface reconstruction all follow the same template: define Edata + λ Esmooth, then minimize. The specific data and smoothness terms change, but the framework is universal.

The bilateral solver extends this by adding an edge-aware smoothness term based on a reference image. It smooths the solution everywhere except at edges in the reference, producing results that respect image boundaries.

What is the general form of a variational energy function in vision?

Chapter 7: Markov Random Fields

A Markov random field (MRF) defines a probability distribution over a grid of labels (like pixel labels in segmentation). Each node in the grid has a label, and the probability depends on two things:

E(x) = ∑i ψi(xi) + ∑(i,j) ψij(xi, xj)

Finding the optimal labeling requires discrete optimization. Key algorithms:

AlgorithmGuaranteeSpeed
Graph cutsGlobal optimum for binary labelsFast
Belief propagationExact on trees, approximate on graphsModerate
Mean fieldApproximateFast
CRFs in deep learning: Conditional random fields (CRFs), a discriminative variant of MRFs, are used as post-processing layers in semantic segmentation networks. They refine coarse CNN outputs by encouraging neighboring pixels with similar colors to get the same label. The fully connected CRF of Krähenbühl and Koltun (2011) revolutionized this.
MRF Labeling

A noisy labeling (left) and the MRF-smoothed result (right). Pairwise potentials encourage neighboring pixels to agree.

What do pairwise potentials in an MRF encode?

Chapter 8: Showcase — Interactive Colorization

A beautiful application of the variational framework: given a grayscale image with sparse color scribbles from a user, propagate the colors to the entire image. This is Levin et al.'s interactive colorization (2004).

The optimization minimizes color variation between neighboring pixels with similar intensities. Where the user provides a color hint, the data term pins the solution. Everywhere else, the smoothness term propagates colors along intensity contours.

Variational Colorization

Sparse color scribbles (left) propagate to fill the entire image (right) using a variational solver.

The variational recipe in action: Edata = color matches scribble where provided. Esmooth = neighboring pixels with similar intensity should have similar color. λ = very high at scribble locations, zero elsewhere. Minimize → color propagates naturally along edges.

Chapter 9: Connections

The optimization tools from this chapter are the backbone of computer vision. Here is where they reappear:

ConceptUsed In
Least squaresCh 8 (Alignment), Ch 9 (Motion), Ch 11 (SfM — bundle adjustment)
RANSACCh 8 (Homography estimation), Ch 11 (Pose estimation)
RegularizationCh 5 (Weight decay in DNNs), Ch 9 (Optical flow), Ch 12 (Stereo)
Variational methodsCh 9 (Horn-Schunck flow), Ch 12 (Dense stereo), Ch 13 (Surface reconstruction)
MRFs / CRFsCh 6 (Semantic segmentation), Ch 12 (Stereo labeling)
Graph cutsCh 12 (Stereo), Ch 10 (Image stitching seams)
Robust loss functionsCh 9 (Robust optical flow), Ch 11 (Robust SfM)
Szeliski's insight: "Most of the algorithms in this book can be understood as instances of the same optimization framework: define an energy, balance data and smoothness, minimize." Learning this framework once gives you the vocabulary for the entire field.
Which optimization technique from this chapter is most commonly used to estimate camera geometry despite outlier correspondences?