Kochenderfer & Wheeler, Chapter 1

Introduction

What is optimization? A mathematical framework for finding the best design, decision, or parameter setting — and the algorithms that get us there.

Prerequisites: Multivariable calculus, basic linear algebra.
10
Chapters
2
Simulations
10
Quizzes

Chapter 0: Why Optimize?

You are designing a wing for an aircraft. The wing has to generate enough lift, but you also want to minimize drag. The shape of the airfoil is controlled by a handful of parameters — thicknesses at various points along the chord. Which values should you choose?

You could try a few by hand. But the search space is enormous: seven parameters, each with a continuous range. Trying every combination is impossible. You need a systematic procedure — an algorithm — that explores the space intelligently and converges on the best design.

The core idea: Optimization is the mathematical framework for finding the "best" value of a set of design variables, where "best" is defined by an objective function and "allowed" is defined by constraints. This book covers the algorithms that solve such problems.

Optimization is everywhere:

Engineering
Shape airfoils, tune controllers, route supply chains.
Machine Learning
Train neural networks by minimizing a loss function over billions of parameters.
Finance
Construct portfolios that maximize return while minimizing risk.

This chapter sets the stage. We will formalize what an optimization problem looks like, define what a "solution" means, and survey the landscape of algorithms covered in this book.

What does an optimization algorithm do?

Chapter 1: A Brief History

The quest to find "the best" is ancient. Pythagoras (569-475 BCE) claimed that mathematics could model the world. Euclid (325-265 BCE) proved that a square is the rectangle with maximum area for a given perimeter — an early optimization result.

Fermat (1601-1665) realized you could find optima by setting the derivative to zero. Lagrange (1736-1813) extended this to constrained problems with his method of Lagrange multipliers, combining the objective and constraints into a single function called the Lagrangian.

Key insight: The word "algorithm" comes from algoritmi, the Latin pronunciation of al-Khwarizmi, the 9th-century Persian mathematician who formalized algebra. Optimization sits at the intersection of algebra and calculus.

The modern era began in the mid-20th century with the rise of electronic computers:

YearMilestoneContributor
1940Linear programming formulationKantorovich
1947Simplex algorithmDantzig
1952Dynamic programmingBellman
1960sQuasi-Newton methods (BFGS)Broyden, Fletcher, Goldfarb, Shanno
2014+Adam optimizer for deep learningKingma & Ba

Today, optimization powers everything from training transformer models with billions of parameters (ChatGPT, 2022) to designing aircraft wings and planning medical radiotherapy.

Who introduced Lagrange multipliers for constrained optimization?

Chapter 2: The Optimization Process

A typical engineering design optimization workflow has four stages:

1. Specify
Define design variables, objective function, and constraints.
2. Initialize
Provide a starting design point (baseline).
3. Optimize
Algorithm iteratively refines the design, evaluating performance at each step.
4. Validate
Designer checks the result. If unsatisfactory, revise the formulation.

The designer's job is to formulate the problem correctly. The algorithm's job is to solve it. A brilliant algorithm applied to a poorly formulated problem will still produce a bad answer.

Why automation matters: Traditional engineering design relies on human intuition, which works in 2 or 3 dimensions. Modern problems have millions of variables and constraints. Optimization algorithms can explore these vast spaces systematically.

There are also pitfalls. An optimization algorithm may exploit modeling errors, producing a solution that is "optimal" with respect to the model but useless in practice. The designer must always validate the result against physical reality.

What is the designer's primary responsibility in the optimization process?

Chapter 3: Mathematical Formulation

Every optimization problem can be written in a standard form:

minimizex f(x)    subject to   x ∈ X

Here x is the design point — a vector [x1, ..., xn] of design variables. f is the objective function we want to minimize. X is the feasible set — the collection of all allowed designs.

Convention: We always frame problems as minimization. To maximize g(x), just minimize −g(x). This keeps notation consistent throughout the book.

The feasible set is typically defined by constraints:

minimizex f(x1, x2)    subject to   x1 ≥ 0,   x2 ≥ 0,   x1 + x2 ≤ 1

This gives a triangular feasible region in 2D. The solution x* is the point inside that triangle where f is smallest.

Watch out: If constraints use strict inequalities (x > 1 rather than x ≥ 1), the feasible set may not include its boundary. This can mean the problem has no solution — you can always get closer to the boundary but never reach it. Use non-strict inequalities whenever possible.

The no free lunch theorem (Wolpert & Macready, 1997) says no single optimization algorithm is best for all problems. Every algorithm makes assumptions about the objective function — smoothness, convexity, Lipschitz continuity. Choosing the right algorithm depends on knowing your problem's structure.

In the standard formulation minimize f(x) subject to x in X, what is X?

Chapter 4: Applications

Optimization problems appear everywhere in engineering and science. Here are key examples from the textbook:

DomainDesign VariablesObjectiveConstraints
Aircraft DesignAirfoil thicknesses at 7 chord pointsMinimize dragMaintain lift, minimum thickness
Deep LearningMillions of weights & biasesMinimize prediction error (loss)Implicit: network architecture
PortfolioFraction of budget per assetMaximize return, minimize riskWeights ≥ 0, sum to 1
RadiotherapyBeam positions, shapes, intensitiesMinimize radiation to healthy tissueTumor dose ≥ threshold
RoboticsSequence of movesMinimize number of stepsAvoid obstacles

Notice the variety: continuous vs. discrete variables, convex vs. non-convex objectives, few vs. millions of dimensions. Different structures call for different algorithms — which is why this book covers so many.

Deep learning connection: Training a neural network is "just" an optimization problem. The design variables are the weights, the objective is the loss function, and gradient descent (Chapter 5) is the workhorse algorithm. The difficulty comes from the sheer scale — billions of parameters — and the non-convexity of the loss landscape.
In a portfolio optimization problem, what constraint ensures you don't invest more than your budget?

Chapter 5: Minima

When we minimize f, we seek a global minimizer: a point x* where f(x*) is the smallest value f achieves. But proving global optimality is generally hard. Often the best we can do is verify a local minimum.

A point x* is a local minimizer if f(x*) ≤ f(x) for all x in some neighborhood around x*. There are two flavors:

TypeDefinitionExample
Strong (strict) local minf(x*) < f(x) for all nearby x ≠ x*Bottom of a bowl
Weak local minf(x*) ≤ f(x) for nearby x, but equality possibleFlat valley floor
Global minf(x*) ≤ f(x) for all x in the feasible setThe absolute lowest point
The fundamental challenge: Most algorithms can only guarantee convergence to a local minimum. Finding the global minimum typically requires either special problem structure (convexity) or global search strategies (simulated annealing, genetic algorithms).

A function may have many local minima but only one global minimum value (though possibly achieved at multiple points). The landscape of local minima is what makes optimization hard — and interesting.

What distinguishes a strong local minimum from a weak one?

Chapter 6: Optimality Conditions

For differentiable functions, we can state necessary and sufficient conditions for a point to be a local minimum.

Univariate case: A point x* is at a local minimum if:

f'(x*) = 0    (first-order necessary condition)
f''(x*) ≥ 0    (second-order necessary condition)

For a guaranteed strong local min, we need the stronger condition f''(x*) > 0.

Key insight: f'(x*) = 0 means the tangent line is flat — no direction of immediate improvement. f''(x*) > 0 means the function curves upward — you're at the bottom of a bowl, not the top of a hill or a saddle point.

Multivariate case: Replace the derivative with the gradient and the second derivative with the Hessian matrix:

∇f(x*) = 0    (first-order: gradient is zero)
∇²f(x*) is positive semidefinite    (second-order)

The Hessian ∇²f is the matrix of all second partial derivatives. Positive semidefinite means all eigenvalues are ≥ 0, which generalizes the univariate condition f'' ≥ 0 to multiple dimensions.

For a guaranteed strong local min: ∇f(x*) = 0 and ∇²f(x*) is positive definite (all eigenvalues strictly > 0).

What does a positive definite Hessian at a stationary point guarantee?

Chapter 7: Contour Explorer

Let's build intuition for what optimization looks like in 2D. The canvas below shows contour lines of a function f(x1, x2) = x1² + 4x2² (an elliptic paraboloid). Click anywhere to place a "design point" and see its function value.

2D Contour Visualization

Click on the canvas to place a point. The contour lines show constant values of f. The global minimum is at the origin.

What to notice: The contour lines are ellipses because the function stretches more in x1 than x2. The minimum is at the center (0,0) where f = 0. Every point on the same contour has the same function value. Optimization algorithms follow paths that cut across contours toward lower values.
On a contour plot, what do points on the same contour line share?

Chapter 8: Book Overview

The textbook is organized into five parts that build on each other:

Part I: Fundamentals (Ch 1-2)
Introduction, derivatives, gradients, numerical & automatic differentiation.
Part II: Univariate (Ch 3)
Bracketing, golden section search, bisection — optimizing a single variable.
Part III: Unconstrained (Ch 4-9)
Gradient descent, Newton's method, direct methods, stochastic & population methods.
Part IV: Constrained (Ch 10-14)
Lagrange multipliers, duality, LP, QP, disciplined convex programming.
Part V: Special Topics (Ch 15-20)
Multiobjective, sampling, surrogates, Gaussian processes, optimization under uncertainty.

The progression is logical: first learn to differentiate (Ch 2), then optimize in 1D (Ch 3), then in many dimensions without constraints (Ch 4-9), then with constraints (Ch 10-14), and finally tackle special structures (Ch 15-20).

Which part of the book covers gradient descent and Newton's method?

Chapter 9: Summary

Optimization is the science of finding the best solution from a set of feasible alternatives, guided by an objective function and constrained by the problem's requirements.

ConceptWhat It Means
Design point xA vector of design variables we can adjust
Objective function f(x)The quantity we want to minimize
Feasible set XThe set of allowed design points (defined by constraints)
Global minimumThe lowest f-value over the entire feasible set
Local minimumThe lowest f-value in a neighborhood
First-order condition∇f(x*) = 0: gradient vanishes at a minimum
Second-order condition∇²f(x*) positive semidefinite: curvature is non-negative
No free lunchNo algorithm is universally best; structure matters
Looking ahead: Chapter 2 covers derivatives and gradients — the fundamental tool that most optimization algorithms use to decide which direction to search. Without gradients, we are blind; with them, we can descend toward the minimum.
The no free lunch theorem says: