Kochenderfer & Wheeler, Chapter 1

Introduction

What is optimization? A mathematical framework for finding the best design, decision, or parameter setting — and the algorithms that get us there.

Prerequisites: Multivariable calculus, basic linear algebra.

Chapters

Simulations

Quizzes

Chapter 0: Why Optimize?

You are designing a wing for an aircraft. The wing has to generate enough lift, but you also want to minimize drag. The shape of the airfoil is controlled by a handful of parameters — thicknesses at various points along the chord. Which values should you choose?

You could try a few by hand. But the search space is enormous: seven parameters, each with a continuous range. Trying every combination is impossible. You need a systematic procedure — an algorithm — that explores the space intelligently and converges on the best design.

The core idea: Optimization is the mathematical framework for finding the "best" value of a set of design variables, where "best" is defined by an objective function and "allowed" is defined by constraints. This book covers the algorithms that solve such problems.

Optimization is everywhere:

Engineering

Shape airfoils, tune controllers, route supply chains.

↓

Machine Learning

Train neural networks by minimizing a loss function over billions of parameters.

↓

Finance

Construct portfolios that maximize return while minimizing risk.

This chapter sets the stage. We will formalize what an optimization problem looks like, define what a "solution" means, and survey the landscape of algorithms covered in this book.

What does an optimization algorithm do?

Evaluates every possible design point Systematically searches for the design point that minimizes (or maximizes) an objective Randomly guesses until a good solution appears

Chapter 1: A Brief History

The quest to find "the best" is ancient. Pythagoras (569-475 BCE) claimed that mathematics could model the world. Euclid (325-265 BCE) proved that a square is the rectangle with maximum area for a given perimeter — an early optimization result.

Fermat (1601-1665) realized you could find optima by setting the derivative to zero. Lagrange (1736-1813) extended this to constrained problems with his method of Lagrange multipliers, combining the objective and constraints into a single function called the Lagrangian.

Key insight: The word "algorithm" comes from algoritmi, the Latin pronunciation of al-Khwarizmi, the 9th-century Persian mathematician who formalized algebra. Optimization sits at the intersection of algebra and calculus.

The modern era began in the mid-20th century with the rise of electronic computers:

Year	Milestone	Contributor
1940	Linear programming formulation	Kantorovich
1947	Simplex algorithm	Dantzig
1952	Dynamic programming	Bellman
1960s	Quasi-Newton methods (BFGS)	Broyden, Fletcher, Goldfarb, Shanno
2014+	Adam optimizer for deep learning	Kingma & Ba

Today, optimization powers everything from training transformer models with billions of parameters (ChatGPT, 2022) to designing aircraft wings and planning medical radiotherapy.

Who introduced Lagrange multipliers for constrained optimization?

Dantzig Fermat Lagrange

Chapter 2: The Optimization Process

A typical engineering design optimization workflow has four stages:

1. Specify

Define design variables, objective function, and constraints.

↓

2. Initialize

Provide a starting design point (baseline).

↓

3. Optimize

Algorithm iteratively refines the design, evaluating performance at each step.

↓

4. Validate

Designer checks the result. If unsatisfactory, revise the formulation.

The designer's job is to formulate the problem correctly. The algorithm's job is to solve it. A brilliant algorithm applied to a poorly formulated problem will still produce a bad answer.

Why automation matters: Traditional engineering design relies on human intuition, which works in 2 or 3 dimensions. Modern problems have millions of variables and constraints. Optimization algorithms can explore these vast spaces systematically.

There are also pitfalls. An optimization algorithm may exploit modeling errors, producing a solution that is "optimal" with respect to the model but useless in practice. The designer must always validate the result against physical reality.

What is the designer's primary responsibility in the optimization process?

Formulating the problem correctly (variables, objective, constraints) Choosing the fastest algorithm Running the algorithm manually

Chapter 3: Mathematical Formulation

Every optimization problem can be written in a standard form:

minimize_x f(x) subject to x ∈ X

Here x is the design point — a vector [x₁, ..., x_n] of design variables. f is the objective function we want to minimize. X is the feasible set — the collection of all allowed designs.

Convention: We always frame problems as minimization. To maximize g(x), just minimize −g(x). This keeps notation consistent throughout the book.

The feasible set is typically defined by constraints:

minimize_x f(x₁, x₂) subject to x₁ ≥ 0, x₂ ≥ 0, x₁ + x₂ ≤ 1

This gives a triangular feasible region in 2D. The solution x* is the point inside that triangle where f is smallest.

Watch out: If constraints use strict inequalities (x > 1 rather than x ≥ 1), the feasible set may not include its boundary. This can mean the problem has no solution — you can always get closer to the boundary but never reach it. Use non-strict inequalities whenever possible.

The no free lunch theorem (Wolpert & Macready, 1997) says no single optimization algorithm is best for all problems. Every algorithm makes assumptions about the objective function — smoothness, convexity, Lipschitz continuity. Choosing the right algorithm depends on knowing your problem's structure.

In the standard formulation minimize f(x) subject to x in X, what is X?

The objective function The feasible set of allowed design points The design variables

Chapter 4: Applications

Optimization problems appear everywhere in engineering and science. Here are key examples from the textbook:

Domain	Design Variables	Objective	Constraints
Aircraft Design	Airfoil thicknesses at 7 chord points	Minimize drag	Maintain lift, minimum thickness
Deep Learning	Millions of weights & biases	Minimize prediction error (loss)	Implicit: network architecture
Portfolio	Fraction of budget per asset	Maximize return, minimize risk	Weights ≥ 0, sum to 1
Radiotherapy	Beam positions, shapes, intensities	Minimize radiation to healthy tissue	Tumor dose ≥ threshold
Robotics	Sequence of moves	Minimize number of steps	Avoid obstacles

Notice the variety: continuous vs. discrete variables, convex vs. non-convex objectives, few vs. millions of dimensions. Different structures call for different algorithms — which is why this book covers so many.

Deep learning connection: Training a neural network is "just" an optimization problem. The design variables are the weights, the objective is the loss function, and gradient descent (Chapter 5) is the workhorse algorithm. The difficulty comes from the sheer scale — billions of parameters — and the non-convexity of the loss landscape.

In a portfolio optimization problem, what constraint ensures you don't invest more than your budget?

Weights are nonnegative and sum to 1 Expected return is maximized Variance is minimized

Chapter 5: Minima

When we minimize f, we seek a global minimizer: a point x* where f(x*) is the smallest value f achieves. But proving global optimality is generally hard. Often the best we can do is verify a local minimum.

A point x* is a local minimizer if f(x*) ≤ f(x) for all x in some neighborhood around x*. There are two flavors:

Type	Definition	Example
Strong (strict) local min	f(x) < f(x) for all nearby x ≠ x	Bottom of a bowl
Weak local min	f(x*) ≤ f(x) for nearby x, but equality possible	Flat valley floor
Global min	f(x) ≤ f(x) for all* x in the feasible set	The absolute lowest point

The fundamental challenge: Most algorithms can only guarantee convergence to a local minimum. Finding the global minimum typically requires either special problem structure (convexity) or global search strategies (simulated annealing, genetic algorithms).

A function may have many local minima but only one global minimum value (though possibly achieved at multiple points). The landscape of local minima is what makes optimization hard — and interesting.

What distinguishes a strong local minimum from a weak one?

A strong local min is strictly less than all nearby points; a weak local min allows ties A strong local min is also a global min A strong local min has a larger second derivative

Chapter 6: Optimality Conditions

For differentiable functions, we can state necessary and sufficient conditions for a point to be a local minimum.

Univariate case: A point x* is at a local minimum if:

f'(x*) = 0 (first-order necessary condition)

f''(x*) ≥ 0 (second-order necessary condition)

For a guaranteed strong local min, we need the stronger condition f''(x*) > 0.

Key insight: f'(x*) = 0 means the tangent line is flat — no direction of immediate improvement. f''(x*) > 0 means the function curves upward — you're at the bottom of a bowl, not the top of a hill or a saddle point.

Multivariate case: Replace the derivative with the gradient and the second derivative with the Hessian matrix:

∇f(x*) = 0 (first-order: gradient is zero)

∇²f(x*) is positive semidefinite (second-order)

The Hessian ∇²f is the matrix of all second partial derivatives. Positive semidefinite means all eigenvalues are ≥ 0, which generalizes the univariate condition f'' ≥ 0 to multiple dimensions.

For a guaranteed strong local min: ∇f(x*) = 0 and ∇²f(x*) is positive definite (all eigenvalues strictly > 0).

What does a positive definite Hessian at a stationary point guarantee?

The point is a strong local minimum The point is a global minimum The point is a saddle point

Chapter 7: Contour Explorer

Let's build intuition for what optimization looks like in 2D. The canvas below shows contour lines of a function f(x₁, x₂) = x₁² + 4x₂² (an elliptic paraboloid). Click anywhere to place a "design point" and see its function value.

2D Contour Visualization

Click on the canvas to place a point. The contour lines show constant values of f. The global minimum is at the origin.

What to notice: The contour lines are ellipses because the function stretches more in x₁ than x₂. The minimum is at the center (0,0) where f = 0. Every point on the same contour has the same function value. Optimization algorithms follow paths that cut across contours toward lower values.

On a contour plot, what do points on the same contour line share?

The same gradient direction The same function value The same x₁ coordinate

Chapter 8: Book Overview

The textbook is organized into five parts that build on each other:

Part I: Fundamentals (Ch 1-2)

Introduction, derivatives, gradients, numerical & automatic differentiation.

↓

Part II: Univariate (Ch 3)

Bracketing, golden section search, bisection — optimizing a single variable.

↓

Part III: Unconstrained (Ch 4-9)

Gradient descent, Newton's method, direct methods, stochastic & population methods.

↓

Part IV: Constrained (Ch 10-14)

Lagrange multipliers, duality, LP, QP, disciplined convex programming.

↓

Part V: Special Topics (Ch 15-20)

Multiobjective, sampling, surrogates, Gaussian processes, optimization under uncertainty.

The progression is logical: first learn to differentiate (Ch 2), then optimize in 1D (Ch 3), then in many dimensions without constraints (Ch 4-9), then with constraints (Ch 10-14), and finally tackle special structures (Ch 15-20).

Which part of the book covers gradient descent and Newton's method?

Part II: Univariate Optimization Part III: Unconstrained Multivariate Optimization Part IV: Constrained Optimization

Chapter 9: Summary

Optimization is the science of finding the best solution from a set of feasible alternatives, guided by an objective function and constrained by the problem's requirements.

Concept	What It Means
Design point x	A vector of design variables we can adjust
Objective function f(x)	The quantity we want to minimize
Feasible set X	The set of allowed design points (defined by constraints)
Global minimum	The lowest f-value over the entire feasible set
Local minimum	The lowest f-value in a neighborhood
First-order condition	∇f(x*) = 0: gradient vanishes at a minimum
Second-order condition	∇²f(x*) positive semidefinite: curvature is non-negative
No free lunch	No algorithm is universally best; structure matters

Looking ahead: Chapter 2 covers derivatives and gradients — the fundamental tool that most optimization algorithms use to decide which direction to search. Without gradients, we are blind; with them, we can descend toward the minimum.

The no free lunch theorem says:

No algorithm outperforms all others on every possible problem Gradient descent is always the best choice All optimization problems are equally hard