When straight lines curve, how do you guarantee where a system can go?
In Chapter 8 we propagated reachable sets through linear systems. A linear map sends a zonotope to another zonotope, an ellipsoid to another ellipsoid. The geometry stays tame because multiplication by a matrix is a well-behaved affine transformation.
Now suppose the dynamics are nonlinear: xk+1 = f(xk), where f might be a neural network, a trigonometric model, or any smooth function. Push a box through f, and the image is no longer a box — it is a warped blob. Push a zonotope through f, and it comes out as a curved shape that no simple representation captures exactly.
The simplest overapproximation strategy starts with the humblest shape of all: the interval. An interval [a, b] represents the set of all real numbers between a and b. In multiple dimensions, a box (Cartesian product of intervals) represents a hyper-rectangle. Interval methods ask: if x ∈ [a, b], what interval must f(x) lie in?
The tradeoff running through this chapter: tighter overapproximation costs more computation. Interval arithmetic is cheap but loose. Taylor models are tight but expensive. The art is choosing the right tool for the problem at hand.
Think of it like wrapping a gift. The exact wrapping follows every contour of the object — perfect but impractical. A box is easy to fold but wastes a lot of paper on oddly shaped gifts. A polynomial surface hugs the shape more closely but takes more effort to construct. Every method in this chapter is a different wrapping strategy.
Before we can propagate sets through nonlinear functions, we need a calculus for intervals. Interval arithmetic defines how to add, subtract, multiply, and divide intervals so the result is guaranteed to contain every possible outcome.
Start with the simplest case. Suppose x ∈ [1, 3] and y ∈ [2, 5]. What values can x + y take? The smallest sum is 1 + 2 = 3. The largest is 3 + 5 = 8. So x + y ∈ [3, 8]. That is interval addition:
Each operation follows the same principle: consider all possible combinations and take the tightest enclosing interval.
| Operation | Formula | Example: [1,3] • [2,5] |
|---|---|---|
| Addition | [a+c, b+d] | [3, 8] |
| Subtraction | [a−d, b−c] | [−4, 1] |
| Multiplication | [min(ac,ad,bc,bd), max(ac,ad,bc,bd)] | [2, 15] |
| Division | [a,b] · [1/d, 1/c] (if 0 ∉ [c,d]) | [0.2, 1.5] |
Subtraction is where it gets interesting. If x ∈ [1, 3] and y ∈ [2, 5], then x − y ranges from 1 − 5 = −4 to 3 − 2 = 1. Notice the endpoints swap: we subtract the upper bound of y from the lower bound of x to get the new lower bound.
Multiplication is messier because signs matter. If one interval is entirely positive and the other entirely negative, the extremes swap. The safest rule: compute all four products of the endpoints and take the min and max.
For monotone functions, interval evaluation is straightforward. If f is monotonically increasing on [a, b], then f([a, b]) = [f(a), f(b)]. If decreasing, f([a, b]) = [f(b), f(a)]. For non-monotone functions like sin or x2, we must find the extrema within the interval.
The squaring rule for intervals containing zero is important: the minimum of x2 is 0 (achieved at x = 0), not min(a2, b2).
Choose a function and an input interval. The green band shows the true image (exact range of f on the interval). The orange band shows what natural interval evaluation gives. Watch the gap grow as the interval widens.
python class Interval: def __init__(self, lo, hi): self.lo, self.hi = lo, hi def __add__(self, other): return Interval(self.lo + other.lo, self.hi + other.hi) def __sub__(self, other): return Interval(self.lo - other.hi, self.hi - other.lo) def __mul__(self, other): prods = [self.lo*other.lo, self.lo*other.hi, self.hi*other.lo, self.hi*other.hi] return Interval(min(prods), max(prods)) # Example: [1,3] + [2,5] = [3,8] x = Interval(1, 3) y = Interval(2, 5) z = x + y # Interval(3, 8)
Interval arithmetic gives us operations on intervals. But what we really need is to evaluate an entire function on an interval input. An inclusion function is a function that maps intervals to intervals and is guaranteed to contain the true image.
Here [x] denotes an interval, and [f̅]([x]) is the inclusion function's output. The guarantee is: no matter which x you pick from [x], f(x) will land inside [f̅]([x]). This is the fundamental enclosure property.
The simplest way to build an inclusion function is natural interval evaluation: take the expression for f, replace every real operation with its interval counterpart, and evaluate. If f(x) = x2 + 2x + 1, then [f̅]([x]) = [x]2 + 2[x] + [1, 1].
A useful inclusion function should satisfy three properties:
| Property | Meaning |
|---|---|
| Enclosure | f([x]) ⊆ [f̅]([x]) for all [x] |
| Thin convergence | If [x] = [a, a] (a point), then [f̅]([a,a]) = [f(a), f(a)] |
| Inclusion isotonicity | If [x] ⊆ [y], then [f̅]([x]) ⊆ [f̅]([y]) |
Thin convergence says that when the input interval shrinks to a point, the output interval also shrinks to a point — the true function value. This is critical: it means that by subdividing the input into small enough pieces, we can make the overapproximation as tight as we want.
Inclusion isotonicity says that a wider input produces a wider output. This is a monotonicity property of the inclusion function itself (not of f). Natural interval evaluation always satisfies this because each interval operation is isotone.
How fast does the overapproximation shrink as the input interval narrows? The excess width is w([f̅]([x])) − w(f([x])), where w denotes interval width. If this excess is O(w([x])k), we say the inclusion function has convergence order k.
Natural interval evaluation has order 1 in general: halving the input width roughly halves the excess. The mean value form (Chapter 4) achieves order 2: halving the input width reduces the excess by a factor of 4. Higher-order Taylor models can achieve order k for any desired k.
Here is the central weakness of natural interval evaluation: it treats every occurrence of a variable as independent. When the same variable appears more than once in an expression, the method does not track the correlation between the occurrences. This leads to overapproximation called the dependency problem (or wrapping effect for set propagation).
Let f(x) = x2 − x. What is the true range of f on [−1, 2]?
Taking the derivative: f'(x) = 2x − 1 = 0 gives x = 0.5. We check the critical point and endpoints:
| x | f(x) = x2 − x |
|---|---|
| −1 | (−1)2 − (−1) = 2 |
| 0.5 | (0.5)2 − 0.5 = −0.25 |
| 2 | (2)2 − 2 = 2 |
So the true image is [−0.25, 2].
Now let's try natural interval evaluation. We compute [−1, 2]2 − [−1, 2]:
Natural interval evaluation gives [−4, 5]. The true answer was [−0.25, 2]. That is a width of 9 instead of 2.25 — a 4× overestimate.
Comparing true image vs. natural interval evaluation for f(x) = x2 − x. Drag the interval endpoints to see how the overapproximation gap changes.
The dependency problem is worst when:
1. The same variable appears many times. Each additional occurrence introduces a new "phantom" degree of freedom that inflates the result.
2. The interval is wide. On a tiny interval, all occurrences of x are near the same value, so treating them as independent barely matters. On a wide interval, the discrepancy is large.
3. The function is far from monotone. Monotone functions have no dependency problem at all (the range is just [f(a), f(b)] or [f(b), f(a)]). It is the turning points inside the interval that create the trouble.
Natural interval evaluation is crude: it replaces every real operation with an interval operation and hopes for the best. The mean value form is smarter. It exploits calculus — specifically, the mean value theorem — to build a tighter inclusion function.
The idea: linearize f around the center of the interval, then bound the linearization error using interval evaluation of the derivative. For a function f: R → R and interval [x] with center c = (a + b) / 2:
We do not know ξ, but we know it lies in [x]. So f'(ξ) ∈ f'([x]) (evaluated using interval arithmetic on the derivative). The mean value inclusion function is:
Here f(c) is a real number (the function evaluated at the center), f'([x]) is the interval-evaluated derivative, and ([x] − c) is the interval [a − c, b − c] = [−w/2, w/2] where w is the interval width.
Let f(x) = x2 − x on [−1, 2]. The center is c = 0.5. f(c) = 0.25 − 0.5 = −0.25. The derivative is f'(x) = 2x − 1, so f'([−1, 2]) = 2[−1, 2] − 1 = [−2, 4] − 1 = [−3, 3]. And [x] − c = [−1.5, 1.5].
Width 9.0 — barely better than the natural evaluation's width 9.0 for this example. Why? Because f'([x]) = [−3, 3] is a wide interval centered at 0, and multiplying by the half-width gives a huge spread.
The mean value form has quadratic convergence: if the interval width is w, the excess width is O(w2). This means halving the input interval reduces the excess by a factor of 4, compared to only 2 for natural evaluation. On small intervals, this makes a huge difference.
| Method | Convergence order | Cost |
|---|---|---|
| Natural interval evaluation | 1 (linear) | Evaluate f once |
| Mean value form | 2 (quadratic) | Evaluate f + f' |
| Second-order Taylor (Ch 5) | 3 (cubic) | Evaluate f + f' + f'' |
For multivariable functions f: Rn → Rm, the mean value form uses the Jacobian matrix:
where Jf([x]) is the interval-evaluated Jacobian. Each entry of the Jacobian is an interval, and the matrix-vector product uses interval arithmetic.
The mean value form uses a first-order Taylor expansion and bounds the remainder with interval arithmetic on the derivative. The natural next step: use higher-order Taylor expansions for tighter bounds.
Expand f around the center c of interval [x]:
The first two terms are exact (evaluated at the point c). Only the last term needs bounding. Since ξ ∈ [x], we bound f''(ξ) ∈ f''([x]) using interval arithmetic:
Notice that f'(c) is a real number (no interval). This means the linear term contributes no overestimation — it shifts the interval exactly. All overestimation is confined to the quadratic remainder term.
f(x) = x2 − x on [−1, 2], c = 0.5.
f(c) = −0.25, f'(c) = 0 (the derivative at the center is 2(0.5) − 1 = 0), f''(x) = 2 (constant).
This is the exact answer! Because f is a quadratic polynomial, the second-order Taylor expansion captures it perfectly — there is no remainder.
The first k−1 terms are exact point evaluations. Only the k-th term is bounded with interval arithmetic. This structure is key: the more terms you compute exactly, the less overestimation you incur.
For f: Rn → R, the second-order form involves the Hessian matrix Hf:
The Hessian Hf([x]) is an n × n matrix of intervals. Computing it requires n(n+1)/2 second partial derivatives, each evaluated on the input interval. For n = 100 (a small neural network layer), that is 5050 interval Hessian entries — expensive, but still cheaper than naive Monte Carlo sampling for formal guarantees.
Taylor expansions bound the function output as an interval. But an interval is a box — it cannot capture correlations between output dimensions. Taylor models go further: the output is a polynomial (which can represent correlations) plus a small interval remainder.
Here p(x) is a polynomial of degree k (the Taylor polynomial), and [R] is an interval bounding the approximation error. The polynomial captures the smooth part of f, and the remainder captures everything else.
Consider a 2D function f: R2 → R2 where f1 and f2 are correlated (e.g., f1(x) = x, f2(x) = x2). The true image is a parabolic curve. An interval box bounds this curve with a rectangle, wasting huge area in the corners. A Taylor model represents the output as (x, x2 + [R]), which traces the parabola with only a thin error band — much tighter.
Just as interval arithmetic defines +, −, · for intervals, Taylor model arithmetic defines operations on polynomial-plus-remainder pairs:
| Operation | Result |
|---|---|
| (p1 + [R1]) + (p2 + [R2]) | (p1 + p2) + ([R1] + [R2]) |
| (p1 + [R1]) · (p2 + [R2]) | truncate(p1 · p2, k) + [R·] |
Addition is straightforward: add polynomials, add remainders. Multiplication is trickier: the product of two degree-k polynomials is degree 2k. We truncate to degree k, sweeping the higher-order terms into the remainder interval. This is where overapproximation enters.
A major advantage of Taylor models: they compose naturally. If you have a Taylor model for g(x) and want to compute f(g(x)), you substitute the Taylor model for g into the Taylor expansion of f. The polynomial part remains polynomial (after truncation), and the remainder absorbs all the error terms.
This is exactly what happens when propagating sets through a deep network layer by layer. Each layer's Taylor model feeds into the next. The remainder grows with each composition — but much slower than with plain interval arithmetic, because the polynomial captures the dominant behavior.
We have seen two strategies: concrete methods (interval arithmetic, mean value form) that evaluate the function on intervals and produce interval outputs, and symbolic methods (Taylor models) that maintain a polynomial expression in the input variables. Each has distinct tradeoffs.
Concrete methods evaluate greedily: at each operation, they produce an interval and forget the structure that produced it. This is fast and simple, but information is lost at every step — especially the dependency between variables.
Symbolic methods carry forward the algebraic relationship between inputs and outputs. A Taylor model knows that "output dimension 1 is approximately 2x + 3y" and "output dimension 2 is approximately x − y". This structural knowledge means that when you later compute dimension 1 + dimension 2, the x and y terms combine correctly, respecting dependencies.
| Aspect | Concrete (Interval) | Symbolic (Taylor Model) |
|---|---|---|
| Representation | Interval per output | Polynomial + remainder per output |
| Dependency tracking | None | Yes (through shared input variables) |
| Cost per operation | O(1) | O(kn) for degree k, n variables |
| Multi-step propagation | Wrapping effect (exponential growth) | Tighter (polynomial growth of remainder) |
| Best for | Single-step, low dimension | Multi-step, correlated outputs |
Symbolic methods shine when composing many functions (deep networks, long time horizons). But the polynomial complexity grows with each composition. After L layers of a neural network with n neurons each, a degree-k Taylor model has O((kL + n choose n)) terms — exponential in L for fixed k. In practice, you must periodically concretize: evaluate the polynomial on its input intervals, collapse it to a new interval, and start a fresh Taylor model. This resets the polynomial complexity at the cost of introducing one round of wrapping.
So far we have propagated sets forward using arithmetic. There is an entirely different approach: formulate the reachability question as an optimization problem. Instead of asking "what interval does f([x]) land in?", ask "what is the maximum (or minimum) of f over [x]?"
If we can solve these two optimization problems, the range of f on [x] is exactly [fmin, fmax] — no overapproximation at all. The catch: for general nonlinear f, global optimization is NP-hard.
A support function generalizes the idea to any direction d. Instead of finding the range along coordinate axes, find the extreme of the dot product dTf(x):
By evaluating ρ in many directions, you can reconstruct a tight polytope bounding the reachable set. This is the support function approach to reachability, and it reduces set propagation to a sequence of optimization problems.
For neural networks with ReLU activations, there is a beautiful trick. The ReLU function max(0, x) is piecewise linear. Its behavior at each neuron can be encoded as a binary variable: z = 1 if the neuron is active (x ≥ 0), z = 0 if inactive (x < 0).
Here L and U are lower and upper bounds on x (computed by interval arithmetic), and z ∈ {0, 1}. These four constraints, together with the binary variable z, exactly encode the ReLU. The entire network becomes a mixed-integer linear program (MILP).
The MILP approach gives exact bounds on the network output — no overapproximation. But it is exponential in the number of unstable neurons. For a network with 1000 unstable neurons, 21000 branches is intractable. In practice, modern MILP solvers use branch-and-bound with LP relaxations to prune the tree, making problems with hundreds of unstable neurons tractable.
All the methods so far propagate a single set through the nonlinear function. There is a universal trick to tighten any overapproximation: subdivide the input.
Split the input interval [x] into smaller pieces [x]1, [x]2, ..., [x]k. Compute the inclusion function on each piece. The union of the results still contains the true image, but each piece's overapproximation is tighter (because the dependency problem is less severe on narrow intervals).
A neural network with L layers and nl neurons per layer is a composition f = fL ∘ fL−1 ∘ … ∘ f1, where each fl is an affine map followed by a nonlinear activation. The verification question: given input set X, does the output ever enter an unsafe region?
See how subdividing the input interval tightens the overapproximation for f(x) = x2 − x. The green band is the true image. Orange boxes are interval evaluations on each partition. More partitions = tighter, but more boxes.
For ReLU networks, the number of possible activation patterns is 2N where N is the total number of neurons. Each pattern corresponds to a linear region of the network. Verification requires reasoning about which patterns are reachable from the input set.
Partitioning the input helps by reducing the number of unstable neurons in each partition. If a partition is small enough that every neuron's sign is determined, the network is linear on that partition and verification is trivial (LP, not MILP). The challenge: finding the right partition granularity. Too coarse = too many unstable neurons. Too fine = too many partitions.
| Method | Completeness | Scalability | Tightness |
|---|---|---|---|
| Interval arithmetic | Incomplete | O(N) | Loose |
| Abstract interpretation | Incomplete | O(N2) | Moderate |
| MILP | Complete | O(2N) worst case | Exact |
| Branch & bound | Complete | Variable | Exact |
What comes next: This chapter handled nonlinear continuous systems. Chapter 10 tackles discrete systems, where the state space is finite and reachability becomes a graph search problem. The tools change from calculus to combinatorics, but the core question remains: where can the system go?