Ch 9: Nonlinear Reachability — Algorithms for Validation

Chapter 0: Why Intervals?

In Chapter 8 we propagated reachable sets through linear systems. A linear map sends a zonotope to another zonotope, an ellipsoid to another ellipsoid. The geometry stays tame because multiplication by a matrix is a well-behaved affine transformation.

Now suppose the dynamics are nonlinear: x_k+1 = f(x_k), where f might be a neural network, a trigonometric model, or any smooth function. Push a box through f, and the image is no longer a box — it is a warped blob. Push a zonotope through f, and it comes out as a curved shape that no simple representation captures exactly.

The core challenge: We cannot represent the exact image f(X) of a set X under a nonlinear map. Instead, we must overapproximate — find a simple shape (box, zonotope, polynomial) guaranteed to contain f(X). The tighter the overapproximation, the more useful it is. Too loose, and the "reachable" set includes states the system can never actually reach.

The simplest overapproximation strategy starts with the humblest shape of all: the interval. An interval [a, b] represents the set of all real numbers between a and b. In multiple dimensions, a box (Cartesian product of intervals) represents a hyper-rectangle. Interval methods ask: if x ∈ [a, b], what interval must f(x) lie in?

Input set

A box (interval per dimension)

↓ apply nonlinear f

Exact image

A curved blob — hard to represent

↓ overapproximate

Interval hull

A bounding box guaranteed to contain the image

The tradeoff running through this chapter: tighter overapproximation costs more computation. Interval arithmetic is cheap but loose. Taylor models are tight but expensive. The art is choosing the right tool for the problem at hand.

Think of it like wrapping a gift. The exact wrapping follows every contour of the object — perfect but impractical. A box is easy to fold but wastes a lot of paper on oddly shaped gifts. A polynomial surface hugs the shape more closely but takes more effort to construct. Every method in this chapter is a different wrapping strategy.

Why this matters: Nonlinear reachability is the foundation of neural network verification. If your autonomous car uses a neural net controller, proving it never steers into a wall requires propagating input uncertainties through every nonlinear layer. The methods in this chapter are exactly how modern verification tools (like NNV, ERAN, and alpha-beta-CROWN) work under the hood.

Why can't we just use the linear reachability methods from Chapter 8 on nonlinear systems?

Linear methods are slower than nonlinear methods Nonlinear maps distort simple geometric shapes into curved blobs that linear representations cannot capture Linear methods only work for 2D systems

Chapter 1: Interval Arithmetic

Before we can propagate sets through nonlinear functions, we need a calculus for intervals. Interval arithmetic defines how to add, subtract, multiply, and divide intervals so the result is guaranteed to contain every possible outcome.

Start with the simplest case. Suppose x ∈ [1, 3] and y ∈ [2, 5]. What values can x + y take? The smallest sum is 1 + 2 = 3. The largest is 3 + 5 = 8. So x + y ∈ [3, 8]. That is interval addition:

[a, b] + [c, d] = [a + c, b + d]

Each operation follows the same principle: consider all possible combinations and take the tightest enclosing interval.

The Four Basic Operations

Operation	Formula	Example: [1,3] • [2,5]
Addition	[a+c, b+d]	[3, 8]
Subtraction	[a−d, b−c]	[−4, 1]
Multiplication	[min(ac,ad,bc,bd), max(ac,ad,bc,bd)]	[2, 15]
Division	[a,b] · [1/d, 1/c] (if 0 ∉ [c,d])	[0.2, 1.5]

Subtraction is where it gets interesting. If x ∈ [1, 3] and y ∈ [2, 5], then x − y ranges from 1 − 5 = −4 to 3 − 2 = 1. Notice the endpoints swap: we subtract the upper bound of y from the lower bound of x to get the new lower bound.

Multiplication is messier because signs matter. If one interval is entirely positive and the other entirely negative, the extremes swap. The safest rule: compute all four products of the endpoints and take the min and max.

Key insight: Every interval arithmetic operation is exact for independent variables. If x and y are truly independent (knowing x tells you nothing about y), the intervals above are as tight as possible. The trouble starts when the same variable appears more than once — that is the dependency problem we tackle in Chapter 3.

Elementary Functions

For monotone functions, interval evaluation is straightforward. If f is monotonically increasing on [a, b], then f([a, b]) = [f(a), f(b)]. If decreasing, f([a, b]) = [f(b), f(a)]. For non-monotone functions like sin or x², we must find the extrema within the interval.

exp([a, b]) = [exp(a), exp(b)]

x² on [a, b] = [min(a², b²), max(a², b²)] if 0 ∉ [a,b]

x² on [a, b] = [0, max(a², b²)] if 0 ∈ [a,b]

The squaring rule for intervals containing zero is important: the minimum of x² is 0 (achieved at x = 0), not min(a², b²).

Interval Arithmetic Calculator

Choose a function and an input interval. The green band shows the true image (exact range of f on the interval). The orange band shows what natural interval evaluation gives. Watch the gap grow as the interval widens.

Left endpoint a -1.0

Right endpoint b 2.0

What to notice: For x² on [−1, 2], the true image is [0, 4] (the minimum is at x = 0). Natural interval evaluation computes [−1, 2]² = [−1, 2] · [−1, 2] = [−2, 4] — treating the two copies of x as independent. The overapproximation [−2, 4] includes negative values that x² can never produce. This gap is the dependency problem.

python
class Interval:
    def __init__(self, lo, hi):
        self.lo, self.hi = lo, hi

    def __add__(self, other):
        return Interval(self.lo + other.lo, self.hi + other.hi)

    def __sub__(self, other):
        return Interval(self.lo - other.hi, self.hi - other.lo)

    def __mul__(self, other):
        prods = [self.lo*other.lo, self.lo*other.hi,
                 self.hi*other.lo, self.hi*other.hi]
        return Interval(min(prods), max(prods))

# Example: [1,3] + [2,5] = [3,8]
x = Interval(1, 3)
y = Interval(2, 5)
z = x + y  # Interval(3, 8)

What is [−2, 3] + [1, 4]?

[−1, 7] [−6, 7] [−2, 12]

Chapter 2: Inclusion Functions

Interval arithmetic gives us operations on intervals. But what we really need is to evaluate an entire function on an interval input. An inclusion function is a function that maps intervals to intervals and is guaranteed to contain the true image.

f(x) ∈ [f̅]([x]) for all x ∈ [x]

Here [x] denotes an interval, and [f̅]([x]) is the inclusion function's output. The guarantee is: no matter which x you pick from [x], f(x) will land inside [f̅]([x]). This is the fundamental enclosure property.

The simplest way to build an inclusion function is natural interval evaluation: take the expression for f, replace every real operation with its interval counterpart, and evaluate. If f(x) = x² + 2x + 1, then [f̅]([x]) = [x]² + 2[x] + [1, 1].

Key insight: An inclusion function is not unique. There are many functions [f̅] that satisfy the enclosure property for a given f. The tightest possible inclusion function returns exactly the true image: [f̅]([x]) = {f(x) : x ∈ [x]}. Natural interval evaluation is usually looser than this optimum. The rest of this chapter is about building tighter inclusion functions.

Properties of Good Inclusion Functions

A useful inclusion function should satisfy three properties:

Property	Meaning
Enclosure	f([x]) ⊆ [f̅]([x]) for all [x]
Thin convergence	If [x] = [a, a] (a point), then [f̅]([a,a]) = [f(a), f(a)]
Inclusion isotonicity	If [x] ⊆ [y], then [f̅]([x]) ⊆ [f̅]([y])

Thin convergence says that when the input interval shrinks to a point, the output interval also shrinks to a point — the true function value. This is critical: it means that by subdividing the input into small enough pieces, we can make the overapproximation as tight as we want.

Inclusion isotonicity says that a wider input produces a wider output. This is a monotonicity property of the inclusion function itself (not of f). Natural interval evaluation always satisfies this because each interval operation is isotone.

Zonotope analogy: In Chapter 8 we used zonotopes to represent reachable sets for linear systems. An inclusion function is the nonlinear counterpart: it takes a set description (interval) in and produces a set description (interval) out, with a containment guarantee. Zonotopes are tighter but only work for linear maps. Intervals are coarser but work for anything.

Convergence Order

How fast does the overapproximation shrink as the input interval narrows? The excess width is w([f̅]([x])) − w(f([x])), where w denotes interval width. If this excess is O(w([x])^k), we say the inclusion function has convergence order k.

Natural interval evaluation has order 1 in general: halving the input width roughly halves the excess. The mean value form (Chapter 4) achieves order 2: halving the input width reduces the excess by a factor of 4. Higher-order Taylor models can achieve order k for any desired k.

What does the enclosure property of an inclusion function guarantee?

The output interval is always as tight as possible Every possible output of f on the input interval lies inside the output interval The function f must be monotonic

Chapter 3: The Dependency Effect

Here is the central weakness of natural interval evaluation: it treats every occurrence of a variable as independent. When the same variable appears more than once in an expression, the method does not track the correlation between the occurrences. This leads to overapproximation called the dependency problem (or wrapping effect for set propagation).

A Concrete Example: x² − x on [−1, 2]

Let f(x) = x² − x. What is the true range of f on [−1, 2]?

Taking the derivative: f'(x) = 2x − 1 = 0 gives x = 0.5. We check the critical point and endpoints:

x	f(x) = x² − x
−1	(−1)² − (−1) = 2
0.5	(0.5)² − 0.5 = −0.25
2	(2)² − 2 = 2

So the true image is [−0.25, 2].

Now let's try natural interval evaluation. We compute [−1, 2]² − [−1, 2]:

Step 1: [x]²

[−1,2] · [−1,2] = [min(−2,1,−2,4), max(...)] = [−2, 4]

↓

Step 2: subtract [x]

[−2, 4] − [−1, 2] = [−2−2, 4−(−1)] = [−4, 5]

Natural interval evaluation gives [−4, 5]. The true answer was [−0.25, 2]. That is a width of 9 instead of 2.25 — a 4× overestimate.

What went wrong: In step 1, the method treated the two copies of x in x · x as independent variables. It allowed x₁ = −1 and x₂ = 2 simultaneously, producing the spurious product −2. In step 2, it allowed the x in x² to be different from the x being subtracted. The dependency x = x (the fact that both are the same variable) is lost.

Dependency Effect Visualizer

Comparing true image vs. natural interval evaluation for f(x) = x² − x. Drag the interval endpoints to see how the overapproximation gap changes.

Left endpoint a -1.0

Right endpoint b 2.0

When is the Dependency Problem Worst?

The dependency problem is worst when:

1. The same variable appears many times. Each additional occurrence introduces a new "phantom" degree of freedom that inflates the result.

2. The interval is wide. On a tiny interval, all occurrences of x are near the same value, so treating them as independent barely matters. On a wide interval, the discrepancy is large.

3. The function is far from monotone. Monotone functions have no dependency problem at all (the range is just [f(a), f(b)] or [f(b), f(a)]). It is the turning points inside the interval that create the trouble.

A remedy: Rewriting the expression to minimize repeated variables helps. For f(x) = x² − x, completing the square gives f(x) = (x − 0.5)² − 0.25. Now x appears only once (inside the square), so natural interval evaluation gives a tighter result. The centered form [−1 − 0.5, 2 − 0.5]² − 0.25 = [−1.5, 1.5]² − 0.25 = [0, 2.25] − 0.25 = [−0.25, 2]. This is the exact answer!

Why does natural interval evaluation of x² − x on [−1, 2] give [−4, 5] instead of the true [−0.25, 2]?

It treats each occurrence of x as an independent variable, losing the constraint that they are equal It uses the wrong formula for squaring Interval arithmetic cannot handle subtraction

Chapter 4: The Mean Value Theorem Trick

Natural interval evaluation is crude: it replaces every real operation with an interval operation and hopes for the best. The mean value form is smarter. It exploits calculus — specifically, the mean value theorem — to build a tighter inclusion function.

The idea: linearize f around the center of the interval, then bound the linearization error using interval evaluation of the derivative. For a function f: R → R and interval [x] with center c = (a + b) / 2:

f(x) = f(c) + f'(ξ)(x − c) for some ξ between c and x

We do not know ξ, but we know it lies in [x]. So f'(ξ) ∈ f'([x]) (evaluated using interval arithmetic on the derivative). The mean value inclusion function is:

[f̅]_MV([x]) = f(c) + f'([x]) · ([x] − c)

Here f(c) is a real number (the function evaluated at the center), f'([x]) is the interval-evaluated derivative, and ([x] − c) is the interval [a − c, b − c] = [−w/2, w/2] where w is the interval width.

Worked Example

Let f(x) = x² − x on [−1, 2]. The center is c = 0.5. f(c) = 0.25 − 0.5 = −0.25. The derivative is f'(x) = 2x − 1, so f'([−1, 2]) = 2[−1, 2] − 1 = [−2, 4] − 1 = [−3, 3]. And [x] − c = [−1.5, 1.5].

[f̅]_MV = −0.25 + [−3, 3] · [−1.5, 1.5] = −0.25 + [−4.5, 4.5] = [−4.75, 4.25]

Width 9.0 — barely better than the natural evaluation's width 9.0 for this example. Why? Because f'([x]) = [−3, 3] is a wide interval centered at 0, and multiplying by the half-width gives a huge spread.

When does mean value shine? When f is nearly linear on [x]. If f'([x]) is a narrow interval (the derivative barely varies), the mean value form is very tight. For our example, the derivative 2x − 1 swings from −3 to 3 over [−1, 2] — that is a large swing, so the benefit is small. On a narrower interval like [0, 1], f'([0, 1]) = [−1, 1], and the mean value form gives [−0.25, 0.75], closer to the truth [−0.25, 0].

Convergence Order

The mean value form has quadratic convergence: if the interval width is w, the excess width is O(w²). This means halving the input interval reduces the excess by a factor of 4, compared to only 2 for natural evaluation. On small intervals, this makes a huge difference.

Key insight: The mean value form separates the function into a point evaluation f(c) (exact) plus a linear correction (bounded by interval arithmetic). The only overestimation comes from the correction term. As the interval shrinks, the correction term shrinks quadratically because both f'([x]) and ([x] − c) shrink linearly.

Method	Convergence order	Cost
Natural interval evaluation	1 (linear)	Evaluate f once
Mean value form	2 (quadratic)	Evaluate f + f'
Second-order Taylor (Ch 5)	3 (cubic)	Evaluate f + f' + f''

For multivariable functions f: Rⁿ → R^m, the mean value form uses the Jacobian matrix:

[f̅]_MV([x]) = f(c) + J_f([x]) · ([x] − c)

where J_f([x]) is the interval-evaluated Jacobian. Each entry of the Jacobian is an interval, and the matrix-vector product uses interval arithmetic.

Why does the mean value form achieve quadratic convergence while natural evaluation only achieves linear?

It uses a different definition of interval width It avoids computing derivatives The overestimation comes only from the correction term, where both the derivative interval and the centered input interval shrink linearly with width

Chapter 5: Taylor Expansions

The mean value form uses a first-order Taylor expansion and bounds the remainder with interval arithmetic on the derivative. The natural next step: use higher-order Taylor expansions for tighter bounds.

Second-Order Form

Expand f around the center c of interval [x]:

f(x) = f(c) + f'(c)(x − c) + ½ f''(ξ)(x − c)²

The first two terms are exact (evaluated at the point c). Only the last term needs bounding. Since ξ ∈ [x], we bound f''(ξ) ∈ f''([x]) using interval arithmetic:

[f̅]_T2([x]) = f(c) + f'(c) · ([x] − c) + ½ f''([x]) · ([x] − c)²

Notice that f'(c) is a real number (no interval). This means the linear term contributes no overestimation — it shifts the interval exactly. All overestimation is confined to the quadratic remainder term.

Worked Example

f(x) = x² − x on [−1, 2], c = 0.5.

f(c) = −0.25, f'(c) = 0 (the derivative at the center is 2(0.5) − 1 = 0), f''(x) = 2 (constant).

[f̅]_T2 = −0.25 + 0 · [−1.5, 1.5] + ½ · 2 · [−1.5, 1.5]²

= −0.25 + [0, 2.25] = [−0.25, 2.0]

This is the exact answer! Because f is a quadratic polynomial, the second-order Taylor expansion captures it perfectly — there is no remainder.

The power of higher order: For our quadratic example, second-order Taylor is exact. For general functions, the second-order form has convergence order 3 (cubic), meaning the excess width is O(w³). Each additional Taylor term buys one more order of convergence. The cost: computing and bounding higher derivatives, which becomes increasingly expensive for multivariate functions.

The General Pattern

[f̅]_Tk([x]) = ∑_i=0^k−1 &frac{f⁽ⁱ⁾(c)}{i!} ([x] − c)ⁱ + &frac{f^(k)([x])}{k!} ([x] − c)^k

The first k−1 terms are exact point evaluations. Only the k-th term is bounded with interval arithmetic. This structure is key: the more terms you compute exactly, the less overestimation you incur.

The Hessian in Multiple Dimensions

For f: Rⁿ → R, the second-order form involves the Hessian matrix H_f:

[f̅]_T2([x]) = f(c) + ∇f(c)^T([x] − c) + ½ ([x] − c)^T H_f([x]) ([x] − c)

The Hessian H_f([x]) is an n × n matrix of intervals. Computing it requires n(n+1)/2 second partial derivatives, each evaluated on the input interval. For n = 100 (a small neural network layer), that is 5050 interval Hessian entries — expensive, but still cheaper than naive Monte Carlo sampling for formal guarantees.

Why does the second-order Taylor form give the exact answer for f(x) = x² − x?

Because the interval is narrow enough Because the derivative is zero at the center Because f is a degree-2 polynomial, so the second-order Taylor expansion captures it exactly with no remainder

Chapter 6: Taylor Models

Taylor expansions bound the function output as an interval. But an interval is a box — it cannot capture correlations between output dimensions. Taylor models go further: the output is a polynomial (which can represent correlations) plus a small interval remainder.

f(x) ∈ p(x) + [R] for all x ∈ [x]

Here p(x) is a polynomial of degree k (the Taylor polynomial), and [R] is an interval bounding the approximation error. The polynomial captures the smooth part of f, and the remainder captures everything else.

Key insight: A Taylor model is not a number or an interval — it is a function plus an error bound. The polynomial p(x) varies with x (it is a symbolic expression), while [R] is a fixed interval. Together, they define a band around p(x) that is guaranteed to contain f(x).

Why Polynomials Beat Boxes

Consider a 2D function f: R² → R² where f₁ and f₂ are correlated (e.g., f₁(x) = x, f₂(x) = x²). The true image is a parabolic curve. An interval box bounds this curve with a rectangle, wasting huge area in the corners. A Taylor model represents the output as (x, x² + [R]), which traces the parabola with only a thin error band — much tighter.

Arithmetic on Taylor Models

Just as interval arithmetic defines +, −, · for intervals, Taylor model arithmetic defines operations on polynomial-plus-remainder pairs:

Operation	Result
(p₁ + [R₁]) + (p₂ + [R₂])	(p₁ + p₂) + ([R₁] + [R₂])
(p₁ + [R₁]) · (p₂ + [R₂])	truncate(p₁ · p₂, k) + [R_·]

Addition is straightforward: add polynomials, add remainders. Multiplication is trickier: the product of two degree-k polynomials is degree 2k. We truncate to degree k, sweeping the higher-order terms into the remainder interval. This is where overapproximation enters.

The truncation tradeoff: Higher polynomial degree k means tighter bounds but more expensive computation. Each multiplication of two degree-k polynomials in n variables produces O((2k choose n)) terms before truncation. For neural network verification where n might be hundreds, even k = 2 becomes expensive. Most practical tools use k = 1 (affine arithmetic) or k = 2.

Composing Taylor Models

A major advantage of Taylor models: they compose naturally. If you have a Taylor model for g(x) and want to compute f(g(x)), you substitute the Taylor model for g into the Taylor expansion of f. The polynomial part remains polynomial (after truncation), and the remainder absorbs all the error terms.

This is exactly what happens when propagating sets through a deep network layer by layer. Each layer's Taylor model feeds into the next. The remainder grows with each composition — but much slower than with plain interval arithmetic, because the polynomial captures the dominant behavior.

What advantage does a Taylor model (polynomial + interval remainder) have over a plain interval?

The polynomial captures correlations and smooth structure, confining overestimation to the small remainder term Taylor models are always faster to compute Taylor models do not require derivatives

Chapter 7: Symbolic vs. Concrete

We have seen two strategies: concrete methods (interval arithmetic, mean value form) that evaluate the function on intervals and produce interval outputs, and symbolic methods (Taylor models) that maintain a polynomial expression in the input variables. Each has distinct tradeoffs.

Concrete (Interval) Methods

Concrete methods evaluate greedily: at each operation, they produce an interval and forget the structure that produced it. This is fast and simple, but information is lost at every step — especially the dependency between variables.

Symbolic (Taylor Model) Methods

Symbolic methods carry forward the algebraic relationship between inputs and outputs. A Taylor model knows that "output dimension 1 is approximately 2x + 3y" and "output dimension 2 is approximately x − y". This structural knowledge means that when you later compute dimension 1 + dimension 2, the x and y terms combine correctly, respecting dependencies.

The wrapping effect: When propagating sets through multiple time steps (x_k+1 = f(x_k)), concrete methods suffer from the wrapping effect. At each step, the curved image is overapproximated by a box, and the next step starts from this inflated box. After many steps, the boxes grow exponentially larger than the true reachable set. Symbolic methods delay this wrapping by maintaining the polynomial structure across steps.

Aspect	Concrete (Interval)	Symbolic (Taylor Model)
Representation	Interval per output	Polynomial + remainder per output
Dependency tracking	None	Yes (through shared input variables)
Cost per operation	O(1)	O(kⁿ) for degree k, n variables
Multi-step propagation	Wrapping effect (exponential growth)	Tighter (polynomial growth of remainder)
Best for	Single-step, low dimension	Multi-step, correlated outputs

The Composition Tradeoff

Symbolic methods shine when composing many functions (deep networks, long time horizons). But the polynomial complexity grows with each composition. After L layers of a neural network with n neurons each, a degree-k Taylor model has O((kL + n choose n)) terms — exponential in L for fixed k. In practice, you must periodically concretize: evaluate the polynomial on its input intervals, collapse it to a new interval, and start a fresh Taylor model. This resets the polynomial complexity at the cost of introducing one round of wrapping.

The art of verification: Practical tools like ERAN and alpha-beta-CROWN mix concrete and symbolic methods. They use symbolic propagation within "easy" layers (linear, ReLU) and concretize at "hard" layers (softmax, complex activations). Finding the optimal concretization schedule is itself an optimization problem.

What is the main advantage of symbolic (Taylor model) methods over concrete (interval) methods when propagating sets through multiple time steps?

Symbolic methods are always faster Symbolic methods track dependencies between variables, avoiding the exponential wrapping effect Symbolic methods do not require derivatives

Chapter 8: Optimization & MILP

So far we have propagated sets forward using arithmetic. There is an entirely different approach: formulate the reachability question as an optimization problem. Instead of asking "what interval does f([x]) land in?", ask "what is the maximum (or minimum) of f over [x]?"

f_max = max_{x ∈ [x]} f(x) f_min = min_{x ∈ [x]} f(x)

If we can solve these two optimization problems, the range of f on [x] is exactly [f_min, f_max] — no overapproximation at all. The catch: for general nonlinear f, global optimization is NP-hard.

Support Functions

A support function generalizes the idea to any direction d. Instead of finding the range along coordinate axes, find the extreme of the dot product d^Tf(x):

ρ_S(d) = max_{x ∈ X} d^T x

By evaluating ρ in many directions, you can reconstruct a tight polytope bounding the reachable set. This is the support function approach to reachability, and it reduces set propagation to a sequence of optimization problems.

ReLU Networks and MILP

For neural networks with ReLU activations, there is a beautiful trick. The ReLU function max(0, x) is piecewise linear. Its behavior at each neuron can be encoded as a binary variable: z = 1 if the neuron is active (x ≥ 0), z = 0 if inactive (x < 0).

ReLU(x) = max(0, x) ↔ y ≥ 0, y ≥ x, y ≤ x − L(1 − z), y ≤ Uz

Here L and U are lower and upper bounds on x (computed by interval arithmetic), and z ∈ {0, 1}. These four constraints, together with the binary variable z, exactly encode the ReLU. The entire network becomes a mixed-integer linear program (MILP).

The MILP encoding: A ReLU network with n neurons has n binary variables. MILP solvers like Gurobi and CPLEX can handle this, but the worst case is 2ⁿ branches. The key optimization: bound tightening. If interval arithmetic proves that a neuron is always active (L ≥ 0) or always inactive (U ≤ 0), we fix z and eliminate the binary variable. Tighter bounds = fewer binary variables = faster solving.

1. Bound propagation

Interval arithmetic to get L, U for each neuron

↓

2. Fix stable neurons

If L ≥ 0 or U ≤ 0, no binary variable needed

↓

3. MILP for unstable neurons

Binary variable for each neuron with L < 0 < U

↓

4. Solve

Global optimum = exact bound on network output

The MILP approach gives exact bounds on the network output — no overapproximation. But it is exponential in the number of unstable neurons. For a network with 1000 unstable neurons, 2¹⁰⁰⁰ branches is intractable. In practice, modern MILP solvers use branch-and-bound with LP relaxations to prune the tree, making problems with hundreds of unstable neurons tractable.

Key insight: The MILP approach reveals a fundamental connection: neural network verification is a combinatorial optimization problem. The nonlinearity of ReLU is discrete (on or off), not smooth. This is both a curse (exponential worst case) and a blessing (powerful integer programming solvers can exploit the structure).

Why does the MILP encoding of a ReLU network introduce binary variables?

Each ReLU neuron has two modes (active/inactive), and a binary variable selects which mode applies Binary variables are needed for the input constraints MILP solvers require at least one binary variable

Chapter 9: Partitioning & Neural Nets

All the methods so far propagate a single set through the nonlinear function. There is a universal trick to tighten any overapproximation: subdivide the input.

Split the input interval [x] into smaller pieces [x]₁, [x]₂, ..., [x]_k. Compute the inclusion function on each piece. The union of the results still contains the true image, but each piece's overapproximation is tighter (because the dependency problem is less severe on narrow intervals).

f([x]) ⊆ [f̅]([x]₁) ∪ [f̅]([x]₂) ∪ … ∪ [f̅]([x]_k)

The partitioning guarantee: If the inclusion function has convergence order p, and we split each dimension into m equal parts, the total excess width is O(1/m^p). For natural interval evaluation (p=1), doubling the partitions halves the excess. For mean value form (p=2), doubling the partitions reduces excess by 4×. The cost: we must evaluate the inclusion function mⁿ times (exponential in dimension n).

Neural Network Verification

A neural network with L layers and n_l neurons per layer is a composition f = f_L ∘ f_L−1 ∘ … ∘ f₁, where each f_l is an affine map followed by a nonlinear activation. The verification question: given input set X, does the output ever enter an unsafe region?

Partitioning Tightness Explorer

See how subdividing the input interval tightens the overapproximation for f(x) = x² − x. The green band is the true image. Orange boxes are interval evaluations on each partition. More partitions = tighter, but more boxes.

Number of partitions 1

The Combinatorial Explosion

For ReLU networks, the number of possible activation patterns is 2^N where N is the total number of neurons. Each pattern corresponds to a linear region of the network. Verification requires reasoning about which patterns are reachable from the input set.

Partitioning the input helps by reducing the number of unstable neurons in each partition. If a partition is small enough that every neuron's sign is determined, the network is linear on that partition and verification is trivial (LP, not MILP). The challenge: finding the right partition granularity. Too coarse = too many unstable neurons. Too fine = too many partitions.

State of the art: Modern NN verification tools combine all the ideas in this chapter: bound propagation (interval or affine) for fast initial bounds, MILP for exact solving of hard subproblems, and adaptive partitioning (branch-and-bound on input space) to manage the combinatorial explosion. The VNN-COMP competition benchmarks show that networks with thousands of neurons can now be verified in seconds for many properties.

Method	Completeness	Scalability	Tightness
Interval arithmetic	Incomplete	O(N)	Loose
Abstract interpretation	Incomplete	O(N²)	Moderate
MILP	Complete	O(2^N) worst case	Exact
Branch & bound	Complete	Variable	Exact

What comes next: This chapter handled nonlinear continuous systems. Chapter 10 tackles discrete systems, where the state space is finite and reachability becomes a graph search problem. The tools change from calculus to combinatorics, but the core question remains: where can the system go?

"The purpose of computing is insight, not numbers."
— Richard Hamming

Why does partitioning the input interval tighten the overapproximation?

It changes the function being evaluated Narrower intervals reduce the dependency problem, so each sub-interval's overapproximation is tighter Partitioning makes the function linear

Nonlinear Reachability

Chapter 0: Why Intervals?

Chapter 1: Interval Arithmetic

The Four Basic Operations

Elementary Functions

Chapter 2: Inclusion Functions

Properties of Good Inclusion Functions

Convergence Order

Chapter 3: The Dependency Effect

A Concrete Example: x2 − x on [−1, 2]

When is the Dependency Problem Worst?

Chapter 4: The Mean Value Theorem Trick

Worked Example

Convergence Order

Chapter 5: Taylor Expansions

Second-Order Form

Worked Example

The General Pattern

The Hessian in Multiple Dimensions

Chapter 6: Taylor Models

Why Polynomials Beat Boxes

Arithmetic on Taylor Models

Composing Taylor Models

Chapter 7: Symbolic vs. Concrete

Concrete (Interval) Methods

Symbolic (Taylor Model) Methods

The Composition Tradeoff

Chapter 8: Optimization & MILP

Support Functions

ReLU Networks and MILP

Chapter 9: Partitioning & Neural Nets

Neural Network Verification

The Combinatorial Explosion

A Concrete Example: x² − x on [−1, 2]