Ch 3: Property Specification — Algorithms for Validation

Chapter 0: Metrics vs Specs

You are building a self-driving car. Your manager says "make it safe." But what does safe mean? Does it mean the average distance to other cars is large? That the car never hits anything? That it hits things less than once per million miles? Each of these is a different specification — and each leads to radically different validation strategies.

Here is the core distinction this chapter builds on. A metric is a function that maps a system trajectory to a real number. "Average speed was 32 mph." "Minimum distance to nearest pedestrian was 1.4 meters." "Total fuel consumed was 8.2 gallons." Numbers. You can compare them, rank them, plot them.

A specification is a function that maps a trajectory to a Boolean: pass or fail. "The car never exceeded 65 mph." "The drone stayed within the geofence." "The robot arm reached the target within 5 seconds." True or false. No middle ground.

The bridge between them: Every metric can become a specification by adding a threshold. "Average speed" (metric) becomes "average speed ≤ 35 mph" (spec). "Minimum clearance" (metric) becomes "minimum clearance ≥ 0.5 m" (spec). The threshold is where engineering judgment lives — the math just checks it.

Metric

Trajectory → Real number (e.g., min distance = 1.4 m)

↓ add threshold

Specification

Trajectory → Boolean (e.g., min distance ≥ 0.5 m → PASS)

↓ check over distribution of trajectories

Validation

Does the system satisfy the spec with high probability?

Why does this matter? Because the tools you need depend entirely on what kind of specification you write. A simple threshold spec ("always stay above X") can be checked with simulation. A temporal spec ("if the light turns red, the car must stop within 3 seconds") requires temporal logic. A probabilistic spec ("the failure probability is less than 10⁻⁹") requires importance sampling. Each chapter of this book targets a different kind of spec.

This chapter teaches you the language of specifications — from simple metrics and thresholds, through risk measures like VaR and CVaR, multi-objective tradeoffs, propositional and first-order logic, all the way to temporal logic and reachability. By the end, you will be able to translate any safety requirement into a mathematical object that a computer can verify.

Type	Maps to	Example
Metric	Real number	"Average deviation was 0.3 m"
Specification	Boolean	"Deviation never exceeded 1 m"
Probabilistic Spec	Boolean over distribution	"P(collision) < 10⁻⁶"

Check: What is the fundamental difference between a metric and a specification?

A metric is always better than a specification A metric maps to a real number; a specification maps to a Boolean A specification cannot use numbers

Chapter 1: Mean Isn't Enough

Suppose you are evaluating two autonomous braking systems. Both have an average stopping distance of 25 meters. Are they equally safe?

Not necessarily. System A stops between 24 and 26 meters every time — consistent, predictable. System B stops at 15 meters half the time and 35 meters the other half. Same mean, but System B occasionally stops 10 meters past where System A would. In a critical scenario, those extra 10 meters are the difference between stopping safely and a collision.

The expected value (mean) of a metric tells you the center of the distribution. But for safety, we care about the tails — the worst-case scenarios. A system that is great on average but occasionally catastrophic is more dangerous than a system that is mediocre but predictable.

E[X] = ∑_i x_i · p(x_i)

The expected value weights every outcome by its probability and sums. It treats a gain of +100 and a loss of −100 as canceling out. For safety, they do not cancel — the loss of −100 might mean someone gets hurt.

The variance captures spread: how far outcomes deviate from the mean on average.

Var(X) = E[(X − E[X])²]

System A has variance near zero (always ~25m). System B has variance of 100 (swings between 15m and 35m). But even variance is not enough — two distributions can have the same mean and variance but very different tail behavior. We need tail risk measures.

Key insight: Safety is not about average-case performance. It is about worst-case performance. A specification like "average stopping distance ≤ 30m" is almost useless for safety. What you want is "the stopping distance is ≤ 30m in at least 99.9% of scenarios." That requires reasoning about the distribution's tails, not its center.

Two Systems, Same Mean

Both distributions have mean = 25m. Adjust the spread of System B to see how tail risk changes while the mean stays fixed.

System B spread10.0

The lesson: When someone tells you "the average performance is X," your first question should be: "What does the tail look like?" In finance, this is the difference between expected return and risk. In safety, it is the difference between "works well on average" and "works well when it matters most."

Check: Two systems have the same expected stopping distance. Why might one be less safe?

Because the mean is too high for both Because it has higher variance — its worst-case outcomes are more extreme Because expected value is always wrong

Chapter 2: VaR & CVaR

We need concrete numbers that capture tail risk. Enter two workhorses from quantitative finance that have become essential tools in safety validation.

Value at Risk (VaR) at confidence level α is the α-quantile of the loss distribution. In plain language: VaR_α is the threshold such that α percent of outcomes are better than it. If VaR_0.95 for stopping distance is 32 meters, that means 95% of the time you stop in 32 meters or less. Only 5% of scenarios are worse.

VaR_α(X) = inf { x : P(X ≤ x) ≥ α }

Think of it as drawing a vertical line on the distribution so that α fraction of the area is to its left. Everything to the right is the "danger zone."

But VaR has a flaw: it tells you the boundary of the danger zone but nothing about what happens inside it. If 5% of outcomes are worse than 32m, are they 33m (annoying) or 100m (catastrophic)? VaR cannot distinguish.

Conditional Value at Risk (CVaR) fixes this. CVaR_α is the expected value of outcomes in the worst (1−α) fraction. It averages over the entire tail beyond VaR.

CVaR_α(X) = E[ X | X ≥ VaR_α(X) ]

CVaR answers the question: "Given that we are in the worst 5% of scenarios, how bad is bad?" If CVaR_0.95 is 38m, then the average stopping distance in the worst 5% of cases is 38m. This is strictly more informative than VaR alone.

VaR is the door to the danger zone. CVaR tells you what's inside. A system with VaR = 32m but CVaR = 33m has a tight, well-behaved tail. A system with VaR = 32m but CVaR = 60m has a heavy tail — when things go wrong, they go very wrong. For safety validation, CVaR is almost always the better measure.

VaR & CVaR Explorer

Three distributions with the same mean but different shapes. Drag the α slider to move the confidence level and watch VaR and CVaR markers update. Notice how the heavy-tailed distribution has a dramatically higher CVaR.

α (confidence)0.90

Measure	Question it answers	Formula
Mean	What is the typical outcome?	E[X]
VaR_α	What threshold are we safe below with probability α?	α-quantile
CVaR_α	Given we're in the worst (1−α), how bad on average?	E[X \| X ≥ VaR_α]

Practical example: An autonomous vehicle must satisfy "CVaR_0.99(time-to-collision) ≥ 3 seconds." This means: even in the worst 1% of scenarios, the average time before a potential collision is at least 3 seconds. Much stronger than requiring the mean TTC to be 3 seconds.

Check: What does CVaR_0.95 measure that VaR_0.95 does not?

The probability of the worst case The best-case outcome The average severity within the worst 5% of outcomes

Chapter 3: Pareto Tradeoffs

Real systems have multiple metrics that conflict. A self-driving car should be fast (reach the destination quickly) and safe (maintain large clearances). But going faster means less time to react. You cannot maximize both simultaneously — you must trade off.

Given two designs A and B, we say A dominates B if A is at least as good on every metric and strictly better on at least one. If A is faster and safer than B, there is no reason to ever choose B.

A design is Pareto optimal if no other design dominates it. The set of all Pareto optimal designs forms the Pareto frontier — a curve in metric space representing the best achievable tradeoffs.

x dominates y ⇔ f_i(x) ≤ f_i(y) ∀ i, and f_j(x) < f_j(y) for some j

(Here we assume lower is better for all objectives, as in minimizing cost and risk.)

Key insight: The Pareto frontier tells you the price of safety. Moving along the frontier, you can see exactly how much speed you sacrifice for each unit of additional safety. This is the quantitative version of the engineering tradeoff conversation. Without a Pareto frontier, the discussion is just opinions.

Pareto Frontier Explorer

Each dot is a system design. Red dots are dominated (some other design is better on all metrics). Teal dots are Pareto optimal. Click New Random to generate a new set. Hover to see metric values.

When you have more than two objectives, the Pareto frontier becomes a surface, then a hypersurface. Visualization gets harder, but the concept is identical: a design is Pareto optimal if you cannot improve any objective without worsening another.

One common approach is to scalarize multiple objectives into a single number using weights: f(x) = w₁f₁(x) + w₂f₂(x) + … Different weight vectors trace out different points on the Pareto frontier. But choosing the weights is a value judgment — which brings us to the next chapter.

Check: When is a design Pareto optimal?

When no other design is better on ALL metrics simultaneously When it has the best value on at least one metric When it minimizes the weighted sum of all metrics

Chapter 4: Preference Elicitation

If you have two objectives (speed and safety) and need to pick one Pareto-optimal design, you need weights: how much safety are you willing to sacrifice for speed? These weights encode a human value judgment. Preference elicitation is the process of discovering those weights by asking simple questions.

The simplest approach: show the human two designs and ask "which do you prefer?" Each answer gives us information about the weight vector. Here is why.

Consider two objectives and the scalarized cost f(x) = w₁ · speed(x) + w₂ · risk(x). If you prefer design A over design B, that means:

w₁ · speed(A) + w₂ · risk(A) < w₁ · speed(B) + w₂ · risk(B)

Rearranging:

w₁(speed(A) − speed(B)) < w₂(risk(B) − risk(A))

This is a linear inequality in the weight space (w₁, w₂). It cuts the space of possible weights in half. Each pairwise query bisects the remaining weight space. After k queries, the feasible region has been halved k times — converging exponentially fast on the human's true preferences.

Each question cuts the weight space in half. With 10 pairwise comparisons, you narrow down from the full weight simplex to roughly 1/1000th of it. This is the same binary search principle that makes 20 questions so powerful — applied to preference learning.

Preference Elicitation

The weight space is the line from (1,0) to (0,1). Each time you choose a preferred design, a constraint eliminates half the remaining weights. Watch the feasible region shrink.

Query 1 of 8

In practice, preference elicitation is used to align validation criteria with stakeholder values. A medical device company might care more about safety than speed; a racing team might accept higher risk for better lap times. The weights are not right or wrong — they reflect priorities. Elicitation makes those priorities explicit and quantifiable.

Check: How does a single pairwise preference comparison constrain the weight vector?

It eliminates all but two possible weight vectors It defines a linear inequality that cuts the feasible weight space in half It sets one weight to zero

Chapter 5: Propositional Logic

Threshold-based specs ("distance ≥ 0.5m") work for single metrics. But safety requirements often combine multiple conditions: "the car is stopped AND the light is red" or "if the sensor fails, THEN switch to backup mode." We need a formal language for combining Boolean conditions. That language is propositional logic.

Propositional logic has a small set of connectives:

Symbol	Name	Meaning	Example
¬	NOT	Negation	¬collision = "no collision"
∧	AND	Both true	stopped ∧ red_light
∨	OR	At least one true	brake ∨ steer
⇒	IMPLIES	If...then...	sensor_fail ⇒ backup_on

The implication ⇒ deserves special attention because it is the workhorse of safety specs. "A ⇒ B" means "whenever A is true, B must also be true." It is only violated when A is true but B is false. If A is false, the implication is satisfied regardless of B (you cannot violate a rule that does not apply).

Implication truth table: "If it rains, I carry an umbrella." This is violated only on rainy days without an umbrella. Sunny days with or without an umbrella both satisfy the rule. The condition (rain) must be present for the guarantee (umbrella) to be tested.

Interactive Truth Table

Toggle inputs A and B. The grid shows the output of each connective. Green = True, Red = False.

A = True, B = True

In validation, propositional logic lets you combine simple checks into complex specifications. "The system is safe" might mean: (¬collision) ∧ (speed ≤ limit) ∧ (sensor_fail ⇒ backup_active). Each atomic proposition is a threshold check; propositional logic glues them together.

Key insight: Propositional logic is atemporal — it describes conditions at a single point in time. "The car is stopped AND the light is red" checks a snapshot. But safety often involves sequences: "IF the light turns red, THEN the car must stop within 3 seconds." That requires temporal logic (Chapter 7).

Check: When is the implication A ⇒ B violated?

When A is true and B is false When A is false and B is true Whenever A and B have different truth values

Chapter 6: First-Order Logic

Propositional logic lets us combine conditions, but it cannot express statements about all or some elements of a set. "The agent never hits any obstacle" is not a single proposition — it is a statement about every obstacle at every time step. We need quantifiers.

First-order logic extends propositional logic with two quantifiers:

Symbol	Name	Reads as	Example
∀	Universal	"for all"	∀ obstacles o: distance(agent, o) > 0.5m
∃	Existential	"there exists"	∃ path p: cost(p) < budget

A predicate is a function that takes arguments and returns true or false. "collision(agent, obstacle)" is a predicate — it depends on which agent and which obstacle. Quantifiers bind variables: ∀o means "for every object o in the domain."

Translating English to logic:
"The agent never hits an obstacle" → ∀t ∀o: ¬collision(agent, o, t)
"Some path reaches the goal" → ∃p: reaches_goal(p)
"Every sensor has a backup" → ∀s: has_backup(s)
"If any sensor fails, alert the operator" → ∀s: fails(s) ⇒ alert_sent()

First-order logic is the backbone of formal verification and specification languages. When you write "the system shall ensure that no pedestrian is within 2 meters of the vehicle at any time," you are writing a first-order specification: ∀t ∀p ∈ Pedestrians: dist(vehicle(t), p(t)) ≥ 2m.

The challenge is that first-order specifications over continuous domains (all times, all positions) require either discretization or specialized mathematical tools like reachability analysis (Chapter 9) to verify.

Negating quantifiers: ¬(∀x: P(x)) is equivalent to ∃x: ¬P(x). To disprove "every trajectory is safe," you only need to find ONE unsafe trajectory. This is the fundamental insight behind falsification (Chapter 4) — finding a counterexample is often much easier than proving universal safety.

Check: How do you disprove the statement "∀x: P(x)" (P is true for all x)?

Show that P is false for all x Find a single x where P(x) is false (a counterexample) Show that P is false for more than half of all x

Chapter 7: Temporal Logic

So far, our specifications describe properties at a single instant or over all instants using ∀t. But many safety requirements have a temporal structure: "if the obstacle appears, the car must brake within 2 seconds." "The robot must eventually reach its goal." "The system must remain stable until it enters safe mode." These require temporal logic.

Linear Temporal Logic (LTL) adds four key operators to propositional logic:

Operator	Symbol	Meaning	Example
Always	□ (G)	True at every future time	□ safe = "always safe"
Eventually	◇ (F)	True at some future time	◇ goal = "eventually reach goal"
Next	○ (X)	True at the next time step	○ moving = "moving at t+1"
Until	U	A holds until B becomes true	slow U stopped = "stay slow until stopped"

These compose naturally: □(obstacle_near ⇒ ◇_≤3s braking) means "it is always the case that if an obstacle is near, then braking occurs within 3 seconds." This is a precise, machine-checkable version of a safety requirement that would otherwise be ambiguous English.

Temporal operators are about the future. □φ says "φ holds now and at every future moment." ◇φ says "φ holds at some future moment." φ U ψ says "φ holds continuously until ψ becomes true." These three capture almost every safety requirement you will encounter: "always avoid X," "eventually achieve Y," "maintain Z until W."

Temporal Logic on a Signal

The signal shows a vehicle's distance to an obstacle over time. The spec is □(distance ≥ threshold). Green regions satisfy the spec; red regions violate it. Drag the threshold to see how the satisfaction changes.

Threshold2.0

For continuous-time systems, Signal Temporal Logic (STL) extends LTL with real-valued time bounds: □_[0,T]φ means "φ holds for the entire interval [0, T]." ◇_[a,b]φ means "φ holds at some time between a and b." This makes specifications precise about timing — "brake within 3 seconds," not just "eventually brake."

LTL vs STL: LTL operates on discrete sequences (step 0, step 1, ...). STL operates on continuous signals with real-valued time bounds. For validation of physical systems (vehicles, robots), STL is usually the right choice because timing constraints are measured in seconds, not steps.

Check: What does the temporal formula □(obstacle_near ⇒ ◇_≤3s braking) specify?

The system brakes exactly every 3 seconds At all times, if an obstacle is near, braking must occur within 3 seconds The obstacle must be near for at least 3 seconds before braking

Chapter 8: Signal Robustness

Temporal logic gives us a Boolean answer: the specification is satisfied or violated. But by how much? A car that maintains 2.01m clearance barely satisfies "distance ≥ 2m." A car that maintains 5m clearance satisfies it with ample margin. We need a quantitative measure of satisfaction.

STL robustness assigns a real number to each formula evaluated on a signal. Positive means satisfied; negative means violated. The magnitude indicates the margin.

For an atomic predicate f(t) ≥ c (where f is a signal), the robustness at time t is simply:

ρ(f(t) ≥ c) = f(t) − c

Positive when satisfied, negative when violated, zero at the boundary. The robustness of compound formulas follows from the Boolean connectives:

ρ(φ₁ ∧ φ₂) = min(ρ(φ₁), ρ(φ₂))

ρ(φ₁ ∨ φ₂) = max(ρ(φ₁), ρ(φ₂))

ρ(□_[a,b] φ) = min_t∈[a,b] ρ(φ, t)

ρ(◇_[a,b] φ) = max_t∈[a,b] ρ(φ, t)

The pattern is natural: AND requires both to be satisfied, so the robustness is the minimum (weakest link). OR requires at least one, so it is the maximum. Always requires satisfaction at every time, so minimum over time. Eventually requires satisfaction at some time, so maximum over time.

Key insight: Robustness is sound: if ρ > 0, the Boolean formula is true. If ρ < 0, it is false. But robustness tells you more — it tells you the margin of safety. This makes it useful as an objective function for optimization. Minimizing robustness is the core idea of falsification (Chapter 4).

There is a subtlety: the min/max operations in robustness are not differentiable. For gradient-based optimization, we use smooth robustness, replacing min/max with softmin/softmax:

smooth_min(a, b) ≈ −(1/β) · ln(e^−βa + e^−βb)

As β → ∞, this approaches the true min. The parameter β controls the smoothness-accuracy tradeoff. Larger β is more accurate but has sharper gradients; smaller β is smoother but less precise.

Robustness Visualization

A signal f(t) is checked against the spec □(f(t) ≥ threshold). The robustness at each time step is f(t) − threshold. The overall robustness is the minimum over all time steps. Drag the threshold to see how robustness changes.

Threshold c2.5

Smooth β5.0

Why smooth robustness matters: Falsification (Chapters 4–5) works by optimizing over disturbance sequences to find failures. If the objective function (robustness) has flat regions or sharp corners, optimization struggles. Smooth robustness provides gradients everywhere, enabling efficient gradient-based search for failures. The cost is a small approximation error controlled by β.

Check: What is the robustness of □_[0,T](f(t) ≥ c) when the signal barely dips below c at one time step?

A small negative number (the min over all time steps of f(t) − c) Zero, because most of the signal satisfies the spec A large positive number, because the average is above c

Chapter 9: Reachability Specs

Temporal logic formulas can be checked by transforming them into an automaton — a state machine that reads the system's trace symbol by symbol and either accepts or rejects it.

A Buchi automaton is a finite automaton over infinite words. It accepts a trace if the automaton visits at least one accepting state infinitely often. Every LTL formula can be converted to a Buchi automaton (using algorithms like LTL2BA).

LTL Formula

□(obstacle_near ⇒ ◇ braking)

↓ LTL-to-Buchi conversion

Buchi Automaton

A finite state machine with acceptance condition

↓ synchronous product with system model

Product System

Combined system + spec state machine

↓ check reachability of accepting states

Verification Result

Does the product system reach accepting states infinitely often?

The product system combines the system model and the Buchi automaton. Its state is a pair (system_state, automaton_state). The system evolves by its own dynamics; the automaton state transitions based on which atomic propositions are true. If the product system can reach an accepting state, the specification is satisfied (for that trace).

This reduction transforms temporal logic verification into graph reachability — a well-studied problem with efficient algorithms. For finite-state systems, model checking algorithms (like SPIN) can exhaustively verify all traces. For continuous systems, approximate methods are needed (Chapters 8–10 of the book).

The practical punchline: Any LTL spec can be converted to a reachability question. Reachability is the fundamental primitive of formal verification. Chapters 8–10 of the book develop increasingly powerful reachability algorithms for linear, nonlinear, and discrete systems. This chapter gave you the language; those chapters will give you the algorithms.

LTL-to-Automaton

Choose an LTL formula and see its corresponding automaton. The automaton reads the system trace and transitions between states. Double-circled states are accepting.

Check: How does LTL verification reduce to reachability?

The LTL formula is converted to a Buchi automaton, composed with the system, and reachability of accepting states is checked The system states are checked one at a time Reachability is only used for propositional logic

Chapter 10: Summary

This chapter gave you the language of validation. Before you can verify a system, you must state precisely what it should do. That statement is a specification, and the tools range from simple thresholds to temporal logic formulas with quantitative robustness.

Tool	What it captures	When to use it
Metric + threshold	Single numeric property	"Stopping distance ≤ 30m"
VaR / CVaR	Tail risk	"Even in the worst 1%, clearance ≥ X"
Pareto frontier	Multi-objective tradeoffs	Balancing speed, safety, comfort
Propositional logic	Boolean combinations at one time	"stopped AND light_red"
First-order logic	Statements about all/some elements	"No obstacle within 2m"
LTL / STL	Temporal sequences	"Always eventually reach goal"
Robustness	Margin of satisfaction	Optimization objective for falsification
Buchi automata	Reduce to reachability	Formal verification of temporal specs

The hierarchy: Simple metrics (Chapter 1) → risk measures (Chapter 2) → multi-objective (Chapter 3) → logic combinations (Chapters 5–6) → temporal properties (Chapter 7) → quantitative robustness (Chapter 8) → automata-based verification (Chapter 9). Each level adds expressive power. Start with the simplest spec that captures your requirement — unnecessary complexity obscures more than it clarifies.

What comes next:

Now that you can write specifications, how do you check them? Chapter 4 tackles falsification through optimization — finding failures by treating validation as a minimization problem over the robustness of your spec.

The big picture:

Specifications are contracts between the system and the world. The rest of the book provides tools to audit those contracts: falsification (find violations), probability estimation (how likely are violations), and reachability (prove no violations exist).

"If you can't state it precisely, you can't verify it."
— The central lesson of formal methods

Check: Why is STL robustness more useful than Boolean satisfaction for falsification?

Because Boolean is always wrong Because robustness provides a continuous, differentiable objective for optimization Because Boolean satisfaction takes too long to compute