Staff-level onsite interview prep: test design, automation, reliability, incident response, safety, and debugging for robotics systems.
It's 3am. A warehouse robot just dropped a package onto the floor. The incident pager fires. You pull up the telemetry dashboard: perception reported a confident grip, the planner issued a place command, the controller tracked the trajectory within tolerance — but the gripper opened 200ms early. Was it a software bug? A sensor miscalibration? A hardware failure? A race condition between the controller and the gripper driver?
This is what a Software Test & Reliability Engineer at a robotics company does. You don't just write unit tests. You own the entire confidence chain from code commit to robot deployment — making sure every layer of the stack behaves correctly, fails gracefully, and recovers autonomously.
Your week splits roughly into four buckets:
| Activity | % of Time | What It Looks Like |
|---|---|---|
| Test infrastructure | ~30% | Writing and maintaining test frameworks, CI/CD pipelines, flaky test triage, test environment provisioning |
| Test design & execution | ~25% | Designing test cases for new features, running integration/HIL tests, reviewing test results |
| Reliability engineering | ~25% | SLO monitoring, error budget tracking, incident response, post-mortems, chaos experiments |
| Collaboration | ~20% | Code reviews (test coverage), release sign-off, cross-team debugging, documentation |
Web developers know the test pyramid: lots of unit tests, fewer integration tests, even fewer E2E tests. Robotics adds two more layers that don't exist in web: simulation testing (running the software against a physics engine) and hardware-in-the-loop (HIL) testing (running against the physical robot). Each layer is slower, more expensive, and catches a different class of bug.
A warehouse robot system like Rhoda AI's DVA has five major layers. Each needs different kinds of tests. Click any layer in the diagram below to see what testing looks like there.
Click a layer to see what tests exist at that level.
Web apps are deterministic: same input, same output. You can test in isolation, mock dependencies, and replay requests. Robotics breaks every assumption that web testing relies on:
| Web Testing | Robotics Testing |
|---|---|
| Deterministic outputs | Stochastic outputs (ML models, sensor noise) |
| Fast feedback (<1 min) | Slow feedback (minutes to hours for HIL) |
| Cheap retries (hit endpoint again) | Expensive retries (reset physical scene) |
| No physical danger | Safety-critical (25kg arm can injure humans) |
| Stateless or easy rollback | Physical state is irreversible (dropped item stays dropped) |
| Mock everything | Physics can't be fully mocked |
| CI runs on commodity hardware | HIL requires dedicated robot rigs |
Before you test robots, you need to test software. Every interview for a test engineering role will probe whether you understand the core techniques that generate good test cases — not random poking, but systematic methods that maximize the chance of finding bugs with the fewest tests.
There are four foundational test design techniques that you should be able to explain and apply on a whiteboard: boundary value analysis, equivalence partitioning, decision tables, and state transition testing. Let's work through each one with a concrete robotics example.
Most bugs cluster at the edges of valid input ranges. Boundary value analysis is the technique of testing at and around these edges rather than in the comfortable middle. For any input parameter with a valid range [min, max], you test at: min-1 (just below), min (lower boundary), min+1 (just above lower), a nominal value in the middle, max-1 (just below upper), max (upper boundary), max+1 (just above).
if (angle < 180) vs if (angle <= 180) looks like one character. But for a robot arm, the difference is "joint operates within limits" vs. "joint hits mechanical stop and strips a gear." BVA exists because humans consistently get boundary conditions wrong.Worked example: Robot gripper force controller. The gripper can apply force in the range [0.5N, 50N]. Below 0.5N, the grip is unreliable — the object slips. Above 50N, the gripper motor stalls and can damage the mechanism. What test cases does BVA give us?
| Test Value | Category | Expected Behavior |
|---|---|---|
| 0.4N | Below minimum | System rejects command, returns error: "Force below minimum threshold" |
| 0.5N | Lower boundary | Grip engaged at minimum force, object held but weakly |
| 0.6N | Just above minimum | Normal grip, minimal force applied |
| 25.0N | Nominal (middle) | Normal operation, comfortable margin |
| 49.9N | Just below maximum | Strong grip, within safe range |
| 50.0N | Upper boundary | Maximum grip, motor at rated capacity |
| 50.1N | Above maximum | System clamps to 50N or rejects command with warning |
That's seven test cases from a single parameter. For a function with multiple bounded parameters, you combine BVA values — but intelligently: test boundaries of one parameter while holding others at nominal values.
Equivalence partitioning divides the input space into classes where the system should behave identically. You test one representative from each class rather than exhaustively testing every possible input. The logic: if the system handles one value from a partition correctly, it should handle all values from that partition correctly (because they follow the same code path).
Worked example: Object classification for gripper strategy. The robot must select a grip strategy based on the object type detected by perception:
| Equivalence Class | Objects | Grip Strategy | Representative Test |
|---|---|---|---|
| Rigid, small | Screws, bolts, USB drives | Precision pinch, 5N | Test with M8 bolt |
| Rigid, large | Boxes, bottles, cans | Power grasp, 20N | Test with cereal box |
| Deformable | Bags, garments, foam | Enveloping grasp, 8N | Test with t-shirt |
| Fragile | Glass, eggs, electronics | Force-limited pinch, 3N | Test with wine glass |
| Unknown/unclassified | Novel objects | Default cautious grasp, 5N | Test with arbitrary object |
Five equivalence classes, five test cases. Without EP, you'd be testing "every kind of box, every kind of bottle, every kind of bag" — thousands of tests that exercise the exact same code path. EP tells you that's wasteful.
When a system's behavior depends on combinations of conditions, a decision table maps every combination to its expected action. This catches cases where individual conditions pass but combinations fail — the classic "it works fine unless the sensor is noisy AND the object is at the workspace boundary AND the gripper is warm."
Worked example: The robot decides whether to attempt a pick based on three conditions: grip confidence (≥ 80%?), workspace clearance (safe?), and battery level (> 20%?).
| Confidence ≥ 80% | Clearance Safe | Battery > 20% | Action |
|---|---|---|---|
| Y | Y | Y | Execute pick |
| Y | Y | N | Return to charge station |
| Y | N | Y | Reposition, then retry |
| Y | N | N | Return to charge station |
| N | Y | Y | Request better viewpoint, retry detection |
| N | Y | N | Return to charge station |
| N | N | Y | Request better viewpoint AND reposition |
| N | N | N | Return to charge station, alert operator |
Three binary conditions produce 23 = 8 test cases. Each row is a test case. Notice how row 7 (low confidence + unsafe clearance + good battery) requires TWO recovery actions — this combination is easy to miss without a decision table.
Robots are stateful systems. The gripper can be OPEN, CLOSING, GRIPPING, OPENING. The robot can be IDLE, PICKING, PLACING, ERROR, E-STOPPED. State transition testing maps every valid state and transition, then tests: (a) every valid transition works, (b) every invalid transition is rejected, and (c) the system handles unexpected events in every state.
Define a parameter's valid range, and this tool generates the BVA test cases automatically. Try the gripper force example: set min to 0.5 and max to 50.
Set the valid range boundaries. The tool generates the 7 canonical BVA test values and shows which are boundary, nominal, and invalid.
Here's how these techniques translate into actual test code. A well-structured robotics test suite uses pytest fixtures to manage the complex setup/teardown that robot testing requires:
python import pytest import numpy as np # --- Fixtures: reusable setup for robot tests --- @pytest.fixture def gripper_controller(): """Create a gripper controller with safe defaults.""" ctrl = GripperController( min_force=0.5, max_force=50.0, stall_timeout_ms=500, ) ctrl.initialize() yield ctrl ctrl.release() # always release gripper in teardown ctrl.shutdown() # --- BVA tests: boundaries of force parameter --- @pytest.mark.parametrize("force,should_accept", [ (0.4, False), # below minimum (0.5, True), # lower boundary (0.6, True), # just above minimum (25.0, True), # nominal (49.9, True), # just below maximum (50.0, True), # upper boundary (50.1, False), # above maximum ]) def test_gripper_force_boundaries(gripper_controller, force, should_accept): """BVA: test at and around force limits.""" if should_accept: result = gripper_controller.set_force(force) assert result.success assert abs(result.actual_force - force) < 0.1 else: with pytest.raises(ForceOutOfRangeError): gripper_controller.set_force(force) # --- EP tests: one representative per object class --- @pytest.mark.parametrize("obj_class,expected_strategy", [ ("rigid_small", GripStrategy.PRECISION_PINCH), ("rigid_large", GripStrategy.POWER_GRASP), ("deformable", GripStrategy.ENVELOPING), ("fragile", GripStrategy.FORCE_LIMITED), ("unknown", GripStrategy.CAUTIOUS_DEFAULT), ]) def test_grip_strategy_selection(gripper_controller, obj_class, expected_strategy): """EP: one test per equivalence class of object types.""" strategy = gripper_controller.select_strategy(obj_class) assert strategy == expected_strategy
if force < 50 when it should say if force <= 50. The gripper rejects 50.0N — its own rated maximum. In a warehouse running 1000 picks/day, this bug triggers dozens of times because the planner sometimes requests exactly the boundary value.Missing equivalence classes. The team tested rigid and deformable objects but forgot the "unknown/unclassified" class. When perception encounters a novel object it can't classify, the grip strategy function returns None and the robot freezes. Always include the "else" case as its own equivalence class.
The test oracle problem. For deterministic functions, the oracle is easy: "given this force, expect this result." For non-deterministic systems (ML models, sensor readings), there's no single correct answer. The oracle becomes statistical: "over 100 trials, the success rate should be above 85%." This makes individual test assertions weaker — you need aggregate metrics, not single-run pass/fail.
State pollution. A test sets the gripper to GRIPPING state but crashes before teardown. The next test starts with the gripper in an unexpected state and fails — not because of a bug, but because of leftover state from a prior test. This is why pytest fixtures with yield-based teardown are essential: the teardown runs even if the test throws an exception.
Knowing test design techniques is necessary but not sufficient. The interviewer will also ask: "How would you structure the test automation for a robotics project?" This is a system design question. It tests whether you can architect a CI/CD pipeline that is fast enough to not block developers, comprehensive enough to catch real bugs, and reliable enough that engineers trust the results.
The testing pyramid is not a suggestion — it's an economic argument. Here's the math for a robotics company:
| Layer | Count | Runtime | Cost per Run | Runs Where | Bugs Caught |
|---|---|---|---|---|---|
| Unit | ~5000 | 90 sec | ~$0.02 (CPU) | Every PR | Logic errors, math bugs, config issues |
| Integration | ~500 | 5-10 min | ~$0.50 (GPU) | Every PR | Interface mismatches, contract violations |
| Simulation | ~100 | 30-60 min | ~$5 (GPU cluster) | Nightly | Task failures, planning bugs, physics edge cases |
| HIL | ~20 | 2-4 hrs | ~$200 (robot time) | Weekly + release | Hardware integration, timing, calibration drift |
| Field | ~5 | 8+ hrs | ~$2000 (site visit) | Pre-deployment | Real-world edge cases, endurance issues |
A single HIL test run costs 10,000x more than a unit test run. If you can catch a bug in a unit test instead of HIL, you save the company $200 and two hours of robot time. This is why the pyramid shape matters: invest heavily in the cheap, fast layers.
A well-architected robotics test suite uses several design patterns that interviewers look for:
Fixtures and factories. A fixture sets up the test environment (connect to robot, load model, configure sensors) and tears it down afterwards. A factory generates test data (random valid poses, synthetic sensor readings, edge-case scenarios). In pytest, fixtures live in conftest.py and are inherited by all tests in subdirectories.
The conftest.py hierarchy. In a robotics project, you have multiple levels of conftest.py files, each providing fixtures appropriate to that test layer:
project structure tests/ conftest.py # root: logging, test IDs, common utils unit/ conftest.py # unit: mock sensors, in-memory configs test_kinematics.py test_path_planner.py integration/ conftest.py # integration: real service clients, docker test_perception_pipeline.py test_planning_service.py simulation/ conftest.py # sim: MuJoCo env, domain randomization test_pick_task.py test_place_task.py hil/ conftest.py # hil: real robot connection, safety checks test_canonical_episodes.py
Page objects for robot UIs. If your robot has a monitoring dashboard or operator interface, use the page object pattern: encapsulate UI element selectors and interactions into reusable classes. When the dashboard layout changes, you update one page object — not 50 test files.
python # tests/conftest.py — Root-level fixtures for all test layers import pytest import logging import uuid from datetime import datetime @pytest.fixture(autouse=True) def test_id(request): """Assign a unique ID to every test for traceability.""" tid = str(uuid.uuid4())[:8] request.node.test_id = tid logging.info(f"[{tid}] START {request.node.name}") yield logging.info(f"[{tid}] END {request.node.name}") @pytest.fixture(scope="session") def test_run_metadata(): """Session-wide metadata for test reporting.""" return { "run_id": str(uuid.uuid4()), "started_at": datetime.utcnow().isoformat(), "git_sha": _get_git_sha(), } # tests/unit/conftest.py — Unit test fixtures (fast, no hardware) import pytest import numpy as np @pytest.fixture def mock_camera(): """Fake camera that returns synthetic RGB frames.""" class MockCamera: def capture(self): # 640x480 RGB frame, random noise return np.random.randint(0, 255, (480, 640, 3), dtype=np.uint8) def capture_with_object(self, obj_bbox): # Frame with a synthetic object at the given bounding box frame = np.zeros((480, 640, 3), dtype=np.uint8) x, y, w, h = obj_bbox frame[y:y+h, x:x+w] = [200, 150, 100] # brown box return frame return MockCamera() @pytest.fixture def mock_joint_state(): """Factory for generating valid joint state vectors.""" def make(n_joints=6, noise_std=0.01): # Random joint angles within safe range [-pi, pi] angles = np.random.uniform(-np.pi, np.pi, n_joints) # Add Gaussian noise to simulate encoder uncertainty angles += np.random.normal(0, noise_std, n_joints) return angles return make @pytest.fixture def sim_environment(): """Lightweight mock environment for unit-testing planners.""" class MockEnv: def __init__(self): self.objects = [] self.robot_pose = np.zeros(6) def add_object(self, name, pose, size): self.objects.append({"name": name, "pose": pose, "size": size}) def check_collision(self, trajectory): # Simplified collision check: any waypoint within 0.05m of object for wp in trajectory: for obj in self.objects: dist = np.linalg.norm(wp[:3] - obj["pose"][:3]) if dist < 0.05: return True return False env = MockEnv() yield env
Click any stage in the pipeline to inject a failure. Watch how the failure propagates through the pipeline — which downstream stages get blocked, which still run, and where the pipeline halts.
Click a stage to toggle failure. Red = failed, green = passed, gray = blocked by upstream failure.
Flaky tests from timing. An integration test checks that Service A responds within 200ms. It passes 95% of the time but fails when the CI runner is under load. The test isn't wrong — it revealed a real timing dependency. But it needs to be fixed: either increase the timeout with a safety margin (test at 500ms, assert production SLO at 200ms), or make the assertion retry-aware.
Test pollution from shared state. Two simulation tests run in the same MuJoCo environment instance for speed. Test A moves a box. Test B assumes the box is at the start position. When tests run in alphabetical order, everything passes. When the framework randomizes order, Test B fails. Fix: each test must fully reset the environment state, or each test gets its own environment instance.
Slow tests blocking CI. Your simulation test suite takes 45 minutes. Engineers start merging PRs without waiting for CI. Now you have untested code in main. The fix is architectural: split the suite into "fast gate" (unit + integration, <5 min, blocks merge) and "slow validation" (simulation + HIL, runs async, blocks deployment).
Unit tests verify that each component works in isolation. Integration tests verify that components work together. In a robotics system, the "together" part is where most production bugs live — not inside individual services, but at the seams between them.
Consider the data flow in a warehouse robot: the perception service detects objects and outputs bounding boxes. The planning service consumes those bounding boxes and produces a trajectory. The control service consumes the trajectory and produces joint commands. Each interface is a contract. When one side changes the contract without telling the other, things break in production.
A contract test verifies that the data one service produces matches the schema and semantics that the consuming service expects. It's a bilateral agreement: the producer promises to always send data in this format, the consumer promises to only depend on fields in this format.
Why is this better than end-to-end tests? Because contract tests are fast (no need to spin up the full pipeline), targeted (they test one interface, not everything), and they tell you exactly where the break is ("perception changed the bounding box format").
The perception service outputs a detection result. The planning service consumes it. Here's what the contract looks like:
python # contracts/perception_output.py — The agreed-upon schema from dataclasses import dataclass from typing import List, Optional @dataclass class BoundingBox: """Format: [x_center, y_center, width, height] in meters.""" x: float # center x in robot frame (meters) y: float # center y in robot frame (meters) w: float # width (meters) h: float # height (meters) @dataclass class Detection: object_id: str label: str # "box", "bottle", "garment", ... confidence: float # [0.0, 1.0] bbox: BoundingBox grasp_points: Optional[List[tuple]] = None # optional grasp candidates @dataclass class PerceptionResult: timestamp_ns: int frame_id: str detections: List[Detection] latency_ms: float # how long perception took
Now here's the contract test — it runs in the perception team's CI to verify they still honor the contract, AND in the planning team's CI to verify they can still parse the contract:
python # tests/integration/test_perception_planning_contract.py import pytest from contracts.perception_output import PerceptionResult, Detection, BoundingBox def make_valid_perception_result(): """Factory for a contract-compliant perception result.""" return PerceptionResult( timestamp_ns=1700000000000000000, frame_id="frame_042", detections=[ Detection( object_id="obj_001", label="box", confidence=0.92, bbox=BoundingBox(x=0.5, y=0.3, w=0.2, h=0.15), ), ], latency_ms=34.2, ) class TestPerceptionContract: """Producer-side: verify perception output matches contract.""" def test_bbox_format_is_xywh(self): """Contract: bounding box is [x_center, y_center, width, height].""" result = make_valid_perception_result() for det in result.detections: assert hasattr(det.bbox, 'x'), "bbox must have x (center)" assert hasattr(det.bbox, 'w'), "bbox must have w (width)" assert det.bbox.w > 0, "width must be positive" def test_confidence_in_range(self): """Contract: confidence is in [0.0, 1.0].""" result = make_valid_perception_result() for det in result.detections: assert 0.0 <= det.confidence <= 1.0 def test_timestamp_is_nanoseconds(self): """Contract: timestamp in nanoseconds since epoch.""" result = make_valid_perception_result() assert result.timestamp_ns > 1_000_000_000_000_000_000 # after 2001 class TestPlanningConsumesContract: """Consumer-side: verify planning can handle all valid contract data.""" def test_planning_handles_empty_detections(self): """Planning must handle zero detections gracefully.""" result = PerceptionResult( timestamp_ns=1700000000000000000, frame_id="frame_000", detections=[], # nothing detected latency_ms=12.0, ) plan = planning_service.plan_from(result) assert plan.action == "wait" # no objects = wait for next frame def test_planning_handles_low_confidence(self): """Planning must handle detections below confidence threshold.""" result = make_valid_perception_result() result.detections[0].confidence = 0.15 # very low plan = planning_service.plan_from(result) assert plan.action == "request_redetection"
Beyond contracts, you need end-to-end tests that exercise the full pipeline: perception → planning → control → (simulated) actuator. These are slower and more expensive, so you run fewer of them — but they catch bugs that contract tests miss, like timing issues, message ordering problems, and emergent behavior from the interaction of correct components.
Three E2E strategies every robotics test engineer should know:
Happy path: The nominal case works. Object is detected, plan is generated, trajectory is executed, task succeeds. This is table stakes — if the happy path fails, nothing else matters.
Failure injection: Deliberately break one component and verify the system degrades gracefully. What happens when perception returns null? When planning times out? When the controller receives a trajectory with a discontinuity? Each failure mode should trigger a defined recovery behavior, not a crash.
Timeout testing: What happens when a service takes too long? The perception service usually responds in 30ms but occasionally takes 500ms due to GPU contention. Does the planner wait forever? Does it use stale data? Does it fail safe? Timeout behavior is one of the most under-tested aspects of robotics systems.
This diagram shows five interconnected services in a robotics system. Click any connection to see the contract test for that interface. Toggle "inject failure" on any connection to see how failures cascade through the system.
Click a connection (arrow) between services to see its contract. Click a service node to inject a failure and watch cascading effects.
The format change disaster. The perception team ships an update that changes bounding box format from [x, y, width, height] to [x1, y1, x2, y2]. Their unit tests all pass — the model outputs correct boxes in the new format. Planning's unit tests all pass — they correctly consume [x, y, w, h] format. But there's no contract test checking that perception's output matches planning's expectation. The robot starts reaching for wrong positions in production.
Integration environments that drift from production. Your Docker Compose integration test setup uses an older version of the perception model, different GPU drivers, and a mock database. Tests pass in CI but fail on the real robot because the integration environment doesn't match production. Fix: use infrastructure-as-code to keep test environments version-locked to production, and run periodic "environment parity checks" that compare test env configs to production configs.
Message ordering bugs. In unit tests, messages arrive in order because everything runs synchronously. In production, the perception result for frame N+1 might arrive before the planning result for frame N finishes. If the planner doesn't handle out-of-order messages, it plans based on stale data. Integration tests must test with realistic timing, including jitter and reordering.
Testing tells you whether the system works right now. Reliability engineering tells you whether the system will keep working over time. It's the discipline of measuring, tracking, and improving how often and how long a system operates correctly — and it's a core part of any test/reliability engineer interview.
Three concepts form the foundation: SLIs (what you measure), SLOs (what you target), and error budgets (how much failure you can tolerate).
A Service Level Indicator (SLI) is a metric that measures some aspect of the system's reliability. For a web service, common SLIs are request latency and error rate. For a warehouse robot, the SLIs are different:
| SLI | What It Measures | How to Compute |
|---|---|---|
| Task success rate | % of pick/place attempts that complete without intervention | successful_tasks / total_tasks over a time window |
| Uptime | % of scheduled operating hours the robot is available | (scheduled_hours - downtime_hours) / scheduled_hours |
| Task latency | Time from task assignment to completion | p95 of task completion times |
| Safety stop rate | How often the safety system triggers an unplanned stop | safety_stops / operating_hours |
A Service Level Objective (SLO) is the target you set for an SLI. "Task success rate SLO: 95%." This means you accept that 5% of tasks may fail — and that's okay. The SLO is an internal engineering target that balances reliability against development velocity.
A Service Level Agreement (SLA) is a contract with customers. "99% uptime per month." SLAs are always less strict than SLOs — you need a buffer. If your SLO is 99.5% and your SLA is 99%, you have room to slip without breaching the customer contract.
An error budget is the inverse of your SLO. If your SLO is 99.5% availability, your error budget is 0.5% — that's how much failure you're allowed over the measurement period.
Here's the math for a monthly error budget:
For a 99.5% availability SLO over a 30-day month (720 hours):
You have 3.6 hours of allowed downtime per month. Every incident burns some of this budget. When the budget hits zero, you stop shipping new features and focus entirely on reliability — this is a feature freeze.
Mean Time Between Failures (MTBF) measures how long the system runs before failing. Mean Time to Recovery (MTTR) measures how long it takes to get back to working state after a failure. Together, they determine availability:
Worked example: Your robot fleet has MTBF = 24 hours and MTTR = 30 minutes (0.5 hours).
To improve availability, you can either increase MTBF (make the system fail less often — harder) or decrease MTTR (make the system recover faster — usually easier). This is why modern reliability engineering focuses heavily on fast recovery: automated restarts, health checks, failover mechanisms, and runbooks.
Click to inject incidents of different severity into the month. Watch the error budget deplete. When the budget hits zero, a feature freeze triggers automatically.
Click buttons to inject incidents. The chart shows remaining error budget over the month. When budget reaches 0, feature freeze activates.
python from dataclasses import dataclass, field from typing import List from datetime import datetime, timedelta @dataclass class Incident: started: datetime resolved: datetime severity: int # 1-4 description: str @property def duration_hours(self) -> float: return (self.resolved - self.started).total_seconds() / 3600 @dataclass class ErrorBudget: slo_percent: float # e.g. 99.5 period_days: int = 30 incidents: List[Incident] = field(default_factory=list) @property def total_hours(self) -> float: return self.period_days * 24 @property def budget_hours(self) -> float: """Total allowed downtime for the period.""" return (1 - self.slo_percent / 100) * self.total_hours @property def consumed_hours(self) -> float: """Downtime consumed by incidents so far.""" return sum(i.duration_hours for i in self.incidents) @property def remaining_hours(self) -> float: return max(0, self.budget_hours - self.consumed_hours) @property def burn_rate(self) -> float: """Budget consumption as a percentage.""" if self.budget_hours == 0: return 100.0 return (self.consumed_hours / self.budget_hours) * 100 @property def feature_freeze(self) -> bool: """True if error budget is exhausted.""" return self.remaining_hours <= 0 def status(self) -> str: if self.feature_freeze: return "FEATURE FREEZE — budget exhausted" elif self.burn_rate > 75: return f"WARNING — {self.remaining_hours:.1f}h remaining" return f"OK — {self.remaining_hours:.1f}h remaining" # Usage budget = ErrorBudget(slo_percent=99.5) print(f"Monthly budget: {budget.budget_hours:.1f}h") # 3.6h budget.incidents.append(Incident( started=datetime(2026, 5, 3, 2, 15), resolved=datetime(2026, 5, 3, 4, 15), severity=2, description="Perception model OOM on high-res frames", )) print(budget.status()) # WARNING — 1.6h remaining
SLOs too tight. If you set a 99.99% availability SLO, your error budget is 4.3 minutes per month. A single 5-minute incident triggers a feature freeze. The team spends all its time firefighting and never ships new features. The product stagnates, customers leave for competitors that iterate faster.
SLOs too loose. If you set a 95% availability SLO, your error budget is 36 hours per month. The team never hits the budget. There's no pressure to fix reliability issues. Customers experience frequent failures and churn, even though you're technically "within SLO."
Poorly defined SLIs. "Task success rate" sounds simple. But what counts as a "task"? Does a retry count as a new task or the same task? If the robot detects no objects and waits, is that a "successful idle" or a "failed detection"? Vague SLI definitions lead to arguments about whether you're meeting your SLO — and arguments about SLOs are arguments about whether the system is reliable, which is the worst possible thing to be uncertain about.
No matter how good your testing is, incidents will happen. Robots will drop things. Services will crash. Models will hallucinate. The question isn't whether you'll have incidents — it's how fast you detect them, how effectively you respond, and how honestly you learn from them.
Incident management is the discipline of handling production failures systematically. It covers everything from the moment an alert fires to the final action item from the post-mortem. An interviewer will probe whether you've lived through real incidents and whether you understand the process — not just theoretically, but in the messy reality of 2am pages.
Not all incidents are created equal. A robot that drops a heavy item on a person is fundamentally different from a dashboard that shows stale data. Severity classification determines how fast you respond, who gets paged, and what resources get allocated.
| Severity | Definition | Robotics Example | Response Time | Who's Paged |
|---|---|---|---|---|
| SEV1 | Safety-critical or total system failure | Robot drops heavy object near person; entire fleet offline | <5 min | On-call + engineering lead + safety officer |
| SEV2 | Major feature broken, no workaround | Perception model fails on all boxes; pick success rate drops to 20% | <15 min | On-call + team lead |
| SEV3 | Feature degraded, workaround exists | Gripper intermittently fails on soft objects; manual intervention needed 1x/hour | <1 hour | On-call engineer |
| SEV4 | Minor issue, no customer impact | Monitoring dashboard shows stale metrics; logging volume doubled | Next business day | Ticket assigned, no page |
Every incident follows a lifecycle. Skipping steps leads to repeated incidents, burned-out engineers, and eroded customer trust.
A blameless post-mortem is a structured analysis of an incident that focuses on systems and processes, not individuals. The goal is to learn and improve — not to find someone to blame.
The structure:
1. Timeline. A chronological list of events with timestamps. "14:23 — Alert fires: pick success rate dropped below 80%. 14:25 — On-call acknowledges page. 14:31 — On-call identifies that model version 2.3.1 was deployed at 13:45. 14:35 — On-call rolls back to model 2.3.0. 14:38 — Pick success rate recovers to 93%."
2. Root cause. The specific technical failure that caused the incident. "Model version 2.3.1 was trained on a dataset that excluded garments. When the robot encountered garments in production, the perception model returned low-confidence detections, causing the planner to skip those objects. The training data pipeline filter was misconfigured to exclude image_class='garment' instead of image_class='test_garment'."
3. Contributing factors. Things that made the incident worse or harder to detect. "The model validation pipeline did not include garment test cases. The deployment process does not require sign-off from the QE team. The alerting threshold was set at 80% but should have been 90% — the delay in alerting extended the impact by 20 minutes."
4. Action items. Specific, measurable improvements with owners and deadlines. Not "improve testing" — but "Add 50 garment images to the model validation suite by May 20 (owner: Alice). Add QE sign-off gate to model deployment pipeline by May 25 (owner: Bob). Lower alerting threshold from 80% to 90% by May 15 (owner: Carol)."
Two RCA methods that interviewers expect you to know:
5 Whys. Start with the problem and ask "why" repeatedly until you reach a root cause that you can fix with a systemic change (not a human behavior change).
Fishbone diagram (Ishikawa). For complex incidents with multiple contributing factors. Categories for robotics: Hardware (sensor drift, actuator wear), Software (bug, config error), Environment (lighting change, obstacle), Process (missing test, no review), People (training gap, handoff error). Each category gets its own "bone" with specific contributing factors listed.
Drag events onto the timeline to reconstruct an incident. Classify each event as detection, triage, mitigation, or resolution. The tool shows you how different RCA methods would decompose the same incident.
Click to add events to the timeline. Each event is assigned a phase. Watch how the incident unfolds and where improvements could shorten the timeline.
python # incident_report.py — Structured incident template from dataclasses import dataclass, field from typing import List, Optional from datetime import datetime from enum import IntEnum class Severity(IntEnum): SEV1 = 1 # Safety-critical / total failure SEV2 = 2 # Major feature broken SEV3 = 3 # Degraded, workaround exists SEV4 = 4 # Minor, no customer impact @dataclass class TimelineEvent: timestamp: datetime description: str phase: str # "detect", "triage", "mitigate", "resolve" actor: str # who did this (role, not name — blameless!) @dataclass class ActionItem: description: str owner: str deadline: datetime priority: str # "P0", "P1", "P2" status: str = "open" @dataclass class IncidentReport: id: str # "INC-2026-042" title: str severity: Severity detected_at: datetime mitigated_at: Optional[datetime] = None resolved_at: Optional[datetime] = None # Impact customer_impact: str = "" sli_impact: str = "" # "Task success rate dropped to 45%" error_budget_burn: float = 0.0 # hours consumed # Analysis timeline: List[TimelineEvent] = field(default_factory=list) root_cause: str = "" contributing_factors: List[str] = field(default_factory=list) five_whys: List[str] = field(default_factory=list) # Follow-up action_items: List[ActionItem] = field(default_factory=list) @property def time_to_detect(self) -> Optional[float]: """Minutes from incident start to first detection.""" detections = [e for e in self.timeline if e.phase == "detect"] if not detections: return None return (detections[0].timestamp - self.detected_at).total_seconds() / 60 @property def time_to_mitigate(self) -> Optional[float]: """Minutes from detection to mitigation.""" if not self.mitigated_at: return None return (self.mitigated_at - self.detected_at).total_seconds() / 60
Blame culture kills learning. If engineers fear punishment for mistakes, they hide failures, downplay severity, and write superficial post-mortems. The 5 Whys stop at "the engineer made an error" instead of reaching the systemic root cause. The same class of incident repeats because the process that enabled it was never fixed.
Action items that never get done. The post-mortem produces 8 action items. All are P1. None have deadlines. Six months later, none are done, and the same incident happens again. Fix: every action item needs an owner, a deadline, and a priority. Track post-mortem action items in the sprint backlog, not in a document that nobody reads. Review completion rates monthly.
Severity inflation/deflation. A team that over-classifies everything as SEV1 burns out the on-call rotation and desensitizes responders to real emergencies ("cry wolf" effect). A team that under-classifies everything as SEV3 lets serious issues linger for hours while the on-call engineer finishes dinner. Calibrate severity by tying it to specific, measurable criteria — not to how the reporter feels about the issue.
You've tested web apps. You've tested APIs. Now you're testing a machine that can exert 200 Newtons of force, moves at 2 meters per second, and relies on sensors that drift with temperature. The stakes are different here. A flaky unit test wastes a developer's time. A flaky sensor test wastes a person's safety.
Robot-specific testing breaks into three domains: sensor validation (is the robot perceiving the world correctly?), actuator testing (is the robot moving correctly?), and timing verification (is everything happening fast enough?). Each domain has failure modes that simply don't exist in pure software systems.
Camera calibration verification is the foundation. A robot's cameras have intrinsic parameters (focal length, distortion coefficients) and extrinsic parameters (the camera's position and orientation relative to the robot's base frame). Both drift over time — someone bumps the camera mount, thermal expansion changes the housing, or a firmware update resets internal parameters.
The test: place ArUco markers or a calibration checkerboard at known 3D positions. Capture an image. Project the known 3D points into the image using the current calibration. Measure the reprojection error — the pixel distance between where the points should appear and where they actually appear. If the mean reprojection error exceeds 1.5 pixels, the calibration is stale and must be refreshed before any perception test is trustworthy.
Where (u, v) is the observed pixel position and (û, v̂) is the projected position from the 3D calibration target using current camera parameters.
IMU drift testing checks whether the inertial measurement unit accumulates bias over time. Place the robot on a stable surface, read the IMU for 60 seconds, and compute the Allan variance. The bias instability — the minimum of the Allan deviation curve — tells you how fast the IMU's readings drift. For a robot arm that doesn't move much, IMU drift matters less than for a mobile robot, but it still corrupts any inertial-based safety monitoring.
LiDAR point cloud validation uses a reference environment with known geometry. Scan the reference, fit planes to known flat surfaces, and measure the deviation. Point cloud noise above 5mm RMS on a flat surface at 2 meters indicates sensor degradation or miscalibration.
Joint limit verification: command each joint to its software-defined limit. Verify the encoder reading matches the expected position within 0.1 degrees. Then command 1 degree past the limit — the controller must reject the command without executing any motion. This catches cases where software limits have drifted from hardware limits after a firmware update.
Torque verification: command a known force against a calibrated load cell. The measured force should match the commanded torque within 5% after accounting for the gear ratio. Drift here means the motor or gearbox is wearing — and the robot's actual force output no longer matches what the software thinks it's applying.
Backlash measurement: command a joint to position A, then to position B, then back to position A. The encoder reading on return should match the original position. Any hysteresis — a gap between the outgoing and returning position — is backlash in the gear train. Typical acceptable values: less than 0.05 degrees for precision manipulation, less than 0.2 degrees for gross motion. Backlash grows with gear wear and is one of the earliest indicators that hardware maintenance is needed.
HIL testing means running the real robot controller against a simulated environment. The controller thinks it's moving a real robot — it receives simulated sensor readings and sends real motor commands — but the "plant" (the physical system) is a simulation. This catches controller bugs (timing, saturation, mode switching) without risking hardware.
The key advantage: you can test thousands of scenarios that would be dangerous or impractical on real hardware. What happens when two joints hit their limits simultaneously? What if a sensor returns NaN? What if the network drops for 50ms mid-trajectory? HIL lets you inject these faults systematically.
A typical robot control loop runs at 100Hz — meaning the entire sense-decide-act cycle must complete in 10ms. If it doesn't, the robot misses a control deadline. One missed deadline causes a small jerk. Sustained missed deadlines cause oscillation, overshoot, or loss of control.
The perception pipeline is the usual bottleneck. Camera capture takes 2ms. Image preprocessing takes 1ms. But occasionally the perception model takes 15ms instead of its usual 8ms — a GPU memory allocation stall, a cache miss, or a garbage collection pause. That 15ms blows the 10ms budget.
Inject noise into the robot's sensors. Watch how downstream position accuracy degrades.
python import numpy as np from dataclasses import dataclass @dataclass class CalibrationResult: mean_reproj_error: float max_reproj_error: float passed: bool def check_camera_calibration( observed_points: np.ndarray, # (N, 2) pixel coords of detected markers world_points: np.ndarray, # (N, 3) known 3D positions camera_matrix: np.ndarray, # 3x3 intrinsic matrix dist_coeffs: np.ndarray, # distortion coefficients rvec: np.ndarray, # rotation vector (extrinsics) tvec: np.ndarray, # translation vector (extrinsics) threshold: float = 1.5 # max acceptable mean error in pixels ) -> CalibrationResult: """Project 3D points and measure reprojection error.""" import cv2 projected, _ = cv2.projectPoints( world_points, rvec, tvec, camera_matrix, dist_coeffs ) projected = projected.reshape(-1, 2) errors = np.linalg.norm(observed_points - projected, axis=1) return CalibrationResult( mean_reproj_error=float(np.mean(errors)), max_reproj_error=float(np.max(errors)), passed=float(np.mean(errors)) < threshold ) def check_imu_bias( readings: np.ndarray, # (T, 3) accelerometer at rest gravity: float = 9.81, bias_threshold: float = 0.05 # m/s^2 max acceptable bias ) -> dict: """Measure IMU bias at rest. Z-axis should read ~9.81.""" mean_reading = np.mean(readings, axis=0) expected = np.array([0.0, 0.0, gravity]) bias = mean_reading - expected return { "bias_xyz": bias.tolist(), "bias_magnitude": float(np.linalg.norm(bias)), "passed": float(np.linalg.norm(bias)) < bias_threshold, "noise_std": np.std(readings, axis=0).tolist() } def check_joint_backlash( joint_id: int, robot, # robot controller interface test_angle: float = 10.0, # degrees to move threshold: float = 0.05 # degrees max hysteresis ) -> dict: """Command joint A->B->A and measure hysteresis.""" pos_a = robot.read_joint(joint_id) robot.move_joint(joint_id, pos_a + test_angle, wait=True) robot.move_joint(joint_id, pos_a, wait=True) pos_return = robot.read_joint(joint_id) hysteresis = abs(pos_return - pos_a) return { "joint_id": joint_id, "hysteresis_deg": hysteresis, "passed": hysteresis < threshold, "wear_warning": hysteresis > threshold * 0.8 }
Your perception model worked great in the lab. Accuracy was 0.92 mAP. You deployed it to the warehouse. Three weeks later, accuracy is 0.78 and nobody noticed until a customer complained that the robot keeps missing items. What went wrong?
Distribution shift — the silent killer of deployed ML systems. The training data was collected under lab conditions: controlled lighting, clean objects, consistent backgrounds. The warehouse has fluorescent flicker, dusty lenses, new product SKUs the model has never seen, and seasonal changes in ambient light. The model didn't break — the world changed around it.
The core technique: maintain a reference distribution from training data and continuously compare production data against it. For image data, compute feature embeddings (using the model's penultimate layer or a separate feature extractor) and measure the KL divergence between the reference and production distributions.
Where P is the reference (training) distribution and Q is the production distribution over feature embeddings. When DKL crosses a threshold, the production data has drifted far enough from training data to warrant investigation.
In practice, you don't compute KL divergence on raw pixels — you compute it on embedding distributions. Extract the feature vector from each production image, bin these into a histogram (or use kernel density estimation), and compare against the training feature distribution. A simpler alternative: track the mean cosine distance between each production embedding and its nearest neighbor in the training set. When this distance trends upward, your model is seeing increasingly unfamiliar inputs.
Every model update gets gated on a benchmark suite before deployment. The suite has three parts:
| Benchmark | What it measures | Gate criterion | Runtime |
|---|---|---|---|
| Accuracy benchmark | mAP on held-out validation set | Must not drop >1% from baseline | ~30 min (GPU) |
| Latency benchmark | P50, P95, P99 inference time | P99 must stay under deadline | ~10 min |
| Canonical episodes | Pass/fail on known failure cases | All must pass (zero regressions) | ~20 min |
| Embedding drift | Cosine distance from previous model | Must stay under threshold | ~5 min |
The canonical episode library is the most valuable artifact in your test suite. Every time the model fails in a novel way and the failure is fixed, add that scenario to the library. Over time, this library becomes a comprehensive regression net that prevents the model from forgetting past lessons.
Never do a full fleet swap of a new model. Instead, deploy the new model to a single robot (the canary) while the rest of the fleet stays on the old model. Run both for 48 hours. Compare task success rate, intervention frequency, and latency. Define rollback criteria before deployment:
Models are only as good as their data. Data quality testing catches corruption before it reaches training:
Schema validation: Every data sample must have the expected fields, types, and ranges. An image must be (H, W, 3) uint8. A label must reference a valid class ID. A bounding box must have x_min < x_max. These seem obvious, but a single corrupted sample in a 10M dataset can poison a training run.
Label quality audit: Sample 500 labels from each new batch and have a human verify them. Track the error rate. If label error exceeds 3%, reject the batch. Common label errors: class confusion between similar objects, incorrect bounding box coordinates from annotation tool bugs, missing annotations for partially occluded objects.
Class balance monitoring: Track the distribution of classes in your training data over time. If a new data batch shifts the distribution (e.g., suddenly 80% of images are of one product type), the model will overfit to that type and underperform on rare classes.
Watch accuracy and latency over time. Inject distribution shift to see metrics degrade.
python import numpy as np from scipy.stats import entropy from dataclasses import dataclass from typing import List @dataclass class ValidationReport: accuracy_passed: bool latency_passed: bool drift_passed: bool schema_passed: bool deploy_ok: bool details: dict def compute_kl_divergence( ref_embeddings: np.ndarray, # (N, D) training embeddings prod_embeddings: np.ndarray, # (M, D) production embeddings n_bins: int = 50 ) -> float: """Compute KL divergence between embedding distributions.""" # Project to 1D via PCA first component for simplicity combined = np.vstack([ref_embeddings, prod_embeddings]) mean = combined.mean(axis=0) centered = combined - mean _, _, Vt = np.linalg.svd(centered, full_matrices=False) pc1 = Vt[0] # first principal component ref_proj = ref_embeddings @ pc1 prod_proj = prod_embeddings @ pc1 # Histogram both with shared bins lo = min(ref_proj.min(), prod_proj.min()) hi = max(ref_proj.max(), prod_proj.max()) bins = np.linspace(lo, hi, n_bins + 1) p, _ = np.histogram(ref_proj, bins, density=True) q, _ = np.histogram(prod_proj, bins, density=True) # Smooth to avoid log(0) eps = 1e-8 p = p + eps q = q + eps p = p / p.sum() q = q / q.sum() return float(entropy(p, q)) def validate_data_schema(samples: List[dict]) -> dict: """Check data samples conform to expected schema.""" errors = [] for i, s in enumerate(samples): if s["image"].shape[2] != 3: errors.append(f"Sample {i}: expected 3 channels, got {s['image'].shape[2]}") if s["image"].dtype != np.uint8: errors.append(f"Sample {i}: expected uint8, got {s['image'].dtype}") for box in s.get("bboxes", []): if box["x_min"] >= box["x_max"]: errors.append(f"Sample {i}: x_min >= x_max") if box["class_id"] < 0: errors.append(f"Sample {i}: negative class_id") return {"valid": len(errors) == 0, "errors": errors} def run_model_validation( model, val_loader, ref_embeddings, prev_embeddings, accuracy_threshold=0.01, latency_p99_ms=10.0, drift_threshold=0.5 ) -> ValidationReport: """Full validation gate for model deployment.""" import time latencies, correct, total = [], 0, 0 new_embeddings = [] for batch in val_loader: t0 = time.perf_counter() preds, embeds = model.predict_with_embeddings(batch) latencies.append((time.perf_counter() - t0) * 1000) correct += (preds == batch["labels"]).sum() total += len(batch["labels"]) new_embeddings.append(embeds) accuracy = correct / total p99 = np.percentile(latencies, 99) kl = compute_kl_divergence(ref_embeddings, np.vstack(new_embeddings)) acc_ok = accuracy >= (1.0 - accuracy_threshold) lat_ok = p99 < latency_p99_ms drift_ok = kl < drift_threshold return ValidationReport( accuracy_passed=acc_ok, latency_passed=lat_ok, drift_passed=drift_ok, schema_passed=True, deploy_ok=acc_ok and lat_ok and drift_ok, details={"accuracy": accuracy, "p99_ms": p99, "kl_div": kl} )
Your robot's perception pipeline runs fine with one object in the scene. How about fifty objects? How about fifty objects while the robot is moving, the camera feed is running at 30fps, three other services are logging to disk, and the GPU is also running the planning model? Performance testing answers the question: "At what point does this system fall over, and what breaks first?"
This chapter is the one interviewers use to separate "I've read about testing" from "I've actually profiled a real system." They want to hear specific numbers, specific tools, and specific failure stories.
Three numbers define your system's latency behavior:
P50 (median) is the "normal case." Half your requests are faster, half are slower. If your robot's P50 perception latency is 8ms and the control deadline is 10ms, things look fine.
P95 is the "bad day." One in twenty requests is this slow or slower. If P95 is 12ms, you're missing the control deadline 5% of the time. That's a robot hesitating once every 0.2 seconds.
P99 is the "surprise." One in a hundred. If P99 is 45ms, once every second the robot has a 45ms gap in its control loop. That's a visible stutter, and depending on the task, it could mean dropping an object or colliding with an obstacle.
Every system has a throughput at which latency goes from "flat and predictable" to "exponentially increasing." This is the saturation point — and finding it is the entire purpose of load testing.
Below saturation: requests arrive, get processed, leave. Latency is determined by processing time alone. Above saturation: requests arrive faster than they can be processed. A queue builds. Each new request waits behind all the queued ones. Latency grows without bound.
The curve has three distinct regions:
| Region | Load level | Latency behavior | What's happening |
|---|---|---|---|
| Linear | 0-60% of capacity | Flat, predictable | Plenty of headroom. Requests processed immediately. |
| Knee | 60-85% of capacity | Starts curving upward | Queue occasionally non-empty. P99 diverges from P50. |
| Hockey stick | 85-100%+ of capacity | Exponential growth | Queue always non-empty. Latency dominated by wait time. |
You've been told "the perception pipeline is slow." Here's the systematic approach:
Step 1: Instrument. Add timestamps at every boundary: frame capture complete, preprocessing complete, model inference complete, postprocessing complete, result dispatched. Compute the time delta for each stage.
Step 2: Profile 1000 frames. Collect the timing data. Compute P50/P95/P99 for each stage independently.
Step 3: Find the bottleneck. Typical results for a 640x480 RGB frame:
| Stage | P50 | P95 | P99 | % of total |
|---|---|---|---|---|
| Frame capture | 1.2ms | 1.5ms | 2.1ms | 15% |
| Preprocessing (resize, normalize) | 0.8ms | 1.0ms | 1.2ms | 10% |
| Model inference | 5.2ms | 9.8ms | 38ms | 65% |
| Postprocessing (NMS, tracking) | 0.8ms | 1.2ms | 2.5ms | 10% |
The bottleneck is model inference — specifically, the P99 spike to 38ms. That's the target. Why does it spike? Common causes: GPU memory allocation (first inference after a long idle), CUDA kernel launch latency variability, or thermal throttling on the GPU.
Step 4: Fix and re-profile. Warm up the GPU with a dummy inference on startup. Pin GPU clock frequency to avoid thermal throttling variability. Pre-allocate CUDA memory. Re-measure: if P99 drops from 38ms to 12ms, you've solved the tail latency problem.
Performance isn't just latency — it's resource consumption over time. The three resources that break robots:
CPU: If the control loop shares CPU cores with logging, network I/O, and visualization, context switches add latency jitter. Profile CPU utilization per core. Pin the control loop to a dedicated core using CPU affinity.
GPU memory: Models that use variable-length inputs (like long context windows) have variable GPU memory usage. Profile peak GPU memory during a 30-minute run. If it grows monotonically, you have a memory leak. If it spikes and recovers, you have fragmentation. Both are problems.
System memory: Logging without rotation fills RAM. Image buffers that aren't freed grow the heap. Profile system memory every 60 seconds during a 2-hour run. Fit a linear regression. If the slope is positive, you're leaking memory and will eventually OOM.
Drag the load slider. Watch P50/P95/P99 diverge as the system saturates.
python import time import numpy as np from collections import defaultdict class PipelineProfiler: """Instrument a multi-stage pipeline and collect latency stats.""" def __init__(self, stages: list): self.stages = stages self.timings = defaultdict(list) self._current_run = {} def start(self, stage: str): self._current_run[stage] = time.perf_counter() def stop(self, stage: str): elapsed_ms = (time.perf_counter() - self._current_run[stage]) * 1000 self.timings[stage].append(elapsed_ms) def report(self) -> dict: """Return P50/P95/P99 for each stage.""" report = {} for stage in self.stages: data = np.array(self.timings[stage]) if len(data) == 0: continue report[stage] = { "p50": round(np.percentile(data, 50), 2), "p95": round(np.percentile(data, 95), 2), "p99": round(np.percentile(data, 99), 2), "mean": round(np.mean(data), 2), "max": round(np.max(data), 2), "n": len(data) } # Total end-to-end total = np.array([ sum(self.timings[s][i] for s in self.stages) for i in range(min(len(self.timings[s]) for s in self.stages)) ]) report["end_to_end"] = { "p50": round(np.percentile(total, 50), 2), "p95": round(np.percentile(total, 95), 2), "p99": round(np.percentile(total, 99), 2), } return report def check_deadline(self, deadline_ms: float) -> dict: """Check what % of runs met the deadline.""" total_times = np.array([ sum(self.timings[s][i] for s in self.stages) for i in range(min(len(self.timings[s]) for s in self.stages)) ]) met = np.sum(total_times <= deadline_ms) return { "deadline_ms": deadline_ms, "met_count": int(met), "total_count": len(total_times), "met_pct": round(met / len(total_times) * 100, 1), "passed": (met / len(total_times)) >= 0.99 } # Usage: # profiler = PipelineProfiler(["capture", "preprocess", "inference", "postprocess"]) # for frame in frames: # profiler.start("capture"); img = camera.read(); profiler.stop("capture") # profiler.start("preprocess"); t = preprocess(img); profiler.stop("preprocess") # profiler.start("inference"); out = model(t); profiler.stop("inference") # profiler.start("postprocess"); res = nms(out); profiler.stop("postprocess") # print(profiler.report())
Everything we've discussed so far — sensor testing, ML validation, performance profiling — is about making the robot work correctly. This chapter is about what happens when it doesn't. When a 25kg-payload robot arm swings into a person at full speed, the kinetic energy is enough to cause serious injury or death. Safety-critical testing is the discipline of systematically imagining every way this could happen and proving it can't.
This isn't academic risk management. This is the engineering that determines whether your robot is legally allowed to operate in a warehouse with people nearby. Get it wrong and someone gets hurt. Get the documentation wrong and your company gets shut down.
FMEA is a bottom-up analysis. You start with individual components and ask: "How can this fail? What happens when it does? How bad is it?" For each failure mode, you assign three scores:
Severity (S): How bad is the effect? 1 = cosmetic, 5 = minor injury, 8 = serious injury, 10 = death or catastrophic damage.
Occurrence (O): How likely is this failure? 1 = extremely unlikely (<1 in 10M), 5 = moderate (1 in 2000), 10 = near-certain.
Detection (D): How likely are we to catch this failure before it causes harm? 1 = always detected, 5 = sometimes detected, 10 = no detection mechanism.
The Risk Priority Number ranges from 1 to 1000. Items with RPN > 100 require immediate mitigation. Items with Severity ≥ 8 require mitigation regardless of RPN.
| Component | Failure Mode | Effect | S | O | D | RPN | Mitigation |
|---|---|---|---|---|---|---|---|
| Joint motor | Overcurrent / thermal runaway | Uncontrolled joint motion | 9 | 3 | 2 | 54 | Hardware current limiter, temperature fuse |
| Gripper | Unexpected release | Heavy object dropped on person | 8 | 4 | 3 | 96 | Mechanical lock + grip force monitor |
| Vision model | Misclassify person as object | Robot approaches person as pickup target | 10 | 2 | 4 | 80 | Redundant person detector (separate model) |
| E-stop circuit | Contact weld in relay | E-stop fails to de-energize | 10 | 2 | 5 | 100 | Redundant relay + periodic test |
| Controller | Software crash mid-motion | Arm continues last command at full speed | 9 | 3 | 3 | 81 | Hardware watchdog, timeout to safe state |
Fault tree analysis (FTA) is top-down — the opposite of FMEA. You start with an undesired top event (e.g., "Robot arm strikes a person") and decompose it into the combination of causes that could produce it. Causes are connected by logical gates:
AND gate: All inputs must occur for the output to occur. Example: "Person enters workspace AND detection system fails AND arm is in motion" — all three must be true simultaneously.
OR gate: Any single input is sufficient. Example: "Detection fails" can be caused by "camera obscured OR model misclassification OR processing timeout" — any one is enough.
The power of fault trees is quantitative analysis. If you know (or can estimate) the probability of each leaf event, you can compute the probability of the top event. AND gates multiply probabilities (making combinations less likely). OR gates add probabilities (making alternatives more likely).
| Standard | Scope | Key requirement |
|---|---|---|
| ISO 10218-1/2 | Industrial robot safety | Risk assessment for all identified hazards, safety-rated control functions |
| ISO/TS 15066 | Collaborative robots | Contact force/pressure limits per body region (e.g., <140N transient for chest) |
| IEC 61508 | Functional safety (general) | Safety Integrity Levels (SIL 1-4) for safety functions based on risk |
| ISO 13849 | Safety control systems | Performance Level (PL a-e) for safety-related control systems |
| ISO 12100 | Risk assessment methodology | Systematic hazard identification, risk estimation, risk reduction |
The risk matrix maps severity (how bad) against probability (how likely) to classify each hazard into a risk level:
| Improbable | Remote | Occasional | Frequent | |
|---|---|---|---|---|
| Catastrophic | High | Critical | Critical | Critical |
| Serious | Medium | High | Critical | Critical |
| Moderate | Low | Medium | High | Critical |
| Minor | Low | Low | Medium | High |
Critical and High risks must be mitigated before deployment. Medium risks require documented justification if accepted. Low risks are monitored but acceptable.
Top event: "Robot drops heavy object on person." Click nodes to expand branches. Toggle AND/OR gates to see how probability changes.
python from dataclasses import dataclass from typing import List import json @dataclass class FMEAEntry: component: str failure_mode: str effect: str severity: int # 1-10 occurrence: int # 1-10 detection: int # 1-10 mitigation: str @property def rpn(self) -> int: return self.severity * self.occurrence * self.detection @property def risk_level(self) -> str: if self.severity >= 8: return "CRITICAL" # regardless of RPN if self.rpn > 100: return "HIGH" if self.rpn > 50: return "MEDIUM" return "LOW" def build_robot_arm_fmea() -> List[FMEAEntry]: """Build standard FMEA for a 25kg-payload robot arm.""" return [ FMEAEntry( component="Joint Motor J1", failure_mode="Overcurrent / thermal runaway", effect="Uncontrolled joint motion at max torque", severity=9, occurrence=3, detection=2, mitigation="Hardware current limiter + thermal fuse" ), FMEAEntry( component="Gripper", failure_mode="Unexpected release of payload", effect="Heavy object falls on person below", severity=8, occurrence=4, detection=3, mitigation="Mechanical lock + grip force monitor" ), FMEAEntry( component="Perception Model", failure_mode="Misclassify person as pickup object", effect="Robot approaches person at full speed", severity=10, occurrence=2, detection=4, mitigation="Redundant person detector (separate model)" ), FMEAEntry( component="E-Stop Circuit", failure_mode="Contact weld in safety relay", effect="Cannot de-energize robot on command", severity=10, occurrence=2, detection=5, mitigation="Dual-channel redundant relay + daily test" ), FMEAEntry( component="Controller Software", failure_mode="Crash mid-trajectory execution", effect="Arm continues last velocity command indefinitely", severity=9, occurrence=3, detection=3, mitigation="Hardware watchdog timer, timeout-to-safe-state" ), ] def print_fmea_report(entries: List[FMEAEntry]): """Print formatted FMEA report, sorted by RPN descending.""" entries_sorted = sorted(entries, key=lambda e: e.rpn, reverse=True) for e in entries_sorted: print(f"[{e.risk_level:8}] RPN={e.rpn:4} | {e.component:20} | {e.failure_mode}") print(f" Effect: {e.effect}") print(f" S={e.severity} O={e.occurrence} D={e.detection}") print(f" Mitigation: {e.mitigation}\n")
It's 6pm on a Friday. The CI pipeline is red. An engineer checks the failing test. It passed yesterday. No code changed. They re-run it. It passes. They shrug and merge their PR. On Monday, the robot fails the same way in the warehouse. The "flaky" test was trying to tell them something.
Flaky tests are the most corrosive force in a testing organization. Not because they're hard to fix — but because they teach engineers to ignore test failures. Once your team starts clicking "re-run" as a reflex instead of investigating, your CI pipeline is decoration.
Not all flaky tests are equal. Understanding the root cause determines the fix:
Timing-dependent flakes: The test assumes an operation completes within a hardcoded timeout. "Wait 2 seconds for the service to start." It works on a fast machine, fails on a loaded CI runner. The fix: use event-based waits (poll for readiness) instead of fixed timeouts. In robotics, this is the most common category — sensor initialization, model loading, and hardware communication all have variable startup times.
Order-dependent flakes: Test A leaves global state (a file, a database entry, a hardware register) that Test B depends on. Run A-then-B: passes. Run B alone: fails. The fix: every test must set up and tear down its own state. In robotics, this means every HIL test must reset the robot to a known joint configuration before starting.
Environment-dependent flakes: The test passes on one developer's machine but fails in CI. Different OS version, different GPU driver, different CUDA version. The fix: containerize the test environment. Pin every dependency version. Run the same Docker image locally and in CI.
Resource-dependent flakes: The test passes when the CI runner has 16GB RAM free, fails when other tests are running in parallel and only 4GB is available. GPU memory contention is especially common when multiple model tests share a GPU. The fix: resource isolation (one model test per GPU), or explicit resource checks before test execution.
Inherently stochastic flakes: The test involves a non-deterministic model. The same input produces slightly different outputs each run. Sometimes the output crosses the assertion threshold, sometimes it doesn't. The fix: statistical assertions (run N times, check that the pass rate exceeds a threshold) or seeded random number generators for reproducibility.
| Flake Type | Symptom | Root Cause | Fix Strategy |
|---|---|---|---|
| Timing | Fails under load, passes in isolation | Hardcoded timeouts / sleep() | Event-based waits, retry with backoff |
| Order | Fails when run alone, passes in suite | Shared mutable state | Test isolation, setup/teardown |
| Environment | Fails only in CI | Dependency version mismatch | Docker containerization |
| Resource | Fails under parallel execution | Memory/GPU contention | Resource isolation, quota checks |
| Stochastic | Random pass/fail on same input | Non-deterministic model output | Statistical bounds, seed pinning |
How do you know a test is flaky? Run it N times in a clean environment. If it passes K out of N times, and K < N, it's flaky. But how many times is enough?
You want to detect a flake rate of F (e.g., F = 0.05 means 5% failure rate) with confidence C (e.g., C = 0.95). The minimum number of runs N is:
For F=0.05 and C=0.95: N ≥ log(0.05) / log(0.95) ≈ 59 runs. For F=0.01 and C=0.95: N ≥ 299 runs. The rarer the flake, the more runs you need to detect it.
In practice, run suspected flaky tests 50 times. If all 50 pass, you have 95% confidence the flake rate is below 5.8%. That's good enough for most decisions. If even one fails, investigate.
When a test is identified as flaky, don't delete it and don't leave it blocking CI. Move it to a quarantine pipeline:
Once a test is in quarantine, how do you find the root cause?
Bisection: If the test was stable until recently, use git bisect to find the commit that introduced the flake. Run the test 20 times at each bisection point. The first commit where pass rate drops below 100% is your culprit.
Isolation: Run the test in a completely clean environment: fresh Docker container, no other tests running, no network access (if possible). If it's stable in isolation but flaky in CI, the cause is environmental — look for resource contention, shared state, or network timing.
Instrumentation: Add verbose logging to the test itself. Log timestamps, resource usage, system load, and all inputs at each step. After 50 runs, compare the logs from passing runs vs. failing runs. The difference is the cause.
Track these metrics weekly and present them to engineering leadership:
| Metric | Definition | Target | Action if exceeded |
|---|---|---|---|
| Flake rate | % of test runs that are flakes (pass on retry) | <2% | Investigate top 3 flakiest tests |
| Quarantine queue size | Number of tests in quarantine | <5% of total | Freeze features, fix flakes |
| Mean time to fix | Days from quarantine entry to fix | <7 days | Escalate unresolved flakes |
| Flake-to-fix ratio | New flakes per week / fixes per week | <1.0 | Queue is growing — increase fix velocity |
| Re-run rate | % of CI runs that were re-triggered manually | <5% | Engineers are ignoring failures |
Click any test to see its pass/fail history. Yellow = flaky (passed on retry).
python import subprocess import numpy as np from dataclasses import dataclass from typing import List import math @dataclass class FlakeReport: test_name: str runs: int passes: int failures: int pass_rate: float confidence_interval: tuple is_flaky: bool recommendation: str def min_runs_for_detection( flake_rate: float = 0.05, confidence: float = 0.95 ) -> int: """Minimum runs to detect a flake at given rate and confidence.""" return math.ceil(math.log(1 - confidence) / math.log(1 - flake_rate)) def detect_flake( test_command: str, test_name: str, n_runs: int = 50, timeout_sec: int = 120 ) -> FlakeReport: """Run a test N times and compute flake statistics.""" results = [] # True=pass, False=fail for i in range(n_runs): try: result = subprocess.run( test_command, shell=True, timeout=timeout_sec, capture_output=True ) results.append(result.returncode == 0) except subprocess.TimeoutExpired: results.append(False) passes = sum(results) failures = n_runs - passes pass_rate = passes / n_runs # Wilson score interval for binomial proportion z = 1.96 # 95% confidence denom = 1 + z**2 / n_runs center = (pass_rate + z**2 / (2 * n_runs)) / denom margin = z * math.sqrt( (pass_rate * (1 - pass_rate) + z**2 / (4 * n_runs)) / n_runs ) / denom ci = (max(0, center - margin), min(1, center + margin)) is_flaky = failures > 0 # Recommendation based on pass rate if pass_rate == 1.0: rec = "STABLE — keep in main CI" elif pass_rate >= 0.98: rec = "MARGINAL — monitor for 1 week, quarantine if another failure" elif pass_rate >= 0.90: rec = "FLAKY — move to quarantine, assign owner, fix within 7 days" else: rec = "BROKEN — this is not a flake, it's a real failure. Fix immediately." return FlakeReport( test_name=test_name, runs=n_runs, passes=passes, failures=failures, pass_rate=pass_rate, confidence_interval=ci, is_flaky=is_flaky, recommendation=rec ) # Example usage: # report = detect_flake( # test_command="pytest tests/test_perception.py::test_model_accuracy -x", # test_name="test_model_accuracy", # n_runs=50 # ) # print(f"{report.test_name}: {report.pass_rate:.1%} pass rate") # print(f" 95% CI: [{report.confidence_interval[0]:.1%}, {report.confidence_interval[1]:.1%}]") # print(f" Recommendation: {report.recommendation}")
A robot arm jerks unexpectedly during a pick operation. The operator hits the e-stop. You get a Slack message at 11pm. Now what?
Debugging a robotics system is qualitatively harder than debugging a web server. The system spans stochastic ML models, real-time control loops, physical hardware with wear and friction, and sensor inputs that are noisy by nature. A bug might be in the model, the controller, the mechanics, the environment, or — most often — in the interaction between two layers that each work fine in isolation.
This chapter gives you a systematic methodology that works under pressure, and the vocabulary to explain it in an interview.
Every debugging session follows the same structure, whether the bug is a flaky unit test or a robot that drops packages on Tuesdays.
5 Whys is a root cause analysis method from Toyota's production system. You ask "why?" repeatedly until you reach the systemic cause, not just the proximal trigger. Here's a robotics example:
| Level | Question | Answer |
|---|---|---|
| Why 1 | Why did the robot drop the package? | The gripper opened prematurely. |
| Why 2 | Why did the gripper open prematurely? | The grasp force reading showed zero, triggering the "object lost" handler. |
| Why 3 | Why did the force reading show zero? | The force/torque sensor returned NaN for 3 consecutive frames. |
| Why 4 | Why did the sensor return NaN? | The USB connection to the sensor dropped briefly under vibration. |
| Why 5 | Why does USB drop under vibration? | The connector isn't strain-relieved — it's a standard cable, not a locking connector rated for industrial vibration. |
The fix is not "handle NaN in the force reading code" (that's a band-aid). The fix is "replace the USB cable with a locking connector and add strain relief." The 5 Whys technique systematically prevents you from stopping at the symptom.
| Tool | What It Shows | When to Use |
|---|---|---|
| Structured logs (JSON) | Timestamped events with context, correlation IDs, severity | First step for any failure — reconstruct the timeline |
| Core dumps + GDB | Stack trace, memory state at crash time | Segfaults, unhandled exceptions in C++ control code |
| strace / dtrace | System calls: file I/O, network, device access | Permission errors, file descriptor leaks, device communication failures |
| Profiling (perf/py-spy) | CPU time per function, hot paths, GIL contention | Latency issues, control loop overruns, inference bottlenecks |
| Git bisect | Which exact commit introduced the regression | Performance degradation, behavioral changes with no obvious code cause |
| ROS bag replay | Exact sensor data replay for reproduction | Intermittent failures that depend on specific sensor input sequences |
In web software, you can usually reproduce any bug by replaying the same HTTP request. In robotics, reproduction requires the same physical environment, same sensor noise, same mechanical state. A joint that's been running for 3 hours has different friction characteristics than a cold joint. A camera in afternoon sunlight behaves differently than under warehouse LEDs.
This means your debugging infrastructure must record more than traditional systems: not just logs, but full sensor streams, joint state trajectories, and environmental snapshots. The cost of not recording is a bug you can never reproduce.
Design your logging system to answer this question: "Given a failure timestamp, can I reconstruct exactly what happened in every subsystem during the 30 seconds before?" If not, your logging is insufficient.
Key design decisions: (1) Use correlation IDs — a single ID that threads through camera capture, model inference, action generation, and motor commands for one control cycle. (2) Use ring buffers for high-frequency data (joint positions at 1kHz) — always keep the last N seconds, dump to disk on failure. (3) Use severity levels correctly: ERROR means "this will cause a visible failure," WARN means "this is degraded but functional," INFO means "this is normal operation."
python import json, time, uuid, logging from dataclasses import dataclass, asdict @dataclass class CycleContext: cycle_id: str # unique per control cycle timestamp: float robot_id: str task_id: str class StructuredLogger: def __init__(self, component: str, sink=None): self.component = component self.sink = sink or logging.getLogger(component) def log(self, ctx: CycleContext, level: str, msg: str, **data): entry = { "ts": ctx.timestamp, "cycle": ctx.cycle_id, "robot": ctx.robot_id, "task": ctx.task_id, "component": self.component, "level": level, "msg": msg, **data } self.sink.info(json.dumps(entry)) # Usage in control loop: logger = StructuredLogger("inverse_dynamics") def control_step(frame, robot_id, task_id): ctx = CycleContext( cycle_id=str(uuid.uuid4())[:8], timestamp=time.time(), robot_id=robot_id, task_id=task_id, ) # Log input logger.log(ctx, "INFO", "frame_received", shape=list(frame.shape), mean_px=float(frame.mean())) action = model.predict(frame) # Log output with validation if any(a > JOINT_LIMIT for a in action): logger.log(ctx, "ERROR", "joint_limit_exceeded", action=action.tolist(), limits=[JOINT_LIMIT] * len(action)) else: logger.log(ctx, "INFO", "action_computed", action=action.tolist(), latency_ms=(time.time() - ctx.timestamp) * 1000) return action
Can't reproduce in the lab: Some failures only happen after hours of operation (thermal drift), or only under specific environmental conditions (afternoon sun angle through a skylight). Solution: instrument the production robot to record full sensor streams on failure trigger, then replay those streams in the lab.
Log verbosity hides the signal: If every control cycle produces 20 log lines at 50Hz, you're generating 1000 lines per second. Finding the one ERROR in 30 minutes of logs means searching 1.8 million lines. Solution: log at INFO only for anomalous cycles (latency > threshold, action near limits). Log at DEBUG only when explicitly enabled for a specific investigation.
Red herrings: The robot drops an object. You see a network latency spike in the logs 2 seconds before the drop. Correlation is not causation. Verify by asking: "If I artificially inject that latency spike, does the robot drop the object?" If not, keep looking.
Emerging practice: feed structured logs from failures into an LLM with the system architecture as context. The model identifies temporal correlations across subsystems that humans miss in million-line log files. Early results from fleet operators show 40% faster time-to-root-cause. The risk: the LLM suggests plausible-sounding but wrong causes. Always verify its hypotheses with controlled experiments.
Click a symptom to walk through the diagnostic branches. Each path leads to a specific root cause category.
You just fixed a bug in staging. You deploy to production. The bug is still there — or worse, a new one appeared. The staging environment didn't actually match production. This is the environment parity problem, and in robotics it's ten times worse than in web software, because your "production environment" includes physical hardware, real sensors, and the laws of physics.
This chapter covers how to design test infrastructure that catches bugs where they're cheapest to fix, while managing the inevitable gaps between simulated and real environments.
A robotics test pipeline typically has four distinct environments, each trading off fidelity for speed and cost:
| Environment | Hardware | Sensors | Physics | Speed | Cost/Run |
|---|---|---|---|---|---|
| Dev (local) | None | Mock data | None | Seconds | ~$0 |
| CI (cloud) | GPU for inference | Recorded datasets | None | Minutes | ~$2 |
| Staging (sim) | GPU cluster | Simulated cameras | MuJoCo/Isaac | 10-60 min | ~$20 |
| Production (HIL) | Real robot | Real cameras | Real physics | Hours | ~$200 |
No two environments are identical. The skill is knowing exactly where each gap exists and having a mitigation for each one.
Gap 1: No real sensors in CI. CI tests run on recorded datasets, not live camera feeds. Mitigation: maintain a curated dataset that includes edge cases — low light, motion blur, partial occlusion, reflective surfaces. Update the dataset quarterly from real production captures.
Gap 2: Sim physics don't match real physics. MuJoCo's contact model is an approximation. Soft objects deform differently. Friction coefficients are estimates. Mitigation: domain randomization (vary friction, mass, damping by +/-30%), plus sim-to-real correlation tracking from Chapter 2.
Gap 3: No GPU in dev. The model runs on GPU in production but developers test on CPU (or not at all). Mitigation: CPU inference with a smaller model checkpoint for smoke tests. Full GPU inference in CI.
Gap 4: Staging has no wear. A fresh simulation doesn't model joint backlash that develops after 10,000 cycles. Mitigation: add wear models to simulation (joint play increases over simulated time), validated against real hardware measurements.
Test data for a robotics system includes: camera frames (RGB + depth), joint state trajectories, force/torque readings, task outcome labels, and environment metadata (lighting, object positions). Managing this data is itself an infrastructure challenge.
Synthetic data generation: Use simulation to generate unlimited test data with perfect ground truth labels. Vary object textures, lighting, camera noise. The risk: synthetic data that's too clean — real cameras have dust, scratches, and calibration drift that synthetic generators miss.
Fixture management: Canonical test fixtures are versioned recordings of specific scenarios — a successful pick, a near-miss, a collision avoidance. Store them in a dedicated test data repository with semantic versioning. When the recording format changes, migrate all fixtures. Never delete a fixture — only deprecate with a reason.
Data masking: If production data contains customer information (warehouse layout, inventory counts), mask or anonymize before using in test environments. This is often overlooked in robotics because "it's just sensor data" — but camera feeds can capture badges, screens, and documents.
yaml # docker-compose.test.yml # Spins up a complete robotics test environment # with mock sensors, model server, and database version: "3.8" services: # Mock sensor server — replays recorded camera data mock-sensors: build: ./test/mock-sensors volumes: - ./test/fixtures/camera:/data/camera:ro - ./test/fixtures/imu:/data/imu:ro environment: REPLAY_SPEED: 1.0 LOOP: "true" ports: - "8001:8001" # camera stream - "8002:8002" # IMU stream # Model inference server model-server: image: robotics/dva-inference:test runtime: nvidia environment: MODEL_CHECKPOINT: "/models/dva-v2.3-test" MAX_BATCH_SIZE: 1 DEVICE: "cuda:0" volumes: - model-cache:/models:ro ports: - "8010:8010" # Robot controller simulator sim-controller: build: ./test/sim-controller environment: JOINT_COUNT: 7 CONTROL_RATE_HZ: 50 PHYSICS_ENGINE: "mujoco" depends_on: - model-server # Test metrics database metrics-db: image: timescale/timescaledb:latest-pg15 environment: POSTGRES_DB: "test_metrics" POSTGRES_PASSWORD: "testonly" ports: - "5432:5432" # Test runner — orchestrates test suites test-runner: build: ./test/runner depends_on: - mock-sensors - model-server - sim-controller - metrics-db environment: SENSOR_URL: "http://mock-sensors:8001" MODEL_URL: "http://model-server:8010" DB_URL: "postgresql://postgres:testonly@metrics-db/test_metrics" TEST_SUITE: "integration" command: pytest tests/ -v --tb=short volumes: model-cache:
Environment drift: The staging Docker image was last rebuilt 3 weeks ago. Production has a new PyTorch version. The model loads differently. Solution: pin ALL dependency versions with lockfiles, rebuild images on every dependency change, and run a "version parity check" that compares staging vs. production package versions before every release.
Test data staleness: Your test fixtures are 6 months old. The warehouse now stocks a new product with reflective packaging that the camera handles differently. Solution: monthly data refresh from production captures, with automated drift detection (compare feature distributions of test data vs. recent production data).
Shared test environments: Two engineers run integration tests simultaneously on the same staging robot. Their tests interfere — one resets the scene while the other is mid-test. Solution: test environment leasing — a booking system that grants exclusive access to a hardware rig for a test window. CI jobs queue for the next available slot.
The next evolution: spin up a complete simulation environment per pull request, run the full test suite, tear it down. Kubernetes namespaces + GPU time-sharing make this feasible for simulation tests. For HIL, the frontier is digital twin synchronization — a sim environment that's continuously updated to match the exact state of a specific physical robot, so you can replay any failure in a perfectly matched simulation within minutes.
Four test environments. Click a gap (dashed red) to see the mitigation strategy for that parity gap.
Your robot fleet has been deployed for a week. Everything looks fine — until you notice that task success rate has drifted from 88% to 81% over the past three days. No code changed. No model update. What happened?
Without observability, you'd never notice the drift until a customer complains. With good observability, you see the trend on day one, investigate immediately, and discover that the warehouse installed new LED fixtures that shifted the color temperature of the camera feeds. Observability is the ability to understand what's happening inside your system by examining its external outputs — logs, metrics, and traces.
Logs are discrete events with context. "At 14:32:07, robot-03 dropped object during pick_task_42, cycle_id=a8f3c2." Logs answer the question "what happened?" They're essential for post-incident investigation but terrible for trend detection — you can't easily aggregate "how many drops happened this hour?" from raw log lines.
Metrics are numeric time series. "task_success_rate = 0.84 at 14:30." Metrics answer "how is the system performing right now?" and "is performance changing over time?" They're cheap to store, fast to query, and perfect for dashboards and alerts. But they lose detail — a metric tells you the success rate dropped, not why.
Traces follow a single request through all services. A trace for one pick operation might span: camera capture (2ms) → image preprocessing (5ms) → model inference (45ms) → action generation (3ms) → motor command (1ms) → execution (800ms). Traces answer "where is time being spent?" and "which service is the bottleneck?" Essential for latency debugging.
The three pillars complement each other. Metrics detect the problem (success rate dropped). Logs explain the problem (force sensor returned NaN). Traces localize the problem (inference latency spiked during the failing cycles). You need all three.
Prometheus (the industry-standard metrics system) defines four metric types. Each has a specific use in robotics:
| Type | What It Measures | Robotics Example |
|---|---|---|
| Counter | Monotonically increasing count | Total picks attempted, total errors, total e-stops triggered |
| Gauge | Current value (can go up or down) | Battery level, joint temperature, current task queue depth |
| Histogram | Distribution of values across buckets | Inference latency distribution (p50, p95, p99) |
| Summary | Pre-computed quantiles | Grasp force distribution across last 100 picks |
You can't instrument everything — metrics have storage cost and cardinality limits. Here are the five metrics you'd choose if limited to five (a real interview question):
| # | Metric | Type | Alert Threshold | Why This One |
|---|---|---|---|---|
| 1 | Task success rate (5-min rolling) | Gauge | < 80% | The north star. If this drops, something is wrong. |
| 2 | Inference latency p95 | Histogram | > 100ms | Latency above the control loop deadline causes missed cycles. |
| 3 | Motor current draw (max across joints) | Gauge | > 90% rated | Approaching current limit means mechanical stress or jam. |
| 4 | Error rate (errors per hour) | Counter | > 3/hour | Catches all failure types — sensor, model, controller, hardware. |
| 5 | Control loop overruns | Counter | > 0 | Any overrun means the system couldn't process fast enough — safety risk. |
Good alerting: you get paged when something needs human attention. Bad alerting: you get paged 15 times a day for things that resolve themselves, and eventually you ignore all alerts — including the real ones. This is alert fatigue, and it's the number one failure mode of monitoring systems.
Symptom-based alerts fire on user-visible impact: "task success rate below 80% for 5 minutes." These are high-signal — they always mean something the customer cares about is broken.
Cause-based alerts fire on internal system state: "GPU temperature above 85C." These are lower-signal — the GPU might be warm but still performing fine. Use cause-based alerts only when the symptom alert would fire too late (hardware damage, safety violation).
python from prometheus_client import ( Counter, Gauge, Histogram, start_http_server ) import time # --- Metric definitions --- PICKS_TOTAL = Counter( "robot_picks_total", "Total pick attempts", ["robot_id", "outcome"] # labels: success/fail/abort ) INFERENCE_LATENCY = Histogram( "robot_inference_latency_seconds", "Model inference latency", ["robot_id", "model_version"], buckets=[.01, .025, .05, .075, .1, .15, .2, .5] ) JOINT_TEMP = Gauge( "robot_joint_temperature_celsius", "Current joint temperature", ["robot_id", "joint_idx"] ) LOOP_OVERRUNS = Counter( "robot_control_loop_overruns_total", "Control cycles that exceeded deadline", ["robot_id"] ) SUCCESS_RATE = Gauge( "robot_task_success_rate", "Rolling 5-min task success rate", ["robot_id"] ) # --- Instrumentation in the control loop --- def run_pick(robot_id, model_ver, frame): # Time the inference t0 = time.monotonic() action = model.predict(frame) dt = time.monotonic() - t0 INFERENCE_LATENCY.labels(robot_id, model_ver).observe(dt) # Check control loop deadline if dt > 0.020: # 20ms deadline for 50Hz loop LOOP_OVERRUNS.labels(robot_id).inc() # Execute and record outcome result = execute_action(action) PICKS_TOTAL.labels(robot_id, result.outcome).inc() # Update joint temperatures for i, temp in enumerate(get_joint_temps()): JOINT_TEMP.labels(robot_id, str(i)).set(temp) return result # Start metrics endpoint on port 9090 start_http_server(9090)
Alert fatigue: You set the inference latency alert at p95 > 50ms. But the model legitimately spikes to 55ms during complex scenes. The alert fires 10 times per day. Engineers start ignoring it. When latency actually spikes to 200ms (a real problem), nobody notices for 3 hours. Fix: raise the threshold to something that actually indicates a problem (p95 > 100ms), or use anomaly detection instead of static thresholds.
Missing the right metric: You monitor CPU, GPU, memory, disk. Everything looks fine. But task success rate drops because the camera's auto-exposure is fighting with the warehouse's flickering fluorescent lights — and you never instrumented camera exposure time. The lesson: instrument domain-specific metrics, not just infrastructure metrics.
Too much logging: You log every frame at full resolution for debugging. Storage costs hit $10k/month. You delete the logs. Three weeks later, a customer reports a failure that started two weeks ago. The logs are gone. Fix: tiered retention — full sensor data for 48 hours, downsampled data for 30 days, metrics and structured logs for 1 year.
Static thresholds are brittle — what's "normal" changes with time of day, season, and workload. The frontier: ML-based anomaly detection on your metrics streams. Train a model on "normal" system behavior, alert when the current state is statistically unlikely. Tools like Facebook Prophet, Amazon Lookout, or custom autoencoders on metric time series. The risk: anomaly detectors can be noisy (every unusual but harmless event triggers), so combine them with symptom-based alerts as a filter.
Adjust alert thresholds with the sliders. Watch which alerts fire during a simulated 30-minute incident timeline. Red markers = alerts that would have fired.
Everything we've covered — unit tests, integration tests, simulation, HIL, observability, debugging — comes together in a single pipeline. This is the capstone: an interactive simulation of a complete robotics test pipeline from code commit to production deployment.
Inject faults at various points in the system. Watch the pipeline catch them — or fail to catch them, depending on your test coverage settings. See the cost of catching a bug late versus early. This is the economic argument for testing that you'll make in every interview and every budget meeting.
1. Set coverage and fidelity levels. 2. Inject a fault. 3. Click "Run Pipeline" to see where (if) it gets caught.
Notice how increasing test coverage catches faults earlier and cheaper. This is the core argument for investing in test infrastructure: the pipeline pays for itself many times over by catching $10,000 bugs for $10.
This is your reference chapter. Everything from the previous 14 chapters, compressed into interview-ready formats: cheat sheets, system design frameworks, coding drills, debugging scenarios, and flash cards you can review the night before your onsite.
| Concept | 30-Second Explanation | Key Metric/Tool | Interview Tip |
|---|---|---|---|
| Test Pyramid | More unit tests (fast, cheap) than integration tests than system tests. Each layer catches different fault classes. | Ratio: 70/20/10 | Draw the pyramid immediately. Shows you think structurally. |
| Contract Testing | Verify that two services agree on the interface between them — input/output shapes, types, value ranges. | pact, custom schemas | Mention this for ML pipelines where model output feeds controller input. |
| Sim-to-Real Gap | Simulation always differs from reality. Track correlation; if sim says 90% but real is 60%, your sim is lying. | Correlation coefficient | Show you understand that sim results need calibration, not just trust. |
| HIL Testing | Automated tests on physical hardware. Slow and expensive but catches what sim misses — friction, vibration, wear. | MTBF, safety interlock pass rate | Emphasize safety checks BEFORE task tests. |
| Flaky Test Quarantine | Tests that intermittently fail go to a quarantine queue — still run, but don't block merges. Investigated weekly. | Pass rate < 98% over 30 runs | Shows you manage test reliability, not just test count. |
| Error Budgets (SRE) | Allowed failure rate. 99.5% SLO = 0.5% budget. When budget exhausted, freeze features and fix reliability. | Budget burn rate | Connects testing to business impact — interviewers love this. |
| Incident Severity | SEV1: safety/data loss. SEV2: major feature broken. SEV3: degraded. SEV4: cosmetic. Drives response time. | MTTD, MTTR | Know the escalation thresholds — when to wake people up. |
| Blameless Postmortems | After incidents: timeline, root cause, action items. Focus on systems, not people. "What made this possible?" | Action item completion rate | Mention the 5 Whys technique and fishbone diagrams. |
| Regression Testing | Canonical test episodes from past failures. Every new model must pass all of them. Prevents known-bug recurrence. | Canonical suite pass rate | Explain that aggregate metrics can improve while specific cases regress. |
| Domain Randomization | Vary sim parameters (lighting, friction, noise) so the policy learns to be robust. More variation = better transfer. | Transfer success rate | Show you know WHY it works (exposure to distribution, not memorization). |
| Structured Logging | JSON logs with correlation IDs, timestamps, severity. Enables machine-parseable log analysis at scale. | Correlation ID coverage | Mention ring buffers for high-frequency data (joint positions at 1kHz). |
| Three Pillars | Logs (events), metrics (aggregates), traces (request flow). You need all three. Metrics detect, logs explain, traces localize. | Prometheus + Grafana + Jaeger | When asked "how do you monitor?", name the three pillars first. |
| Alert Fatigue | Too many alerts = all alerts ignored. Every alert needs a runbook. Target < 3 pages per on-call shift. | Pages per shift, false positive rate | Propose symptom-based alerts over cause-based alerts. |
| 5 Whys | Ask "why?" five times to reach root cause. "Gripper opened" → sensor NaN → USB drop → no strain relief → procurement didn't spec industrial connectors. | Root cause depth | Practice on every failure you've encountered. The depth impresses. |
Key insight: A test is flaky if its pass rate over N runs is between 1% and 99%. Track per-test pass history, flag tests below 98%.
python from collections import defaultdict class FlakyDetector: def __init__(self, window=30, threshold=0.98): self.window = window self.threshold = threshold self.history = defaultdict(list) def record(self, test_name: str, passed: bool): h = self.history[test_name] h.append(passed) if len(h) > self.window: h.pop(0) def get_flaky(self) -> list: flaky = [] for name, h in self.history.items(): if len(h) < self.window: continue rate = sum(h) / len(h) if rate < self.threshold and rate > 0.0: flaky.append((name, rate)) return sorted(flaky, key=lambda x: x[1])
Key insight: The contract is the agreed-upon data format between two teams. Test that the producer's output matches the consumer's expectations in shape, type, and value range.
python import numpy as np def validate_perception_output(output: dict) -> list: """Validate perception module output matches the contract expected by the planning module.""" errors = [] # Shape: (N, 7) — N detections, 7 = [x,y,z,w,h,d,conf] detections = output.get("detections") if detections is None: errors.append("missing 'detections' key") return errors if detections.ndim != 2 or detections.shape[1] != 7: errors.append(f"shape {detections.shape}, expected (N,7)") # Confidence in [0, 1] confs = detections[:, 6] if np.any(confs < 0) or np.any(confs > 1): errors.append("confidence outside [0,1]") # Positions in workspace bounds (meters) pos = detections[:, :3] if np.any(np.abs(pos) > 3.0): errors.append("detection outside 3m workspace") return errors
Key insight: Error budget = (1 - SLO) × total operations. If budget is exhausted, freeze feature work.
python class ErrorBudget: def __init__(self, slo: float, window_hours: int, ops_per_hour: int): self.slo = slo # e.g. 0.995 self.window = window_hours # e.g. 720 (30 days) self.ops_per_hour = ops_per_hour # e.g. 120 picks/hr self.failures = 0 self.total_ops = 0 @property def budget_total(self) -> float: return (1 - self.slo) * self.window * self.ops_per_hour @property def budget_remaining(self) -> float: return max(0, self.budget_total - self.failures) @property def budget_pct(self) -> float: return self.budget_remaining / self.budget_total * 100 def record(self, success: bool): self.total_ops += 1 if not success: self.failures += 1 def should_freeze(self) -> bool: return self.budget_remaining <= 0
Key insight: Compare new model's outputs to baseline on canonical episodes. Flag if any episode's score drops below the baseline by more than a threshold.
python import numpy as np def detect_regression( baseline_scores: dict, # {episode_id: score} new_scores: dict, abs_threshold: float = 0.05, rel_threshold: float = 0.10, ) -> list: """Return list of regressed episodes.""" regressions = [] for ep_id, base in baseline_scores.items(): new = new_scores.get(ep_id) if new is None: regressions.append((ep_id, "MISSING", base, 0)) continue abs_drop = base - new rel_drop = abs_drop / max(base, 1e-6) if abs_drop > abs_threshold or rel_drop > rel_threshold: regressions.append((ep_id, "REGRESSED", base, new)) return regressions
Approach: What's different on Tuesdays? Check: (1) Is there a different shift/operator? (2) Does the warehouse receive restocking deliveries on Tuesdays — different lighting from open loading bay doors? (3) Is a weekly cron job running (model retrain, database vacuum, log rotation) that competes for CPU/GPU? (4) Check joint temperature logs — does the robot run longer on Tuesdays due to scheduling?
Key insight: Temporal patterns almost always correlate with environmental or operational changes, not code bugs. Map the failure to the calendar — what else happens on that schedule?
Approach: Classic environment parity gap. (1) What does CI test with that production doesn't have? (Clean test fixtures vs. real damaged boxes.) (2) What does production have that CI doesn't? (Forklift vibration, variable lighting, concurrent robots.) (3) Capture a production failure — record sensor data, replay in CI. Does the CI test pass with production data? If yes, the test oracle is too loose. If the test fails, your CI data is unrepresentative.
Key insight: Bridge the gap by bringing production data INTO CI (recorded camera feeds from failures) and bringing CI rigor INTO production (run a subset of CI assertions on every production cycle).
Approach: The periodicity is the clue. What runs on a 30-second cycle? (1) Garbage collection in Python — check GC logs, try disabling GC or tuning thresholds. (2) Health check endpoint being scraped by monitoring (if the health check triggers inference). (3) Thermal throttling — GPU hits temp limit, clock drops, recovers. (4) Linux cron — is logrotate or a heartbeat script running every 30s?
Key insight: Periodic performance issues are caused by periodic processes. Correlate the spike timestamps with every scheduled operation on the system. Use strace or perf to see what the process is doing during a spike.
Approach: If code didn't change, data or environment did. (1) Did the training data pipeline change? Check data version hashes. (2) Did a dependency update silently (unpinned package)? Check pip freeze diff. (3) Did the physical environment change — new products, new shelving, new lighting? (4) Did the evaluation data change — maybe the test set was refreshed with harder cases? (5) Hardware — is the GPU running in a lower power mode?
Key insight: "No code changed" doesn't mean "nothing changed." Check data, dependencies, environment, and hardware. Use git bisect on data versions if available.
Click the card to flip. Use Next/Previous to navigate. 20 cards covering every chapter.
| Priority | Resource | Why |
|---|---|---|
| 1 | Site Reliability Engineering (Beyer, Jones, Petoff, Murphy — Google) | The SRE bible. Error budgets, SLOs, incident response. Free at sre.google/sre-book |
| 2 | Release It! (Michael Nygard) | Stability patterns: circuit breakers, bulkheads, timeouts. Essential for production robotics. |
| 3 | Accelerate (Forsgren, Humble, Kim) | Data-driven evidence that CI/CD, monitoring, and testing culture predict team performance. |
| 4 | "ML Test Score" (Breck et al., 2017) | Google's rubric for ML system readiness. 28 tests across data, model, infra, monitoring. |
| 5 | "Sim-to-Real Transfer in Robotics" (Zhao et al., 2020) | Comprehensive survey of domain randomization, adaptation, and transfer techniques. |
| 6 | pytest documentation (docs.pytest.org) | The standard Python test framework. Know fixtures, parametrize, markers, conftest.py. |
| 7 | Chaos Engineering (Principles of Chaos, principlesofchaos.org) | Proactively inject failures to find weaknesses. Netflix pioneered this; it applies to robot fleets. |
| 8 | ISO 10218/ISO TS 15066 standards summaries | Know the safety standards framework even if you haven't read every clause. |
| Aspect | Classical Approach | Modern/ML-Era Approach | When to Use Which |
|---|---|---|---|
| Test oracle | Exact expected output: assert y == 42 | Statistical bounds: assert 0.8 < accuracy < 0.95 | Classical for deterministic paths; modern for ML outputs |
| Test data | Handcrafted fixtures, small dataset | Synthetic generation + production sampling, large scale | Handcrafted for edge cases; synthetic for coverage |
| Regression | Binary pass/fail on golden outputs | Statistical comparison to baseline distribution | Binary for safety-critical; statistical for ML performance |
| Environment | Identical staging/prod (containerized) | Sim → HIL → canary → prod ladder | Classical for software-only; ladder for cyber-physical |
| Flaky tests | Bug — fix immediately | Expected — quarantine, track rate, investigate | Fix if deterministic path; quarantine if inherently stochastic |
| Coverage metric | Line/branch coverage % | Scenario coverage: task × object × condition matrix | Line coverage for utils; scenario coverage for integration |
| CI speed | Minutes (all tests every PR) | Tiered: fast per-PR, heavy nightly | Fast tier for development velocity; heavy tier for confidence |
| Debugging | Breakpoints, step-through debugger | Sensor replay, log correlation, embedding analysis | Breakpoints for logic bugs; replay for physical-interaction bugs |