Lead QE & Validation Engineer

Chapter 0: The Role

the team builds the DVA (Direct Video Action) system: a causal video model that predicts future frames, paired with an inverse dynamics model that converts those frames into robot joint commands. The robot sees the world through cameras, sends actuator commands to 25kg-payload arms, and must run autonomously for 160+ minutes in a live warehouse.

A Lead QE and Validation Engineer here owns the entire confidence chain between "code merged" and "robot ships to customer." That means writing tests, building infrastructure, running hardware, defining safety acceptance criteria, and debugging failures when they happen at 2am in a Decathlon warehouse.

What makes this role unique: You're not testing deterministic software. You're testing a foundation model that generates video as its primary output, a learned inverse dynamics mapping, a real-time control loop, and physical hardware — all simultaneously. Every layer is stochastic or mechanically variable.

System Architecture You're Testing

DVA System Data Flow

Click any component to see what tests live there.

The role sits at the intersection of four disciplines: ML engineering (model validation), robotics (hardware testing), software QE (CI/CD, test automation), and safety engineering (ISO compliance, risk assessment). You don't need to be expert-level in all four — but you need to speak fluently with the people who are.

Which component of DVA is the most critical to test continuously, because it converts model outputs into physical robot motion?

The causal video model The inverse dynamics model — it's the direct interface between learned representations and physical actuators The camera driver

Chapter 1: DVA Testing Challenges

Testing a video generation model that controls a robot is qualitatively different from testing traditional software. Traditional software: given input X, expect output Y. DVA: given camera frame X, the model generates a distribution over plausible futures — any of which might be reasonable. "Correct" output isn't a single value.

This creates four categories of novel testing challenges that you'll face daily at the team.

1. Non-Deterministic Outputs

Run the same input through the video model twice: you get two different future-frame sequences, both plausible. Traditional test assertions (assert output == expected) are meaningless. You need distributional correctness — does the output fall within the plausible distribution? Is the object in the right location? Is the predicted motion physically consistent?

Testing strategy: Use perceptual metrics and physical constraints as oracle tests. A predicted frame should preserve object identity (detected by an object detector), maintain physically plausible motion (velocity below mechanical limits), and produce actions within joint range. These are testable even when the exact pixel values vary.

2. Visual Quality Metrics

You need automated metrics for "does this predicted video look right?" Common metrics: SSIM (Structural Similarity Index) for frame quality, FVD (Fréchet Video Distance) for video distribution quality, optical flow consistency for temporal coherence. But these have limits: a model can score well on SSIM while still predicting the robot arm going the wrong direction.

3. Temporal Consistency

A predicted video frame at t+1 must be consistent with the frame at t. Objects can't teleport. The robot arm's predicted trajectory must be smooth — acceleration limits apply. Test this with inter-frame difference analysis: flag any prediction where pixel deltas between frames exceed what's physically achievable given the robot's speed limits.

4. Physics Plausibility

The video model must predict a world that obeys physics. An item shouldn't float after the robot releases it. Containers don't compress. Cloth doesn't phase through the table. Build a suite of physics constraint checkers that run on predicted frames: bounding box continuity, gravity consistency, contact plausibility.

Failure Mode Taxonomy

Click each failure type to see its test strategy.

Why does assert predicted_frame == expected_frame fail as a test for DVA?

The frames are too large to compare quickly It requires too much GPU memory DVA's video model is non-deterministic — many plausible futures exist for a given input, and any of them may be correct

Chapter 2: Sim-to-Real Validation

Running every test on the physical robot is slow, expensive, and dangerous during early development. Simulation lets you run thousands of tests per day at low cost. But simulation is always an approximation — and the gap between sim and real is one of the most insidious bugs in robotics.

Simulation environments used in industrial robotics QE: MuJoCo (physics-accurate, widely used for manipulation research), Isaac Sim (NVIDIA, GPU-accelerated, photorealistic rendering), Gazebo (ROS-native, popular for integration testing). Each has different fidelity trade-offs.

The domain gap: Real cameras have motion blur, lens distortion, varying lighting, and sensor noise. Real robot joints have backlash, friction, and thermal drift. Real objects have surface texture, weight, and deformation. Simulation has none of these unless you explicitly model them — and even then, the model is wrong.

Sim-to-Real Transfer Testing Methodology

Step 1: Domain randomization in simulation. Randomly vary lighting intensity, camera noise level, object mass, surface friction, and joint damping during test runs. If the policy succeeds across all variants, it's more likely to handle real-world variation.

Step 2: Correlation tracking. For every metric you measure in sim, measure the same metric on hardware. Build a chart: sim success rate vs. real success rate across many policy versions. If the correlation breaks — sim says 90% but real says 40% — your sim model has drifted from reality. This requires constant maintenance.

Step 3: Failure mode audit. When the robot fails on real hardware, replay the failure scenario in sim. If sim doesn't reproduce the failure, the domain gap is causing the discrepancy. Add the missing physics (contact model, sensor noise model) until sim matches real failure patterns.

Domain Gap Visualization

Drag the noise slider. Watch how sim performance diverges from real as the gap widens.

Domain gap 3.0

The correlation chart is your most important artifact: When you walk into a review meeting and say "sim says 85% — shipping," the first question should be "what's your sim-to-real correlation coefficient?" If you can't answer that, you don't actually know what 85% means on hardware.

A new DVA policy achieves 90% task success in MuJoCo simulation but only 55% on the physical robot. What should you investigate first?

Measure the specific ways the real robot environment differs from the sim — lighting variation, joint friction, contact model — then add those to the domain randomization suite to close the gap Retrain the model with more robot data Switch to a different simulation platform

Chapter 3: Hardware-in-the-Loop Testing

Hardware-in-the-Loop (HIL) testing means running automated test suites on real physical hardware — the actual robot, actual cameras, actual sensors — not a simulation. It's slow (each test run takes minutes), expensive (arm wear, power cost), and harder to automate. But it's the only way to catch the failures that sim misses.

HIL testing at a robotics company happens in a test rig: a controlled physical setup with fixed camera positions, a standardized set of test objects (boxes, bottles, garments in known conditions), and automated test orchestration that resets the environment between runs. The reset is often the hardest engineering problem — you need a way to automatically return the scene to a known initial state after each test.

Sensor Calibration Validation

Camera calibration drifts. Extrinsic parameters (camera pose relative to robot base) shift when anyone bumps the rig. Run a calibration health check before every test session: project known 3D points (ArUco markers at fixed positions) and check that the reprojection error is within threshold. If it isn't, recalibrate before trusting any test results.

Actuator Validation

Joint encoders can drift. Gear wear changes the effective transmission ratio. Test actuator health with a known-trajectory test: command the robot to execute a precise geometric path (a square, a circle), capture joint encoder data, and verify that actual positions match commanded positions within tolerance. Run this nightly.

Safety Interlocks

Every HIL test must start with safety interlock validation. The emergency stop must stop motion within 50ms. The payload limit enforcer must reject commands that would exceed the arm's rated capacity. The workspace boundary enforcer must prevent motion outside the safe zone. Test these before running any task-level tests — if safety systems are broken, task tests are dangerous.

Safety interlock check

E-stop, payload limit, workspace boundary

↓ pass required to continue

Calibration health check

Camera extrinsics, joint encoder zeroing

↓ pass required to continue

Actuator validation

Known trajectory test, position error check

↓ pass required to continue

Task-level HIL tests

Decanting, container breakdown, returns

Failure recovery protocols: When the robot fails a task mid-execution in a HIL test, you need an automated recovery that gets the hardware back to a safe, known state without human intervention. Design this first. A robot stuck in an unknown state in the middle of the night breaks your entire nightly test run.

Why must safety interlock tests always run before task-level HIL tests?

Safety tests are faster, so it's more efficient Regulatory requirements mandate this order If safety systems are broken, running task tests creates physical danger — a broken e-stop means you can't stop the robot if a task test produces an unsafe motion

Chapter 4: ML Model Validation

the team updates the DVA model regularly — new pre-training data, new robot demos, architectural tweaks. Each update must be validated before deployment. But how do you test a foundation model that controls a robot?

The core challenge: the model is not a function with a single correct output. It's a policy — a mapping from observations to action distributions. Validating it requires running it on representative tasks and measuring outcomes, not just checking outputs against a golden dataset.

Key Metrics

Metric	Definition	Target
Task success rate	% of trials where the robot completes the full task without intervention	≥ 85% for production
Task completion time	Wall-clock time per successful trial	Within 20% of human operator
Error recovery rate	% of recoverable failures where robot self-corrects	≥ 70%
Intervention frequency	Human interventions per hour of operation	< 0.5 per hour for production
MTBF	Mean time between failures requiring intervention	≥ 120 min for industrial deployment

A/B Testing Model Updates

When a new model version is ready, run a parallel test: the old model and the new model each execute the same task suite on the same hardware setup, alternating trials. Measure success rate, completion time, and intervention frequency for both. Compute the difference and test for statistical significance.

Statistical significance in robotics: Task trials are expensive — each one takes minutes and wears hardware. You often can't run 1000 trials for a clean p-value. Design your A/B test for the minimum number of trials that can detect a 10% change in success rate at 80% power. For a baseline of 85%, that's roughly 60 trials per arm. Know this number going into the test.

Regression test suite structure: keep a library of canonical test episodes — recordings of past failures and edge cases that were fixed. Every model update must pass all canonical episodes. This prevents regressions on known failure modes even when aggregate metrics improve.

Embedding Drift Detection

The video model produces internal embeddings. If a new model version produces embeddings that differ drastically from the previous version (measured by cosine distance on held-out frames), something fundamental changed — investigate before hardware testing. This is a fast, cheap smoke test that catches major architectural regressions before they waste robot time.

Your new DVA model achieves a higher success rate than the previous version on aggregate, but fails two of the canonical test episodes that the old model passed. What do you do?

Ship the new model — aggregate metrics are what matter Retrain from scratch with more data Block the release — canonical episode regressions indicate the model has forgotten how to handle previously-solved failure modes, which is unacceptable regardless of aggregate improvement

Chapter 5: Safety & Compliance

A 25kg-payload robot arm can deliver lethal force. Industrial robot safety is not optional — it is a legal and moral prerequisite for deployment. As lead QE, you own the safety validation plan and are the person who signs off that the system is safe to operate.

Relevant Standards

Standard	Scope	Key requirement for the team
ISO 10218-1/2	Industrial robot safety — robot manufacturer and integrator	Risk assessment, safety-rated control system, separation distances
ISO/TS 15066	Collaborative robots — human-robot co-existence	Contact force limits: <140N transient, <65N quasi-static for chest
IEC 62061	Safety of machinery — functional safety	Safety function integrity levels (SIL) for e-stop and speed monitoring
ISO 13849	Safety-related control system performance	Performance Level (PL) for safety-rated monitoring functions

Risk Assessment Process

Every new task deployment requires a formal risk assessment. Steps: (1) enumerate hazardous scenarios (arm collision, dropped payload, unexpected motion, tool ejection), (2) assess likelihood and severity for each, (3) define risk reduction measures (guards, speed limits, force monitoring), (4) verify measures with tests, (5) document and sign off.

The 160-minute endurance question: If the robot runs autonomously for 160 minutes, how do you validate it's safe for the entire duration? Thermal drift — joints heat up, performance changes. Fatigue accumulation — repeated motions stress mechanical components. Battery/power supply degradation. You need a full-duration test with continuous safety monitoring, not just a spot check.

Safety Test Suite

E-stop response time: Command stop, measure time to zero velocity. Requirement: <200ms for category 0 stop (immediate de-energize). Test at start of every test session.

Force/torque limit testing: Command approach to a force plate. Verify that contact force never exceeds rated limits before the safety monitor intervenes. Test at multiple approach speeds and angles.

Workspace boundary enforcement: Command positions outside the safe zone. Verify the controller rejects the command without executing any unsafe motion.

Endurance testing: Run the full task loop continuously for 3+ hours. Monitor joint temperatures, motor current draw, error rates, and task success rate over time. Flag any degradation trend.

ISO/TS 15066 specifies contact force limits for collaborative robots. Why does this matter for DVA even in a non-collaborative (fenced) deployment?

It doesn't — ISO/TS 15066 only applies to collaborative robots operating without fences Warehouse environments have people entering work zones for restocking, maintenance, and fault recovery — those moments create de-facto collaborative scenarios that require force limits It's required by Decathlon's supplier contracts

Chapter 6: Test Automation & CI/CD

Every code merge should trigger automated tests. But "automated tests" for robotics means something more complex than a standard software pipeline — some tests run on GPUs in cloud VMs, some run in simulation on cloud instances, and some run on physical hardware in a test lab. Orchestrating all three without manual intervention is the infrastructure challenge of this role.

Pipeline Structure

Git push / PR merge

Triggers CI pipeline

↓ ~5 min

Unit tests + linting

CPU-only, fast. Model loading, data preprocessing, config validation.

↓ ~15 min

Integration tests (GPU)

Model inference on held-out frames. Metric computation. Embedding drift check.

↓ ~45 min

Simulation tests

Isaac Sim task suite. Domain randomization. 50 episodes per task.

↓ nightly only

HIL regression tests

Physical robot. Canonical episodes. Safety validation.

Flaky Test Detection

Non-deterministic ML systems produce non-deterministic test results. A test that passes 19 times out of 20 is not a passing test — it's a flaky test. Track test pass rates over time. Any test with a pass rate below 98% over 30 runs is either flaky (infrastructure issue) or indicating a real intermittent bug. Never let flaky tests accumulate — they erode trust in the entire CI system.

The quarantine queue: Maintain a separate "quarantine" CI pipeline for tests that have been flagged as flaky. They still run, but they don't block merges. Engineers are assigned to investigate quarantined tests weekly. A test exits quarantine when it achieves 100% pass rate over 50 consecutive runs, or when the underlying bug is fixed.

Infrastructure

Docker containers for reproducible test environments — every simulation test runs in the same container image, with pinned versions of MuJoCo, PyTorch, and robot controller firmware. Kubernetes for cloud simulation test orchestration — scale up 20 parallel sim instances for a big policy evaluation, scale to zero overnight.

Test result dashboards: Every test run writes structured metrics to a time-series database (InfluxDB or similar). Dashboard (Grafana) shows: per-task success rate trend, inference latency percentiles, simulation vs. hardware correlation, flaky test rate. This is what you present to engineering leadership every week.

A simulation test passes on 18 out of 20 consecutive runs. How should you treat it in CI?

It's basically passing — 90% is good enough for non-deterministic systems Disable the test — unreliable tests shouldn't run Move it to the quarantine pipeline — it still runs but doesn't block merges, and assign someone to investigate the 2 failures before returning it to main CI

Chapter 7: Performance Benchmarking

DVA must run in real time. The leapfrog inference scheme means the video model has a budget — its predictions must extend far enough into the future to cover its own compute time. If inference is too slow, the robot waits, control becomes jerky, and task success rate drops. Latency is a functional requirement, not a nice-to-have.

Latency Testing

Measure end-to-end inference time: from camera frame arriving to robot joint command being sent. This includes image preprocessing, video model forward pass, inverse dynamics model forward pass, and command serialization. Target: the total must be less than the leapfrog prediction horizon. Measure at P50, P90, and P99 — tail latency matters because P99 spikes cause visible robot hesitation.

Latency Budget Simulator

Adjust component latencies. See if the total fits within the leapfrog budget.

Video model (ms)80

Inv. dynamics (ms)15

Leapfrog budget (ms)150

Memory and GPU Profiling

Long-context visual memory (hundreds of frames) is DVA's differentiator — but it comes with a memory cost. Profile peak GPU memory usage during a full task episode. Measure how memory grows as context length grows. Define the maximum context length that fits within GPU memory budget at deployment, and add a regression test that fails if memory usage exceeds that threshold.

Throughput Testing

Frames processed per second under sustained load. The robot generates frames continuously — test that the model can keep up at the camera's frame rate without building a queue backlog. Simulate 30 minutes of operation: does the inference queue grow over time (bad) or stay bounded (good)?

The benchmark suite artifact: Your performance benchmarks must be reproducible and version-controlled. Pin the GPU model (A100 vs H100 matters), pin the batch size, pin the context length, pin the input resolution. If you change any of these, it's a new benchmark run, not a comparison to historical results.

Why is P99 latency more important than P50 latency for robot control?

P50 is always lower so it's less useful as a metric Tail-latency spikes (the worst 1% of inference times) cause the robot to pause mid-motion, creating visible hesitation and potentially unsafe behavior — the robot must meet timing requirements consistently, not just on average P99 is easier to improve through optimization

Chapter 8: The Test Matrix

This is the artifact you'd present in an interview. Click any cell to see the specific tests that live at that intersection of system component and test level.

the team Robotics Test Matrix — Interactive

Click a cell. Green = owned. Yellow = shared with ML team. Orange = shared with hardware team.

How to present this in the interview: "I build the test matrix first — rows are system components, columns are test levels. Each cell gets owner, frequency, and pass/fail criteria. This makes coverage gaps visible and creates a shared language between QE, ML, and hardware teams. For the team specifically, the hardest cells are end-to-end system tests and HIL — those require close coordination with robotics engineers."

Chapter 9: Debugging Robot Failures

The robot fails a task. You have a recording: camera frames, joint positions, model inputs and outputs, predicted video frames. Your job is to determine: which of the four system layers failed?

Layer 1: Video prediction

Wrong future frames generated — model misunderstands the scene

Layer 2: Inverse dynamics

Correct video predicted, but wrong actions extracted from it

Layer 3: Controller

Correct actions computed, wrong execution by low-level controller

Layer 4: Hardware

Correct commands sent, physical failure in actuator, sensor, or gripper

Debugging Decision Tree

Step 1: Watch the predicted video frames. Do they show the robot successfully completing the task? If YES → the video model did its job. If NO → Layer 1 failure. Investigate: was the scene out-of-distribution? Did the model hallucinate? Check embedding similarity to training data.

Step 2: If predicted video is correct, extract the inverse dynamics outputs. Do the actions it computed correspond to the motion shown in the predicted frames? Simulate the actions in MuJoCo: does simulated motion match the predicted video? If NO → Layer 2 failure. Check inverse dynamics model on similar frame pairs from training data.

Step 3: If actions are correct, compare commanded joint positions to executed joint positions from encoders. Match? → Layer 4 (hardware). Don't match? → Layer 3 (controller).

The interpretability bonus of DVA: Because the system generates predicted video, you can directly inspect Layer 1. This is impossible with traditional VLAs — you can't look at an action vector and determine if it reflects correct world understanding. DVA's predicted frames are a free debugging tool.

Failure Attribution Tool

Click the failure symptoms you observe. The tool suggests which layer is responsible.

The predicted video frames look correct (they show the robot successfully picking the item) but the robot's actual arm moves in a completely different direction. Which layer failed?

Layer 1 — the video model Layer 2 — the inverse dynamics model failed to extract correct actions from the (correct) predicted video Layer 4 — the hardware

Chapter 10: 20 Interview Questions

These are the questions a sharp robotics engineering team will ask a Lead QE candidate. Each answer is calibrated for a staff-level role — concrete, specific, and showing ownership.

Interview mindset: Every answer should demonstrate that you've thought about the second-order problem. Don't just say what you'd do — say why that approach, what failure mode it prevents, and how you'd know it's working.

Core DVA Testing

Q1. How would you test a video generation model that controls a robot?

Model answer: "I'd build a three-tier oracle system. Tier 1: perceptual metrics — SSIM and FVD on held-out video clips measure frame quality, but aren't sufficient alone. Tier 2: physics constraint checkers — automated tests that verify predicted frames don't violate object permanence, gravity, or kinematic limits. These are deterministic and fast. Tier 3: downstream task success — run the full DVA pipeline in simulation and measure task success rate. The first two tiers catch regressions cheaply; the third catches emergent failures the earlier tiers miss. I'd gate on all three in CI."

Q2. How do you handle non-determinism in your test suite?

Model answer: "I separate tests by determinism class. Preprocessing, config loading, calibration computation — these are deterministic, so exact-match assertions work. Model inference — non-deterministic, so I use statistical bounds with multiple runs, or fix the random seed for reproducibility checks. End-to-end task success — I track pass rates over N trials, not individual pass/fail. I also maintain a flaky test quarantine: any test with pass rate below 98% over 30 consecutive runs gets moved out of the blocking pipeline and assigned for investigation."

Q3. How do you write regression tests for a foundation model updated weekly?

Model answer: "I maintain a canonical episode library — recordings of past failures and edge cases that were resolved. Every model update must pass all canonical episodes. I also track an embedding drift metric: if the new model's embeddings on held-out frames are more than a configurable cosine distance from the previous model, it triggers an investigation flag before hardware testing. Finally, I run A/B task trials on hardware: 30 trials per arm, alternating, and compute success rate difference with a 95% confidence interval. Regression is blocked if the interval includes a drop of more than 5 percentage points."

Sim-to-Real

Q4. Simulation: 90% success, hardware: 60%. Walk me through your debugging process.

Model answer: "First, I check the sim-to-real correlation history. Is this a new gap or has it been drifting? If new, something changed — either a model update or the hardware environment changed. I replay the 20 hardware failures in sim: if sim reproduces them, the model has a real bug. If sim doesn't reproduce them, the domain gap is the issue. I then enumerate sim-to-real gaps: lighting (run a brightness sweep in sim), contact model (run different friction values), sensor noise (add Gaussian noise to camera input). I identify which of these, when added to sim, reproduces the hardware failure pattern. That tells me what to fix."

Q5. How do you maintain your simulation's fidelity over time?

Model answer: "I treat sim-to-real correlation as a living metric. After every 10 hardware test sessions, I run a matched comparison: the same 20 episodes in sim and on hardware. I track the correlation coefficient over time on a dashboard. If it drops below 0.85, I halt sim-gated deployments and schedule a sim audit. The audit checks: physics parameters (friction, inertia), camera model (distortion, noise), and environment setup (lighting, object placement tolerance). I also version-control the sim environment config alongside the model — so if hardware changes (robot arm wear, new camera mount), the sim config gets a corresponding update."

Hardware & Safety

Q6. Design a safety validation plan for a 25kg-payload robot operating autonomously for 3+ hours.

Model answer: "Three layers. Layer 1 — pre-deployment static validation: risk assessment per ISO 10218, workspace hazard analysis, safety function SIL determination per IEC 62061, e-stop response time test (<200ms), force limit test against a load cell. Layer 2 — continuous runtime safety monitoring: joint torque, velocity, and temperature monitored at 1kHz with hardware safety limits independent of the DVA software stack. Layer 3 — endurance validation: run the full task loop for 4 hours (20% headroom over 3-hour requirement) with a test engineer monitoring remotely. Log joint temperature trends, motor current, task success rate, and error frequency across the session. The robot passes only if all metrics are stable in the final hour — no degradation trend is acceptable."

Q7. How do you test that the e-stop actually works?

Model answer: "E-stop testing is automated and runs at the start of every HIL test session. The procedure: command the robot to execute a known slow motion, trigger the e-stop via software API, and record time-to-zero-velocity from joint encoder data. Threshold: <200ms for Category 0 (de-energize), <500ms for Category 1 (controlled stop). I also test the hardware e-stop button separately — a human physically pushes the button during a controlled motion. I log response times over weeks to detect degradation trends. Any response time over threshold blocks the test session and pages the hardware team."

Q8. The robot runs fine in the first hour but degrades in hour 2. What do you investigate?

Model answer: "Thermal effects are the primary suspect — I check joint temperature telemetry first. If joints are heating up, the transmission efficiency changes, and the inverse dynamics model's action predictions become increasingly wrong because they were calibrated at cold-start parameters. Secondary: GPU thermal throttling — if the inference GPU hits its thermal limit, inference latency spikes, and the leapfrog timing budget gets violated. Third: memory pressure — if the long-context memory grows unbounded over the session, eventually it causes memory pressure and slower inference. I'd log all three throughout the endurance run and correlate degradation onset with changes in any of these signals."

CI/CD & Metrics

Q9. What does your nightly test report include?

Model answer: "Five sections. (1) Build health: did all stages complete, were there timeouts? (2) Unit/integration metrics: pass rate, new failures vs. known failures, flaky test quarantine count. (3) Simulation summary: per-task success rate with delta from previous night, any tasks that regressed >5%. (4) Performance snapshot: P50/P99 inference latency, GPU memory peak, throughput under sustained load — with trend lines over the past two weeks. (5) Action items: any test that needs investigation, with a recommended owner. The report is auto-generated, posted to Slack, and the CI dashboard retains 90 days of history."

Q10. How do you prevent a flaky test from blocking your team?

Model answer: "I never let flaky tests live in the blocking pipeline. Any test that fails without a consistent repro gets immediately moved to a quarantine pipeline. It still runs daily — I want the data — but it doesn't gate merges. A weekly flaky-test review meeting goes through the quarantine list: each flaky test gets assigned an owner and a deadline. If it's been in quarantine for more than two weeks without investigation, the test gets deleted and we file a ticket to rewrite it properly. This is non-negotiable — accumulated flaky tests are a death spiral where engineers stop trusting CI."

Q11. How do you measure if your test suite is actually providing value?

Model answer: "I track two metrics. First, defect escape rate: how many bugs found in production (customer site) vs. bugs caught in CI. This should trend toward zero escapes. Second, mean time to detection: when a regression is introduced, how many hours until CI flags it? I target under 24 hours for critical paths. I also do periodic fault injection: intentionally introduce a known bug and verify CI catches it within the expected time window. If CI misses the injected fault, that's a coverage gap to fix."

Leadership & Process

Q12. You're the first QE hire at a robotics startup. What do you build in week one?

Model answer: "Week one is purely observational and inventory-taking. I shadow engineers running the robot, attend every test session, and ask about every failure they've seen. I collect: what tests already exist (even informal ones), what the current release process looks like, where engineers spend the most debugging time, and what failures have actually reached customers. By end of week one I have a prioritized list of the highest-value tests to build first — which is always different from what intuition suggests. Then I build the CI skeleton before writing any tests, so all future tests have somewhere to live."

Q13. How do you build a testing culture in a team that's primarily ML and robotics engineers?

Model answer: "Don't call it 'testing' — call it 'understanding the system.' ML engineers are already rigorous about eval metrics; QE just formalizes that. My approach: I start by adding tests that immediately help them — tests that catch the bugs they've been debugging manually. When tests save them time, buy-in follows. I also make the test matrix visible: a shared doc showing coverage gaps. Engineers naturally want to fill gaps once they can see them. I never position QE as a gate — I position it as infrastructure that makes everyone's work faster."

Q14. A critical safety test is failing intermittently — e-stop response time is occasionally 220ms vs. the 200ms threshold. How do you handle this?

Model answer: "This is a hard stop on hardware deployment until resolved. 220ms e-stop is not a flaky test — it's a safety system that doesn't meet specification. I immediately raise it to the hardware team and schedule a dedicated investigation session. I capture all available telemetry from the failing instances: system load at time of failure, CPU utilization, network latency to the safety controller, any concurrent processes. My hypothesis is that software interrupt latency under load is adding 20ms. If confirmed, the fix is either a dedicated real-time OS thread for safety functions, or hardware-level e-stop that bypasses software entirely. We don't ship until this is resolved and verified at <150ms with margin."

Q15. How do you decide when the system is ready to deploy to a customer site?

Model answer: "I use a deployment readiness checklist with hard gates. Functional gates: success rate ≥85% on all customer tasks in HIL testing, intervention frequency <0.5/hr over a 4-hour endurance run, all canonical regression episodes pass. Safety gates: all safety function tests pass at required SIL/PL, endurance thermal test shows no degradation, risk assessment signed off by a qualified person. Process gates: customer site survey complete (environment matches tested conditions), trained operator on-site for first week. Any hard gate not met = no deployment. Soft factors like 'we're close' don't override hard gates."

Q16. DVA uses long context (hundreds of frames). How does this affect your memory testing?

Model answer: "Long context means memory usage is task-duration-dependent, not just input-size-dependent. A 10-minute task with 30fps cameras = 18,000 context frames — you need to test memory at that scale, not at a 16-frame window. My approach: instrument the inference loop to log peak GPU memory every 30 seconds. Run a full 3-hour session. Plot memory over time — it should plateau as older context gets pruned, not grow unboundedly. Set a regression threshold: if any update causes peak memory to grow more than 10% compared to the previous baseline, block deployment. Also test the edge case: what happens when the context buffer is full? The pruning logic must be tested explicitly."

Q17. How do you validate one-shot learning from a human demo?

Model answer: "One-shot learning is evaluated with a held-out set of novel scenarios — objects and environments that weren't in the training data, with a single human demonstration provided for each. The eval procedure: provide the demo, run 10 trials of the robot attempting the task, measure success rate. 'Novel' must be verified: I check embedding distance between the eval scenario and all training episodes. If the cosine distance is below a threshold, the scenario isn't novel enough and I replace it. Target success rate for one-shot eval: ≥60% (lower than the standard task threshold, since it's genuinely zero-shot on the robot side)."

Q18. The inverse dynamics model performs well in the lab but fails at the customer site. What's your hypothesis?

Model answer: "The inverse dynamics model maps visual frame transitions to robot actions — it's sensitive to camera appearance. My first hypothesis: the customer site has different lighting or camera angle than the training environment, making the frame transitions look different even for identical motions. The model correctly maps lab-frame transitions to actions, but customer-site transitions look unfamiliar. Diagnostic: compare the embeddings of frame pairs from the customer site to the training distribution. If they're far out of distribution, the fix is fine-tuning the inverse dynamics model on a small amount of customer-site data — typically 1-2 hours of the robot operating there."

Q19. How do you test that the leapfrog inference timing is correct?

Model answer: "Leapfrog works when the prediction horizon covers the model's own inference time — meaning the robot always has valid commands to execute, never waiting for the next prediction. I test this by measuring the ratio of prediction horizon to inference time under load. If the inference time grows (GPU load, memory pressure), the prediction horizon must cover it or the robot pauses. I run a stress test: artificially slow down inference by 20%, 40%, 60% and verify the control loop remains smooth (no pauses, no velocity spikes). I also log command timestamps vs. inference completion timestamps in production — if any inference completes after its corresponding command window, that's a leapfrog violation to investigate."

Q20. Questions to ask the interviewer at the team.

Model answers: "What's the current ratio of sim testing to hardware testing in your release process, and is that ratio where you want it to be?" / "How do you currently detect when a model update has regressed on a failure mode that was previously fixed?" / "What does a failed deployment look like — has a DVA system ever been pulled from a customer site, and what was the root cause?" / "How close is the engineering team to having a 24-hour fully automated test cycle, and what's blocking it?" / "What's the one quality failure mode that keeps you up at night for an autonomous 25kg robot?"

Self-Assessment Quiz

Q: Your DVA model achieves 88% success in sim. What's the minimum information you need before claiming "it's ready for hardware testing"?

The 88% figure alone is sufficient — it's above the 85% threshold Pass all unit tests and the sim result The sim-to-real correlation coefficient from previous releases, so you know what 88% in sim actually predicts on hardware

Q: A safety-critical test (e-stop timing) fails once in 50 runs. You're two days from a planned customer deployment. What do you do?

It passed 49/50 times — that's 98%, which is acceptable Delay deployment by one day and run 50 more tests — if none fail, ship Block deployment — safety-critical functions have zero tolerance for intermittent failure. Investigate and fix the root cause before any customer deployment regardless of timeline pressure

Chapter 11: Cheat Sheet

Everything you need on one page. Read this the morning of the interview.

Key Metrics Reference

Metric	Definition	Target	Measured how
Task success rate	% trials completed without intervention	≥85%	HIL / sim, 30+ trials
Intervention freq.	Human interventions per hour	<0.5/hr	Endurance run log
MTBF	Mean time between failures	≥120 min	Production telemetry
E-stop latency	Time to zero velocity after stop signal	<200ms (Cat. 0)	Encoder vs. trigger timestamp
Inference P99	99th percentile end-to-end latency	< leapfrog horizon	Profiling under load
Sim-to-real corr.	Correlation of sim vs. real success rates	≥0.85	Matched A/B sessions
Embedding drift	Cosine distance, old vs. new model	Below threshold	Automated on held-out frames

Safety Standards Quick Reference

Standard	What it governs	Key number
ISO 10218-1/2	Industrial robot safety, manufacturer + integrator	Risk assessment mandatory
ISO/TS 15066	Collaborative robot contact forces	<140N transient, <65N quasi-static (chest)
IEC 62061	Safety function integrity (machinery)	SIL 2 for e-stop in most industrial contexts
ISO 13849	Safety control system performance	Performance Level (PL) d or e for high-risk

Test Pyramid for Robotics

Robotics Test Pyramid

DVA Architecture with Test Points

Where Tests Live in the DVA Pipeline

What to Say in the Interview

Opening (when asked about your approach to QE): "My job is to make the system's confidence in itself measurable. DVA is non-deterministic, physically embodied, and safety-critical — which means three things: perceptual oracle tests instead of exact-match assertions, a maintained sim-to-real correlation coefficient, and zero tolerance on safety function intermittency."

When asked about your biggest QE challenge: "Non-determinism plus physical safety. Those two things pull in opposite directions — safety testing wants guarantees, and ML systems make probabilistic promises. The resolution is separating the layers: the safety functions (e-stop, force limits, workspace boundary) are deterministic hardware circuits and must have hard guarantees. The ML layers (video model, inverse dynamics) are evaluated statistically. You never confuse the two."

When asked about a failure you've debugged: Use the four-layer framework — video prediction, inverse dynamics, controller, hardware. Show that you can instrument each layer independently and attribute failures precisely.

Red Flags to Watch For / Questions to Ask

Red flags in the role:

No existing sim-to-real correlation tracking
"Safety testing" means just running the robot and watching
QE treated as release gate, not engineering partner
No canonical failure episode library
Flaky tests tolerated in blocking CI
No automated HIL test execution

Good signs:

Engineers can articulate their sim-to-real gap
Safety system tests run automatically, not manually
There's already a test matrix, even if incomplete
QE is in design reviews, not just release reviews
Someone is tracking intervention frequency over time
The team has debugged and documented past failures