Staff-level interview prep for testing robot AI systems — DVA models, sim-to-real, safety, CI/CD, and debugging at the team.
the team builds the DVA (Direct Video Action) system: a causal video model that predicts future frames, paired with an inverse dynamics model that converts those frames into robot joint commands. The robot sees the world through cameras, sends actuator commands to 25kg-payload arms, and must run autonomously for 160+ minutes in a live warehouse.
A Lead QE and Validation Engineer here owns the entire confidence chain between "code merged" and "robot ships to customer." That means writing tests, building infrastructure, running hardware, defining safety acceptance criteria, and debugging failures when they happen at 2am in a Decathlon warehouse.
Click any component to see what tests live there.
The role sits at the intersection of four disciplines: ML engineering (model validation), robotics (hardware testing), software QE (CI/CD, test automation), and safety engineering (ISO compliance, risk assessment). You don't need to be expert-level in all four — but you need to speak fluently with the people who are.
Testing a video generation model that controls a robot is qualitatively different from testing traditional software. Traditional software: given input X, expect output Y. DVA: given camera frame X, the model generates a distribution over plausible futures — any of which might be reasonable. "Correct" output isn't a single value.
This creates four categories of novel testing challenges that you'll face daily at the team.
Run the same input through the video model twice: you get two different future-frame sequences, both plausible. Traditional test assertions (assert output == expected) are meaningless. You need distributional correctness — does the output fall within the plausible distribution? Is the object in the right location? Is the predicted motion physically consistent?
You need automated metrics for "does this predicted video look right?" Common metrics: SSIM (Structural Similarity Index) for frame quality, FVD (Fréchet Video Distance) for video distribution quality, optical flow consistency for temporal coherence. But these have limits: a model can score well on SSIM while still predicting the robot arm going the wrong direction.
A predicted video frame at t+1 must be consistent with the frame at t. Objects can't teleport. The robot arm's predicted trajectory must be smooth — acceleration limits apply. Test this with inter-frame difference analysis: flag any prediction where pixel deltas between frames exceed what's physically achievable given the robot's speed limits.
The video model must predict a world that obeys physics. An item shouldn't float after the robot releases it. Containers don't compress. Cloth doesn't phase through the table. Build a suite of physics constraint checkers that run on predicted frames: bounding box continuity, gravity consistency, contact plausibility.
Click each failure type to see its test strategy.
assert predicted_frame == expected_frame fail as a test for DVA?Running every test on the physical robot is slow, expensive, and dangerous during early development. Simulation lets you run thousands of tests per day at low cost. But simulation is always an approximation — and the gap between sim and real is one of the most insidious bugs in robotics.
Simulation environments used in industrial robotics QE: MuJoCo (physics-accurate, widely used for manipulation research), Isaac Sim (NVIDIA, GPU-accelerated, photorealistic rendering), Gazebo (ROS-native, popular for integration testing). Each has different fidelity trade-offs.
Step 1: Domain randomization in simulation. Randomly vary lighting intensity, camera noise level, object mass, surface friction, and joint damping during test runs. If the policy succeeds across all variants, it's more likely to handle real-world variation.
Step 2: Correlation tracking. For every metric you measure in sim, measure the same metric on hardware. Build a chart: sim success rate vs. real success rate across many policy versions. If the correlation breaks — sim says 90% but real says 40% — your sim model has drifted from reality. This requires constant maintenance.
Step 3: Failure mode audit. When the robot fails on real hardware, replay the failure scenario in sim. If sim doesn't reproduce the failure, the domain gap is causing the discrepancy. Add the missing physics (contact model, sensor noise model) until sim matches real failure patterns.
Drag the noise slider. Watch how sim performance diverges from real as the gap widens.
Hardware-in-the-Loop (HIL) testing means running automated test suites on real physical hardware — the actual robot, actual cameras, actual sensors — not a simulation. It's slow (each test run takes minutes), expensive (arm wear, power cost), and harder to automate. But it's the only way to catch the failures that sim misses.
HIL testing at a robotics company happens in a test rig: a controlled physical setup with fixed camera positions, a standardized set of test objects (boxes, bottles, garments in known conditions), and automated test orchestration that resets the environment between runs. The reset is often the hardest engineering problem — you need a way to automatically return the scene to a known initial state after each test.
Camera calibration drifts. Extrinsic parameters (camera pose relative to robot base) shift when anyone bumps the rig. Run a calibration health check before every test session: project known 3D points (ArUco markers at fixed positions) and check that the reprojection error is within threshold. If it isn't, recalibrate before trusting any test results.
Joint encoders can drift. Gear wear changes the effective transmission ratio. Test actuator health with a known-trajectory test: command the robot to execute a precise geometric path (a square, a circle), capture joint encoder data, and verify that actual positions match commanded positions within tolerance. Run this nightly.
Every HIL test must start with safety interlock validation. The emergency stop must stop motion within 50ms. The payload limit enforcer must reject commands that would exceed the arm's rated capacity. The workspace boundary enforcer must prevent motion outside the safe zone. Test these before running any task-level tests — if safety systems are broken, task tests are dangerous.
the team updates the DVA model regularly — new pre-training data, new robot demos, architectural tweaks. Each update must be validated before deployment. But how do you test a foundation model that controls a robot?
The core challenge: the model is not a function with a single correct output. It's a policy — a mapping from observations to action distributions. Validating it requires running it on representative tasks and measuring outcomes, not just checking outputs against a golden dataset.
| Metric | Definition | Target |
|---|---|---|
| Task success rate | % of trials where the robot completes the full task without intervention | ≥ 85% for production |
| Task completion time | Wall-clock time per successful trial | Within 20% of human operator |
| Error recovery rate | % of recoverable failures where robot self-corrects | ≥ 70% |
| Intervention frequency | Human interventions per hour of operation | < 0.5 per hour for production |
| MTBF | Mean time between failures requiring intervention | ≥ 120 min for industrial deployment |
When a new model version is ready, run a parallel test: the old model and the new model each execute the same task suite on the same hardware setup, alternating trials. Measure success rate, completion time, and intervention frequency for both. Compute the difference and test for statistical significance.
Regression test suite structure: keep a library of canonical test episodes — recordings of past failures and edge cases that were fixed. Every model update must pass all canonical episodes. This prevents regressions on known failure modes even when aggregate metrics improve.
The video model produces internal embeddings. If a new model version produces embeddings that differ drastically from the previous version (measured by cosine distance on held-out frames), something fundamental changed — investigate before hardware testing. This is a fast, cheap smoke test that catches major architectural regressions before they waste robot time.
A 25kg-payload robot arm can deliver lethal force. Industrial robot safety is not optional — it is a legal and moral prerequisite for deployment. As lead QE, you own the safety validation plan and are the person who signs off that the system is safe to operate.
| Standard | Scope | Key requirement for the team |
|---|---|---|
| ISO 10218-1/2 | Industrial robot safety — robot manufacturer and integrator | Risk assessment, safety-rated control system, separation distances |
| ISO/TS 15066 | Collaborative robots — human-robot co-existence | Contact force limits: <140N transient, <65N quasi-static for chest |
| IEC 62061 | Safety of machinery — functional safety | Safety function integrity levels (SIL) for e-stop and speed monitoring |
| ISO 13849 | Safety-related control system performance | Performance Level (PL) for safety-rated monitoring functions |
Every new task deployment requires a formal risk assessment. Steps: (1) enumerate hazardous scenarios (arm collision, dropped payload, unexpected motion, tool ejection), (2) assess likelihood and severity for each, (3) define risk reduction measures (guards, speed limits, force monitoring), (4) verify measures with tests, (5) document and sign off.
E-stop response time: Command stop, measure time to zero velocity. Requirement: <200ms for category 0 stop (immediate de-energize). Test at start of every test session.
Force/torque limit testing: Command approach to a force plate. Verify that contact force never exceeds rated limits before the safety monitor intervenes. Test at multiple approach speeds and angles.
Workspace boundary enforcement: Command positions outside the safe zone. Verify the controller rejects the command without executing any unsafe motion.
Endurance testing: Run the full task loop continuously for 3+ hours. Monitor joint temperatures, motor current draw, error rates, and task success rate over time. Flag any degradation trend.
Every code merge should trigger automated tests. But "automated tests" for robotics means something more complex than a standard software pipeline — some tests run on GPUs in cloud VMs, some run in simulation on cloud instances, and some run on physical hardware in a test lab. Orchestrating all three without manual intervention is the infrastructure challenge of this role.
Non-deterministic ML systems produce non-deterministic test results. A test that passes 19 times out of 20 is not a passing test — it's a flaky test. Track test pass rates over time. Any test with a pass rate below 98% over 30 runs is either flaky (infrastructure issue) or indicating a real intermittent bug. Never let flaky tests accumulate — they erode trust in the entire CI system.
Docker containers for reproducible test environments — every simulation test runs in the same container image, with pinned versions of MuJoCo, PyTorch, and robot controller firmware. Kubernetes for cloud simulation test orchestration — scale up 20 parallel sim instances for a big policy evaluation, scale to zero overnight.
Test result dashboards: Every test run writes structured metrics to a time-series database (InfluxDB or similar). Dashboard (Grafana) shows: per-task success rate trend, inference latency percentiles, simulation vs. hardware correlation, flaky test rate. This is what you present to engineering leadership every week.
DVA must run in real time. The leapfrog inference scheme means the video model has a budget — its predictions must extend far enough into the future to cover its own compute time. If inference is too slow, the robot waits, control becomes jerky, and task success rate drops. Latency is a functional requirement, not a nice-to-have.
Measure end-to-end inference time: from camera frame arriving to robot joint command being sent. This includes image preprocessing, video model forward pass, inverse dynamics model forward pass, and command serialization. Target: the total must be less than the leapfrog prediction horizon. Measure at P50, P90, and P99 — tail latency matters because P99 spikes cause visible robot hesitation.
Adjust component latencies. See if the total fits within the leapfrog budget.
Long-context visual memory (hundreds of frames) is DVA's differentiator — but it comes with a memory cost. Profile peak GPU memory usage during a full task episode. Measure how memory grows as context length grows. Define the maximum context length that fits within GPU memory budget at deployment, and add a regression test that fails if memory usage exceeds that threshold.
Frames processed per second under sustained load. The robot generates frames continuously — test that the model can keep up at the camera's frame rate without building a queue backlog. Simulate 30 minutes of operation: does the inference queue grow over time (bad) or stay bounded (good)?
This is the artifact you'd present in an interview. Click any cell to see the specific tests that live at that intersection of system component and test level.
Click a cell. Green = owned. Yellow = shared with ML team. Orange = shared with hardware team.
The robot fails a task. You have a recording: camera frames, joint positions, model inputs and outputs, predicted video frames. Your job is to determine: which of the four system layers failed?
Step 1: Watch the predicted video frames. Do they show the robot successfully completing the task? If YES → the video model did its job. If NO → Layer 1 failure. Investigate: was the scene out-of-distribution? Did the model hallucinate? Check embedding similarity to training data.
Step 2: If predicted video is correct, extract the inverse dynamics outputs. Do the actions it computed correspond to the motion shown in the predicted frames? Simulate the actions in MuJoCo: does simulated motion match the predicted video? If NO → Layer 2 failure. Check inverse dynamics model on similar frame pairs from training data.
Step 3: If actions are correct, compare commanded joint positions to executed joint positions from encoders. Match? → Layer 4 (hardware). Don't match? → Layer 3 (controller).
Click the failure symptoms you observe. The tool suggests which layer is responsible.
These are the questions a sharp robotics engineering team will ask a Lead QE candidate. Each answer is calibrated for a staff-level role — concrete, specific, and showing ownership.
Q1. How would you test a video generation model that controls a robot?
Model answer: "I'd build a three-tier oracle system. Tier 1: perceptual metrics — SSIM and FVD on held-out video clips measure frame quality, but aren't sufficient alone. Tier 2: physics constraint checkers — automated tests that verify predicted frames don't violate object permanence, gravity, or kinematic limits. These are deterministic and fast. Tier 3: downstream task success — run the full DVA pipeline in simulation and measure task success rate. The first two tiers catch regressions cheaply; the third catches emergent failures the earlier tiers miss. I'd gate on all three in CI."
Q2. How do you handle non-determinism in your test suite?
Model answer: "I separate tests by determinism class. Preprocessing, config loading, calibration computation — these are deterministic, so exact-match assertions work. Model inference — non-deterministic, so I use statistical bounds with multiple runs, or fix the random seed for reproducibility checks. End-to-end task success — I track pass rates over N trials, not individual pass/fail. I also maintain a flaky test quarantine: any test with pass rate below 98% over 30 consecutive runs gets moved out of the blocking pipeline and assigned for investigation."
Q3. How do you write regression tests for a foundation model updated weekly?
Model answer: "I maintain a canonical episode library — recordings of past failures and edge cases that were resolved. Every model update must pass all canonical episodes. I also track an embedding drift metric: if the new model's embeddings on held-out frames are more than a configurable cosine distance from the previous model, it triggers an investigation flag before hardware testing. Finally, I run A/B task trials on hardware: 30 trials per arm, alternating, and compute success rate difference with a 95% confidence interval. Regression is blocked if the interval includes a drop of more than 5 percentage points."
Q4. Simulation: 90% success, hardware: 60%. Walk me through your debugging process.
Model answer: "First, I check the sim-to-real correlation history. Is this a new gap or has it been drifting? If new, something changed — either a model update or the hardware environment changed. I replay the 20 hardware failures in sim: if sim reproduces them, the model has a real bug. If sim doesn't reproduce them, the domain gap is the issue. I then enumerate sim-to-real gaps: lighting (run a brightness sweep in sim), contact model (run different friction values), sensor noise (add Gaussian noise to camera input). I identify which of these, when added to sim, reproduces the hardware failure pattern. That tells me what to fix."
Q5. How do you maintain your simulation's fidelity over time?
Model answer: "I treat sim-to-real correlation as a living metric. After every 10 hardware test sessions, I run a matched comparison: the same 20 episodes in sim and on hardware. I track the correlation coefficient over time on a dashboard. If it drops below 0.85, I halt sim-gated deployments and schedule a sim audit. The audit checks: physics parameters (friction, inertia), camera model (distortion, noise), and environment setup (lighting, object placement tolerance). I also version-control the sim environment config alongside the model — so if hardware changes (robot arm wear, new camera mount), the sim config gets a corresponding update."
Q6. Design a safety validation plan for a 25kg-payload robot operating autonomously for 3+ hours.
Model answer: "Three layers. Layer 1 — pre-deployment static validation: risk assessment per ISO 10218, workspace hazard analysis, safety function SIL determination per IEC 62061, e-stop response time test (<200ms), force limit test against a load cell. Layer 2 — continuous runtime safety monitoring: joint torque, velocity, and temperature monitored at 1kHz with hardware safety limits independent of the DVA software stack. Layer 3 — endurance validation: run the full task loop for 4 hours (20% headroom over 3-hour requirement) with a test engineer monitoring remotely. Log joint temperature trends, motor current, task success rate, and error frequency across the session. The robot passes only if all metrics are stable in the final hour — no degradation trend is acceptable."
Q7. How do you test that the e-stop actually works?
Model answer: "E-stop testing is automated and runs at the start of every HIL test session. The procedure: command the robot to execute a known slow motion, trigger the e-stop via software API, and record time-to-zero-velocity from joint encoder data. Threshold: <200ms for Category 0 (de-energize), <500ms for Category 1 (controlled stop). I also test the hardware e-stop button separately — a human physically pushes the button during a controlled motion. I log response times over weeks to detect degradation trends. Any response time over threshold blocks the test session and pages the hardware team."
Q8. The robot runs fine in the first hour but degrades in hour 2. What do you investigate?
Model answer: "Thermal effects are the primary suspect — I check joint temperature telemetry first. If joints are heating up, the transmission efficiency changes, and the inverse dynamics model's action predictions become increasingly wrong because they were calibrated at cold-start parameters. Secondary: GPU thermal throttling — if the inference GPU hits its thermal limit, inference latency spikes, and the leapfrog timing budget gets violated. Third: memory pressure — if the long-context memory grows unbounded over the session, eventually it causes memory pressure and slower inference. I'd log all three throughout the endurance run and correlate degradation onset with changes in any of these signals."
Q9. What does your nightly test report include?
Model answer: "Five sections. (1) Build health: did all stages complete, were there timeouts? (2) Unit/integration metrics: pass rate, new failures vs. known failures, flaky test quarantine count. (3) Simulation summary: per-task success rate with delta from previous night, any tasks that regressed >5%. (4) Performance snapshot: P50/P99 inference latency, GPU memory peak, throughput under sustained load — with trend lines over the past two weeks. (5) Action items: any test that needs investigation, with a recommended owner. The report is auto-generated, posted to Slack, and the CI dashboard retains 90 days of history."
Q10. How do you prevent a flaky test from blocking your team?
Model answer: "I never let flaky tests live in the blocking pipeline. Any test that fails without a consistent repro gets immediately moved to a quarantine pipeline. It still runs daily — I want the data — but it doesn't gate merges. A weekly flaky-test review meeting goes through the quarantine list: each flaky test gets assigned an owner and a deadline. If it's been in quarantine for more than two weeks without investigation, the test gets deleted and we file a ticket to rewrite it properly. This is non-negotiable — accumulated flaky tests are a death spiral where engineers stop trusting CI."
Q11. How do you measure if your test suite is actually providing value?
Model answer: "I track two metrics. First, defect escape rate: how many bugs found in production (customer site) vs. bugs caught in CI. This should trend toward zero escapes. Second, mean time to detection: when a regression is introduced, how many hours until CI flags it? I target under 24 hours for critical paths. I also do periodic fault injection: intentionally introduce a known bug and verify CI catches it within the expected time window. If CI misses the injected fault, that's a coverage gap to fix."
Q12. You're the first QE hire at a robotics startup. What do you build in week one?
Model answer: "Week one is purely observational and inventory-taking. I shadow engineers running the robot, attend every test session, and ask about every failure they've seen. I collect: what tests already exist (even informal ones), what the current release process looks like, where engineers spend the most debugging time, and what failures have actually reached customers. By end of week one I have a prioritized list of the highest-value tests to build first — which is always different from what intuition suggests. Then I build the CI skeleton before writing any tests, so all future tests have somewhere to live."
Q13. How do you build a testing culture in a team that's primarily ML and robotics engineers?
Model answer: "Don't call it 'testing' — call it 'understanding the system.' ML engineers are already rigorous about eval metrics; QE just formalizes that. My approach: I start by adding tests that immediately help them — tests that catch the bugs they've been debugging manually. When tests save them time, buy-in follows. I also make the test matrix visible: a shared doc showing coverage gaps. Engineers naturally want to fill gaps once they can see them. I never position QE as a gate — I position it as infrastructure that makes everyone's work faster."
Q14. A critical safety test is failing intermittently — e-stop response time is occasionally 220ms vs. the 200ms threshold. How do you handle this?
Model answer: "This is a hard stop on hardware deployment until resolved. 220ms e-stop is not a flaky test — it's a safety system that doesn't meet specification. I immediately raise it to the hardware team and schedule a dedicated investigation session. I capture all available telemetry from the failing instances: system load at time of failure, CPU utilization, network latency to the safety controller, any concurrent processes. My hypothesis is that software interrupt latency under load is adding 20ms. If confirmed, the fix is either a dedicated real-time OS thread for safety functions, or hardware-level e-stop that bypasses software entirely. We don't ship until this is resolved and verified at <150ms with margin."
Q15. How do you decide when the system is ready to deploy to a customer site?
Model answer: "I use a deployment readiness checklist with hard gates. Functional gates: success rate ≥85% on all customer tasks in HIL testing, intervention frequency <0.5/hr over a 4-hour endurance run, all canonical regression episodes pass. Safety gates: all safety function tests pass at required SIL/PL, endurance thermal test shows no degradation, risk assessment signed off by a qualified person. Process gates: customer site survey complete (environment matches tested conditions), trained operator on-site for first week. Any hard gate not met = no deployment. Soft factors like 'we're close' don't override hard gates."
Q16. DVA uses long context (hundreds of frames). How does this affect your memory testing?
Model answer: "Long context means memory usage is task-duration-dependent, not just input-size-dependent. A 10-minute task with 30fps cameras = 18,000 context frames — you need to test memory at that scale, not at a 16-frame window. My approach: instrument the inference loop to log peak GPU memory every 30 seconds. Run a full 3-hour session. Plot memory over time — it should plateau as older context gets pruned, not grow unboundedly. Set a regression threshold: if any update causes peak memory to grow more than 10% compared to the previous baseline, block deployment. Also test the edge case: what happens when the context buffer is full? The pruning logic must be tested explicitly."
Q17. How do you validate one-shot learning from a human demo?
Model answer: "One-shot learning is evaluated with a held-out set of novel scenarios — objects and environments that weren't in the training data, with a single human demonstration provided for each. The eval procedure: provide the demo, run 10 trials of the robot attempting the task, measure success rate. 'Novel' must be verified: I check embedding distance between the eval scenario and all training episodes. If the cosine distance is below a threshold, the scenario isn't novel enough and I replace it. Target success rate for one-shot eval: ≥60% (lower than the standard task threshold, since it's genuinely zero-shot on the robot side)."
Q18. The inverse dynamics model performs well in the lab but fails at the customer site. What's your hypothesis?
Model answer: "The inverse dynamics model maps visual frame transitions to robot actions — it's sensitive to camera appearance. My first hypothesis: the customer site has different lighting or camera angle than the training environment, making the frame transitions look different even for identical motions. The model correctly maps lab-frame transitions to actions, but customer-site transitions look unfamiliar. Diagnostic: compare the embeddings of frame pairs from the customer site to the training distribution. If they're far out of distribution, the fix is fine-tuning the inverse dynamics model on a small amount of customer-site data — typically 1-2 hours of the robot operating there."
Q19. How do you test that the leapfrog inference timing is correct?
Model answer: "Leapfrog works when the prediction horizon covers the model's own inference time — meaning the robot always has valid commands to execute, never waiting for the next prediction. I test this by measuring the ratio of prediction horizon to inference time under load. If the inference time grows (GPU load, memory pressure), the prediction horizon must cover it or the robot pauses. I run a stress test: artificially slow down inference by 20%, 40%, 60% and verify the control loop remains smooth (no pauses, no velocity spikes). I also log command timestamps vs. inference completion timestamps in production — if any inference completes after its corresponding command window, that's a leapfrog violation to investigate."
Q20. Questions to ask the interviewer at the team.
Model answers: "What's the current ratio of sim testing to hardware testing in your release process, and is that ratio where you want it to be?" / "How do you currently detect when a model update has regressed on a failure mode that was previously fixed?" / "What does a failed deployment look like — has a DVA system ever been pulled from a customer site, and what was the root cause?" / "How close is the engineering team to having a 24-hour fully automated test cycle, and what's blocking it?" / "What's the one quality failure mode that keeps you up at night for an autonomous 25kg robot?"
Everything you need on one page. Read this the morning of the interview.
| Metric | Definition | Target | Measured how |
|---|---|---|---|
| Task success rate | % trials completed without intervention | ≥85% | HIL / sim, 30+ trials |
| Intervention freq. | Human interventions per hour | <0.5/hr | Endurance run log |
| MTBF | Mean time between failures | ≥120 min | Production telemetry |
| E-stop latency | Time to zero velocity after stop signal | <200ms (Cat. 0) | Encoder vs. trigger timestamp |
| Inference P99 | 99th percentile end-to-end latency | < leapfrog horizon | Profiling under load |
| Sim-to-real corr. | Correlation of sim vs. real success rates | ≥0.85 | Matched A/B sessions |
| Embedding drift | Cosine distance, old vs. new model | Below threshold | Automated on held-out frames |
| Standard | What it governs | Key number |
|---|---|---|
| ISO 10218-1/2 | Industrial robot safety, manufacturer + integrator | Risk assessment mandatory |
| ISO/TS 15066 | Collaborative robot contact forces | <140N transient, <65N quasi-static (chest) |
| IEC 62061 | Safety function integrity (machinery) | SIL 2 for e-stop in most industrial contexts |
| ISO 13849 | Safety control system performance | Performance Level (PL) d or e for high-risk |
Red flags in the role:
Good signs: