Every calculation a robotics engineer needs to derive from scratch. State vectors, Jacobians, data association, loop closure, graph optimization, visual odometry, and map representations — all solvable in-browser with instant feedback.
You're building a robot that needs to map an unknown environment while simultaneously figuring out where it is in that map. This is the Simultaneous Localization and Mapping (SLAM) problem. The robot's belief about the world is encoded in a single state vector — its own pose plus the positions of every landmark it has ever seen.
In 2D EKF-SLAM, the state vector stacks the robot pose and all landmark positions:
A 2D robot has observed 3 landmarks. What is the size of the state vector?
State = [robot pose (3)] + [landmarks (2 each)].
x = [xr, yr, θr, x1, y1, x2, y2, x3, y3]T. Nine numbers fully describe the robot's belief about the world.
For 50 landmarks in 2D, how many elements does the full covariance matrix Σ contain?
This is why EKF-SLAM doesn't scale well. The covariance matrix grows quadratically with the number of landmarks. At 1000 landmarks, you'd have a 2003×2003 matrix with ~4 million elements — and you need to invert a portion of it at every update step.
The covariance matrix becomes fully dense over time. Here's why: when the robot observes landmark A, the update step creates correlations between the robot pose and A. Later, when it observes landmark B, the update creates correlations between the pose and B. But since the robot pose is correlated with A, B now becomes correlated with A — even though they were never observed at the same time.
This "correlation contagion" through the shared robot pose is both the power and the curse of EKF-SLAM. Power: loop closure corrections propagate everywhere. Curse: O(N²) memory and O(N²) updates.
In 3D SLAM, the robot pose has 6 DOF (x, y, z, roll, pitch, yaw) and each landmark has 3 coordinates (x, y, z). For a robot that has mapped 200 landmarks, what is the covariance matrix size?
In practice, 3D EKF-SLAM is rarely used beyond ~100 landmarks due to the O(N²) cost. Modern 3D SLAM systems use graph-based or factor graph approaches that exploit the sparsity of the information matrix (inverse covariance) instead.
Write a function that returns the state vector dimension and covariance element count for a SLAM problem.
javascript function stateSize(poseDim, landmarkDim, numLandmarks) { const n = poseDim + landmarkDim * numLandmarks; return [n, n * n]; }
The covariance matrix is symmetric (Σ = ΣT). For a 2D SLAM problem with 10 landmarks, how many unique entries does Σ have? (Include the diagonal.)
Exploiting symmetry halves the storage. In practice, implementations store only the upper triangle. For n=23 this saves 253 entries; for n=1003 (500 landmarks in 2D) it saves ~500,000 entries.
Your robot sees a landmark with its range-bearing sensor. The sensor reports: "I see something 5 meters away at 30 degrees to my left." But the state vector stores landmark positions in global coordinates. How do you connect the raw sensor measurement to the state? That's the observation model — the function h(x) that predicts what the sensor should see given the current state estimate.
Robot at (2, 3, π/4). Landmark at (5, 7). What is the predicted range r?
Same setup: robot at (2, 3, π/4), landmark at (5, 7). What is the predicted bearing φ in radians? (Bearing = atan2(Δy, Δx) - θr.)
The landmark is slightly to the left of the robot's heading. If the robot were facing directly at the landmark, φ would be zero.
The sensor measures z = [5.3, 0.05]T (range 5.3m, bearing 0.05 rad). Using the predicted h(x) = [5.0, 0.14]T from above, what is the innovation ν?
The range innovation is +0.3m (landmark is farther than expected), and the bearing innovation is -0.09 rad (landmark is more to the right than expected). These residuals will drive the EKF update to shift the state estimate.
The predicted measurement h(x) depends on the robot pose (xr, yr, θr) and the observed landmark's position (x2, y2). It does not depend on any other landmark's position. So H has nonzero entries only in columns 1-3 (pose) and columns 6-7 (landmark 2). The other 4 columns are all zeros.
This sparse structure is critical: it means each observation only directly updates the robot pose and the observed landmark. However, because the covariance Σ has off-diagonal correlations, the correction indirectly propagates to all other landmarks through the Kalman gain K = ΣHTS-1.
For the range measurement r = √(Δx² + Δy²), compute ∂r/∂xr. (Let Δx = xj - xr, Δy = yj - yr, q = Δx² + Δy².)
The negative sign makes sense: increasing xr (moving the robot right) decreases Δx when the landmark is to the right, which decreases the range. Note that ∂r/∂xj = +Δx/√q — same magnitude, opposite sign.
Write a function that computes the predicted measurement [range, bearing] given a robot pose and landmark position.
javascript function observationModel(xr, yr, thr, xj, yj) { const dx = xj - xr; const dy = yj - yr; const range = Math.sqrt(dx * dx + dy * dy); const bearing = Math.atan2(dy, dx) - thr; return [range, bearing]; }
Your robot sees a blip on its sensor. Is it landmark #3 that you mapped two minutes ago, or a completely new landmark? Get it wrong and your map breaks catastrophically — matching to the wrong landmark introduces phantom correlations that can never be corrected. This is the data association problem, and it's often harder than the estimation itself.
The standard approach uses Mahalanobis distance — the innovation scaled by its expected uncertainty:
Innovation ν = 0.5m. Innovation variance S = 0.25 m². What is d²?
d² = 1.0 means the innovation is exactly 1 standard deviation. For a χ² distribution with 1 DOF, this corresponds to the 68th percentile — quite plausible. The typical gate threshold at 95% confidence is d² < 3.84.
Innovation ν = [0.3, -0.09]T. Innovation covariance S = [[0.25, 0], [0, 0.01]]. Compute d² = νTS-1ν.
Since S is diagonal, S-1 = diag(1/0.25, 1/0.01) = diag(4, 100).
Even though the bearing innovation (-0.09 rad) looks small, its contribution (0.81) is larger than the range innovation's contribution (0.36) because bearing has much lower variance (0.01 vs 0.25). Mahalanobis distance respects this — small innovations in precise dimensions count more.
Candidate A: d² = 1.17 < 5.99 — passes the gate. This is a statistically plausible match.
Candidate B: d² = 8.5 > 5.99 — fails the gate. At 95% confidence, a true match would produce d² this large only 1.4% of the time. Reject it.
If no candidates pass the gate, the observation is treated as a new landmark and initialized. Here, A passes, so we match to A.
A single wrong association is catastrophic in EKF-SLAM. When you match observation z to the wrong landmark j, the update step "corrects" the state as if j generated z — this shifts both the robot pose and landmark j to explain a measurement that came from somewhere else entirely. Worse, these false corrections propagate through the covariance to every other landmark.
This is why robust data association methods like Joint Compatibility Branch and Bound (JCBB) exist — they evaluate the consistency of the entire set of associations jointly, not each one independently.
Write a function that computes the Mahalanobis distance squared for a 2D innovation with a diagonal covariance.
javascript function mahalanobis(nu0, nu1, s00, s11) { return (nu0 * nu0) / s00 + (nu1 * nu1) / s11; }
This code finds the nearest neighbor match but has a critical bug. The robot keeps matching to wrong landmarks. Click the buggy line.
function nearestNeighbor(z, landmarks, S, gate) { let bestIdx = -1; let bestDist = Infinity; for (let j = 0; j < landmarks.length; j++) { const nu = innovate(z, landmarks[j]); const d2 = nu[0]*nu[0] + nu[1]*nu[1]; if (d2 < bestDist && d2 < gate) { bestDist = d2; bestIdx = j; } } return bestIdx; }
Line 6 is the bug. It computes the Euclidean distance (nu[0]² + nu[1]²) instead of the Mahalanobis distance (nuT S-1 nu). Without scaling by S-1, the function treats range and bearing innovations as equally weighted — but range is in meters and bearing is in radians. A 0.5m range error and a 0.5 rad bearing error are very different levels of surprise!
The fix: const d2 = nu[0]*nu[0]/S[0][0] + nu[1]*nu[1]/S[1][1]; (assuming diagonal S).
Your robot has been driving for 10 minutes, mapping as it goes. Odometry drift has been accumulating — small errors in each motion step add up to meters of error. Then the robot recognizes a place it visited at the start. This loop closure event is the single most powerful source of information in SLAM: it constrains the accumulated drift to be consistent with the re-observation.
A robot moves with 0.1 m/step drift (systematic error). After 100 steps, what is the accumulated position error?
10 meters of drift after 100 steps! For a warehouse robot moving 0.5 m/step, that's only 50 meters of travel. Without loop closure or absolute positioning, the map is unusable after a few minutes.
If each step has independent random error with σ = 0.05 m, what is the expected position error (1σ) after 400 steps?
Random walk drift grows as √k, not k — much slower than systematic drift. This is why wheel encoders with zero-mean noise accumulate error slower than sensors with a constant bias. But even √k diverges eventually; you always need loop closure or external corrections for long-duration navigation.
A robot traverses a square loop (100 steps per side, 400 total). Systematic drift is 0.02 m/step in the forward direction only. When it returns to the start, what is the position error? (Hint: drift accumulates in the direction of travel, which changes at each corner.)
Forward drift per side: 0.02 × 100 = 2.0 m. But on a square loop, the drift directions are:
On a closed loop with constant forward bias, the drift cancels if the robot turns enough. This is a special case. In practice, heading drift (which rotates the direction of subsequent drift) prevents this cancellation and is usually the dominant error source. A 1° heading bias causes the entire trajectory to curve, producing meters of closure error.
Now consider heading drift: each step adds 0.01 rad of heading error (~0.57°). After 100 straight-line steps of 0.5 m each, by how much has the lateral (perpendicular) position drifted? (Approximate: lateral error ≈ distance × sin(accumulated heading error).)
Each step k has accumulated heading error δθ = 0.01k rad. The lateral displacement at step k is approximately 0.5 × sin(0.01k) ≈ 0.5 × 0.01k = 0.005k (small angle).
Using the exact sin: ∑ 0.5 sin(0.01k) for k=1..100 ≈ 24.5 m. The heading drift causes massive lateral error — nearly 50% of the total distance traveled (50m). This is why gyroscope calibration and heading correction are critical for odometry-based navigation.
The correct approach distributes the error across all poses proportionally to their uncertainty. In EKF-SLAM, this happens automatically through the covariance — the Kalman gain allocates corrections based on how uncertain each state element is. In graph SLAM (Chapter 4), this is done by pose graph optimization, which minimizes the total weighted error across all constraints.
Correcting only the final pose would create a discontinuity — landmarks mapped near the end would suddenly jump. Distributing smoothly preserves the local consistency of the map.
EKF-SLAM maintains a single giant state vector and covariance. Graph SLAM takes a different approach: it stores each robot pose as a node and each constraint (odometry, observation, loop closure) as an edge. The full map is recovered by optimizing the entire graph at once. This is more flexible, more scalable, and handles nonlinearity better.
3 poses (1D for simplicity). Odometry constraints: x0→x1 and x1→x2. Loop closure: x0→x2. Each constraint has information ω = 1. The prior on x0 adds ω0 = 10 to Ω[0,0]. How many nonzero entries does Ω (3×3) have?
Each constraint between i and j adds entries at (i,i), (i,j), (j,i), (j,j):
Every entry in the 3×3 matrix gets at least one contribution. Ω[0,0] gets contributions from three sources (odo01 + loop02 + prior), Ω[1,1] gets two (odo01 + odo12), etc. All 9 entries are nonzero because the loop closure connects x0 to x2, filling in the corners. Without the loop closure, Ω[0,2] and Ω[2,0] would be zero (tridiagonal structure).
Same setup: 3 poses, ω = 1 for each constraint, prior ω0 = 10. Each edge (i,j) with information ω adds: +ω to Ω[i,i], +ω to Ω[j,j], -ω to Ω[i,j] and Ω[j,i]. What is Ω[0,0]?
The full matrix:
Ω = [[12, -1, -1], [-1, 2, -1], [-1, -1, 2]]
The diagonal entries are the sum of all information connected to that pose. The off-diagonal entries are the negated information of the direct constraint. This structure makes Ω diagonally dominant — a property that ensures numerical stability when solving the linear system.
Given Ω from above and ξ = [10, 1, 3]T (information vector), solve μ = Ω-1ξ for μ[0] (use the fact that x0 is well-constrained with prior=10, so μ[0] ≈ ξ[0]/Ω[0,0]).
Solving Ωμ = ξ for [[12,-1,-1],[-1,2,-1],[-1,-1,2]]μ = [10,1,3] via Cramer's rule:
The exact solution is μ = [1.4, 2.0, 2.6]. The strong prior (ω=10) on x0 pulls it toward ξ[0]/Ω[0,0] = 10/12 ≈ 0.83, but the odometry and loop closure constraints push it higher. The full solution balances all constraints simultaneously.
A robot traverses 1000 poses in a straight line (chain graph). Each pose is 3D (3 elements). Ω is block-tridiagonal. What fraction of the 3000×3000 matrix is nonzero? (Count nonzero blocks, not individual entries.)
A chain of 1000 nodes has 999 edges. Each edge creates 2 off-diagonal blocks, plus diagonal blocks. Count total nonzero 3×3 blocks.
Only 0.3% of the matrix is nonzero! This extreme sparsity is what makes graph SLAM tractable at large scale. Sparse Cholesky solvers exploit this structure to solve the optimization in O(N) time for chain graphs, compared to O(N³) for dense systems.
Each loop closure adds exactly one edge to the graph, which means 2 additional nonzero off-diagonal blocks in Ω. Five loop closures add only 10 blocks to the existing 2998 — the matrix stays 99.7% sparse.
However, these long-range connections create fill-in during Cholesky factorization. A good variable ordering (like COLAMD or nested dissection) minimizes this fill-in, keeping the solve fast. This is why graph SLAM implementations spend significant effort on variable ordering — it determines whether the solve takes milliseconds or seconds.
Build a 1D information matrix for a chain of poses with optional loop closures.
javascript function buildOmega(n, edges) { const O = new Array(n * n).fill(0); for (const e of edges) { O[e.i * n + e.i] += e.omega; O[e.j * n + e.j] += e.omega; O[e.i * n + e.j] -= e.omega; O[e.j * n + e.i] -= e.omega; } return O; }
Instead of wheel encoders, you estimate motion from camera images. Two consecutive frames share visible features — by tracking how those features moved in the image, you can recover the camera's 3D rotation and translation. The key mathematical object is the essential matrix E, which encodes the geometric relationship between two views.
E = [t]×R has 6 underlying parameters (3 rotation + 3 translation). But scaling E by any constant c gives cE, which satisfies the same epipolar constraints (x'T(cE)x = 0 iff x'TEx = 0). This removes 1 DOF, leaving 5.
Algebraically, E has two equal nonzero singular values and one zero singular value. This imposes constraints on the 9 entries: the essential matrix manifold is a 5-dimensional surface in ℝ9.
Each point correspondence x ↔ x' gives one equation (x'TEx = 0). Since E has 5 DOF and we solve up to scale (so effectively need to determine 5 ratios), what is the minimum number of point correspondences needed to solve for E?
This is the 5-point algorithm (Nister, 2004). Each correspondence provides one linear equation in the entries of E. Five correspondences give 5 equations for 5 unknowns (after fixing the scale). In practice, the 5-point algorithm uses polynomial solvers and can return up to 10 solutions, which are disambiguated using additional correspondences or geometric checks.
Given E = [[0, -1, 0], [1, 0, -0.5], [0, 0.3, 0]] and correspondences x = [0.5, 0.3, 1]T, x' = [0.6, 0.2, 1]T. Compute x'TEx. Is the epipolar constraint satisfied?
The result is -0.09, not zero. This means the correspondence either has noise or doesn't perfectly satisfy the epipolar constraint for this E. In RANSAC, you'd compare |x'TEx| against a threshold (typically derived from the Sampson distance). A residual of 0.09 is moderate — whether it's an inlier depends on the noise level and the threshold you set.
Normalized coordinates: x = K-1p. Substituting into x'TEx = 0: (K'-1p')T E (K-1p) = p'T K'-T E K-1 p = 0. So F = K'-TEK-1, or equivalently E = K'TFK.
When both cameras have the same intrinsics (K' = K), this simplifies to E = KTFK. The essential matrix removes the camera-specific calibration, leaving only the pure geometric relationship. This is why calibrated cameras should use E (5 DOF, more constrained) while uncalibrated cameras must use F (7 DOF).
A monocular VO system estimates translation direction t̂ = [0.8, 0, 0.6]T (unit vector). A stereo system with 10cm baseline resolves the scale: actual translation = 2.5 meters. What is the full translation vector?
The monocular system gives the direction [0.8, 0, 0.6] but cannot determine the scale factor 2.5. The stereo baseline (or IMU, or known object size) provides the missing scale. Without it, all distances are relative — the VO trajectory could be scaled by any positive constant and still be consistent with the observations.
You've extracted 500 feature matches between two images. But 40% are wrong — mismatches from repeated textures, occlusions, or detector noise. If you feed these to the 5-point algorithm, the outliers will corrupt the essential matrix. You need RANSAC (Random Sample Consensus) — sample minimal subsets, fit a model, count inliers, and keep the best model.
50% inliers (ε = 0.5), 5-point model (s = 5), 99% confidence (p = 0.99). How many RANSAC iterations needed?
Each iteration: sample 5 random correspondences, compute E, count inliers (points with |x'TEx| < threshold). After 146 iterations, there's a 99% chance at least one sample was all-inlier. With 500 correspondences, each iteration takes microseconds — 146 iterations is fast.
Same setup but with better matching: 80% inliers (ε = 0.2), s = 5, p = 0.99. How many iterations?
With 80% inliers instead of 50%, the iteration count drops from 146 to just 12 — a 12× reduction. This is why feature matching quality matters more than RANSAC sophistication. Better descriptors (SuperPoint, DISK) with better matchers (SuperGlue, LightGlue) can push inlier rates above 90%, making RANSAC nearly instantaneous.
The 8-point algorithm needs s=8 samples. At ε=0.5 and p=0.99, how many iterations does it need? Compare to the 5-point result.
1177 vs 146 — the 8-point algorithm needs 8× more RANSAC iterations at 50% inliers. Each additional sample point in the minimal set multiplies the iteration count because it's exponentially harder to draw a clean sample. This is why the 5-point algorithm (despite being algebraically harder) is preferred in practice: fewer samples = fewer RANSAC iterations = faster runtime.
Write a function that computes the number of RANSAC iterations needed.
javascript function ransacIterations(inlierRatio, sampleSize, confidence) { const w = Math.pow(inlierRatio, sampleSize); return Math.ceil(Math.log(1 - confidence) / Math.log(1 - w)); }
This RANSAC loop has a subtle bug that causes it to sometimes return a model fitted to outliers. Click the buggy line.
function ransac(points, fitModel, countInliers, n) { let bestModel = null; let bestCount = 0; for (let i = 0; i < n; i++) { const sample = randomSample(points, 5); const model = fitModel(sample); const count = countInliers(model, sample); if (count > bestCount) { bestCount = count; bestModel = model; } } return bestModel; }
Line 7 is the bug. It counts inliers against the sample (the 5 sampled points) instead of against ALL points. The model is fitted to exactly the sample, so it will always have 5 "inliers" from the sample — even if the model is complete nonsense for the rest of the data. The fix: countInliers(model, points). RANSAC's power comes from evaluating each hypothesis against the full dataset, not just the minimal sample used to generate it.
The inlier threshold is a critical hyperparameter. Too large (τ = 10 pixels): gross outliers sneak in and the "best" model may fit a mixture of good and bad correspondences. Too small (τ = 0.1 pixels): even well-matched points get rejected due to noise, making the effective inlier ratio drop and requiring many more iterations.
A good rule of thumb: set τ = 2-3× the measurement noise standard deviation. For feature matching with sub-pixel accuracy, this is typically 1-3 pixels. For the Sampson distance (a better geometric error than x'TEx), τ² follows a χ²(1) distribution — set τ² = 3.84 for 95% confidence.
A factor graph is a bipartite graph with two types of nodes: variable nodes (the unknowns: poses, landmarks) and factor nodes (the constraints: odometry, observations, priors). It's a more general representation than a pose graph — it can represent any probabilistic inference problem, and modern SLAM backends (GTSAM, Ceres, g2o) are all factor graph solvers.
A robot moves through 5 poses, observing 3 landmarks. It has: 1 prior factor on pose 0, 4 odometry factors (between consecutive poses), and 6 observation factors (each landmark seen from 2 poses). How many total factors and variables?
The system is over-determined: 11 factors constraining 8 variables (each with multiple DOF). This redundancy is what makes the estimate robust — if one factor is noisy, the others compensate.
When you eliminate x2, all its neighbors must become connected (this is the fill-in caused by elimination). x2's neighbors are x1 and x3, so a new edge x1—x3 is created. The chain becomes x0-x1-x3-x4 with the extra x1—x3 edge.
For a chain graph, the optimal elimination order is from one end to the other (e.g., x0, x1, x2, x3, x4). This produces zero fill-in because each node has only one remaining neighbor when eliminated. Eliminating from the middle causes fill-in, making the computation more expensive.
In a star graph (central node c connected to leaves a, b, d, e — no edges between leaves), how many fill-in edges are created by eliminating c first?
Eliminating the central node first is the worst possible choice — it creates maximum fill-in, turning the sparse star into a dense clique. The optimal order eliminates leaves first (zero fill-in per leaf), then the center last. This principle generalizes: always eliminate nodes with few connections first. Variable ordering algorithms like COLAMD automate this heuristic.
A 2D SLAM factor graph has 100 poses (each 3 DOF: x, y, θ), 50 landmarks (each 2 DOF), 99 odometry factors (each constrains 3 DOF), 1 prior factor (3 DOF), and 200 observation factors (each constrains 2 DOF). How many total constraint DOF does the system have?
The system is over-determined by 300 DOF — there are 1.75× more constraints than unknowns. This redundancy is what makes the least-squares solution robust to noise. Each additional observation factor adds 2 constraints, further improving the estimate. An under-determined system (fewer constraints than unknowns) would have infinitely many solutions — you'd need more observations or stronger priors.
Put these factor graph SLAM steps in the correct order.
The correct order: Prior → Odometry → Data association → Observation factors → Optimize
First, anchor the graph with a prior on the initial pose. Then add the odometry constraint for the new motion. Before adding observations, perform data association to determine which landmark each measurement corresponds to (or if it's a new landmark). Then add the observation factors with the correct landmark IDs. Finally, optimize the entire graph to get the MAP estimate.
SLAM tells you where landmarks are. But for navigation, you need to know where the robot can and cannot go. An occupancy grid divides the world into cells, each storing the probability that it's occupied. The key trick is using log-odds to make updates additive instead of multiplicative.
Convert p = 0.9 to log-odds l.
Reference values: p=0.5 → l=0, p=0.73 → l=1, p=0.88 → l=2, p=0.95 → l=3, p=0.99 → l=4.6.
Starting from l = 0 (p = 0.5 prior), observe the cell as occupied (psensor = 0.9), then observe it as free (psensor = 0.3). What is the final log-odds and probability?
One occupied and one free observation leaves the cell more likely occupied (p ≈ 0.79) because the occupied sensor (p=0.9, contributing +2.20 log-odds) was more confident than the free sensor (p=0.3, contributing -0.85 log-odds). The occupied observation "won" by a margin of 1.35 log-odds. To bring the cell back to 50/50, you'd need about 2.6 more free observations at p=0.3 each.
Starting from l=0, how many consecutive "occupied" observations (each with psensor = 0.7) are needed to reach p ≥ 0.95?
Each observation adds lobs = log(0.7/0.3) ≈ 0.847 to the log-odds. Need l ≥ log(0.95/0.05) = log(19) ≈ 2.944.
With a relatively weak sensor (70% accuracy), you need 4 consistent observations to reach 95% confidence. A stronger sensor (p=0.9) would need only 2 observations (2 × 2.197 = 4.394, p = 0.988). This is why LiDAR-based mapping (very confident per-ray) converges faster than camera-based mapping (less certain per pixel).
A 100m × 100m environment mapped at 5cm resolution. How many grid cells? How much memory at 1 byte per cell?
4 MB is manageable. But at 1cm resolution: 10,000 × 10,000 = 100M cells = 100 MB. And in 3D at 1cm: 10,000³ = 1012 cells = 1 TB — impossible. This is why 3D maps use octrees (only allocate cells near surfaces), point clouds, or signed distance fields instead of dense grids.
Write a function that performs a log-odds occupancy update and returns the new probability.
javascript function logOddsUpdate(currentLogOdds, sensorProb) { const update = Math.log(sensorProb / (1 - sensorProb)); const newL = currentLogOdds + update; const newP = 1 / (1 + Math.exp(-newL)); return [newL, newP]; }
In a static world, unclamped log-odds are fine — cells converge to their true occupancy. But in the real world, objects move. A chair observed 50 times at location A has l=50×2.2 = 110 (p≈1). When the chair moves, clearing it requires 50+ free observations to bring l back below 0. With clamping at l=5, you need at most ~6 free observations to flip the cell — the map responds to changes in seconds instead of minutes.
You're building a mobile robot with a 2D LiDAR that needs to explore and map an indoor office (50m × 50m). Design the complete SLAM system: state representation, motion model, measurement model, data association, loop closure, and map output. Each exercise in this capstone integrates concepts from the previous chapters.
Your robot uses a graph-based SLAM approach. After a 5-minute exploration at 10 Hz (10 poses/second), with keyframe selection keeping every 10th pose, and the LiDAR detecting an average of 2 new landmarks per keyframe: how many variables (poses + landmarks) are in the factor graph?
In 2D, each pose has 3 DOF and each landmark 2 DOF, so the total DOF is 300×3 + 600×2 = 900 + 1200 = 2100. The information matrix is 2100×2100 but extremely sparse — easily solvable in real-time with modern sparse solvers like GTSAM's iSAM2.
Your robot uses differential-drive odometry. The wheels have 1% distance error and 0.5°/m heading error. For a 0.5m straight-line motion: what is the position uncertainty σpos and heading uncertainty σθ?
These per-step errors seem tiny, but they accumulate. After 100 steps (50m): σpos ≈ 0.005 × √100 = 0.05m (random walk). The heading error is more dangerous: σθ ≈ 0.00436 × √100 = 0.0436 rad ≈ 2.5°. A 2.5° heading error over the next 50m of travel causes ~2.2m of lateral drift. This is why heading estimation is the bottleneck in odometry-based SLAM.
Your LiDAR has 360° FOV, 1° angular resolution, 12m max range. In a typical office, about 30% of rays hit a wall/obstacle within range. How many valid range readings per scan? If each landmark is a distinct corner detected from clustered range discontinuities, and corners are found every ~20 valid rays on average, how many landmarks per scan?
About 5 landmarks per scan is typical for indoor environments. This means your factor graph gets ~5 observation factors per keyframe. Over 300 keyframes, that's ~1500 observation factors — providing dense constraints that make the optimization well-conditioned.
The 50m×50m office is mapped as an occupancy grid at 2cm resolution. Each cell stores an int8 log-odds value. How much memory for the grid? If you also store the factor graph with 900 variables (2100 DOF total) as double-precision floats for the mean vector and sparse information matrix (~3 nonzero blocks per variable, each 3×3 or 2×2), approximately how much total memory for the SLAM state?
The occupancy grid dominates memory by 50×. The factor graph itself is tiny. This is why SLAM research focuses on efficient map representations — the map is the bottleneck, not the pose graph. At 1cm resolution, the grid would be 25 MB; at 0.5cm, 100 MB.
Loop closure verification requires multiple independent checks because a false positive is catastrophic (Ch. 2):
(1) Place recognition: Scan descriptors (scan context, BoW) identify candidate locations that "look like" the current scan. Fast but noisy — gives candidates, not confirmations.
(2) Geometric verification: ICP or scan matching computes the precise relative transform. If the alignment error is high, reject.
(3) Mahalanobis gate: Is the relative transform consistent with what the current pose graph predicts? If the graph says we should be 10m away but the closure says 0m, something is wrong.
(4) Consistency check: Does adding this constraint make the overall graph residual increase dramatically? If so, either this is a false closure or the graph has other errors that need resolution first.
Put these components of a complete SLAM system in the correct processing order for each new sensor frame.
The correct order: Motion prediction → Feature extraction → Data association → Loop closure detection → Graph optimization → Update occupancy grid
(1) Predict the new pose from odometry. (2) Extract features/landmarks from the new scan. (3) Match features to known landmarks (data association). (4) Check if this location has been visited before (loop closure — uses the current scan and map). (5) Add all new factors and optimize the graph. (6) With the optimized poses, ray-trace through the scans to update the occupancy grid.
The map update comes LAST because it depends on the optimized poses — updating the map with pre-optimization poses would project scans to wrong locations.
Your SLAM system must run at 10 Hz (100ms budget per frame). Typical timings: odometry prediction 0.5ms, feature extraction 15ms, data association 5ms, loop closure check 10ms (amortized), graph optimization 30ms (incremental), map update 20ms. Total? Does it fit?
It fits with 19.5ms of headroom — about 20% margin. In practice, graph optimization can spike during loop closures (full relinearization instead of incremental), and feature extraction varies with scene complexity. The 20% margin is tight but workable. If it overruns, common optimizations: (1) run loop closure in a separate thread, (2) use iSAM2 for incremental optimization, (3) update the map at a lower rate (every 5th frame).
| Topic | Lesson |
|---|---|
| Classical SLAM fundamentals | Classical SLAM — From Absolute Zero |
| Visual-Inertial Odometry | Classical VIO — From Absolute Zero |
| EKF foundations | Extended Kalman Filter — From Absolute Zero |
| Bayesian estimation | Bayesian Estimation — From Absolute Zero |
| Modern SLAM (learning-based) | Modern SLAM — From Absolute Zero |