The algorithm that lets robots explore unknown environments — building a map while simultaneously figuring out where they are within it.
Imagine you wake up in a dark building with a flashlight. You want to draw a floor plan (a map). But to mark anything on the map, you need to know where you are. And to know where you are, you need... a map. It's a chicken-and-egg problem.
SLAM (Simultaneous Localization And Mapping) solves both at the same time. As the robot moves, it observes landmarks, estimates its own pose, and builds a map — all in one continuous loop. It's one of the fundamental problems in robotics.
A robot (green) moves through unknown space, sensing landmarks (yellow). It must estimate both its own path and the landmark positions.
The deep connection: SLAM is just a Kalman filter where you keep appending new states every time you discover something. The predict/update structure is identical — only the dimensionality grows.
If a Kalman filter tracks a single ball with a 4D state [x, y, vx, vy], what would the state look like if you tracked 3 balls simultaneously? How is that analogous to SLAM?
A "map" in robotics isn't necessarily a picture. It's a representation of the environment that the robot can use for navigation and planning. There are several common types, each with tradeoffs.
Toggle between three representations of the same room. Each shows the same space differently.
A robot's pose in 2D is three numbers: position (x, y) and heading angle θ. Together, they describe where the robot is and which direction it's facing. This is a rigid body transform — it tells us how to convert from the robot's frame to the world frame.
When the robot moves, its new pose depends on the old pose plus a motion command (drive forward, turn). But real motors are imperfect, so every motion adds noise. Over time, this odometry drift accumulates and the robot becomes lost.
Watch how odometry (red dashed) drifts away from the true path (teal). Without corrections, errors compound.
Before the robot can localize or map, it needs to find distinctive features in its sensor data. These are things like corners, edges, blobs — anything that looks unique and can be reliably re-detected later. This is called the frontend of a SLAM system.
Each detected feature gets a descriptor — a compact numerical summary that captures what it looks like. Think of it as a fingerprint. When the robot sees a feature again from a different viewpoint, the descriptor helps match it to the stored version.
Features are detected at corners and distinctive points (yellow). Move the threshold slider to see how detection sensitivity changes.
The robot sees a corner. Is it the same corner it saw 30 seconds ago, or a completely different one? This matching problem is called data association, and getting it wrong is catastrophic — if you match to the wrong landmark, your map becomes inconsistent.
The simplest approach is nearest neighbor: match each observation to the closest known landmark. But this fails when landmarks are close together. More robust methods include JCBB (Joint Compatibility Branch and Bound), which considers all matches jointly.
Current observations (teal) must match to known landmarks (yellow). Lines show matches. Red lines are wrong matches. Click "Scramble" to see how ambiguity arises.
| Method | Approach | Robustness |
|---|---|---|
| Nearest Neighbor | Match to closest landmark | Low |
| Mahalanobis Gate | NN but weighted by uncertainty | Medium |
| JCBB | Check all matches jointly | High |
| RANSAC | Random sample consensus | High |
1. Why it's dangerous: If you match to the wrong landmark, the EKF update will "pull" both the robot pose and the landmark estimate toward an incorrect configuration. Worse, because EKF-SLAM maintains correlations between ALL landmarks, this one bad match can corrupt estimates for landmarks you aren't even currently observing. The covariance becomes inconsistent — the filter thinks it's certain about wrong positions.
2. Mahalanobis gate: Instead of raw Euclidean distance, the Mahalanobis distance weights by the innovation covariance S = H P HT + R. A landmark with high uncertainty (large P block) will have a larger "gate" around it, while a well-localized landmark has a tight gate. This prevents matching to a precise landmark that happens to be nearby in space but would require an implausible robot pose correction.
3. Beyond distance: Feature descriptors! Two landmarks may be spatially close but look completely different (a corner vs. an edge). JCBB goes further — it checks whether the JOINT set of matches is mutually consistent, rejecting individual matches that would require contradictory robot poses.
The classic approach. We augment the robot's state vector with every landmark position. If we have N landmarks, the state is [x, y, θ, l1x, l1y, ..., lNx, lNy] — a vector of size 3+2N. The covariance matrix is (3+2N) × (3+2N), tracking correlations between everything.
The EKF (Extended Kalman Filter) handles the nonlinear motion and observation models by linearizing them at each step. Predict: move robot, grow uncertainty. Update: observe a landmark, shrink uncertainty — for the robot and all correlated landmarks.
Watch the state vector grow as new landmarks are discovered. Uncertainty ellipses (cyan) show correlations. Click "Step" or toggle auto-play.
In EKF-SLAM, the state vector is x = [r, l1, l2, ..., lN] where r is the 3D robot pose and each li is a 2D landmark position. The covariance matrix P tracks uncertainty and correlations between all state elements.
Your task: Show that (1) the covariance matrix is (3+2N) × (3+2N), (2) explain why the off-diagonal blocks are non-zero and what they physically mean, and (3) derive why a single EKF update step costs O(N²) operations.
Step 1: State and covariance dimensions.
State x = [xr, yr, θr, x1, y1, x2, y2, ..., xN, yN]T has dimension d = 3 + 2N. The covariance P is d×d = (3+2N)² entries.
Step 2: Why off-diagonals are non-zero.
Consider observing landmark j. The observation model is z = h(xr, lj) + noise. The Jacobian H has non-zero blocks only for xr and lj. BUT the Kalman gain K = PHTS-1 multiplies the FULL covariance P by HT. Since Prli blocks are non-zero (for previously observed landmarks), K has non-zero entries for ALL landmarks, not just the observed one.
Physically: Plilj ≠ 0 means "landmarks i and j were both observed from the same uncertain robot pose, so their errors are correlated." If we learn landmark i is 1m east of where we thought, we gain information about landmark j too.
Step 3: O(N²) per update.
The covariance update P ← P - KSKT (Joseph form) requires forming KSKT which is (3+2N)×(3+2N). Even with sparse H, K is a (3+2N)×2 matrix (2 because observation is [range, bearing]), so KSKT is a rank-2 update to the full P. Computing this outer product is O((3+2N)²) = O(N²).
The key insight: Correlations are both EKF-SLAM's strength (observing one landmark improves ALL estimates) and its curse (you can never drop the off-diagonal blocks without losing optimality, and they grow as O(N²)).
python import numpy as np def augment_state(mu, P, z_range, z_bearing, R_obs): n = len(mu) xr, yr, theta = mu[0], mu[1], mu[2] # 1. Landmark position from inverse observation model angle = theta + z_bearing lx = xr + z_range * np.cos(angle) ly = yr + z_range * np.sin(angle) # 2. Augment state vector mu_new = np.append(mu, [lx, ly]) # 3. Jacobian of inverse obs model w.r.t. robot pose Gr = np.array([ [1, 0, -z_range * np.sin(angle)], [0, 1, z_range * np.cos(angle)] ]) # 4. Jacobian w.r.t. observation [range, bearing] Gz = np.array([ [np.cos(angle), -z_range * np.sin(angle)], [np.sin(angle), z_range * np.cos(angle)] ]) # 5. Augment covariance P_new = np.zeros((n + 2, n + 2)) P_new[:n, :n] = P # existing block unchanged # Cross-correlation: new landmark with existing state Prr = P[:3, :3] cross = Gr @ P[:3, :] # (2 x n) cross-correlation row P_new[n:, :n] = cross P_new[:n, n:] = cross.T # New landmark self-covariance P_new[n:, n:] = Gr @ Prr @ Gr.T + Gz @ R_obs @ Gz.T return mu_new, P_new
FastSLAM takes a clever shortcut using Rao-Blackwellization. The insight: if you knew the robot's path exactly, the landmark positions would be conditionally independent. You wouldn't need that massive joint covariance matrix.
So FastSLAM uses particles to represent the robot's pose distribution. Each particle carries its own mini-EKF for each landmark. Since landmarks are independent given the path, each mini-EKF is just 2×2 — no quadratic blowup.
Each green dot is a particle (robot pose hypothesis). After resampling, particles cluster around the true pose. Landmarks shown in yellow.
The modern workhorse. Instead of maintaining a filter, we build a graph. Each node is a robot pose (at some time) or a landmark position. Each edge is a constraint — an odometry measurement between poses, or a landmark observation. This forms a factor graph.
SLAM becomes an optimization problem: find the node positions that best satisfy all constraints simultaneously. This is solved with iterative least-squares methods like Gauss-Newton. The key advantage: we can re-optimize the entire trajectory when new information arrives (like a loop closure).
Blue nodes = robot poses. Yellow nodes = landmarks. Edges = constraints. Click "Optimize" to see least-squares adjustment pull everything into consistency.
| Approach | State | Updates | Scalability |
|---|---|---|---|
| EKF-SLAM | Current only | Recursive | O(N²) |
| FastSLAM | Current + particles | Sequential | O(M·N) |
| Graph-SLAM | All history | Batch optimize | Sparse O(N) |
The EKF maintains covariance P (dense). The information filter maintains the inverse: the information matrix Ω = P-1 and information vector ξ = P-1μ. In SLAM, Ω is naturally sparse — zero entries between landmarks that were never observed together.
Your task: Show that (1) Ωij = 0 when landmarks i and j were never co-observed from the same robot pose, (2) explain why this sparsity makes graph-SLAM efficient, and (3) show the connection between Ω and the graph's adjacency structure.
Step 1: Conditional independence → sparse precision.
In a Gaussian N(μ, P), the precision matrix Ω = P-1 has a beautiful property: Ωij = 0 ⇔ xi ⊥ xj | x\{i,j\} (conditional independence). Landmark i and landmark j are conditionally independent given the robot poses they were observed from. If they were NEVER observed from the same pose, they have no direct constraint — only indirect ones through shared poses.
Step 2: Factor graph = sparsity pattern.
Each factor (measurement) adds to Ω only in the rows/columns of the connected nodes. An odometry factor between poses xk and xk+1 adds to Ωk,k, Ωk,k+1, Ωk+1,k, Ωk+1,k+1. A landmark observation from pose xk to landmark lj adds to their mutual entries. The rest stays zero.
Step 3: Sparsity → efficiency.
The least-squares normal equations give HTΣ-1H Δx* = HTΣ-1e where H is the Jacobian. The matrix HTΣ-1H IS the information matrix Ω. With N poses and M landmarks, Ω is (3N+2M)×(3N+2M) but each row has only ~10 non-zeros (degree of the graph). Sparse Cholesky (e.g., CHOLMOD) exploits this for near-linear solve time.
The key insight: EKF-SLAM maintains P (dense, because all landmarks are marginally correlated). Graph-SLAM works with Ω (sparse, because most landmarks are conditionally independent). Same information, dual representation — but sparsity changes the computational game entirely.
The robot is at pose xk (current) and recognizes it's revisiting the location of pose x0 (from the past). The relative transform between xk and x0 is measured as zloop with covariance Σloop.
Your task: Write the loop closure constraint as a factor in the least-squares objective, and show why it causes ALL intermediate poses x1...xk-1 to shift when you re-optimize.
Full objective function:
x* = argmin Σi=0..k-1 ||zodom,i − (xi+1 ⊖ xi)||²Σodom + ||zloop − (xk ⊖ x0)||²Σloop
Why ALL poses shift: The Gauss-Newton system HTΩH Δx = HTΩe has H as a banded matrix (tridiagonal-ish) for the odometry chain, PLUS an entry connecting x0 and xk. Solving this linear system propagates the loop closure correction through the chain. The correction at pose xi is proportional to i/k — poses near x0 barely move, poses near xk move the most. This is because odometry between consecutive poses is "stiff" (low noise), so bending the chain is distributed uniformly.
The key insight: A loop closure with covariance Σloop much smaller than accumulated odometry uncertainty (Σodom × k) effectively "snaps" the trajectory back. The tighter the loop constraint, the more dramatic the correction.
The robot has been exploring for 10 minutes. Odometry has drifted badly. Then it returns to a place it visited at the start — and recognizes it. This is loop closure, and it's the most impactful event in SLAM. A single loop closure can correct accumulated drift across the entire trajectory.
Detection uses place recognition: comparing the current scene's features against a database of all previously visited places. Bag-of-words (DBoW) and learned descriptors are common approaches. Once a loop is detected, a new constraint is added to the graph, and re-optimization snaps everything into alignment.
The red path shows drifted odometry forming a loop. Click "Close Loop" to see the constraint propagate corrections backward through the entire trajectory.
Time to see it all come together. Below is a 2D room with landmarks. A robot explores the space, observing landmarks within its sensor range. Watch the map build in real-time. Toggle between the EKF view (uncertainty ellipses) and the graph view (nodes and edges). Use arrow keys or click "Auto-explore" to drive the robot.
Arrow keys to drive manually. Sensor range shown as a circle around the robot.
1. Graph-SLAM (specifically Cartographer/LOAM variants). EKF-SLAM is O(N²) and dies at 50k features. Particle filters need too many particles for 10cm accuracy in 10,000m². Graph-SLAM with sliding window (keep last 500 poses, marginalize old ones) stays real-time on Orin. Use scan matching (ICP/NDT) instead of individual features to handle repetitive geometry.
2. Multi-modal constraints + aisle numbering. Pure lidar geometry is ambiguous in repetitive aisles. Solutions: (a) downward camera reads floor markings/barcodes for absolute localization (like waypoints), (b) count aisle transitions as topological constraints, (c) use multi-hypothesis tracking — maintain 2-3 loop closure candidates until one can be disambiguated.
3. Dynamic object filtering. (a) Compare current scan to static map — points that don't match the map are classified as dynamic and excluded from SLAM. (b) Use occupancy grid with decay: cells revert to "unknown" after not being observed as occupied for N scans. (c) Temporal consistency check: landmarks must persist for >5 observations before becoming part of the permanent map.
4. Hybrid: persistent static layer + ephemeral dynamic layer. Walls and shelving racks form the static map (updated monthly). Pallet positions form a dynamic layer (updated per-flight). The static map provides localization; the dynamic layer provides obstacle avoidance and inventory tracking. This is essentially "lifelong SLAM with semantic labels."
5. Centralized map server with delta updates. Each drone sends only its new pose graph nodes + scan descriptors to a central server. Server performs multi-robot graph optimization and broadcasts compressed map deltas. Bandwidth: ~50KB/s per drone (pose + descriptor), not full scans. The drones download "regions of interest" when entering a new area.
Classical SLAM is the foundation, but the field has exploded in the last decade. Modern systems combine visual features, inertial measurements, and even neural networks. Here's where SLAM connects to everything else.
| System | What It Adds | When To Use |
|---|---|---|
| Visual SLAM (ORB-SLAM) | Camera features + bundle adjustment | Camera-only systems |
| VIO (Visual-Inertial Odometry) | Tightly coupled camera + IMU | Drones, AR/VR headsets |
| LiDAR SLAM (LOAM) | Point cloud matching | Self-driving cars, surveying |
| NeRF/3DGS SLAM | Neural scene representation | Dense 3D reconstruction |
The deep structure: both are message passing on graphs. In SLAM, messages are corrections weighted by measurement precision. In attention, messages are value vectors weighted by query-key similarity. Both iteratively refine node estimates by aggregating neighbor information. Graph neural networks, belief propagation, and loopy BP all share this backbone.
Modern SLAM systems (like those using learned features) sometimes use a GNN to predict loop closures. Why is this a natural fit given the message-passing structure above?
You now understand the core concepts behind every SLAM system. From vacuum robots to self-driving cars, the chicken-and-egg problem is the same — only the sensors and scale change.