The EM algorithm for mapping: alternating localization and mapping, grid-based EM, and layered representations.
All SLAM algorithms in Chapters 10-12 assume known correspondences, or use maximum likelihood to determine them on-the-fly. But what happens when landmarks are genuinely ambiguous? When the robot returns to a place and can't tell which previously seen feature matches which current observation?
This chapter introduces Expectation-Maximization (EM) for mapping — an algorithm that can handle ambiguous correspondences by iterating between localization and mapping. Unlike EKF/EIF SLAM, which commit to a single correspondence hypothesis, EM maintains a probability distribution over correspondences.
The trade-off: EM is a batch algorithm (requires multiple passes over the data) and only finds the most likely map (no uncertainty estimate). But it can handle raw sensor data and ambiguous correspondences that would break Kalman-based approaches.
EM mapping finds the most likely map by alternating two steps:
where d is all data (measurements and controls). Since searching the space of all maps is intractable, EM performs hill climbing:
E-Step (Expectation / Localization): Given the current map estimate m[i], localize the robot at every point in time. Compute a probability distribution over poses for each time step.
M-Step (Maximization / Mapping): Given the pose distributions from the E-step, compute a new map m[i+1] that maximizes the expected log-likelihood.
The E-step runs a full forwards-backwards localization pass over the data. For each time step τ, it computes:
The first factor is the backward pass (future data constraining past poses). The second is the forward pass (past data). Combining them gives the best possible pose estimate using ALL data.
The result is a probability distribution Isτ over poses for every time step. This distribution serves as a "soft assignment" for the M-step — each pose is weighted by its probability rather than committed to a single value.
The M-step builds a new map that maximizes the expected log-likelihood under the pose distributions from the E-step:
For occupancy grid maps, this decomposes per-cell: each cell's occupancy is determined by the weighted evidence from all times when the robot might have observed it, weighted by the pose probabilities.
EM is guaranteed to improve the map likelihood at each iteration (or leave it unchanged at a fixed point). It converges to a local maximum of p(m | d).
In practice, 5-20 iterations typically suffice. The initial map (from raw odometry) is usually terrible, but each iteration produces a noticeably better map. The improvement is often dramatic after just 2-3 iterations.
| Iteration | Map Quality | Description |
|---|---|---|
| 0 | Terrible | Raw odometry; drift makes everything misaligned |
| 1 | Rough | Major structures visible; local alignment improved |
| 2-3 | Good | Walls are straight; rooms recognizable |
| 5-10 | Excellent | Near-optimal; convergence nearly complete |
The E-step requires representing the pose distribution over a 3D space (x, y, θ). Grid-based EM discretizes this on a grid, similar to grid localization (Chapter 8).
The computational cost is dominated by the E-step: for each time step, the localization runs over the full grid. For T time steps and K grid cells, the E-step is O(T · K). The M-step is O(T · K · M) where M is the number of map cells.
Standard EM doesn't scale to large environments because the grid localization becomes too expensive. Layered EM addresses this by decomposing the problem into layers:
1. Build small local maps from consecutive data segments. Each local map covers a small area and is internally consistent.
2. Run EM on the relative positions of local maps, not individual sensor readings. This is a much smaller optimization problem.
The perceptual model for layered EM compares overlapping local maps: if local map A and local map B are placed at the correct relative positions, their overlapping regions should agree. The measurement likelihood measures this agreement.
Each local map is built from a short segment of data (e.g., 50-200 scans) using standard occupancy grid mapping (Chapter 9). The segment is short enough that odometry errors are small, so the local map is accurate.
The local map serves as the "measurement" for the higher-level EM. Instead of matching raw sensor readings to a global map, EM matches local maps to each other. This is both faster and more robust.
EM mapping has produced some of the largest and most accurate maps in robotics research:
| Environment | Challenge | Result |
|---|---|---|
| Rectangular corridor | Four identical corners; high symmetry | EM correctly identifies all four corners |
| Large office building | Multiple loops; long corridors | Consistent metric map with aligned loops |
| Multi-floor building | Repeated structure across floors | Layered EM aligns floor plans correctly |
| Concept | Key Idea |
|---|---|
| EM mapping | Alternate localization (E) and mapping (M) to find the most likely map |
| E-step | Forwards-backwards localization over all data, producing pose distributions |
| M-step | Build map from data weighted by pose distributions |
| Convergence | Monotonically improves map likelihood; converges to local maximum |
| Layered EM | Build local maps, then align them globally |