Modalities & Methods

Point Clouds

LiDAR and depth sensors see the world as an unordered cloud of 3D points — no grid, no order, no fixed size. How PointNet, PointNet++, point transformers, and ICP learn from sets of points, where convolutions cannot go.

Prerequisites: An MLP maps a vector to a vector + Max pooling takes the maximum. That’s it.
10
Chapters
9+
Simulations
0
Assumed Knowledge

Chapter 0: No Grid

An image is a tidy grid: pixels in neat rows and columns, every one with a fixed neighbor. That regular structure is exactly what a convolution exploits — slide a small kernel over the grid. But a point cloud — what a LiDAR or depth camera produces — is nothing like that. It’s a set of 3D points: a list of (x, y, z) coordinates (sometimes with extra features like color or intensity), scattered through space with no grid, no order, and no fixed count. A self-driving car’s LiDAR might return 100,000 such points per sweep.

You cannot run a convolution on this. There’s no grid to slide over, no notion of “the pixel to the right.” And you cannot just flatten the points into a vector and feed an MLP, because the points have no canonical order — the same shape can be listed in any of N-factorial orderings, and the network must give the same answer for all of them. Learning on point clouds needs architectures built for unordered sets. That’s what PointNet and its descendants provide, and what this lesson builds.

The trap: “just voxelize the points into a 3D grid and use a 3D CNN.” You can — and some methods do — but most of a 3D grid is empty (point clouds are sparse), so it’s wasteful and loses fine detail. Operating directly on the points, respecting their set structure, is more efficient and accurate — once you solve the ordering problem.
Image grid vs. point cloud

Left: an image — a regular grid a convolution can slide over. Right: a point cloud — scattered 3D points, no grid, no order. Drag to rotate the cloud and feel its irregularity.

rotate0.30
Why can’t a standard convolution process a point cloud directly?

Chapter 1: Three Challenges

A point-cloud architecture must respect three properties of set-structured 3D data, or it will be wrong:

There’s a fourth, subtler property: transformation invariance. The same object rotated or translated is still the same object — ideally the model’s understanding shouldn’t change. The genius of PointNet is satisfying the first three (especially permutation invariance) with a strikingly simple idea, and approximating the fourth with a learned alignment. We’ll tackle permutation invariance first, because everything hinges on it.

Which is the defining challenge of point clouds that the architecture must satisfy?

Chapter 2: Permutation Invariance via Symmetric Functions

How do you build a function whose output doesn’t change when you shuffle its inputs? You use a symmetric function — one that ignores order by construction. Sum is symmetric: a+b+c = c+a+b. Max is symmetric: max(a,b,c) = max(c,a,b). Average too. These aggregate a set into one value regardless of order. That’s the key that unlocks point clouds.

PointNet’s recipe: first transform each point independently with a shared MLP (the same network applied to every point), turning each 3D point into a high-dimensional feature vector. Then aggregate all those per-point features with a symmetric function — specifically max pooling, taking the maximum across points for each feature dimension. The per-point step is order-agnostic (same MLP for all); the max-pool is symmetric. So the whole thing is permutation invariant: shuffle the points and the output is identical.

Why max, specifically: max pooling makes each output feature a “detector” — it fires for whichever point most strongly exhibits that feature, ignoring the rest. The network learns features such that the max over points captures the cloud’s shape (the points that “win” each feature are like a learned skeleton of critical points). It’s symmetric, simple, and surprisingly expressive — PointNet proved a max over per-point MLPs can approximate any set function.
Shuffle the points — same answer

Each point gets a feature vector (via the shared MLP), then max-pool across points gives the global feature. Press shuffle: the point order scrambles, but the max-pooled global feature is unchanged. That’s permutation invariance.

How does PointNet achieve permutation invariance?

Chapter 3: PointNet, Assembled

Now the full architecture (Qi et al., 2017). Each point (x, y, z) goes through a shared MLP → a per-point feature. Max pool across all points → a single global feature vector summarizing the whole cloud. For classification, feed that global vector to a small classifier (“chair” vs “table”). For segmentation (labeling each point), concatenate the global feature back onto each per-point feature — so every point sees both itself and the whole-cloud context — then an MLP labels each point.

PointNet also adds a small T-Net: a mini-network that predicts a transformation matrix to align the input cloud (and features) to a canonical pose before processing — an attempt at the transformation-invariance property. The whole thing is elegant: a shared MLP, a max-pool, and a head. It works directly on raw points, handles any number of them, and is permutation invariant by construction.

N points (x,y,z)
unordered set
↓ shared MLP (same for every point)
N per-point features
[N × D]
↓ max pool (symmetric)
global feature [D]
→ classify, or concat back per-point → segment
For per-point segmentation, what does PointNet do with the global feature?

Chapter 4: Local Structure — PointNet++

PointNet has a weakness: it goes straight from per-point features to a single global max-pool, so it captures global shape but misses local geometry — the fine arrangement of nearby points (an edge, a corner, a surface patch). A CNN succeeds partly because it builds up local features hierarchically; PointNet has no such locality. PointNet++ (2017) fixes this by applying PointNet locally and repeatedly.

The recipe: sample a subset of points as local centers; group the neighbors around each center (points within a radius); run a small PointNet on each local group to get a feature for that neighborhood; then repeat on the reduced set of centers. This is exactly the hierarchical, growing-receptive-field idea of a CNN — local patterns combine into larger ones — but for points. Early levels capture fine local geometry; later levels capture coarse global structure. It handles varying density too (with multi-scale grouping). PointNet++ massively improved accuracy by giving point networks the local awareness they lacked.

Sample & group: local PointNets

Centers are sampled (large dots); neighbors within a radius are grouped (rings); a mini-PointNet summarizes each group, then it repeats on the centers. Drag the radius — small captures fine local detail, large captures coarser structure.

grouping radius0.18
What does PointNet++ add over PointNet?

Chapter 5: Point Transformers

The natural next step: replace the local max-pool with attention. A point transformer applies self-attention within each point’s local neighborhood — each point attends to its nearby points, weighting them by learned relevance and relative position. This is more expressive than max-pooling a group: instead of just taking the strongest feature, the point gathers a learned, position-aware combination of its neighbors.

Attention is naturally permutation invariant (it’s a weighted set operation), so it fits point clouds perfectly — and incorporating relative position (the offset between a point and its neighbor) gives the geometric awareness that pure attention lacks. Point transformers, and related sparse 3D attention methods, are now state-of-the-art for 3D understanding — the same “attention over a set with positional info” pattern you saw in the Perceiver and in standard transformers, specialized to local 3D neighborhoods.

The throughline: PointNet’s max-pool, PointNet++’s local grouping, and point transformers’ local attention are increasingly expressive ways to do the same thing — aggregate a neighborhood of points in a permutation-invariant, position-aware way. Max is the simplest symmetric aggregator; attention is the richest.
Local attention on a neighborhood

Click a point: it attends to its neighbors, weighting each by learned relevance and relative position (line thickness = weight). Richer than just taking the max — a position-aware blend of the neighborhood.

Why is attention a natural fit for point clouds?

Chapter 6: Registration — aligning two clouds (ICP)

A different but essential point-cloud task: registration — finding the rigid transformation (rotation + translation) that aligns one cloud to another. This is how a robot stitches consecutive LiDAR scans into a map, or matches a scan to a known model. The classic algorithm is Iterative Closest Point (ICP), and it’s beautifully simple.

ICP iterates two steps until convergence. 1. Correspondence: for each point in cloud A, find its closest point in cloud B (a nearest-neighbor lookup) — a guess at which points match. 2. Alignment: compute the single rigid transform that best maps A’s points onto their guessed matches in B (a closed-form least-squares solve), and apply it to A. Now A is closer to B, so the closest-point guesses improve — repeat. Each iteration tightens the fit. It’s a chicken-and-egg loop — correspondences need alignment, alignment needs correspondences — solved by alternating, like a geometric EM algorithm.

Common pitfall: ICP only finds the nearest local alignment, so it needs a decent initial guess — start the clouds too far apart or too rotated and it converges to the wrong answer. In practice it’s seeded by a coarse global method (or odometry), then ICP refines. It’s a workhorse of LiDAR SLAM and 3D scanning, and learned variants now improve robustness.
ICP: closest point → transform → repeat

Two misaligned clouds (teal = target, orange = source). Press Step: each iteration finds closest-point matches (gray lines), solves the best rigid transform, and moves the source closer. Watch them converge.

What two steps does ICP alternate?

Chapter 7: Aligning Scans, Live (showcase)

The interactive payoff: a full ICP registration. Set how far apart and rotated the two clouds start, add sensor noise, then run ICP and watch it converge — or fail, if you start it too far off. This is the core of LiDAR odometry, the front-end of how robots build maps from point clouds.

ICP registration sandbox

Set the initial misalignment and noise, then press Run ICP. The orange source cloud iteratively aligns to the teal target; the readout shows the alignment error falling. Push the misalignment too far and watch ICP converge to a wrong local minimum — why a good initial guess matters.

initial misalignment0.40
sensor noise0.05

When the start is reasonable, ICP snaps the clouds together in a handful of iterations and the error plunges. When it’s too far off, it locks onto wrong correspondences and settles into a bad alignment — the local-minimum failure that motivates global initialization and learned registration.

Chapter 8: The 3D Landscape

Point clouds are one of several 3D representations, each with trade-offs:

RepresentationWhat it isPros / Cons
Point cloudset of 3D pointsdirect from sensors, sparse-efficient; needs set-based nets
Voxel grid3D pixelsCNN-friendly; memory-heavy, mostly empty
Meshvertices + facessurfaces, graphics-ready; irregular connectivity
Implicit / NeRF / 3DGSa function / splatscontinuous, photorealistic; different math

Point-based deep learning powers 3D object detection for autonomous driving (finding cars and pedestrians in LiDAR), semantic segmentation of scenes, robotic grasping, and registration/SLAM. The field has converged on sparse, attention-based point/voxel hybrids for understanding, while NeRF and 3D Gaussian Splatting (separate lessons) handle photorealistic 3D reconstruction. Knowing which representation fits the task — points for sensor data and understanding, meshes for graphics, implicit for rendering — is half the battle in 3D.

Choosing a 3D representation

The same object as a point cloud, a voxel grid, and a mesh. Drag to morph between them and see the trade-off: points are sparse/efficient, voxels are grid-regular but bloated, meshes are surface-accurate.

representationpoints
When is a point cloud the natural representation to use?

Chapter 9: Cheat Sheet & Connections

problem
point clouds are unordered, irregular, variable-size sets — no grid for convolutions
↓ symmetric aggregation
PointNet
shared per-point MLP + max-pool → permutation-invariant global feature
↓ add locality
PointNet++ / point transformers
hierarchical local grouping / local attention → local geometry
↓ align two clouds
ICP registration
closest-point correspondences ↔ rigid transform, iterated

Keep exploring

Pooling & Aggregation — symmetric functions in depth
Perceiver — attention over unordered sets, generalized
Classical SLAM / Modern SLAM — where ICP registration lives
NeRF & 3D Gaussian Splatting — the photorealistic 3D side

“What I cannot create, I do not understand.” You just rebuilt deep learning on 3D sets: respect that points are unordered by aggregating them with a symmetric function (max), add locality with hierarchical grouping or attention, and align two clouds by iterating closest-point matches and rigid transforms. No grid required.