LiDAR and depth sensors see the world as an unordered cloud of 3D points — no grid, no order, no fixed size. How PointNet, PointNet++, point transformers, and ICP learn from sets of points, where convolutions cannot go.
An image is a tidy grid: pixels in neat rows and columns, every one with a fixed neighbor. That regular structure is exactly what a convolution exploits — slide a small kernel over the grid. But a point cloud — what a LiDAR or depth camera produces — is nothing like that. It’s a set of 3D points: a list of (x, y, z) coordinates (sometimes with extra features like color or intensity), scattered through space with no grid, no order, and no fixed count. A self-driving car’s LiDAR might return 100,000 such points per sweep.
You cannot run a convolution on this. There’s no grid to slide over, no notion of “the pixel to the right.” And you cannot just flatten the points into a vector and feed an MLP, because the points have no canonical order — the same shape can be listed in any of N-factorial orderings, and the network must give the same answer for all of them. Learning on point clouds needs architectures built for unordered sets. That’s what PointNet and its descendants provide, and what this lesson builds.
Left: an image — a regular grid a convolution can slide over. Right: a point cloud — scattered 3D points, no grid, no order. Drag to rotate the cloud and feel its irregularity.
A point-cloud architecture must respect three properties of set-structured 3D data, or it will be wrong:
There’s a fourth, subtler property: transformation invariance. The same object rotated or translated is still the same object — ideally the model’s understanding shouldn’t change. The genius of PointNet is satisfying the first three (especially permutation invariance) with a strikingly simple idea, and approximating the fourth with a learned alignment. We’ll tackle permutation invariance first, because everything hinges on it.
How do you build a function whose output doesn’t change when you shuffle its inputs? You use a symmetric function — one that ignores order by construction. Sum is symmetric: a+b+c = c+a+b. Max is symmetric: max(a,b,c) = max(c,a,b). Average too. These aggregate a set into one value regardless of order. That’s the key that unlocks point clouds.
PointNet’s recipe: first transform each point independently with a shared MLP (the same network applied to every point), turning each 3D point into a high-dimensional feature vector. Then aggregate all those per-point features with a symmetric function — specifically max pooling, taking the maximum across points for each feature dimension. The per-point step is order-agnostic (same MLP for all); the max-pool is symmetric. So the whole thing is permutation invariant: shuffle the points and the output is identical.
Each point gets a feature vector (via the shared MLP), then max-pool across points gives the global feature. Press shuffle: the point order scrambles, but the max-pooled global feature is unchanged. That’s permutation invariance.
Now the full architecture (Qi et al., 2017). Each point (x, y, z) goes through a shared MLP → a per-point feature. Max pool across all points → a single global feature vector summarizing the whole cloud. For classification, feed that global vector to a small classifier (“chair” vs “table”). For segmentation (labeling each point), concatenate the global feature back onto each per-point feature — so every point sees both itself and the whole-cloud context — then an MLP labels each point.
PointNet also adds a small T-Net: a mini-network that predicts a transformation matrix to align the input cloud (and features) to a canonical pose before processing — an attempt at the transformation-invariance property. The whole thing is elegant: a shared MLP, a max-pool, and a head. It works directly on raw points, handles any number of them, and is permutation invariant by construction.
PointNet has a weakness: it goes straight from per-point features to a single global max-pool, so it captures global shape but misses local geometry — the fine arrangement of nearby points (an edge, a corner, a surface patch). A CNN succeeds partly because it builds up local features hierarchically; PointNet has no such locality. PointNet++ (2017) fixes this by applying PointNet locally and repeatedly.
The recipe: sample a subset of points as local centers; group the neighbors around each center (points within a radius); run a small PointNet on each local group to get a feature for that neighborhood; then repeat on the reduced set of centers. This is exactly the hierarchical, growing-receptive-field idea of a CNN — local patterns combine into larger ones — but for points. Early levels capture fine local geometry; later levels capture coarse global structure. It handles varying density too (with multi-scale grouping). PointNet++ massively improved accuracy by giving point networks the local awareness they lacked.
Centers are sampled (large dots); neighbors within a radius are grouped (rings); a mini-PointNet summarizes each group, then it repeats on the centers. Drag the radius — small captures fine local detail, large captures coarser structure.
The natural next step: replace the local max-pool with attention. A point transformer applies self-attention within each point’s local neighborhood — each point attends to its nearby points, weighting them by learned relevance and relative position. This is more expressive than max-pooling a group: instead of just taking the strongest feature, the point gathers a learned, position-aware combination of its neighbors.
Attention is naturally permutation invariant (it’s a weighted set operation), so it fits point clouds perfectly — and incorporating relative position (the offset between a point and its neighbor) gives the geometric awareness that pure attention lacks. Point transformers, and related sparse 3D attention methods, are now state-of-the-art for 3D understanding — the same “attention over a set with positional info” pattern you saw in the Perceiver and in standard transformers, specialized to local 3D neighborhoods.
Click a point: it attends to its neighbors, weighting each by learned relevance and relative position (line thickness = weight). Richer than just taking the max — a position-aware blend of the neighborhood.
A different but essential point-cloud task: registration — finding the rigid transformation (rotation + translation) that aligns one cloud to another. This is how a robot stitches consecutive LiDAR scans into a map, or matches a scan to a known model. The classic algorithm is Iterative Closest Point (ICP), and it’s beautifully simple.
ICP iterates two steps until convergence. 1. Correspondence: for each point in cloud A, find its closest point in cloud B (a nearest-neighbor lookup) — a guess at which points match. 2. Alignment: compute the single rigid transform that best maps A’s points onto their guessed matches in B (a closed-form least-squares solve), and apply it to A. Now A is closer to B, so the closest-point guesses improve — repeat. Each iteration tightens the fit. It’s a chicken-and-egg loop — correspondences need alignment, alignment needs correspondences — solved by alternating, like a geometric EM algorithm.
Two misaligned clouds (teal = target, orange = source). Press Step: each iteration finds closest-point matches (gray lines), solves the best rigid transform, and moves the source closer. Watch them converge.
The interactive payoff: a full ICP registration. Set how far apart and rotated the two clouds start, add sensor noise, then run ICP and watch it converge — or fail, if you start it too far off. This is the core of LiDAR odometry, the front-end of how robots build maps from point clouds.
Set the initial misalignment and noise, then press Run ICP. The orange source cloud iteratively aligns to the teal target; the readout shows the alignment error falling. Push the misalignment too far and watch ICP converge to a wrong local minimum — why a good initial guess matters.
When the start is reasonable, ICP snaps the clouds together in a handful of iterations and the error plunges. When it’s too far off, it locks onto wrong correspondences and settles into a bad alignment — the local-minimum failure that motivates global initialization and learned registration.
Point clouds are one of several 3D representations, each with trade-offs:
| Representation | What it is | Pros / Cons |
|---|---|---|
| Point cloud | set of 3D points | direct from sensors, sparse-efficient; needs set-based nets |
| Voxel grid | 3D pixels | CNN-friendly; memory-heavy, mostly empty |
| Mesh | vertices + faces | surfaces, graphics-ready; irregular connectivity |
| Implicit / NeRF / 3DGS | a function / splats | continuous, photorealistic; different math |
Point-based deep learning powers 3D object detection for autonomous driving (finding cars and pedestrians in LiDAR), semantic segmentation of scenes, robotic grasping, and registration/SLAM. The field has converged on sparse, attention-based point/voxel hybrids for understanding, while NeRF and 3D Gaussian Splatting (separate lessons) handle photorealistic 3D reconstruction. Knowing which representation fits the task — points for sensor data and understanding, meshes for graphics, implicit for rendering — is half the battle in 3D.
The same object as a point cloud, a voxel grid, and a mesh. Drag to morph between them and see the trade-off: points are sparse/efficient, voxels are grid-regular but bloated, meshes are surface-accurate.
→ Pooling & Aggregation — symmetric functions in depth
→ Perceiver — attention over unordered sets, generalized
→ Classical SLAM / Modern SLAM — where ICP registration lives
→ NeRF & 3D Gaussian Splatting — the photorealistic 3D side