Kolmogorov–Arnold Networks — flip the MLP inside out. Move the learnable parts from the nodes to the edges: replace every weight with a learnable 1-D function, and let the nodes just add. The result is a network you can actually read.
Train an MLP and you get millions of numbers — the weights. Each one is a multiplier on a connection. Stare at any single weight and it tells you nothing; meaning is smeared across all of them at once. The network works, but you cannot read it. For a chatbot, fine. For discovering the law a physical system obeys, that opacity is a wall.
KANs (Liu et al., 2024) ask: what if a network’s learnable parts were functions you could look at, not opaque scalars? What if you could point at a connection and see “ah, this one learned a sine wave; that one learned a square”? That is the promise — and it comes from one structural change to the MLP, which the next chapters build up carefully.
Left: an MLP’s edges are numbers — meaningless alone. Right: a KAN’s edges are little curves — each one a learned 1-D function you can recognize. Toggle to compare what “looking inside” gives you.
To appreciate the flip, pin down exactly what an MLP neuron does. It takes each incoming value, multiplies it by a learnable weight (the weight lives on the edge), sums all those products, adds a bias, and then passes the sum through a fixed activation function — ReLU, say — that lives on the node.
Two things to notice. The learnable parts are the weights, and they’re on the edges. The nonlinearity is fixed (you chose ReLU; the network can’t change its shape), and it’s on the node. This split — learnable linear weights on edges, fixed nonlinearity on nodes — is the DNA of every MLP. Its theoretical license is the universal approximation theorem: enough such neurons can approximate any function.
Inputs scaled by edge weights (drag them), summed at the node, then squashed by a fixed activation. The weights learn; the activation’s shape never does.
Here is the entire idea of a KAN in one move. Swap what’s learnable and what’s fixed; swap the roles of edges and nodes. Put a learnable 1-D function on every edge, and make every node just sum its inputs — no fixed activation at all. The nonlinearity is now learned and lives on the edges; the nodes are dumb adders.
| edges | nodes | |
|---|---|---|
| MLP | learnable numbers (weights) | fixed nonlinearity (ReLU) |
| KAN | learnable functions (splines) | just sum |
So where an MLP edge does “multiply by 0.7,” a KAN edge does “apply this whole curve to the input.” The curve might be a sine, a parabola, a step — whatever the data demands. The node downstream simply adds up the curve-outputs.
The theoretical license is different too. MLPs lean on universal approximation. KANs lean on the Kolmogorov–Arnold representation theorem, which says any multivariable continuous function can be written as sums of compositions of single-variable functions. That is exactly the shape of a KAN: 1-D functions on edges, summed at nodes, composed across layers. The architecture is the theorem made literal.
Same two-input, one-output unit, drawn both ways. Toggle: in the MLP the edges are weights and the node squashes; in the KAN the edges are curves and the node only sums.
“A learnable 1-D function” sounds vague — how do you store and train a curve? KANs use B-splines. The trick: lay down a row of fixed, overlapping bump-shaped basis functions across the input range, give each bump a learnable height (a coefficient), and add them up. The sum is a smooth curve, and its shape is controlled entirely by those heights — which are the parameters gradient descent adjusts.
Each bump is local — it only affects a small piece of the curve — so raising one coefficient bends the curve in just that region, leaving the rest alone. That locality is what makes splines so controllable. KANs also add a small fixed base function (a SiLU) to each edge so the curve has a sensible default and trains stably. The number of bumps, set by the grid, controls how wiggly a curve the edge can represent.
Faint curves are the fixed basis bumps; the bold teal curve is their weighted sum — the learned edge function. Drag a coefficient and watch only its local region bend. This is what training a KAN edge actually adjusts.
Now assemble a layer. Say it has 3 inputs and 2 outputs. In an MLP that’s a 3×2 weight matrix — 6 numbers. In a KAN it’s 6 learnable functions — one spline on each of the 3×2 edges. Trace one input value through: it hits the edge going to output 1 and gets transformed by that edge’s curve; the same input hits the edge to output 2 and gets a different curve. Each output node then sums the transformed values arriving on its incoming edges. That sum is the output. No activation afterwards — the nonlinearity already happened, on the edges.
Stack these layers and you compose 1-D functions through summing nodes — precisely the Kolmogorov–Arnold form. Parameter count per edge is the number of spline coefficients (the grid size plus the spline order), so a KAN edge holds several numbers where an MLP edge holds one — more parameters per edge, but often far fewer edges needed for the same accuracy on structured problems.
Watch an input value travel each edge: every edge applies its own learned curve, and the destination node sums the results. Drag the input and see each edge’s output (dot on its curve) update, then the node totals.
Here is a knob MLPs simply don’t have. Because each edge function is a spline on a grid, you can add more grid points — more bumps — to give the curve finer resolution, without changing the architecture. Train a KAN on a coarse grid until it converges, then extend the grid (interpolate the existing curve onto more control points) and keep training. Accuracy improves smoothly as resolution rises.
This gives a clean accuracy-versus-compute story you can climb deliberately: start coarse and cheap, refine where you need precision. It also mirrors classical numerical analysis — finer grids, smaller error — which is part of why KANs feel at home on scientific function-fitting.
The teal target is the function to fit; the orange spline is the edge’s best fit at the current grid resolution. Add grid points and watch the spline track the target more tightly — until, past the data’s detail, it starts chasing wiggles.
This is the payoff. Because every edge is a visible curve, a trained KAN can be interpreted in ways an MLP cannot:
For scientific discovery this is transformative: instead of a black box that predicts, you get a candidate law you can inspect, test, and publish. KANs have been used to rediscover physics relations and knot-theory formulas this way.
The orange curve is a messy learned edge. Press “snap” and it locks to the closest clean function (here, a sine) — turning a fuzzy spline into a symbolic term you can write down.
Let’s fit a target function with a tiny KAN and watch the edge curves form. Pick a target, press train, and the spline coefficients adjust until the network’s output traces the target. Increase the grid for finer detail; the readout shows the fit error and the parameter count — compare it to how many an MLP would need.
Teal = target function, orange = the KAN’s current fit. Press Train and the spline coefficients descend toward the target. Switch targets and grids; watch how few parameters a KAN needs to capture a clean function.
On clean, structured targets a handful of spline coefficients capture the function exactly — and you can see the answer in the curve. That combination of parameter efficiency and legibility, on the right problems, is the whole pitch.
KANs are exciting but young, and honesty about their costs matters.
For a structured target, the KAN (teal) reaches low error with few parameters; the MLP (orange) needs more to match. Drag the parameter budget. (On messy high-dim data the curves often flip — niche matters.)
| MLP | KAN | |
|---|---|---|
| learnable part | weights (numbers) | edge functions (splines) |
| nodes | fixed activation | sum |
| theory | universal approximation | Kolmogorov–Arnold |
| interpretable? | hard | yes (visualize/prune/symbolic) |
| scales to billions? | yes | unproven |
| best at | high-dim, raw scale | low-dim, structured, science |
→ Activation Functions — the fixed node nonlinearities KANs replace
→ Initialization — why parameter structure matters for training
→ Neural ODEs — another rethink of what a “layer” is
→ Embedding Layers — more on what networks actually store