AI Architectures

KANs

Kolmogorov–Arnold Networks — flip the MLP inside out. Move the learnable parts from the nodes to the edges: replace every weight with a learnable 1-D function, and let the nodes just add. The result is a network you can actually read.

Prerequisites: An MLP is weighted sums followed by a fixed activation like ReLU + A curve can be built from simple bumps. That’s it.
10
Chapters
9+
Simulations
0
Assumed Knowledge

Chapter 0: The Black Box

Train an MLP and you get millions of numbers — the weights. Each one is a multiplier on a connection. Stare at any single weight and it tells you nothing; meaning is smeared across all of them at once. The network works, but you cannot read it. For a chatbot, fine. For discovering the law a physical system obeys, that opacity is a wall.

KANs (Liu et al., 2024) ask: what if a network’s learnable parts were functions you could look at, not opaque scalars? What if you could point at a connection and see “ah, this one learned a sine wave; that one learned a square”? That is the promise — and it comes from one structural change to the MLP, which the next chapters build up carefully.

The trap: “A bigger network is a better network.” For raw prediction, often true. But for science — finding the compact formula behind data — you want the smallest, most legible model. KANs trade some of the MLP’s brute scalability for interpretability and parameter efficiency on structured, function-fitting problems.
Opaque weights vs. readable functions

Left: an MLP’s edges are numbers — meaningless alone. Right: a KAN’s edges are little curves — each one a learned 1-D function you can recognize. Toggle to compare what “looking inside” gives you.

What problem with MLPs are KANs primarily designed to address?

Chapter 1: The MLP, precisely

To appreciate the flip, pin down exactly what an MLP neuron does. It takes each incoming value, multiplies it by a learnable weight (the weight lives on the edge), sums all those products, adds a bias, and then passes the sum through a fixed activation function — ReLU, say — that lives on the node.

node output = ReLU( w1x1 + w2x2 + … + b )

Two things to notice. The learnable parts are the weights, and they’re on the edges. The nonlinearity is fixed (you chose ReLU; the network can’t change its shape), and it’s on the node. This split — learnable linear weights on edges, fixed nonlinearity on nodes — is the DNA of every MLP. Its theoretical license is the universal approximation theorem: enough such neurons can approximate any function.

An MLP neuron

Inputs scaled by edge weights (drag them), summed at the node, then squashed by a fixed activation. The weights learn; the activation’s shape never does.

weight w₁1.0
weight w₂-0.5
In a standard MLP, what is learnable and where does it live?

Chapter 2: The Flip

Here is the entire idea of a KAN in one move. Swap what’s learnable and what’s fixed; swap the roles of edges and nodes. Put a learnable 1-D function on every edge, and make every node just sum its inputs — no fixed activation at all. The nonlinearity is now learned and lives on the edges; the nodes are dumb adders.

edgesnodes
MLPlearnable numbers (weights)fixed nonlinearity (ReLU)
KANlearnable functions (splines)just sum

So where an MLP edge does “multiply by 0.7,” a KAN edge does “apply this whole curve to the input.” The curve might be a sine, a parabola, a step — whatever the data demands. The node downstream simply adds up the curve-outputs.

The theoretical license is different too. MLPs lean on universal approximation. KANs lean on the Kolmogorov–Arnold representation theorem, which says any multivariable continuous function can be written as sums of compositions of single-variable functions. That is exactly the shape of a KAN: 1-D functions on edges, summed at nodes, composed across layers. The architecture is the theorem made literal.

MLP vs. KAN, side by side

Same two-input, one-output unit, drawn both ways. Toggle: in the MLP the edges are weights and the node squashes; in the KAN the edges are curves and the node only sums.

The single structural change that turns an MLP into a KAN is:

Chapter 3: What Is a Learnable Function? Splines

“A learnable 1-D function” sounds vague — how do you store and train a curve? KANs use B-splines. The trick: lay down a row of fixed, overlapping bump-shaped basis functions across the input range, give each bump a learnable height (a coefficient), and add them up. The sum is a smooth curve, and its shape is controlled entirely by those heights — which are the parameters gradient descent adjusts.

edge function φ(x) = c1B1(x) + c2B2(x) + … + cGBG(x)  (+ a SiLU base term)

Each bump is local — it only affects a small piece of the curve — so raising one coefficient bends the curve in just that region, leaving the rest alone. That locality is what makes splines so controllable. KANs also add a small fixed base function (a SiLU) to each edge so the curve has a sensible default and trains stably. The number of bumps, set by the grid, controls how wiggly a curve the edge can represent.

Build a curve from bumps

Faint curves are the fixed basis bumps; the bold teal curve is their weighted sum — the learned edge function. Drag a coefficient and watch only its local region bend. This is what training a KAN edge actually adjusts.

bump 2 height1.0
bump 4 height-0.5
How does a KAN store a learnable 1-D function on an edge?

Chapter 4: The KAN Layer (data flow)

Now assemble a layer. Say it has 3 inputs and 2 outputs. In an MLP that’s a 3×2 weight matrix — 6 numbers. In a KAN it’s 6 learnable functions — one spline on each of the 3×2 edges. Trace one input value through: it hits the edge going to output 1 and gets transformed by that edge’s curve; the same input hits the edge to output 2 and gets a different curve. Each output node then sums the transformed values arriving on its incoming edges. That sum is the output. No activation afterwards — the nonlinearity already happened, on the edges.

inputs [3]
x₁, x₂, x₃
↓ each edge applies its own spline φij
edges [3×2]
6 learnable curves transform each input per destination
↓ node = sum of incoming
outputs [2]
yₙ = φ1j(x₁) + φ2j(x₂) + φ3j(x₃)

Stack these layers and you compose 1-D functions through summing nodes — precisely the Kolmogorov–Arnold form. Parameter count per edge is the number of spline coefficients (the grid size plus the spline order), so a KAN edge holds several numbers where an MLP edge holds one — more parameters per edge, but often far fewer edges needed for the same accuracy on structured problems.

One input flowing through KAN edges

Watch an input value travel each edge: every edge applies its own learned curve, and the destination node sums the results. Drag the input and see each edge’s output (dot on its curve) update, then the node totals.

input value0.30
A KAN layer with 3 inputs and 2 outputs has how many learnable functions, and what do the nodes do?

Chapter 5: Grid Refinement

Here is a knob MLPs simply don’t have. Because each edge function is a spline on a grid, you can add more grid points — more bumps — to give the curve finer resolution, without changing the architecture. Train a KAN on a coarse grid until it converges, then extend the grid (interpolate the existing curve onto more control points) and keep training. Accuracy improves smoothly as resolution rises.

This gives a clean accuracy-versus-compute story you can climb deliberately: start coarse and cheap, refine where you need precision. It also mirrors classical numerical analysis — finer grids, smaller error — which is part of why KANs feel at home on scientific function-fitting.

Common misconception: “just crank the grid to the max.” Too fine a grid with too little data overfits — the spline wiggles to chase noise between data points. The grid resolution must match how much data and how much true detail you have, exactly like choosing model capacity. Refine toward the data, not past it.
Refining the grid fits the target

The teal target is the function to fit; the orange spline is the edge’s best fit at the current grid resolution. Add grid points and watch the spline track the target more tightly — until, past the data’s detail, it starts chasing wiggles.

grid points5
Grid refinement in a KAN lets you:

Chapter 6: Reading the Network

This is the payoff. Because every edge is a visible curve, a trained KAN can be interpreted in ways an MLP cannot:

For scientific discovery this is transformative: instead of a black box that predicts, you get a candidate law you can inspect, test, and publish. KANs have been used to rediscover physics relations and knot-theory formulas this way.

Snap a learned edge to a known function

The orange curve is a messy learned edge. Press “snap” and it locks to the closest clean function (here, a sine) — turning a fuzzy spline into a symbolic term you can write down.

Why can a KAN be turned into a closed-form formula while an MLP usually cannot?

Chapter 7: Fitting a Function, Live (showcase)

Let’s fit a target function with a tiny KAN and watch the edge curves form. Pick a target, press train, and the spline coefficients adjust until the network’s output traces the target. Increase the grid for finer detail; the readout shows the fit error and the parameter count — compare it to how many an MLP would need.

Train a KAN edge to match a target

Teal = target function, orange = the KAN’s current fit. Press Train and the spline coefficients descend toward the target. Switch targets and grids; watch how few parameters a KAN needs to capture a clean function.

targetsine
grid points7

On clean, structured targets a handful of spline coefficients capture the function exactly — and you can see the answer in the curve. That combination of parameter efficiency and legibility, on the right problems, is the whole pitch.

Chapter 8: Trade-offs & Where KANs Fit

KANs are exciting but young, and honesty about their costs matters.

The right niche: reach for a KAN when the goal is understanding — symbolic regression, scientific law discovery, low-dimensional function fitting where a readable, compact model is worth more than raw throughput. Reach for an MLP/transformer when the goal is scale and raw predictive performance on high-dimensional data.
Accuracy per parameter: KAN vs. MLP (on a clean function)

For a structured target, the KAN (teal) reaches low error with few parameters; the MLP (orange) needs more to match. Drag the parameter budget. (On messy high-dim data the curves often flip — niche matters.)

parameter budget0.40
For which task is a KAN most clearly the right tool?

Chapter 9: Cheat Sheet & Connections

MLP
learnable weights on edges + fixed activation on nodes (universal approximation)
↓ the flip
KAN
learnable 1-D functions on edges + sum on nodes (Kolmogorov–Arnold theorem)
↓ how the function is stored
B-spline edges
sum of local basis bumps with learnable heights, + SiLU base; grid sets resolution
↓ refine & read
grid extension + symbolic
add grid points for accuracy; prune + snap edges to known functions → a formula
MLPKAN
learnable partweights (numbers)edge functions (splines)
nodesfixed activationsum
theoryuniversal approximationKolmogorov–Arnold
interpretable?hardyes (visualize/prune/symbolic)
scales to billions?yesunproven
best athigh-dim, raw scalelow-dim, structured, science

Keep exploring

Activation Functions — the fixed node nonlinearities KANs replace
Initialization — why parameter structure matters for training
Neural ODEs — another rethink of what a “layer” is
Embedding Layers — more on what networks actually store

“What I cannot create, I do not understand.” You just rebuilt the KAN from one move — take the MLP, move the learnable part from the nodes to the edges, and replace each weight with a curve. The reward is a network whose every connection you can read, refine, and turn into an equation.