AI Architectures

KANs

Kolmogorov–Arnold Networks — flip the MLP inside out. Move the learnable parts from the nodes to the edges: replace every weight with a learnable 1-D function, and let the nodes just add. The result is a network you can actually read.

Prerequisites: An MLP is weighted sums followed by a fixed activation like ReLU + A curve can be built from simple bumps. That’s it.

Chapters

Simulations

Assumed Knowledge

Chapter 0: The Black Box

Train an MLP and you get millions of numbers — the weights. Each one is a multiplier on a connection. Stare at any single weight and it tells you nothing; meaning is smeared across all of them at once. The network works, but you cannot read it. For a chatbot, fine. For discovering the law a physical system obeys, that opacity is a wall.

KANs (Liu et al., 2024) ask: what if a network’s learnable parts were functions you could look at, not opaque scalars? What if you could point at a connection and see “ah, this one learned a sine wave; that one learned a square”? That is the promise — and it comes from one structural change to the MLP, which the next chapters build up carefully.

The trap: “A bigger network is a better network.” For raw prediction, often true. But for science — finding the compact formula behind data — you want the smallest, most legible model. KANs trade some of the MLP’s brute scalability for interpretability and parameter efficiency on structured, function-fitting problems.

Opaque weights vs. readable functions

Left: an MLP’s edges are numbers — meaningless alone. Right: a KAN’s edges are little curves — each one a learned 1-D function you can recognize. Toggle to compare what “looking inside” gives you.

What problem with MLPs are KANs primarily designed to address?

They train too slowly Their learned parameters are opaque scalars — hard to interpret or extract a formula from They cannot represent nonlinear functions

Chapter 1: The MLP, precisely

To appreciate the flip, pin down exactly what an MLP neuron does. It takes each incoming value, multiplies it by a learnable weight (the weight lives on the edge), sums all those products, adds a bias, and then passes the sum through a fixed activation function — ReLU, say — that lives on the node.

node output = ReLU( w₁x₁ + w₂x₂ + … + b )

Two things to notice. The learnable parts are the weights, and they’re on the edges. The nonlinearity is fixed (you chose ReLU; the network can’t change its shape), and it’s on the node. This split — learnable linear weights on edges, fixed nonlinearity on nodes — is the DNA of every MLP. Its theoretical license is the universal approximation theorem: enough such neurons can approximate any function.

An MLP neuron

Inputs scaled by edge weights (drag them), summed at the node, then squashed by a fixed activation. The weights learn; the activation’s shape never does.

weight w₁1.0

weight w₂-0.5

In a standard MLP, what is learnable and where does it live?

The activation shape, on the edges Linear weights, on the edges; the nonlinearity is fixed and on the nodes Nothing — MLPs have no learnable parameters

Chapter 2: The Flip

Here is the entire idea of a KAN in one move. Swap what’s learnable and what’s fixed; swap the roles of edges and nodes. Put a learnable 1-D function on every edge, and make every node just sum its inputs — no fixed activation at all. The nonlinearity is now learned and lives on the edges; the nodes are dumb adders.

	edges	nodes
MLP	learnable numbers (weights)	fixed nonlinearity (ReLU)
KAN	learnable functions (splines)	just sum

So where an MLP edge does “multiply by 0.7,” a KAN edge does “apply this whole curve to the input.” The curve might be a sine, a parabola, a step — whatever the data demands. The node downstream simply adds up the curve-outputs.

The theoretical license is different too. MLPs lean on universal approximation. KANs lean on the Kolmogorov–Arnold representation theorem, which says any multivariable continuous function can be written as sums of compositions of single-variable functions. That is exactly the shape of a KAN: 1-D functions on edges, summed at nodes, composed across layers. The architecture is the theorem made literal.

MLP vs. KAN, side by side

Same two-input, one-output unit, drawn both ways. Toggle: in the MLP the edges are weights and the node squashes; in the KAN the edges are curves and the node only sums.

The single structural change that turns an MLP into a KAN is:

adding more layers using a different optimizer putting learnable 1-D functions on the edges and making nodes just sum (no fixed activation)

Chapter 3: What Is a Learnable Function? Splines

“A learnable 1-D function” sounds vague — how do you store and train a curve? KANs use B-splines. The trick: lay down a row of fixed, overlapping bump-shaped basis functions across the input range, give each bump a learnable height (a coefficient), and add them up. The sum is a smooth curve, and its shape is controlled entirely by those heights — which are the parameters gradient descent adjusts.

edge function φ(x) = c₁B₁(x) + c₂B₂(x) + … + c_GB_G(x) (+ a SiLU base term)

Each bump is local — it only affects a small piece of the curve — so raising one coefficient bends the curve in just that region, leaving the rest alone. That locality is what makes splines so controllable. KANs also add a small fixed base function (a SiLU) to each edge so the curve has a sensible default and trains stably. The number of bumps, set by the grid, controls how wiggly a curve the edge can represent.

Build a curve from bumps

Faint curves are the fixed basis bumps; the bold teal curve is their weighted sum — the learned edge function. Drag a coefficient and watch only its local region bend. This is what training a KAN edge actually adjusts.

bump 2 height1.0

bump 4 height-0.5

How does a KAN store a learnable 1-D function on an edge?

As a single weight, like an MLP As a sum of fixed local basis bumps, each with a learnable height (a B-spline) As a lookup table of every possible input

Chapter 4: The KAN Layer (data flow)

Now assemble a layer. Say it has 3 inputs and 2 outputs. In an MLP that’s a 3×2 weight matrix — 6 numbers. In a KAN it’s 6 learnable functions — one spline on each of the 3×2 edges. Trace one input value through: it hits the edge going to output 1 and gets transformed by that edge’s curve; the same input hits the edge to output 2 and gets a different curve. Each output node then sums the transformed values arriving on its incoming edges. That sum is the output. No activation afterwards — the nonlinearity already happened, on the edges.

inputs [3]

x₁, x₂, x₃

↓ each edge applies its own spline φ_ij

edges [3×2]

6 learnable curves transform each input per destination

↓ node = sum of incoming

outputs [2]

yₙ = φ_1j(x₁) + φ_2j(x₂) + φ_3j(x₃)

Stack these layers and you compose 1-D functions through summing nodes — precisely the Kolmogorov–Arnold form. Parameter count per edge is the number of spline coefficients (the grid size plus the spline order), so a KAN edge holds several numbers where an MLP edge holds one — more parameters per edge, but often far fewer edges needed for the same accuracy on structured problems.

One input flowing through KAN edges

Watch an input value travel each edge: every edge applies its own learned curve, and the destination node sums the results. Drag the input and see each edge’s output (dot on its curve) update, then the node totals.

input value0.30

A KAN layer with 3 inputs and 2 outputs has how many learnable functions, and what do the nodes do?

5 functions; nodes multiply 6 functions (one per edge); nodes just sum the incoming transformed values 1 function shared by all; nodes apply ReLU

Chapter 5: Grid Refinement

Here is a knob MLPs simply don’t have. Because each edge function is a spline on a grid, you can add more grid points — more bumps — to give the curve finer resolution, without changing the architecture. Train a KAN on a coarse grid until it converges, then extend the grid (interpolate the existing curve onto more control points) and keep training. Accuracy improves smoothly as resolution rises.

This gives a clean accuracy-versus-compute story you can climb deliberately: start coarse and cheap, refine where you need precision. It also mirrors classical numerical analysis — finer grids, smaller error — which is part of why KANs feel at home on scientific function-fitting.

Common misconception: “just crank the grid to the max.” Too fine a grid with too little data overfits — the spline wiggles to chase noise between data points. The grid resolution must match how much data and how much true detail you have, exactly like choosing model capacity. Refine toward the data, not past it.

Refining the grid fits the target

The teal target is the function to fit; the orange spline is the edge’s best fit at the current grid resolution. Add grid points and watch the spline track the target more tightly — until, past the data’s detail, it starts chasing wiggles.

grid points5

Grid refinement in a KAN lets you:

add more layers automatically increase each edge spline’s resolution for finer accuracy without changing the architecture (but too fine overfits) remove the need for training data

Chapter 6: Reading the Network

This is the payoff. Because every edge is a visible curve, a trained KAN can be interpreted in ways an MLP cannot:

Visualize: plot each edge function and literally see what it learned — a sine here, a log there.
Prune: edges whose function is essentially flat (near zero everywhere) contribute nothing; delete them. The network shrinks to the connections that matter, revealing structure.
Symbolic regression: if a learned edge looks like a known function, snap it to that function (sine, exponential, square). Do this across the pruned network and you can read off a closed-form formula — the model becomes an equation.

For scientific discovery this is transformative: instead of a black box that predicts, you get a candidate law you can inspect, test, and publish. KANs have been used to rediscover physics relations and knot-theory formulas this way.

Snap a learned edge to a known function

The orange curve is a messy learned edge. Press “snap” and it locks to the closest clean function (here, a sine) — turning a fuzzy spline into a symbolic term you can write down.

Why can a KAN be turned into a closed-form formula while an MLP usually cannot?

KANs use integers for weights Each edge is a visible 1-D function you can prune and snap to a known symbolic function (sine, exp, …) KANs have no nonlinearities to express

Chapter 7: Fitting a Function, Live (showcase)

Let’s fit a target function with a tiny KAN and watch the edge curves form. Pick a target, press train, and the spline coefficients adjust until the network’s output traces the target. Increase the grid for finer detail; the readout shows the fit error and the parameter count — compare it to how many an MLP would need.

Train a KAN edge to match a target

Teal = target function, orange = the KAN’s current fit. Press Train and the spline coefficients descend toward the target. Switch targets and grids; watch how few parameters a KAN needs to capture a clean function.

targetsine

grid points7

On clean, structured targets a handful of spline coefficients capture the function exactly — and you can see the answer in the curve. That combination of parameter efficiency and legibility, on the right problems, is the whole pitch.

Chapter 8: Trade-offs & Where KANs Fit

KANs are exciting but young, and honesty about their costs matters.

Slow training: evaluating splines on every edge is far more expensive than a matrix multiply, and it’s less friendly to GPUs (which love big dense matmuls). KANs train slower than MLPs of similar size.
Scaling is unproven: MLPs and transformers scale to billions of parameters with known recipes. KANs at that scale are an open question; most wins are on small-to-medium scientific tasks.
Parameter efficiency, situational: on clean, low-dimensional, structured functions KANs match MLP accuracy with fewer parameters and give you interpretability. On messy high-dimensional data (images, language) MLPs/transformers still dominate.

The right niche: reach for a KAN when the goal is understanding — symbolic regression, scientific law discovery, low-dimensional function fitting where a readable, compact model is worth more than raw throughput. Reach for an MLP/transformer when the goal is scale and raw predictive performance on high-dimensional data.

Accuracy per parameter: KAN vs. MLP (on a clean function)

For a structured target, the KAN (teal) reaches low error with few parameters; the MLP (orange) needs more to match. Drag the parameter budget. (On messy high-dim data the curves often flip — niche matters.)

parameter budget0.40

For which task is a KAN most clearly the right tool?

Training a billion-parameter language model Discovering a compact symbolic formula behind low-dimensional scientific data Real-time high-resolution image classification at scale

Chapter 9: Cheat Sheet & Connections

MLP

learnable weights on edges + fixed activation on nodes (universal approximation)

↓ the flip

KAN

learnable 1-D functions on edges + sum on nodes (Kolmogorov–Arnold theorem)

↓ how the function is stored

B-spline edges

sum of local basis bumps with learnable heights, + SiLU base; grid sets resolution

↓ refine & read

grid extension + symbolic

add grid points for accuracy; prune + snap edges to known functions → a formula

	MLP	KAN
learnable part	weights (numbers)	edge functions (splines)
nodes	fixed activation	sum
theory	universal approximation	Kolmogorov–Arnold
interpretable?	hard	yes (visualize/prune/symbolic)
scales to billions?	yes	unproven
best at	high-dim, raw scale	low-dim, structured, science

Keep exploring

→ Activation Functions — the fixed node nonlinearities KANs replace
→ Initialization — why parameter structure matters for training
→ Neural ODEs — another rethink of what a “layer” is
→ Embedding Layers — more on what networks actually store

“What I cannot create, I do not understand.” You just rebuilt the KAN from one move — take the MLP, move the learnable part from the nodes to the edges, and replace each weight with a curve. The reward is a network whose every connection you can read, refine, and turn into an equation.