AI Architectures

Neural ODEs

What if a network had not 50 layers, but infinitely many infinitely thin ones? A residual block is one step of solving a differential equation — take the step size to zero and depth becomes a continuous flow you hand to an ODE solver.

Prerequisites: A residual layer computes h + f(h) + A derivative is a rate of change. That’s it.

Chapters

Simulations

Assumed Knowledge

Chapter 0: Infinite Depth

A ResNet works by adding a small correction at each layer: the new hidden state is the old one plus whatever the layer computes. Stack 50 of these and the data takes 50 little steps from input to output. Each step nudges the representation a bit closer to something useful.

Now ask a strange question. What if instead of 50 steps we took 500 tiny ones? Or 5,000 even tinier ones? In the limit — infinitely many, infinitely small steps — the discrete staircase of layers becomes a smooth curve. The hidden state stops jumping from layer to layer and instead flows continuously. That continuous flow is described by a differential equation, and a network defined this way is a Neural ODE (Chen et al., 2018).

The reframe: “depth” stops being a count of layers and becomes a length of time to flow. You no longer choose “50 layers”; you choose “flow from time 0 to time 1” and hand the problem to an ODE solver, which decides how many tiny steps it actually needs. Depth becomes adaptive, continuous, and — as we’ll see — almost free in memory.

From discrete layers to a continuous curve

The orange staircase is a ResNet: each step is one layer’s update. Add more layers (smaller steps) and the staircase converges to the smooth teal curve — the solution of a differential equation.

number of layers (steps)4

In a Neural ODE, what does “depth” become?

A fixed count of weight matrices, just more of them A continuous length of integration time the solver flows over, with the step count chosen adaptively The number of training epochs

Chapter 1: ResNet = Euler’s Method

Here is the bridge that started it all. A residual update looks like this: next hidden state equals current state plus the layer’s function of the current state. Now write the simplest way to solve a differential equation numerically — Euler’s method: the next value equals the current value plus a small step size times the derivative.

ResNet: h_next = h + f(h) Euler: h_next = h + Δt · f(h)

They are the same equation. A ResNet is Euler’s method with the step size baked in as 1, where f — the residual block — plays the role of the derivative of the hidden state. So a ResNet isn’t just like solving an ODE; it literally is a crude ODE solve, taking one clumsy step per layer.

Worked example by hand

Let the derivative be f(h) = 0.5·h, start at h = 1.0, and integrate from t=0 to t=1. Take it in two Euler steps (step size 0.5):

step	h	f(h)=0.5h	h + 0.5·f(h)
0→0.5	1.000	0.500	1.000 + 0.5·0.5 = 1.250
0.5→1.0	1.250	0.625	1.250 + 0.5·0.625 = 1.563

Two steps give 1.563. The true answer (this ODE has solution h(t)=e^0.5t) is e^0.5 = 1.649. We’re close but low — Euler with big steps undershoots curvature. Take 10 steps and you’d get 1.628; take 1000 and you’d nail 1.649. More layers = smaller step = less error. The Neural ODE just takes this to the continuous limit and lets a smart solver pick the steps.

Euler steps vs. the true curve

The teal curve is the exact solution; the orange path is Euler’s method with your chosen step size. Big steps drift off the curve (that’s a shallow ResNet’s approximation error); small steps hug it.

step size Δt0.50

In the ResNet–ODE correspondence, the residual block f(h) plays the role of:

the integration bounds the derivative dh/dt — the instantaneous rate of change of the hidden state the loss function

Chapter 2: The Vector Field

If f is the derivative of the hidden state, then f is a vector field: at every point in the state space, it tells you which direction to move and how fast. The hidden state is a particle dropped into this field, drifting along wherever the arrows point. Running the network = letting the particle flow for a fixed amount of time.

This is a genuinely different mental picture from stacked layers. The network is the field. The same field is applied continuously the whole way through (in the simplest version, f uses one shared set of weights for all time) — what changes is where the particle has drifted to. Learning means shaping the field so that particles starting at different inputs flow to the right outputs.

Concept → realization: a classifier’s job is to push points of class A one way and class B another. With a Neural ODE, you learn a flow field that untangles the classes — two interleaved spirals get smoothly straightened until a single line separates them. The trajectory is guaranteed not to cross itself (ODE solutions are unique), which is both a strength and a limit we’ll revisit.

Drop a particle into the field

Arrows show the learned velocity field f. Click anywhere to drop a particle — watch it flow along the field. That path is the hidden state moving through “depth.” Change the field to see different flows.

field typeswirl

Thinking of f as a vector field, what does “running the network” correspond to?

Randomly resampling the weights Letting the input point flow along the field for a fixed amount of integration time Multiplying the input by the field once

Chapter 3: The Solver Does the Layers

In a normal network, you decide the number of layers. In a Neural ODE, you hand the differential equation to a black-box ODE solver and it decides how many steps to take. Modern solvers (like Dormand–Prince) are adaptive: they estimate their own error at each step and shrink the step where the field changes fast, grow it where the field is calm — spending computation exactly where it’s needed.

The count of times the solver calls f is the number of function evaluations, or NFE — the continuous analogue of “how many layers.” You don’t set it; it emerges from the dynamics and your chosen error tolerance. Loosen the tolerance and the solver takes big, cheap, sloppy steps; tighten it and it takes many small, accurate ones.

Why this is powerful: the same trained model can be run fast and approximate (loose tolerance, few NFE) or slow and precise (tight tolerance, many NFE) at inference — without retraining. You dial accuracy against compute after the fact, something a fixed-layer network can’t do.

Tolerance sets the step count

The solver places steps (dots) along the trajectory — denser where the field bends sharply. Tighten the tolerance and watch the solver add evaluations to stay accurate; loosen it and it coasts on a few big steps.

error tolerance (loose → tight)0.50

In a Neural ODE, the “number of function evaluations” (NFE) is:

fixed by the architecture before training chosen adaptively by the solver from the dynamics and error tolerance — the continuous analogue of layer count always equal to the batch size

Chapter 4: Adaptive Computation

Because the solver chooses its own steps, a Neural ODE spends more compute on harder inputs — for free. An input whose trajectory passes through a calm region of the field resolves in a few steps; one that threads a sharply curving region forces the solver to take many small steps to stay accurate. The model literally thinks longer about hard examples.

This is a property fixed-depth networks simply don’t have: a 50-layer ResNet does exactly 50 layers of work whether the input is trivial or brutal. The Neural ODE’s NFE rises and falls with difficulty, an early and elegant form of adaptive computation.

Common misconception: “more function evaluations always means a better model.” NFE measures effort, not quality. A well-behaved (non-stiff) field solves any input in few steps and is great; a badly-conditioned field can blow NFE sky-high without improving accuracy — just burning compute. Part of training a good Neural ODE is keeping the field smooth so NFE stays low.

Easy vs. hard inputs cost different effort

Two particles start in different regions. The one in the calm zone needs few steps (low NFE); the one crossing the turbulent zone forces many. Same model, different compute — automatically.

field turbulence0.50

Why can a Neural ODE spend more compute on a harder input automatically?

It adds more layers to the architecture mid-inference It retrains on the hard input The adaptive solver takes more small steps where the trajectory passes through sharply-changing field regions

Chapter 5: The Adjoint Method — gradients with O(1) memory

Training needs gradients. The naive approach: record every operation the solver performed and backpropagate through all of them. But an adaptive solver might take hundreds of steps — storing every intermediate state is as memory-hungry as a hundred-layer network. The Neural ODE’s signature trick avoids this entirely: the adjoint sensitivity method.

The idea: to get gradients, solve a second ODE backward in time, from the output back to the input. This backward ODE tracks how sensitive the loss is to the state at each moment (the “adjoint”). Crucially, you don’t need to have stored the forward trajectory — you reconstruct what you need on the fly during the backward solve. Memory cost becomes constant in depth: O(1), whether the solver took 10 steps or 10,000.

Why it matters: a normal deep net’s memory grows with depth, because backprop must cache every layer’s activations. The adjoint method breaks that link — you can have effectively “infinite” depth at the memory cost of a single layer. The price is extra compute (you solve an ODE backward) and some numerical care, but the memory savings are dramatic.

Forward solve, then adjoint backward

Top: backprop-through-solver stores every step (memory grows with depth, orange bars pile up). Bottom: the adjoint solves backward, holding only the current state (flat teal memory). Drag the step count and watch only the orange bars grow.

solver steps (depth)12

The adjoint method computes gradients by:

storing every forward step and backpropagating through them ignoring the solver and using finite differences solving a second ODE backward in time, giving O(1) memory in depth

Chapter 6: Density Flow — continuous normalizing flows

Here’s where Neural ODEs become a generative tool. Suppose you flow not a single point but a whole cloud of points — a probability distribution. As the cloud flows along the field, it stretches and compresses. If you start from a simple distribution (a Gaussian blob) and flow it through a learned field, you can morph it into a complex data distribution. This is a continuous normalizing flow (CNF).

The magic is that you can track exactly how the probability density changes as it flows — the rate of change of (log) density is governed by how much the field is locally expanding or contracting (its divergence). So you get a generative model with exact likelihoods: sample a Gaussian, flow it forward to generate data; flow data backward to score its probability. No fixed architecture of invertible layers needed — the ODE is invertible by running it in reverse.

The big lineage: this is the conceptual ancestor of flow matching and a cousin of diffusion models. All three move a simple distribution to a complex one along a learned, continuous path. Neural ODEs gave the math (a learned velocity field + an ODE solve); flow matching made it cheap to train. If you’ve studied diffusion, you’ve been using a descendant of this idea.

Morphing a Gaussian into data

A cloud of points starts as a simple blob (left) and flows along the field, stretching into a target shape (right). Drag the flow time to watch the distribution transform continuously — that’s a continuous normalizing flow generating data.

flow time0.00

A continuous normalizing flow generates data by:

sampling pixels independently from a uniform distribution classifying noise into categories flowing a simple distribution along a learned ODE field until it becomes the data distribution

Chapter 7: A Neural ODE Classifier, Live (showcase)

Let’s watch a Neural ODE untangle two interleaved classes. The points start mixed; the learned field flows them apart over integration time until a straight line can separate them. You control how long to flow, the field strength, and the solver tolerance — and the readout shows the function evaluations (NFE) the solver needed.

Flow-based classification

Two classes (orange, teal) start interleaved. Press Flow to integrate the field forward; watch the classes separate as integration time grows. Increase tolerance for a coarse, cheap solve; decrease it for a fine, expensive one. The readout tracks NFE and separation quality.

integration time T0.00

field strength1.5

tolerance (loose → tight)0.50

Notice the trade-off you control at inference: a looser tolerance gives a few-step, slightly jagged flow that still separates the classes; a tight tolerance gives a silky-smooth flow at many times the NFE. The model didn’t change — only how carefully you solved it. That run-time accuracy dial is the Neural ODE’s signature gift.

Chapter 8: Trade-offs & Where They Shine

Neural ODEs are elegant but not free. Knowing their costs tells you when to reach for them.

The costs

Speed: an adaptive solver may call f hundreds of times — often slower than a fixed-depth net of comparable accuracy.
Stiffness: if training drives the field to change very fast, the equation becomes “stiff” and NFE explodes; people add regularizers to keep the field gentle.
Expressivity limit: because trajectories can’t cross, a plain Neural ODE can’t represent some mappings — the fix is to add extra dimensions (“augmented” Neural ODEs) so trajectories can route around each other.

Where they shine

Irregular time series: medical records, sensor logs sampled at uneven times. Because the model is continuous in time, you just integrate to whatever timestamp you have — Latent ODEs and ODE-RNNs handle missing/irregular data gracefully where discrete RNNs struggle.
Memory-bound training: the O(1)-memory adjoint enables very “deep” models on limited hardware.
Generative modeling: continuous normalizing flows, and the whole flow-matching / diffusion lineage built on continuous-time dynamics.

Accuracy vs. compute, dialed at inference

For one trained model, tightening tolerance trades NFE (compute, orange) for solution accuracy (teal). A fixed-depth net is a single point; the Neural ODE is the whole curve — you choose where to sit, after training.

tolerance (loose → tight)0.50

Why does a plain Neural ODE sometimes need to be “augmented” with extra dimensions?

To make training faster Because ODE trajectories can’t cross, so extra dimensions let paths route around each other to represent more mappings To add more output classes

Chapter 9: Cheat Sheet & Connections

ResNet

h ← h + f(h): one Euler step; f = derivative dh/dt

↓ step size → 0

Neural ODE

dh/dt = f(h,t;θ); forward = solve from t0 to t1 with an adaptive solver

↓ gradients

adjoint method

solve a backward ODE → O(1) memory in depth

↓ flow a distribution

continuous normalizing flow

morph Gaussian → data; exact likelihood; ancestor of flow matching

Model	Depth	Memory in depth	Adaptive compute?
ResNet	fixed layers	O(depth)	no
Neural ODE	continuous (solver picks steps)	O(1) via adjoint	yes
RNN	fixed per step, discrete time	O(time)	no
Latent ODE	continuous time	O(1)	yes (irregular series)

Keep exploring

→ Flow Matching — the cheap way to train these continuous flows
→ Diffusion — the cousin that moves noise to data over time
→ Skip Connections — the ResNet that started the analogy
→ SSM & Mamba — another continuous-time dynamical view of sequences

“What I cannot create, I do not understand.” You just rebuilt the Neural ODE from a single observation — a residual layer is one Euler step — and followed it to its conclusion: infinite continuous depth, a learned velocity field, an adaptive solver that thinks longer about hard inputs, O(1)-memory gradients, and a generative engine that flows noise into data.