A different escape from quadratic attention — not recurrence, but a convolution as long as the whole sequence, generated on the fly and computed with the Fourier transform.
We've seen one family escape attention's quadratic cost through recurrence — RWKV, Mamba, RetNet, xLSTM all carry a fixed-size state that updates per token. Hyena (Poli et al., 2023) takes a completely different road to the same destination: convolution. Not the short convolutions of a CNN that see only nearby pixels — a convolution with a filter as long as the entire sequence, so it mixes every token with every other, just like attention.
Think about what a convolution does: it slides a filter over the input, and each output position is a weighted blend of nearby inputs, with the filter giving the weights. A short filter (say 3 wide) blends only immediate neighbors — local context, what CNNs do. But make the filter as long as the sequence, and each output position blends all the inputs, weighted by the filter. That “every output depends on every input” is exactly the global mixing attention provides. A long-enough convolution is a form of global token mixing.
So why isn't this obvious and old? Two problems blocked it, and Hyena's contribution is solving both. First, a filter as long as the sequence has as many parameters as the sequence is long — too many, and you can't fix the length in advance. Second, naively convolving with a length-n filter costs n-squared — right back to the quadratic wall. Hyena fixes the first with implicit filters (generate the long filter from a tiny network) and the second with the Fast Fourier Transform (compute the convolution in n-log-n time). Add data-controlled gating for the data-dependence attention has, and you have an attention alternative built from convolutions.
It's worth holding both roads in mind. The recurrent family processes the sequence left-to-right, carrying a state. Hyena processes the whole sequence at once with a global convolution — inherently parallel, no sequential scan. (Intriguingly, the two are deeply related: a linear recurrence is a particular long convolution, a connection we'll return to with state-space models.) For now, the key contrast: Hyena is a parallel, FFT-based convolution, not a token-by-token recurrence.
The widget contrasts the approaches. Quadratic attention mixes all pairs directly (expensive). The recurrent road carries a state along the sequence. The Hyena road applies one long convolution across the whole sequence at once. Click each to see how it mixes information and its cost — all three achieve global mixing, by very different means.
Click each approach to see how it mixes tokens and its cost. Attention: all-pairs (n²). Recurrence: a state along the sequence (n). Hyena: one long convolution, FFT-computed (n log n).
Let's make sure convolution is crystal clear, because everything builds on it. A convolution takes an input signal and a filter (also called a kernel) — a short list of weights — and slides the filter along the input. At each position, it multiplies the overlapping values and sums them, producing one output value. Slide one step, repeat. The output at each position is a weighted blend of the input values the filter currently covers.
The crucial property: a filter of length L lets each output position “see” L input positions. A length-3 filter blends a value with its two neighbors — tiny, local reach. A length-100 filter blends a value with 99 of its neighbors — far wider reach. The filter length is the receptive field. CNNs use short filters and stack many layers to slowly grow their reach; Hyena's shortcut is to use one filter that's already as long as the sequence, getting global reach in a single operation.
Input signal [2, 0, 1, 3], filter [1, 0.5] (length 2). Slide the filter and compute each output as input-times-filter, summed over the overlap (here, current value × 1 plus previous value × 0.5):
| position | computation | output |
|---|---|---|
| 0 | 2×1 (no previous) | 2.0 |
| 1 | 0×1 + 2×0.5 | 1.0 |
| 2 | 1×1 + 0×0.5 | 1.0 |
| 3 | 3×1 + 1×0.5 | 3.5 |
Each output mixes the current value with a fraction of the previous one — this short filter has a reach of 2. To make each output depend on values 50 steps back, you'd need a filter of length 51. And to make every output depend on every input (global mixing, like attention), the filter must be as long as the whole sequence. That's the long convolution — and the next chapters are about making such a giant filter affordable.
The widget convolves a signal with a filter whose length you control. At length 3 the output is a lightly-smoothed version of the input (local mixing). Crank the length up and watch the output become a broad blend of the whole signal — each position now influenced by far-away values. That growing reach is the path from local CNN to global, attention-like mixing.
A signal (gray) convolved with a filter of adjustable length (teal output). Short = local smoothing; long = global blend. The filter length is the receptive field.
We want a filter as long as the sequence, for global reach. But a normal convolution filter stores one weight per filter position. So a filter long enough to span a 100,000-token sequence would need 100,000 parameters — just for one filter, in one layer. That's the first wall blocking long convolutions, and it has two distinct problems.
Storing a weight per position means the parameter count grows with the sequence length. A short CNN filter (length 3) has 3 weights — trivial. A sequence-length filter has as many weights as the sequence is long. For long-context models that's enormous — and most of those weights would be poorly trained, since each only affects one specific distance. You'd have a bloated, hard-to-train layer.
Worse, an explicit filter has a fixed length — you must decide it at model-creation time. Train a model with a length-2048 filter and you can never run it on a 4096-token sequence: the filter literally isn't long enough, and there are no weights for the extra positions. The model's reach is hard-coded by its parameter count. That rigidity is fatal for a general long-context model, where you want to handle any length.
Consider one layer's filter at different reaches. A length-3 filter: 3 parameters. Length-128: 128 parameters. Length-8192 (a long context): 8,192 parameters — per channel, per layer. With hundreds of channels and dozens of layers, explicit long filters would add hundreds of millions of parameters whose sole job is the convolution kernels, most barely trained. And if you later want length-16384, you're stuck — the filter can't stretch. The parameter cost and the inflexibility both come from the same root: one stored weight per position.
The widget plots the parameter count of an explicit filter against its length. Drag the desired reach: the parameter count climbs linearly with it, and the “max sequence length” is pinned to whatever you chose. There's no way to get long reach cheaply, or to handle a longer sequence than you trained for. This is the wall the implicit filter knocks down.
An explicit filter stores one weight per position, so its parameter count equals its length — and that length caps the sequence it can handle. Drag the reach and watch both rise together.
Here's Hyena's first key trick, and it dissolves the long-filter problem entirely. Instead of storing the filter as a list of weights, generate it from a small network. The filter becomes a function: feed in a position (“what's the filter weight at distance 37?”) and the small network outputs the value. This is an implicit filter — the filter is represented implicitly, by a function, not explicitly, by stored numbers.
The magic of this move: the small generating network has a fixed number of parameters, regardless of how long a filter it produces. Want a length-2048 filter? Call the network for positions 0 through 2047. Want length-100,000? Call it for positions 0 through 99,999. Same network, same parameters — you just evaluate it at more positions. The filter's length is now completely decoupled from its parameter count. Both problems from the last chapter vanish: few parameters (just the small net) and any length (evaluate the net wherever you need).
The generating network takes a positional encoding of the distance (often a set of sinusoids, like a transformer's positional encoding) and passes it through a small MLP to produce the filter weight at that distance. Because it's a smooth function of position, the filter comes out smooth — nearby positions get similar weights — which is a sensible inductive bias for a long filter. Hyena also typically multiplies in an explicit decay window (an exponential that fades with distance), so the long filter naturally emphasizes nearer context while still being able to reach far. The network learns the filter's shape; the decay sets its reach.
The widget shows a small network generating a long filter, position by position. Drag the filter length: the generated filter gets longer, but the parameter count (the small net) stays fixed — watch the params readout not move while the filter stretches. Adjust the decay to see the filter's reach change. This is a long, smooth, expressive filter for almost no parameters — the whole point.
A small network produces the filter weight at each position. Drag the length: the filter stretches but the parameter count stays fixed. The decay sets how far it reaches. Length decoupled from parameters.
We've made the filter cheap (implicit) and long (any length). But there's still the second wall: convolving a length-n signal with a length-n filter, done naively, costs n-squared — for every output position, you sum over the whole filter. That's the quadratic cost we were trying to escape! The fix is one of the most beautiful results in all of computing: the Fast Fourier Transform, which turns a quadratic convolution into an n-log-n one.
Here's the deep fact, called the convolution theorem: convolution in the time domain equals pointwise multiplication in the frequency domain. In plain terms — the slow operation of sliding a filter and summing overlaps (convolution) becomes, after a Fourier transform, the fast operation of just multiplying two lists element by element. Sliding-and-summing is expensive; element-wise multiplying is cheap. The Fourier transform is the bridge between them.
The difference between n-squared and n-log-n is enormous at scale, almost as dramatic as linear. At n = 1,000,000: n-squared is a trillion operations; n-log-n is about 20 million — roughly 50,000× cheaper. So Hyena's long convolution, computed via FFT, costs n-log-n — sub-quadratic, nearly linear. It mixes every token with every other (global, attention-like reach) without ever paying the quadratic price, because the FFT lets it skip the all-pairs summation entirely. This is the second key trick, and it's what makes the long convolution actually affordable.
The Fourier transform decomposes a signal into a sum of waves at different frequencies. Here's the intuition for the theorem: convolving with a filter is really asking “how much does the filter respond to each frequency present in the signal?” In the frequency domain, that question becomes trivial — for each frequency, you just multiply the signal's amount of that frequency by the filter's response to it. No sliding needed; each frequency is handled independently with one multiplication. Transform to where the problem is easy (frequency), solve it cheaply (multiply), transform back. That's the FFT convolution.
The widget plots the cost of naive convolution (n-squared) against FFT convolution (n-log-n) as the sequence grows. At short lengths they're close (the FFT has overhead). Drag the length up and watch the naive curve explode while the FFT curve stays nearly flat — the same divergence that separates quadratic attention from sub-quadratic alternatives. The FFT is what keeps the long convolution on the cheap curve.
Cost vs sequence length. Naive sliding convolution is quadratic; FFT convolution is n log n. Drag the length and watch the gap widen — the FFT is what makes the long convolution affordable.
We have a long, cheap, global convolution. But there's a problem: a convolution is linear and data-independent. Its filter is fixed once trained — it applies the same mixing pattern to every input. Attention, by contrast, is data-dependent: the mixing weights depend on the actual content (the queries and keys change per input). That content-dependence is a big part of why attention is so powerful. A plain convolution can't match it. Hyena's third trick fixes this: data-controlled gating.
The idea: after (or around) the convolution, multiply the result, element by element, by another signal derived from the input itself. This is a gate — an input-dependent modulation. Because the gate is computed from the current input, the overall operation now behaves differently for different inputs, even though the convolution filter is fixed. The convolution provides the long-range mixing; the gate provides the content-dependence. Together they recover what attention has: global reach and data-adaptivity.
Putting it together, the Hyena operator interleaves these two ingredients: project the input into several signals, then alternate — multiply by a gate (element-wise), convolve with a long implicit filter, multiply by another gate, convolve again, and so on. Each “long conv” mixes across positions (the global reach); each “gate” mixes data-dependently (the content adaptivity). The number of these alternating steps is the operator's order (more on that next chapter). This gate-conv-gate-conv recurrence is the Hyena layer that replaces attention.
The widget shows the same long convolution applied two ways. In plain mode, the output is the convolution — the same fixed mixing pattern regardless of input. Toggle gated and an input-derived gate modulates it: where the gate is low, the output is suppressed; where high, it passes. The gate makes the effective behavior depend on the content, which a fixed convolution alone can't do.
The convolution output (teal), and the gate (purple, derived from the input). In gated mode the output is multiplied by the gate — content-dependent modulation. Toggle to see the difference.
Now watch the whole thing run. This is the Hyena operator on a real signal: an input flows in, a small network generates a long implicit filter, the input is convolved with it (FFT-fast), and the result is gated by an input-derived signal. The three tricks — implicit filter, FFT convolution, data-controlled gating — working together, replacing attention.
Play with the controls to feel each ingredient's effect:
Top: input signal. Middle: the generated long filter. Bottom: the output (convolution, then optional gate). Adjust reach and gating; watch the n log n cost readout stay flat as length grows.
No quiz — the operator is the test. If you can explain what each of the three controls (reach, gating, length) changes and why all three ingredients are necessary, you understand Hyena.
A single gate-then-convolve step is the basic unit. But Hyena's full operator stacks several of them in a recurrence, and the number of steps is called the operator's order. Higher order means more alternating gate-and-convolution stages, which gives the operator more expressive power — richer interactions between distant tokens. This is the “Hyena hierarchy.”
Here's the structure. The input is first projected into several signals — one set of “values” plus several “gates,” analogous to how attention projects to queries, keys, and values. Then the operator alternates: multiply by the first gate, convolve with a long filter, multiply by the next gate, convolve again, and so on, for order-many steps. Each gate-conv pair lets information interact across positions and then be re-modulated by content. Stacking them composes these interactions into something far richer than a single convolution — capturing the kind of multi-step, content-dependent dependencies that attention handles in one shot.
Don't confuse two kinds of depth. The order is depth within one Hyena operator (how many gate-conv steps). Then, like any architecture, you stack many Hyena operators into blocks — each block being a Hyena operator (replacing attention) plus a feed-forward network, with residual connections and normalization, exactly the transformer scaffolding. So a Hyena model is: many blocks, each containing one order-N Hyena operator. Same “swap the token mixer, keep the scaffolding” pattern as every other attention alternative in this series.
The widget shows a Hyena operator's internal structure. Drag the order: at order 1 it's a single gate-convolution; raise it and watch more gate-conv stages chain together, each adding a round of content-dependent long-range mixing. Notice the projections at the start (values + gates) and the alternating pattern — the recurrence that defines the operator's expressiveness.
Input projected into values + gates, then alternating gate × long-conv steps. Drag the order to add stages. Order 2 is the common choice; higher = more expressive, more cost.
Hyena sits in the sub-quadratic family with the recurrent models, but it reaches global mixing through convolution rather than a token-by-token recurrence. The most beautiful fact in this whole area is that these two roads — convolution and recurrence — are secretly the same road. Seeing why ties the entire family together.
Here's the equivalence. Take a simple linear recurrence: each step, multiply the state by a factor and add the new input (exactly the retention/SSM update). Unroll it, and the output at position n is a weighted sum of all past inputs, where input k steps back is weighted by that factor to the k-th power. But a “weighted sum of past inputs by a fixed per-distance weight” is precisely a convolution with a filter whose values are those powers. So a linear recurrence and a long convolution compute the same thing — one incrementally (recurrent), one all-at-once (convolutional). The decay you met in RetNet is literally the convolution filter Hyena would use.
So what does Hyena specifically contribute? Its emphasis on the convolution view and implicit filters. Where SSMs parameterize the long convolution through a structured linear recurrence (specific math constraints), Hyena parameterizes it directly and freely as an implicit-filter network — more flexible, fewer structural assumptions. And Hyena's explicit gate-conv recurrence (the order) is its own expressiveness mechanism. Hyena showed you don't need the SSM's particular structure to get a good long convolution — a freely-generated implicit filter works, and is conceptually simpler.
Hyena shares the family's strengths (sub-quadratic, parallel, long-context) and its central weakness: because the filter is data-independent (the gating helps but the core convolution is fixed per input), it can struggle with the precise, content-addressed recall that attention and the data-dependent SSMs (Mamba) excel at. The field's trajectory has favored data-dependent state updates (Mamba's selectivity) over Hyena's fixed-filter-plus-gating — but Hyena's convolution framing, implicit filters, and the recurrence-equals-convolution insight were deeply influential, especially in genomics and other very-long-sequence domains where its efficiency shines.
The widget shows a linear recurrence and a long convolution side by side, computing the same output. Step the recurrence and watch its unrolled weights trace out exactly the convolution filter. This is the identity that unifies the whole sub-quadratic family — convolution and recurrence are two views of one computation.
A linear recurrence (decay ×, add) unrolled gives weights gamma^distance — which IS a convolution filter. Step it and watch the recurrence's effective weights match the convolution kernel exactly.
You now understand Hyena fully: the convolution road around quadratic attention, why filter length is reach, the long-filter parameter problem, implicit filters that decouple length from parameters, the FFT that makes long convolution sub-quadratic, data-controlled gating for content-dependence, the operator and its order, and the deep recurrence-equals-convolution duality that unifies the whole family. The thread: a sequence-length convolution mixes all tokens like attention — made affordable by implicit filters, the FFT, and gating.
“To mix everything with everything, you need not compare every pair — sometimes one long, well-shaped sweep will do.”