Hyena — Attention by Convolution

Chapter 0: Another Road Around the Wall

We've seen one family escape attention's quadratic cost through recurrence — RWKV, Mamba, RetNet, xLSTM all carry a fixed-size state that updates per token. Hyena (Poli et al., 2023) takes a completely different road to the same destination: convolution. Not the short convolutions of a CNN that see only nearby pixels — a convolution with a filter as long as the entire sequence, so it mixes every token with every other, just like attention.

Why a long convolution is attention-like

Think about what a convolution does: it slides a filter over the input, and each output position is a weighted blend of nearby inputs, with the filter giving the weights. A short filter (say 3 wide) blends only immediate neighbors — local context, what CNNs do. But make the filter as long as the sequence, and each output position blends all the inputs, weighted by the filter. That “every output depends on every input” is exactly the global mixing attention provides. A long-enough convolution is a form of global token mixing.

So why isn't this obvious and old? Two problems blocked it, and Hyena's contribution is solving both. First, a filter as long as the sequence has as many parameters as the sequence is long — too many, and you can't fix the length in advance. Second, naively convolving with a length-n filter costs n-squared — right back to the quadratic wall. Hyena fixes the first with implicit filters (generate the long filter from a tiny network) and the second with the Fast Fourier Transform (compute the convolution in n-log-n time). Add data-controlled gating for the data-dependence attention has, and you have an attention alternative built from convolutions.

The one-sentence version. Hyena replaces attention with a long convolution (a filter spanning the whole sequence, so it mixes all tokens) made practical by three tricks: the filter is generated implicitly by a small network (few parameters, any length), computed via the FFT (sub-quadratic, n-log-n), and interleaved with data-controlled gating (for content-dependence). Convolution, not recurrence, as the road around the quadratic wall.

Recurrence vs. convolution

It's worth holding both roads in mind. The recurrent family processes the sequence left-to-right, carrying a state. Hyena processes the whole sequence at once with a global convolution — inherently parallel, no sequential scan. (Intriguingly, the two are deeply related: a linear recurrence is a particular long convolution, a connection we'll return to with state-space models.) For now, the key contrast: Hyena is a parallel, FFT-based convolution, not a token-by-token recurrence.

See it: two roads around the wall

The widget contrasts the approaches. Quadratic attention mixes all pairs directly (expensive). The recurrent road carries a state along the sequence. The Hyena road applies one long convolution across the whole sequence at once. Click each to see how it mixes information and its cost — all three achieve global mixing, by very different means.

Three Roads to Global Token Mixing

Click each approach to see how it mixes tokens and its cost. Attention: all-pairs (n²). Recurrence: a state along the sequence (n). Hyena: one long convolution, FFT-computed (n log n).

Common misconception. “Convolutions are for images / local patterns, not for long-range sequence modeling.” That's true of short convolutions. The whole insight of Hyena (and of state-space models) is that a long convolution — spanning the entire sequence — captures global, long-range dependencies, exactly the regime people thought only attention could handle. The filter's length is what matters; long filters reach as far as attention does.

How does Hyena achieve attention-like global token mixing?

By carrying a fixed-size recurrent state token by token With a convolution whose filter is as long as the whole sequence, so each output blends all inputs — made practical by implicit filters, the FFT, and data-controlled gating By using many short convolutions stacked deeply

Chapter 1: Convolution — A Filter That Slides

Let's make sure convolution is crystal clear, because everything builds on it. A convolution takes an input signal and a filter (also called a kernel) — a short list of weights — and slides the filter along the input. At each position, it multiplies the overlapping values and sums them, producing one output value. Slide one step, repeat. The output at each position is a weighted blend of the input values the filter currently covers.

The filter's length is its reach

The crucial property: a filter of length L lets each output position “see” L input positions. A length-3 filter blends a value with its two neighbors — tiny, local reach. A length-100 filter blends a value with 99 of its neighbors — far wider reach. The filter length is the receptive field. CNNs use short filters and stack many layers to slowly grow their reach; Hyena's shortcut is to use one filter that's already as long as the sequence, getting global reach in a single operation.

The filter's shape decides what it does. A filter that's all positive and equal is a blur (averaging). A filter like [-1, 1] is an edge detector (difference of neighbors). A long filter that decays from the center can emphasize recent context while still reaching far. In Hyena, the model learns what the long filter's shape should be — effectively learning a custom, global “attention pattern” baked into a convolution kernel. The filter is where the model's token-mixing strategy lives.

Worked example: a tiny convolution

Input signal [2, 0, 1, 3], filter [1, 0.5] (length 2). Slide the filter and compute each output as input-times-filter, summed over the overlap (here, current value × 1 plus previous value × 0.5):

position	computation	output
0	2×1 (no previous)	2.0
1	0×1 + 2×0.5	1.0
2	1×1 + 0×0.5	1.0
3	3×1 + 1×0.5	3.5

Each output mixes the current value with a fraction of the previous one — this short filter has a reach of 2. To make each output depend on values 50 steps back, you'd need a filter of length 51. And to make every output depend on every input (global mixing, like attention), the filter must be as long as the whole sequence. That's the long convolution — and the next chapters are about making such a giant filter affordable.

See it: filter length = reach

The widget convolves a signal with a filter whose length you control. At length 3 the output is a lightly-smoothed version of the input (local mixing). Crank the length up and watch the output become a broad blend of the whole signal — each position now influenced by far-away values. That growing reach is the path from local CNN to global, attention-like mixing.

Filter Length Controls Reach

A signal (gray) convolved with a filter of adjustable length (teal output). Short = local smoothing; long = global blend. The filter length is the receptive field.

Filter length 3

Common misconception. “A bigger filter just blurs more.” Only if the filter is a uniform average. A long filter with a learned, structured shape can do far more than blur — it can pick out specific long-range patterns, emphasize particular distances, or implement complex mixing. Hyena's long filters aren't blurs; they're learned, expressive global mixing patterns. The length gives reach; the shape gives selectivity.

What determines how far back a convolution's output can “see”?

The number of channels The filter's length — a length-L filter blends L input positions, so a sequence-length filter blends every input (global reach) The learning rate

Chapter 2: The Long-Filter Problem

We want a filter as long as the sequence, for global reach. But a normal convolution filter stores one weight per filter position. So a filter long enough to span a 100,000-token sequence would need 100,000 parameters — just for one filter, in one layer. That's the first wall blocking long convolutions, and it has two distinct problems.

Problem one: too many parameters

Storing a weight per position means the parameter count grows with the sequence length. A short CNN filter (length 3) has 3 weights — trivial. A sequence-length filter has as many weights as the sequence is long. For long-context models that's enormous — and most of those weights would be poorly trained, since each only affects one specific distance. You'd have a bloated, hard-to-train layer.

Problem two: the length is fixed

Worse, an explicit filter has a fixed length — you must decide it at model-creation time. Train a model with a length-2048 filter and you can never run it on a 4096-token sequence: the filter literally isn't long enough, and there are no weights for the extra positions. The model's reach is hard-coded by its parameter count. That rigidity is fatal for a general long-context model, where you want to handle any length.

The core tension. We want a filter that is (a) very long (global reach), (b) cheap in parameters, and (c) able to handle any sequence length. An explicit per-position filter gives us (a) but fails (b) and (c) — its parameters and its length are welded together. The breakthrough is to decouple the filter's length from its parameter count: describe the filter not as a list of weights, but as a function that can produce a weight for any position. That's the implicit filter, next chapter.

Worked example: the parameter blowup

Consider one layer's filter at different reaches. A length-3 filter: 3 parameters. Length-128: 128 parameters. Length-8192 (a long context): 8,192 parameters — per channel, per layer. With hundreds of channels and dozens of layers, explicit long filters would add hundreds of millions of parameters whose sole job is the convolution kernels, most barely trained. And if you later want length-16384, you're stuck — the filter can't stretch. The parameter cost and the inflexibility both come from the same root: one stored weight per position.

See it: parameters grow with reach

The widget plots the parameter count of an explicit filter against its length. Drag the desired reach: the parameter count climbs linearly with it, and the “max sequence length” is pinned to whatever you chose. There's no way to get long reach cheaply, or to handle a longer sequence than you trained for. This is the wall the implicit filter knocks down.

Explicit Filter: Parameters Welded to Length

An explicit filter stores one weight per position, so its parameter count equals its length — and that length caps the sequence it can handle. Drag the reach and watch both rise together.

Desired filter reach 20

Common misconception. “Just use a giant filter and let training sort out the weights.” Even setting aside the parameter cost, you'd still be locked to that fixed length forever, and most of those millions of independent weights would get almost no gradient signal (each affects exactly one distance). The problem isn't only quantity of parameters — it's that tying parameters one-to-one with positions is the wrong representation for a long, smooth filter. There's a better way to describe one.

Why can't you simply use an explicit (per-position) filter that's as long as the sequence?

Convolutions can't be longer than 3 It needs one parameter per position (so params grow with length), and its length is fixed at creation — you can't handle longer sequences than you trained for Long filters always blur the signal

Chapter 3: Implicit Filters — Generate, Don’t Store

Here's Hyena's first key trick, and it dissolves the long-filter problem entirely. Instead of storing the filter as a list of weights, generate it from a small network. The filter becomes a function: feed in a position (“what's the filter weight at distance 37?”) and the small network outputs the value. This is an implicit filter — the filter is represented implicitly, by a function, not explicitly, by stored numbers.

Decoupling length from parameters

The magic of this move: the small generating network has a fixed number of parameters, regardless of how long a filter it produces. Want a length-2048 filter? Call the network for positions 0 through 2047. Want length-100,000? Call it for positions 0 through 99,999. Same network, same parameters — you just evaluate it at more positions. The filter's length is now completely decoupled from its parameter count. Both problems from the last chapter vanish: few parameters (just the small net) and any length (evaluate the net wherever you need).

This is an implicit neural representation. The same idea powers things like NeRF and SIREN: rather than storing a signal as a grid of values, you train a small network that maps coordinates to values, and you can sample it at any resolution. Hyena's filter generator is exactly this — a network that maps a position (a coordinate along the filter) to a weight. It can produce an arbitrarily long, smooth filter from a handful of parameters, because the network encodes the filter's shape as a function, not as a lookup table. Shape is cheap to store; a lookup table is not.

How the filter gets its shape

The generating network takes a positional encoding of the distance (often a set of sinusoids, like a transformer's positional encoding) and passes it through a small MLP to produce the filter weight at that distance. Because it's a smooth function of position, the filter comes out smooth — nearby positions get similar weights — which is a sensible inductive bias for a long filter. Hyena also typically multiplies in an explicit decay window (an exponential that fades with distance), so the long filter naturally emphasizes nearer context while still being able to reach far. The network learns the filter's shape; the decay sets its reach.

See it: a generated long filter

The widget shows a small network generating a long filter, position by position. Drag the filter length: the generated filter gets longer, but the parameter count (the small net) stays fixed — watch the params readout not move while the filter stretches. Adjust the decay to see the filter's reach change. This is a long, smooth, expressive filter for almost no parameters — the whole point.

An Implicit Filter: Generated, Not Stored

A small network produces the filter weight at each position. Drag the length: the filter stretches but the parameter count stays fixed. The decay sets how far it reaches. Length decoupled from parameters.

Filter length 40

Decay (reach) 0.97

Common misconception. “If the filter is generated by a small network, it must be less expressive than a stored one.” In practice it's more useful: the generating network encodes a smooth, structured filter that generalizes across lengths, where a stored filter is a pile of independent, noisily-trained weights. And you can run the trained model on sequences far longer than any you trained on, just by sampling the filter network further out. Generating the filter is a feature, not a compromise — it buys both efficiency and length-generalization.

How does an implicit filter solve both the parameter and the fixed-length problems at once?

It uses a shorter filter The filter is generated by a small fixed-size network that maps position → weight, so the parameter count is constant regardless of filter length, and you can sample any length you need It stores the filter in compressed form

Chapter 4: The FFT — Convolution Without the Square

We've made the filter cheap (implicit) and long (any length). But there's still the second wall: convolving a length-n signal with a length-n filter, done naively, costs n-squared — for every output position, you sum over the whole filter. That's the quadratic cost we were trying to escape! The fix is one of the most beautiful results in all of computing: the Fast Fourier Transform, which turns a quadratic convolution into an n-log-n one.

The convolution theorem

Here's the deep fact, called the convolution theorem: convolution in the time domain equals pointwise multiplication in the frequency domain. In plain terms — the slow operation of sliding a filter and summing overlaps (convolution) becomes, after a Fourier transform, the fast operation of just multiplying two lists element by element. Sliding-and-summing is expensive; element-wise multiplying is cheap. The Fourier transform is the bridge between them.

The recipe: transform, multiply, transform back. To convolve signal and filter cheaply: (1) Fourier-transform both into the frequency domain — the FFT does this in n-log-n time. (2) Multiply them pointwise — that's just n multiplications, cheap. (3) Inverse-transform the result back to the time domain — another n-log-n FFT. Total cost: n-log-n, dominated by the two FFTs. The expensive sliding-window convolution never happens; you replaced it with a cheap pointwise multiply sandwiched between two fast transforms.

Why n-log-n is the whole game

The difference between n-squared and n-log-n is enormous at scale, almost as dramatic as linear. At n = 1,000,000: n-squared is a trillion operations; n-log-n is about 20 million — roughly 50,000× cheaper. So Hyena's long convolution, computed via FFT, costs n-log-n — sub-quadratic, nearly linear. It mixes every token with every other (global, attention-like reach) without ever paying the quadratic price, because the FFT lets it skip the all-pairs summation entirely. This is the second key trick, and it's what makes the long convolution actually affordable.

Worked intuition: why frequency-domain multiply works

The Fourier transform decomposes a signal into a sum of waves at different frequencies. Here's the intuition for the theorem: convolving with a filter is really asking “how much does the filter respond to each frequency present in the signal?” In the frequency domain, that question becomes trivial — for each frequency, you just multiply the signal's amount of that frequency by the filter's response to it. No sliding needed; each frequency is handled independently with one multiplication. Transform to where the problem is easy (frequency), solve it cheaply (multiply), transform back. That's the FFT convolution.

See it: n-log-n vs n-squared

The widget plots the cost of naive convolution (n-squared) against FFT convolution (n-log-n) as the sequence grows. At short lengths they're close (the FFT has overhead). Drag the length up and watch the naive curve explode while the FFT curve stays nearly flat — the same divergence that separates quadratic attention from sub-quadratic alternatives. The FFT is what keeps the long convolution on the cheap curve.

Naive Convolution (n²) vs FFT Convolution (n log n)

Cost vs sequence length. Naive sliding convolution is quadratic; FFT convolution is n log n. Drag the length and watch the gap widen — the FFT is what makes the long convolution affordable.

Sequence length 16k

Common misconception. “FFT convolution is an approximation.” It's exact — the convolution theorem is a precise mathematical identity, so FFT convolution computes the identical result as the naive sliding convolution, just far faster. (The only subtlety is padding to handle the finite sequence correctly, a standard detail.) You're not trading accuracy for speed; you're computing the same thing in a smarter domain. Centuries-old math (Fourier) quietly powering a 2023 architecture.

How does the FFT make a long convolution sub-quadratic?

It skips most of the signal By the convolution theorem: transform signal and filter to the frequency domain (FFT, n log n), multiply them pointwise (cheap), and transform back — replacing the n² sliding sum with n log n total By using a shorter filter in the frequency domain

Chapter 5: Data-Controlled Gating — The Missing Ingredient

We have a long, cheap, global convolution. But there's a problem: a convolution is linear and data-independent. Its filter is fixed once trained — it applies the same mixing pattern to every input. Attention, by contrast, is data-dependent: the mixing weights depend on the actual content (the queries and keys change per input). That content-dependence is a big part of why attention is so powerful. A plain convolution can't match it. Hyena's third trick fixes this: data-controlled gating.

Gating injects data-dependence

The idea: after (or around) the convolution, multiply the result, element by element, by another signal derived from the input itself. This is a gate — an input-dependent modulation. Because the gate is computed from the current input, the overall operation now behaves differently for different inputs, even though the convolution filter is fixed. The convolution provides the long-range mixing; the gate provides the content-dependence. Together they recover what attention has: global reach and data-adaptivity.

Multiplication is where the data-dependence lives. Note it's an element-wise multiply, not an add. Multiplying the convolution output by an input-derived gate means the input controls how much of each mixed feature passes through — a value can be amplified, suppressed, or zeroed depending on content. This multiplicative interaction between two input-derived signals is what gives Hyena (and gated architectures generally) expressiveness comparable to attention's query-key-value interaction. Attention multiplies queries by keys; Hyena multiplies a convolved signal by a gate. Different mechanics, same multiplicative-interaction principle.

The Hyena operator

Putting it together, the Hyena operator interleaves these two ingredients: project the input into several signals, then alternate — multiply by a gate (element-wise), convolve with a long implicit filter, multiply by another gate, convolve again, and so on. Each “long conv” mixes across positions (the global reach); each “gate” mixes data-dependently (the content adaptivity). The number of these alternating steps is the operator's order (more on that next chapter). This gate-conv-gate-conv recurrence is the Hyena layer that replaces attention.

See it: convolution alone vs. gated

The widget shows the same long convolution applied two ways. In plain mode, the output is the convolution — the same fixed mixing pattern regardless of input. Toggle gated and an input-derived gate modulates it: where the gate is low, the output is suppressed; where high, it passes. The gate makes the effective behavior depend on the content, which a fixed convolution alone can't do.

Plain Convolution vs. Gated

The convolution output (teal), and the gate (purple, derived from the input). In gated mode the output is multiplied by the gate — content-dependent modulation. Toggle to see the difference.

Common misconception. “A convolution can already learn data-dependence during training.” A trained convolution learns a fixed filter that's applied identically to every input — it adapts during training, but not at inference to the specific content in front of it. Attention's power is that it computes different mixing weights for each input at inference. Gating is how Hyena gets that inference-time, content-specific behavior: the gate is recomputed from each input. Without gating, a long convolution is just a (very long) fixed filter.

Why does Hyena need data-controlled gating in addition to the long convolution?

To make the convolution shorter A convolution applies a fixed, data-independent filter; multiplying by an input-derived gate makes the operation content-dependent at inference — recovering the data-adaptivity that makes attention powerful To reduce the parameter count

Chapter 6: The Hyena Operator — End to End

Now watch the whole thing run. This is the Hyena operator on a real signal: an input flows in, a small network generates a long implicit filter, the input is convolved with it (FFT-fast), and the result is gated by an input-derived signal. The three tricks — implicit filter, FFT convolution, data-controlled gating — working together, replacing attention.

Play with the controls to feel each ingredient's effect:

Filter decay — how far the long convolution reaches. Short reach = local mixing; long reach = global, attention-like mixing.
Gating — toggle the data-dependent modulation that makes the operation content-adaptive.
Sequence length — watch the cost stay on the n-log-n curve no matter how long.

The Hyena Operator: Generate → Convolve → Gate

Top: input signal. Middle: the generated long filter. Bottom: the output (convolution, then optional gate). Adjust reach and gating; watch the n log n cost readout stay flat as length grows.

Filter reach (decay) 0.97

Sequence length 16k

What to take away. Each piece earns its place. The implicit filter gives a long, cheap, flexible kernel. The FFT computes the long convolution in n-log-n. The gate restores the data-dependence a fixed filter lacks. Remove any one and it falls apart: no implicit filter = too many parameters; no FFT = back to quadratic; no gate = a fixed, content-blind mixer. Together they form a sub-quadratic, parallel, content-adaptive global mixer — an attention replacement built from convolutions and the Fourier transform.

Common misconception. “Hyena is just a fancy CNN.” A CNN uses short, stored, data-independent filters and builds reach by stacking many layers. Hyena uses a single sequence-length, generated, FFT-computed filter with data-controlled gating — global reach in one operation, content-adaptive, sub-quadratic. The convolution machinery is shared with CNNs, but the long-implicit-gated combination is a different beast aimed squarely at replacing attention.

No quiz — the operator is the test. If you can explain what each of the three controls (reach, gating, length) changes and why all three ingredients are necessary, you understand Hyena.

Chapter 7: The Hyena Hierarchy — Order and Depth

A single gate-then-convolve step is the basic unit. But Hyena's full operator stacks several of them in a recurrence, and the number of steps is called the operator's order. Higher order means more alternating gate-and-convolution stages, which gives the operator more expressive power — richer interactions between distant tokens. This is the “Hyena hierarchy.”

How the recurrence builds expressiveness

Here's the structure. The input is first projected into several signals — one set of “values” plus several “gates,” analogous to how attention projects to queries, keys, and values. Then the operator alternates: multiply by the first gate, convolve with a long filter, multiply by the next gate, convolve again, and so on, for order-many steps. Each gate-conv pair lets information interact across positions and then be re-modulated by content. Stacking them composes these interactions into something far richer than a single convolution — capturing the kind of multi-step, content-dependent dependencies that attention handles in one shot.

Order is Hyena's analogue of multiplicative interaction depth. A standard attention layer does one round of query-key-value multiplication. Hyena's order-N operator does N rounds of gate-convolution, each a multiplicative-then-mixing step. Order 2 is the common, efficient choice (it already matches attention well); higher orders add expressiveness at more cost. The order is a knob trading compute for the richness of the token interactions — not to be confused with the number of layers (you still stack many Hyena operators into a deep network, just as you stack transformer blocks).

One operator, then stack the blocks

Don't confuse two kinds of depth. The order is depth within one Hyena operator (how many gate-conv steps). Then, like any architecture, you stack many Hyena operators into blocks — each block being a Hyena operator (replacing attention) plus a feed-forward network, with residual connections and normalization, exactly the transformer scaffolding. So a Hyena model is: many blocks, each containing one order-N Hyena operator. Same “swap the token mixer, keep the scaffolding” pattern as every other attention alternative in this series.

See it: building the operator by order

The widget shows a Hyena operator's internal structure. Drag the order: at order 1 it's a single gate-convolution; raise it and watch more gate-conv stages chain together, each adding a round of content-dependent long-range mixing. Notice the projections at the start (values + gates) and the alternating pattern — the recurrence that defines the operator's expressiveness.

The Hyena Operator by Order

Input projected into values + gates, then alternating gate × long-conv steps. Drag the order to add stages. Order 2 is the common choice; higher = more expressive, more cost.

Operator order 2

Common misconception. “Higher order is always better.” Each extra order adds another long convolution (another FFT) and gate, so cost grows with order. In practice order 2 already matches attention on many tasks, and going higher gives diminishing returns for the added compute. As with most of these designs, the win is finding the smallest order that captures the dependencies your task needs — not maxing it out. Order is a dial, not a “bigger is better” lever.

What does the “order” of a Hyena operator control, and how does it differ from the number of layers?

The filter length; it's the same as the layer count Order = how many alternating gate-convolution steps within one operator (its expressiveness); layers = how many such operators you stack into a deep network — two separate kinds of depth Order is the batch size

Chapter 8: Hyena, SSMs, and the Deep Connection

Hyena sits in the sub-quadratic family with the recurrent models, but it reaches global mixing through convolution rather than a token-by-token recurrence. The most beautiful fact in this whole area is that these two roads — convolution and recurrence — are secretly the same road. Seeing why ties the entire family together.

A linear recurrence is a long convolution

Here's the equivalence. Take a simple linear recurrence: each step, multiply the state by a factor and add the new input (exactly the retention/SSM update). Unroll it, and the output at position n is a weighted sum of all past inputs, where input k steps back is weighted by that factor to the k-th power. But a “weighted sum of past inputs by a fixed per-distance weight” is precisely a convolution with a filter whose values are those powers. So a linear recurrence and a long convolution compute the same thing — one incrementally (recurrent), one all-at-once (convolutional). The decay you met in RetNet is literally the convolution filter Hyena would use.

This unifies the whole family. State-space models (Mamba's ancestors, like S4) are explicitly built on this duality: they define a linear recurrence and compute it as a long convolution via FFT for training, then as a recurrence for inference — the same dual forms you saw in RetNet, now understood as “recurrence = convolution.” Hyena comes at it from the convolution side (long implicit filters + FFT + gating); SSMs come at it from the recurrence side (a structured linear system). They meet in the middle. RWKV, RetNet, xLSTM, Mamba, Hyena — all are, at heart, long convolutions / linear recurrences with data-dependent gating. The grand convergence again, now seen through the convolution lens.

What's distinctive about Hyena

So what does Hyena specifically contribute? Its emphasis on the convolution view and implicit filters. Where SSMs parameterize the long convolution through a structured linear recurrence (specific math constraints), Hyena parameterizes it directly and freely as an implicit-filter network — more flexible, fewer structural assumptions. And Hyena's explicit gate-conv recurrence (the order) is its own expressiveness mechanism. Hyena showed you don't need the SSM's particular structure to get a good long convolution — a freely-generated implicit filter works, and is conceptually simpler.

The honest tradeoffs

Hyena shares the family's strengths (sub-quadratic, parallel, long-context) and its central weakness: because the filter is data-independent (the gating helps but the core convolution is fixed per input), it can struggle with the precise, content-addressed recall that attention and the data-dependent SSMs (Mamba) excel at. The field's trajectory has favored data-dependent state updates (Mamba's selectivity) over Hyena's fixed-filter-plus-gating — but Hyena's convolution framing, implicit filters, and the recurrence-equals-convolution insight were deeply influential, especially in genomics and other very-long-sequence domains where its efficiency shines.

See it: the two roads meet

The widget shows a linear recurrence and a long convolution side by side, computing the same output. Step the recurrence and watch its unrolled weights trace out exactly the convolution filter. This is the identity that unifies the whole sub-quadratic family — convolution and recurrence are two views of one computation.

Recurrence = Long Convolution

A linear recurrence (decay ×, add) unrolled gives weights gamma^distance — which IS a convolution filter. Step it and watch the recurrence's effective weights match the convolution kernel exactly.

Recurrence decay 0.88

Common misconception. “Hyena and Mamba are competing, unrelated ideas.” They're two parameterizations of the same underlying object — a long convolution / linear recurrence. Mamba adds input-dependent (selective) parameters; Hyena uses a freely-generated implicit filter with gating. Understanding the recurrence-equals-convolution duality lets you see all the sub-quadratic models as points in one design space, rather than a confusing zoo of rival architectures. The duality is the unifying lens.

What is the deep connection between Hyena's long convolution and the recurrent models (SSMs, RetNet)?

There is none; they're unrelated A linear recurrence, unrolled, weights past inputs by a fixed per-distance factor — which IS a convolution filter. So recurrence and long convolution compute the same thing; Hyena and SSMs are two views of it Convolutions are always quadratic like attention

Chapter 9: Connections & Cheat Sheet

You now understand Hyena fully: the convolution road around quadratic attention, why filter length is reach, the long-filter parameter problem, implicit filters that decouple length from parameters, the FFT that makes long convolution sub-quadratic, data-controlled gating for content-dependence, the operator and its order, and the deep recurrence-equals-convolution duality that unifies the whole family. The thread: a sequence-length convolution mixes all tokens like attention — made affordable by implicit filters, the FFT, and gating.

The cheat sheet

The idea: replace attention with a convolution as long as the sequence (global mixing)

Filter length = reach: a sequence-length filter blends every token (attention-like)

Long-filter problem: explicit filter has 1 param/position; length welded to params

Implicit filter: a small net maps position → weight; fixed params, any length

FFT convolution: convolution = pointwise multiply in frequency domain → O(n log n)

Data-controlled gating: element-wise multiply by an input-derived gate → content-dependence

The operator: interleave (gate, long conv) for “order” steps; stack into blocks

Deep duality: a linear recurrence = a long convolution (gamma^distance filter) — unifies the family

A decision guide

Need a parallel, sub-quadratic attention alternative?

Hyena (or an SSM) — long convolution, n log n.

↓

Very long sequences (genomics, audio)?

Hyena shines — implicit filters generalize across lengths.

↓

Need strong content-addressed recall?

Consider data-dependent SSMs (Mamba) or hybrids with attention.

↓

Want to understand the whole family?

Learn the recurrence = convolution duality — it ties them all together.

Where this connects

SSM / Mamba — the recurrence-side view of the same long convolution; the duality made explicit.
Linear Attention & RWKV & RetNet — recurrent cousins; RetNet's decay is literally the convolution filter Hyena would use.
Attention Variants — the quadratic mechanism Hyena replaces; FlashAttention attacks memory, Hyena attacks the all-pairs compute.
Vision / CNNs — Hyena generalizes the convolution from short local filters to long global ones.
Transformer — Hyena swaps attention for the long-conv operator, keeping the block scaffolding.
Positional Encoding — the implicit filter is generated from positional encodings (sinusoids → MLP).

The one thing to remember. Hyena's lesson is that you don't need attention — or even recurrence — to mix tokens globally. A single convolution with a filter as long as the sequence does it, and three tricks make it practical: generate the filter implicitly (cheap, any length), compute it with the FFT (sub-quadratic), and gate it by the input (content-adaptive). And the deepest takeaway is the duality — a long convolution and a linear recurrence are the same computation, which is why the entire sub-quadratic family, from Hyena to Mamba to RetNet, is really one idea wearing different clothes.

You need a parallel, sub-quadratic model for million-base-pair DNA sequences (very long, length varies a lot). Why is Hyena a strong fit?

Because attention is cheap at that length Because it uses short local filters like a CNN Its FFT long-convolution is sub-quadratic (n log n) for very long sequences, and its implicit filter handles any length without adding parameters — so it generalizes across the varying, very long inputs cheaply

“To mix everything with everything, you need not compare every pair — sometimes one long, well-shaped sweep will do.”

Hyena & Long Convolutions