A reader's guide to every major attention mechanism in modern deep learning — from the original Transformer to Flash Attention 4 — explained as a book.
Seven Families · Fifty Mechanisms · One Decade
Preface
How to read this book
This is not a survey paper and it is not a tutorial. It is closer to a field guide — the kind of book a naturalist takes into the forest. Each entry is a single attention mechanism. Each entry has the same four parts so that you can compare them honestly, without having to learn a new vocabulary every time.
The parts are: In Brief, where the mechanism is described in two or three plain sentences, the way you might describe it to a colleague over coffee. The Mechanism, where the actual math or algorithm appears, kept as simple as honesty allows. Why It Matters, which places the work in context — what problem it solved, what door it opened. And Trade-offs, because every mechanism is a deal with the devil somewhere, and the question is only which devil.
The mechanisms are grouped into seven families. The families are not watertight — Mamba is a state-space model that turns out to be a linear attention; MLA is a softmax attention with a compression trick; some hybrids belong in two places at once. The taxonomy is a reading aid, not a truth. Treat it that way.
The book is meant to be read in order, but every entry stands on its own. Skip what you know. Linger on what you don't.
The mechanisms in this family compute exactly the same thing as the original Transformer. They are not approximations. They are software engineering — clever uses of memory hierarchy, tiling, and parallelism — that made the impossible merely expensive.
01
Vanilla Attention
Vaswani, Shazeer, Parmar et al. · "Attention is All You Need" · 2017
In Brief
The original. Every token looks at every other token, computes how much it cares (a scalar), and uses those weights to take a weighted average of value vectors. It is brutally simple, brutally expressive, and brutally expensive. Everything in this book is a reaction to it.
The Mechanism
You take three matrices derived from the input: queries Q, keys K, and values V. Each row of Q is a token asking a question. Each row of K is a token offering an answer. The dot product between them, scaled, softmaxed, and used to weight V, gives the output.
Attention(Q, K, V) = softmax( QKᵀ / √d_k ) V
The matrix inside the softmax is N × N — every token by every other token. This is where all the trouble starts. For a million tokens, that is a trillion entries.
Why It Matters
Before 2017, sequence modeling meant recurrence: you read the sequence one token at a time and hoped the hidden state remembered enough. Attention threw that out. Every position could see every other position in a single layer, in parallel, on a GPU. It changed not just NLP but vision, audio, biology, and code. Most of modern AI is downstream of this one equation.
Trade-offs
Quadratic. Memory and compute scale as N². At 1,000 tokens this is fine. At 100,000 it is painful. At 1,000,000 it is impossible without help. The rest of this book is, in one way or another, that help.
· · ·
02
Scaled Dot-Product Attention (SDPA)
PyTorch core · introduced PyTorch 2.0 · 2023
In Brief
Not an algorithm — a doorway. torch.nn.functional.scaled_dot_product_attention is PyTorch's unified call that hides which kernel actually runs underneath. You write one line; PyTorch picks Flash, memory-efficient, or naive depending on your inputs and hardware.
The Mechanism
SDPA is a dispatcher. Given Q, K, V, an optional mask, and flags for causal or dropout, it inspects your tensors and picks the fastest available backend: Flash Attention if the shapes and dtype support it, the memory-efficient backend if not, and finally a plain math implementation as fallback. The math it computes is identical to vanilla attention.
Why It Matters
SDPA matters because it is the layer between researchers and the kernel zoo below. Before it, using Flash Attention meant installing a separate library, matching CUDA versions, and rewriting your model. After it, you write one PyTorch call and inherit whatever the framework currently knows is fastest. It is the API that made Flash Attention universal without anyone noticing.
Trade-offs
You give up explicit control. The dispatcher's heuristics are good but not infallible — sometimes you would have preferred a different backend. For most users this is invisible. For kernel engineers it is the source of late-night debugging.
· · ·
03
Memory-Efficient Attention
Rabe and Staats · "Self-attention Does Not Need O(n²) Memory" · 2021
In Brief
The first proof that you do not need to store the attention matrix to compute attention. You can stream through it. The paper is short, almost playful, and it set up everything that Flash Attention later industrialised.
The Mechanism
The trick is to compute the softmax incrementally. As you scan through the keys, you keep a running maximum (for numerical stability) and a running normaliser. You never materialise the full N × N matrix; you only ever hold a strip of it at a time. The backward pass recomputes what it needs rather than caching it.
memory: O(log N) → compute: O(N²) but streamable
Why It Matters
It demolished a piece of folk wisdom — that attention's quadratic memory was inherent. It was not. It was a choice of how to organise the computation. The paper is the conceptual ancestor of every IO-aware kernel that followed.
Trade-offs
The original implementation, naïve in CUDA, was slower than the standard one despite using less memory. The compute was still O(N²); you just paid for it without the cache. Flash Attention would later fix the speed problem too.
· · ·
04
Flash Attention
Tri Dao et al. · "Fast and Memory-Efficient Exact Attention with IO-Awareness" · 2022
In Brief
Take Rabe and Staats's idea and engineer it for the GPU memory hierarchy. Block the computation so that data stays in fast on-chip SRAM as long as possible, and only writes back to slow HBM when forced to. Result: same output, 7× faster, linear memory.
The Mechanism
The GPU has two kinds of memory: SRAM (tiny, near the cores, fast) and HBM (large, far, slow). Standard attention writes and reads the full N × N matrix to HBM. Flash Attention tiles the computation into blocks small enough to fit in SRAM. For each query block and key block, it loads, computes a partial softmax inside SRAM, updates running statistics, and only writes the final output to HBM. The matrix is never assembled.
Mathematically identical to vanilla. Numerically nearly so — only floating-point reordering differs.
Why It Matters
This was the moment long-context transformers became practical. Sequences of 4K, then 16K, then 64K became routine. The paper also reframed the field: attention's bottleneck was not FLOPs, it was memory bandwidth. Everyone started thinking in those terms.
Trade-offs
Compute is still O(N²); only memory dropped to O(N). So the FLOP cost of long sequences remains. Implementation is CUDA-specific and tied to particular GPU generations.
· · ·
05
Flash Attention 2
Tri Dao · 2023
In Brief
Flash Attention again, with the parallelism rewritten. Same math, roughly twice as fast. The headline number is that it reaches around 70% of theoretical GPU peak on attention — close to what cuBLAS does for dense matrix multiplication.
The Mechanism
The original Flash Attention parallelised across batch and heads but processed each sequence somewhat serially. Flash Attention 2 also parallelises across the sequence dimension itself. It reorganises the loop order so that the inner loop sweeps the more expensive dimension, reduces the number of non-matmul instructions (which run on different units than matmul on modern GPUs), and partitions work between warps more cleverly.
Why It Matters
Most production systems today — vLLM, SGLang, HuggingFace Transformers — call into Flash Attention 2 under the hood. It is the default. The 2× speedup compounded with the existing memory savings is what made 128K-context models feasible to serve at scale.
Trade-offs
Hopper (H100) and later get diminishing returns because they have features (TMA, warp specialisation, FP8) that Flash 2 does not use. That gap is what Flash 3 was built to close.
· · ·
06
Flash Attention 3
Shah, Dao, Bikshandi et al. · 2024
In Brief
Flash Attention rebuilt for the Hopper architecture (H100). Asynchronous, warp-specialised, FP8-capable. Another roughly 2× speedup on H100, and FP8 brings another 2× on top.
The Mechanism
Three changes do the work. First, the Tensor Memory Accelerator (TMA) handles data movement asynchronously, so warps doing matmul do not stall waiting for memory. Second, warps are split into producer and consumer roles — some load data, others compute — overlapping work that used to be serial. Third, FP8 is supported with a clever rescaling scheme that keeps accuracy acceptable for attention even at the lower precision.
Why It Matters
Frontier labs run on H100s; Flash 3 is what those clusters actually use. When Anthropic or OpenAI talks about long-context inference cost, the relevant kernel underneath is some descendant of Flash 3. It is the closest thing to a free lunch in modern deep learning hardware.
Trade-offs
Hopper-specific. On older GPUs (A100, A6000) it offers little or nothing — those users stay on Flash 2. FP8 attention requires care: not every workload tolerates the precision loss without quality regression.
· · ·
07
Flash Attention 4
Announced 2025 · Blackwell (B200/B300) optimised
In Brief
The next iteration, rebuilt for Blackwell GPUs. Public details are still partial as of early 2026, but the direction is clear: use Blackwell's fifth-generation Tensor Cores, larger SRAM, and improved FP4 / FP6 support to push attention throughput further. Another generational step rather than a paradigm shift.
The Mechanism
Same recipe — IO-aware tiling, async data movement, warp specialisation — re-tuned for new hardware. Blackwell exposes more SRAM per SM, faster inter-SM communication, and native support for very low-precision formats. Flash 4 leans on all of these. It also targets training, where reduced precision is harder to make work cleanly.
Why It Matters
Each Flash generation has roughly doubled effective attention throughput. That compounding is what makes longer contexts cheaper every year despite quadratic compute. Without Flash, a million-token context window would still be a research curiosity.
Trade-offs
Hardware-bound — runs only on Blackwell. The vast majority of GPUs in the world are not Blackwell. For most practitioners this is something to look forward to, not something to use today.
· · ·
08
xFormers Memory-Efficient Attention
Meta AI · open-sourced 2021, ongoing
In Brief
Meta's library of attention kernels. Predates Flash in some ways, ships several backends, and remains the production choice for many diffusion model pipelines. Less famous than Flash, equally pragmatic.
The Mechanism
xFormers is less a single algorithm and more a toolbox. It includes a memory-efficient attention based on the Rabe-Staats principle (with CUDA optimisations), block-sparse attention kernels, several positional bias schemes, and utilities for attention with arbitrary masks. The library auto-selects implementations based on tensor shapes and hardware.
Why It Matters
Stable Diffusion ran on xFormers for a long time. Many image and video generation pipelines still do. In the LLM world, Flash dominates; in the diffusion world, xFormers had earlier mind-share and kept it. The two libraries now overlap significantly.
Trade-offs
Less aggressive on the latest hardware than Flash 3 and 4. The codebase is broader, which means more features but slower release cadence on the cutting edge.
· · ·
09
FlexAttention
PyTorch 2.5+ · Horace He et al. · 2024
In Brief
You describe the attention pattern you want — causal, sliding window, ALiBi, document masking, anything — as a small Python function. PyTorch compiles it down to a Flash-style kernel. Custom attention without writing CUDA.
The Mechanism
FlexAttention exposes two user-defined functions: a score_mod that transforms the attention score for any pair of positions, and a mask_mod that decides whether a pair is attended to at all. PyTorch's compiler (TorchInductor) lowers these into a fused kernel that behaves like Flash Attention but with your custom logic baked in. Sparse masks become genuinely sparse kernels — they skip masked-out blocks entirely.
Why It Matters
Before FlexAttention, every new attention variant required someone to write CUDA. Custom positional schemes, document-level causal masks, sliding windows with different sizes per layer — all needed bespoke kernels. FlexAttention collapses that work into a few lines of Python. It is, in effect, a compiler for attention.
Trade-offs
Compilation is not always perfect; exotic patterns sometimes fall back to slower paths. Performance on the simplest cases is close to but slightly behind hand-tuned Flash Attention. The trade is generality for a small efficiency gap, which is usually worth it.
II
Family Two · Nine Mechanisms
Sparse and Pattern-Based
If full attention is too expensive, the obvious move is to attend less. The mechanisms in this family pick which token-pairs to compute and ignore the rest — sometimes with fixed patterns, sometimes with learned routing, always with the same gamble: that the discarded pairs did not matter.
10
Sparse Transformer
Child, Gray, Radford, Sutskever · OpenAI · 2019
In Brief
The first serious attempt to attack quadratic attention. Instead of every token attending to every other token, you allow only structured subsets — strided patterns, fixed local windows. Used to train models on 16,384-token sequences in 2019, which felt impossible at the time.
The Mechanism
Two attention heads with different sparsity patterns. One head attends locally — token i attends to a fixed window around itself. Another head attends with a stride — token i attends to every k-th token across the whole sequence. Together they cover the sequence with O(N√N) connections instead of O(N²). The patterns are fixed; the model does not learn them.
Why It Matters
It established the template that every later sparse method copied: combine local with some-kind-of-global, prove you can approximate full attention. Sparse Transformer also produced Sparse Image Transformer and trained on raw pixels of CIFAR and ImageNet, foreshadowing Vision Transformers.
Trade-offs
The patterns are heuristic. Some tasks need attention between specific distant pairs that the fixed stride happens to miss. And implementing fast sparse kernels on GPUs was, in 2019, very hard — much of the speedup theory did not translate to wall-clock improvements.
· · ·
11
Longformer
Beltagy, Peters, Cohan · Allen AI · 2020
In Brief
Sliding window attention plus a small set of "global" tokens that everyone attends to. Linear in sequence length for the local part. The global tokens act as information hubs, like the bulletin board in an office where anyone can post and anyone can read.
The Mechanism
Most tokens attend only to a fixed window of their neighbours, say 512 tokens on each side. A few designated tokens — often the [CLS] token, or question tokens in a QA setup — attend to everything and are attended to by everything. The cost is O(N · w) for the local part plus O(N · g) for the g global tokens, which together remain linear in N.
Why It Matters
Longformer was the first practical long-document model. Researchers used it to process entire academic papers, long contracts, multi-page chats. The global token trick proved durable — many later sparse methods reinvent it under different names.
Trade-offs
You must decide ahead of time which tokens are global. Get it wrong and the model cannot route the information it needs. The local window is a hyperparameter that does not adapt to content.
· · ·
12
BigBird
Zaheer et al. · Google · 2020
In Brief
Longformer plus randomness. Each token attends to its local window, to a few global tokens, and to a few random tokens. The authors prove that this combination is theoretically as expressive as full attention — it can approximate any sequence-to-sequence function.
The Mechanism
Three kinds of connections. Local: a sliding window. Global: a small set of always-on tokens. Random: for each token, sample a few other tokens uniformly at random. The random connections give the sparse graph small-world properties — short paths between any pair of nodes — which is what powers the theoretical universality result.
Why It Matters
BigBird made sparse attention respectable to theorists. The randomness was an interesting move: a way to recover the expressivity of full attention without paying for it. In practice the random part is often less impactful than the global and local parts, but the framing influenced everything that came after.
Trade-offs
The proofs are asymptotic. In practice, BigBird does well on long documents but is not noticeably better than well-tuned Longformer. And random patterns are hostile to GPU memory access; getting them fast required custom kernels.
· · ·
13
Reformer
Kitaev, Kaiser, Levskaya · Google · 2020
In Brief
Instead of attending to every key, attend only to keys that are likely to give a high score. "Likely" is determined by locality-sensitive hashing — keys that hash to the same bucket are probably similar, so they are probably the ones you would attend to anyway.
The Mechanism
Hash each query and key with a random projection-based LSH scheme. Group together those that hash to the same bucket. Compute attention only within buckets. Repeat with several different hash functions and combine. The complexity drops to O(N log N). The chunking by hash is the heart of it; the rest is engineering.
Why It Matters
Reformer was beautiful because it tied attention to a classical algorithm (LSH) with decades of theory. It also introduced reversible residual layers, which let the model train on much longer sequences by trading memory for recomputation. Both ideas filtered into later work.
Trade-offs
The hashing is not free. Reformer often ended up similar in wall-clock time to vanilla attention at moderate lengths because the hashing and bucketing overhead ate the asymptotic gains. It remains a beautiful idea that other methods quietly outran.
· · ·
14
Routing Transformer
Roy, Saffar, Vaswani, Grangier · 2021
In Brief
Like Reformer but with k-means instead of LSH. Cluster the queries and keys with online k-means; only attend within clusters. The clustering is learned during training, so the routing adapts to the data.
The Mechanism
Maintain a small set of centroids. At each layer, assign every query and every key to its nearest centroid. Attention is computed only between queries and keys assigned to the same centroid. The centroids are updated by exponential moving average during training, drifting towards the centres of the natural clusters in the data.
Why It Matters
Routing Transformer formalised an intuition: similar queries want similar keys, so route them together. The same idea reappears in mixture-of-experts, in retrieval-augmented attention, in modern sparse methods like MoBA. It was an early instance of a now-pervasive pattern.
Trade-offs
Clustering is unstable during early training; sometimes centroids collapse onto one another. The method needs careful regularisation. And like Reformer, the wall-clock advantage was often smaller than the asymptotic story suggested.
· · ·
15
Sliding Window Attention
Popularised by Mistral 7B · 2023
In Brief
The simplest possible sparse pattern: each token attends only to the previous w tokens, nothing more. Mistral 7B used w = 4096 and showed that, with deep stacking, information still propagates through the whole context — just indirectly.
The Mechanism
At layer 1, token at position i attends to positions i - w through i. At layer 2, those tokens themselves attended to a window before them. So by layer L, information from position i - L · w can reach position i. The receptive field grows linearly with depth. This is the same idea as stacked dilated convolutions.
Why It Matters
Mistral 7B's strong results showed that you do not need every layer to be full-attention. Stacking is a kind of long-range communication. The KV cache also shrinks dramatically because you only keep the last w tokens — a significant deployment win.
Trade-offs
Information has to travel layer by layer. A long-range dependency needs a deep network to be representable. Tasks that require attending precisely to a token far away in a shallow layer (some retrieval problems) suffer.
· · ·
16
StreamingLLM
Xiao, Tian, Chen, Han, Lewis · MIT · 2024
In Brief
Sliding window attention, but with a twist: keep the first few tokens around forever. Those "attention sinks" stabilise the softmax and let the model stream indefinitely without quality collapse. A small fix to a real practical problem.
The Mechanism
Without sinks, pure sliding window models degrade catastrophically when the context grows past their training length. The authors identify the cause: softmax must put its probability mass somewhere, and when there is no semantically important token to attend to, it dumps mass on whatever is left — which used to be the early tokens. Drop those, and the softmax has nowhere to dump, and the values blow up. StreamingLLM keeps the first 4 tokens permanently in the KV cache plus a sliding window of recent tokens. Softmax is happy. The model streams.
Why It Matters
It explained a previously mysterious failure mode (long-stream collapse) and offered a one-line fix. Production chat systems that need to handle very long conversations adopted the idea quickly. The "attention sink" terminology entered the field's vocabulary.
Trade-offs
You cannot use information from the middle of the stream beyond what is currently in the window. The model has effective short-term memory only. For tasks needing genuine recall over long history this is the wrong tool.
· · ·
17
Native Sparse Attention (NSA)
DeepSeek · 2025
In Brief
Sparse attention that the model learns and that is genuinely fast on hardware. Three parallel branches — coarse summary, selective fine-grained, sliding window — each contributing to a unified output. Trained from scratch, not retrofitted.
The Mechanism
For each query, three things happen in parallel. First, a compressed branch: chunks of keys and values are pooled into block representations, and the query attends to those summaries. Second, a selective branch: based on the compressed scores, the top-k blocks are chosen for full fine-grained attention. Third, a sliding window for recent tokens. The three outputs are combined with learned gates. Crucially, the kernel is co-designed with the algorithm: blocks are sized to match GPU tile shapes, so the sparsity is real on hardware.
Why It Matters
NSA is the first sparse attention that is both genuinely trainable end-to-end and genuinely fast at training time. Many earlier sparse methods either were not trainable from scratch or had implementations that did not actually run faster. NSA reports both quality matching full attention and substantial wall-clock speedups, which is unusual.
Trade-offs
The architecture has several hyperparameters (block sizes, top-k value, gating). The top-k selection introduces some non-differentiability that requires care during training. And the empirical claim that it matches dense attention is true on certain benchmarks; whether it holds for tasks requiring genuinely dense attention is an open question — likely no.
· · ·
18
MoBA — Mixture of Block Attention
Moonshot AI · 2025
In Brief
Treat blocks of keys as experts. For each query, route to a small set of blocks and attend within them. Mixture-of-experts logic applied to attention rather than feedforward layers. Powers Kimi's long-context system.
The Mechanism
Partition the key sequence into fixed-size blocks. For each query, compute a routing score per block (using a cheap summary of the block). Pick the top-k blocks and perform standard softmax attention against them. The unchosen blocks contribute nothing, but the choice is differentiable through a straight-through estimator.
Why It Matters
MoBA is the visible engine behind Kimi-K1.5 and related Moonshot systems. It scales to very long contexts (millions of tokens) with workable cost, and the routing is content-aware in a way that fixed patterns are not. The framing — attention as a sparse mixture — is a clean abstraction that other groups are now adopting.
Trade-offs
The router is a small extra cost, and block-level granularity means you may pull in tokens you did not want or miss tokens you did. Tuning block size and top-k matters more than for dense methods. Like NSA, it shines on retrieval-style tasks and is less clearly better on dense reasoning.
III
Family Three · Twelve Mechanisms
Linear Attention
A different strategy: not sparsifying, but rewriting. Replace softmax with a kernel that lets matrix multiplication commute differently. The cost drops from N² to N. The price is a representational ceiling that researchers have been chipping at for five years.
19
Linear Transformer
Katharopoulos, Vyas, Pappas, Fleuret · 2020
In Brief
The original linear attention. Replace the softmax in attention with a non-negative feature map applied to Q and K separately. Then exploit the associativity of matrix multiplication to compute K times V first, before involving Q. The complexity drops to O(N).
The Mechanism
Vanilla attention computes softmax(QKᵀ) V. The softmax is what couples queries to keys non-linearly and is what forces you to materialise the N × N matrix. Replace it with a kernel φ(Q) φ(K)ᵀ where φ is, say, elu(x) + 1 applied element-wise. Now the expression becomes φ(Q) (φ(K)ᵀ V) and the parenthesised inner product is a d × d matrix — independent of N.
softmax(QKᵀ)V → φ(Q)(φ(K)ᵀ V)
N² N · d²
The right-hand side, for causal attention, also admits an exact recurrent form: maintain a running state S that accumulates φ(K_i) ⊗ V_i. Each new token updates S in constant time. The Transformer becomes an RNN.
Why It Matters
This paper rewrote what attention could be. It revealed that the softmax was not essential machinery — it was a particular choice. Drop it, and attention is a kernel method, with all the theory of kernel methods available. Every linear attention paper since 2020 is, in some sense, a refinement of this one equation.
Trade-offs
The original elu+1 feature map is a weak approximation of softmax. Softmax is sharply peaked — it can attend almost exclusively to a single token. elu+1 cannot. So linear attention with this kernel struggles on tasks that require precise retrieval. The gap between linear and softmax attention is what motivated every later refinement in this family.
· · ·
20
Performer / FAVOR+
Choromanski et al. · Google · 2020
In Brief
If linear attention's weakness is a bad feature map, build a better one. Performer uses random Fourier features to approximate the softmax kernel itself, with provable guarantees. The approximation is unbiased; in expectation, it is exact softmax.
The Mechanism
The softmax can be written as a kernel exp(qᵀk). Performer constructs a feature map φ using random projections drawn from a specific distribution, such that E[φ(q)ᵀφ(k)] = exp(qᵀk). With enough random features, the approximation is tight. The method is called FAVOR+ — Fast Attention Via positive Orthogonal Random features.
Why It Matters
Performer was the first linear attention with a real theoretical argument for closing the gap to softmax. The randomness gives unbiasedness, the orthogonality controls variance, the positivity ensures stability. The paper showed strong results on Long Range Arena and on protein sequence modelling.
Trade-offs
The approximation has variance, and for sharp attention distributions (the ones that matter most for retrieval), the variance is high. In practice, Performer underperforms softmax on language modelling at scale. The theoretical elegance did not quite translate.
· · ·
21
Linformer
Wang, Li, Khabsa, Fang, Ma · Meta · 2020
In Brief
Different approach: project the N-dimensional key and value sequences down to a fixed-size k-dimensional space, then do attention against the projection. The N × N matrix becomes N × k, which is linear if k is treated as a constant.
The Mechanism
Multiply K and V on the left by learned projection matrices E and F of shape k × N. The resulting EK and FV have shape k × d. Compute softmax(Q (EK)ᵀ) (FV) as usual. Cost is O(N · k).
Why It Matters
Linformer was a simple, working alternative to softmax that was easy to implement and ran fast. The conceptual argument was based on the observation that attention matrices in trained Transformers tend to be low-rank, so projecting to a low-rank subspace loses little.
Trade-offs
The projection size k is fixed at training time, so the architecture commits to a particular sequence length. Different sequence lengths need different projections, which makes Linformer awkward to use at flexible context sizes. It also struggles in autoregressive (causal) settings — the standard formulation is most natural for encoders.
· · ·
22
cosFormer
Qin, Sun, Deng et al. · 2022
In Brief
A linear attention that bakes in locality. The feature map is ReLU(q) · cos and ReLU(k) · cos with phases that depend on token position. The cosine reweighting gives nearby tokens a stronger affinity than distant ones, which mirrors what softmax usually does in practice.
The Mechanism
Define φ(x, i) = ReLU(x) · [cos(πi/2N), sin(πi/2N)]. The attention score between query at position i and key at position j contains a factor of cos(π(i - j)/2N), which peaks when i = j and decays smoothly. The product expands so that QKᵀ can still be computed via the right-product trick.
Why It Matters
cosFormer narrowed the gap to softmax meaningfully on language tasks by adding a structural prior that matches how language actually works (nearby words matter more). It was an influence on later refinements that combined linear attention with positional biases.
Trade-offs
The locality bias is good for language but hurts on tasks where genuinely distant tokens carry the signal. And like all early linear methods, it still trailed softmax on the hardest retrieval tasks.
· · ·
23
Hedgehog
Zhang, Lindsey, Goel, Ré · Stanford · 2024
In Brief
The problem with linear attention is that the kernel is too smooth — softmax is spiky, and smooth kernels cannot mimic spikes. Hedgehog learns the feature map directly, trained to match the spikiness of softmax. A small MLP, doing what fixed feature maps could not.
The Mechanism
Instead of using elu+1 or random features, Hedgehog uses a small two-layer MLP as the feature map, applied to Q and K. The MLP is trained, sometimes via distillation from a softmax teacher, to make φ(q)ᵀφ(k) closely approximate exp(qᵀk) in the regime that actually occurs in trained models. The right-product trick still applies; linearity in N is preserved.
Why It Matters
Hedgehog explicitly diagnosed the spikiness problem and proposed a learned solution. It substantially closed the quality gap to softmax on language modelling while remaining linear. The technique also enables linearising a pre-trained softmax model post hoc, by fitting Hedgehog feature maps to its attention patterns.
Trade-offs
An extra MLP per attention head adds parameters and a small per-token cost. The distillation step adds training complexity. And while the gap narrows, it does not always vanish — softmax remains stronger on the hardest tasks.
· · ·
24
RetNet — Retentive Network
Sun, Dong, Huang et al. · Microsoft · 2023
In Brief
A linear attention with a built-in exponential decay over position. The same model can be unfolded three ways — parallel for training, recurrent for inference, chunkwise for long context — and they are all mathematically equivalent. The "best of both worlds" pitch from RNNs and Transformers.
The Mechanism
Each head has a fixed decay rate γ. The retention operation is essentially linear attention where the key-value contribution from position j to position i is weighted by γ^(i-j). This decay makes the recurrent form numerically stable and gives a natural locality bias.
parallel: O = (Q · Kᵀ ⊙ D) · V [training]
recurrent: sₙ = γ · sₙ₋₁ + kₙᵀ vₙ [inference]
chunkwise: blocks of both [long sequences]
Why It Matters
RetNet was the first linear-attention architecture marketed as a serious successor to the Transformer. The triple-form equivalence is genuinely elegant: train it like a Transformer, deploy it like an RNN, scale it like a hybrid. It influenced the framing of many later models including Mamba-2 and GLA.
Trade-offs
The decay rates are fixed per head, not data-dependent. This is a strong inductive bias — it works well for language, where the locality assumption is mostly correct, and worse for tasks where the relevant token might be arbitrarily far. Later models (GLA, Gated DeltaNet) replaced the fixed decay with learned, data-dependent gating.
· · ·
25
RWKV
Peng et al. · open community · versions 4 through 7, 2022–2025
In Brief
An RNN that thinks it is a Transformer, or a Transformer that thinks it is an RNN. RWKV reformulates attention as a recurrent operation with channel-mixing and time-mixing layers. It has been trained at scale by an open community and has been the most-deployed non-Transformer language model.
The Mechanism
Two operations per block. Time-mixing replaces attention: each channel maintains a running state that decays, with new contributions weighted by a learnable function. Channel-mixing replaces the feedforward block. The decay terms vary across channels and across versions of RWKV; later versions (v6, v7) add data-dependent decays and richer state updates, bringing it closer to general linear attention.
Why It Matters
RWKV trains in parallel like a Transformer and runs in constant memory per step like an RNN. The community around it has produced openly licensed models at meaningful scales and demonstrated that linear-recurrent architectures can hold their own. v7 in particular incorporated ideas from Mamba and DeltaNet and showed strong results on standard benchmarks.
Trade-offs
RWKV has accumulated complexity across its versions; understanding the current one requires reading several papers. Earlier versions were weaker than softmax on retrieval-heavy tasks. As the architecture has evolved, the gap has shrunk, but it remains a moving target.
· · ·
26
Mamba and Mamba-2
Gu, Dao · 2023, 2024
In Brief
Selective state space models. Originally framed as an alternative to attention, Mamba-2 then made the equivalence explicit: state-space duality means Mamba is, formally, a particular kind of linear attention. The model is fast, scales well, and matches softmax Transformers in many language benchmarks.
The Mechanism
A state-space model maintains a hidden state that evolves linearly with each input token: h_t = A · h_(t-1) + B · x_t, with output y_t = C · h_t. Mamba's contribution was to make A, B, C input-dependent — the model can choose what to remember based on what it sees. Mamba-2 then showed that, with the right structure on these matrices (specifically scalar-times-identity for A), the SSM is mathematically equivalent to a form of linear attention with a particular gating scheme. The team called this state-space duality (SSD).
Why It Matters
Mamba reignited the entire field of linear-recurrent models. It scaled. It worked. It made non-Transformer architectures credible at frontier scales for the first time since LSTMs. Mamba-2's framing — SSD as a unifying lens — clarified that what looked like an architectural revolution was, in fact, an excellent point in the linear-attention design space.
Trade-offs
Pure Mamba models still trail softmax on certain copying and retrieval tasks where exact recall matters. This is why many production systems use hybrid Mamba-attention stacks. The hardware-aware kernel is tied to specific GPU generations and requires careful implementation.
· · ·
27
Gated Linear Attention (GLA)
Yang, Wang, Zhang, Shen, Kim · 2024
In Brief
Linear attention plus a data-dependent forget gate. The model decides, per token and per channel, how much of the past state to retain. Combined with a hardware-efficient chunkwise algorithm, GLA was the strongest linear attention at language modelling for much of 2024.
The Mechanism
The recurrent state update in linear attention becomes S_t = G_t ⊙ S_(t-1) + k_t · v_tᵀ, where G_t is a learned, input-dependent gate applied element-wise. The gate is constrained to be in (0, 1) via a sigmoid. The gating provides the selectivity that earlier linear methods lacked, and the chunkwise computation algorithm preserves linear cost while running fast on GPUs.
Why It Matters
GLA was the centrepiece of the Flash Linear Attention (FLA) library, which became the reference codebase for the linear attention community. It clarified what every modern linear attention needs: data-dependent gating, chunkwise execution, hardware-friendly memory access patterns. Most later models (Gated DeltaNet, Kimi Linear) are evolutions of this template.
Trade-offs
The gate adds parameters and small overhead. The model still has a representational limit — fundamentally, it must summarise the past into a fixed-size state, so very long-range exact retrieval remains hard. Beats earlier linear methods clearly; beats softmax less clearly.
· · ·
28
DeltaNet and Gated DeltaNet
Yang, Kautz, Hatamizadeh et al. · 2024
In Brief
Linear attention reinterpreted through the delta rule from classical online learning. Each new token doesn't just add to the state — it updates the state to better predict the value, the way a single gradient step would. The gated version adds GLA-style forget gates on top.
The Mechanism
Standard linear attention adds k_t · v_tᵀ to a running matrix S. DeltaNet treats S as a one-layer linear network mapping keys to values, and applies the delta (Widrow-Hoff) rule: S_t = S_(t-1) + β_t · (v_t - S_(t-1) · k_t) · k_tᵀ. The correction term removes the previous prediction error from the state. Gated DeltaNet then adds an input-dependent decay to S before the update.
Why It Matters
The delta rule gives the recurrent state a clearer interpretation: it is an online estimator that is being corrected with each observation. Empirically this helped with associative recall tasks where pure additive accumulation fails. Gated DeltaNet has been a strong contender on recent linear-attention benchmarks.
Trade-offs
The delta update involves a matrix-vector product before the rank-one update, which is slightly more expensive than naive linear attention. Implementations need care to keep the per-step cost low. As with all linear methods, exact long-range recall remains harder than softmax.
· · ·
29
Lightning Attention
MiniMax · 2024 (Lightning-1), 2024 (Lightning-2)
In Brief
A linear attention with carefully engineered IO-aware kernels — Flash Attention's tricks transplanted to the linear setting. Lightning Attention-2 is the attention block inside MiniMax-Text-01, the first large-scale model to use linear attention as its primary mechanism at frontier scale.
The Mechanism
Lightning Attention is built on the standard linear attention recurrence but written as a chunkwise algorithm with intra-chunk and inter-chunk passes. The intra-chunk pass uses softmax-like normalisation within a small block (to handle the spikiness problem); the inter-chunk pass uses the linear recurrence to bridge blocks. The kernel is heavily optimised for tensor cores and tiled to match GPU memory hierarchy.
Why It Matters
Lightning Attention proved that linear attention could run, at scale, faster than Flash Attention in wall-clock terms — not just asymptotically. MiniMax-Text-01 was the first widely available model with a linear-dominated architecture (a small fraction of softmax layers remain in the stack), demonstrating that linear attention can power production systems at frontier scale.
Trade-offs
The chunk-wise hybrid (linear between chunks, softmax within) is more complex than pure linear methods. MiniMax-Text-01 still includes occasional softmax layers — pure linear is not yet a complete replacement. Implementation is heavily hardware-specific.
· · ·
30
Kimi Linear
Moonshot AI · 2024
In Brief
A linear attention variant from Moonshot, the team behind the Kimi assistant. Combines lessons from GLA, Gated DeltaNet, and Lightning Attention. As of late 2024 and early 2025, it has been the strongest publicly disclosed linear attention on long-context benchmarks.
The Mechanism
Kimi Linear uses a gated, delta-style update with a carefully chosen feature map and chunkwise hardware kernel. The technical contributions are largely in the engineering — kernel fusion, KV-cache layout, integration with the broader Kimi serving stack — rather than a brand-new mathematical formulation. The architectural surface looks like a refined GLA/DeltaNet variant.
Why It Matters
Kimi Linear is the current public benchmark for linear attention quality at long context. New linear-attention papers compare against it. It also feeds into Moonshot's product stack, where it handles part of the long-context workload alongside MoBA's sparse attention.
Trade-offs
The model is closed enough that the exact details require some inference from public material. As with all linear methods, performance on dense-reasoning tasks at very long context is the open question. Moonshot's deployment uses Kimi Linear alongside other mechanisms, suggesting it is not yet a single-architecture answer.
IV
Family Four · Six Mechanisms
Hierarchical and Compressed
These methods do not change the attention algorithm so much as shrink what it operates on. Compress the KV cache. Pool the sequence into multi-scale summaries. Decide, at inference time, which tokens to keep around. The math is still softmax — there is just less of it.
31
Lighthouse Attention
2024 — hierarchical pyramid pooling with top-K
In Brief
Build a pyramid. At level zero, you have the raw token sequence. At level one, you average pairs of tokens together. At level two, pairs of pairs. And so on. Then, for each query, look at the top few cells at each level. Coarse where you can, fine where you must.
The Mechanism
Given a sequence of N keys and values, choose a window size w. Level 0 is the original sequence. Level l is the sequence pooled into chunks of size w^l, with each chunk represented by the mean (or another pooling) of its constituent vectors. For each query, score the cells at every level by their feature-norm or by a learned criterion, and pick the top-k cells across all levels. Attend to those.
Why It Matters
Lighthouse formalises the intuition that not all parts of a long context need the same resolution. Most of a million-token document is dilution — a coarse summary suffices. A few regions are dense with signal and need precise attention. The pyramid is a clean way to express that.
Trade-offs
It is a form of compression, and compression is lossy. Tasks that need fine-grained interactions across the entire sequence (true dense attention) lose information at higher pyramid levels. The number of hyperparameters — window size, top-k, number of pyramid levels — grows with the model scale. Also: if your application needs attention to genuinely dense signal, the pyramid is the wrong shape.
· · ·
32
H2O — Heavy Hitter Oracle
Zhang et al. · 2023
In Brief
An empirical observation: in any given attention layer, only a small fraction of tokens consistently get high attention scores. Call them the "heavy hitters." Keep those tokens in the KV cache and evict the rest. The model continues to perform well at a fraction of the memory.
The Mechanism
During generation, track the cumulative attention each cached token has received. When the KV cache is full, evict the tokens with the lowest accumulated attention. The retained tokens are the "heavy hitters" plus the most recent tokens (which the model is still going to need). The eviction policy is greedy and runs continuously, with no extra training.
Why It Matters
H2O is a deployment-time win that requires no retraining and no architecture change. For long-context inference where memory is the bottleneck, it shrinks the KV cache dramatically without serious quality loss. It also catalysed a wave of follow-up work on smarter eviction policies.
Trade-offs
"Heavy hitters" are determined by past attention — there is no guarantee they will remain useful in the future. Adversarial or long-tail queries can ask about an evicted token, and the model has no recourse. The method is also generation-only; it does not help prefill.
· · ·
33
TOVA
Oren, Hassid, Adi, Schwartz · 2024
In Brief
A simpler heavy-hitter variant. At each generation step, look only at the current query's attention scores to decide which token to evict. Whatever the current step found least interesting goes. No accumulated history, no oracle — just the immediate vote.
The Mechanism
At step t, the new query q_t computes attention scores against all keys in the cache. The token with the lowest score from q_t is evicted. The cache size stays fixed at some target value k. Implementation is essentially free — the attention scores are computed anyway.
Why It Matters
TOVA showed that the simplest possible eviction policy — vote by the current query — is competitive with more elaborate schemes. The lesson is that recent context dominates: what the model is doing right now is the best signal for what it will need next.
Trade-offs
Aggressive eviction degrades retrieval over long contexts. The method is best when generation is locally coherent — instruction following, casual chat — and worst when the model needs to recall a distant detail. The "current vote" heuristic is myopic.
· · ·
34
SnapKV
Li, Liu, Sheng et al. · 2024
In Brief
An observation-based approach. Look at the last few prompt tokens — they encode the user's question. See which tokens in the long context they attend to. Keep those, drop the rest. The "snap" is that you do this once at the end of prefill.
The Mechanism
During prefill, compute the full attention pattern for the last few tokens of the prompt — the "observation window." Aggregate their attention over all earlier tokens, smoothed by a small max-pooling filter to capture contiguous regions of interest. Keep the top-k tokens; discard everything else from the KV cache. Generation then proceeds with the shrunken cache.
Why It Matters
SnapKV exploits the structure of typical LLM use: a long context followed by a focused query. The query's attention is a near-perfect signal of what matters in the context. By collapsing the KV cache to the relevant subset at the moment of inflection, memory drops by 80–95% with minimal quality loss.
Trade-offs
The method requires a clear "question" at the end of the prompt; conversational or multi-turn use is harder to handle. If a later turn needs information that was dropped after the first snap, it is gone. SnapKV is a one-shot decision, not an ongoing one.
· · ·
35
Quest
Tang, Zhao, Tian et al. · 2024
In Brief
Query-aware KV cache selection at generation time, with bounds. For each query, estimate which cached blocks are most likely to receive high attention, and load only those into SRAM. The unloaded blocks contribute zero. Memory stays low; quality stays high because the selection is dynamic.
The Mechanism
Partition the KV cache into blocks. For each block, maintain a compact summary — typically the min and max values of each channel. For an incoming query, compute an upper bound on the attention score for each block using only the summaries. Keep the top-k blocks by bound, load their full KV into SRAM, and run attention against them. The other blocks are provably small contributors and are safely skipped.
Why It Matters
Quest provides a principled, runtime-adaptive way to reduce attention cost without permanent eviction. Unlike H2O or SnapKV, it does not throw anything away; it just decides what to look at this step. Different queries can attend to different parts of the same cache.
Trade-offs
The bounds are loose — sometimes the block with the highest bound is not the one with the highest true score. Quest accepts a small approximation error in exchange for the speedup. Implementation requires maintaining block summaries, which costs a small constant memory per block.
· · ·
36
InfLLM
Xiao, Han, Yang et al. · 2024
In Brief
Treat the long context as an external memory. Keep a small working window in fast memory, with the rest sitting on the side as retrievable blocks. For each generation step, pull in the most relevant external blocks. The model behaves as if it has full context while really only attending to a slice.
The Mechanism
Split the context into blocks. The recent tokens stay in a sliding-window "working memory." Older blocks are stored with representative key vectors. For each generation step, retrieve the top-k external blocks (by similarity between query and block representatives), and attend to those plus the working window. The model is unchanged — InfLLM is purely an inference-time augmentation.
Why It Matters
InfLLM lets a standard short-context model handle indefinite input lengths without retraining. It is closer to retrieval-augmented generation than to traditional attention, but it operates at the token level inside the model rather than at the document level outside it. The boundary between attention and retrieval starts to blur here.
Trade-offs
Retrieval quality bounds the model's quality. The representative vectors are coarse; sometimes the right block is missed. And because the underlying model was not trained at the new effective length, there are edge cases where position handling breaks.
V
Family Five · Three Mechanisms
Latent and KV Compression
A short family. These mechanisms preserve full softmax attention but compress the keys and values themselves — either by sharing them across attention heads, or by projecting them into a smaller latent space. Memory bandwidth, not compute, is the bottleneck they attack.
37
Multi-Query Attention (MQA)
Shazeer · "Fast Transformer Decoding" · 2019
In Brief
Standard multi-head attention gives every head its own K and V. MQA gives every head its own Q but makes them all share a single K and V. The math is unchanged; the KV cache becomes h times smaller, where h is the number of heads.
The Mechanism
In standard multi-head attention with h heads, you maintain h separate sets of K and V matrices. In MQA, there is a single K and a single V, used by all h queries. Each head computes softmax(Q_i · Kᵀ) V with Q_i head-specific but K and V shared.
Why It Matters
For inference, the KV cache dominates memory bandwidth. Shrinking it h-fold makes generation dramatically faster on memory-bound hardware. PaLM, Falcon, and the original LLaMA codebases used MQA in various places. It is one of the highest-leverage simple changes in the history of attention.
Trade-offs
Some quality loss. With fewer distinct K, V projections, the model has less expressivity per attention block. For most tasks the loss is small; for some it is noticeable. The trade is good but not free, which is why GQA was eventually preferred.
· · ·
38
Grouped-Query Attention (GQA)
Ainslie, Lee-Thorp, de Jong et al. · Google · 2023
In Brief
The compromise between standard multi-head attention and MQA. Group the heads into g groups; within each group, share K and V. With g = 1 you have MQA. With g = h you have standard attention. The sweet spot is somewhere in between.
The Mechanism
If you have h heads and choose g groups, you maintain g sets of K and V matrices. Each head's Q is mapped to one of the g groups. Heads in the same group share keys and values. The KV cache shrinks by h / g; with h = 32 and g = 8, that's a 4× reduction with much less quality loss than MQA.
Why It Matters
GQA is now the default. LLaMA 2 70B, LLaMA 3, Mistral, Qwen, Gemma — almost every recent open-weight model uses GQA. It is the single best knob for trading a small amount of quality for substantially faster generation. The fact that it is now invisible in design discussions is a sign of how completely it won.
Trade-offs
Choosing g is a hyperparameter tuning problem; too low and quality drops, too high and you lose the memory benefit. Most models settle on g around 4 or 8. The architecture is otherwise identical to standard attention, so the trade is genuinely small.
· · ·
39
Multi-head Latent Attention (MLA)
DeepSeek-V2, DeepSeek-V3 · 2024
In Brief
Instead of storing the full keys and values, store a compressed latent vector per token and reconstruct keys and values on the fly. The KV cache becomes dramatically smaller — typically more than ten times smaller than GQA. The model retains full multi-head expressivity at compute time.
The Mechanism
For each token, project the input down to a small latent vector c of dimension much smaller than the full K, V dimensions. Store only c. When attention is computed, project c back up to the per-head K and V using learned matrices. By careful algebra, the up-projection can be absorbed into the query and output projections, so the actual stored quantity is genuinely just c.
MLA is the architecture choice that makes DeepSeek-V3's efficiency possible. The KV cache compression is so aggressive that serving cost per token drops by an order of magnitude relative to comparable GQA models. And unlike many compression schemes, MLA is trained end-to-end — the model learns to make the latent representation sufficient.
Trade-offs
The architecture is more complex; the algebraic absorption trick takes a moment to internalise. The latent dimension is a hyperparameter — too small and quality suffers, too large and you lose the memory advantage. RoPE positional encoding interacts awkwardly with MLA and required a special "decoupled" RoPE scheme in the DeepSeek implementation.
VI
Family Six · Five Mechanisms
Hybrid Architectures
The pragmatist's family. If softmax is too expensive at long context and linear methods are imperfect, alternate them. Most layers do something cheap; a few do softmax where it matters. The result is often better than either pure approach.
40
Jamba
AI21 Labs · 2024
In Brief
Alternating layers of Mamba and softmax attention, with a mixture-of-experts feedforward on top. The first major hybrid Mamba-Transformer at production scale. Released open-weight and competitive on long-context benchmarks.
The Mechanism
The block pattern is one attention layer for every seven Mamba layers, repeated. Each layer is independent of the others architecturally. Mixture-of-experts is applied to the MLP at alternate layers to add capacity without proportional compute. The model is dense in computation; the experts are routed sparsely.
Why It Matters
Jamba showed that mixing attention types is not just academically tidy — it is the practical move. The few softmax layers anchor exact retrieval and copying; the many Mamba layers handle the rest cheaply. The 1:7 ratio was a finding, not a guess — the team explored several and settled there. Many later hybrids reused similar ratios.
Trade-offs
Two architectures means two sets of optimisations and two sets of failure modes. Implementation, kernel work, and debugging all duplicate. The mixture-of-experts adds another dimension of complexity around routing balance.
· · ·
41
Zamba and Zamba2
Zyphra · 2024
In Brief
Another Mamba-attention hybrid, this one with a shared global attention module reused across the network. The shared block compresses parameter count while keeping the cross-layer flexibility of full attention.
The Mechanism
The model is mostly Mamba layers. Periodically, a shared softmax attention block is interleaved — but it is a single set of parameters, used many times across the network rather than duplicated. The Mamba layers compose with this shared module like residual connections to a global memory.
Why It Matters
Zamba2 is one of the more parameter-efficient open-weight models at its scale — its long-context behaviour is good relative to its size because the Mamba layers carry most of the work cheaply. The parameter-sharing idea is unusual and worth knowing.
Trade-offs
Shared blocks can become bottlenecks during training — gradients to the shared module are noisier. The architecture is harder to scale up than a uniformly-stacked design, since the placement of the shared blocks is itself a hyperparameter.
· · ·
42
Samba
Microsoft · 2024
In Brief
A hybrid of Mamba and sliding-window attention. No full softmax anywhere — both components are linear in sequence length. The result is a model that scales to long sequences without the per-step cost of any quadratic operation.
The Mechanism
Alternate Mamba blocks (for global, summarised context) with sliding-window attention blocks (for precise local interaction). The window is small — typically 2K. Mamba handles long-range information transfer cheaply through its recurrence; the windowed attention provides exact local resolution that Mamba alone struggles with.
Why It Matters
Samba is a clean demonstration that the long-context game can be won without any softmax layer in the stack. Each component is cheap; the combination covers each component's weakness. It is also a useful design point in the bestiary — pure linear is possible, if you accept the local-window restriction.
Trade-offs
The windowed attention is limited; truly long-range exact retrieval (find a needle 500K tokens back, exactly) is hard. For tasks that need that, Samba is not the answer. For ordinary long-context language modelling it is more than enough.
· · ·
43
Hymba
NVIDIA · 2024
In Brief
Most hybrids alternate layers — Mamba, then attention, then Mamba. Hymba runs them in parallel within the same layer. Half the heads do attention, half do Mamba, the outputs are fused. A different way to combine the two.
The Mechanism
Each block has parallel branches: attention heads (using sliding window) and Mamba (SSM) heads. Both branches process the input independently and their outputs are concatenated and projected. The attention provides precise recall; the SSM provides long-range summarisation; the fusion lets each contribute where it is strongest.
Why It Matters
The parallel framing is conceptually cleaner — each layer has both kinds of computation, rather than relying on the stack to mix them. Hymba reports strong results at small scales, particularly the 1B-1.5B range, where attention-only models often struggle relative to their parameter count.
Trade-offs
Parallel branches double the kernel work per layer — though each branch is doing less. The fusion is a learned projection, which adds parameters. The design is more complex to implement and debug than serial alternation.
· · ·
44
MiniMax-Text-01
MiniMax · 2025
In Brief
A 456B-parameter mixture-of-experts model with mostly Lightning Attention layers and occasional softmax attention. The first frontier-scale model where linear attention is the dominant mechanism. Supports 4-million-token context.
The Mechanism
The stack is mostly Lightning Attention (linear) blocks with softmax attention inserted at a 1-in-8 ratio. Each layer uses mixture-of-experts feedforward to scale capacity. The linear attention does the long-range work cheaply; the periodic softmax layers handle exact retrieval and copying. Together the architecture supports million-token inference at a fraction of what a pure softmax model would cost.
Why It Matters
MiniMax-Text-01 is a proof point. Frontier capability with mostly linear attention had not been demonstrated openly until this model. It validates the hybrid template at scale and gives the rest of the field a reference architecture to point at when arguing for non-softmax stacks.
Trade-offs
The model is huge and complex. Reproducing its quality outside of MiniMax requires both the architecture and the training recipe. As a public artefact it is more existence proof than off-the-shelf option — but the architectural lessons travel.
VII
Family Seven · Six Mechanisms · a companion
Positional Encodings
A companion family. These are not attention mechanisms in themselves; they decide how the model knows where a token sits in the sequence. They matter here because long-context behaviour depends on positional encoding at least as much as on the attention mechanism — sometimes more. A great attention with broken positions is useless.
45
RoPE — Rotary Position Embedding
Su, Lu, Pan et al. · 2021
In Brief
The dominant positional encoding in modern LLMs. Apply a position-dependent rotation in 2D subspaces of the query and key vectors. The dot product between query and key then depends only on their relative position, not absolute. Elegant, parameter-free, hardware-friendly.
The Mechanism
Split each query and key vector into 2D pairs. For position m, rotate each pair by an angle m · θ_i, where θ_i is a fixed per-pair frequency. When two rotated vectors are dotted, the result depends on the difference of positions, not the positions themselves. So the attention score between position m and position n sees only m - n.
Why It Matters
RoPE is everywhere. LLaMA, Mistral, Qwen, GPT-NeoX, Falcon, DeepSeek — all use RoPE in some form. It composes well with Flash Attention, requires no extra parameters, and gives clean relative-position semantics. Most long-context extensions (PI, NTK, YaRN) are RoPE modifications.
Trade-offs
RoPE does not extrapolate gracefully past the training context length — the rotation frequencies were calibrated for a particular range, and going far beyond it makes the model behave erratically. This is why context extension techniques exist as their own subfield.
· · ·
46
ALiBi — Attention with Linear Biases
Press, Smith, Lewis · 2022
In Brief
No positional embedding in the input. Instead, add a linear bias to the attention scores: the score between positions i and j gets a penalty proportional to |i - j|. Closer tokens get a higher score by default; far tokens get a built-in handicap that the data has to overcome.
The Mechanism
For each attention head, fix a slope m_h. The score between query at position i and key at position j is the usual dot product minus m_h · |i - j|. Different heads get different slopes; together they cover a range of locality preferences from strong (high slope) to weak (low slope).
Why It Matters
ALiBi's surprise was that models trained with it extrapolate to longer sequences than they were trained on, much more gracefully than RoPE without modification. The linear decay is smooth and never becomes nonsensical at any distance. Several long-context models adopted it specifically for that property.
Trade-offs
ALiBi forces a locality prior on every layer. For tasks where the important token can be anywhere, this is a handicap. RoPE-based models with proper extension have largely surpassed ALiBi on long-context benchmarks, so its moment as the long-context choice has passed.
· · ·
47
PI — Position Interpolation
Chen, Wong, Chen, Tian · Meta · 2023
In Brief
The first principled context extension for RoPE. Take a model trained at length L and want to use it at length L'. Rescale the RoPE position indices by L / L', so the new long context still fits within the rotation range the model saw. A few hundred steps of fine-tuning recover quality.
The Mechanism
RoPE rotates pair i by angle m · θ_i at position m. Position Interpolation replaces this with m · θ_i · (L / L'), compressing the entire new range into the angular range the model was trained on. The model still receives angles it has seen during training, just spaced more finely.
Why It Matters
PI is the simplest and best-understood context extension. It works. The fine-tuning required is modest (a few hundred million tokens at the longer length). Many open-source long-context releases used PI before more sophisticated alternatives existed.
Trade-offs
Quality degrades when the extension factor is large. Stretching a 4K model to 128K is fine; stretching to 1M loses fidelity. Later methods (NTK, YaRN) address this with frequency-aware scaling.
· · ·
48
NTK-aware Scaling
community-discovered, formalised by bloc97 and others · 2023
In Brief
An empirical refinement of Position Interpolation. Instead of rescaling all RoPE frequencies uniformly, rescale them in a frequency-dependent way: high-frequency components (fast-rotating pairs, which encode short-range info) are barely touched; low-frequency components (slow-rotating, long-range) are stretched more aggressively. Better extrapolation, less fine-tuning required.
The Mechanism
The name refers loosely to neural tangent kernel reasoning. Each RoPE base frequency θ_i is replaced with θ_i · α^(i / D), where α is a function of the extension factor. The effect is to leave fast frequencies alone (they still cycle correctly) while substantially slowing the slowest frequencies (which carry long-range information that the original training did not specify well).
Why It Matters
NTK-aware was the first context extension that worked with no fine-tuning at all — you could take a 4K LLaMA, apply NTK rescaling, and run it at 8K or 16K immediately with usable quality. It was the open-source community's first major contribution to long-context, and it set the stage for YaRN.
Trade-offs
The exact rescaling formula is not derived from first principles; it was empirically tuned. Variations (dynamic NTK, NTK-by-parts) proliferated. At very large extension factors quality still degrades. The lack of theoretical grounding made it feel like a hack — accurate, but a hack.
· · ·
49
YaRN — Yet another RoPE extensioN
Peng, Quesnelle, Fan, Shippole · 2023
In Brief
The synthesis. Combines NTK-aware frequency-dependent scaling with PI-style interpolation and an attention temperature adjustment that compensates for the longer sequences. With modest fine-tuning, extends models cleanly to many times their original context length. The standard recipe for context extension in late 2023 and 2024.
The Mechanism
YaRN partitions the RoPE frequencies into three regimes by wavelength relative to the training context: short, medium, long. Short frequencies are left as-is. Long frequencies are interpolated (PI-style). Medium frequencies are NTK-rescaled. A small temperature factor on the attention logits compensates for the slightly different distribution that the rescaled positions induce. A few hundred million tokens of fine-tuning at the new length completes the job.
Why It Matters
YaRN is the closest thing to a standard for context extension. Qwen-1M, several Mistral variants, and many community fine-tunes use YaRN. It is the method most likely to be applied to any pre-trained RoPE model that wants to handle long contexts.
Trade-offs
Like all extension methods, YaRN's quality at extreme extension factors (32× or more) is unreliable. The fine-tuning still costs compute. Some tasks degrade visibly past 4–8× extension, even with YaRN — a fact worth knowing when evaluating "1M-context" models that were extended rather than natively trained.
· · ·
50
NoPE — No Position Encoding
Kazemnejad et al. · 2023
In Brief
The provocation: do not use a positional encoding at all. The argument is that causal attention plus a deep stack of layers is enough on its own to give the model positional information implicitly. In some settings, the resulting model extrapolates to longer sequences better than RoPE or ALiBi.
The Mechanism
Remove all positional embeddings — no RoPE, no ALiBi, no learned positions, nothing. The only thing the model has to distinguish positions is the causal mask, which makes position t a function of all positions before it but not after. The hypothesis is that this asymmetry, propagated through layers, is enough to encode position in the residual stream.
Why It Matters
NoPE's empirical results were surprising. On certain reasoning benchmarks, NoPE models extrapolated to lengths much longer than they were trained on, while RoPE models broke. The paper became a piece of evidence that explicit positional encoding might be more crutch than necessity in causal models — though the picture is complex and task-dependent.
Trade-offs
NoPE works best in narrow settings — small models, particular kinds of synthetic tasks. At scale, the most successful models still use RoPE. NoPE is more of a useful provocation than a production option, but it teaches something important about what attention plus causality alone can do.
⁂
Fin
Fifty mechanisms, ten years of work, and a great deal more left out than included. Linear attention's gating evolves quarterly; sparse methods invent new routing tricks faster than this page can be updated; some hybrid model published next week will look obvious in retrospect.
The taxonomy in this book is a snapshot. Treat it as such. What persists is the underlying question that every entry tries to answer: how much of the past should the present be allowed to see, and at what cost?