The Attention Atlas 2017 — 2026
A Field Guide · Vol. I

The Attention
Atlas

A reader's guide to every major attention mechanism in modern deep learning — from the original Transformer to Flash Attention 4 — explained as a book.

Seven Families · Fifty Mechanisms · One Decade
Preface

How to read this book

This is not a survey paper and it is not a tutorial. It is closer to a field guide — the kind of book a naturalist takes into the forest. Each entry is a single attention mechanism. Each entry has the same four parts so that you can compare them honestly, without having to learn a new vocabulary every time.

The parts are: In Brief, where the mechanism is described in two or three plain sentences, the way you might describe it to a colleague over coffee. The Mechanism, where the actual math or algorithm appears, kept as simple as honesty allows. Why It Matters, which places the work in context — what problem it solved, what door it opened. And Trade-offs, because every mechanism is a deal with the devil somewhere, and the question is only which devil.

The mechanisms are grouped into seven families. The families are not watertight — Mamba is a state-space model that turns out to be a linear attention; MLA is a softmax attention with a compression trick; some hybrids belong in two places at once. The taxonomy is a reading aid, not a truth. Treat it that way.

The book is meant to be read in order, but every entry stands on its own. Skip what you know. Linger on what you don't.

Contents

The Seven Families

  1. Exact Softmax, Engineered Better9 entries
  2. Sparse and Pattern-Based9 entries
  3. Linear Attention12 entries
  4. Hierarchical and Compressed6 entries
  5. Latent and KV Compression3 entries
  6. Hybrid Architectures5 entries
  7. Positional Encodings — a companion6 entries
I
Family One · Nine Mechanisms

Exact Softmax,
Engineered Better

The mechanisms in this family compute exactly the same thing as the original Transformer. They are not approximations. They are software engineering — clever uses of memory hierarchy, tiling, and parallelism — that made the impossible merely expensive.

01

Vanilla Attention

The original. Every token looks at every other token, computes how much it cares (a scalar), and uses those weights to take a weighted average of value vectors. It is brutally simple, brutally expressive, and brutally expensive. Everything in this book is a reaction to it.

You take three matrices derived from the input: queries Q, keys K, and values V. Each row of Q is a token asking a question. Each row of K is a token offering an answer. The dot product between them, scaled, softmaxed, and used to weight V, gives the output.

Attention(Q, K, V) = softmax( QKᵀ / √d_k ) V

The matrix inside the softmax is N × N — every token by every other token. This is where all the trouble starts. For a million tokens, that is a trillion entries.

FULL CAUSAL ATTENTION · O(N²) Q K

Before 2017, sequence modeling meant recurrence: you read the sequence one token at a time and hoped the hidden state remembered enough. Attention threw that out. Every position could see every other position in a single layer, in parallel, on a GPU. It changed not just NLP but vision, audio, biology, and code. Most of modern AI is downstream of this one equation.

Quadratic. Memory and compute scale as . At 1,000 tokens this is fine. At 100,000 it is painful. At 1,000,000 it is impossible without help. The rest of this book is, in one way or another, that help.

· · ·
02

Scaled Dot-Product Attention (SDPA)

Not an algorithm — a doorway. torch.nn.functional.scaled_dot_product_attention is PyTorch's unified call that hides which kernel actually runs underneath. You write one line; PyTorch picks Flash, memory-efficient, or naive depending on your inputs and hardware.

SDPA is a dispatcher. Given Q, K, V, an optional mask, and flags for causal or dropout, it inspects your tensors and picks the fastest available backend: Flash Attention if the shapes and dtype support it, the memory-efficient backend if not, and finally a plain math implementation as fallback. The math it computes is identical to vanilla attention.

SDPA · ONE CALL, MANY BACKENDS scaled_dot_product Flash if available Mem-Eff fallback Math last resort PyTorch picks; you don't know which. SAME OUTPUT, DIFFERENT KERNEL

SDPA matters because it is the layer between researchers and the kernel zoo below. Before it, using Flash Attention meant installing a separate library, matching CUDA versions, and rewriting your model. After it, you write one PyTorch call and inherit whatever the framework currently knows is fastest. It is the API that made Flash Attention universal without anyone noticing.

You give up explicit control. The dispatcher's heuristics are good but not infallible — sometimes you would have preferred a different backend. For most users this is invisible. For kernel engineers it is the source of late-night debugging.

· · ·
03

Memory-Efficient Attention

The first proof that you do not need to store the attention matrix to compute attention. You can stream through it. The paper is short, almost playful, and it set up everything that Flash Attention later industrialised.

The trick is to compute the softmax incrementally. As you scan through the keys, you keep a running maximum (for numerical stability) and a running normaliser. You never materialise the full N × N matrix; you only ever hold a strip of it at a time. The backward pass recomputes what it needs rather than caching it.

memory: O(log N)   →   compute: O(N²) but streamable
STREAM THROUGH ATTENTION · O(log N) MEMORY running max · running sum · one strip at a time

It demolished a piece of folk wisdom — that attention's quadratic memory was inherent. It was not. It was a choice of how to organise the computation. The paper is the conceptual ancestor of every IO-aware kernel that followed.

The original implementation, naïve in CUDA, was slower than the standard one despite using less memory. The compute was still O(N²); you just paid for it without the cache. Flash Attention would later fix the speed problem too.

· · ·
04

Flash Attention

Take Rabe and Staats's idea and engineer it for the GPU memory hierarchy. Block the computation so that data stays in fast on-chip SRAM as long as possible, and only writes back to slow HBM when forced to. Result: same output, 7× faster, linear memory.

The GPU has two kinds of memory: SRAM (tiny, near the cores, fast) and HBM (large, far, slow). Standard attention writes and reads the full N × N matrix to HBM. Flash Attention tiles the computation into blocks small enough to fit in SRAM. For each query block and key block, it loads, computes a partial softmax inside SRAM, updates running statistics, and only writes the final output to HBM. The matrix is never assembled.

Mathematically identical to vanilla. Numerically nearly so — only floating-point reordering differs.

IO-AWARE TILING · KEEP DATA IN SRAM HBM — SLOW N×N matrix never built SRAM — FAST tile load · compute · accumulate · write final output only

This was the moment long-context transformers became practical. Sequences of 4K, then 16K, then 64K became routine. The paper also reframed the field: attention's bottleneck was not FLOPs, it was memory bandwidth. Everyone started thinking in those terms.

Compute is still O(N²); only memory dropped to O(N). So the FLOP cost of long sequences remains. Implementation is CUDA-specific and tied to particular GPU generations.

· · ·
05

Flash Attention 2

Flash Attention again, with the parallelism rewritten. Same math, roughly twice as fast. The headline number is that it reaches around 70% of theoretical GPU peak on attention — close to what cuBLAS does for dense matrix multiplication.

The original Flash Attention parallelised across batch and heads but processed each sequence somewhat serially. Flash Attention 2 also parallelises across the sequence dimension itself. It reorganises the loop order so that the inner loop sweeps the more expensive dimension, reduces the number of non-matmul instructions (which run on different units than matmul on modern GPUs), and partitions work between warps more cleverly.

FLASH 2 · PARALLEL OVER SEQUENCE TOO FLASH 1 one at a time FLASH 2 all in parallel ~2× faster · ~70% of GPU peak

Most production systems today — vLLM, SGLang, HuggingFace Transformers — call into Flash Attention 2 under the hood. It is the default. The 2× speedup compounded with the existing memory savings is what made 128K-context models feasible to serve at scale.

Hopper (H100) and later get diminishing returns because they have features (TMA, warp specialisation, FP8) that Flash 2 does not use. That gap is what Flash 3 was built to close.

· · ·
06

Flash Attention 3

Flash Attention rebuilt for the Hopper architecture (H100). Asynchronous, warp-specialised, FP8-capable. Another roughly 2× speedup on H100, and FP8 brings another 2× on top.

Three changes do the work. First, the Tensor Memory Accelerator (TMA) handles data movement asynchronously, so warps doing matmul do not stall waiting for memory. Second, warps are split into producer and consumer roles — some load data, others compute — overlapping work that used to be serial. Third, FP8 is supported with a clever rescaling scheme that keeps accuracy acceptable for attention even at the lower precision.

FLASH 3 · ASYNC PRODUCER / CONSUMER WARPS PRODUCER WARP loads via TMA CONSUMER WARP does matmul while one warp computes, another fetches — no stalls HOPPER (H100) ONLY · FP8 OPTIONAL · ~2× FLASH 2

Frontier labs run on H100s; Flash 3 is what those clusters actually use. When Anthropic or OpenAI talks about long-context inference cost, the relevant kernel underneath is some descendant of Flash 3. It is the closest thing to a free lunch in modern deep learning hardware.

Hopper-specific. On older GPUs (A100, A6000) it offers little or nothing — those users stay on Flash 2. FP8 attention requires care: not every workload tolerates the precision loss without quality regression.

· · ·
07

Flash Attention 4

The next iteration, rebuilt for Blackwell GPUs. Public details are still partial as of early 2026, but the direction is clear: use Blackwell's fifth-generation Tensor Cores, larger SRAM, and improved FP4 / FP6 support to push attention throughput further. Another generational step rather than a paradigm shift.

Same recipe — IO-aware tiling, async data movement, warp specialisation — re-tuned for new hardware. Blackwell exposes more SRAM per SM, faster inter-SM communication, and native support for very low-precision formats. Flash 4 leans on all of these. It also targets training, where reduced precision is harder to make work cleanly.

FLASH 4 · BLACKWELL-TUNED B200 / B300 SM 5th-gen TC larger SRAM FP4 / FP6 faster comms same recipe, new silicon — generational step

Each Flash generation has roughly doubled effective attention throughput. That compounding is what makes longer contexts cheaper every year despite quadratic compute. Without Flash, a million-token context window would still be a research curiosity.

Hardware-bound — runs only on Blackwell. The vast majority of GPUs in the world are not Blackwell. For most practitioners this is something to look forward to, not something to use today.

· · ·
08

xFormers Memory-Efficient Attention

Meta's library of attention kernels. Predates Flash in some ways, ships several backends, and remains the production choice for many diffusion model pipelines. Less famous than Flash, equally pragmatic.

xFormers is less a single algorithm and more a toolbox. It includes a memory-efficient attention based on the Rabe-Staats principle (with CUDA optimisations), block-sparse attention kernels, several positional bias schemes, and utilities for attention with arbitrary masks. The library auto-selects implementations based on tensor shapes and hardware.

XFORMERS · KERNEL LIBRARY mem_eff block_sparse flash port bias schemes masking utils + more auto-selects by shape and hardware

Stable Diffusion ran on xFormers for a long time. Many image and video generation pipelines still do. In the LLM world, Flash dominates; in the diffusion world, xFormers had earlier mind-share and kept it. The two libraries now overlap significantly.

Less aggressive on the latest hardware than Flash 3 and 4. The codebase is broader, which means more features but slower release cadence on the cutting edge.

· · ·
09

FlexAttention

You describe the attention pattern you want — causal, sliding window, ALiBi, document masking, anything — as a small Python function. PyTorch compiles it down to a Flash-style kernel. Custom attention without writing CUDA.

FlexAttention exposes two user-defined functions: a score_mod that transforms the attention score for any pair of positions, and a mask_mod that decides whether a pair is attended to at all. PyTorch's compiler (TorchInductor) lowers these into a fused kernel that behaves like Flash Attention but with your custom logic baked in. Sparse masks become genuinely sparse kernels — they skip masked-out blocks entirely.

FLEXATTENTION · PYTHON → FUSED KERNEL YOUR FUNCTION def mod(s, b, h, q_idx, kv_idx): return s + alibi(q_idx, kv_idx) compile FUSED FLASH-STYLE KERNEL describe the pattern, get the kernel CAUSAL · WINDOW · ALIBI · DOC MASK · ALL ONE API

Before FlexAttention, every new attention variant required someone to write CUDA. Custom positional schemes, document-level causal masks, sliding windows with different sizes per layer — all needed bespoke kernels. FlexAttention collapses that work into a few lines of Python. It is, in effect, a compiler for attention.

Compilation is not always perfect; exotic patterns sometimes fall back to slower paths. Performance on the simplest cases is close to but slightly behind hand-tuned Flash Attention. The trade is generality for a small efficiency gap, which is usually worth it.

II
Family Two · Nine Mechanisms

Sparse and
Pattern-Based

If full attention is too expensive, the obvious move is to attend less. The mechanisms in this family pick which token-pairs to compute and ignore the rest — sometimes with fixed patterns, sometimes with learned routing, always with the same gamble: that the discarded pairs did not matter.

10

Sparse Transformer

The first serious attempt to attack quadratic attention. Instead of every token attending to every other token, you allow only structured subsets — strided patterns, fixed local windows. Used to train models on 16,384-token sequences in 2019, which felt impossible at the time.

Two attention heads with different sparsity patterns. One head attends locally — token i attends to a fixed window around itself. Another head attends with a stride — token i attends to every k-th token across the whole sequence. Together they cover the sequence with O(N√N) connections instead of O(N²). The patterns are fixed; the model does not learn them.

SPARSE TRANSFORMER · LOCAL + STRIDED Q K

It established the template that every later sparse method copied: combine local with some-kind-of-global, prove you can approximate full attention. Sparse Transformer also produced Sparse Image Transformer and trained on raw pixels of CIFAR and ImageNet, foreshadowing Vision Transformers.

The patterns are heuristic. Some tasks need attention between specific distant pairs that the fixed stride happens to miss. And implementing fast sparse kernels on GPUs was, in 2019, very hard — much of the speedup theory did not translate to wall-clock improvements.

· · ·
11

Longformer

Sliding window attention plus a small set of "global" tokens that everyone attends to. Linear in sequence length for the local part. The global tokens act as information hubs, like the bulletin board in an office where anyone can post and anyone can read.

Most tokens attend only to a fixed window of their neighbours, say 512 tokens on each side. A few designated tokens — often the [CLS] token, or question tokens in a QA setup — attend to everything and are attended to by everything. The cost is O(N · w) for the local part plus O(N · g) for the g global tokens, which together remain linear in N.

LONGFORMER · WINDOW + GLOBAL Q K

Longformer was the first practical long-document model. Researchers used it to process entire academic papers, long contracts, multi-page chats. The global token trick proved durable — many later sparse methods reinvent it under different names.

You must decide ahead of time which tokens are global. Get it wrong and the model cannot route the information it needs. The local window is a hyperparameter that does not adapt to content.

· · ·
12

BigBird

Longformer plus randomness. Each token attends to its local window, to a few global tokens, and to a few random tokens. The authors prove that this combination is theoretically as expressive as full attention — it can approximate any sequence-to-sequence function.

Three kinds of connections. Local: a sliding window. Global: a small set of always-on tokens. Random: for each token, sample a few other tokens uniformly at random. The random connections give the sparse graph small-world properties — short paths between any pair of nodes — which is what powers the theoretical universality result.

BIGBIRD · WINDOW + GLOBAL + RANDOM Q K

BigBird made sparse attention respectable to theorists. The randomness was an interesting move: a way to recover the expressivity of full attention without paying for it. In practice the random part is often less impactful than the global and local parts, but the framing influenced everything that came after.

The proofs are asymptotic. In practice, BigBird does well on long documents but is not noticeably better than well-tuned Longformer. And random patterns are hostile to GPU memory access; getting them fast required custom kernels.

· · ·
13

Reformer

Instead of attending to every key, attend only to keys that are likely to give a high score. "Likely" is determined by locality-sensitive hashing — keys that hash to the same bucket are probably similar, so they are probably the ones you would attend to anyway.

Hash each query and key with a random projection-based LSH scheme. Group together those that hash to the same bucket. Compute attention only within buckets. Repeat with several different hash functions and combine. The complexity drops to O(N log N). The chunking by hash is the heart of it; the rest is engineering.

REFORMER · LSH BUCKETS Q K

Reformer was beautiful because it tied attention to a classical algorithm (LSH) with decades of theory. It also introduced reversible residual layers, which let the model train on much longer sequences by trading memory for recomputation. Both ideas filtered into later work.

The hashing is not free. Reformer often ended up similar in wall-clock time to vanilla attention at moderate lengths because the hashing and bucketing overhead ate the asymptotic gains. It remains a beautiful idea that other methods quietly outran.

· · ·
14

Routing Transformer

Like Reformer but with k-means instead of LSH. Cluster the queries and keys with online k-means; only attend within clusters. The clustering is learned during training, so the routing adapts to the data.

Maintain a small set of centroids. At each layer, assign every query and every key to its nearest centroid. Attention is computed only between queries and keys assigned to the same centroid. The centroids are updated by exponential moving average during training, drifting towards the centres of the natural clusters in the data.

ROUTING TRANSFORMER · K-MEANS CLUSTERS Q K

Routing Transformer formalised an intuition: similar queries want similar keys, so route them together. The same idea reappears in mixture-of-experts, in retrieval-augmented attention, in modern sparse methods like MoBA. It was an early instance of a now-pervasive pattern.

Clustering is unstable during early training; sometimes centroids collapse onto one another. The method needs careful regularisation. And like Reformer, the wall-clock advantage was often smaller than the asymptotic story suggested.

· · ·
15

Sliding Window Attention

The simplest possible sparse pattern: each token attends only to the previous w tokens, nothing more. Mistral 7B used w = 4096 and showed that, with deep stacking, information still propagates through the whole context — just indirectly.

At layer 1, token at position i attends to positions i - w through i. At layer 2, those tokens themselves attended to a window before them. So by layer L, information from position i - L · w can reach position i. The receptive field grows linearly with depth. This is the same idea as stacked dilated convolutions.

SLIDING WINDOW · LOCAL ONLY Q K

Mistral 7B's strong results showed that you do not need every layer to be full-attention. Stacking is a kind of long-range communication. The KV cache also shrinks dramatically because you only keep the last w tokens — a significant deployment win.

Information has to travel layer by layer. A long-range dependency needs a deep network to be representable. Tasks that require attending precisely to a token far away in a shallow layer (some retrieval problems) suffer.

· · ·
16

StreamingLLM

Sliding window attention, but with a twist: keep the first few tokens around forever. Those "attention sinks" stabilise the softmax and let the model stream indefinitely without quality collapse. A small fix to a real practical problem.

Without sinks, pure sliding window models degrade catastrophically when the context grows past their training length. The authors identify the cause: softmax must put its probability mass somewhere, and when there is no semantically important token to attend to, it dumps mass on whatever is left — which used to be the early tokens. Drop those, and the softmax has nowhere to dump, and the values blow up. StreamingLLM keeps the first 4 tokens permanently in the KV cache plus a sliding window of recent tokens. Softmax is happy. The model streams.

STREAMINGLLM · SINKS + WINDOW Q K

It explained a previously mysterious failure mode (long-stream collapse) and offered a one-line fix. Production chat systems that need to handle very long conversations adopted the idea quickly. The "attention sink" terminology entered the field's vocabulary.

You cannot use information from the middle of the stream beyond what is currently in the window. The model has effective short-term memory only. For tasks needing genuine recall over long history this is the wrong tool.

· · ·
17

Native Sparse Attention (NSA)

Sparse attention that the model learns and that is genuinely fast on hardware. Three parallel branches — coarse summary, selective fine-grained, sliding window — each contributing to a unified output. Trained from scratch, not retrofitted.

For each query, three things happen in parallel. First, a compressed branch: chunks of keys and values are pooled into block representations, and the query attends to those summaries. Second, a selective branch: based on the compressed scores, the top-k blocks are chosen for full fine-grained attention. Third, a sliding window for recent tokens. The three outputs are combined with learned gates. Crucially, the kernel is co-designed with the algorithm: blocks are sized to match GPU tile shapes, so the sparsity is real on hardware.

NSA · THREE BRANCHES, ONE OUTPUT COMPRESSED block summaries SELECTED top-k blocks (fine) WINDOW recent tokens gated sum co-designed with the GPU kernel

NSA is the first sparse attention that is both genuinely trainable end-to-end and genuinely fast at training time. Many earlier sparse methods either were not trainable from scratch or had implementations that did not actually run faster. NSA reports both quality matching full attention and substantial wall-clock speedups, which is unusual.

The architecture has several hyperparameters (block sizes, top-k value, gating). The top-k selection introduces some non-differentiability that requires care during training. And the empirical claim that it matches dense attention is true on certain benchmarks; whether it holds for tasks requiring genuinely dense attention is an open question — likely no.

· · ·
18

MoBA — Mixture of Block Attention

Treat blocks of keys as experts. For each query, route to a small set of blocks and attend within them. Mixture-of-experts logic applied to attention rather than feedforward layers. Powers Kimi's long-context system.

Partition the key sequence into fixed-size blocks. For each query, compute a routing score per block (using a cheap summary of the block). Pick the top-k blocks and perform standard softmax attention against them. The unchosen blocks contribute nothing, but the choice is differentiable through a straight-through estimator.

MoBA · MIXTURE OF BLOCK ATTENTION q router KEY BLOCKS each query picks its own top-k key blocks

MoBA is the visible engine behind Kimi-K1.5 and related Moonshot systems. It scales to very long contexts (millions of tokens) with workable cost, and the routing is content-aware in a way that fixed patterns are not. The framing — attention as a sparse mixture — is a clean abstraction that other groups are now adopting.

The router is a small extra cost, and block-level granularity means you may pull in tokens you did not want or miss tokens you did. Tuning block size and top-k matters more than for dense methods. Like NSA, it shines on retrieval-style tasks and is less clearly better on dense reasoning.

III
Family Three · Twelve Mechanisms

Linear
Attention

A different strategy: not sparsifying, but rewriting. Replace softmax with a kernel that lets matrix multiplication commute differently. The cost drops from to N. The price is a representational ceiling that researchers have been chipping at for five years.

19

Linear Transformer

The original linear attention. Replace the softmax in attention with a non-negative feature map applied to Q and K separately. Then exploit the associativity of matrix multiplication to compute K times V first, before involving Q. The complexity drops to O(N).

Vanilla attention computes softmax(QKᵀ) V. The softmax is what couples queries to keys non-linearly and is what forces you to materialise the N × N matrix. Replace it with a kernel φ(Q) φ(K)ᵀ where φ is, say, elu(x) + 1 applied element-wise. Now the expression becomes φ(Q) (φ(K)ᵀ V) and the parenthesised inner product is a d × d matrix — independent of N.

softmax(QKᵀ)Vφ(Q)(φ(K)V)
      N²                       N · d²

The right-hand side, for causal attention, also admits an exact recurrent form: maintain a running state S that accumulates φ(K_i) ⊗ V_i. Each new token updates S in constant time. The Transformer becomes an RNN.

THE RIGHT-PRODUCT TRICK SOFTMAX ORDER · O(N²) Q Kᵀ · V first make N×N then multiply by V LINEAR ORDER · O(N) φ(Q) · ( φ(K)ᵀ · V ) ← d×d, no N! drop softmax, multiply right first

This paper rewrote what attention could be. It revealed that the softmax was not essential machinery — it was a particular choice. Drop it, and attention is a kernel method, with all the theory of kernel methods available. Every linear attention paper since 2020 is, in some sense, a refinement of this one equation.

The original elu+1 feature map is a weak approximation of softmax. Softmax is sharply peaked — it can attend almost exclusively to a single token. elu+1 cannot. So linear attention with this kernel struggles on tasks that require precise retrieval. The gap between linear and softmax attention is what motivated every later refinement in this family.

· · ·
20

Performer / FAVOR+

If linear attention's weakness is a bad feature map, build a better one. Performer uses random Fourier features to approximate the softmax kernel itself, with provable guarantees. The approximation is unbiased; in expectation, it is exact softmax.

The softmax can be written as a kernel exp(qᵀk). Performer constructs a feature map φ using random projections drawn from a specific distribution, such that E[φ(q)ᵀφ(k)] = exp(qᵀk). With enough random features, the approximation is tight. The method is called FAVOR+ — Fast Attention Via positive Orthogonal Random features.

PERFORMER · RANDOM FEATURES APPROXIMATE SOFTMAX exp(qᵀk) φ(q)ᵀ φ(k) RANDOM PROJECTIONS E[ φ(q)ᵀφ(k) ] = exp(qᵀk) unbiased — exact in expectation

Performer was the first linear attention with a real theoretical argument for closing the gap to softmax. The randomness gives unbiasedness, the orthogonality controls variance, the positivity ensures stability. The paper showed strong results on Long Range Arena and on protein sequence modelling.

The approximation has variance, and for sharp attention distributions (the ones that matter most for retrieval), the variance is high. In practice, Performer underperforms softmax on language modelling at scale. The theoretical elegance did not quite translate.

· · ·
21

Linformer

Different approach: project the N-dimensional key and value sequences down to a fixed-size k-dimensional space, then do attention against the projection. The N × N matrix becomes N × k, which is linear if k is treated as a constant.

Multiply K and V on the left by learned projection matrices E and F of shape k × N. The resulting EK and FV have shape k × d. Compute softmax(Q (EK)ᵀ) (FV) as usual. Cost is O(N · k).

LINFORMER · PROJECT K, V DOWN TO k K N × d EK k × d (small) learned projection attention matrices in trained Transformers are low-rank anyway

Linformer was a simple, working alternative to softmax that was easy to implement and ran fast. The conceptual argument was based on the observation that attention matrices in trained Transformers tend to be low-rank, so projecting to a low-rank subspace loses little.

The projection size k is fixed at training time, so the architecture commits to a particular sequence length. Different sequence lengths need different projections, which makes Linformer awkward to use at flexible context sizes. It also struggles in autoregressive (causal) settings — the standard formulation is most natural for encoders.

· · ·
22

cosFormer

A linear attention that bakes in locality. The feature map is ReLU(q) · cos and ReLU(k) · cos with phases that depend on token position. The cosine reweighting gives nearby tokens a stronger affinity than distant ones, which mirrors what softmax usually does in practice.

Define φ(x, i) = ReLU(x) · [cos(πi/2N), sin(πi/2N)]. The attention score between query at position i and key at position j contains a factor of cos(π(i - j)/2N), which peaks when i = j and decays smoothly. The product expands so that QKᵀ can still be computed via the right-product trick.

COSFORMER · COSINE LOCALITY WEIGHT 1 0 |i − j| → cos(π(i-j)/2N)

cosFormer narrowed the gap to softmax meaningfully on language tasks by adding a structural prior that matches how language actually works (nearby words matter more). It was an influence on later refinements that combined linear attention with positional biases.

The locality bias is good for language but hurts on tasks where genuinely distant tokens carry the signal. And like all early linear methods, it still trailed softmax on the hardest retrieval tasks.

· · ·
23

Hedgehog

The problem with linear attention is that the kernel is too smooth — softmax is spiky, and smooth kernels cannot mimic spikes. Hedgehog learns the feature map directly, trained to match the spikiness of softmax. A small MLP, doing what fixed feature maps could not.

Instead of using elu+1 or random features, Hedgehog uses a small two-layer MLP as the feature map, applied to Q and K. The MLP is trained, sometimes via distillation from a softmax teacher, to make φ(q)ᵀφ(k) closely approximate exp(qᵀk) in the regime that actually occurs in trained models. The right-product trick still applies; linearity in N is preserved.

HEDGEHOG · LEARNED FEATURE MAP q MLP φ(·) φ(q) trained to match softmax's spikiness

Hedgehog explicitly diagnosed the spikiness problem and proposed a learned solution. It substantially closed the quality gap to softmax on language modelling while remaining linear. The technique also enables linearising a pre-trained softmax model post hoc, by fitting Hedgehog feature maps to its attention patterns.

An extra MLP per attention head adds parameters and a small per-token cost. The distillation step adds training complexity. And while the gap narrows, it does not always vanish — softmax remains stronger on the hardest tasks.

· · ·
24

RetNet — Retentive Network

A linear attention with a built-in exponential decay over position. The same model can be unfolded three ways — parallel for training, recurrent for inference, chunkwise for long context — and they are all mathematically equivalent. The "best of both worlds" pitch from RNNs and Transformers.

Each head has a fixed decay rate γ. The retention operation is essentially linear attention where the key-value contribution from position j to position i is weighted by γ^(i-j). This decay makes the recurrent form numerically stable and gives a natural locality bias.

parallel:    O = (Q · Kᵀ ⊙ D) · V    [training]
recurrent:   sₙ = γ · sₙ₋₁ + kₙᵀ vₙ   [inference]
chunkwise:   blocks of both           [long sequences]
RETNET · ONE MODEL, THREE FORMS PARALLEL training RECURRENT s_t = γs_(t-1) + ... inference CHUNKWISE long sequences mathematically equivalent — pick by context

RetNet was the first linear-attention architecture marketed as a serious successor to the Transformer. The triple-form equivalence is genuinely elegant: train it like a Transformer, deploy it like an RNN, scale it like a hybrid. It influenced the framing of many later models including Mamba-2 and GLA.

The decay rates are fixed per head, not data-dependent. This is a strong inductive bias — it works well for language, where the locality assumption is mostly correct, and worse for tasks where the relevant token might be arbitrarily far. Later models (GLA, Gated DeltaNet) replaced the fixed decay with learned, data-dependent gating.

· · ·
25

RWKV

An RNN that thinks it is a Transformer, or a Transformer that thinks it is an RNN. RWKV reformulates attention as a recurrent operation with channel-mixing and time-mixing layers. It has been trained at scale by an open community and has been the most-deployed non-Transformer language model.

Two operations per block. Time-mixing replaces attention: each channel maintains a running state that decays, with new contributions weighted by a learnable function. Channel-mixing replaces the feedforward block. The decay terms vary across channels and across versions of RWKV; later versions (v6, v7) add data-dependent decays and richer state updates, bringing it closer to general linear attention.

RWKV · TIME-MIX + CHANNEL-MIX TIME-MIX across positions CHANNEL-MIX across features linear recurrence with learned decay replaces the FFN block trains parallel, infers recurrent

RWKV trains in parallel like a Transformer and runs in constant memory per step like an RNN. The community around it has produced openly licensed models at meaningful scales and demonstrated that linear-recurrent architectures can hold their own. v7 in particular incorporated ideas from Mamba and DeltaNet and showed strong results on standard benchmarks.

RWKV has accumulated complexity across its versions; understanding the current one requires reading several papers. Earlier versions were weaker than softmax on retrieval-heavy tasks. As the architecture has evolved, the gap has shrunk, but it remains a moving target.

· · ·
26

Mamba and Mamba-2

Selective state space models. Originally framed as an alternative to attention, Mamba-2 then made the equivalence explicit: state-space duality means Mamba is, formally, a particular kind of linear attention. The model is fast, scales well, and matches softmax Transformers in many language benchmarks.

A state-space model maintains a hidden state that evolves linearly with each input token: h_t = A · h_(t-1) + B · x_t, with output y_t = C · h_t. Mamba's contribution was to make A, B, C input-dependent — the model can choose what to remember based on what it sees. Mamba-2 then showed that, with the right structure on these matrices (specifically scalar-times-identity for A), the SSM is mathematically equivalent to a form of linear attention with a particular gating scheme. The team called this state-space duality (SSD).

MAMBA · SELECTIVE STATE SPACE h₁ h₂ h₃ h₄ A(x₁) A(x₂) A(x₃) A(x₄) A, B, C depend on input — the model chooses what to remember

Mamba reignited the entire field of linear-recurrent models. It scaled. It worked. It made non-Transformer architectures credible at frontier scales for the first time since LSTMs. Mamba-2's framing — SSD as a unifying lens — clarified that what looked like an architectural revolution was, in fact, an excellent point in the linear-attention design space.

Pure Mamba models still trail softmax on certain copying and retrieval tasks where exact recall matters. This is why many production systems use hybrid Mamba-attention stacks. The hardware-aware kernel is tied to specific GPU generations and requires careful implementation.

· · ·
27

Gated Linear Attention (GLA)

Linear attention plus a data-dependent forget gate. The model decides, per token and per channel, how much of the past state to retain. Combined with a hardware-efficient chunkwise algorithm, GLA was the strongest linear attention at language modelling for much of 2024.

The recurrent state update in linear attention becomes S_t = G_t ⊙ S_(t-1) + k_t · v_tᵀ, where G_t is a learned, input-dependent gate applied element-wise. The gate is constrained to be in (0, 1) via a sigmoid. The gating provides the selectivity that earlier linear methods lacked, and the chunkwise computation algorithm preserves linear cost while running fast on GPUs.

GATED LINEAR ATTENTION · S_t = G_t ⊙ S_(t-1) + k_t vᵀ S_(t-1) G_t + k_t · v_tᵀ forget gate (data-dependent) new info model chooses what to forget, per token, per channel

GLA was the centrepiece of the Flash Linear Attention (FLA) library, which became the reference codebase for the linear attention community. It clarified what every modern linear attention needs: data-dependent gating, chunkwise execution, hardware-friendly memory access patterns. Most later models (Gated DeltaNet, Kimi Linear) are evolutions of this template.

The gate adds parameters and small overhead. The model still has a representational limit — fundamentally, it must summarise the past into a fixed-size state, so very long-range exact retrieval remains hard. Beats earlier linear methods clearly; beats softmax less clearly.

· · ·
28

DeltaNet and Gated DeltaNet

Linear attention reinterpreted through the delta rule from classical online learning. Each new token doesn't just add to the state — it updates the state to better predict the value, the way a single gradient step would. The gated version adds GLA-style forget gates on top.

Standard linear attention adds k_t · v_tᵀ to a running matrix S. DeltaNet treats S as a one-layer linear network mapping keys to values, and applies the delta (Widrow-Hoff) rule: S_t = S_(t-1) + β_t · (v_t - S_(t-1) · k_t) · k_tᵀ. The correction term removes the previous prediction error from the state. Gated DeltaNet then adds an input-dependent decay to S before the update.

DELTANET · ONLINE LEARNING UPDATE S_t = S_(t-1) + β_t · (v_t − S_(t-1)k_t) prediction error (target − current guess) treat S as a one-layer net apply a single SGD step per token

The delta rule gives the recurrent state a clearer interpretation: it is an online estimator that is being corrected with each observation. Empirically this helped with associative recall tasks where pure additive accumulation fails. Gated DeltaNet has been a strong contender on recent linear-attention benchmarks.

The delta update involves a matrix-vector product before the rank-one update, which is slightly more expensive than naive linear attention. Implementations need care to keep the per-step cost low. As with all linear methods, exact long-range recall remains harder than softmax.

· · ·
29

Lightning Attention

A linear attention with carefully engineered IO-aware kernels — Flash Attention's tricks transplanted to the linear setting. Lightning Attention-2 is the attention block inside MiniMax-Text-01, the first large-scale model to use linear attention as its primary mechanism at frontier scale.

Lightning Attention is built on the standard linear attention recurrence but written as a chunkwise algorithm with intra-chunk and inter-chunk passes. The intra-chunk pass uses softmax-like normalisation within a small block (to handle the spikiness problem); the inter-chunk pass uses the linear recurrence to bridge blocks. The kernel is heavily optimised for tensor cores and tiled to match GPU memory hierarchy.

LIGHTNING ATTENTION · CHUNKED LINEAR chunk 1 softmax inside chunk 2 softmax inside chunk 3 softmax inside linear recurrence between chunks ↓ spikiness handled within, linearity preserved across

Lightning Attention proved that linear attention could run, at scale, faster than Flash Attention in wall-clock terms — not just asymptotically. MiniMax-Text-01 was the first widely available model with a linear-dominated architecture (a small fraction of softmax layers remain in the stack), demonstrating that linear attention can power production systems at frontier scale.

The chunk-wise hybrid (linear between chunks, softmax within) is more complex than pure linear methods. MiniMax-Text-01 still includes occasional softmax layers — pure linear is not yet a complete replacement. Implementation is heavily hardware-specific.

· · ·
30

Kimi Linear

A linear attention variant from Moonshot, the team behind the Kimi assistant. Combines lessons from GLA, Gated DeltaNet, and Lightning Attention. As of late 2024 and early 2025, it has been the strongest publicly disclosed linear attention on long-context benchmarks.

Kimi Linear uses a gated, delta-style update with a carefully chosen feature map and chunkwise hardware kernel. The technical contributions are largely in the engineering — kernel fusion, KV-cache layout, integration with the broader Kimi serving stack — rather than a brand-new mathematical formulation. The architectural surface looks like a refined GLA/DeltaNet variant.

KIMI LINEAR · ENGINEERED FOR PRODUCTION delta rule gating chunked exec tuned KV-cache layout fused production kernels a refined GLA / DeltaNet, hardened for serving

Kimi Linear is the current public benchmark for linear attention quality at long context. New linear-attention papers compare against it. It also feeds into Moonshot's product stack, where it handles part of the long-context workload alongside MoBA's sparse attention.

The model is closed enough that the exact details require some inference from public material. As with all linear methods, performance on dense-reasoning tasks at very long context is the open question. Moonshot's deployment uses Kimi Linear alongside other mechanisms, suggesting it is not yet a single-architecture answer.

IV
Family Four · Six Mechanisms

Hierarchical and
Compressed

These methods do not change the attention algorithm so much as shrink what it operates on. Compress the KV cache. Pool the sequence into multi-scale summaries. Decide, at inference time, which tokens to keep around. The math is still softmax — there is just less of it.

31

Lighthouse Attention

Build a pyramid. At level zero, you have the raw token sequence. At level one, you average pairs of tokens together. At level two, pairs of pairs. And so on. Then, for each query, look at the top few cells at each level. Coarse where you can, fine where you must.

Given a sequence of N keys and values, choose a window size w. Level 0 is the original sequence. Level l is the sequence pooled into chunks of size w^l, with each chunk represented by the mean (or another pooling) of its constituent vectors. For each query, score the cells at every level by their feature-norm or by a learned criterion, and pick the top-k cells across all levels. Attend to those.

LIGHTHOUSE · ATTENTION PYRAMID L0 L1 L2 L3 query picks top-k across all levels — coarse where it can, fine where it must

Lighthouse formalises the intuition that not all parts of a long context need the same resolution. Most of a million-token document is dilution — a coarse summary suffices. A few regions are dense with signal and need precise attention. The pyramid is a clean way to express that.

It is a form of compression, and compression is lossy. Tasks that need fine-grained interactions across the entire sequence (true dense attention) lose information at higher pyramid levels. The number of hyperparameters — window size, top-k, number of pyramid levels — grows with the model scale. Also: if your application needs attention to genuinely dense signal, the pyramid is the wrong shape.

· · ·
32

H2O — Heavy Hitter Oracle

An empirical observation: in any given attention layer, only a small fraction of tokens consistently get high attention scores. Call them the "heavy hitters." Keep those tokens in the KV cache and evict the rest. The model continues to perform well at a fraction of the memory.

During generation, track the cumulative attention each cached token has received. When the KV cache is full, evict the tokens with the lowest accumulated attention. The retained tokens are the "heavy hitters" plus the most recent tokens (which the model is still going to need). The eviction policy is greedy and runs continuously, with no extra training.

H2O · KEEP THE HEAVY HITTERS KV CACHE (cumulative attention) .02 ★ .81 .05 .04 ★ .92 .01 .03 ★ .75 .06 recent retain ★ heavy hitters + the most recent · evict the rest

H2O is a deployment-time win that requires no retraining and no architecture change. For long-context inference where memory is the bottleneck, it shrinks the KV cache dramatically without serious quality loss. It also catalysed a wave of follow-up work on smarter eviction policies.

"Heavy hitters" are determined by past attention — there is no guarantee they will remain useful in the future. Adversarial or long-tail queries can ask about an evicted token, and the model has no recourse. The method is also generation-only; it does not help prefill.

· · ·
33

TOVA

A simpler heavy-hitter variant. At each generation step, look only at the current query's attention scores to decide which token to evict. Whatever the current step found least interesting goes. No accumulated history, no oracle — just the immediate vote.

At step t, the new query q_t computes attention scores against all keys in the cache. The token with the lowest score from q_t is evicted. The cache size stays fixed at some target value k. Implementation is essentially free — the attention scores are computed anyway.

TOVA · CURRENT QUERY VOTES q_t .4 .7 .05 ✗ .8 .6 drop whatever this query liked least

TOVA showed that the simplest possible eviction policy — vote by the current query — is competitive with more elaborate schemes. The lesson is that recent context dominates: what the model is doing right now is the best signal for what it will need next.

Aggressive eviction degrades retrieval over long contexts. The method is best when generation is locally coherent — instruction following, casual chat — and worst when the model needs to recall a distant detail. The "current vote" heuristic is myopic.

· · ·
34

SnapKV

An observation-based approach. Look at the last few prompt tokens — they encode the user's question. See which tokens in the long context they attend to. Keep those, drop the rest. The "snap" is that you do this once at the end of prefill.

During prefill, compute the full attention pattern for the last few tokens of the prompt — the "observation window." Aggregate their attention over all earlier tokens, smoothed by a small max-pooling filter to capture contiguous regions of interest. Keep the top-k tokens; discard everything else from the KV cache. Generation then proceeds with the shrunken cache.

SNAPKV · OBSERVATION WINDOW SELECTS LONG CONTEXT ★ tokens chosen by observation PROMPT obs window the last few tokens already encode the question use their attention to pick what to keep

SnapKV exploits the structure of typical LLM use: a long context followed by a focused query. The query's attention is a near-perfect signal of what matters in the context. By collapsing the KV cache to the relevant subset at the moment of inflection, memory drops by 80–95% with minimal quality loss.

The method requires a clear "question" at the end of the prompt; conversational or multi-turn use is harder to handle. If a later turn needs information that was dropped after the first snap, it is gone. SnapKV is a one-shot decision, not an ongoing one.

· · ·
35

Quest

Query-aware KV cache selection at generation time, with bounds. For each query, estimate which cached blocks are most likely to receive high attention, and load only those into SRAM. The unloaded blocks contribute zero. Memory stays low; quality stays high because the selection is dynamic.

Partition the KV cache into blocks. For each block, maintain a compact summary — typically the min and max values of each channel. For an incoming query, compute an upper bound on the attention score for each block using only the summaries. Keep the top-k blocks by bound, load their full KV into SRAM, and run attention against them. The other blocks are provably small contributors and are safely skipped.

QUEST · BLOCK BOUNDS, TOP-K PER QUERY KV CACHE IN BLOCKS [min,max] load skip load skip ↑ top-k by upper bound no permanent eviction — just skip what can't matter

Quest provides a principled, runtime-adaptive way to reduce attention cost without permanent eviction. Unlike H2O or SnapKV, it does not throw anything away; it just decides what to look at this step. Different queries can attend to different parts of the same cache.

The bounds are loose — sometimes the block with the highest bound is not the one with the highest true score. Quest accepts a small approximation error in exchange for the speedup. Implementation requires maintaining block summaries, which costs a small constant memory per block.

· · ·
36

InfLLM

Treat the long context as an external memory. Keep a small working window in fast memory, with the rest sitting on the side as retrievable blocks. For each generation step, pull in the most relevant external blocks. The model behaves as if it has full context while really only attending to a slice.

Split the context into blocks. The recent tokens stay in a sliding-window "working memory." Older blocks are stored with representative key vectors. For each generation step, retrieve the top-k external blocks (by similarity between query and block representatives), and attend to those plus the working window. The model is unchanged — InfLLM is purely an inference-time augmentation.

INFLLM · WORKING + EXTERNAL MEMORY EXTERNAL BLOCKS retrieved top-k WORKING WINDOW recent tokens (sliding) unbounded input on a model that wasn't trained for it

InfLLM lets a standard short-context model handle indefinite input lengths without retraining. It is closer to retrieval-augmented generation than to traditional attention, but it operates at the token level inside the model rather than at the document level outside it. The boundary between attention and retrieval starts to blur here.

Retrieval quality bounds the model's quality. The representative vectors are coarse; sometimes the right block is missed. And because the underlying model was not trained at the new effective length, there are edge cases where position handling breaks.

V
Family Five · Three Mechanisms

Latent and
KV Compression

A short family. These mechanisms preserve full softmax attention but compress the keys and values themselves — either by sharing them across attention heads, or by projecting them into a smaller latent space. Memory bandwidth, not compute, is the bottleneck they attack.

37

Multi-Query Attention (MQA)

Standard multi-head attention gives every head its own K and V. MQA gives every head its own Q but makes them all share a single K and V. The math is unchanged; the KV cache becomes h times smaller, where h is the number of heads.

In standard multi-head attention with h heads, you maintain h separate sets of K and V matrices. In MQA, there is a single K and a single V, used by all h queries. Each head computes softmax(Q_i · Kᵀ) V with Q_i head-specific but K and V shared.

MQA · ALL HEADS SHARE ONE K,V Q h1 h2 h3 h4 h5 h6 h7 h8 K,V shared K,V

For inference, the KV cache dominates memory bandwidth. Shrinking it h-fold makes generation dramatically faster on memory-bound hardware. PaLM, Falcon, and the original LLaMA codebases used MQA in various places. It is one of the highest-leverage simple changes in the history of attention.

Some quality loss. With fewer distinct K, V projections, the model has less expressivity per attention block. For most tasks the loss is small; for some it is noticeable. The trade is good but not free, which is why GQA was eventually preferred.

· · ·
38

Grouped-Query Attention (GQA)

The compromise between standard multi-head attention and MQA. Group the heads into g groups; within each group, share K and V. With g = 1 you have MQA. With g = h you have standard attention. The sweet spot is somewhere in between.

If you have h heads and choose g groups, you maintain g sets of K and V matrices. Each head's Q is mapped to one of the g groups. Heads in the same group share keys and values. The KV cache shrinks by h / g; with h = 32 and g = 8, that's a 4× reduction with much less quality loss than MQA.

GQA · GROUPS OF HEADS SHARE Q h1 h2 h3 h4 h5 h6 h7 h8 K,V K,V #1 K,V #2 K,V #3 K,V #4

GQA is now the default. LLaMA 2 70B, LLaMA 3, Mistral, Qwen, Gemma — almost every recent open-weight model uses GQA. It is the single best knob for trading a small amount of quality for substantially faster generation. The fact that it is now invisible in design discussions is a sign of how completely it won.

Choosing g is a hyperparameter tuning problem; too low and quality drops, too high and you lose the memory benefit. Most models settle on g around 4 or 8. The architecture is otherwise identical to standard attention, so the trade is genuinely small.

· · ·
39

Multi-head Latent Attention (MLA)

Instead of storing the full keys and values, store a compressed latent vector per token and reconstruct keys and values on the fly. The KV cache becomes dramatically smaller — typically more than ten times smaller than GQA. The model retains full multi-head expressivity at compute time.

For each token, project the input down to a small latent vector c of dimension much smaller than the full K, V dimensions. Store only c. When attention is computed, project c back up to the per-head K and V using learned matrices. By careful algebra, the up-projection can be absorbed into the query and output projections, so the actual stored quantity is genuinely just c.

store:        c_t = W_down · x_t          [small]
reconstruct:  K_t = W_K_up · c_t
              V_t = W_V_up · c_t          [at compute time]
MLA · LATENT KV COMPRESSION x_t W_down store c_t tiny vector ...later, at compute time... c_t K_t = W_K c V_t = W_V c store small, reconstruct on demand — >10× smaller cache than GQA

MLA is the architecture choice that makes DeepSeek-V3's efficiency possible. The KV cache compression is so aggressive that serving cost per token drops by an order of magnitude relative to comparable GQA models. And unlike many compression schemes, MLA is trained end-to-end — the model learns to make the latent representation sufficient.

The architecture is more complex; the algebraic absorption trick takes a moment to internalise. The latent dimension is a hyperparameter — too small and quality suffers, too large and you lose the memory advantage. RoPE positional encoding interacts awkwardly with MLA and required a special "decoupled" RoPE scheme in the DeepSeek implementation.

VI
Family Six · Five Mechanisms

Hybrid
Architectures

The pragmatist's family. If softmax is too expensive at long context and linear methods are imperfect, alternate them. Most layers do something cheap; a few do softmax where it matters. The result is often better than either pure approach.

40

Jamba

Alternating layers of Mamba and softmax attention, with a mixture-of-experts feedforward on top. The first major hybrid Mamba-Transformer at production scale. Released open-weight and competitive on long-context benchmarks.

The block pattern is one attention layer for every seven Mamba layers, repeated. Each layer is independent of the others architecturally. Mixture-of-experts is applied to the MLP at alternate layers to add capacity without proportional compute. The model is dense in computation; the experts are routed sparsely.

JAMBA · 1 ATTENTION : 7 MAMBA mamba mamba mamba mamba mamba mamba mamba attention mamba mamba mamba mamba mamba mamba mamba attention softmax anchors retrieval, Mamba carries the load

Jamba showed that mixing attention types is not just academically tidy — it is the practical move. The few softmax layers anchor exact retrieval and copying; the many Mamba layers handle the rest cheaply. The 1:7 ratio was a finding, not a guess — the team explored several and settled there. Many later hybrids reused similar ratios.

Two architectures means two sets of optimisations and two sets of failure modes. Implementation, kernel work, and debugging all duplicate. The mixture-of-experts adds another dimension of complexity around routing balance.

· · ·
41

Zamba and Zamba2

Another Mamba-attention hybrid, this one with a shared global attention module reused across the network. The shared block compresses parameter count while keeping the cross-layer flexibility of full attention.

The model is mostly Mamba layers. Periodically, a shared softmax attention block is interleaved — but it is a single set of parameters, used many times across the network rather than duplicated. The Mamba layers compose with this shared module like residual connections to a global memory.

ZAMBA · SHARED ATTENTION BLOCK mamba mamba attn (shared) mamba mamba mamba attn (shared) mamba mamba one attention module reused across the stack

Zamba2 is one of the more parameter-efficient open-weight models at its scale — its long-context behaviour is good relative to its size because the Mamba layers carry most of the work cheaply. The parameter-sharing idea is unusual and worth knowing.

Shared blocks can become bottlenecks during training — gradients to the shared module are noisier. The architecture is harder to scale up than a uniformly-stacked design, since the placement of the shared blocks is itself a hyperparameter.

· · ·
42

Samba

A hybrid of Mamba and sliding-window attention. No full softmax anywhere — both components are linear in sequence length. The result is a model that scales to long sequences without the per-step cost of any quadratic operation.

Alternate Mamba blocks (for global, summarised context) with sliding-window attention blocks (for precise local interaction). The window is small — typically 2K. Mamba handles long-range information transfer cheaply through its recurrence; the windowed attention provides exact local resolution that Mamba alone struggles with.

SAMBA · MAMBA + SLIDING WINDOW window attn mamba window attn mamba window attn mamba window attn mamba window attn mamba window attn mamba no softmax in sight — both components linear

Samba is a clean demonstration that the long-context game can be won without any softmax layer in the stack. Each component is cheap; the combination covers each component's weakness. It is also a useful design point in the bestiary — pure linear is possible, if you accept the local-window restriction.

The windowed attention is limited; truly long-range exact retrieval (find a needle 500K tokens back, exactly) is hard. For tasks that need that, Samba is not the answer. For ordinary long-context language modelling it is more than enough.

· · ·
43

Hymba

Most hybrids alternate layers — Mamba, then attention, then Mamba. Hymba runs them in parallel within the same layer. Half the heads do attention, half do Mamba, the outputs are fused. A different way to combine the two.

Each block has parallel branches: attention heads (using sliding window) and Mamba (SSM) heads. Both branches process the input independently and their outputs are concatenated and projected. The attention provides precise recall; the SSM provides long-range summarisation; the fusion lets each contribute where it is strongest.

HYMBA · PARALLEL HEADS PER LAYER ONE LAYER window attn mamba (SSM) fuse + project both kinds of computation at every layer

The parallel framing is conceptually cleaner — each layer has both kinds of computation, rather than relying on the stack to mix them. Hymba reports strong results at small scales, particularly the 1B-1.5B range, where attention-only models often struggle relative to their parameter count.

Parallel branches double the kernel work per layer — though each branch is doing less. The fusion is a learned projection, which adds parameters. The design is more complex to implement and debug than serial alternation.

· · ·
44

MiniMax-Text-01

A 456B-parameter mixture-of-experts model with mostly Lightning Attention layers and occasional softmax attention. The first frontier-scale model where linear attention is the dominant mechanism. Supports 4-million-token context.

The stack is mostly Lightning Attention (linear) blocks with softmax attention inserted at a 1-in-8 ratio. Each layer uses mixture-of-experts feedforward to scale capacity. The linear attention does the long-range work cheaply; the periodic softmax layers handle exact retrieval and copying. Together the architecture supports million-token inference at a fraction of what a pure softmax model would cost.

MINIMAX-TEXT-01 · LIGHTNING : SOFTMAX ≈ 7 : 1 lightning lightning lightning lightning attention lightning lightning lightning lightning lightning lightning lightning attention lightning lightning lightning lightning lightning lightning lightning frontier scale, mostly linear · 4M-token context

MiniMax-Text-01 is a proof point. Frontier capability with mostly linear attention had not been demonstrated openly until this model. It validates the hybrid template at scale and gives the rest of the field a reference architecture to point at when arguing for non-softmax stacks.

The model is huge and complex. Reproducing its quality outside of MiniMax requires both the architecture and the training recipe. As a public artefact it is more existence proof than off-the-shelf option — but the architectural lessons travel.

VII
Family Seven · Six Mechanisms · a companion

Positional
Encodings

A companion family. These are not attention mechanisms in themselves; they decide how the model knows where a token sits in the sequence. They matter here because long-context behaviour depends on positional encoding at least as much as on the attention mechanism — sometimes more. A great attention with broken positions is useless.

45

RoPE — Rotary Position Embedding

The dominant positional encoding in modern LLMs. Apply a position-dependent rotation in 2D subspaces of the query and key vectors. The dot product between query and key then depends only on their relative position, not absolute. Elegant, parameter-free, hardware-friendly.

Split each query and key vector into 2D pairs. For position m, rotate each pair by an angle m · θ_i, where θ_i is a fixed per-pair frequency. When two rotated vectors are dotted, the result depends on the difference of positions, not the positions themselves. So the attention score between position m and position n sees only m - n.

ROPE · POSITION AS ROTATION pos m, angle mθ pos n, angle nθ dot product depends only on m − n PARAMETER-FREE · RELATIVE POSITIONS

RoPE is everywhere. LLaMA, Mistral, Qwen, GPT-NeoX, Falcon, DeepSeek — all use RoPE in some form. It composes well with Flash Attention, requires no extra parameters, and gives clean relative-position semantics. Most long-context extensions (PI, NTK, YaRN) are RoPE modifications.

RoPE does not extrapolate gracefully past the training context length — the rotation frequencies were calibrated for a particular range, and going far beyond it makes the model behave erratically. This is why context extension techniques exist as their own subfield.

· · ·
46

ALiBi — Attention with Linear Biases

No positional embedding in the input. Instead, add a linear bias to the attention scores: the score between positions i and j gets a penalty proportional to |i - j|. Closer tokens get a higher score by default; far tokens get a built-in handicap that the data has to overcome.

For each attention head, fix a slope m_h. The score between query at position i and key at position j is the usual dot product minus m_h · |i - j|. Different heads get different slopes; together they cover a range of locality preferences from strong (high slope) to weak (low slope).

ALIBI · LINEAR PENALTY ON DISTANCE 0 |i − j| → head 1 head 2 head 3 add −m·|i−j| to attention score

ALiBi's surprise was that models trained with it extrapolate to longer sequences than they were trained on, much more gracefully than RoPE without modification. The linear decay is smooth and never becomes nonsensical at any distance. Several long-context models adopted it specifically for that property.

ALiBi forces a locality prior on every layer. For tasks where the important token can be anywhere, this is a handicap. RoPE-based models with proper extension have largely surpassed ALiBi on long-context benchmarks, so its moment as the long-context choice has passed.

· · ·
47

PI — Position Interpolation

The first principled context extension for RoPE. Take a model trained at length L and want to use it at length L'. Rescale the RoPE position indices by L / L', so the new long context still fits within the rotation range the model saw. A few hundred steps of fine-tuning recover quality.

RoPE rotates pair i by angle m · θ_i at position m. Position Interpolation replaces this with m · θ_i · (L / L'), compressing the entire new range into the angular range the model was trained on. The model still receives angles it has seen during training, just spaced more finely.

PI · COMPRESS NEW RANGE INTO TRAINED RANGE TRAINED ANGLES (4K) RESCALED ANGLES (16K → fit into trained range) multiply position indices by L/L' all rotations stay within seen range

PI is the simplest and best-understood context extension. It works. The fine-tuning required is modest (a few hundred million tokens at the longer length). Many open-source long-context releases used PI before more sophisticated alternatives existed.

Quality degrades when the extension factor is large. Stretching a 4K model to 128K is fine; stretching to 1M loses fidelity. Later methods (NTK, YaRN) address this with frequency-aware scaling.

· · ·
48

NTK-aware Scaling

An empirical refinement of Position Interpolation. Instead of rescaling all RoPE frequencies uniformly, rescale them in a frequency-dependent way: high-frequency components (fast-rotating pairs, which encode short-range info) are barely touched; low-frequency components (slow-rotating, long-range) are stretched more aggressively. Better extrapolation, less fine-tuning required.

The name refers loosely to neural tangent kernel reasoning. Each RoPE base frequency θ_i is replaced with θ_i · α^(i / D), where α is a function of the extension factor. The effect is to leave fast frequencies alone (they still cycle correctly) while substantially slowing the slowest frequencies (which carry long-range information that the original training did not specify well).

NTK-AWARE · FREQUENCY-DEPENDENT SCALING HIGH FREQ (short-range) LOW FREQ (long-range) 1.1× 1.5× fast pairs barely touched · slow pairs stretched hard often works with no fine-tuning

NTK-aware was the first context extension that worked with no fine-tuning at all — you could take a 4K LLaMA, apply NTK rescaling, and run it at 8K or 16K immediately with usable quality. It was the open-source community's first major contribution to long-context, and it set the stage for YaRN.

The exact rescaling formula is not derived from first principles; it was empirically tuned. Variations (dynamic NTK, NTK-by-parts) proliferated. At very large extension factors quality still degrades. The lack of theoretical grounding made it feel like a hack — accurate, but a hack.

· · ·
49

YaRN — Yet another RoPE extensioN

The synthesis. Combines NTK-aware frequency-dependent scaling with PI-style interpolation and an attention temperature adjustment that compensates for the longer sequences. With modest fine-tuning, extends models cleanly to many times their original context length. The standard recipe for context extension in late 2023 and 2024.

YaRN partitions the RoPE frequencies into three regimes by wavelength relative to the training context: short, medium, long. Short frequencies are left as-is. Long frequencies are interpolated (PI-style). Medium frequencies are NTK-rescaled. A small temperature factor on the attention logits compensates for the slightly different distribution that the rescaled positions induce. A few hundred million tokens of fine-tuning at the new length completes the job.

YARN · THREE REGIMES + TEMPERATURE SHORT λ unchanged MEDIUM λ NTK-rescaled LONG λ PI-interpolated + attention temperature adjust treat each frequency by what it does brief fine-tuning, large extension

YaRN is the closest thing to a standard for context extension. Qwen-1M, several Mistral variants, and many community fine-tunes use YaRN. It is the method most likely to be applied to any pre-trained RoPE model that wants to handle long contexts.

Like all extension methods, YaRN's quality at extreme extension factors (32× or more) is unreliable. The fine-tuning still costs compute. Some tasks degrade visibly past 4–8× extension, even with YaRN — a fact worth knowing when evaluating "1M-context" models that were extended rather than natively trained.

· · ·
50

NoPE — No Position Encoding

The provocation: do not use a positional encoding at all. The argument is that causal attention plus a deep stack of layers is enough on its own to give the model positional information implicitly. In some settings, the resulting model extrapolates to longer sequences better than RoPE or ALiBi.

Remove all positional embeddings — no RoPE, no ALiBi, no learned positions, nothing. The only thing the model has to distinguish positions is the causal mask, which makes position t a function of all positions before it but not after. The hypothesis is that this asymmetry, propagated through layers, is enough to encode position in the residual stream.

NOPE · THE CAUSAL MASK IS ENOUGH causality alone gives position information, propagated by depth

NoPE's empirical results were surprising. On certain reasoning benchmarks, NoPE models extrapolated to lengths much longer than they were trained on, while RoPE models broke. The paper became a piece of evidence that explicit positional encoding might be more crutch than necessity in causal models — though the picture is complex and task-dependent.

NoPE works best in narrow settings — small models, particular kinds of synthetic tasks. At scale, the most successful models still use RoPE. NoPE is more of a useful provocation than a production option, but it teaches something important about what attention plus causality alone can do.

Fin

Fifty mechanisms, ten years of work, and a great deal more left out than included. Linear attention's gating evolves quarterly; sparse methods invent new routing tricks faster than this page can be updated; some hybrid model published next week will look obvious in retrospect.

The taxonomy in this book is a snapshot. Treat it as such. What persists is the underlying question that every entry tries to answer: how much of the past should the present be allowed to see, and at what cost?

— The Attention Atlas, Vol. I —