Speed Always Wins: A Survey on Efficient Architectures for LLMs
I write everyday for my readers on actionable AI. Subscribe and instantly get a 1300+ page Python book.
Speed Always Wins.
Paper Title: "Speed Always Wins: A Survey on Efficient Architectures for LLMs"
Absolutely beautiful and exhaustive 82 page survey paper on on Efficient Architectures for Large Language Models
Maps the ways to make LLMs cheaper, longer context, and near real time.
Transformers compare every token with every other token, so if text is 2x longer, the work is about 4x. That burns memory because past keys and values are stored for every attention head, and it drags latency during long chats or reasoning loops.
The survey groups fixes into 4 buckets. Linear sequence models redo the math so cost grows with length, not length squared.
They include linear attention, recurrent networks that carry a small state, and state space models like Mamba, which track history with a running summary, so no big cache.
Sparse attention keeps the Transformer idea but only connects important pairs. Most tokens look locally, a few tokens act as global anchors, and some methods route tokens to the right places. You get large savings without throwing away core behavior.
Efficient full attention keeps exact attention but makes it hardware friendly. Input output aware kernels such as FlashAttention cut reads and writes, and multi-query or grouped-query attention lets many heads share 1 key-value set, cutting cache and bandwidth.
Sparse Mixture of Experts adds conditional compute. Only a few experts run per token, so capacity grows without paying full cost each step, and memory tricks compress, quantize, or prune the cache to stretch context.
The theme is simple, move less data. Methods that cut memory traffic tend to win on modern GPUs, which enables longer context, faster training, and lower serving cost.
This figure is a roadmap of how to make LLMs faster and cheaper from input tokens to output tokens. The center shows Efficient Sequence Modeling. One path makes sequence cost scale linearly using things like linear attention, linear recurrent networks, and state space models, plus test-time-training variants and unified linear sequence models. Another path saves work by using sparse attention so the model only looks at the most useful token pairs. A third path keeps full attention but makes it cheaper with input-output aware scheduling, grouped attention, mixtures of different attention types, and quantization. Below that sits Sparse Mixture-of-Experts. The model grows capacity by keeping many experts but routes each token to only a few, so compute per token stays low. Different routing rules, expert designs, and conversion tricks live here. To the right are Hybrid Architectures. These mix building blocks across layers or inside a layer to hit better speed and accuracy tradeoffs. Next is Diffusion LLM. This family targets non-autoregressive generation so many tokens can be produced in parallel, with methods to connect back to standard autoregressive decoding and to extend into multimodal settings. The final column highlights reach beyond text, showing where these efficiency ideas apply to vision, audio, and multimodal tasks.
How can we break through the Transformer’s efficiency ceiling? Is costly "intelligence" our only path forward? Possible solutions
A Comprehensive Taxonomy of Efficient Architectures for Large Language Models.
Example Patterns of Static Sparse Attention