Post-Transformer Architectures: Innovations

Apr 20, 2025

Browse all previoiusly published AI Tutorials here.

Introduction

Transformers have dominated deep learning, but their quadratic complexity in sequence length poses challenges for long contexts and large models (FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning · Hazy Research). In 2024 and 2025, researchers explored a new generation of architectures – “post-Transformer” models – that aim to overcome these limitations. These innovations range from efficient Transformer variants that approximate or restructure self-attention (e.g. Performer, linear attention, Reformer) to entirely different sequence modules like state-space models and revamped RNNs. All share a common goal: improve memory and compute efficiency (and often extend context length) without sacrificing too much performance. This report breaks down the key architectures, their principles, efficiency trade-offs, and how major AI labs are adopting these ideas in practice.

The Quadratic Bottleneck of Self-Attention

A standard Transformer’s self-attention requires each token to attend to every other token, leading to O(n²) time and memory complexity for sequence length n. This becomes prohibitive for long sequences (e.g. thousands of tokens) – GPU memory fills up and inference latency grows. Large language models use tricks like caching keys/values, but even the key-value cache grows linearly with n, slowing generation (here). In 2024, we saw 32k-token GPT-4 and 100k-token Anthropic Claude, but scaling context further requires fundamental changes . Two broad strategies emerged: make attention more efficient (through approximation or sparsity) or replace attention altogether with architectures that have linear or sub-quadratic complexity. Below, we explore major Transformer variants and alternatives from the past two years, focusing on their design, efficiency improvements, and real-world relevance.

Next-Generation Transformer Variants (Efficient Attention)

Performer: Random-Feature Kernel Attention

Performer (Choromanski et al.) introduced a kernel-based approach to approximate softmax attention with linear complexity (Rethinking Attention with Performers | OpenReview). Instead of computing full attention weights, Performer uses FAVOR+ (Fast Attention Via Orthogonal Random features) to map queries and keys into a random feature space where dot-products approximate the softmax kernel . This yields an O(n) time and memory attention mechanism, as the expensive n×nn×n attention matrix is never explicitly formed. Performers provide provably unbiased estimates of true attention with high probability, offering strong theoretical guarantees . In practice, a Performer can handle much longer sequences than a regular Transformer, with accuracy close to exact attention in many tasks .

Trade-offs: The approximation error is controlled by the number of random features used – more features improve fidelity at the cost of extra computation (though still linear in n). Performers don’t rely on fixed sparsity patterns or low-rank assumptions, making them flexible. However, they may still underperform full attention on very complex language tasks unless adequately tuned. In 2024, Performers continued to serve as a baseline for efficient attention research. For example, PyTorch’s open-source xFormers and FlashAttention libraries include Performer implementations for developers to experiment with linear attention in large models. Major industry models haven’t fully switched to Performer attention (likely due to mild accuracy loss and added complexity), but the influence is evident – many later architectures (like linear Transformers and RWKV) build on the idea of making attention kernelizable and scalable. Performer’s random feature approach remains one of the core techniques for achieving linear-time attention with negligible loss in quality .

Linear & Low-Rank Attention (Linear Transformers, Linformer, etc.)

Researchers have also attacked the quadratic bottleneck by reformulating the attention operation in ways that yield linear complexity. One line of work, often called Linear Transformers, removes the softmax non-linearity or uses a kernel trick so that attention can be computed via associative law of matrix multiplication (Bridging the Divide: Reconsidering Softmax and Linear Attention | OpenReview). For example, Katharopoulos et al. (2020) showed that if we use an alternative attention formulation with certain activation functions, the attention scores can be factored to compute outputs in O(n). Similarly, the Linformer (2020) projected the length dimension of keys and values to a lower rank k (with k ≪ n), achieving O(n·k) complexity by effectively compressing attention span. Another variant, Nyströmformer (2021), uses Nyström matrix approximation to construct a low-rank attention matrix, also reducing complexity to near-linear.

Performance vs. Softmax: A major challenge with these methods was that they often degrade model quality compared to full softmax attention – especially on language tasks (2024 in Post-Transformers Architectures (State Space Models, RWKV) [LS Live @ NeurIPS]) (2024 in Post-Transformers Architectures (State Space Models, RWKV) [LS Live @ NeurIPS]). In 2024, significant progress was made to bridge this gap. Han et al. (NeurIPS 2024) analyzed why naive linear attention underperforms and identified two key issues: (1) lack of an injective mapping (different queries can produce identical attention weights) and (2) poor locality modeling (unlike softmax, linear attention doesn’t naturally focus on nearby tokens) . By addressing these – essentially designing linear attention with unique query embeddings and adding mechanisms for local context – they showed linear attention can actually outperform softmax on vision tasks while staying linear . This narrowed the quality gap and offered a path to high-performance linear Transformers in practice.

Another breakthrough came with LoLCATs (Low-Rank Linear Conversion via Attention Transfer), a late-2024 method for linearizing pre-trained Transformers. Instead of training a linear attention model from scratch (which is costly and often subpar), LoLCATs starts with a regular Transformer LLM and replaces its attention with a linear attention module, then fine-tunes in two steps (2024 in Post-Transformers Architectures (State Space Models, RWKV) [LS Live @ NeurIPS]) (2024 in Post-Transformers Architectures (State Space Models, RWKV) [LS Live @ NeurIPS]). First, it trains the new linear attention to mimic the original softmax attention outputs (by minimizing an MSE loss between softmax and linear outputs) (2024 in Post-Transformers Architectures (State Space Models, RWKV) [LS Live @ NeurIPS]). Second, it applies a light low-rank adaptation (LoRA) to recover any quality loss (2024 in Post-Transformers Architectures (State Space Models, RWKV) [LS Live @ NeurIPS]). This strategy dramatically improved the quality of linearized models – closing most of the perplexity gap without full retraining. Using LoLCATs, researchers produced state-of-the-art subquadratic LLMs like a linear-attention Llama-3 8B and Mistral 7B, gaining 20+ point improvements on benchmark tests (5-shot MMLU) compared to earlier linear models (LoLCATs: On Low-Rank Linearizing of Large Language Models | OpenReview). Impressively, they even created the first linearized 70B and 405B parameter LLMs, about 50× larger than prior linear models, and managed to retain over 77% of the quality of the original dense-attention versions . These results suggest that linear Transformers are becoming viable at scale, especially when leveraging knowledge distillation or clever fine-tuning to preserve performance.

Trade-offs: Linear and low-rank attention mechanisms typically reduce memory usage from O(n²) to O(n) and allow longer contexts on fixed hardware. The cost is usually some loss of modeling power or precision in attention weights. Techniques like kernelization (Performer, linear Transformer) or learned projections (Linformer) may miss some subtle long-range correlations that full softmax would capture. Moreover, some linear attention variants break certain benefits of softmax (e.g. the output is no longer order-invariant to key permutations, meaning the nice probabilistic interpretation of attention is lost ). Training stability can also be an issue – many linear attention models needed careful initialization or normalization to converge on large-scale tasks. However, with 2024’s innovations (theoretical fixes and fine-tune strategies), the gap is rapidly closing. In practice, these efficient attention variants are mostly seen in research prototypes and specialized domains (like long-document QA or efficient video models), but they are gaining traction. Notably, Hazy Research (Stanford) and industry startups are actively working to linearize large models (as evidenced by LoLCATs on Llama/Mistral), aiming to deploy LLMs that require constant memory regardless of context length. As hardware trends favor models that can handle longer input, linear attention is poised to play a growing role.

Reformer: LSH Attention and Reversible Layers

Google’s Reformer (Kitaev et al.) was an early efficient Transformer (2020) that remains influential through 2024. It introduced two main ideas: Locality-Sensitive Hashing (LSH) attention and reversible layers (Day:30 Reformer: Efficient Transformer for Large Scale Models - DEV Community) . Instead of attending to all n tokens, Reformer uses hashing to find bucket of similar keys: each token is hashed (via a random projection) into one of B buckets such that tokens with similar content likely end up in the same bucket . Attention is then computed within each bucket only, rather than globally. This yields a sparse attention pattern that approximates full attention but at O(n log n) time complexity (from the cost of hashing and sorting) . Essentially, each token only attends to the subset of tokens in its hash bucket (plus maybe one neighboring bucket for continuity). Empirically, this captures most of the important context (since unrelated tokens are often in different buckets) while avoiding quadratic cost .

The second innovation, reversible residual layers, addresses the memory footprint during training . In a standard Transformer, each layer’s activations are stored for backpropagation, which multiplies memory usage by number of layers. Reformer makes each layer invertible – meaning the input can be reconstructed from the output. In training, instead of storing every activation, the model recomputes the needed activations on the fly by running the layer in reverse . This dramatically cuts memory use (at the expense of some extra compute), enabling training on longer sequences with the same GPU memory . The combination of LSH attention and reversible layers allowed Reformer to handle sequences of length up to 1 million in the original paper, using only 16 GB of memory (something infeasible for a regular Transformer) (Reformer: The Efficient (and Overlooked) Transformer - Medium) .

Trade-offs: Reformer achieves big efficiency gains but adds complexity. LSH attention is probabilistic; there’s a chance that relevant tokens don’t hash into the same bucket, potentially missing important context. The model mitigates this by doing multiple hash rounds, but it can still be less accurate than full attention on certain tasks. Also, the non-differentiable nature of hashing required some care (e.g. treating the hash assignments as fixed during backprop). Training a Reformer can be finicky – the random hashing makes it harder to reproduce results exactly or debug issues. Reversible layers, while clever, impose constraints on layer design (each layer must be invertible) and can make certain architectures (like cross-attention or complex sublayers) harder to implement. In practice, Reformer did not see wide adoption in flagship industry models, but it pioneered techniques now seen elsewhere. For example, sparse attention has become a common idea – models like Longformer and BigBird use fixed local windows and global tokens to attain linear complexity for long texts, inspired by similar goals as Reformer’s LSH attention. These sparse Transformers have been used in document analysis and DNA sequence modeling. Likewise, the idea of reversibility is used in other large networks (such as some diffusion models) to save memory. In the Hugging Face Transformers library, Reformer is implemented and available to experiment with, though one might use it today mainly for research on ultra-long sequences or if memory is the primary bottleneck. The Reformer taught the community that sparsity and smart memory management can push Transformers to new lengths – lessons that 2024 models continue to leverage.

Connect with me on X (Twitter)

FlashAttention and Memory-Efficient Implementations

Not all innovations require changing the Transformer architecture; some come from better algorithms and implementations. A prime example is FlashAttention (Tri Dao et al. 2022, with FlashAttention-2 in 2023) – an exact attention algorithm that produces the same results as standard attention but uses memory far more efficiently. FlashAttention leverages GPU memory hierarchy (SM caches and high-bandwidth memory) by tiling the attention computation (FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning · Hazy Research). Instead of computing the full n×n attention matrix and then multiplying by values (which would write a huge intermediate matrix to memory), FlashAttention breaks the sequence into blocks and computes attention for one block at a time, writing out only the final output . By never materializing the full matrix S = Q·K^T or the full softmax probabilities matrix, it reduces memory usage from quadratic to linear in n . This yields 10–20× lower memory overhead for typical sequence lengths ( FlashAttention-2: Faster Attention with Better Parallelism and Work ...). The algorithm also fuses many low-level operations to reduce redundant memory reads/writes, achieving significant speedups (2–4× faster training in many cases) . Importantly, this is done without any approximation – the results are identical to vanilla attention, just computed in a more GPU-friendly way.

FlashAttention was quickly adopted across the industry in 2023–2024 . It became part of PyTorch (via scaled_dot_product_attention and as the default in some versions of torch.nn.MultiheadAttention), and integrated into JAX/Flax and TensorFlow through XLA optimizations. Many companies reported using FlashAttention to train long-context models; OpenAI’s and Meta’s long-context variants (e.g. 32k-token LLMs) rely on such memory-efficient kernels to not blow past GPU limits. In late 2023, FlashAttention-2 further improved throughput (e.g. ~2× speedup by better parallelizing across GPU threads) , enabling training of 2× longer context with the same time cost as before (FlashAttention-2: Faster Attention with Better Parallelism and Work...). These advances don’t alter the Transformer’s architecture, but they make long contexts and large models practical. Essentially, FlashAttention changed the mindset from “O(n²) attention is intractable for n=100k” to “O(n²) can be managed if you handle memory smartly”.

Alongside, frameworks introduced high-level support for efficiency. PyTorch 2.0+ released BetterTransformer and later FlexAttention APIs, which automatically use optimized kernels and even allow defining custom attention patterns in a few lines of code that compile to efficient fused kernels (FlexAttention: The Flexibility of PyTorch with the Performance of FlashAttention | PyTorch). This lowers the barrier for researchers to test new attention variants without writing CUDA code. These tooling improvements mean that any new efficient attention (kernel-based, sparse, etc.) can often be plugged into a training pipeline and run near hardware peak speeds. The net effect is that major labs (OpenAI, DeepMind, Meta, etc.) have widely adopted these memory-efficient attention implementations in 2024 – not necessarily changing the core of the Transformer, but ensuring that even if they stick with classic attention, it runs as fast as possible on modern GPUs. For example, Meta’s LLaMA-2 and other LLMs use Multi-Query Attention (MQA) in decoding (only one K/V head per multiple query heads) to cut cache size by ~8× with negligible quality loss (here). This MQA idea (from Shazeer, 2019) is another efficiency tweak now standard in large models. In summary, 2024 did not see the mainstream Transformer replaced yet – but through FlashAttention and related tricks, the standard architecture became far more efficient, buying time for more radical “post-Transformer” ideas to mature.

Alternatives Beyond Self-Attention (RNNs and State-Space Models)

While some researchers optimized the Transformer, others asked a bolder question: do we even need attention? In 2024, there’s been a resurgence of recurrent architectures and other frameworks that seek to capture long-range dependencies without the full attention mechanism. These “post-Transformer” architectures often have linear memory and compute scaling by design, making them attractive for large-scale deployment – if they can match Transformers in capability. Here we discuss two prominent families: State Space Models (SSMs) and modern RNN variants, as well as hybrids, all of which saw significant breakthroughs in 2024.

State Space Models (SSMs) – Example: Mamba (S6)

State Space Models approach sequence modeling by learning a continuous-time dynamical system that is discretized to process the sequence. In an SSM, you maintain a hidden state vector that evolves with each new input according to learned linear dynamics plus nonlinear controls. A notable SSM line is the S4 model (2022), which used HiPPO matrices to efficiently capture long-range dependencies with convolution-like operations. However, early SSMs struggled to match Transformers on language – they were fast, but not as good at “content-based” interactions (basically, soft attention’s ability to strongly link arbitrary distant tokens).

At the end of 2023, Mamba (Gu & Dao, 2024) emerged as a game-changing SSM architecture (dubbed S6, as it’s an evolution of S4) (Linear Attention and Mamba: New Power to Old Ideas - Synthesis AI) . Mamba’s key innovation is introducing “selective” state updates: instead of fixed dynamics, the state update equations have parameters that are functions of the input token ( Mamba: Linear-Time Sequence Modeling with Selective State Spaces). In other words, the model can adaptively decide to propagate or reset certain information based on the content it’s processing (akin to an RNN’s gates, but formulated in the SSM continuous framework). This gives Mamba a form of content-based attention without an explicit attention matrix – it can learn to attend to important tokens by preserving their influence in the state, or forget irrelevant ones, all in a single pass through the sequence . The challenge is that making SSM dynamics input-dependent breaks the trick that allowed fast Fourier convolution in S4. Mamba addressed this by a custom parallel scan algorithm that still runs efficiently on accelerators in O(n) time . The resulting model completely eschews attention and even feed-forward layers; it’s basically a stack of SSM-based recurrent units with skip connections (hence the tongue-in-cheek line: “without attention or even MLP blocks”) .

Performance: Mamba showed for the first time that an attention-free model can reach Transformer-level results on major tasks. It achieved state-of-the-art across modalities including language, audio, and genomics . Notably, a 3-billion-parameter Mamba model outperformed a Transformer of the same size and even matched the performance of a Transformer twice its size on language modeling benchmarks . It also excelled at extremely long sequences: on tasks with inputs up to one million tokens, Mamba’s performance kept improving (while Transformers couldn’t even be run in those regimes) . Importantly, Mamba runs in linear time and memory – it has 5× higher inference throughput than Transformers for long sequences . These are remarkable feats: efficiency without sacrifice, exactly the post-Transformer promise.

Trade-offs and Adoption: As a new architecture, Mamba is complex to implement and has many components (control theory meets deep learning). Training large SSMs requires expertise (e.g. careful initialization to maintain stability). But the success of Mamba has sparked tremendous optimism. Its design essentially gives a blueprint for attention-like capabilities through continuous state evolution. Already, we’ve seen follow-ups: researchers extended Mamba to images (Vision Mamba for vision Transformers) (Linear Attention and Mamba: New Power to Old Ideas - Synthesis AI), and explored variants with different state parametrizations (e.g. “Mamba-2” with a simplified structure (State Space Duality (Mamba-2) Part I - The Model | Goomba Lab)). The Gradient and other AI blogs in 2024 ran detailed explainers to disseminate Mamba’s ideas (Mamba Explained - The Gradient). As of late 2024, Mamba is still a research prototype, but organizations interested in ultra-long context (e.g. analysis of whole genomes, long videos, etc.) are experimenting with it. Its code is open-source (state-spaces/mamba: Mamba SSM architecture - GitHub), and we might see it or similar SSMs in production for specialized tasks soon. The big takeaway is that state-space models are now a viable alternative: they deliver linear scaling and have caught up in accuracy ( Mamba: Linear-Time Sequence Modeling with Selective State Spaces), after years of lagging behind attention.

Connect with me on X (Twitter)

Recurrent Revival: Modern RNNs (RWKV, xLSTM)

Before Transformers, RNNs (LSTMs, GRUs) were the workhorse of NLP. They fell out of favor because Transformers parallelize much better (an RNN processes tokens sequentially, hindering GPU utilization). In 2024, researchers revisited RNNs with fresh perspectives, aiming to combine the sequential efficiency of RNNs (constant memory, infinite context via hidden state) with tricks to allow parallelism during training. Two notable results are RWKV and xLSTM.

RWKV (Receptance Weighted Key-Value) is an architecture developed by the open-source community (led by Bo Peng et al.) that essentially blends a Transformer block into an RNN ( RWKV: Reinventing RNNs for the Transformer Era). The idea is to have the recurrence (hidden state passing) of an RNN, but design the recurrence update in a way that looks like a Transformer’s attention + feedforward computation unrolled over time . In practice, RWKV consists of time-decay terms and gating that mimic the effect of attention, but without explicit pairwise attention – each new token updates a hidden state that carries summarized information of past tokens, similar to how an LSTM carries a cell state. Crucially, RWKV was built so that it can be trained with parallelizable computations (like a Transformer) but then used as a pure RNN at inference . The authors describe it as “either a Transformer or an RNN” depending on perspective . Technically, they use a form of linear attention internally, which allows the model’s forward pass to be rearranged for parallel training. The payoff: at inference, RWKV streams like an RNN, with constant memory and compute per token (no growing context overhead), and during training it can process sequences in batches like a Transformer.

Performance: In late 2023, RWKV was scaled up to 14 billion parameters, making it the largest RNN ever trained . Remarkably, it was shown to perform on par with similarly sized Transformers on language tasks . This result, presented at NeurIPS 2023/2024, suggests that RNNs can match Transformer LLMs in quality when properly scaled and trained . RWKV models in the 1.5B to 7B range were also compared to GPT-style models and found to have comparable perplexity. The open-source AI community embraced RWKV in 2024: you can find RWKV-4 and RWKV-5 models on Hugging Face, and enthusiasts tout that RWKV is much faster for long text inference since it doesn’t need to carry around a huge attention cache. It’s also more memory efficient at inference – you just maintain a hidden state (maybe a few thousand floats) instead of tens of thousands of key/value vectors. These benefits make RWKV attractive for deploying on CPU or mobile devices, or anytime streaming throughput is critical.

Trade-offs: As an RNN at heart, RWKV is still non-parallel in token processing at inference – you can’t easily get the next token without the previous, which complicates certain batching scenarios. Training RWKV to billions of parameters required bespoke optimization (the community leveraged tricks like mixing segment-wise parallel training). Also, while RWKV matches Transformer performance on perplexity, there’s ongoing analysis on whether it captures nuances like zero-shot generalization or complex reasoning to the same degree (some early reports suggest it may be slightly behind on some benchmarks, but improving). Importantly, RWKV demonstrates that recurrent networks with modern designs can scale and even excel in the LLM setting, something many had doubted. It has not yet been deployed by a major company in a flagship product (most big labs still use Transformers for their GPT-4, PaLM, etc.), but RWKV’s success influenced the research directions of 2024: it was a proof-of-concept that the reign of Transformers might not be absolute. In fact, Google DeepMind’s later work (like Griffin, below) cites RWKV as inspiration for blending RNN efficiency with Transformer performance.

xLSTM (Extended LSTM) is another 2024 entrant, coming from perhaps the most authoritative source on RNNs: Jürgen Schmidhuber’s lab (Sepp Hochreiter, one of LSTM’s inventors). At NeurIPS 2024, Hochreiter and colleagues unveiled xLSTM, a revamped LSTM architecture designed to be scaled up to billions of parameters and compete with Transformers ( xLSTM: Extended Long Short-Term Memory). They revisited the LSTM’s core design – gating and memory – and introduced two main modifications: (1) Exponential Linear Units for gating (replacing sigmoid with an exponential function with normalization) and (2) new memory structures . In fact, xLSTM has two variants of memory: sLSTM (which uses a scalar cell state per unit with refined update rules) and mLSTM (which uses a matrix-valued cell state, enabling more information storage and some parallelism) . By integrating these into a deep network with residual connections (very much like how Transformers are structured in blocks), they create xLSTM blocks that can be stacked dozens of layers high . Essentially, xLSTM is an LSTM redesigned for the scale and depth that Transformers operate at.

Performance: The authors reported that xLSTM matches state-of-the-art Transformers and SSMs in language modeling performance, and shows excellent scaling behavior . In their experiments, xLSTM models in the billions of parameters were able to reach parity with Transformer LLMs on several benchmarks . Moreover, xLSTM retains the desirable features of RNNs: it has a fixed-size hidden state that summarizes the past, so it can in principle handle arbitrarily long sequences (no fixed context window). There’s also the promise of fast inference – an xLSTM generates tokens step-by-step without large memory growth, similar to RWKV. The mLSTM variant is noted to be “fully parallelizable,” which suggests that some aspect of its computation can be parallelized across time steps (possibly by treating the state matrix evolution with linear algebra tricks). This could mean easier training or maybe processing multiple tokens per step in some way, though details are beyond our scope here.

Implications: xLSTM is significant not just technically but also symbolically – it’s the original LSTM concept, 25 years later, coming back to challenge Transformers on their own turf (2024 in Post-Transformers Architectures (State Space Models, RWKV) [LS Live @ NeurIPS]). The work hints that recurrent networks may ultimately be more powerful sequence models than transformers, given enough innovation (echoing some theoretical results that RNNs can simulate transformers but not vice-versa). In practice, xLSTM’s code (by the NX-AI research group) is available, and there is talk of them building a European LLM based on xLSTM (Sepp Hochreiter on X: "Again xLSTM excels in time series prediction ...). While it’s early, xLSTM could form the backbone of new LLMs that need efficiency – for instance, a chatbot that can keep a conversation history indefinitely without a context length limit. Training stability was a focus in xLSTM’s design; by using normalized exponential gates and robust memory updates, they avoided issues like gradient explosion that plagued old RNNs. The result is a highly stable deep RNN that can be trained similarly to a Transformer. As of 2025, xLSTM remains in the research/demo phase, but it reinforces the trend: recurrent architectures are staging a comeback, armed with modern tricks to finally scale to the needs of giant models.

Hybrid Models: Mixing Recurrence and Attention (DeepMind’s Griffin)

Given the respective strengths of Transformers and RNNs, a compelling idea is to combine them – get the best of both worlds. In 2024, Google DeepMind did exactly that with a model called Griffin (Soham De et al., 2024), which mixes a linear recurrence with local attention (here). Griffin’s architecture consists of repeating blocks where two recurrent layers (called “Hawk” layers) are followed by one attention layer, all within a residual block (Transformer alternatives in 2024) . The recurrent part (Hawk) uses a gated linear unit similar to RWKV’s recurrence – essentially an RNN that can be unrolled indefinitely, with gating mechanisms to regulate information . The attention part is a Multi-Query local attention that operates on a fixed window (say, attending to the last 128 tokens) with shared key/value projections . By alternating these, Griffin can handle long sequences through its RNN aspect while still benefiting from the high-resolution local pattern learning that attention excels at.

Results: Griffin is notable because it was scaled up and directly compared against top-tier Transformer LLMs. A Griffin model with 7B parameters and another with 14B were trained on a large corpus (though using fewer tokens than a typical Transformer). The 14B Griffin matched the performance of Meta’s Llama-2 13B on evaluation tasks, despite being trained on 6× fewer tokens . This sample efficiency indicates the inductive bias of having recurrence + attention may help the model generalize from less data. Moreover, Hawk (the pure RNN part) by itself outperformed Mamba of similar size on downstream tasks , showing that DeepMind’s optimized RNN can even beat the state-of-the-art SSM. Griffin also demonstrated an ability to extrapolate to much longer sequences than seen in training . For example, if trained on sequences up to 1K tokens, it can handle inputs of 4K or more at test time with graceful degradation, thanks to the RNN component (which inherently isn’t limited by position embeddings). Hardware-wise, Griffin was designed to train as efficiently as a Transformer (they even implemented the recurrence using a custom kernel in JAX/TPU that parallelizes it) . At inference, Griffin has lower latency and higher throughput than a Transformer of the same size, because the RNN can continue stepping forward without accumulating a larger and larger history . The authors specifically note that sampling long sequences is faster with Griffin than with a Transformer baseline .

Significance: Griffin essentially validates the approaches of RWKV and others in a hybrid form: you can retain a bit of attention for local detail and use RNNs for long-range, and achieve performance equal to the best Transformer with big efficiency gains. DeepMind’s involvement means this idea got a lot of eyes – it won a spotlight at NeurIPS 2024 and is discussed in context of building the next generation of LLMs. A telling point: Griffin-14B matched Llama-2 quality (here). One can imagine future models (Llama-3 or others) adopting similar hybrid layers to push context lengths to, say, 1 million tokens while keeping compute manageable. The hybrid approach also highlights a trade-off spectrum: pure attention vs pure recurrence is not binary – you can tune how much attention (how large the local window, how frequent the attention layers) to balance quality vs. efficiency. In industry, a hybrid may be easier to adopt than an entirely new architecture because part of it is still a familiar Transformer. We might see, for example, a production translation model or speech model that uses a Griffin-like block to get streaming capability and fine detail alignment.

DeepMind’s Griffin and related experiments indicate that the frontier of sequence modeling in 2024 is about combining techniques: attention is fantastic for pattern matching and was the core of the 2017–2023 models, but adding recurrence can bring fundamental advantages in memory and generalization. As training datasets plateau in size (we can’t keep throwing trillions of tokens), more sample-efficient architectures like hybrids could become very appealing to industry.

(Aside: Another line of hybrids involves convolution or spectral methods – e.g. Stanford’s Hyena (2023) uses long convolutions to replace attention. 2024 saw some follow-ups (like HyenaDNA for genomics and SE(3)-Hyena for 3D data) that combine Hyena’s long convolutional filters with specialized processing (Hyena architecture enables fast and efficient protein language ...). These indicate that convolution-based “attention” is also being explored, but as of 2024, RNN/SSM approaches have gained more traction in language tasks than pure convolution.)

Connect with me on X (Twitter)

Industry Impact and Tooling Support

The flurry of post-Transformer research in 2024 is not happening in isolation – major AI labs and companies are both inspiring it and adopting its findings:

OpenAI & Others (Attention Optimization): Big proprietary LLMs like GPT-4 have (to public knowledge) stuck with the Transformer architecture, but they heavily rely on efficiency improvements such as multi-query attention and FlashAttention. OpenAI’s research team contributed to techniques like model parallelism and memory optimization that complement these architectures. For instance, OpenAI’s Triton library was instrumental for kernels like FlashAttention. While OpenAI hasn’t announced using a Performer or RWKV-like model in production, they certainly benefit from the tooling that came out of this research – enabling longer contexts (32k in GPT-4, possibly more in future) with tolerable costs (FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning · Hazy Research). OpenAI’s decision to offer a 128k context version of GPT-4 (via Azure Cognitive Services) likely leverages some of the sparse or chunked-attention ideas (maybe sliding window attention with retrieval) that trace back to Reformer/Longformer concepts.
Google DeepMind: Google has been at the center of efficient Transformer research. Reformer and Performer came from Google researchers. In 2024, DeepMind’s focus on retention and recurrence is evident: they not only created Griffin, but also explored ideas like Retentive Networks (RetNet) – a 2023 model where each token’s influence decays exponentially over time, effectively limiting context length dynamically (this can be seen as a form of learned recurrence). Google’s Motor (a family of models combining SSMs and attention) and contributions to S4/Mamba (one of Mamba’s authors, Tri Dao, had moved to industry) show a concerted effort to find what comes after pure Transformers. It’s telling that DeepMind could match Llama-2 with an RNN-heavy model (here) – this surely influences their future internal projects (like perhaps a version of Gemini, Google’s next-gen model, might incorporate recurrence for longer contexts). On the tooling side, JAX-based libraries like Trax and Google’s TensorFlow Research frameworks have added support for many of these layers (e.g. Trax has fast state-space layers from S4). Google’s TPU compiler also incorporates optimizations for long sequence RNN/SSM processing (Linear Attention and Mamba: New Power to Old Ideas - Synthesis AI) (as mentioned in the Griffin paper, they wrote a custom kernel for their RG-LRU recurrent unit).
Meta (Facebook): Meta has been slightly quieter on radically new architectures in 2024, but they did adopt Multi-Query Attention in LLaMA-2 (reducing memory and compute for decoding) and have an open-source library xFormers that implements many efficient attention variants (including Performer, Linformer, sparse attention, etc.). This library is used in projects like Metaseq and fairseq to allow researchers to plug in alternatives easily. Meta’s AI research has also delved into Mixture-of-Experts (MoE) for scaling and efficient routing, which is orthogonal but related (MoE reduces compute per token by activating subsets of the model – a different kind of efficiency). In 2024, with the open release of LLaMA 2 Long (supporting 32k tokens via positional interpolation and segmented attention), Meta showed interest in long contexts but achieved it by scaling existing mechanisms rather than a new architecture. However, they’re certainly tracking external progress – Meta researchers have participated in long-range arena benchmarks, and one can imagine future LLaMA versions considering an architecture like a hybrid if it proves significantly more efficient.
Microsoft: Microsoft’s research introduced LongNet (2023), which used dilated attention to extend context length to 1 billion tokens (effectively chunking the sequence hierarchically). They also developed RetNet as mentioned, which is a form of RNN (each layer maintains a state that decays) incorporated into a Transformer block. These were pre-2024 developments, but in 2024 Microsoft integrated such ideas into their training toolkit DeepSpeed – e.g. DeepSpeed offers a Sparse Attention library and support for blockwise parallel evaluation of RNNs. Azure’s hardware for AI benefited from FlashAttention (which was implemented in DeepSpeed as well). Microsoft’s backing of OpenAI means they’re likely testing some of these efficient architectures on the side – for instance, a 14B RWKV model was trained and released by a community member with support from Azure credits, hinting at interest in the approach. We might see Microsoft include a state-space or recurrent module in an Azure service for, say, real-time processing of long sequences (where transformers struggle). On frameworks, Microsoft’s ONNX Runtime and NNFusion have been adding kernels for variants like linear attention and sparse ops, so that models exported to ONNX can still benefit from custom attention implementations.
write everyday for my readers on actionable AI. Subscribe and instantly get a 1300+ page Python book.
Emerging AI Startups: Perhaps the quickest to jump on post-Transformer architectures are smaller AI startups and research collectives. For example, Together AI (the team of Dan Fu and others) has been championing RWKV and state-space models in 2024. They gave talks (like the Latent Space Live podcast) summarizing these developments, and even trained their own models with RWKV and linear attention conversion (LoLCATs) (2024 in Post-Transformers Architectures (State Space Models, RWKV) [LS Live @ NeurIPS]) (2024 in Post-Transformers Architectures (State Space Models, RWKV) [LS Live @ NeurIPS]). Hazy Research (at Stanford) likewise pushed LoLCATs for democratizing LLMs (so smaller orgs can train subquadratic models without a trillion-token budget) (2024 in Post-Transformers Architectures (State Space Models, RWKV) [LS Live @ NeurIPS]) (2024 in Post-Transformers Architectures (State Space Models, RWKV) [LS Live @ NeurIPS]). Startups like MosaicML (now part of Databricks) have integrated long-context and efficient attention into their platforms; MosaicML’s MPT-7B model had a 65k context version using sliding window attention and gating, which is an idea derived from earlier efficient models. xAI (Elon Musk’s AI venture) has hired researchers aware of these trends, and while it’s not public what they are building, there is speculation they might attempt something novel to differentiate from just another Transformer – possibly leveraging these 2024 advances.
Library and Framework Support (2024/25): By now, many of the post-transformer innovations are accessible to engineers. Hugging Face’s Transformers library includes implementations for Reformer, Longformer, BigBird, and has community-supported models for RWKV (there is an official RWKV model class and checkpoints). It’s increasingly easy to try a Performer or Linformer attention in your own model via libraries like xFormers or PyTorch’s attention_ops. JAX’s Flax library has modules for state-space layers (e.g. in the fast_transformers or nn.state_space modules). We also saw new libraries specialized for long sequence models: scaling Transformers (a project by EleutherAI) provides efficient attention alternatives; MegaBlocks by MosaicML allows training with block-sparse attention for long contexts at lower cost. All major training frameworks (PyTorch, JAX, TensorFlow) now have some form of XLA/GPU-accelerated support for custom attention ops, meaning that if you devise a new attention pattern (sparse, linear, etc.), you can implement it and get good performance with far less effort than before. This convergence of ideas and tools means that post-Transformer research can transition to real-world use faster. Indeed, we already see it: novel architectures like Mamba and xLSTM had reference implementations released immediately, and others can reproduce or fine-tune them. As companies aim to deploy models on edge devices or process ever-growing inputs (think multi-hour podcasts, entire codebases, etc.), the motivation to adopt these efficient architectures in production is high.
write everyday for my readers on actionable AI. Subscribe and instantly get a 1300+ page Python book.

Conclusion

The Transformer isn’t dead, but it’s no longer the only game in town. 2024 marked a tipping point where efficient Transformer variants proved their worth and alternative architectures (RNN/SSM hybrids) matched baseline Transformers in key metrics. The driving force is efficiency – in memory, computation, and the ability to handle long contexts needed for real-world tasks. Major AI labs are actively exploring these innovations: some are being quietly incorporated into products (long context via efficient attention kernels), others are still experimental but highly promising (recurrent hybrids potentially in next-gen models). We now have a spectrum of architectures to choose from, depending on the application’s needs: from approximate attention that’s drop-in for existing models to fundamentally new designs that break the quadratic barrier. The trend in 2025 is likely a combination of these: hybrid models that use attention where it’s needed and recurrence or convolution elsewhere to scale gracefully. In terms of tooling, the gap between cutting-edge research and practical implementation has narrowed, thanks to libraries and system support for these new layers.

In summary, the post-Transformer era is about making sequence models more efficient and scalable. Whether it’s through clever re-imagining of attention (Performer, linear, Reformer) or revisiting old ideas with fresh eyes (SSMs, RNNs), the field is moving toward models that can learn like Transformers but run like RNNs – combining the best of both. The table below compares the main architectures discussed, highlighting their computational complexity, training stability, and adoption as of .

Connect with me on X (Twitter)

Rohan's Bytes

Discussion about this post