Meta released Llama 4: A Huge 10mn Context Window, Runs on a Single H100 & Natively Multimodal
Read time: 11 mins
📚 Browse past editions here.
( I publish this newletter very frequently. Noise-free, actionable, applied-AI developments only).
Meta just announced Llama 4 and it has an incredible almost infinite 10 million token context window. 🤯 and designed to run on a single GPU.
Powered by Meta’s new "iRoPE" architecture, significantly advancing beyond the previous Llama 3’s 128K-token capacity. 🔥
The 10mn context window means 20+ hours of video, as its a native multi-modal model. Also for text, imagine a standard novel is around 80,000 tokens. So a 10mn context window could, in theory, process over 125 such novels simultaneously.
📌 Three new models are announced today:
Scout (17B active parameters, 16 experts). It’s crazy fast, natively multimodal, and very smart. It achieves an industry leading 10M+ token context window and can also run on a single GPU!.
Maverick (17B active parameters, 128 experts and 1M for context-window). is the best multimodal model in its class, beating GPT-4o and Gemini 2.0 Flash across a broad range of widely reported benchmarks, while achieving comparable results to the new DeepSeek v3 on reasoning and coding – at less than half the active parameters. It offers a best-in-class performance to cost ratio with an experimental chat version scoring ELO of 1417 on LMArena. It can also run on a single host!
Behemoth (288B active parameters, 16 experts). This model is still training. Its their most powerful model yet and among the world’s smartest LLMs. Llama 4 Behemoth outperforms GPT4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on several STEM benchmarks.
So today only Scout and Maverick are currently released.
📌 Key Highlights
→ Natively multimodal architecture. All models use early fusion to seamlessly integrate text, image, and video tokens into a unified model backbone.
→ So these models mark Meta's entry into native multimodal AI and the adoption of a mixture-of-experts (MoE) architecture.
→ Both MoE models trained on up-to 40 Trillion tokens, pre-trained on 200 languages and significantly outperforms its predecessors Llama 3.1 405B
→ The Llama series have been re-designed to use state of the art mixture-of-experts (MoE) architecture . MoE design cuts inference costs while raising quality.
→ Llama 4 Scout has a 10 million token context window.
→ Codistilled from a 2T-parameter teacher model for stronger reasoning.
→ Reduced political bias and more balanced refusal rates.
→ Training incorporated new strategies, including the "MetaP" method for optimizing hyperparameters, real-time reinforcement learning enhanced by adaptive filtering, and co-distillation from the larger Behemoth model.
→ Llama 4 Scout can run on a single Nvidia H100 GPU, while Maverick requires an Nvidia H100 DGX system or equivalent, according to Meta’s calculations.
⚡ Performance Benchmarks
Llama 4 Scout (17B active parameters, 10M context) outperforms previous Llama models on coding, reasoning, and long-text tasks. It also rivals larger models on image comprehension. Llama 4 Maverick (17B active parameters, 400B total) beats GPT‑4o and Gemini 2.0 in multilingual understanding, coding benchmarks, and visual reasoning. Both models benefit from codistillation with the not-yet-released Llama 4 Behemoth (288B active parameters) to achieve higher scores on STEM‑focused evaluations.
LMArena ELO score vs. cost "To deliver a user experience with a decode latency of 30ms for each token after a one-time 350ms prefill latency, we estimate that the model can be served within a range of $0.19-$0.49 per million tokens (3:1 blend)" - Meta
Lama 4 Maverick just is at 2nd position overall - becoming the 4th org to break 1400+ on Arena!🔥
🦙Availability
Llama 4 Maverick and Llama 4 Scout available to download on llama.com and HuggingFace, with availability across the most widely used cloud and data platforms, edge silicon, and global service integrators to follow shortly.
🖥️ The Llama 4 “Behemoth” Teacher
A separate Llama 4 Behemoth model, with 288B active parameters and 2T total, was trained using the same mixture-of-experts approach. Llama 4 Maverick was codistilled from it to gain better performance on math and reasoning tasks.
⚖️ License and Commercial Usage
A custom commercial license, the Llama 4 Community License Agreement, is available at: github.com/meta-llama/llama-odels/blob/main/models/llama4/LICENSE
The key takeaway is that while you get broad, free rights to use, modify, and distribute Llama 4, there’s a critical caveat for large-scale corporate use.
For personal use, you must adhere to the attribution and usage requirements (for example, displaying “Built with Llama” and including the copyright notice).
However, if you’re using the model commercially and your product or service reaches more than 700 million monthly active users, you are required to request and obtain a separate license from Meta before you can legally continue using the model.
🏗️ iRoPE (interleaved Rotary Positional Embedding) is Meta’s approach for long-context support in Llama 4 Scout
They replaced standard positional embeddings with Interleaved attention layers. They interleave the model’s attention blocks, allowing it to process segments without fixed positional encodings.
This architecture is particularly suited for tasks requiring extensive textual data, such as summarizing multiple documents or reasoning over large codebases.
Rotary position embeddings (RoPE) in most layers. These embeddings handle positional information more flexibly than traditional methods.
Inference time temperature scaling of attention. This technique helps the model generalize to extremely long input sequences.
This design extends the context window to 10 million tokens. It avoids position-specific constraints when dealing with large or streaming inputs. The reference to “i” in iRoPE stands for interleaved, hinting at the long-term goal of supporting infinite context length in future iterations of Llama.
The iRoPE architecture, as described in the Meta AI blog Llama 4 Multimodal Intelligence, involves several critical components:
Interleaved Attention Layers: The architecture employs a configuration where attention layers are interleaved, potentially alternating between different types to enhance the capture of both local and global dependencies. While direct details are scarce, this interleaving likely contributes to the model's ability to handle long sequences by balancing computational efficiency and contextual understanding.
Absence of Traditional Positional Embeddings: iRoPE does not use traditional positional embeddings, aligning with research findings from The Impact of Positional Encoding on Length Generalization in Transformers, which suggests that models without explicit positional encodings (NoPE) can outperform those with explicit methods like ALiBi or Rotary in length generalization tasks. This paper found that NoPE, when trained with SGD, resembles T5’s Relative PE attention patterns, indicating that the model can learn positional information implicitly.
Rotary Position Embeddings (RoPE): iRoPE incorporates RoPE, a method detailed in RoFormer: Enhanced Transformer with Rotary Position Embedding, which encodes positional information by applying rotation matrices to query and key vectors in the attention mechanism. This allows for effective capture of relative positional relationships, crucial for understanding long sequences, and is known for its flexibility in sequence length and decaying inter-token dependency with distance.
Inference Time Temperature Scaling: The architecture includes a mechanism to scale attention scores by a temperature parameter during inference, as noted in the Meta AI blog. This adjustment likely controls the sharpness of the attention distribution, enabling the model to focus more on relevant parts of the context, especially in very long sequences, enhancing performance on tasks like multi-document summarization.
Aston Zhang, who is the Long context lead of Llama talks about iRoPE architecture.
🏗️ iRoPE Explained
Goal: Train on shorter contexts and generalize to extremely long sequences (256K or more) without explicit position embeddings everywhere. It is called “iRoPE” because it uses interleaved layers (the “i”) and rotary position embeddings (RoPE). The idea is to make context length theoretically unbounded by carefully combining local and global attention.
Local Parallellizable Chunked Attention with RoPE
They use RoPE on local attention blocks. Each of these blocks handles a shorter context window (for example, 8K tokens). Training on smaller sequences is more memory-efficient and still captures local dependencies. These short-context attention layers are fully parallellizable.Global Attention without Position Embeddings
Certain layers serve as “global” attention layers that see beyond 8K tokens. They omit fixed positional embeddings in these layers to improve length extrapolation. The goal is to let the model handle sequences far longer than it has explicitly seen in training.Maximum Training Length: 256K
Even though local and global attention are part of the same model, iRoPE only trains up to 256K tokens. Beyond that, it relies on the model’s ability to extrapolate rather than matching exact training patterns.Flattening Attention Weights at Extreme Lengths
At very large positions (e.g., hundreds of thousands of tokens), attention weights tend to flatten. This hurts the model’s ability to focus on relevant tokens.Inference-Time Temperature Scaling
To counteract flattened attention, iRoPE multiplies the query vectors in the global attention layers by a scaling factor:xq *= 1 + log(floor(i / α) + 1) * β
i
= position indexα
= threshold (e.g., 8K)β
= scale factor
This gives extra weight to tokens that appear later in the context, helping the model maintain more meaningful attention signals over extreme lengths. It preserves short‑range performance (below α) while boosting long‑range reasoning.
This formula modifies the query vector (xq) in the model’s attention mechanism so that tokens at very large positions do not get ignored. It multiplies xq by a factor that grows once the position index i goes past a certain threshold α.
How does the multiplication help?
In attention mechanisms, each token’s query vector (xq) is compared to other tokens’ key vectors (xk) via a dot product to determine how much “attention” should be given to each token. By multiplying xq by a factor larger than 1, we amplify those dot-product scores for tokens at large positions.
Concretely, attention weight ∗att∗*att*∗att∗ between query q and key k is often computed as:
att = (q ⋅ k) / sqrt(d)
where d is the dimensionality. If we scale q (tokens deep in the sequence) by a growing factor, then q ⋅ k becomes larger. This makes the attention mechanism pay more heed to those distant tokens, preventing them from being overshadowed by tokens at earlier positions.
Looking at the formulae again:
xq *= 1 + log(floor(i / α) + 1) * β
Position Index (i): Tells where a token is in the input sequence.
Threshold (α): A cutoff (e.g., 8K) after which the model starts increasing attention scaling.
Scale Factor (β): Determines how strongly the model boosts the attention signal.
Position Index (i)
This is the token’s index in the input sequence, starting from 1 (or 0, depending on the implementation). The bigger i gets, the further you are into the input.Threshold (α)
This defines where scaling begins. For example, if α = 8000 tokens, positions less than 8000 will see almost no scaling effect. Once i exceeds α, the expression (floor(i / α) + 1) grows larger.floor(i / α)
Dividing i by α and then taking the integer part. This measures how many “chunks” of length α have passed. For example, if i is 16,000 and α is 8,000, floor(16000 / 8000) = 2.log(...)
The log function (commonly natural log) increases slowly. By taking log(floor(i / α) + 1), the growth is gradual. It starts at 0 when i is below α (log(1) = 0) and rises bit by bit as i gets bigger.Scale Factor (β)
This is multiplied by the log(...) term. A higher β makes the increase stronger. If β is small, you get a gentler slope.1 + [ ... ]
If log(floor(i / α) + 1) is 0 (meaning i < α), then this factor remains 1 and does nothing. Only when i crosses α does the factor exceed 1, scaling up xq and effectively boosting the attention strength at longer contexts.
How it helps
When the model sees extremely long sequences, attention weights tend to “flatten out” because the model struggles to highlight which specific tokens are more important. This formula “turns up the volume” on query vectors at positions beyond α, preserving relevant information across far larger contexts. Meanwhile, for short sequences (below α), the factor stays near 1, so it does not interfere with the usual attention behavior.
This technique lets Llama 4 maintain strong performance for short contexts and extend its capability to contexts of hundreds of thousands—or even millions—of tokens.
🤖 Post-Training Pipeline
This 2 trillion total parameter model (Behemoth) was a huge challenge for them as to post-training. They had to revamp their underlying RL infrastructure due to the scale.
Their post-training pipeline in short: lightweight SFT → online RL → lightweight DPO. Overuse of SFT/DPO can over-constrain the model and limit exploration during online RL—keep it light.
They first apply lightweight supervised fine-tuning (SFT) on a carefully curated subset of the data. They remove more than half of the “easy” prompts (identified by Llama-based judges) to emphasize tougher problems. This step raises the model’s baseline performance without overconstraining it.
They then switch to continuous online RL with adaptive data filtering. The model generates responses on medium-to-hard prompts, and those prompts with zero advantage or negligible difficulty are filtered out. By cycling through training and filtering, it focuses on challenging examples and builds stronger capabilities in math, coding, and reasoning.
Finally, they perform direct preference optimization (DPO) to manage fine-grained quality issues. They use a lighter DPO stage so it does not degrade performance in more complex tasks. This pipeline ensures a balanced model that handles multimodal inputs effectively, maintains creativity, and still handles high-difficulty prompts reliably.
Codistillation from Llama 4 Behemoth refines both of the smaller Llama 4 models, transferring advanced reasoning skills with fewer parameters activated, which further boosts the post‑training results.
Meta says that it tuned all of its Llama 4 models to refuse to answer “contentious” questions less often. According to the company, Llama 4 responds to “debated” political and social topics that the previous crop of Llama models wouldn’t. Their official blog also mentioned that, Llama 4 is “dramatically more balanced” with which prompts it flat-out won’t entertain.
That’s a wrap for today, see you very soon next time.
Excellent article!