Paper Explained - Don't Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Tasks
This paper introduces Cache-Augmented Generation (CAG) as an alternative to Retrieval-Augmented Generation (RAG) for knowledge-intensive tasks using long-context LLMs.
Original Problem: 🤔
→ Retrieval-Augmented Generation (RAG) integrates external knowledge into Large Language Models (LLMs) to enhance their responses.
→ However, RAG introduces issues like retrieval latency, potential errors in document selection, and increased system complexity.
Cache-Augmented Generation (CAG): A Simple Approach to Rethink Retrieval in LLMs
Retrieval-Augmented Generation (RAG) is a popular method that enhances LLMs by fetching external documents and injecting them into the model during inference. But this process comes with some significant drawbacks—real-time retrieval slows things down, introduces errors when the wrong documents are retrieved, and increases system complexity by adding multiple components. The paper presents a streamlined solution: Cache-Augmented Generation (CAG).
CAG skips real-time retrieval entirely by preloading relevant documents into the LLM’s context window before the query is processed. This "preloading" creates a cached memory of the key-value (KV) pairs that represent the model's state when these documents are processed. When a user sends a query, the model reuses this cached state, avoiding the need to repeatedly parse the same knowledge.
How CAG Works: Breaking It Down
Core Idea: Instead of retrieving information each time a question is asked (like in RAG), CAG preloads all the necessary information into the LLM beforehand.
The CAG process has three phases: external knowledge preloading, inference, and cache reset.
External Knowledge Preloading
→ Prepare the Documents: First, you gather all the documents relevant to the task. This collection is your external knowledge base. Let's say these documents are D = {d1, d2, ..., dn}.
→ Format and Process: These documents are then formatted to fit within the model's context window. The LLM (M), with its parameters (θ), processes these documents.
→ Create the KV Cache: This processing step generates what's called a Key-Value (KV) cache. You can think of the KV cache as a highly compressed and organized representation of the information within the documents. This is represented as CKV = KV-Encode(D).
→ Store the Cache: This KV cache (CKV) is stored either on disk or in memory. It's like creating a cheat sheet that the model can quickly reference later. The important thing is that this computationally intensive step is done only once.
Inference
→ Load the Cache: When a user asks a question (Q), the precomputed KV cache (CKV) is loaded along with the query.
→ Generate the Response: The LLM then uses this cached context to generate a response (R). This is represented as R = M(Q | CKV).
→ No Retrieval Needed: Because the information is already in the cache, there's no need to retrieve anything during this phase. The model directly uses the cached information to answer the question. This is the big advantage over RAG. It is like having all the answers at your fingertips.
→ Unified Context: The combination of the preloaded documents (D) and the query (Q) ensures that the model has a complete understanding of both the external knowledge and the user's specific question.
Cache Reset
→ Maintaining Efficiency: To keep the system running smoothly over many interactions, the KV cache can be reset efficiently.
→ Append-Only Nature: The KV cache grows in an append-only manner as new tokens are added. Let's say these new tokens are t1, t2, ..., tk.
→ Truncation for Reset: Resetting involves truncating (or removing) these new tokens from the cache. This is represented as Creset = Truncate(CKV, t1, t2, ..., tk).
→ Avoid Reloading: This allows for quick reinitialization without having to reload the entire cache from storage. It ensures the system remains fast and responsive.
→ Advantages of CAG:
→ Reduced Inference Time: No real-time retrieval means faster responses to user queries.
→ Unified Context: The entire knowledge collection is preloaded, leading to better understanding and more consistent responses.
→ Simplified Architecture: No need to integrate separate retrieval and generation components, making the system easier to manage.
→ When is CAG Most Effective?
→ CAG shines when dealing with a relatively fixed and manageable set of documents. Imagine scenarios like answering questions based on a specific textbook or a company's internal documents. In such cases, preloading all this information makes a lot of sense.
→ CAG eliminates the main problems associated with RAG: retrieval latency, potential errors in document selection, and overall system complexity. It's a streamlined approach that leverages the capabilities of long-context LLMs to deliver fast and accurate responses in knowledge-intensive tasks.
Cache-Augmented Generation (CAG) vs. Traditional Whole Knowledge Base Inclusion in the Prompt
The key difference lies in how the knowledge is accessed and utilized during inference.
Precomputed Key-Value (KV) Cache: CAG does more than simply loading all references at the beginning of the prompt and caching that prompt. It precomputes the key-value (KV) cache from the reference documents, then reuses that compressed internal representation for any subsequent queries. That means the model does not re-encode the same text for each user prompt, which is what happens when you just load all documents in a single prompt.
So the core of CAG lies in creating this specialized cache, not just a prompt cache. This cache is generated by encoding the entire set of documents (knowledge base) beforehand into a key-value (KV) representation. Think of it like indexing the information in a way that's highly optimized for the LLM's internal workings. This encoding is a computationally intensive process, but it only happens once during the preloading phase.
In the standard method, the entire knowledge base is included directly within the prompt. This means that the LLM has to process the entire knowledge base every time it generates a response, even if only a small portion of the knowledge is relevant to the specific query. This can lead to increased inference time and computational overhead.
So the hidden states get regenerated every time you ask a new question, leading to slower responses.
In contrast, CAG preloads all relevant passages and precomputes the key-value (KV) cache, which eliminates the need for reference text input at inference. This eliminates any retrieval time and error and allows for a more streamlined system.
So, when a query comes in, the LLM doesn't need to re-process the original documents. Instead, it uses this precomputed KV cache. The LLM's attention mechanism can efficiently access relevant information within this cache to generate the response. The KV cache is specifically structured to facilitate how the LLM processes information.
-> Another distinction is how you can reset or truncate parts of the cache. CAG can remove only the newly added query tokens while preserving the main document encodings. That eliminates the overhead of re-encoding each time, which is not achievable if you rely on a static “prompt caching” alone.
-> Precomputing a KV cache changes the internal architecture during inference. It streamlines queries since the model references an established hidden-state structure. A raw prompt with all documents just frontloads them into text form; CAG frontloads them into the model’s own internal hidden states, enabling a more efficient mechanism for knowledge reuse.
Cache-Augmented Generation (CAG) vs. Prompt Caching
Prompt Caching (If Used): Prompt caching, in this context, would typically mean storing the entire prompt (including the documents and the query) and its corresponding response. If a user asks the exact same query again, the cached response can be retrieved.
Re-processing at Inference: Even with prompt caching, if the query is slightly different, the LLM needs to process the entire document set again because the documents are part of the prompt itself. The LLM will have to perform the computationally heavy attention operations over the whole input every time.
Limited by Prompt Cache Hit Rate: The effectiveness of prompt caching heavily depends on the likelihood of exact query repetition. It's not very helpful for variations or paraphrases of the same question.
Why CAG Outperforms RAG: Performance Boost Explained
When the knowledge base is small enough to fit within the LLM's extended context window (like 128k tokens in their experiments), CAG shows clear advantages:
Speed: No real-time retrieval means responses are generated much faster since the KV cache is already in memory.
Consistency: Since all documents are preloaded, the model avoids mistakes where irrelevant or incorrect documents might be retrieved.
Simplicity: You remove the need for a separate retriever module, making the architecture cleaner and easier to maintain.
Experiments and Key Findings
The authors tested CAG on the SQuAD and HotPotQA datasets, two standard question-answering benchmarks. Both datasets have different demands—SQuAD focuses on single-context comprehension, while HotPotQA requires multi-hop reasoning across multiple documents. They created multiple configurations (small, medium, large) by varying the number of documents to simulate increasing complexity.
The baseline RAG systems used two types of retrieval:
Sparse Retrieval (BM25): Ranks documents based on keyword matching and frequency.
Dense Retrieval (OpenAI Embeddings): Uses vector representations for semantic similarity matching.
In contrast, CAG directly loaded all the documents into the LLM’s context. The results showed that CAG had higher BERTScores in most test cases—especially in tasks requiring reasoning over multiple documents—since it wasn’t limited by the retrieval step's accuracy.
Performance Metrics: Generation Time and BERTScore
The most impressive gains were in generation time. With CAG, generation time was reduced drastically because there was no need to fetch and encode documents on the fly:
For HotPotQA (large configuration), CAG took 2.32 seconds compared to 94.34 seconds for a RAG system.
For SQuAD (medium configuration), CAG generated responses in 1.73 seconds versus 13.35 seconds with RAG.
By preloading everything, the model processes the query faster and holistically. Instead of stitching together results from scattered documents retrieved mid-inference, CAG forms a unified picture from the start.
Conclusion: When Simpler is Smarter
CAG shines when your knowledge base is well-defined and manageable. It avoids the complexity and pitfalls of retrieval errors by shifting the heavy lifting to the preloading phase. This approach fits particularly well for tasks like customer support, medical FAQs, or document-based Q&A—essentially any scenario where the knowledge is structured and doesn’t need constant updating.
By leveraging long-context LLMs and KV caching, CAG simplifies the entire workflow without sacrificing performance. It’s a practical alternative for systems where retrieval isn’t a must but fast, contextually-rich generation is essential.
So overall, CAG is particularly advantageous when:
The knowledge base is relatively static (doesn't change very frequently).
The knowledge base is large but still manageable to encode into a KV cache.
Fast response times are crucial.
Users are likely to ask a variety of questions related to the same underlying knowledge.