SCBench, proposed in this paper, reveals how LLMs actually perform when sharing context across multiple real-world requests
KV cache reuse patterns expose the true efficiency limits of long-context LLM methods
SCBench introduces a comprehensive benchmark for evaluating long-context methods with KV cache reuse across multiple domains, addressing real-world application scenarios often overlooked in existing evaluations.
-----
https://arxiv.org/abs/2412.10319
🤔 Original Problem:
Existing benchmarks evaluate LLMs only in single-request scenarios, ignoring how KV cache gets reused across multiple requests in real applications. This creates a gap between benchmark performance and actual deployment effectiveness.
-----
🔧 Solution in this Paper:
→ SCBench evaluates long-context methods through a KV cache-centric framework with 4 stages: generation, compression, retrieval, and loading
→ Tests span 12 tasks covering string retrieval, semantic retrieval, global information processing, and multi-tasking capabilities
→ Implements two shared context modes: multi-turn for single-session caching and multi-request for cross-session caching
→ Evaluates 13 methods across 8 categories on 8 state-of-the-art LLMs
-----
💡 Key Insights:
→ Sub-O(n) memory methods perform well in single-turn but fail in multi-turn scenarios
→ Sparse encoding with O(n) memory shows robust performance across multiple requests
→ Dynamic sparsity produces more expressive KV caches than static patterns
→ Layer-level sparsity in hybrid architectures reduces memory while maintaining performance
-----
📊 Results:
→ Methods with O(n) memory cost show improving performance as requests increase
→ Sub-O(n) KV cache methods perform well only in first request
→ All methods show some loss in Retrieval capability while maintaining Global Information processing
------
Are you into AI and LLMs❓ Join my daily AI newsletter. I will send you 7 emails a week analyzing the highest signal AI developments. ↓↓
🎉 https://rohanpaul.substack.com/
Share this post