0:00
/
0:00
Transcript

"Marconi: Prefix Caching for the Era of Hybrid LLMs"

The podcast on this paper is generated with Google's Illuminate.

Marconi makes Hybrid LLMs faster by intelligently caching computation states based on their actual value.

Marconi introduces the first prefix caching system for Hybrid LLMs that combine Attention and State Space Models (SSMs). It tackles the challenge of managing in-place state updates in SSM layers while maximizing cache efficiency through smart admission and eviction policies.

-----

https://arxiv.org/abs/2411.19379

🔍 Original Problem:

Hybrid LLMs use SSM layers alongside Attention for efficient long-context processing. However, SSM states are updated in-place and can't be rolled back, making traditional prefix caching ineffective. Fine-grained checkpointing creates many large cache entries with limited reuse, leading to cache thrashing.

-----

⚡ Solution in this Paper:

→ Marconi implements a radix tree structure to track common prefixes across requests and identify high-value caching opportunities

→ For input-only prefixes, it performs speculative insertion to detect potential reuse before caching SSM states

→ For input-output sequences, it selectively caches states at the last decoded token where conversations typically resume

→ It introduces a FLOP-aware eviction policy that considers both recency and compute-to-memory ratios when selecting entries to remove

-----

💡 Key Insights:

→ SSM states must be cached holistically with KV cache entries to enable prefix reuse

→ Traditional recency-based caching is insufficient for Hybrid LLMs due to SSM state properties

→ Compute savings per memory footprint is more important than pure recency for cache efficiency

-----

📊 Results:

→ 34.4x higher token hit rates (71.1%) compared to existing systems

→ 617ms lower Time To First Token latency

→ Superior performance with longer contexts and higher SSM layer ratios

Discussion about this video

User's avatar