0:00
/
0:00
Transcript

"InstCache: A Predictive Cache for LLM Serving"

The podcast on this paper is generated with Google's Illuminate.

A cache that knows your next question before you do

InstCache predicts what users will ask before they ask it

The paper introduces a predictive caching system that pre-populates potential user instructions using an instruction-aligned LLM, achieving significant speedup while maintaining minimal memory footprint.

-----

https://arxiv.org/abs/2411.13820

🤖 Original Problem:

→ LLM serving systems face massive computational demands, leading to high latency and energy consumption

→ Existing solutions like Key-Value Cache and Semantic Cache have limited scalability due to high memory costs

-----

🔍 Key Insights:

→ Most user instructions to LLMs are short and repetitive

→ User instructions are highly predictable using LLMs

→ Negative log likelihood can effectively determine cache size and hit rate

-----

⚡ Solution in this Paper:

→ InstCache predicts potential user instructions before they are requested using an instruction-aligned LLM.

→ It uses a tree-based structure during pre-population where each path represents an instruction and its answer.

→ The system converts to a hash table during deployment for O(1) lookup complexity.

→ Cache size and hit rate are controlled through negative log likelihood thresholds.

-----

📊 Results:

→ Achieves 51.34% hit rate on LMSys dataset

→ Delivers 2x speedup in serving performance

→ Requires only 4.5 GB memory for deployment

Discussion about this video

User's avatar