A cache that knows your next question before you do
InstCache predicts what users will ask before they ask it
The paper introduces a predictive caching system that pre-populates potential user instructions using an instruction-aligned LLM, achieving significant speedup while maintaining minimal memory footprint.
-----
https://arxiv.org/abs/2411.13820
🤖 Original Problem:
→ LLM serving systems face massive computational demands, leading to high latency and energy consumption
→ Existing solutions like Key-Value Cache and Semantic Cache have limited scalability due to high memory costs
-----
🔍 Key Insights:
→ Most user instructions to LLMs are short and repetitive
→ User instructions are highly predictable using LLMs
→ Negative log likelihood can effectively determine cache size and hit rate
-----
⚡ Solution in this Paper:
→ InstCache predicts potential user instructions before they are requested using an instruction-aligned LLM.
→ It uses a tree-based structure during pre-population where each path represents an instruction and its answer.
→ The system converts to a hash table during deployment for O(1) lookup complexity.
→ Cache size and hit rate are controlled through negative log likelihood thresholds.
-----
📊 Results:
→ Achieves 51.34% hit rate on LMSys dataset
→ Delivers 2x speedup in serving performance
→ Requires only 4.5 GB memory for deployment
Share this post