Smart token caching helps LLMs speak JSON and SQL without breaking a sweat.
XGrammar introduces a high-performance engine for structured LLM outputs, dividing tokens into context-independent and context-dependent categories. This enables efficient constrained decoding through adaptive caching and persistent execution stacks, achieving 100x speedup over existing solutions.
-----
https://arxiv.org/abs/2411.15100
🤖 Original Problem:
→ LLM applications need structured outputs like JSON, SQL, and function calls, but current context-free grammar approaches face performance bottlenecks due to stack state tracking and vocabulary checking overhead.
-----
🛠️ Solution in this Paper:
→ XGrammar splits vocabulary tokens into context-independent ones (prechecked and cached) and context-dependent ones (checked at runtime).
→ It uses an adaptive token mask cache with smart storage formats to minimize memory usage.
→ A persistent execution stack enables fast state branching and rollback for efficient token validation.
→ The system overlaps grammar computations with GPU operations to minimize overhead.
-----
💡 Key Insights:
→ Most tokens can be validated independently of stack state
→ Adaptive storage formats reduce memory usage to 0.2% of original size
→ Context expansion reduces context-dependent tokens by 90%
-----
📊 Results:
→ Up to 100x reduction in per-token latency compared to existing methods
→ 80x speedup in end-to-end LLM serving with structured output on H100 GPU
→ Near-zero overhead for structured generation
Share this post