Speculative decoding gets smarter by learning when to stop drafting tokens.
SVIP introduces a dynamic draft length policy for speculative decoding that adapts to token generation difficulty, replacing fixed-length approaches. It determines optimal draft lengths using entropy information, achieving significant speedups without additional training.
-----
https://arxiv.org/abs/2411.18462
Original Problem 🤔:
Current speculative decoding systems use fixed draft lengths, ignoring that some tokens (like stop words) are easier to predict than others (like reasoning-intensive tokens). This leads to inefficient processing and slower generation.
-----
Solution in this Paper 🛠️:
→ SVIP dynamically controls draft sequence lengths by analyzing draft model entropy after each token generation.
→ It derives a theoretical lower bound for acceptance probability using entropy information from the speculative decoding system.
→ The system approximates cross-entropy during inference using only the draft model's entropy, making it practical for real-time use.
→ When entropy exceeds a threshold, SVIP stops drafting and starts verification, optimizing the process for each token.
-----
Key Insights 💡:
→ Draft model entropy strongly correlates with token acceptance probability
→ Cross-entropy between target and draft models can be approximated by draft entropy alone
→ Dynamic length policies outperform fixed-length approaches significantly
-----
Results 📊:
→ 20% speedup on SpecBench for Qwen2.5 14B and LLaMA-3 70B
→ 60% speedup on MTBench for long-form generation up to 8K tokens
→ 80-99% token acceptance rate across all domains
Share this post