Smart compute allocation approach for LLM inference yields dramatic efficiency gains
https://arxiv.org/abs/2410.22480
🎯 Original Problem:
When using LLMs for inference, we need to decide how to allocate limited compute budget across different sampling configurations (model, temperature, language). Current approaches focus on finding a single optimal configuration, missing the fact that different problems may require different configurations.
-----
🔧 Solution in this Paper:
OSCA (Optimizes Sample Compute Allocation) algorithm finds optimal mix of sampling configurations through:
→ Estimates pass probabilities for each configuration-problem pair using small sample size
→ Formulates compute allocation as convex optimization problem
→ Uses hill-climbing to find allocation maximizing accuracy on training problems
→ Dynamically adjusts allocation based on problem characteristics
-----
💡 Key Insights:
→ Mixed allocation outperforms single configuration approaches significantly
→ Different problems require different optimal configurations
→ Temperature and model diversity are crucial for optimal performance
→ Small training data (50 samples) sufficient for learning good allocations
-----
📊 Results:
→ 128x less compute needed for same accuracy on code generation tasks
→ 25x compute reduction on reasoning tasks while matching baseline performance
→ 3x compute efficiency improvement on agent workflows
→ Achieves 73.3% pass@8 rate compared to baseline's 512 samples requirement