LLMs can reason quickly by using compressed thought patterns instead of full explanations
Chain-of-Thought decoding helps LLMs reason better but slows them down. This paper introduces Compressed Chain-of-Thought (CCoT), which uses shorter, dense reasoning tokens to maintain accuracy while being faster.
-----
https://arxiv.org/abs/2412.13171
🤔 Original Problem:
→ Chain-of-Thought (CoT) improves reasoning but adds significant generation latency - taking up to 10x longer to generate answers
→ Current solutions using fixed-length contemplation tokens lack semantic meaning and interpretability
-----
🔧 Solution in this Paper:
→ CCoT generates variable-length contentful contemplation tokens that compress explicit reasoning chains
→ Uses LoRA finetuning with ranks 128 and 64 for training modules
→ Takes layer 3 for subset selection and layer 15 for autoregressive generation
→ Implemented on LLAMA2-7B-CHAT as base model
→ Allows post-hoc inspection of reasoning through grounded representations
-----
💡 Key Insights:
→ Contemplation tokens enhance computational width through parallel operations
→ Autoregressive decoding provides additional computational depth
→ Model can solve tasks requiring depth D with D/L additional tokens
-----
📊 Results:
→ With compression ratio 0.10: 9-point accuracy gain with only 0.4s extra generation time
→ At ratio 0.05: 6-point improvement with just 0.15s additional time
→ More data-efficient than previous approaches (9000 vs 400000 instances)
Share this post