Xmodel-2 proves small models can reason like the big ones through smart training and architecture choices
Xmodel-2 is a 1.2B parameter model that achieves state-of-the-art performance in reasoning tasks while maintaining efficient training through innovative architecture and data optimization techniques .
-----
https://arxiv.org/abs/2412.19638
🔧 Key takeaways of Techniques:
→ Xmodel-2 uses a deep-and-thin architecture with 1536 hidden size, 24 attention heads, and 48 layers
→ The model employs a custom Unigram tokenizer with 65,280 tokens instead of common BPE tokenizers
→ Training happens in two stages: stable training on 1.5 trillion tokens and decay stage combining pretraining with supervised fine-tuning
→ Uses Warmup-Stable-Decay learning rate scheduler with optimal SFT data ratio between 60-69%
-----
💡 Key Insights:
→ Embedding sharing reduces parameter count by 0.1B while maintaining performance
→ Instruction-formatted datasets outperform pretraining-format data in complex reasoning
→ Chain-of-Thought datasets enhance logical reasoning capabilities
-----
📊 Results:
→ Achieves SOTA performance among 1B-parameter models in reasoning tasks
→ 29.31% improvement in complex reasoning compared to baseline
→ 40% accuracy on FEVER fact verification task
→ 13.70% exact match score on HotpotQA
Share this post