0:00
/
0:00
Transcript

"Xmodel-2 Technical Report"

Generated below podcast on this paper with Google's Illuminate.

Xmodel-2 proves small models can reason like the big ones through smart training and architecture choices

Xmodel-2 is a 1.2B parameter model that achieves state-of-the-art performance in reasoning tasks while maintaining efficient training through innovative architecture and data optimization techniques .

-----

https://arxiv.org/abs/2412.19638

🔧 Key takeaways of Techniques:

→ Xmodel-2 uses a deep-and-thin architecture with 1536 hidden size, 24 attention heads, and 48 layers

→ The model employs a custom Unigram tokenizer with 65,280 tokens instead of common BPE tokenizers

→ Training happens in two stages: stable training on 1.5 trillion tokens and decay stage combining pretraining with supervised fine-tuning

→ Uses Warmup-Stable-Decay learning rate scheduler with optimal SFT data ratio between 60-69%

-----

💡 Key Insights:

→ Embedding sharing reduces parameter count by 0.1B while maintaining performance

→ Instruction-formatted datasets outperform pretraining-format data in complex reasoning

→ Chain-of-Thought datasets enhance logical reasoning capabilities

-----

📊 Results:

→ Achieves SOTA performance among 1B-parameter models in reasoning tasks

→ 29.31% improvement in complex reasoning compared to baseline

→ 40% accuracy on FEVER fact verification task

→ 13.70% exact match score on HotpotQA

Discussion about this video

User's avatar