When LLMs compete with better models during training, they learn to generate higher quality responses.
PoFT (Preference-Oriented Supervised Fine-Tuning) boosts supervised fine-tuning by making target models compete against aligned LLMs on training data, improving model performance and stability with quality-limited datasets.
-----
https://arxiv.org/abs/2412.12865v1
🤔 Original Problem:
→ Traditional supervised fine-tuning requires high-quality instruction-response pairs, but creating and maintaining such datasets is costly and labor-intensive
→ Current methods treat all training data equally, making them vulnerable to low-quality samples
-----
🛠️ Solution in this Paper:
→ PoFT introduces a preference-oriented approach where the target model competes against aligned LLMs on the same training data
→ Uses Bradley-Terry ranking objective to model preferences between models
→ Aligned LLMs provide dynamic weights for different training samples through a coefficient in gradient updates
→ Training objective favors samples where target model outperforms aligned LLMs
→ Remains compatible with existing data filtering methods and DPO
-----
💡 Key Insights:
→ Dynamic weighting of training samples based on aligned LLM assessments improves training stability
→ More effective with diverse preference score distributions
→ Compatible with existing SFT filtering methods for enhanced performance
→ Better than direct knowledge distillation from aligned LLMs
-----
📊 Results:
→ Outperformed SFT baselines across different datasets and base models
→ Achieved 1.58 and 2.17 points improvement on OpenHermes with Mistral-7B and Llama-3-8B
→ Showed particular strength on GSM8k dataset
→ Combined with DPO, achieved 27.83% win rate on AlpacaEval
Share this post