"Preference-Oriented Supervised Fine-Tuning: Favoring Target Model Over Aligned Large Language Models"

Playback speed

Share post at current time

0:00

Transcript

"Preference-Oriented Supervised Fine-Tuning: Favoring Target Model Over Aligned Large Language Models"

Generated below podcast on this paper with Google's Illuminate.

Rohan Paul

Jan 21, 2025

When LLMs compete with better models during training, they learn to generate higher quality responses.

PoFT (Preference-Oriented Supervised Fine-Tuning) boosts supervised fine-tuning by making target models compete against aligned LLMs on training data, improving model performance and stability with quality-limited datasets.

-----

https://arxiv.org/abs/2412.12865v1

🤔 Original Problem:

→ Traditional supervised fine-tuning requires high-quality instruction-response pairs, but creating and maintaining such datasets is costly and labor-intensive

→ Current methods treat all training data equally, making them vulnerable to low-quality samples

-----

🛠️ Solution in this Paper:

→ PoFT introduces a preference-oriented approach where the target model competes against aligned LLMs on the same training data

→ Uses Bradley-Terry ranking objective to model preferences between models

→ Aligned LLMs provide dynamic weights for different training samples through a coefficient in gradient updates

→ Training objective favors samples where target model outperforms aligned LLMs

→ Remains compatible with existing data filtering methods and DPO

-----

💡 Key Insights:

→ Dynamic weighting of training samples based on aligned LLM assessments improves training stability

→ More effective with diverse preference score distributions

→ Compatible with existing SFT filtering methods for enhanced performance

→ Better than direct knowledge distillation from aligned LLMs

-----

📊 Results:

→ Outperformed SFT baselines across different datasets and base models

→ Achieved 1.58 and 2.17 points improvement on OpenHermes with Mistral-7B and Llama-3-8B

→ Showed particular strength on GSM8k dataset

→ Combined with DPO, achieved 27.83% win rate on AlpacaEval

Rohan's Bytes

"Preference-Oriented Supervised Fine-Tuning: Favoring Target Model Over Aligned Large Language Models"

Discussion about this video