0:00
/
0:00
Transcript

"BPO: Towards Balanced Preference Optimization between Knowledge Breadth and Depth in Alignment"

Generated below podcast on this paper with Google's Illuminate.

Why train on everything when you can train on the right things? BPO (Balanced Preference Optimization) shows the way.

Balancing knowledge breadth and depth in LLM training improves alignment while using only 10% of original training data.

-----

https://arxiv.org/abs/2411.10914

🤔 Original Problem:

→ Current LLM alignment datasets show significant imbalance - thousands of prompts but only 2 responses per prompt, creating disparity between knowledge breadth and depth learning.

-----

🔧 Solution in this Paper:

→ Introduces Balanced Preference Optimization (BPO), a two-stage framework.

→ First stage compresses knowledge breadth using embedding-based clustering to select representative prompts.

→ Second stage dynamically augments knowledge depth by generating multiple response pairs.

→ Uses gradient-based clustering to determine optimal knowledge depth per sample.

→ Allocates more resources to informative samples near cluster centers.

-----

💡 Key Insights:

→ Simple uniform balancing between breadth and depth improves performance

→ Dynamic depth allocation based on sample informativeness is crucial

→ Gradient-based depth estimation outperforms length-based approaches

-----

📊 Results:

→ BPO achieves better scores on MT-Bench and AlpacaEval using only 10% of original data

→ Shows robust performance across different model sizes and datasets

→ Outperforms baseline methods while maintaining training efficiency

Discussion about this video