"Direct Preference Optimization Using Sparse Feature-Level Constraints"

Playback speed

Share post at current time

0:00

Transcript

"Direct Preference Optimization Using Sparse Feature-Level Constraints"

The podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Dec 24, 2024

Smart feature selection helps train better behaved AI models with less computation

Feature-level constrained Preference Optimization (FPO) uses sparse features to make LLM alignment both faster and more controllable

https://arxiv.org/abs/2411.07618

🤔 Original Problem:

→ Current LLM alignment methods face a trade-off between efficiency and controllability. Methods like SimPO are computationally efficient but lack stability, while methods like TDPO provide better control but are computationally expensive due to sequential KL divergence calculations.

-----

🛠️ Solution in this Paper:

→ The paper introduces Feature-level constrained Preference Optimization (FPO), which uses Sparse Autoencoders (SAEs) to generate sparse feature-level constraints.

→ FPO pre-computes and caches reference model outputs offline, reducing computational overhead while maintaining control quality.

→ Instead of using token-level KL divergence, FPO employs feature-level MSE constraints on sparse activations, where only dozens out of 16,000 features are active.

→ The method combines length normalization from SimPO with offline reference control and sparse feature-level constraints.

-----

💡 Key Insights:

→ Feature-level preferences enable fine-grained adjustment while minimizing side effects

→ Sparse representations enhance efficiency as only a few features are active at once

→ Pre-computing reference model outputs offline balances efficiency and stability

→ Feature-level control provides stronger generalization than token-level control

-----

📊 Results:

→ Achieves 5% absolute improvement in win rate compared to baselines

→ Reduces computational cost by 17.6% compared to TDPO2

→ Maintains competitive output diversity

→ Shows stable performance across different temperatures and SAE layers

Rohan's Bytes

"Direct Preference Optimization Using Sparse Feature-Level Constraints"

Discussion about this video