0:00
/
0:00
Transcript

"Accelerating Direct Preference Optimization with Prefix Sharing"

The podcast on this paper is generated with Google's Illuminate.

Smart prompt reuse in preference training cuts computation time without compromising quality.

Prefix sharing eliminates redundant prompt processing in preference tuning, boosting training speed by 1.5×

📚 https://arxiv.org/abs/2410.20305

🤖 Original Problem:

Training LLMs with preference data currently wastes computation by processing shared prompts twice - once for chosen response and once for rejected response. This redundancy particularly impacts tasks with long prompts like summarization and mathematics.

-----

🔧 Solution in this Paper:

→ Introduces "prefix sharing" that processes chosen and rejected responses as one sequence with a shared prefix

→ Uses custom block-sparse attention mask to prevent cross-response contamination

→ Leverages FlexAttention's block sparsity to skip computation of masked regions

→ Implements sequence packing for prefix-shared inputs to further boost efficiency

-----

💡 Key Insights:

→ Most gains come from reducing total training tokens rather than attention optimizations

→ Performance boost increases with longer sequence lengths and higher prefix-to-completion ratios

→ Sequence packing provides additional efficiency gains especially for shorter sequences

→ Method is compatible with any paired preference tuning approach

-----

📊 Results:

→ Achieves 1.1-1.5× speedup on DPO datasets without affecting convergence

→ Combined with sequence packing, delivers consistent 1.3-1.6× speedups

→ Shows 1.5× speedup for datasets with long prefixes and high prefix-to-completion ratios

→ Even datasets with low prefix-to-completion ratios see modest 1.1× improvements

Discussion about this video