Smart prompt reuse in preference training cuts computation time without compromising quality.
Prefix sharing eliminates redundant prompt processing in preference tuning, boosting training speed by 1.5×
📚 https://arxiv.org/abs/2410.20305
🤖 Original Problem:
Training LLMs with preference data currently wastes computation by processing shared prompts twice - once for chosen response and once for rejected response. This redundancy particularly impacts tasks with long prompts like summarization and mathematics.
-----
🔧 Solution in this Paper:
→ Introduces "prefix sharing" that processes chosen and rejected responses as one sequence with a shared prefix
→ Uses custom block-sparse attention mask to prevent cross-response contamination
→ Leverages FlexAttention's block sparsity to skip computation of masked regions
→ Implements sequence packing for prefix-shared inputs to further boost efficiency
-----
💡 Key Insights:
→ Most gains come from reducing total training tokens rather than attention optimizations
→ Performance boost increases with longer sequence lengths and higher prefix-to-completion ratios
→ Sequence packing provides additional efficiency gains especially for shorter sequences
→ Method is compatible with any paired preference tuning approach
-----
📊 Results:
→ Achieves 1.1-1.5× speedup on DPO datasets without affecting convergence
→ Combined with sequence packing, delivers consistent 1.3-1.6× speedups
→ Shows 1.5× speedup for datasets with long prefixes and high prefix-to-completion ratios
→ Even datasets with low prefix-to-completion ratios see modest 1.1× improvements
Share this post