Alignment Vectors enable real-time LLM behavior control without retraining
📚 https://arxiv.org/abs/2410.19206
🎯 Original Problem:
Current LLM alignment methods require full retraining for changes and need reward models during inference, making them resource-intensive and inflexible.
-----
🔧 Solution in this Paper:
• Introduces Alignment Vectors (AV) - encoded representations of preference dimensions computed by subtracting base model parameters from aligned model parameters
• Enables dynamic behavior adjustment through simple linear operations during inference
• Tests three proficiency levels: Expert opinion (Exp), Generic response (Gen), Avoidance (Avd)
• Focuses on three domains: Medical, Legal, Financial
• Created 38k domain-specific queries using PersonaHub dataset and CreatePersona method
-----
💡 Key Insights:
• AVs are transferable across different fine-tuning stages of same model
• Reduces inference cost by 50% compared to prompt engineering
• Enables multidomain diverse preference alignment 12x faster than retraining
• Basic approach works only for LLMs with same architecture
• Grid search needed for multidomain alignment
-----
📊 Results:
• Achieves 93% safety preference accuracy at λ=1
• Medical domain: 95% expert accuracy at λ=0.5
• Financial domain: 85% expert accuracy at λ=0.3
• Legal domain: 100% expert accuracy at λ=0.3
• Human evaluation achieved Cohen's kappa score of 0.84
Share this post