DIESEL guides LLMs away from harmful content by comparing semantic similarities during token generation
DIESEL is a lightweight inference guidance technique that filters undesired concepts from LLM responses without additional training. It reranks tokens based on their semantic similarity to predefined negative concepts, providing efficient safety control while maintaining response quality.
-----
https://arxiv.org/abs/2411.19038
🚨 Original Problem:
LLMs often generate responses misaligned with human values, leading to unsafe outputs. Existing solutions require expensive training or significantly increase inference time.
-----
🔧 Solution in this Paper:
→ DIESEL operates in three steps: candidate selection, latent space semantic similarity, and token reranking.
→ It uses Nucleus sampling to select potential token candidates efficiently.
→ A lightweight sentence embedding model compares generated responses with predefined negative concepts in latent space.
→ The final token selection balances between original probabilities and safety scores through a tunable parameter.
-----
💡 Key Insights:
→ Simple text descriptions can effectively filter undesired concepts without specialized expertise
→ Token-level semantic filtering provides finer control than full response filtering
→ Lightweight models can effectively guide inference without compromising quality
-----
📊 Results:
→ Only 1.46x inference time increase compared to vanilla inference
→ 82% reduction in unsafe responses compared to baseline
→ Maintains 51% truthfulness on benign prompts vs 60% baseline
Share this post