Social AI gets smarter by learning from its conversational mistakes segment by segment
SDPO optimizes LLM-based social agents by focusing on key conversation segments, improving goal-oriented dialogues through targeted preference optimization rather than entire conversations.
-----
https://arxiv.org/abs/2501.01821
🤖 Original Problem:
Existing methods like turn-level DPO focus too narrowly on single responses, while session-level approaches introduce noise by treating all turns as equally important. This makes it hard for LLMs to learn effective social interaction patterns.
-----
🔧 Solution in this Paper:
→ SDPO identifies specific erroneous turns in negative conversations using GPT-4 evaluation
→ It generates positive alternative conversations starting from those error points
→ The system selects key segments from both positive and negative conversations that directly impact goal achievement
→ SDPO then applies preference optimization only on these matched-length segments, eliminating training noise
→ The method uses a simplified loss function that directly compares segment probabilities without complex normalization
-----
💡 Key Insights:
→ Multi-turn optimization is crucial for social dialogue improvement
→ Focusing on specific conversation segments reduces training noise
→ Equal-length segment comparison enables cleaner optimization
→ Using in-distribution samples leads to better performance
-----
📊 Results:
→ SDPO outperformed GPT-4 with goal score of 8.56 vs 7.90 in self-chat
→ Achieved 3.69 relationship score vs GPT-4's 2.67
→ Consistently superior performance across different base models (Llama, Mistral)
Share this post