"Understanding Impact of Human Feedback via Influence Functions"

Playback speed

Share post at current time

0:00

Transcript

"Understanding Impact of Human Feedback via Influence Functions"

Generated below podcast on this paper with Google's Illuminate.

Rohan Paul

Jan 21, 2025

Influence functions reveal hidden biases in human feedback, making RLHF more reliable and trustworthy.

This paper introduces influence functions to measure and improve human feedback quality in RLHF systems, enabling detection of biased feedback and guidance for better labeling strategies.

-----

https://arxiv.org/abs/2501.05790

🤔 Original Problem:

Human feedback in RLHF can be noisy, inconsistent, or biased, especially for complex tasks. This leads to misaligned reward signals and unintended model behaviors like sycophancy.

-----

🔧 Solution in this Paper:

→ The paper applies influence functions to quantify how individual feedback samples impact reward model performance

→ They introduce a compute-efficient method using vector compression to make influence functions practical for LLM-scale models

→ The approach reduces gradient vector size from 160MB to 256KB while preserving essential information

→ Their system can detect biased feedback and help non-expert labelers align better with expert strategies

→ The method uses validation sets to evaluate feedback quality and guide labeling improvements

-----

💡 Key Insights:

→ Influence functions can effectively detect both length bias and sycophancy bias in human feedback

→ Small validation sets (50 samples) are sufficient for accurate bias detection

→ The method outperforms GPT-4 and other baselines at identifying problematic feedback

→ Expert feedback can guide non-expert labelers to improve their strategies

-----

📊 Results:

→ 2.5x faster computation compared to previous methods

→ Achieves 0.8 AUC for length bias detection

→ Outperforms GPT-4 by 5.3% and Gemini-1.5-Pro by 25.6% in length bias detection

→ Identifies 14.4% more biased samples than LLM-based detectors

Rohan's Bytes

"Understanding Impact of Human Feedback via Influence Functions"

Discussion about this video