Small models can outperform giants by learning to ignore irrelevant noise
The paper introduces ICR2, a new benchmark that tests LLMs' ability to retrieve and reason with long contexts while handling confounding information, making evaluation more realistic.
-----
https://arxiv.org/abs/2501.08248
Original Problem 🔍:
→ Current benchmarks like LOFT overestimate LLMs' performance by using oversimplified contexts without challenging confounding information
→ LLMs struggle with accurate retrieval and reasoning when faced with realistic scenarios containing misleading but relevant passages
-----
Solution in this Paper 🛠️:
→ ICR2 benchmark uses strong retrievers to select challenging confounding passages, creating more realistic test conditions
→ Introduces retrieve-then-generate fine-tuning where models first find relevant passages then generate answers
→ Employs retrieval-attention-probing to filter noisy contexts using attention heads during decoding
→ Implements joint training of dedicated retrieval and generation heads
-----
Key Insights 💡:
→ LLMs are highly sensitive to confounding information in context
→ Explicit retrieval steps improve performance compared to end-to-end approaches
→ Attention heads can effectively identify relevant passages
-----
Results 📊:
→ Best approach (Mistral-7B): +17 points on LOFT, +13 points on ICR2 vs vanilla RAG
→ Outperforms GPT-4-Turbo despite being much smaller
→ Achieves 51% improvement in exact match rates compared to baseline
Share this post