0:00
/
0:00
Transcript

"Eliciting In-context Retrieval and Reasoning for Long-context Large Language Models"

Below podcast is generated with Google's Illuminate.

Small models can outperform giants by learning to ignore irrelevant noise

The paper introduces ICR2, a new benchmark that tests LLMs' ability to retrieve and reason with long contexts while handling confounding information, making evaluation more realistic.

-----

https://arxiv.org/abs/2501.08248

Original Problem 🔍:

→ Current benchmarks like LOFT overestimate LLMs' performance by using oversimplified contexts without challenging confounding information

→ LLMs struggle with accurate retrieval and reasoning when faced with realistic scenarios containing misleading but relevant passages

-----

Solution in this Paper 🛠️:

→ ICR2 benchmark uses strong retrievers to select challenging confounding passages, creating more realistic test conditions

→ Introduces retrieve-then-generate fine-tuning where models first find relevant passages then generate answers

→ Employs retrieval-attention-probing to filter noisy contexts using attention heads during decoding

→ Implements joint training of dedicated retrieval and generation heads

-----

Key Insights 💡:

→ LLMs are highly sensitive to confounding information in context

→ Explicit retrieval steps improve performance compared to end-to-end approaches

→ Attention heads can effectively identify relevant passages

-----

Results 📊:

→ Best approach (Mistral-7B): +17 points on LOFT, +13 points on ICR2 vs vanilla RAG

→ Outperforms GPT-4-Turbo despite being much smaller

→ Achieves 51% improvement in exact match rates compared to baseline

Discussion about this video