Self-improving LLMs crack the long-context reasoning puzzle.
This paper introduces SEALONG, a method for LLMs to self-improve in long-context reasoning without relying on human experts or advanced models. It leverages multiple sampled outputs and consensus-based evaluation to create self-supervision for fine-tuning.
-----
https://arxiv.org/abs/2411.08147
🤔 Original Problem:
LLMs struggle with long-context reasoning despite strong retrieval capabilities. Existing approaches rely on human experts or advanced models for data synthesis, limiting further advancements.
-----
💡 Solution in this Paper:
→ SEALONG enables LLMs to self-improve in long-context reasoning through a two-stage process.
→ First, it samples multiple reasoning trajectories for each question and long context.
→ These outputs are then scored using Minimum Bayes Risk (MBR), prioritizing outputs with higher semantic consistency.
→ The scoring method uses sentence embedding similarity to measure consistency between outputs.
→ Finally, SEALONG applies either supervised fine-tuning on high-scoring outputs or preference optimization using both high and low-scoring outputs.
-----
🔑 Key Insights from this Paper:
→ LLMs have untapped potential in long-context reasoning, revealed through refined prompting and multiple output sampling.
→ Consensus-based evaluation effectively identifies high-quality outputs without external supervision.
→ Self-improvement in long-context reasoning is possible without relying on human experts or advanced models.
-----
📊 Results:
→ SEALONG improved Llama-3.1-8B-Instruct's performance from 50.8 to 55.0 on long-context tasks.
→ Outperformed GPT-4o on some tasks (55.0 vs 54.4).
→ Demonstrated strong data efficiency, achieving competitive performance with only 1K examples.
Share this post