0:00
/
0:00
Transcript

"NExtLong: Toward Effective Long-Context Training without Long Documents"

Below podcast is generated with Google's Illuminate.

Hard negatives are the secret weapon NExtLong uses to make LLMs excel at ultra-long context tasks.

NExtLong introduces a novel training framework that synthesizes long-context data by interleaving hard negative distractors between dependent text chunks, significantly improving long-range dependency modeling in LLMs.

---

Paper - https://arxiv.org/abs/2501.12766

Original Problem πŸ˜•:

β†’ LLMs struggle with long-range dependency modeling due to the scarcity of naturally occurring long documents.

β†’ Existing long-context synthesis methods concatenate short texts but lack a mechanism to reinforce dependency learning.

β†’ Models get easily distracted by irrelevant content as context length increases.

Solution in this Paper πŸ‘¨β€πŸ”§:

β†’ NExtLong uses Negative Document Extension to enhance long-context training without relying on long documents.

β†’ A document is broken into meta-chunks, maintaining local semantic integrity.

β†’ Hard negative distractors are retrieved from a pretraining corpus and interleaved between meta-chunks.

β†’ The model is forced to differentiate between meaningful long-range dependencies and misleading distractors.

β†’ This approach improves the model’s ability to capture long-distance contextual relationships.

β†’ Training is performed using a next-token prediction loss, which optimizes for better long-context understanding.

---

Key Insights from this Paper πŸ’‘:

β†’ Hard negative distractors force the model to improve long-range attention.

β†’ Outperforms all previous synthetic long-context methods, including Quest.

β†’ Maintains short-text performance while improving long-text comprehension.

β†’ Does not require naturally long documents, reducing training dependency on limited data sources.

-----

Results πŸ“Š:

β†’ 7.33% improvement over previous best synthetic long-context method (Quest).

β†’ 62.58% average accuracy on HELMET benchmark, outperforming all baselines.

β†’ Achieves comparable or superior performance to ProLong models without using naturally long documents.

Discussion about this video

User's avatar

Ready for more?