0:00
/
0:00
Transcript

"NExtLong: Toward Effective Long-Context Training without Long Documents"

Below podcast is generated with Google's Illuminate.

Hard negatives are the secret weapon NExtLong uses to make LLMs excel at ultra-long context tasks.

NExtLong introduces a novel training framework that synthesizes long-context data by interleaving hard negative distractors between dependent text chunks, significantly improving long-range dependency modeling in LLMs.

---

Paper - https://arxiv.org/abs/2501.12766

Original Problem πŸ˜•:

β†’ LLMs struggle with long-range dependency modeling due to the scarcity of naturally occurring long documents.

β†’ Existing long-context synthesis methods concatenate short texts but lack a mechanism to reinforce dependency learning.

β†’ Models get easily distracted by irrelevant content as context length increases.

Solution in this Paper πŸ‘¨β€πŸ”§:

β†’ NExtLong uses Negative Document Extension to enhance long-context training without relying on long documents.

β†’ A document is broken into meta-chunks, maintaining local semantic integrity.

β†’ Hard negative distractors are retrieved from a pretraining corpus and interleaved between meta-chunks.

β†’ The model is forced to differentiate between meaningful long-range dependencies and misleading distractors.

β†’ This approach improves the model’s ability to capture long-distance contextual relationships.

β†’ Training is performed using a next-token prediction loss, which optimizes for better long-context understanding.

---

Key Insights from this Paper πŸ’‘:

β†’ Hard negative distractors force the model to improve long-range attention.

β†’ Outperforms all previous synthetic long-context methods, including Quest.

β†’ Maintains short-text performance while improving long-text comprehension.

β†’ Does not require naturally long documents, reducing training dependency on limited data sources.

-----

Results πŸ“Š:

β†’ 7.33% improvement over previous best synthetic long-context method (Quest).

β†’ 62.58% average accuracy on HELMET benchmark, outperforming all baselines.

β†’ Achieves comparable or superior performance to ProLong models without using naturally long documents.

Discussion about this video

User's avatar