Hard negatives are the secret weapon NExtLong uses to make LLMs excel at ultra-long context tasks.
NExtLong introduces a novel training framework that synthesizes long-context data by interleaving hard negative distractors between dependent text chunks, significantly improving long-range dependency modeling in LLMs.
---
Paper - https://arxiv.org/abs/2501.12766
Original Problem 😕:
→ LLMs struggle with long-range dependency modeling due to the scarcity of naturally occurring long documents.
→ Existing long-context synthesis methods concatenate short texts but lack a mechanism to reinforce dependency learning.
→ Models get easily distracted by irrelevant content as context length increases.
Solution in this Paper 👨🔧:
→ NExtLong uses Negative Document Extension to enhance long-context training without relying on long documents.
→ A document is broken into meta-chunks, maintaining local semantic integrity.
→ Hard negative distractors are retrieved from a pretraining corpus and interleaved between meta-chunks.
→ The model is forced to differentiate between meaningful long-range dependencies and misleading distractors.
→ This approach improves the model’s ability to capture long-distance contextual relationships.
→ Training is performed using a next-token prediction loss, which optimizes for better long-context understanding.
---
Key Insights from this Paper 💡:
→ Hard negative distractors force the model to improve long-range attention.
→ Outperforms all previous synthetic long-context methods, including Quest.
→ Maintains short-text performance while improving long-text comprehension.
→ Does not require naturally long documents, reducing training dependency on limited data sources.
-----
Results 📊:
→ 7.33% improvement over previous best synthetic long-context method (Quest).
→ 62.58% average accuracy on HELMET benchmark, outperforming all baselines.
→ Achieves comparable or superior performance to ProLong models without using naturally long documents.
Share this post