Hard negatives are the secret weapon NExtLong uses to make LLMs excel at ultra-long context tasks.
NExtLong introduces a novel training framework that synthesizes long-context data by interleaving hard negative distractors between dependent text chunks, significantly improving long-range dependency modeling in LLMs.
---
Paper - https://arxiv.org/abs/2501.12766
Original Problem π:
β LLMs struggle with long-range dependency modeling due to the scarcity of naturally occurring long documents.
β Existing long-context synthesis methods concatenate short texts but lack a mechanism to reinforce dependency learning.
β Models get easily distracted by irrelevant content as context length increases.
Solution in this Paper π¨βπ§:
β NExtLong uses Negative Document Extension to enhance long-context training without relying on long documents.
β A document is broken into meta-chunks, maintaining local semantic integrity.
β Hard negative distractors are retrieved from a pretraining corpus and interleaved between meta-chunks.
β The model is forced to differentiate between meaningful long-range dependencies and misleading distractors.
β This approach improves the modelβs ability to capture long-distance contextual relationships.
β Training is performed using a next-token prediction loss, which optimizes for better long-context understanding.
---
Key Insights from this Paper π‘:
β Hard negative distractors force the model to improve long-range attention.
β Outperforms all previous synthetic long-context methods, including Quest.
β Maintains short-text performance while improving long-text comprehension.
β Does not require naturally long documents, reducing training dependency on limited data sources.
-----
Results π:
β 7.33% improvement over previous best synthetic long-context method (Quest).
β 62.58% average accuracy on HELMET benchmark, outperforming all baselines.
β Achieves comparable or superior performance to ProLong models without using naturally long documents.