Ensemble method leveraging N-gram predictions refines next word prediction in language models.
📚 https://arxiv.org/pdf/2409.03295
Original Problem 🔍:
Causal language modeling (CLM) can lead models to overly focus on local dependencies within sentences. N-gram prediction has been used in masked language modeling and machine translation, but not extensively explored for CLM.
-----
Solution in this Paper 💡:
• Introduces simple N-gram prediction framework for CLM
• Proposes word difference representation (WDR) as contextualized target
• Develops ensemble method incorporating future N word predictions
• Applies WDR to encoder-decoder models for neural machine translation
-----
Key Insights from this Paper 💡:
• Word difference representation (WDR) provides more diverse target representations than standard embeddings
• Higher gradient diversity from WDR may improve generalization
• N-gram prediction and WDR consistently improve over baseline CLM models
• Ensemble method further boosts performance, especially on smaller datasets
-----
Results 📊:
• WDR N-gram CLM outperforms baselines on multiple benchmarks:
- 19.2% PPL reduction on PTB vs Tensorized Transformer
- 7.5% PPL reduction on WikiText-103 vs Reformer
• 0.7-1.5 BLEU score improvements on IWSLT14 En-De NMT task
Share this post