0:00
/
0:00
Transcript

"N-gram Prediction and Word Difference Representations for Language Modeling"

Generated this podcast with Google's Illuminate.

Ensemble method leveraging N-gram predictions refines next word prediction in language models.

📚 https://arxiv.org/pdf/2409.03295

Original Problem 🔍:

Causal language modeling (CLM) can lead models to overly focus on local dependencies within sentences. N-gram prediction has been used in masked language modeling and machine translation, but not extensively explored for CLM.

-----

Solution in this Paper 💡:

• Introduces simple N-gram prediction framework for CLM

• Proposes word difference representation (WDR) as contextualized target

• Develops ensemble method incorporating future N word predictions

• Applies WDR to encoder-decoder models for neural machine translation

-----

Key Insights from this Paper 💡:

• Word difference representation (WDR) provides more diverse target representations than standard embeddings

• Higher gradient diversity from WDR may improve generalization

• N-gram prediction and WDR consistently improve over baseline CLM models

• Ensemble method further boosts performance, especially on smaller datasets

-----

Results 📊:

• WDR N-gram CLM outperforms baselines on multiple benchmarks:

- 19.2% PPL reduction on PTB vs Tensorized Transformer

- 7.5% PPL reduction on WikiText-103 vs Reformer

• 0.7-1.5 BLEU score improvements on IWSLT14 En-De NMT task

Discussion about this video