Decoder-only getting all the hype? Encoder-decoders are the real MVP for Small Language Model.
This paper addresses the underperformance of efficient architectures for small language models. It shows encoder-decoder models are better than decoder-only models at smaller scales, especially for on-device deployment, and proposes a knowledge distillation method to enhance them.
-----
https://arxiv.org/abs/2501.16273
Original Problem 🤔:
→ Decoder-only LLMs dominate current trends.
→ Encoder-decoder architectures are overlooked for large models.
→ Small language models need efficient architectures for resource-limited environments.
→ Decoder-only models in SLMs suffer from high latency and low throughput on edge devices.
-----
Solution in this Paper 💡:
→ This paper systematically analyzes encoder-decoder architectures for small language models.
→ It highlights the efficiency advantages of encoder-decoder models over decoder-only models for SLMs.
→ Encoder-decoder models process input once, enabling fixed memory footprint.
→ Decoder-only models reprocess input for every output token, leading to growing KV caches.
→ The paper introduces a knowledge distillation framework.
→ This framework transfers knowledge from large decoder-only teacher models to small encoder-decoder student models.
→ The framework uses a novel sequence alignment strategy for distillation.
→ It combines reverse KL-divergence and cross-entropy loss for effective knowledge transfer.
→ The architecture incorporates Rotary Positional Embeddings and Vision encoders.
-----
Key Insights from this Paper 🧐:
→ Encoder-decoder architectures are inherently more efficient for small language models due to one-time input processing.
→ This efficiency leads to lower first-token latency and higher throughput compared to decoder-only models in SLMs.
→ The information bottleneck in encoder-decoder models becomes an advantage at smaller scales, acting as a valuable inductive bias.
→ Knowledge distillation effectively bridges the gap, allowing efficient encoder-decoder models to learn from larger decoder-only models.
→ Architectural choices are crucial for parameter efficiency in resource-constrained environments, especially for SLMs.
-----
Results 📊:
→ Encoder-decoder models achieve 47% lower first-token latency on edge devices.
→ Encoder-decoder models achieve 4.7x higher throughput on edge devices.
→ Encoder-decoder models with knowledge distillation improve average performance by up to 6 points across diverse tasks.
→ Encoder-decoder models use 11-16% less memory and require only 78% of FLOPs compared to decoder-only models for 256 token generation.