Decoder-only Transformers hit mathematical walls that deeper networks can exponentially overcome.
This paper proves the first unconditional computational limitations of multi-layer decoder-only Transformers, showing they need polynomial model dimensions for sequential function composition tasks, leading to provable depth-width trade-offs.
-----
https://arxiv.org/abs/2412.02975
🤔 Original Problem:
→ While Transformers power modern LLMs, their computational limitations remain poorly understood beyond single-layer models, with previous work relying on unproven complexity assumptions.
-----
🔍 Solution in this Paper:
→ The researchers introduce a novel multi-party autoregressive communication model that captures decoder-only Transformer computation.
→ They develop an indistinguishable decomposition technique that iteratively finds input patterns to prove lower bounds.
→ The proof exploits the information bottleneck in decoder-only architectures where each token can only attend to previous tokens.
-----
💡 Key Insights:
→ L-layer decoder-only Transformers require polynomial model dimension for L-step sequential composition
→ Deeper models are exponentially more efficient than shallower ones for composition tasks
→ Encoder architectures can solve tasks with exponentially smaller and shallower networks than decoders
→ Chain-of-thought reasoning provides provable computational advantages
-----
📊 Results:
→ For L-layer decoders, model dimension must be n^Ω(1) for L-function composition
→ (L+1)-layer models need only polylog parameters vs polynomial for L-layer models
→ O(log L)-layer encoders solve tasks requiring L layers in decoders
Share this post