0:00
/
0:00
Transcript

"Theoretical limitations of multi-layer Transformer"

The podcast on this paper is generated with Google's Illuminate.

Decoder-only Transformers hit mathematical walls that deeper networks can exponentially overcome.

This paper proves the first unconditional computational limitations of multi-layer decoder-only Transformers, showing they need polynomial model dimensions for sequential function composition tasks, leading to provable depth-width trade-offs.

-----

https://arxiv.org/abs/2412.02975

🤔 Original Problem:

→ While Transformers power modern LLMs, their computational limitations remain poorly understood beyond single-layer models, with previous work relying on unproven complexity assumptions.

-----

🔍 Solution in this Paper:

→ The researchers introduce a novel multi-party autoregressive communication model that captures decoder-only Transformer computation.

→ They develop an indistinguishable decomposition technique that iteratively finds input patterns to prove lower bounds.

→ The proof exploits the information bottleneck in decoder-only architectures where each token can only attend to previous tokens.

-----

💡 Key Insights:

→ L-layer decoder-only Transformers require polynomial model dimension for L-step sequential composition

→ Deeper models are exponentially more efficient than shallower ones for composition tasks

→ Encoder architectures can solve tasks with exponentially smaller and shallower networks than decoders

→ Chain-of-thought reasoning provides provable computational advantages

-----

📊 Results:

→ For L-layer decoders, model dimension must be n^Ω(1) for L-function composition

→ (L+1)-layer models need only polylog parameters vs polynomial for L-layer models

→ O(log L)-layer encoders solve tasks requiring L layers in decoders

Discussion about this video

User's avatar