Math cracks how AI understands 'bark' means both tree-skin and dog-sound.
This paper mathematically proves how Transformers use multiple word meanings for in-context learning
https://arxiv.org/abs/2411.02199
🤔 Original Problem:
LLMs show remarkable in-context learning abilities without fine-tuning, but we lack theoretical understanding of how they leverage word semantics across different concepts for this capability.
-----
🔧 Solution in this Paper:
→ Introduces a concept-based sparse coding prompt model where words and labels have multiple feature embeddings for different concepts
→ Analyzes a two-layer Transformer with softmax attention and ReLU-activated feed-forward network
→ Maintains positive inner products within concepts and orthogonality across concepts
→ Uses cross-entropy loss and stochastic gradient descent for training
-----
💡 Key Insights:
→ Words can have different semantic meanings across multiple concepts
→ The model achieves exponential convergence despite non-convex optimization
→ Multi-concept encoded linear semantic geometry enables efficient out-of-distribution tasks
→ No requirement on demonstration length or batch size for training
-----
📊 Results:
→ Achieves Bayes optimal test error with logarithmic number of iterations
→ Proves exponential convergence of 0-1 loss with cross-entropy training
→ First theoretical framework handling softmax attention, ReLU-MLP, and cross-entropy loss simultaneously
Share this post