Automatically Interpreting Millions of Features in Large Language Models
LLMs analyze other LLMs' neurons, creating human-readable explanations at massive scale.
LLMs analyze other LLMs' neurons, creating human-readable explanations at massive scale.
Sparse autoencoders (SAEs) features expose LLM's internal mechanics through automated interpretation pipeline.
Original Problem 🔍:
Interpreting millions of features in LLMs is challenging due to their vast scale and complexity.
Solution in this Paper 🛠️:
• Open-source automated pipeline to generate and evaluate natural language explanations for sparse autoencoder (SAE) features using LLMs
• Five new scoring techniques: detection, fuzzing, surprisal, embedding, and intervention scoring
• Intervention scoring evaluates interpretability of feature interventions
• Hungarian algorithm aligns SAE features across layers
Key Insights from this Paper 💡:
• Sparse autoencoders (SAEs) latents are more interpretable than individual neurons
• SAEs with more latents have higher interpretability scores
• SAEs trained on nearby residual stream layers are highly similar
• Residual stream SAEs have higher semantic overlap than MLP SAEs
• Efficient scoring techniques enable feedback loops to improve explanation quality
Results 📊:
• SAE latents significantly outperform neurons in interpretability scores
• Larger SAEs (131k latents) achieve higher scores than smaller ones (16k latents)
• Residual stream SAEs score slightly higher than MLP SAEs
• Intervention scoring distinguishes trained SAE features from random features
• Semantic similarity between adjacent layers is higher in residual stream sparse autoencoders (SAEs).
🔬 How does this work compare sparse autoencoders (SAEs) to individual neurons?
The large-scale analysis confirms that SAE latents are much more interpretable than individual neurons, even when neurons are sparsified using top-k postprocessing. SAEs with more latents tend to have higher interpretability scores.
🧠 The paper introduces five new techniques to score the quality of explanations:
Detection scoring
Fuzzing scoring
Surprisal scoring
Embedding scoring
Intervention scoring
Intervention scoring is highlighted as particularly valuable for evaluating the interpretability of feature interventions.