Automatically Interpreting Millions of Features in Large Language Models

LLMs analyze other LLMs' neurons, creating human-readable explanations at massive scale.

Rohan Paul

Nov 11, 2024

LLMs analyze other LLMs' neurons, creating human-readable explanations at massive scale.

Sparse autoencoders (SAEs) features expose LLM's internal mechanics through automated interpretation pipeline.

Original Problem 🔍:

Interpreting millions of features in LLMs is challenging due to their vast scale and complexity.

Solution in this Paper 🛠️:

• Open-source automated pipeline to generate and evaluate natural language explanations for sparse autoencoder (SAE) features using LLMs

• Five new scoring techniques: detection, fuzzing, surprisal, embedding, and intervention scoring

• Intervention scoring evaluates interpretability of feature interventions

• Hungarian algorithm aligns SAE features across layers

Key Insights from this Paper 💡:

• Sparse autoencoders (SAEs) latents are more interpretable than individual neurons

• SAEs with more latents have higher interpretability scores

• SAEs trained on nearby residual stream layers are highly similar

• Residual stream SAEs have higher semantic overlap than MLP SAEs

• Efficient scoring techniques enable feedback loops to improve explanation quality

Results 📊:

• SAE latents significantly outperform neurons in interpretability scores

• Larger SAEs (131k latents) achieve higher scores than smaller ones (16k latents)

• Residual stream SAEs score slightly higher than MLP SAEs

• Intervention scoring distinguishes trained SAE features from random features

• Semantic similarity between adjacent layers is higher in residual stream sparse autoencoders (SAEs).

🔬 How does this work compare sparse autoencoders (SAEs) to individual neurons?

The large-scale analysis confirms that SAE latents are much more interpretable than individual neurons, even when neurons are sparsified using top-k postprocessing. SAEs with more latents tend to have higher interpretability scores.

🧠 The paper introduces five new techniques to score the quality of explanations:

Detection scoring
Fuzzing scoring
Surprisal scoring
Embedding scoring
Intervention scoring

Intervention scoring is highlighted as particularly valuable for evaluating the interpretability of feature interventions.

Rohan's Bytes

Discussion about this post