"Partially Rewriting a Transformer in Natural Language"
Below podcast on this paper is generated with Google's Illuminate.
https://arxiv.org/abs/2501.18838
The challenge is understanding the internal workings of LLMs. Current interpretability methods are limited. This paper explores rewriting parts of an LLM using natural language for better understanding.
This paper proposes a method to partially rewrite a Transformer layer with natural language. It uses sparse representations and LLMs to explain and simulate neuron activations, then integrates these back into the original model.
-----
📌 Sparse transcoders offer a path to interpretability by approximating neural networks with sparse, explainable features. Natural language descriptions, though imperfect, bridge the gap to human understanding.
📌 Quantile normalization is essential for calibrating LLM-predicted activations. Raw LLM outputs for neuron activity are poorly distributed, hindering direct integration without statistical correction.
📌 Current natural language explanations are not yet sufficiently precise to fully replace neural network components. The performance remains close to zero ablation, indicating a need for richer, more specific explanations.
----------
Methods Explored in this Paper 🔧:
→ The paper trains a sparse transcoder. The transcoder approximates a feedforward network within the LLM. It uses a TopK activation function to enforce sparsity.
→ A skip connection is added to the transcoder architecture. This improves approximation without affecting interpretability. A sparse autoencoder (SAE) is also trained on the residual stream.
→ Automated interpretability pipeline generates natural language explanations for transcoder and SAE latents. This pipeline is used to score the quality of these explanations.
→ An LLM predicts latent activation. It uses the generated explanation and surrounding text context. This prediction is for a single token.
→ Quantile normalization calibrates the LLM's activation predictions. This aligns the predicted activation distribution with the true activation distribution. This step is crucial for performance.
-----
Key Insights 💡:
→ Current automatically generated natural language explanations are not sufficiently specific. This lack of specificity hinders performance when rewriting model components.
→ Simply predicting whether a feature is active is insufficient. Explanations must also accurately identify contexts where a feature is *not* active. Specificity is as important as sensitivity.
→ Quantile normalization significantly improves performance. It addresses the issue of low specificity in explanations by correcting falsely activated neurons.
-----
Results 📊:
→ Rewriting the entire transcoder with natural language explanations results in a cross-entropy loss similar to a Pythia model trained on only 10-15% of the data. This is comparable to replacing the transcoder output with zeros.
→ Randomly selecting latents for rewriting leads to worse performance than zeroing out the MLP.
→ Using quantile normalization significantly improves performance compared to not using it. Normalization makes rewritten models perform better than zeroing out components.
→ Detection scores correlate with explanation quality. Higher detection scores indicate more specific and sensitive explanations.