"Analyze Feature Flow to Enhance Interpretation and Steering in Language Models"
Below podcast on this paper is generated with Google's Illuminate.
https://arxiv.org/abs/2502.03032
The paper addresses the challenge of understanding how LLMs process information across layers. Current methods often examine single layers, missing the multi-layer dynamics of feature transformation.
This paper introduces a data-free method using cosine similarity to trace Sparse Autoencoder features across layers. This creates feature flow graphs, revealing feature evolution and enabling multi-layer model steering.
-----
📌 Feature flow graphs offer a novel, data-free method to visualize and analyze concept evolution within LLMs across layers. This provides mechanistic interpretability by tracing feature transformations through network modules.
📌 Cosine similarity of Sparse Autoencoder decoder weights acts as an effective proxy for feature lineage. This enables efficient cross-layer feature mapping and circuit identification without activation data.
📌 Multi-layer steering, guided by flow graphs, demonstrates enhanced control over LLM generation. Targeted feature manipulation across layers improves thematic consistency and reduces hyperparameter sensitivity in steering.
----------
Methods Explored in this Paper 🔧:
→ The paper uses Sparse Autoencoders to extract interpretable features from different layers of an LLM.
→ It then employs cosine similarity between the decoder weights of these Sparse Autoencoders to track features across consecutive layers.
→ This tracking process generates "flow graphs". These graphs visualize how features evolve as they pass through the model's layers and modules like MLP and attention.
→ The method identifies feature origins, persistence, and transformations layer by layer, without needing additional data.
→ Finally, the paper demonstrates multi-layer model steering. This is achieved by manipulating sets of Sparse Autoencoder features identified through the flow graphs to control text generation.
-----
Key Insights 💡:
→ Feature flow graphs reveal how features are born, refined, and transformed across different layers of an LLM.
→ These graphs expose internal computational pathways. They show how MLP and attention modules contribute to feature evolution.
→ Multi-layer steering, guided by flow graphs, offers improved control over model behavior. This allows for targeted thematic manipulation in generated text.
-----
Results 📊:
→ Cosine similarity of decoder weights is a good proxy for feature activation correlation, validating the feature matching approach.
→ Top-1 similarity method for predecessor identification shows effectiveness in deactivation experiments compared to random selection.
→ Multi-layer intervention for model steering outperforms single-layer methods, achieving better topic control and text quality.