"Analyze Feature Flow to Enhance Interpretation and Steering in Language Models"
Below podcast on this paper is generated with Google's Illuminate.
https://arxiv.org/abs/2502.03032
The paper addresses the challenge of understanding how LLMs process information across layers. Current methods often examine single layers, missing the multi-layer dynamics of feature transformation.
This paper introduces a data-free method using cosine similarity to trace Sparse Autoencoder features across layers. This creates feature flow graphs, revealing feature evolution and enabling multi-layer model steering.
-----
๐ Feature flow graphs offer a novel, data-free method to visualize and analyze concept evolution within LLMs across layers. This provides mechanistic interpretability by tracing feature transformations through network modules.
๐ Cosine similarity of Sparse Autoencoder decoder weights acts as an effective proxy for feature lineage. This enables efficient cross-layer feature mapping and circuit identification without activation data.
๐ Multi-layer steering, guided by flow graphs, demonstrates enhanced control over LLM generation. Targeted feature manipulation across layers improves thematic consistency and reduces hyperparameter sensitivity in steering.
----------
Methods Explored in this Paper ๐ง:
โ The paper uses Sparse Autoencoders to extract interpretable features from different layers of an LLM.
โ It then employs cosine similarity between the decoder weights of these Sparse Autoencoders to track features across consecutive layers.
โ This tracking process generates "flow graphs". These graphs visualize how features evolve as they pass through the model's layers and modules like MLP and attention.
โ The method identifies feature origins, persistence, and transformations layer by layer, without needing additional data.
โ Finally, the paper demonstrates multi-layer model steering. This is achieved by manipulating sets of Sparse Autoencoder features identified through the flow graphs to control text generation.
-----
Key Insights ๐ก:
โ Feature flow graphs reveal how features are born, refined, and transformed across different layers of an LLM.
โ These graphs expose internal computational pathways. They show how MLP and attention modules contribute to feature evolution.
โ Multi-layer steering, guided by flow graphs, offers improved control over model behavior. This allows for targeted thematic manipulation in generated text.
-----
Results ๐:
โ Cosine similarity of decoder weights is a good proxy for feature activation correlation, validating the feature matching approach.
โ Top-1 similarity method for predecessor identification shows effectiveness in deactivation experiments compared to random selection.
โ Multi-layer intervention for model steering outperforms single-layer methods, achieving better topic control and text quality.


