The paper's technique reveals neural network's internal logic by tracking feature changes between layers.
The networks become like a glass boxes and you can spy on neural features as they shape-shift through network layers
Greatly enhancing interpretability.
📚 https://arxiv.org/abs/2410.07656
Original Problem 🔍:
Understanding feature evolution across layers in deep neural networks, particularly in LLMs, is challenging due to polysemanticity and feature superposition.
-----
Solution in this Paper 🧠:
• Introduces Sparse Autoencoder (SAE) Match, a data-free method for aligning Sparse Autoencoder (SAE) features across layers
• Utilizes parameter folding to incorporate activation thresholds into encoder and decoder weights
• Matches features by minimizing mean squared error between folded SAE parameters
• Enables analysis of feature evolution throughout model depth
-----
Key Insights from this Paper 💡:
• Features persist over several layers in the network
• Initial layers (up to 10th) show increased polysemanticity
• Optimal sparsity level (mean l0-norm around 70) identified for effective feature matching
• Parameter folding improves matching by accounting for differences in feature scales
-----
Results 📊:
• Folded matching improves feature matching quality compared to unfolded variant
• Matching Score and LLM evaluation show similar patterns of feature similarity across layers
• Change in Cross-Entropy Loss (∆L) decreases with higher mean l0-norm values
• Explained Variance peaks at mean l0-norm ≈ 70, suggesting optimal sparsity for feature matching
Share this post