0:00
/
0:00
Transcript

Mechanistic Permutability: Match Features Across Layers

Generated this podcast with Google's Illuminate.

The paper's technique reveals neural network's internal logic by tracking feature changes between layers.

The networks become like a glass boxes and you can spy on neural features as they shape-shift through network layers

Greatly enhancing interpretability.

📚 https://arxiv.org/abs/2410.07656

Original Problem 🔍:

Understanding feature evolution across layers in deep neural networks, particularly in LLMs, is challenging due to polysemanticity and feature superposition.

-----

Solution in this Paper 🧠:

• Introduces Sparse Autoencoder (SAE) Match, a data-free method for aligning Sparse Autoencoder (SAE) features across layers

• Utilizes parameter folding to incorporate activation thresholds into encoder and decoder weights

• Matches features by minimizing mean squared error between folded SAE parameters

• Enables analysis of feature evolution throughout model depth

-----

Key Insights from this Paper 💡:

• Features persist over several layers in the network

• Initial layers (up to 10th) show increased polysemanticity

• Optimal sparsity level (mean l0-norm around 70) identified for effective feature matching

• Parameter folding improves matching by accounting for differences in feature scales

-----

Results 📊:

• Folded matching improves feature matching quality compared to unfolded variant

• Matching Score and LLM evaluation show similar patterns of feature similarity across layers

• Change in Cross-Entropy Loss (∆L) decreases with higher mean l0-norm values

• Explained Variance peaks at mean l0-norm ≈ 70, suggesting optimal sparsity for feature matching

Discussion about this video

User's avatar