Mechanistic Permutability: Match Features Across Layers

Playback speed

Share post at current time

Share from 0:00

0:00

Transcript

Mechanistic Permutability: Match Features Across Layers

Generated this podcast with Google's Illuminate.

Rohan Paul

Dec 30, 2024

Transcript

The paper's technique reveals neural network's internal logic by tracking feature changes between layers.

The networks become like a glass boxes and you can spy on neural features as they shape-shift through network layers

Greatly enhancing interpretability.

📚 https://arxiv.org/abs/2410.07656

Original Problem 🔍:

Understanding feature evolution across layers in deep neural networks, particularly in LLMs, is challenging due to polysemanticity and feature superposition.

-----

Solution in this Paper 🧠:

• Introduces Sparse Autoencoder (SAE) Match, a data-free method for aligning Sparse Autoencoder (SAE) features across layers

• Utilizes parameter folding to incorporate activation thresholds into encoder and decoder weights

• Matches features by minimizing mean squared error between folded SAE parameters

• Enables analysis of feature evolution throughout model depth

-----

Key Insights from this Paper 💡:

• Features persist over several layers in the network

• Initial layers (up to 10th) show increased polysemanticity

• Optimal sparsity level (mean l0-norm around 70) identified for effective feature matching

• Parameter folding improves matching by accounting for differences in feature scales

-----

Results 📊:

• Folded matching improves feature matching quality compared to unfolded variant

• Matching Score and LLM evaluation show similar patterns of feature similarity across layers

• Change in Cross-Entropy Loss (∆L) decreases with higher mean l0-norm values

• Explained Variance peaks at mean l0-norm ≈ 70, suggesting optimal sparsity for feature matching

Rohan's Bytes

Mechanistic Permutability: Match Features Across Layers

Discussion about this video