"The Geometry of Concepts: Sparse Autoencoder Feature Structure"

Playback speed

Share post at current time

0:00

Transcript

"The Geometry of Concepts: Sparse Autoencoder Feature Structure"

The podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Jan 04, 2025

AI doesn't just learn words - it builds entire geometric worlds.

Turns out, AI thinks in shapes - from tiny crystals to whole galaxies

LLM's internal concepts form geometric patterns similar to human brain's organization.

And LLMs organize knowledge in three-tiered architecture: atomic, neural, and cosmic scales

📚 https://arxiv.org/abs/2410.19750

🎯 Original Problem:

Sparse autoencoders (SAE) have discovered interpretable features in LLMs, but we don't understand how these features are organized in the high-dimensional space and what patterns they form.

-----

🔧 Solution in this Paper:

→ Analyzed SAE feature space at three distinct scales:

- Atomic scale: Studied parallelogram/trapezoid structures (like man:woman::king:queen)

- Brain scale: Identified functional "lobes" where similar features cluster

- Galaxy scale: Examined large-scale point cloud structure

→ Used Linear Discriminant Analysis (LDA) to remove "distractor features"

→ Applied multiple statistical tests and metrics:

- Phi coefficient for functional clustering

- k-NN entropy estimation

- Mutual information metrics

-----

💡 Key Insights:

→ SAE features form crystal-like structures at small scales, but these are hidden by irrelevant dimensions

→ Features that fire together in documents are geometrically co-located, forming brain-like functional lobes

→ Middle layers show steeper power-law slopes (-0.47) compared to early/late layers (-0.24/-0.25)

→ Found distinct lobes for code/math, short messages/dialogue, and scientific papers

-----

📊 Results:

→ LDA significantly improved parallelogram/trapezoid structure quality

→ Phi coefficient showed best correspondence between functional and geometric clustering

→ Middle layers demonstrated reduced clustering entropy, suggesting more concentrated feature representation

→ Statistical significance: 954σ for mutual information, 74σ for logistic regression tests

Rohan's Bytes

"The Geometry of Concepts: Sparse Autoencoder Feature Structure"

Discussion about this video