"Steering Large Language Models with Feature Guided Activation Additions"

Playback speed

Share post at current time

0:00

Transcript

"Steering Large Language Models with Feature Guided Activation Additions"

Below podcast is generated with Google's Illuminate.

Rohan Paul

Jan 28, 2025

Steer LLMs effectively with FGAA (Feature Guided Activation Additions) and Sparse Autoencoders.

FGAA enhances activation steering in LLMs by optimizing steering vectors in a Sparse Autoencoder's latent space. This offers more precise control over model behavior while maintaining coherence.

Paper - https://arxiv.org/abs/2501.09929v1

Original Problem 🤔:

→ Existing LLM steering methods lack precision and interpretability.

→ This leads to unintended model behavior and poor output quality.

Solution in this Paper 💡:

→ FGAA leverages Sparse Autoencoders (SAEs) to extract interpretable features.

→ It computes contrastive differences in the SAE activation space between desired and undesired examples.

→ It filters these differences, retaining the most relevant features.

→ Finally, it uses linear effect approximators to optimize steering vectors for targeted behavior modification.

Key Insights from this Paper 🔎:

→ Operating in the SAE latent space improves precision and coherence of steered outputs.

→ Automatic feature selection simplifies steering vector construction and improves interpretability.

→ A trade-off exists between steering strength and general model capabilities.

Results 💯:

→ FGAA outperforms existing steering methods (CAA, SAE decoder steering, SAE-TS) on Gemma-2-2B in 8 out of 9 tasks.

→ Average Behavioral-Coherence Score (BCS) on Gemma-2-2B increased from 0.1404 (SAE) to 0.4702 (FGAA).

→ While demonstrating effectiveness on Gemma-2-9B, the performance improvement was less significant (BCS increased from 0.2267 to 0.3979).

Rohan's Bytes

"Steering Large Language Models with Feature Guided Activation Additions"

Discussion about this video