Steer LLMs effectively with FGAA (Feature Guided Activation Additions) and Sparse Autoencoders.
FGAA enhances activation steering in LLMs by optimizing steering vectors in a Sparse Autoencoder's latent space. This offers more precise control over model behavior while maintaining coherence.
Paper - https://arxiv.org/abs/2501.09929v1
Original Problem 🤔:
→ Existing LLM steering methods lack precision and interpretability.
→ This leads to unintended model behavior and poor output quality.
Solution in this Paper 💡:
→ FGAA leverages Sparse Autoencoders (SAEs) to extract interpretable features.
→ It computes contrastive differences in the SAE activation space between desired and undesired examples.
→ It filters these differences, retaining the most relevant features.
→ Finally, it uses linear effect approximators to optimize steering vectors for targeted behavior modification.
Key Insights from this Paper 🔎:
→ Operating in the SAE latent space improves precision and coherence of steered outputs.
→ Automatic feature selection simplifies steering vector construction and improves interpretability.
→ A trade-off exists between steering strength and general model capabilities.
Results 💯:
→ FGAA outperforms existing steering methods (CAA, SAE decoder steering, SAE-TS) on Gemma-2-2B in 8 out of 9 tasks.
→ Average Behavioral-Coherence Score (BCS) on Gemma-2-2B increased from 0.1404 (SAE) to 0.4702 (FGAA).
→ While demonstrating effectiveness on Gemma-2-9B, the performance improvement was less significant (BCS increased from 0.2267 to 0.3979).
Share this post