Discussion about this post

User's avatar
Rainbow Roxy's avatar

Hey, great read as always, and this SGTM research, building on your prior deepe dives into LLM control, brilliantly frames the critical "not knowing" safety debate.

Neural Foundry's avatar

Great roundup on SGTM, which feels like a meaningfully different approach compared to just filtering training data or doing post-hoc unlearning. The 7x resistance to adversarial fine-tuning is pretty noteworthy since it suggests the "forgetting" is more deeply baked into the weight sturcture. What's intresting is whether this could scale to larger models where the compute overhead stays manageable, especially when you're dealing with way more nuanced categories than just "biology knowldege."

No posts

Ready for more?