Your LLM text has alternate universes - here's how to peek into them
This paper introduces a new way to understand how Language Models think by examining "what-if" scenarios. Think of it like asking "If I changed X in the model, how would the output actually change?
The researchers created a framework that treats Language Models as special mathematical structures called Generalized Structural-equation Models. They use something called the Gumbel-max trick to model both the original text and its "what-if" versions simultaneously.
https://arxiv.org/abs/2411.07180
🤔 Original Problem:
LLMs can be intervened to modify behavior, but we can't precisely determine how a specific text would have looked after the intervention. Current methods lack true counterfactual reasoning capabilities.
-----
🛠️ Solution in this Paper:
→ Reformulates LLMs as Generalized Structural-equation Models using Gumbel-max trick to separate deterministic logit computation from sampling process
→ Develops an algorithm based on hindsight Gumbel sampling to infer latent noise variables and generate counterfactuals of observed strings
→ Implements a framework that models joint distribution over original strings and their counterfactuals using the same sampling noise
-----
💡 Key Insights:
→ Common intervention techniques often have unintended side effects beyond their targeted changes
→ Even simple interventions (like changing gender-related outputs) can unexpectedly affect text completions that have nothing to do with gender. This shows how interconnected different aspects of language model behavior really are
→ Gender-based interventions can unexpectedly influence text completions unrelated to gender
→ Even interventions modifying small parameter subsets can fail to achieve targeted effects
-----
📊 Results:
→ Tested on GPT2-XL and LLaMA3-8b models
→ Evaluated using MEMIT, linear steering, and instruction tuning interventions
→ Successfully generated meaningful counterfactuals while revealing unintended side-effects of common intervention techniques
Share this post