0:00
/
0:00
Transcript

"Counterfactual Generation from Language Models"

The podcast on this paper is generated with Google's Illuminate.

Your LLM text has alternate universes - here's how to peek into them

This paper introduces a new way to understand how Language Models think by examining "what-if" scenarios. Think of it like asking "If I changed X in the model, how would the output actually change?

The researchers created a framework that treats Language Models as special mathematical structures called Generalized Structural-equation Models. They use something called the Gumbel-max trick to model both the original text and its "what-if" versions simultaneously.

https://arxiv.org/abs/2411.07180

🤔 Original Problem:

LLMs can be intervened to modify behavior, but we can't precisely determine how a specific text would have looked after the intervention. Current methods lack true counterfactual reasoning capabilities.

-----

🛠️ Solution in this Paper:

→ Reformulates LLMs as Generalized Structural-equation Models using Gumbel-max trick to separate deterministic logit computation from sampling process

→ Develops an algorithm based on hindsight Gumbel sampling to infer latent noise variables and generate counterfactuals of observed strings

→ Implements a framework that models joint distribution over original strings and their counterfactuals using the same sampling noise

-----

💡 Key Insights:

→ Common intervention techniques often have unintended side effects beyond their targeted changes

→ Even simple interventions (like changing gender-related outputs) can unexpectedly affect text completions that have nothing to do with gender. This shows how interconnected different aspects of language model behavior really are

→ Gender-based interventions can unexpectedly influence text completions unrelated to gender

→ Even interventions modifying small parameter subsets can fail to achieve targeted effects

-----

📊 Results:

→ Tested on GPT2-XL and LLaMA3-8b models

→ Evaluated using MEMIT, linear steering, and instruction tuning interventions

→ Successfully generated meaningful counterfactuals while revealing unintended side-effects of common intervention techniques

Discussion about this video

User's avatar