LLMs can predict their own mistakes by answering simple yes/no questions about their responses
This paper introduces a method to predict when LLMs will make mistakes by asking them follow-up questions about their own answers, without needing access to their internal workings.
-----
https://arxiv.org/abs/2501.01558
🤔 Original Problem:
→ Current methods to understand LLM behavior require access to internal model representations, which isn't possible with closed-source API-only models like GPT-4
→ Need reliable ways to predict when these black-box models might fail
-----
🔍 Solution in this Paper:
→ Introduces QueRE (Question Representation Elicitation), which asks LLMs follow-up questions about their outputs
→ Uses probability distributions of yes/no responses to these questions as features
→ Combines these with pre/post confidence scores and answer distribution probabilities
→ Trains simple linear predictors on these features to forecast model behavior
-----
💡 Key Insights:
→ Even random sequences of text as prompts can extract useful model behavior information
→ Performance improves with more elicitation questions but plateaus after ~50 questions
→ Method works well even with sampling-based approximations when exact probabilities aren't available
-----
📊 Results:
→ Outperforms white-box methods on open-ended QA tasks
→ Achieves 86-100% accuracy in detecting adversarially influenced models
→ Can distinguish between different model architectures with near-perfect accuracy
Share this post