0:00
/
0:00
Transcript

"Predicting the Performance of Black-box LLMs through Self-Queries"

Generated below podcast on this paper with Google's Illuminate.

LLMs can predict their own mistakes by answering simple yes/no questions about their responses

This paper introduces a method to predict when LLMs will make mistakes by asking them follow-up questions about their own answers, without needing access to their internal workings.

-----

https://arxiv.org/abs/2501.01558

🤔 Original Problem:

→ Current methods to understand LLM behavior require access to internal model representations, which isn't possible with closed-source API-only models like GPT-4

→ Need reliable ways to predict when these black-box models might fail

-----

🔍 Solution in this Paper:

→ Introduces QueRE (Question Representation Elicitation), which asks LLMs follow-up questions about their outputs

→ Uses probability distributions of yes/no responses to these questions as features

→ Combines these with pre/post confidence scores and answer distribution probabilities

→ Trains simple linear predictors on these features to forecast model behavior

-----

💡 Key Insights:

→ Even random sequences of text as prompts can extract useful model behavior information

→ Performance improves with more elicitation questions but plateaus after ~50 questions

→ Method works well even with sampling-based approximations when exact probabilities aren't available

-----

📊 Results:

→ Outperforms white-box methods on open-ended QA tasks

→ Achieves 86-100% accuracy in detecting adversarially influenced models

→ Can distinguish between different model architectures with near-perfect accuracy

Discussion about this video