BenTo: Benchmark Task Reduction with In-Context Transferability
BENTO, proposed in this paper, slashes LLM testing costs by finding the essential 5% of benchmark tasks
BENTO, proposed in this paper, slashes LLM testing costs by finding the essential 5% of benchmark tasks
Smart task selection lets you evaluate LLMs using 95% fewer benchmark tests
Original Problem ๐ฏ:
Evaluating LLMs requires testing on large benchmark datasets, often with 50+ diverse tasks. This process is computationally expensive and time-consuming, yet there's no clear way to reduce tasks while maintaining evaluation quality.
Solution in this Paper ๐ ๏ธ:
โ BENTO measures task transferability using in-context learning - using examples from Task A to solve Task B
โ Creates a transferability matrix showing how knowledge transfers between task pairs
โ Applies spectral clustering to group related tasks based on transferability
โ Uses facility location optimization to select the most representative tasks
โ Employs Laplacian Eigenmaps to enhance similarity measurements between tasks
Key Insights ๐ก:
โ Tasks naturally cluster into themed groups (like History, Science) with higher intra-cluster transferability
โ A small subset of tasks can effectively represent the full benchmark
โ In-context learning provides an efficient, training-free way to measure task relationships
โ Task selection works better with processed similarities than raw transferability scores
Results ๐:
โ Reduces MMLU benchmark to 5% of tasks while maintaining 97% evaluation accuracy
โ Outperforms GPT4 and BM25 baselines in task selection
โ Works across multiple benchmarks: MMLU, FLAN, AGIEval, and Big-Bench Hard
โ Achieves <4% error rate when evaluating 9 different LLMs
๐ฏ BENTO's benchmark reduction method
BENTO treats task selection as a facility location problem - picking tasks that maximize similarity to all other tasks. It uses either direct similarity from transferability or a processed version via Laplacian Eigenmaps embedding.



