BenTo: Benchmark Task Reduction with In-Context Transferability

BENTO, proposed in this paper, slashes LLM testing costs by finding the essential 5% of benchmark tasks

Nov 11, 2024

BENTO, proposed in this paper, slashes LLM testing costs by finding the essential 5% of benchmark tasks

Smart task selection lets you evaluate LLMs using 95% fewer benchmark tests

Original Problem 🎯:

Evaluating LLMs requires testing on large benchmark datasets, often with 50+ diverse tasks. This process is computationally expensive and time-consuming, yet there's no clear way to reduce tasks while maintaining evaluation quality.

Solution in this Paper 🛠️:

→ BENTO measures task transferability using in-context learning - using examples from Task A to solve Task B

→ Creates a transferability matrix showing how knowledge transfers between task pairs

→ Applies spectral clustering to group related tasks based on transferability

→ Uses facility location optimization to select the most representative tasks

→ Employs Laplacian Eigenmaps to enhance similarity measurements between tasks

Key Insights 💡:

→ Tasks naturally cluster into themed groups (like History, Science) with higher intra-cluster transferability

→ A small subset of tasks can effectively represent the full benchmark

→ In-context learning provides an efficient, training-free way to measure task relationships

→ Task selection works better with processed similarities than raw transferability scores

Results 📊:

→ Reduces MMLU benchmark to 5% of tasks while maintaining 97% evaluation accuracy

→ Outperforms GPT4 and BM25 baselines in task selection

→ Works across multiple benchmarks: MMLU, FLAN, AGIEval, and Big-Bench Hard

→ Achieves <4% error rate when evaluating 9 different LLMs

🎯 BENTO's benchmark reduction method

BENTO treats task selection as a facility location problem - picking tasks that maximize similarity to all other tasks. It uses either direct similarity from transferability or a processed version via Laplacian Eigenmaps embedding.

Rohan's Bytes

Discussion about this post