A benchmark that measures LLMs' real data science capabilities, not just theoretical knowledge
FeatEng, proposed in this paper, tests if LLMs can actually improve ML model performance through smart feature engineering
https://arxiv.org/abs/2410.23331
🎯 Original Problem:
Existing LLM benchmarks fail to evaluate practical usability, domain knowledge application, and skill integration simultaneously. They often focus on isolated capabilities rather than real-world problem-solving abilities.
-----
🔧 Solution in this Paper:
→ Created FeatEng, a benchmark evaluating LLMs' ability to write feature engineering code for data science tasks
→ Uses 103 diverse datasets across healthcare, finance, entertainment, and other domains
→ Models generate Python code to transform datasets, aiming to improve XGBoost model performance
→ Performance measured by accuracy improvement over baseline (untransformed data)
→ Incorporates practical value, domain expertise, and multiple skill integration
-----
💡 Key Insights:
→ Domain knowledge is crucial - models with better domain understanding generated more sophisticated features
→ Strong correlation (0.878) found between FeatEng scores and Chatbot Arena rankings
→ Basic code generation skills alone proved insufficient for high performance
→ Feature engineering capabilities vary significantly across models, from basic data cleaning to complex domain-specific transformations
-----
📊 Results:
→ O1-Preview achieved best performance with 11%+ improvement over baseline
→ Gemini models, GPT-4 and O1-Mini formed second tier performers
→ AutoML baseline achieved 10.19% improvement
→ Models needed at least Mixtral 8x7B capability to show noticeable improvements
Share this post