New benchmark to evaluate LLMs on actual enterprise data engineering workflows
Shows LLMs still struggle with complex SQL tasks that data engineers handle dail
https://arxiv.org/abs/2411.07763
🎯 Original Problem:
Current text-to-SQL benchmarks use simplified databases with few tables and basic queries, failing to reflect real enterprise scenarios with massive schemas, diverse SQL dialects, and complex data workflows.
-----
🔧 Solution in this Paper:
→ Spider 2.0 introduces 632 real-world text-to-SQL workflow problems using enterprise-level databases with average 812 columns and complex nested structures.
→ The benchmark supports multiple SQL dialects across BigQuery, Snowflake, SQLite, and other systems, requiring interaction with project codebases and documentation.
→ Tasks include data wrangling, transformation, and analytics workflows, with ground-truth SQL queries averaging 144 tokens.
→ The framework provides both code agent and traditional text-to-SQL evaluation settings.
-----
💡 Key Insights:
→ Real enterprise SQL tasks require handling massive schemas, multiple dialects, and complex project contexts
→ Current LLMs struggle with schema linking in large databases and complex query planning
→ Project-level data transformation tasks pose significant challenges
→ Traditional text-to-SQL approaches fail on real-world enterprise scenarios
-----
📊 Results:
→ Best code agent framework achieved only 17.0% success rate on Spider 2.0
→ Most advanced text-to-SQL parser reached only 5.7% execution accuracy on Spider 2.0-lite
→ Performance drop from 91.2% on Spider 1.0 to 17.0% on Spider 2.0 shows significant gap
Share this post