0:00
/
0:00
Transcript

"Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows"

The podcast on this paper is generated with Google's Illuminate.

New benchmark to evaluate LLMs on actual enterprise data engineering workflows

Shows LLMs still struggle with complex SQL tasks that data engineers handle dail

https://arxiv.org/abs/2411.07763

🎯 Original Problem:

Current text-to-SQL benchmarks use simplified databases with few tables and basic queries, failing to reflect real enterprise scenarios with massive schemas, diverse SQL dialects, and complex data workflows.

-----

🔧 Solution in this Paper:

→ Spider 2.0 introduces 632 real-world text-to-SQL workflow problems using enterprise-level databases with average 812 columns and complex nested structures.

→ The benchmark supports multiple SQL dialects across BigQuery, Snowflake, SQLite, and other systems, requiring interaction with project codebases and documentation.

→ Tasks include data wrangling, transformation, and analytics workflows, with ground-truth SQL queries averaging 144 tokens.

→ The framework provides both code agent and traditional text-to-SQL evaluation settings.

-----

💡 Key Insights:

→ Real enterprise SQL tasks require handling massive schemas, multiple dialects, and complex project contexts

→ Current LLMs struggle with schema linking in large databases and complex query planning

→ Project-level data transformation tasks pose significant challenges

→ Traditional text-to-SQL approaches fail on real-world enterprise scenarios

-----

📊 Results:

→ Best code agent framework achieved only 17.0% success rate on Spider 2.0

→ Most advanced text-to-SQL parser reached only 5.7% execution accuracy on Spider 2.0-lite

→ Performance drop from 91.2% on Spider 1.0 to 17.0% on Spider 2.0 shows significant gap

Discussion about this video