TL;DR
New benchmark reveals models achieving 90% on Spider 1.0 drop to 10-20% on Spider 2.0's enterprise-scale databases, exposing the inadequacy of outdated evaluation frameworks.
Key Points
- Spider 2.0 databases average 812 columns (vs. <10 in Spider 1.0), with some exceeding 3,000 columns
- Models dominating Spider 1.0 see success rates plummet to 10-20% on full Spider 2.0 benchmark
- Spider 2.0 introduces agentic workflows requiring schema linking, dialect diversity (Snowflake, BigQuery, T-SQL), and external documentation navigation
- Accepted to ICLR 2025 as oral presentation; released November 2024
Why It Matters
For engineers building Text-to-SQL systems, this exposes the critical gap between academic benchmarks and production reality. The binary nature of database queries means 90% accuracy is industrially useless—a single hallucinated table or misinterpreted filter erodes user trust and can lead to catastrophic business decisions. Organizations must adopt Spider 2.0-level evaluation rigor to build genuinely reliable AI-driven analytics.
Source: towardsdatascience.com