Spider 2.0 Exposes Text-to-SQL Accuracy Crisis in Enterprise

TL;DR

New benchmark reveals models achieving 90% on Spider 1.0 drop to 10-20% on Spider 2.0's enterprise-scale databases, exposing the inadequacy of outdated evaluation frameworks.

Key Points

Spider 2.0 databases average 812 columns (vs. <10 in Spider 1.0), with some exceeding 3,000 columns
Models dominating Spider 1.0 see success rates plummet to 10-20% on full Spider 2.0 benchmark
Spider 2.0 introduces agentic workflows requiring schema linking, dialect diversity (Snowflake, BigQuery, T-SQL), and external documentation navigation
Accepted to ICLR 2025 as oral presentation; released November 2024

Why It Matters

For engineers building Text-to-SQL systems, this exposes the critical gap between academic benchmarks and production reality. The binary nature of database queries means 90% accuracy is industrially useless—a single hallucinated table or misinterpreted filter erodes user trust and can lead to catastrophic business decisions. Organizations must adopt Spider 2.0-level evaluation rigor to build genuinely reliable AI-driven analytics.

Read Spider 2.0 paper on arXiv

Source: towardsdatascience.com