Guide April 18, 2026 · 16 mins · The D23 Team

Text-to-SQL Accuracy in 2026: What Actually Works (and What Doesn't)

Real benchmarks, failure modes, and production-grade text-to-SQL strategies for 2026. What works, what doesn't, and when to use semantic layers instead.

Text-to-SQL Accuracy in 2026: What Actually Works (and What Doesn’t)

Text-to-SQL—the ability to convert natural language into executable SQL queries—has become the holy grail of self-serve analytics. Users ask questions in plain English; the system generates the SQL; dashboards populate. No SQL expertise required. No data engineering bottleneck.

The promise is seductive. The reality is messier.

In 2026, we’re seeing a widening gap between what benchmark scores claim and what actually works in production. A model that achieves 87% accuracy on a curated test set might fail catastrophically on your schema. Semantic layers are resurfacing as a more reliable alternative for mission-critical use cases. And the real bottleneck isn’t the LLM—it’s schema design, context window limitations, and the sheer complexity of business logic.

This article cuts through the hype. We’ll walk through what text-to-SQL actually does well, where it breaks, and how to decide whether it’s right for your analytics stack—especially if you’re evaluating D23’s managed Apache Superset platform for embedded analytics or self-serve BI.

The State of Text-to-SQL Benchmarks in 2026

Let’s start with what the numbers say—and why they’re often misleading.

Benchmarks like BIRD (Big-Bench for Large-Scale Database Grounding Evaluation) and Spider 2.0 have become the de facto standard for measuring text-to-SQL accuracy. They present test cases: a natural language question, a database schema, and the expected SQL query. The model generates SQL, and if it matches or produces the same result, it’s marked correct.

Sounds objective. It’s not.

Research published in 2026 shows that Text-to-SQL Benchmarks are Broken: An In-Depth Analysis reveals critical flaws in these widely-cited benchmarks. The analysis uncovered 52.8% annotation errors in BIRD and Spider 2.0—meaning the “correct” answers in the test set are themselves wrong or ambiguous. When researchers corrected these errors and re-evaluated models, the reported accuracy gaps collapsed. A model claiming 82% accuracy might actually be performing at 76% on a corrected benchmark.

Why does this matter? Because teams building analytics platforms—especially those embedding text-to-SQL into products—are making architectural decisions based on flawed data. They’re assuming a level of reliability that doesn’t exist.

The second issue is the gap between benchmark and production. Benchmarks use clean, well-documented schemas. They test against a fixed set of questions. Real-world databases are messy. Business logic is implicit. Users ask questions that don’t map neatly to the schema. And they ask variations that the model has never seen.

According to 43% Accuracy with Opus-4.6: Will Text-to-SQL Ever Be Good Enough?, Berkeley’s Data Agent Benchmark—which tests on real, unfiltered questions—shows that even frontier models like Claude Opus 4.6 achieve only 43% accuracy on genuinely novel questions. That’s a 40+ percentage point gap from the BIRD benchmark scores.

So when you read “text-to-SQL achieves 87% accuracy,” you’re reading a number that:

May be based on a flawed benchmark
Doesn’t account for semantic errors (queries that run but return wrong results)
Doesn’t reflect performance on your actual schema and user questions
Doesn’t measure latency, cost, or operational reliability

What Text-to-SQL Actually Gets Right

Despite the caveats, text-to-SQL isn’t a dead end. It works well in specific, bounded contexts.

Simple, unambiguous questions on well-designed schemas. If your database has clear naming conventions—customer_id, order_date, revenue—and users ask straightforward questions like “How many orders did we ship last month?”, text-to-SQL performs reliably. Models handle basic aggregations, filters, and joins with high accuracy.

Narrow domains with consistent terminology. If you’re analyzing a single product’s metrics—say, a SaaS platform’s usage data—the schema is smaller, the business logic is more consistent, and the model has less room to misinterpret context. A model trained on financial data will generate more accurate SQL for banking queries than for healthcare.

Iterative refinement with human feedback. Text-to-SQL works better when it’s not a one-shot process. If users can see the generated query, modify it, and re-run it, the system becomes more reliable. The human becomes the quality gate. This is especially effective when embedded in tools like D23’s Apache Superset dashboards, where users can inspect and adjust queries before accepting results.

Queries that don’t require complex business logic. If your question can be answered with basic SQL—SELECT, WHERE, GROUP BY, ORDER BY—text-to-SQL is solid. But if you need window functions, CTEs, recursive queries, or domain-specific calculations, accuracy drops sharply.

Research from Text-to-SQL: Comparison of LLM Accuracy comparing eight major LLMs found that accuracy varies significantly by query complexity. Simple queries achieved 85%+ accuracy across models; complex queries with multiple joins or subqueries dropped to 60-70%.

The other strength: cost and speed. Text-to-SQL is fast. A single API call generates SQL in milliseconds. Compare that to traditional BI platforms where a data engineer writes and tests queries, or where users need to learn SQL syntax. For exploratory analysis, the speed advantage is real, even if accuracy isn’t perfect.

Where Text-to-SQL Fails (and Fails Hard)

Now let’s talk about failure modes. These matter because they’re where production systems break.

Semantic errors. The most insidious failure is when the generated query is syntactically correct and runs without error—but returns the wrong answer. A model might generate:

SELECT customer_id, SUM(amount) FROM orders WHERE status = 'completed' GROUP BY customer_id

When the user actually wanted:

SELECT customer_id, SUM(amount) FROM orders WHERE status = 'completed' AND created_at >= '2026-01-01' GROUP BY customer_id

The first query runs fine. It just answers the wrong question. Benchmarks often miss these because they only check if the query syntax is valid and produces some result. They don’t validate that the result is correct for the user’s actual intent.

Both Ends Count! Just How Good are LLM Agents at Text-to-Big SQL extends evaluation metrics to account for this, showing that when you measure query correctness holistically—not just syntax—performance drops significantly.

Schema misunderstanding. Text-to-SQL fails when the schema is ambiguous or when business logic isn’t encoded in the schema itself. Consider a table with columns like user_id, type, and value. What does type mean? Is value revenue, count, or something else? If the model guesses wrong, the query is wrong.

This is especially problematic in legacy databases where naming conventions are inconsistent. A table might have cust_id in one place and customer_identifier in another. The model has to infer that they’re the same thing—and it often gets this wrong.

Complex business logic. Real analytics require context that isn’t in the database schema. “Revenue” might mean different things depending on whether you’re looking at bookings, billings, or cash collected. “Active users” might exclude test accounts, internal users, or trial accounts. Text-to-SQL has no way to know these rules unless they’re explicitly encoded.

This is where semantic layers shine—and where text-to-SQL struggles. A semantic layer (like dbt’s Semantic Layer) pre-defines metrics, dimensions, and relationships. The LLM doesn’t have to infer them; it just has to use them correctly.

Multi-table reasoning. Queries that require joining three or more tables, especially with non-obvious join keys, are where text-to-SQL accuracy collapses. A model might:

Join on the wrong column
Miss a necessary join entirely
Create a Cartesian product by joining incorrectly
Misunderstand the cardinality of relationships

According to Text to SQL Comparison 2026: Enterprise Solutions Evaluated, accuracy drops from 80%+ on single-table queries to 55-65% on queries requiring three or more joins.

Hallucinated columns and tables. LLMs sometimes generate SQL that references columns or tables that don’t exist. This is less common with well-tuned models, but it happens. The query fails at runtime, and the user sees an error instead of an answer.

Context window constraints. Larger, more complex schemas don’t fit in a single prompt. You have to choose which tables and columns to include. Leave out important context, and the model generates incorrect queries. Include too much, and you exceed the context window or confuse the model with irrelevant information.

Semantic Layers vs. Text-to-SQL: The 2026 Reality

This is the critical decision point for teams building analytics platforms.

Semantic layers—a pre-defined, curated layer of metrics and dimensions—achieve near-100% accuracy for text-to-SQL queries. Why? Because the model isn’t inferring business logic from a schema. It’s using pre-defined metrics.

Instead of asking a model to interpret a raw schema and generate SQL, you’re asking it to:

Understand the user’s question
Map it to pre-defined metrics and dimensions
Generate a simple query using those definitions

This is fundamentally easier. The model can focus on intent, not schema archaeology.

Semantic Layer vs. Text-to-SQL: 2026 Benchmark Update provides a comprehensive benchmark comparison. Semantic layers achieve 98%+ accuracy. Text-to-SQL reaches 64.5% on the same queries. That’s a 33+ percentage point gap.

But semantic layers have a cost: setup time. You need to define every metric, every dimension, and every relationship. For a small analytics team, this is overhead. For an enterprise with dozens of dashboards and hundreds of metrics, it’s worth it.

The hybrid approach is emerging as the 2026 standard: use text-to-SQL for exploratory, low-stakes queries where speed matters more than perfection. Use semantic layers for production metrics, KPIs, and customer-facing analytics where accuracy is non-negotiable.

Best Text to SQL Query Tools in 2026 - Comparison, Features, Benchmarks reviews leading platforms and notes that the most mature solutions now combine both approaches, with semantic layers as a fallback for critical queries.

Schema Design: The Underrated Lever

Here’s what most teams miss: text-to-SQL accuracy is less about the LLM and more about schema design.

A well-designed schema with clear naming, explicit relationships, and minimal ambiguity can push text-to-SQL accuracy to 90%+. A poorly designed schema will fail even with the best models.

Research on BIRD Bench SQL benchmarks shows that Text to SQL at 95% Accuracy: Do You NEED a Semantic Layer? demonstrates that LLMs achieve 95% accuracy with well-named data models. The implication: spend time on schema design, and text-to-SQL becomes reliable.

What makes a schema text-to-SQL-friendly?

Explicit naming conventions. Use full names: customer_id, not cust_id. Use consistent prefixes: order_created_at, order_updated_at, not created, updated. Avoid abbreviations and domain-specific jargon that only humans understand.

Comments and documentation. Add SQL comments to tables and columns explaining what they contain and how they should be used. This context gets passed to the LLM and dramatically improves accuracy.

CREATE TABLE orders (
  order_id INT PRIMARY KEY COMMENT 'Unique identifier for each order',
  customer_id INT COMMENT 'Foreign key to customers table',
  created_at TIMESTAMP COMMENT 'Order creation date in UTC',
  status VARCHAR(50) COMMENT 'Order status: pending, completed, cancelled, refunded',
  total_amount DECIMAL(10,2) COMMENT 'Total order value in USD, including tax'
);

Normalized structure. Denormalization can improve query performance, but it confuses text-to-SQL models. Keep fact tables and dimension tables separate. Use foreign keys to establish relationships explicitly.

Materialized views for complex logic. If a calculation is complex—say, calculating “monthly active users” or “customer lifetime value”—create a materialized view. The model can then use that view directly instead of trying to reverse-engineer the logic.

Data validation and quality. Missing values, inconsistent enums, and data quality issues cause text-to-SQL failures. A NULL in a status column where the model expects ‘completed’ or ‘pending’ breaks assumptions. Invest in data quality, and text-to-SQL reliability improves.

Practical Implementation: When to Use Text-to-SQL

So when should you actually build text-to-SQL into your analytics stack?

Use text-to-SQL if:

You’re building exploratory analytics where speed and ease-of-use matter more than perfection
Your schema is small (under 50 tables) and well-designed
Your users are asking simple, straightforward questions
You can add a human review step (users see the query before results are executed)
You’re willing to invest in schema design and documentation
You have a fallback (semantic layer, manual SQL) for complex queries
Your use case is internal (mistakes are less costly than in customer-facing analytics)

Don’t use text-to-SQL if:

You need 99%+ accuracy (use semantic layers instead)
Your schema is large, complex, or poorly documented
You’re embedding analytics in a customer-facing product where errors damage trust
Your users are asking complex questions that require deep business logic
You don’t have the resources to maintain and improve the system
Your data quality is poor

For teams building on Apache Superset, the managed approach offers a middle ground. Rather than building text-to-SQL yourself, you get a platform that handles the infrastructure, model selection, and reliability concerns. This is especially valuable if you’re evaluating Superset against alternatives like Looker or Tableau—you get the flexibility of open-source with the reliability of a managed service.

The Role of Model Selection

Which LLM should you use for text-to-SQL?

The answer in 2026 is: it depends on your constraints.

Frontier models (Claude 3.5 Sonnet, GPT-4o, Gemini 2.0) achieve the highest accuracy—typically 75-85% on well-designed schemas. They’re also the most expensive (per-token pricing) and have the longest latency (up to 10 seconds for complex queries).

Mid-tier models (Claude 3 Haiku, GPT-4 Turbo, Llama 3.1) achieve 65-75% accuracy at lower cost and faster latency. They’re a sweet spot for many production use cases.

Open-source models (Llama 3.1 70B, Mistral, Code Llama) can be self-hosted for cost control and latency reduction. Accuracy is 55-70%, but improving rapidly. If you’re running D23’s managed Superset, you have flexibility in model selection based on your requirements.

Specialized models trained specifically on SQL generation (like CodeLlama or Codex) outperform general-purpose models on text-to-SQL tasks. If accuracy is critical, consider a specialized model even if it’s less well-known.

The key insight: larger models don’t always mean better performance. A well-tuned mid-tier model with good prompting and schema context often outperforms a frontier model with poor prompting. This is where consulting expertise matters—teams that understand LLM behavior can squeeze 5-15% more accuracy out of the same models.

Prompting, Fine-Tuning, and In-Context Learning

The LLM is only one piece of the puzzle. How you prompt it matters as much as which model you choose.

Few-shot prompting (providing examples of good questions and corresponding SQL) improves accuracy by 10-20%. If you have 5-10 representative examples in your prompt, the model learns your schema patterns and generates better queries.

Chain-of-thought prompting (asking the model to explain its reasoning before generating SQL) improves accuracy by 5-10% and makes errors easier to debug. Instead of just outputting SQL, the model says: “The user is asking for orders from Q4. I need to filter by created_at >= ‘2025-10-01’ AND created_at < ‘2026-01-01’.”

Schema context pruning is critical. You can’t pass your entire schema to the model—it’ll exceed context limits or confuse the model. Techniques like:

Embedding-based schema retrieval (find the most relevant tables for the question)
Keyword matching (search for tables/columns by name)
User history (tables the user has queried recently)

…all improve accuracy by reducing irrelevant context.

Fine-tuning on your specific schema and question patterns can improve accuracy by 10-30%, but requires significant data (hundreds to thousands of examples) and ongoing maintenance. It’s worth it for high-stakes applications; it’s overkill for exploratory analytics.

Cost and Performance Trade-offs

Text-to-SQL isn’t just about accuracy. It’s about the total cost of ownership.

A frontier model that achieves 85% accuracy might cost $0.10 per query (at typical pricing). A mid-tier model at 70% accuracy might cost $0.02. Over millions of queries, that’s a significant difference.

But accuracy also has a cost. A 70% accurate system might require human review of 30% of queries, which is expensive. An 85% accurate system requires review of 15%, which is cheaper despite higher per-query costs.

Latency is another factor. If text-to-SQL queries take 5-10 seconds, users will abandon the feature. If they take under 1 second, adoption is high. Faster models are typically cheaper and lower-latency, but less accurate.

The optimal configuration depends on your use case:

Exploratory analytics: Use fast, cheap models. Accept 65-70% accuracy. Users will fix incorrect queries.
Embedded analytics: Use accurate models. Spend more per query to ensure correctness.
Internal dashboards: Use mid-tier models with human review. Balance cost and accuracy.

Building Text-to-SQL Into Your Analytics Stack

If you decide text-to-SQL is right for your use case, how do you implement it?

Option 1: DIY. Build your own text-to-SQL system using an LLM API, prompt engineering, and schema context retrieval. This gives you maximum control but requires significant engineering effort. You’re responsible for model selection, prompt optimization, error handling, and reliability.

Option 2: Specialized tools. Use a dedicated text-to-SQL platform like Vanna, SQL Chat, or similar. These handle the infrastructure and provide pre-built integrations with common databases. You trade flexibility for ease of implementation.

Option 3: Managed platforms. Use a managed BI platform like D23 that includes text-to-SQL as a feature alongside dashboards, embedded analytics, and self-serve BI. You get text-to-SQL as part of a broader analytics solution, with expert support and reliability guarantees.

For teams evaluating options, the decision often comes down to: do you want to own the complexity, or do you want to outsource it? DIY gives you control but demands expertise. Managed platforms give you reliability but less customization.

The Future of Text-to-SQL (and When to Revisit This Decision)

Text-to-SQL is improving rapidly, but not in the way most people expect.

Accuracy isn’t the main bottleneck anymore. Models are good enough for many use cases. The bottlenecks are:

Semantic understanding. Models still struggle with implicit business logic and domain-specific terminology.
Reliability and observability. Production systems need to handle errors gracefully and provide visibility into why a query failed.
Integration with existing tools. Text-to-SQL needs to work with your data warehouse, your BI platform, and your data governance tools.

The 2026 trend is toward hybrid systems that combine text-to-SQL, semantic layers, and human-in-the-loop workflows. Frontier models like GPT-4o and Claude 3.5 are being fine-tuned for specific industries and use cases, improving accuracy without requiring custom models.

AI agents that can iterate on queries, check results, and refine their approach are emerging as more reliable than one-shot text-to-SQL. Instead of generating a single query, the agent generates, executes, validates, and refines.

If you’re building analytics infrastructure in 2026, plan for evolution. Your text-to-SQL approach in 2026 might be obsolete in 2027. Build systems that can swap models, add semantic layers, and incorporate new techniques without major rewrites.

Conclusion: Text-to-SQL in Context

Text-to-SQL is a powerful tool, but it’s not a silver bullet. It works well for exploratory analytics on well-designed schemas with simple questions. It fails on complex business logic, ambiguous schemas, and high-stakes use cases.

The 2026 reality: the gap between benchmark claims and production performance is real and significant. A 87% accuracy benchmark translates to 60-70% in production. Semantic layers achieve near-perfect accuracy but require upfront investment. Hybrid approaches combining text-to-SQL, semantic layers, and human review are becoming the standard.

For teams building analytics platforms—whether you’re evaluating D23’s managed Superset or building custom solutions—the decision to adopt text-to-SQL should be based on:

Your schema quality and design
The complexity of your business logic
Your accuracy requirements
Your tolerance for human review and iteration
Your willingness to invest in ongoing optimization

Text-to-SQL will continue to improve. But in 2026, the real competitive advantage isn’t the model—it’s the engineering, schema design, and operational discipline around it. That’s where the best analytics platforms distinguish themselves from the rest.