Guide April 18, 2026 · 18 mins · The D23 Team

Text-to-SQL Evals: How to Measure What's Actually Good

Learn to build production-grade eval harnesses for text-to-SQL systems. Metrics, benchmarks, and real-world strategies for measuring accuracy.

Text-to-SQL Evals: How to Measure What's Actually Good

Text-to-SQL Evals: How to Measure What’s Actually Good

You’ve deployed a text-to-SQL system into production. Users ask questions in English. Your LLM translates them into SQL. Queries execute. Results come back. Life is good—until someone asks a slightly unusual question, the system hallucinates a JOIN, and your dashboard shows revenue numbers that are off by 40%.

This is why evals matter. Text-to-SQL is deceptively hard to evaluate. A query can be syntactically perfect, semantically reasonable, and still wrong in ways that matter to your business. This guide walks you through building a production-grade eval harness—the infrastructure and methodology you need to measure what’s actually good, catch regressions before they hit users, and know when your system is ready to scale.

Why Text-to-SQL Evals Are Different

Text-to-SQL evaluation isn’t like classifying images or grading sentiment. With image classification, you have a clear ground truth: the image either contains a cat or it doesn’t. With text-to-SQL, the problem is layered.

A single business question can be answered by multiple valid SQL queries. “How much revenue did we make last month?” could join sales and orders in different ways, aggregate at different granularities, or filter by different date ranges—all producing the same correct answer. Your eval framework needs to account for this ambiguity.

Moreover, SQL errors have real consequences that scale with your user base. When D23 helps teams embed analytics into their products, we see that a single incorrect query can poison an entire dashboard, mislead decision-makers, and erode trust in self-serve BI faster than you can deploy a fix. This is why measuring text-to-SQL quality before it reaches production isn’t optional—it’s foundational.

The stakes are higher in production environments. Unlike a chatbot that occasionally gives bad advice, an analytics system that returns incorrect numbers directly impacts business decisions. You need evals that catch errors reliably, measure quality across multiple dimensions, and give you confidence that your system is safe to scale.

The Core Problem: What Does “Correct” Even Mean?

Before you can measure anything, you need to define correctness. This is harder than it sounds.

Syntactic Correctness: Does the query parse? Does it have valid SQL syntax? This is the easiest bar to clear and almost useless on its own. A perfectly valid query can return the wrong data.

Execution Correctness: Does the query run without errors? It’s necessary but not sufficient. A query that runs successfully might still be semantically wrong—selecting the wrong columns, joining on the wrong keys, or filtering by the wrong conditions.

Semantic Correctness: Does the query answer the question that was asked? This is what you actually care about, and it’s the hardest to measure programmatically. A query might join tables correctly, filter appropriately, and aggregate properly—but if it’s answering a different question than what the user intended, it’s wrong.

Result Correctness: Does the query return the right numbers? Even if the SQL is semantically sound, execution issues (missing data, incorrect aggregations, floating-point rounding) can produce wrong results.

A robust eval framework measures all four, but with different tools and confidence levels. As Querio’s analysis of five key metrics for text-to-SQL accuracy explains, functional correctness and execution accuracy are foundational—but they’re not the whole story.

Building Your Eval Harness: The Three Layers

Production-grade text-to-SQL evals require three layers: a golden dataset, a set of metrics, and an execution environment.

Layer One: The Golden Dataset

Your eval harness lives and dies by the quality of your golden dataset. This is a collection of (question, ground-truth SQL, expected results) tuples that represent real user behavior and edge cases.

What makes a good golden dataset:

  • Coverage: Does it represent the full range of questions your users actually ask? Include simple aggregations, complex joins, window functions, subqueries, and edge cases. If your users ask about revenue by region for the last 90 days, your dataset should include that exact question—and variations like “last quarter,” “YTD,” and “last 90 rolling days.”

  • Authenticity: Are these real questions from real users, or synthetic examples that sound plausible? Real questions are better. They capture the linguistic patterns, domain language, and ambiguities that your system will actually encounter. If you’re building analytics for a SaaS company, questions should use terms like “MRR,” “churn,” and “CAC.” If you’re building for e-commerce, they should reference “SKU,” “conversion,” and “AOV.”

  • Scale: Start with 50–100 golden examples. As your system matures and you hit edge cases in production, add those cases to your golden dataset and retrain. This turns production failures into permanent improvements.

  • Diversity in difficulty: Include easy questions (“How many users signed up today?”), medium questions (“What’s the month-over-month growth in revenue by product?”), and hard questions (“Which cohorts have declining engagement, and what’s the correlation with feature releases?”). This helps you understand where your system breaks.

Building the golden dataset in practice:

Start by collecting real user queries from your application logs or analytics platform. Remove personally identifiable information. Group queries by type: aggregations, joins, filters, time-series, cohort analysis, etc. For each query, have a domain expert (ideally someone who understands both the data and the business) write the correct SQL and verify the results manually.

Store your golden dataset in a versioned format—JSONL, CSV, or a database table. Each record should include:

{
  "question": "What's our monthly recurring revenue for enterprise customers?",
  "expected_sql": "SELECT DATE_TRUNC('month', created_at) as month, SUM(mrr) FROM subscriptions WHERE plan_type = 'enterprise' GROUP BY 1 ORDER BY 1 DESC",
  "expected_result": [["2024-01-01", 125000], ["2023-12-01", 118000], ...],
  "difficulty": "medium",
  "category": "time_series",
  "notes": "Includes only active subscriptions; excludes trials"
}

Version this dataset. When you add new examples, increment the version. This lets you track how your eval set evolves and compare model performance across versions.

Layer Two: The Metrics

Once you have a golden dataset, you need metrics that measure different aspects of correctness. No single metric is sufficient.

Execution Accuracy: Does the query run without errors? What percentage of generated queries execute successfully on your database?

Execution Accuracy = (Queries that execute without errors) / (Total queries)

This is a floor. If your execution accuracy is below 90%, you have a fundamental problem. But high execution accuracy doesn’t mean your results are correct.

Result Correctness: Does the query return the right numbers? Compare the generated query’s results to the ground-truth results using exact match, fuzzy match, or statistical distance.

For numeric results, you might allow a small margin of error (e.g., ±1% for floating-point calculations). For categorical results, exact match is appropriate. Defog’s guide on text-to-SQL evaluation emphasizes that result comparison is the most business-relevant metric—it’s what users actually see.

Result Correctness = (Queries returning correct results) / (Queries that executed)

Semantic Correctness: Does the generated SQL match the intended meaning of the question? This is harder to measure automatically, but you can use a few techniques:

  • Exact Match: Does the generated SQL match the ground-truth SQL exactly? This is too strict—multiple queries can be semantically equivalent. But it’s a useful baseline.

  • Logical Equivalence: Do the two queries return the same results? Run both queries and compare results. If they match, they’re logically equivalent (at least for your data). This is more lenient than exact match but still meaningful.

  • LLM-as-Judge: Use a separate LLM (e.g., GPT-4) to evaluate whether the generated SQL correctly interprets the question. As Arize’s research on using LLMs as judges for SQL generation demonstrates, this approach can be surprisingly effective when combined with structured prompts and F1 scoring.

Schema Fidelity: Does the generated query use the correct tables and columns? You can measure this by parsing the SQL and checking whether referenced tables and columns exist in your schema and are semantically appropriate for the question.

Schema Fidelity = (Queries using correct tables/columns) / (Total queries)

As Snowflake’s work on text-to-SQL accuracy shows, achieving 90%+ accuracy requires careful attention to schema representation and flexible matching strategies.

Latency: How long does it take to generate the SQL and execute it? For production systems, this matters. If generating a single query takes 10 seconds, your users will abandon your product.

Latency = (Time to generate SQL) + (Time to execute SQL)

Track this separately for generation and execution. If generation is slow, you need a faster model or better prompt engineering. If execution is slow, you need query optimization or better database indexing.

Token Efficiency: How many tokens does it take to generate the correct SQL? This affects cost and latency. Measure average tokens per query, tokens per correct query, and tokens wasted on failed attempts.

Building the Evaluation Framework

Now that you understand the layers, here’s how to implement them.

Step 1: Set Up Your Execution Environment

You need a test database that mirrors your production schema but with safe, anonymized data. This is non-negotiable. Never run evals against production data.

Create a test database with:

  • Same schema as production: Same tables, columns, data types, relationships, and constraints.
  • Representative data: Enough data to make queries meaningful. If your production database has 100M rows, your test database should have at least 1M rows—enough to test aggregations and joins without being unwieldy.
  • Known ground truth: For your golden dataset examples, you should be able to manually verify results. This might mean using a smaller dataset where you can validate results by hand, or using a separate query engine (e.g., DuckDB) to verify results.

Step 2: Implement the Metrics

Build a Python module that computes each metric. Here’s a sketch:

import json
from typing import List, Dict, Any
import sqlparse
from difflib import SequenceMatcher

class TextToSQLEvaluator:
    def __init__(self, db_connection, golden_dataset_path: str):
        self.db = db_connection
        self.golden_dataset = self._load_golden_dataset(golden_dataset_path)
    
    def evaluate_query(self, question: str, generated_sql: str, ground_truth_sql: str) -> Dict[str, Any]:
        """Evaluate a single generated query against ground truth."""
        result = {
            "question": question,
            "generated_sql": generated_sql,
            "ground_truth_sql": ground_truth_sql,
            "metrics": {}
        }
        
        # Execution correctness
        generated_executable = self._is_executable(generated_sql)
        result["metrics"]["execution_correctness"] = 1.0 if generated_executable else 0.0
        
        if not generated_executable:
            return result
        
        # Result correctness
        try:
            generated_results = self.db.execute(generated_sql).fetchall()
            ground_truth_results = self.db.execute(ground_truth_sql).fetchall()
            result["metrics"]["result_correctness"] = 1.0 if generated_results == ground_truth_results else 0.0
        except Exception as e:
            result["metrics"]["result_correctness"] = 0.0
            result["error"] = str(e)
        
        # Semantic correctness (exact match)
        result["metrics"]["exact_match"] = 1.0 if self._normalize_sql(generated_sql) == self._normalize_sql(ground_truth_sql) else 0.0
        
        # Schema fidelity
        result["metrics"]["schema_fidelity"] = self._check_schema_fidelity(generated_sql)
        
        return result
    
    def _is_executable(self, sql: str) -> bool:
        """Check if SQL can be executed without errors."""
        try:
            self.db.execute(sql)
            return True
        except Exception:
            return False
    
    def _normalize_sql(self, sql: str) -> str:
        """Normalize SQL for comparison."""
        parsed = sqlparse.parse(sql)[0]
        return str(parsed).strip().lower()
    
    def _check_schema_fidelity(self, sql: str) -> float:
        """Check if query uses valid tables and columns."""
        # Parse SQL, extract table and column references
        # Check against schema
        # Return score (0.0 to 1.0)
        pass
    
    def evaluate_dataset(self) -> Dict[str, float]:
        """Evaluate all examples in golden dataset."""
        results = []
        for example in self.golden_dataset:
            result = self.evaluate_query(
                example["question"],
                example["generated_sql"],
                example["expected_sql"]
            )
            results.append(result)
        
        # Aggregate metrics
        aggregated = {
            "execution_correctness": sum(r["metrics"].get("execution_correctness", 0) for r in results) / len(results),
            "result_correctness": sum(r["metrics"].get("result_correctness", 0) for r in results) / len(results),
            "exact_match": sum(r["metrics"].get("exact_match", 0) for r in results) / len(results),
            "schema_fidelity": sum(r["metrics"].get("schema_fidelity", 0) for r in results) / len(results),
        }
        
        return aggregated

This is a starting point. Extend it based on your needs.

Step 3: Integrate with Your CI/CD Pipeline

Run your eval harness on every code change. If a change causes metrics to degrade, fail the build.

# Example GitHub Actions workflow
name: Text-to-SQL Evals
on: [push, pull_request]
jobs:
  evals:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Set up Python
        uses: actions/setup-python@v2
        with:
          python-version: '3.11'
      - name: Install dependencies
        run: pip install -r requirements.txt
      - name: Run text-to-SQL evals
        run: python eval_harness.py --golden-dataset golden_dataset.jsonl
      - name: Check thresholds
        run: |
          python check_thresholds.py \
            --execution-accuracy 0.95 \
            --result-correctness 0.90 \
            --schema-fidelity 0.98

Define minimum thresholds for each metric. If you’re at 92% result correctness and a PR drops you to 89%, you’ll know immediately.

Advanced Evaluation Techniques

As your system matures, you’ll need more sophisticated eval strategies.

Cohort Analysis

Not all queries are equally important. Break your golden dataset into cohorts by difficulty, category, or business importance. Track metrics separately for each cohort.

Example cohorts:

  • Simple aggregations (easy): Single-table queries with GROUP BY. Target: 99% result correctness.
  • Multi-table joins (medium): Queries joining 2–3 tables. Target: 95% result correctness.
  • Complex queries (hard): Queries with subqueries, window functions, or CTEs. Target: 85% result correctness.

This helps you understand where your system struggles and prioritize improvements.

Adversarial Testing

Generate edge cases and adversarial examples that are likely to break your system:

  • Ambiguous column names: Tables with columns that have similar names. “user_id” vs. “customer_id.”
  • Temporal edge cases: “Last month” vs. “Last 30 days.” “Q1” vs. “Jan-Mar.”
  • Null handling: Queries that should exclude NULLs vs. queries that should include them.
  • Rounding and precision: Queries that require specific precision or rounding rules.

Add these to your golden dataset and track performance separately.

Regression Testing

When you fix a bug or improve your system, add the failing case to your golden dataset. This prevents regressions. Over time, your golden dataset becomes a comprehensive test suite.

Production Monitoring

Your evals should extend into production. Log every query generated, every result returned, and (when possible) user feedback. Use this data to:

  • Identify failing cases: When a user reports an incorrect result, add that case to your golden dataset.
  • Monitor drift: If your production accuracy is dropping, you’ll know.
  • Prioritize improvements: Focus on the types of queries that fail most often in the real world.

Choosing the Right Metrics for Your Use Case

Different organizations need different metrics. Evaluating text-to-SQL systems requires understanding why syntax is just the beginning, and the same principle applies to choosing metrics.

If you’re embedding analytics in your product (like teams using D23’s embedded analytics capabilities), you care most about result correctness and latency. Users don’t care if the SQL is elegant—they care that the numbers are right and the dashboard loads fast.

If you’re building an internal analytics tool for a data team, you might care more about semantic correctness and explainability. Data analysts want to understand what the system did and why.

If you’re evaluating whether to adopt a managed solution like D23 for Apache Superset hosting and AI-powered analytics, look at how they measure and monitor these metrics in production. Ask about their eval harness, their golden dataset size, and their SLAs for accuracy.

Common Pitfalls and How to Avoid Them

Pitfall 1: Golden Dataset Too Small

You start with 20 examples. Your system gets 95% accuracy. You ship it. Users immediately find cases you didn’t cover. You need at least 50–100 diverse examples before you can trust your evals. As you grow, expand to 500+ examples across different cohorts.

Pitfall 2: Only Testing Happy Paths

Your golden dataset has mostly straightforward queries. Your system handles them well. But users ask weird questions, use domain-specific jargon, and request edge cases. Your evals don’t catch these because they’re not in your dataset. Actively solicit edge cases from users and add them to your golden dataset.

Pitfall 3: Metrics That Don’t Correlate with User Satisfaction

You’re optimizing for exact match (syntactic similarity). But users care about result correctness. You get high exact match scores but low result correctness. Focus on metrics that matter to your business.

Pitfall 4: Not Tracking Latency

Your text-to-SQL system is 99% accurate but takes 30 seconds to generate a query. Users hate it. You need to track latency alongside accuracy. Set SLAs: “Result correctness ≥ 90% AND latency ≤ 5 seconds.”

Pitfall 5: Evals Divorced from Production

Your evals are great, but they don’t reflect real user behavior. You’re testing against a clean database with perfect data. Production has edge cases, missing values, and data quality issues. Run evals against production data (safely, with anonymization) and add production failures to your golden dataset.

Integrating Evals into Your Analytics Stack

When you’re building on Apache Superset with AI-powered analytics, your eval harness should integrate with your broader analytics infrastructure.

Log every query: Capture the user’s question, the generated SQL, the execution time, the result, and any errors. This becomes your production dataset for monitoring and improvement.

Track metrics over time: Store eval results in a time-series database. Plot execution accuracy, result correctness, and latency over time. This helps you spot regressions and measure the impact of improvements.

Correlate with user feedback: When users report issues, log them and correlate with your eval metrics. “This query was flagged by evals as potentially incorrect, and a user reported it as wrong.” This validates your evals.

Automate retraining: When new failures appear in production, add them to your golden dataset and retrain your model. This closes the feedback loop.

Benchmarking Against Existing Standards

Don’t start from scratch. The research community has established benchmarks and datasets for text-to-SQL evaluation. Familiarize yourself with them.

Spider: A large-scale human-annotated dataset with 10,000 questions and 5,000 databases. It’s the de facto standard for benchmarking text-to-SQL systems. If your system performs well on Spider, it’s likely to perform well on real-world queries.

BIRD: A more challenging dataset with 12,000 questions and 95 databases, designed to test systems on real-world complexity.

SFT Eval: Snowflake’s evaluation framework, which includes diverse query types and flexible matching strategies.

You can use these benchmarks to:

  • Baseline your system: How does your approach compare to published results?
  • Identify weak spots: If you perform poorly on a specific query type in Spider, you know where to improve.
  • Track progress: Rerun benchmarks as you improve your system.

The SQL evaluation leaderboard on Hugging Face aggregates results from multiple text-to-SQL systems on standard benchmarks, giving you a sense of the current state of the art.

The Role of Formal Verification

Emerging techniques like formal verification for text-to-SQL evaluation offer additional rigor. These approaches use theorem provers and constraint solvers to verify that a generated query is semantically equivalent to a reference query, not just empirically equivalent.

This is overkill for most production systems, but it’s worth understanding. Formal verification can catch subtle bugs that empirical testing misses, and it’s becoming more practical as tools improve.

Scaling Your Eval Harness

As your system grows, your eval harness needs to scale too.

Parallel execution: Run queries in parallel to speed up evals. Use a connection pool to avoid overwhelming your database.

Distributed testing: If you have multiple database replicas, run evals against all of them to catch database-specific issues.

Sampling: If your golden dataset grows to thousands of examples, you might not run all of them on every build. Sample strategically: always run critical examples, sample others randomly, and run the full suite nightly.

Caching: Cache query results to avoid re-executing identical queries. This speeds up evals significantly.

Building an Eval Culture

Evals are only useful if your team uses them. Make evals a core part of your development process.

  • Make evals easy to run: One command should run your entire eval suite.
  • Make results visible: Display eval results in your CI/CD pipeline, in dashboards, and in team communications.
  • Celebrate improvements: When someone improves eval metrics, acknowledge it. When metrics regress, investigate and fix it.
  • Involve the whole team: Data engineers should contribute test cases. Product managers should define success metrics. Engineers should implement the harness.

When you’re working with a managed solution like D23’s AI-powered analytics platform, ask about their eval practices. How do they measure accuracy? What’s their golden dataset size? How do they catch regressions? A vendor with a rigorous eval culture is more trustworthy.

Putting It All Together: A Practical Example

Let’s say you’re building a text-to-SQL system for a SaaS company. Your golden dataset has 100 examples covering:

  • 30 simple aggregations (revenue by month, user count by region, etc.)
  • 40 multi-table joins (customer lifetime value, churn analysis, etc.)
  • 20 complex queries (cohort analysis, window functions, etc.)
  • 10 edge cases (null handling, rounding, temporal ambiguity)

You run your eval harness and get:

  • Execution correctness: 98% (2 queries have syntax errors)
  • Result correctness: 89% (11 queries return wrong results)
  • Exact match: 65% (many semantically equivalent alternatives)
  • Schema fidelity: 96% (4 queries reference wrong columns)

You dig into the failures:

  • The 2 syntax errors are both in complex queries with CTEs. You improve your prompt to emphasize CTE syntax.
  • Of the 11 result correctness failures, 6 are in the “multi-table joins” cohort. You add more join examples to your golden dataset and fine-tune your model on join patterns.
  • The 4 schema fidelity failures are all related to ambiguous column names. You improve your schema representation to include column descriptions and usage examples.

You rerun evals:

  • Execution correctness: 99%
  • Result correctness: 94%
  • Schema fidelity: 99%

You set these as your SLAs. Every PR that degrades these metrics fails CI. You’ve built a safety net.

Conclusion: Evals as Your North Star

Text-to-SQL evaluation is not a one-time activity. It’s an ongoing practice that grows with your system. Start with a small golden dataset and simple metrics. Run evals on every code change. Add production failures to your golden dataset. Iterate.

Over time, your eval harness becomes your north star. It tells you whether your system is ready to scale, where to focus improvements, and when you’ve broken something. It gives you confidence that your analytics are correct.

When you’re evaluating text-to-SQL platforms or building your own, ask hard questions about evals. How do they measure accuracy? What’s their golden dataset? How do they catch regressions? How do they monitor production? A vendor or team with rigorous eval practices is one you can trust.

For teams building on Apache Superset with D23’s managed platform, comprehensive eval practices are built in. You get production-grade monitoring, AI-powered query generation with transparent accuracy metrics, and expert consulting to help you build evals that matter. This is how you move from “does it work?” to “how good is it?” and ultimately to “can we scale it?”

Start measuring today. Your future self—and your users—will thank you.