Orchestrating Data Discovery Agents Across Multiple Warehouses
Learn how to build multi-agent data discovery systems spanning multiple warehouses. Architecture, patterns, and implementation guide for analytics leaders.
The Multi-Warehouse Data Discovery Problem
Your organization doesn’t live in a single data warehouse anymore. You’ve got Snowflake handling transactional data, BigQuery running analytics workloads, and a data lake in S3 collecting raw events. Your finance team needs to cross-reference metrics from three different sources. Your product team wants to explore customer behavior without waiting for a data analyst to write a custom query. Meanwhile, your platform is growing fast enough that you’re embedding analytics directly into your product—and your customers want to discover and query data across all your systems without knowing SQL.
This is the orchestrating data discovery agents problem. It’s not about moving all your data into one place (that’s often impractical and expensive). It’s about creating a coordinated layer of AI agents that can intelligently navigate, understand, and query across your distributed data infrastructure without requiring a PhD in your schema.
Traditional BI tools like Looker and Tableau assume a relatively static, well-documented data model sitting in one place. They work well when your data architecture is simple. But at scale—especially when you’re embedding analytics into your product or managing complex data for portfolio companies—you need something more flexible. You need agents that can discover what data exists, understand relationships across systems, route queries intelligently, and handle the operational complexity of multi-warehouse environments.
This is where orchestrated data discovery agents become essential. Unlike a single monolithic BI tool, a multi-agent system can be built on Apache Superset with AI capabilities, API-first architecture, and the flexibility to integrate with your actual data infrastructure. Let’s explore how to design, build, and operate these systems.
Understanding Data Discovery Agents and Orchestration
Before diving into architecture, let’s define what we’re actually building.
A data discovery agent is an AI-powered system component that can autonomously explore, catalog, and query data sources. Unlike a static data dictionary, an agent actively understands your schema, can reason about relationships between tables, and can respond to natural language requests. Think of it as a data analyst who knows every table in your warehouse and can instantly answer “what’s our monthly churn rate?” without you having to specify exactly which tables to join.
Orchestration means coordinating multiple agents across different data sources, ensuring they work together coherently, and managing the complexity of routing queries to the right system. In a multi-warehouse environment, orchestration is the difference between having five disconnected agents and having a unified data discovery experience.
The core challenge is this: when a user asks a question, which agent should handle it? How do you ensure consistency when the same metric might exist in multiple warehouses with different definitions? How do you manage authentication, rate limiting, and cost when agents are querying across multiple cloud providers?
According to recent enterprise architecture research, orchestrating agents and data requires a blueprint architecture with centralized data registry for discovery across multi-modal data. This isn’t theoretical—it’s the pattern that’s emerging as organizations scale from single-warehouse BI to distributed analytics.
The Architecture of Multi-Warehouse Data Discovery
A production-grade multi-warehouse data discovery system has several layers:
The Data Registry Layer
Every agent needs to know what data exists. In a single warehouse, this is straightforward—you query the information schema. Across multiple warehouses, you need a centralized registry.
This registry should contain:
- Warehouse metadata: Connection details, type (Snowflake, BigQuery, Redshift), region, access patterns
- Dataset catalogs: Table names, schemas, row counts, last-updated timestamps for each warehouse
- Semantic layer: Business-friendly definitions of metrics, dimensions, and relationships
- Data lineage: Which tables feed into which dashboards, which sources are authoritative for specific metrics
- Access control mappings: Which users can query which warehouses and datasets
The registry isn’t static. It needs to be continuously updated as schemas change, new tables are added, and data quality metrics shift. This is where data orchestration platforms coordinate flows across data warehouses and multi-system environments—they maintain the ground truth about what data exists and where.
In practice, you might implement this as a combination of:
- A metadata database (PostgreSQL, managed cloud database) storing warehouse connections and dataset definitions
- Automated schema crawlers running nightly to detect schema changes
- A semantic layer (often built into or alongside your BI tool) defining business metrics
- An API exposing this registry to your agents
The Agent Dispatch Layer
When a user or application requests data, the dispatch layer decides which agent(s) should handle the query.
This isn’t a simple lookup. It requires reasoning about:
- Data availability: Does the requested dataset exist in warehouse A or warehouse B (or both)?
- Freshness requirements: If you need real-time data, can you query the transactional warehouse, or do you need to wait for the analytics warehouse to sync?
- Cost optimization: For exploratory queries, route to the cheaper warehouse. For critical reporting, route to the more reliable one.
- Query complexity: Some warehouses (Snowflake) handle complex nested queries better than others (BigQuery with slot pricing).
- User permissions: Route only to warehouses the user has access to.
A dispatch router might look like:
User Query → Parse Intent → Identify Required Datasets → Check Availability & Permissions → Select Optimal Warehouse(s) → Route to Agent(s) → Aggregate Results
This is similar to NVIDIA’s multi-agent warehouse AI command layer architecture, which orchestrates specialized agents across warehouse systems for real-time intelligence. Each agent is specialized for a particular warehouse or data domain, and the command layer routes work intelligently.
The Query Generation and Execution Layer
Once routed to the right warehouse, the agent needs to translate the user’s intent into SQL (or the native query language).
This is where text-to-SQL capabilities shine. An AI model that’s been fine-tuned on your specific schema can understand natural language like “show me our top 10 customers by revenue this quarter” and generate the correct SQL:
SELECT
customer_id,
customer_name,
SUM(order_value) as total_revenue
FROM orders
WHERE order_date >= DATE_TRUNC('quarter', CURRENT_DATE)
GROUP BY customer_id, customer_name
ORDER BY total_revenue DESC
LIMIT 10
The key is that the model understands your schema, your naming conventions, and your business logic. A generic LLM will fail. You need domain-specific fine-tuning.
Query execution in a multi-warehouse environment requires:
- Connection pooling: Maintain efficient connections to each warehouse
- Query timeout management: Different warehouses have different timeout tolerances
- Cost tracking: Log which warehouse executed which query and how much it cost
- Error handling: If a query fails in one warehouse, should you retry in another? When?
The Results Aggregation and Caching Layer
When you’re querying across multiple warehouses, you might get results from several agents. You need to:
- Deduplicate: If the same metric exists in multiple warehouses with the same definition, do you return both results or reconcile them?
- Aggregate: Can you combine results from multiple warehouses (e.g., total revenue across all regions, where each region is in a different warehouse)?
- Cache: For expensive queries that won’t change frequently, cache results to avoid repeated warehouse hits
- Validate: Ensure results are consistent with expected ranges and patterns
Caching is particularly important in multi-warehouse setups. A query that costs $5 to run in BigQuery might cost $50 if you re-run it for every user. Intelligent caching—based on query patterns, user roles, and data freshness requirements—can dramatically reduce costs.
Building the Agent Orchestration System
Now let’s talk about how to actually build this.
Using Apache Superset as Your Foundation
D23 is a managed Apache Superset platform designed specifically for this kind of distributed analytics. Rather than building a custom orchestration layer from scratch, you can leverage Superset’s native capabilities:
- Multi-database support: Superset natively connects to Snowflake, BigQuery, Redshift, Postgres, and dozens of other sources
- Semantic layer: Define business metrics and dimensions once, use them across all warehouses
- SQL Lab: Exploratory query interface with syntax highlighting and schema browser
- Embedded analytics: Embed dashboards and charts directly into your product
- API-first design: Everything in Superset is accessible via REST API, making it easy to integrate with agents
When you add AI capabilities (text-to-SQL, natural language query generation), you get the foundation for intelligent data discovery. Pair this with MCP (Model Context Protocol) integration—which allows you to extend Superset’s capabilities with custom tools and data sources—and you have a platform ready for multi-warehouse orchestration.
Implementing the Data Registry
Start by building your data registry. This is typically a separate service that maintains metadata about all your warehouses and datasets.
A minimal registry API might expose:
GET /warehouses
[
{
"id": "snowflake-prod",
"type": "snowflake",
"region": "us-east-1",
"cost_per_query_estimate": 0.05,
"freshness_sla_minutes": 15
},
{
"id": "bigquery-prod",
"type": "bigquery",
"region": "us",
"cost_per_query_estimate": 0.10,
"freshness_sla_minutes": 5
}
]
GET /datasets
[
{
"id": "customers",
"warehouse_id": "snowflake-prod",
"schema": "public",
"table_name": "customers",
"row_count": 50000,
"columns": [
{
"name": "customer_id",
"type": "integer",
"description": "Unique customer identifier"
},
{
"name": "created_at",
"type": "timestamp",
"description": "Account creation date"
}
],
"last_updated": "2025-01-15T10:30:00Z"
}
]
GET /metrics
[
{
"id": "monthly_revenue",
"definition": "SUM(order_value) WHERE order_date >= DATE_TRUNC('month', CURRENT_DATE)",
"warehouse_ids": ["snowflake-prod", "bigquery-prod"],
"primary_warehouse": "snowflake-prod",
"description": "Total revenue for current month"
}
]
This registry becomes the single source of truth for your agents. When an agent needs to answer a question, it queries this registry to understand what data is available.
Building the Query Router
The router sits between user intent and warehouse execution. It needs to be intelligent about routing decisions.
Here’s a simplified router logic:
- Parse the query: Extract intent, required datasets, filters, and aggregations
- Identify candidate warehouses: Which warehouses have all required datasets?
- Check freshness: Do you need real-time data, or is slightly stale data acceptable?
- Estimate cost: Which warehouse will be cheapest to query?
- Check permissions: Can the user access those warehouses?
- Route: Send to the optimal warehouse (or multiple warehouses if needed)
In practice, this might be implemented as:
def route_query(user_query, user_id, freshness_requirement):
# Parse intent
intent = parse_natural_language(user_query)
required_datasets = identify_datasets(intent)
# Find candidate warehouses
candidates = registry.find_warehouses_with_datasets(required_datasets)
# Filter by permissions
accessible = [w for w in candidates if user_has_access(user_id, w)]
# Filter by freshness
fresh_enough = [w for w in accessible if w.freshness_sla >= freshness_requirement]
# Pick the cheapest
selected = min(fresh_enough, key=lambda w: w.cost_estimate)
return selected
This is a simplified version. A production system would also consider query complexity, warehouse load, and historical performance.
Implementing Text-to-SQL with Domain-Specific Fine-Tuning
The magic of data discovery agents is their ability to translate natural language to SQL. But generic LLMs are terrible at this—they hallucinate table names, make up column aliases, and produce syntactically invalid SQL.
You need to fine-tune a model on your specific schema. Here’s the approach:
- Collect training data: Gather 100-1000 examples of natural language queries paired with correct SQL for your specific schema
- Augment with schema context: Include the relevant schema definition (table names, column names, data types) in the model’s context
- Fine-tune: Use a model like Llama 2, Mistral, or even GPT-4 with few-shot prompting to generate SQL
- Validate: Test against a held-out test set; ensure generated SQL is syntactically correct and produces expected results
- Iterate: As you collect more queries, continuously improve the model
A prompt for text-to-SQL might look like:
You are a SQL expert. Given the following schema and a natural language question,
generate the correct SQL query.
Schema:
- customers (customer_id INT, name VARCHAR, created_at TIMESTAMP, status VARCHAR)
- orders (order_id INT, customer_id INT, order_value DECIMAL, order_date DATE)
- order_items (order_id INT, product_id INT, quantity INT, unit_price DECIMAL)
Question: What's our top 5 customers by total spending this year?
SQL:
SELECT
c.customer_id,
c.name,
SUM(o.order_value) as total_spent
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE YEAR(o.order_date) = YEAR(CURRENT_DATE)
GROUP BY c.customer_id, c.name
ORDER BY total_spent DESC
LIMIT 5
This approach is far more reliable than using a generic LLM. According to Meta’s engineering work on evolving data warehouses with multi-agent systems, domain-specific training is critical for production reliability.
Real-World Implementation Patterns
Pattern 1: The Hub-and-Spoke Model
One warehouse (the “hub”) is the source of truth for most queries. Other warehouses (the “spokes”) are used for specific purposes: real-time data, specialized analytics, compliance.
The router preferentially routes to the hub unless the query specifically requires data from a spoke.
Pros: Simple, predictable, easy to manage consistency Cons: Hub becomes a bottleneck, doesn’t leverage distributed compute
Pattern 2: The Federated Model
Each warehouse is equally important. The router decides which warehouse to query based on data location and cost.
The challenge: ensuring consistency when the same metric might be calculated differently in different warehouses.
Pros: Leverages distributed compute, scales better Cons: Complex to manage consistency, requires careful governance
Pattern 3: The Tiered Model
Data is organized by freshness and cost. Real-time data lives in a fast, expensive warehouse (like Firestore or a real-time OLAP engine). Slightly stale data lives in a cheaper warehouse (like S3 with Athena). Historical data lives in cold storage.
The router selects the appropriate tier based on freshness requirements.
Pros: Cost-optimized, meets diverse freshness requirements Cons: Requires careful data movement orchestration
For most organizations, a hybrid approach works best. Start with a hub-and-spoke model, then gradually move toward federation as your data architecture matures.
Handling Cross-Warehouse Complexity
The Consistency Problem
When the same metric exists in multiple warehouses, they often have different definitions. Your Snowflake warehouse might calculate monthly revenue as of the last day of the month. Your BigQuery warehouse might calculate it as of today.
Solve this with:
- Metric definitions: Define metrics once in your semantic layer, with warehouse-specific SQL implementations
- Reconciliation dashboards: Build dashboards that show metric values across warehouses, making discrepancies visible
- Data validation: Run continuous validation to catch when metrics diverge unexpectedly
The Latency Problem
When you’re querying across multiple warehouses, the slowest warehouse determines your overall latency. A query that takes 2 seconds in Snowflake and 15 seconds in BigQuery will feel slow to your users.
Solutions:
- Parallel execution: Query multiple warehouses in parallel, return results as they arrive (with a timeout)
- Intelligent routing: Route queries to the fastest warehouse for that type of query
- Caching: Cache results aggressively, using TTLs based on data freshness requirements
- Pre-aggregation: Pre-compute common queries and store results in a fast cache
The Cost Problem
Multi-warehouse queries can get expensive fast. A single exploratory query that hits three warehouses could cost $50 or more.
Manage costs with:
- Query budgets: Set per-user or per-team monthly budgets
- Cost estimation: Show users the estimated cost before running expensive queries
- Query optimization: Analyze slow/expensive queries and suggest optimizations
- Tiered access: Give exploratory access only to cheaper warehouses
Integration with D23 and Embedded Analytics
If you’re embedding analytics into your product, multi-warehouse orchestration becomes even more important. Your customers might have data spread across multiple systems, and they’ll expect your embedded analytics to “just work.”
D23’s API-first architecture is purpose-built for this. You can:
- Register multiple data sources: Connect your customers’ Snowflake, BigQuery, or other warehouses
- Create unified dashboards: Build dashboards that query across multiple sources
- Embed with MCP integration: Use Model Context Protocol to extend D23 with custom agents and data sources
- Programmatic access: Use D23’s REST API to query across warehouses programmatically
For example, a venture capital firm using D23 could:
- Connect each portfolio company’s data warehouse
- Build unified dashboards showing portfolio-wide KPIs
- Allow LPs to explore data across companies via natural language queries
- Embed analytics directly in their portfolio management platform
All without requiring each company to move their data or standardize their schema.
Monitoring and Operations
Once you’ve built a multi-warehouse orchestration system, you need to operate it. This means monitoring:
Query Performance
- Latency by warehouse: Which warehouses are slow? Why?
- Query success rate: What percentage of queries fail? On which warehouses?
- Slow query log: Which queries are taking the longest?
Data Quality
- Freshness: How old is the data in each warehouse?
- Completeness: Are all expected datasets present and up-to-date?
- Consistency: Are metrics consistent across warehouses?
Cost
- Cost by warehouse: Which warehouse is most expensive?
- Cost by query type: Are exploratory queries too expensive?
- Cost trends: Is cost increasing or decreasing over time?
Agent Health
- Text-to-SQL accuracy: What percentage of generated SQL is syntactically correct and produces expected results?
- Router accuracy: Is the router making good routing decisions?
- Agent latency: How long does each agent take to respond?
Build dashboards for each of these. Make them visible to your team. When something goes wrong, you want to know immediately.
Advanced: Multi-Agent RAG Systems
As you mature your orchestration system, you might move toward a Retrieval-Augmented Generation (RAG) approach. Instead of routing a query to a single warehouse, you:
- Retrieve: Query your data registry to find relevant datasets across all warehouses
- Augment: Enrich the query with metadata, previous results, and context
- Generate: Use an LLM to generate the optimal SQL
- Execute: Run the SQL and return results
Snowflake’s multi-agent RAG system with Gen2 warehouses and Cortex is a good reference implementation. The key insight is that RAG allows agents to be more intelligent about what data to query, rather than relying on a simple router.
For example, if a user asks “how does our churn rate compare to our competitors,” a RAG system could:
- Retrieve: Find all available churn metrics in your warehouses
- Retrieve: Find competitor benchmarks in your data lake
- Augment: Add context about data freshness and definitions
- Generate: Create SQL that joins your churn data with competitor data
- Execute: Run the query and return results with context
This is significantly more powerful than a simple router, but also more complex to implement.
Conclusion: From Static BI to Intelligent Data Discovery
The shift from single-warehouse BI tools to orchestrated multi-agent data discovery is fundamental. It reflects how modern data infrastructure actually works: distributed, heterogeneous, and constantly evolving.
Traditional tools like Looker and Tableau assume a stable, well-documented data model. They work great when you have one warehouse. But when you’re managing multiple data sources, embedding analytics into your product, or supporting portfolio companies with different data architectures, you need something more flexible.
D23’s managed Apache Superset platform combined with intelligent agent orchestration gives you that flexibility. You get the power of open-source BI (no vendor lock-in, full customization) without the operational overhead.
Start simple: build a data registry, implement a basic router, add text-to-SQL for one warehouse. As you mature, add more warehouses, implement caching, move toward RAG. The architecture scales with your needs.
The future of analytics isn’t about moving all your data into one place. It’s about building intelligent systems that can navigate your actual data architecture, understand your business logic, and answer questions without requiring a PhD in your schema. That’s what orchestrated data discovery agents deliver.
For organizations scaling analytics across multiple warehouses—whether you’re embedding BI into your product, consolidating portfolio companies, or managing complex data infrastructure—this is the pattern to understand and implement.