BigQuery Federated Queries: When to Use Them
Learn when to use BigQuery federated queries to join Cloud SQL, Spanner, and external data sources without ETL. Real-world patterns and performance trade-offs.
Understanding BigQuery Federated Queries
BigQuery federated queries represent a fundamental shift in how organizations approach multi-source analytics. Rather than extracting data from operational systems, transforming it, and loading it into a data warehouse—the traditional ETL pattern—federated queries allow you to query external data sources directly from BigQuery using standard SQL. This capability eliminates the need to replicate data across systems, reducing latency between source-of-truth updates and analytical visibility.
At its core, a federated query is a SQL statement executed in BigQuery that references tables stored outside BigQuery—typically in Cloud SQL, Cloud Spanner, or other external databases. When you execute a federated query, BigQuery connects to the external source, retrieves the requested data, and performs any necessary transformations or joins before returning results. This virtual integration pattern means your analytical layer can access live operational data without maintaining separate copies.
The Introduction to federated queries documentation from Google Cloud outlines the fundamental syntax and capabilities. However, understanding when federated queries make sense—and when they don’t—requires deeper analysis of your data architecture, query patterns, and performance requirements.
The Architecture Behind Federated Queries
Federated queries operate through a connection layer that BigQuery maintains to external databases. When you define an external data source in BigQuery, you’re creating a reference that includes authentication credentials, connection parameters, and metadata about the remote tables. BigQuery then uses this connection to establish communication with the external system at query time.
The key architectural difference between federated queries and traditional ETL is timing. With ETL, data movement happens on a schedule—nightly batch loads, hourly increments, or event-triggered syncs. With federated queries, data retrieval happens at query time. This means your analytical results always reflect the current state of the source system, but it also means every query incurs network latency and computational overhead on both BigQuery and the remote database.
Google Cloud’s exploration of new features in BigQuery federated queries highlights critical enhancements like SQL pushdown—the ability to push filter predicates and aggregations down to the remote database, reducing data transfer. When you query a Cloud SQL table with a WHERE clause, for example, SQL pushdown means the WHERE clause executes on Cloud SQL first, and only matching rows are transferred to BigQuery. This dramatically reduces network traffic and improves query performance.
When Federated Queries Make Sense
Real-Time Operational Analytics
Federated queries shine when you need analytical access to data that changes frequently and where staleness is unacceptable. Consider a SaaS platform that tracks user activity, subscription status, and billing events in a transactional database. If your analytics team needs to report on active subscribers, churn rate, or MRR (monthly recurring revenue) with minute-level accuracy, federated queries eliminate the lag inherent in batch-based data pipelines.
A practical example: a venture capital firm tracking portfolio company metrics might have each portfolio company’s financial data in a separate Cloud SQL instance—cash balance, burn rate, headcount, revenue. Rather than syncing all this data nightly to a central warehouse, federated queries allow the VC’s analytics team to query live balances across all companies in a single BigQuery dashboard. When a portfolio company closes a funding round or records a major customer win, that data is immediately available for analysis.
Compliance and Data Residency Requirements
In regulated industries—healthcare, finance, government—data residency constraints often prohibit copying sensitive information across regional or organizational boundaries. Federated queries allow you to keep operational data in its original location (a Cloud SQL instance in a specific region or within a VPC) while still enabling analytical queries from BigQuery. The query results are computed without copying the source data.
This is particularly relevant for organizations subject to GDPR, HIPAA, or industry-specific regulations. You can maintain customer PII in a Cloud SQL database within a specific region, and use federated queries to join that data with analytical datasets in BigQuery—the raw PII never leaves the operational database.
Hybrid Transactional-Analytical Workloads
Some organizations run analytics directly against their operational databases using tools like Looker, Tableau, or Metabase. This works for small-scale analytics, but as query complexity grows and concurrency increases, operational database performance degrades. Federated queries provide a middle ground: you keep operational data in its optimized transactional database, but you execute complex analytical queries in BigQuery, which has dramatically different optimization and resource allocation.
For example, an e-commerce platform might store orders, customers, and inventory in Cloud SQL (optimized for fast transactional reads and writes). But when the analytics team wants to perform a complex cohort analysis—grouping customers by acquisition channel, calculating lifetime value, identifying churn risk—that query would lock tables or degrade transaction processing if run directly on Cloud SQL. Federated queries allow that complex analysis to run in BigQuery while the source data remains in Cloud SQL.
Multi-Source Analytics Without Central Data Warehouse
Organizations with multiple operational systems—separate databases for different business units, microservices architectures, or acquired companies—often lack a centralized data warehouse. Building one requires significant engineering effort, governance decisions, and ongoing maintenance. Federated queries enable analytics across these fragmented sources without requiring a centralized hub.
Imagine a holding company with five acquired businesses, each operating independently with its own Cloud SQL instance. Rather than building a centralized data warehouse (which might take 6-12 months), federated queries allow the corporate analytics team to immediately run cross-company queries: comparing revenue by product line, analyzing customer overlap, or consolidating financial metrics. This accelerates time-to-insight post-acquisition.
When Federated Queries Create Problems
High-Volume, Repetitive Queries
Federated queries incur overhead every time they execute. If your dashboard or analytics application runs the same query thousands of times per day, that overhead compounds. Each execution involves establishing a connection to the remote database, executing a query, and transferring results back to BigQuery. At scale, this creates several problems:
Network latency accumulates. Even with optimized connections, a 50-millisecond round trip to Cloud SQL adds up when multiplied across hundreds of daily queries.
Remote database load increases. If 1,000 users are viewing a dashboard powered by federated queries, the source database experiences 1,000 concurrent query executions. This can degrade transactional performance, which is the opposite of your original goal.
Costs become unpredictable. Federated queries charge based on data scanned in BigQuery, but they also generate load on the remote database, which may have its own cost structure. A dashboard that appears cheap in BigQuery might be expensive when you account for the database load it generates.
For high-volume, repetitive queries, caching or materialization is more appropriate. Materialized views in BigQuery, scheduled queries that populate tables, or caching layers eliminate the per-query overhead.
Complex Multi-Table Joins Across Multiple Sources
Federated queries work well when you’re joining a BigQuery table with a remote table. They work less well when you need to join multiple remote tables together, or when you need to join data from three or more sources (two remote databases plus BigQuery, for example).
When you join two Cloud SQL tables using a federated query, BigQuery may pull both tables into memory, perform the join, and return results. For large tables or complex joins, this becomes expensive and slow. The Databricks glossary entry on federated queries notes that virtualization comes with performance trade-offs—you’re not materializing data in an optimized analytical format, so complex transformations can be inefficient.
Consider a scenario where you need to join customer data (Cloud SQL), order history (Cloud Spanner), and product catalog (BigQuery) to calculate customer lifetime value by product category. If you execute this as a federated query, BigQuery must coordinate across all three sources, handle the joins, and aggregate results. This is slower and more resource-intensive than materializing the customer and order data in BigQuery first, then joining with the product catalog.
Unpredictable Query Performance
Federated query performance depends on factors outside your control: remote database load, network conditions, and the remote database’s query optimizer. A query that runs in 2 seconds during off-hours might take 30 seconds during peak business hours, not because of anything BigQuery is doing, but because the remote database is handling transaction processing.
This unpredictability is problematic for user-facing analytics. If you’re embedding a dashboard in your product (as many companies do with tools like D23’s embedded analytics platform), users expect consistent, fast response times. Federated queries make that guarantee impossible.
Data Freshness vs. Performance Trade-Off
Federated queries offer real-time data freshness, but that freshness comes at a cost. If you can tolerate data that’s 1 hour old, 1 day old, or 1 week old, you can materialize that data in BigQuery and serve queries from the materialized copy—dramatically faster, cheaper, and more predictable than federated queries.
The decision between freshness and performance is not always obvious. A dashboard showing yesterday’s metrics is often sufficient for decision-making. Materializing data on a schedule (hourly, daily, weekly) often provides the right balance: fresh enough for actionable insights, but materialized enough for predictable performance.
Technical Patterns and Implementation Strategies
Setting Up Federated Queries: Connection and Authentication
Implementing federated queries requires several setup steps. First, you create an external data source in BigQuery that defines how to connect to the remote database. This includes connection parameters (host, port, database name) and authentication credentials (username, password, or service account). Google Cloud manages these credentials securely; they’re never exposed in query code.
For Cloud SQL, BigQuery can connect via public IP (less secure, simpler) or private IP (more secure, requires VPC peering or Cloud SQL Auth proxy). For Cloud Spanner, BigQuery connects using the Spanner API. For other databases, you might use JDBC connections or other protocols.
Once the external data source is created, you reference it in SQL using the syntax PROJECT.DATASET.CONNECTION_NAME.REMOTE_TABLE. For example:
SELECT c.customer_id, c.name, o.order_count
FROM bigquery_dataset.customers c
JOIN `project.dataset.cloud_sql_connection.orders` o
ON c.customer_id = o.customer_id
WHERE o.order_date >= '2024-01-01'
This query joins a local BigQuery table with a remote Cloud SQL table, filtering for recent orders. The WHERE clause is pushed down to Cloud SQL (assuming SQL pushdown is supported), reducing data transfer.
SQL Pushdown and Query Optimization
SQL pushdown is the most important optimization technique for federated queries. When you include a WHERE clause, GROUP BY, or ORDER BY in a federated query, BigQuery attempts to push these operations down to the remote database. This means the remote database does the filtering, aggregation, or sorting, and BigQuery receives only the processed results.
The Google Cloud blog post on SQL pushdown in federated queries provides concrete examples. If you query a Cloud SQL table with millions of rows but filter for a specific customer, SQL pushdown means Cloud SQL returns only that customer’s rows, not all millions. This reduces network bandwidth and BigQuery processing time.
Not all operations can be pushed down. Complex custom functions, certain window functions, or operations that require BigQuery’s specific capabilities must execute in BigQuery itself. Understanding which operations push down and which don’t is crucial for writing efficient federated queries.
Joining BigQuery and External Data Efficiently
When joining BigQuery tables with external tables, the order of operations matters. If you join a large remote table with a small BigQuery table, BigQuery should retrieve the small table first, then use it to filter the remote table. This is called a broadcast join or lookup join.
For example, if you have a list of 1,000 high-value customers in BigQuery and want to find their orders in Cloud SQL, you should structure the query to filter Cloud SQL orders by customer ID first (using the broadcast join pattern), not retrieve all orders and then filter.
Materialization Strategies: When to Cache Results
For dashboards, reports, or frequently-accessed datasets, consider materializing federated query results into BigQuery tables on a schedule. A scheduled query can run a complex federated query once per hour, day, or week, and write results to a table. Users then query the materialized table instead of the federated query.
This pattern combines the benefits of both approaches: you get fresh data on a schedule, but queries are fast and predictable because they’re reading from materialized tables, not executing federated queries.
For example, a materialized view might run this federated query hourly:
CREATE OR REPLACE TABLE `project.dataset.daily_metrics` AS
SELECT
DATE(o.order_date) as order_date,
COUNT(*) as order_count,
SUM(o.amount) as total_revenue
FROM `project.dataset.cloud_sql_connection.orders` o
WHERE o.order_date >= CURRENT_DATE() - 30
GROUP BY order_date
This query runs once per hour, materializing the last 30 days of order metrics. Dashboards query the materialized table, which returns instantly.
Real-World Use Cases and Trade-Offs
Case Study: VC Portfolio Tracking
A venture capital firm manages 50 portfolio companies. Each company operates independently with its own financial systems—some use SAP, others use NetSuite, and several have custom accounting databases. The VC needs a consolidated dashboard showing cash position, burn rate, and runway across all companies.
Traditional approach: Build an ETL pipeline that extracts financial data from each system nightly, transforms it to a common schema, and loads it into a central data warehouse. This takes 3-4 months of engineering effort and requires ongoing maintenance as portfolio companies change systems.
Federated query approach: Create BigQuery external data sources for each portfolio company’s database. Write federated queries that join financial data across all sources, transforming and consolidating it in BigQuery. This is operational within days, not months.
Trade-off: Federated queries work well here because the VC’s analytics team runs reports a few times per day, not thousands of times. The portfolio company databases are optimized for transactional accounting, not high-concurrency analytics, so the federated query load is manageable. If the VC’s LPs demanded real-time dashboard updates with hundreds of concurrent users, federated queries would become problematic—materialization would be necessary.
Case Study: E-Commerce Operational Analytics
An e-commerce platform processes orders, manages inventory, and handles customer service through separate operational systems. The customer service team needs dashboards showing customer order history, returns, and refund status. This data lives in Cloud SQL and changes constantly.
Federated query approach: Create a BigQuery dashboard that queries Cloud SQL directly using federated queries. Customer service reps see live order data without any latency.
Trade-off: This works if the customer service team is small (dozens of concurrent users) and queries are simple (single-table lookups). If the platform scales to hundreds of concurrent customer service reps, each running queries, the federated query load on Cloud SQL becomes problematic. At that point, you’d materialize order data in BigQuery and synchronize it every few minutes, trading real-time freshness for scalability.
Case Study: Compliance and Data Residency
A healthcare provider must keep patient data within a specific region due to HIPAA requirements. Operational patient data lives in a regional Cloud SQL instance. The analytics team wants to perform epidemiological analysis—disease prevalence, treatment outcomes, medication patterns—without moving PII to a central warehouse.
Federated query approach: Use federated queries to join regional Cloud SQL data (containing PII) with de-identified analytical datasets in BigQuery. The PII never leaves the regional database; only analytical results are returned.
Trade-off: This is an ideal use case for federated queries. The compliance requirement (data residency) aligns perfectly with federated queries’ ability to keep source data in place. Query frequency is moderate (analysts run complex queries a few times per day), and performance is less critical than compliance. The main trade-off is that complex analytical queries must be written in BigQuery SQL, which may require more sophisticated query design than running directly against the operational database.
Comparing Federated Queries to Alternative Approaches
Federated Queries vs. ETL/ELT
ETL (Extract, Transform, Load) has been the standard approach for decades. You extract data from operational systems, transform it to a common schema, and load it into a data warehouse. This requires engineering effort but provides fast, predictable analytical query performance.
Federated queries eliminate the “E” and “L” steps—data stays in place, and you query it virtually. This saves engineering effort and keeps data fresh, but sacrifices query performance and predictability.
When to choose ETL: High-volume analytics, complex transformations, multiple data sources, or when data freshness can be hourly or daily.
When to choose federated queries: Real-time freshness required, compliance/residency constraints, or simple analytical queries.
Federated Queries vs. Change Data Capture (CDC)
Change Data Capture streams operational data changes to BigQuery in near-real-time (seconds or minutes of latency). This is faster than nightly ETL but requires infrastructure investment (CDC tools like Fivetran, Stitch, or custom solutions).
Federated queries offer similar freshness to CDC (actual real-time) but without the infrastructure. However, CDC materializes data in BigQuery, providing faster query performance. Federated queries query remotely, incurring latency on every query.
When to choose CDC: You need real-time data in BigQuery for complex analytics, and you’re willing to invest in CDC infrastructure.
When to choose federated queries: Real-time freshness is needed, but query complexity is low, or compliance constraints require keeping data in place.
Federated Queries vs. Data Virtualization Platforms
Data virtualization platforms like Denodo or Informatica create a semantic layer across multiple data sources, allowing users to query data as if it were centralized. They offer sophisticated caching, query optimization, and governance.
BigQuery federated queries are simpler and cheaper but less feature-rich. If you need sophisticated data governance, caching strategies, or support for many data sources, a data virtualization platform might be better. If you need simple cross-source queries in BigQuery, federated queries are sufficient.
Performance Benchmarks and Cost Considerations
Query Latency
Federated query latency depends on several factors:
- Network round-trip time: Typically 5-50ms to reach Cloud SQL or Cloud Spanner from BigQuery.
- Remote database query execution: Time for the remote database to execute the query (depends on data size, indexes, query complexity).
- Data transfer: Time to transfer results back to BigQuery (depends on result size).
- BigQuery processing: Time for BigQuery to aggregate, join, or further process results.
For a simple query (“SELECT COUNT(*) FROM orders WHERE customer_id = 123”), total latency might be 100-500ms. For a complex query joining multiple tables and aggregating millions of rows, latency might be 5-30 seconds.
Comparison: A materialized table query in BigQuery returns results in 1-5 seconds for similar complexity. The federated query is 2-10x slower.
Cost Structure
BigQuery federated queries charge based on data scanned in BigQuery. If your federated query scans 1GB of data in BigQuery, you pay for 1GB of scanned data (at $6.25 per TB for on-demand pricing).
However, you also incur costs on the remote database:
- Cloud SQL: Charges for compute and storage; federated queries increase compute usage.
- Cloud Spanner: Charges per processing unit; federated queries consume processing units.
- External databases: May have their own query costs or usage-based pricing.
A federated query that appears cheap in BigQuery ($0.01 for scanning 1GB) might cost $5-10 in Cloud SQL compute if it generates heavy load on the database.
Comparison: Materializing data via scheduled queries costs BigQuery storage (cheaper than scanning) and query execution (one-time, not per-user query). For high-volume analytics, materialization is usually cheaper.
Best Practices for Implementing Federated Queries
1. Start with Simple, Specific Queries
Don’t use federated queries for broad, exploratory analytics. Start with specific use cases where you know the query pattern: “I need to join customer data from Cloud SQL with product data in BigQuery to calculate LTV by product category.” Write that specific query, optimize it, and measure performance.
2. Monitor Remote Database Load
Federated queries place load on the remote database. Monitor CPU, memory, and query latency on the remote database when running federated queries. If you see degradation in transactional performance, it’s time to materialize data instead.
3. Use SQL Pushdown Aggressively
Write federated queries with WHERE clauses, GROUP BY, and ORDER BY operations that can be pushed down to the remote database. Avoid operations that require BigQuery’s specific capabilities (complex UDFs, certain window functions) in federated queries.
4. Implement Caching and Materialization for User-Facing Analytics
If your federated queries power dashboards or reports used by many users, materialize results on a schedule. Run the federated query once per hour, day, or week, and serve results from the materialized table.
5. Use Private IP Connections for Security
When connecting to Cloud SQL or Cloud Spanner, use private IP connections (via VPC peering or Cloud SQL Auth proxy) instead of public IP. This is more secure and can be faster.
6. Document Query Patterns and Performance Baselines
Maintain documentation of which federated queries are in production, their expected latency, and the remote database load they generate. This helps with capacity planning and troubleshooting.
Integrating Federated Queries with Analytics Platforms
For organizations using analytics platforms like D23, which provides managed Apache Superset with AI-powered analytics and API integration, federated queries offer a way to connect Superset dashboards to external data sources without materializing everything in BigQuery.
Superset can query BigQuery directly, and BigQuery can use federated queries to access Cloud SQL, Cloud Spanner, or other databases. This creates a flexible architecture: operational data stays in its optimized transactional database, BigQuery serves as the analytical hub, and Superset provides the visualization and user interface layer.
This architecture is particularly valuable for embedded analytics use cases, where you’re building analytics into your product. You can query live operational data via federated queries, process it in BigQuery, and surface it through Superset dashboards without the complexity of maintaining separate data pipelines.
The approach to querying across projects using BigQuery federated queries shows how this pattern scales across multiple GCP projects and databases, enabling complex multi-source analytics without ETL.
Advanced Patterns: Multi-Source Joins and Federation at Scale
Joining Across Multiple External Databases
As your data architecture grows, you might need to join data from multiple external sources. For example, joining customer data from Cloud SQL with order data from Cloud Spanner and product data from BigQuery. This is technically possible with federated queries but requires careful query design.
The key is to structure the join so that filtering happens on the external databases first. Use WHERE clauses that reference specific values (customer IDs, date ranges) that can be pushed down. Avoid joins between two large external tables; instead, materialize one and join with the other.
Handling Schema Evolution
When external database schemas change, federated queries may break. If a Cloud SQL table adds or removes a column, your federated query might fail. Implement monitoring to detect schema changes and update queries accordingly. Consider using SELECT * cautiously in federated queries; explicitly list columns you need so schema changes don’t break your queries.
Dealing with Network Failures and Timeouts
Federated queries depend on network connectivity. If the remote database is unreachable or responds slowly, queries fail or timeout. Implement retry logic and fallback strategies: if a federated query fails, fall back to cached results or a materialized table.
For mission-critical dashboards, avoid relying solely on federated queries. Combine them with materialized data for resilience.
The Future of Federated Queries and BigQuery
Google Cloud continues enhancing BigQuery’s federation capabilities. Recent improvements include SQL pushdown for more operations, support for private IP connections, and better performance for large data transfers. As these capabilities mature, federated queries become more viable for use cases that previously required materialization.
However, the fundamental trade-off remains: freshness vs. performance. Federated queries will always be slower than materialized data, and they’ll always place load on remote databases. As data volumes grow and analytics becomes more central to business operations, most organizations find themselves materializing data anyway, using federated queries only for specific, high-freshness use cases.
The Towards Data Science guide on joining Cloud SQL data with BigQuery provides practical examples of this evolution, showing how organizations start with federated queries and gradually shift toward materialization as their analytics needs grow.
Conclusion: Making the Decision
BigQuery federated queries are a powerful tool for specific use cases: real-time operational analytics, compliance-driven data residency, and simple cross-source queries. They eliminate the engineering overhead of ETL pipelines and keep data fresh without replication.
However, they’re not a replacement for materialized data warehouses. For high-volume analytics, complex transformations, or user-facing dashboards with many concurrent users, materialization provides better performance, predictability, and cost efficiency.
The decision to use federated queries should be based on your specific requirements:
- If freshness is critical and query volume is low: Use federated queries.
- If query volume is high or performance must be predictable: Materialize data.
- If compliance requires data residency: Use federated queries with private IP connections.
- If you need both freshness and performance: Combine federated queries with materialized caching—run federated queries on a schedule to refresh materialized tables.
Start small with federated queries for specific use cases. Monitor remote database load and query performance. As your analytics needs grow, plan to materialize data for high-volume use cases while keeping federated queries for real-time, low-volume analytics. This balanced approach gives you the flexibility to meet diverse analytical requirements without over-engineering your data infrastructure.
For organizations building analytics into their products, D23’s managed Apache Superset platform combined with BigQuery federated queries offers a streamlined path: query operational data in place, process it in BigQuery, and visualize it through Superset dashboards. This architecture scales efficiently and keeps your data infrastructure lean.