How D23 Handles Apache Superset Upgrades Without Downtime
Learn how D23 executes zero-downtime Apache Superset upgrades using blue-green deployments, schema migrations, and rollback strategies for production analytics.
How D23 Handles Apache Superset Upgrades Without Downtime
When you’re running Apache Superset in production—especially as a managed service supporting dozens of teams and hundreds of dashboards—upgrades aren’t optional maintenance tasks. They’re critical operational events that require precision planning, tested procedures, and the ability to roll back instantly if something breaks.
At D23, we’ve built a zero-downtime upgrade strategy that keeps dashboards running, queries executing, and your analytics infrastructure available 24/7. This article walks through exactly how we do it: the architecture decisions, the deployment patterns, the database migration strategies, and the safety nets we’ve put in place.
If you’re evaluating managed Apache Superset as an alternative to Looker or Tableau, or if you’re running Superset yourself and want to understand production-grade upgrade patterns, this deep-dive will give you the concrete operational knowledge you need.
Why Upgrades Matter: The Real Cost of Downtime
Before we dig into the technical implementation, let’s be clear about what’s at stake. Apache Superset upgrades aren’t like patching a test environment. When you’re running embedded analytics, self-serve BI dashboards, or KPI reporting infrastructure that teams depend on daily, downtime isn’t just an inconvenience—it breaks workflows, delays decisions, and erodes confidence in your analytics platform.
Consider a typical scenario: you’ve got 50 dashboards embedded in your product. Users are checking conversion funnels, revenue trends, and customer cohorts. Your data team is running ad-hoc queries against your data warehouse. A critical Superset security patch is released, and you need to upgrade within days. If you take the traditional approach—stop the service, run migrations, restart—you’re looking at 15 minutes to an hour of complete unavailability. In that window, embedded dashboards go blank, API calls fail, and your team loses visibility into business metrics.
The financial impact depends on your business, but for SaaS companies, ecommerce platforms, and data-driven organizations, even 30 minutes of analytics downtime can cost thousands of dollars in lost visibility and delayed decisions.
D23’s approach eliminates this entirely. We execute upgrades while dashboards stay live, queries continue to execute, and users never see a service interruption. Here’s how.
The Architecture Foundation: Stateless Application Design
Zero-downtime upgrades start with architecture. If your Superset deployment is tightly coupled to a single server, database, or cache layer, you can’t upgrade without stopping everything. That’s why the first principle of our infrastructure is strict separation of concerns.
Our Superset deployment consists of three independent layers:
Application Layer (Stateless): Superset web servers and query executors run in containers with no local state. A user’s session isn’t pinned to a specific server. If a container goes down, the load balancer routes traffic to another. This is critical because it means we can drain traffic from old containers, spin up new ones with upgraded code, and retire the old ones—all without losing a single request.
Data Layer (Persistent): PostgreSQL (or your chosen database) stores dashboards, users, saved queries, and metadata. This layer never stops during an upgrade. We use read replicas and connection pooling to ensure database availability remains constant.
Cache Layer (Distributed): Redis handles query result caching, session storage, and temporary data. Like the database, this runs independently and survives application upgrades. We use Redis Sentinel for automatic failover, so even if a cache node fails, the system recovers without manual intervention.
This three-layer architecture means an upgrade touches only the stateless application layer. The data and cache layers keep running, serving requests from old application instances until they’re fully drained.
Blue-Green Deployments: The Upgrade Strategy
The core technique we use is called blue-green deployment. Here’s the concept: instead of upgrading in place, you run two complete, identical production environments side by side. One is “blue” (current), one is “green” (new). You upgrade green while blue serves all traffic. Once green is fully tested and healthy, you flip traffic over. If something goes wrong, you flip back instantly.
For Superset, this works like this:
Phase 1: Prepare Green Environment
We provision new Superset containers with the upgraded version. These containers connect to the same PostgreSQL database and Redis cache as the blue environment. They run schema migrations (more on that below) in a controlled, testable way. The green environment is fully operational but receives zero traffic.
Phase 2: Smoke Testing
Before we route any production traffic, we run automated tests against green:
- Load a sample of dashboards and verify they render
- Execute a set of representative queries and check results match blue
- Test the API endpoints that embedded analytics depend on
- Verify user authentication and permissions work correctly
If any test fails, green is torn down and we investigate. Blue continues serving all traffic unaffected.
Phase 3: Gradual Traffic Shift
Once green passes smoke tests, we don’t flip 100% of traffic immediately. Instead, we use a load balancer (we use Nginx with custom routing logic) to gradually shift traffic to green. We start with 5% of requests, monitor error rates and latency, then shift to 10%, 25%, 50%, and finally 100%.
This gradual shift is crucial. If there’s a subtle bug that only manifests under production load or with specific data patterns, we catch it while 95% of traffic still flows through blue. We can roll back without affecting most users.
Phase 4: Complete Cutover
Once green has handled 100% of traffic for a period without issues, we formally decommission blue. The upgrade is complete.
Phase 5: Instant Rollback (If Needed)
For 24 hours after cutover, we keep blue running in standby mode. If a critical issue emerges—say, a dashboard rendering bug that only appears in a specific configuration—we can flip traffic back to blue in seconds. This gives us a safety net without requiring a full re-upgrade.
This pattern is well-documented in the Kubernetes deployment documentation, which describes rolling updates and blue-green strategies in detail. We implement it using container orchestration, but the principles apply whether you’re using Kubernetes, Docker Compose, or traditional VMs.
Database Migrations: The Trickiest Part
Blue-green deployment works smoothly for application code, but databases are trickier. Here’s the problem: Superset upgrades often include schema changes. A new version might add columns, create indexes, or restructure tables. You can’t run two versions of the application against incompatible database schemas simultaneously.
Our solution uses a principle called backward-compatible migrations. Here’s how it works:
Step 1: Additive-Only Migrations
When we upgrade Superset, we ensure database changes are additive. We add new columns, but we don’t remove old ones immediately. We create new indexes without dropping old ones. This way, both blue (old code) and green (new code) can read and write to the same database schema.
For example, if an upgrade adds a query_timeout_seconds column to the queries table:
- We add the column with a default value
- Old code (blue) ignores the new column
- New code (green) reads and writes it
- Both versions work against the same schema
Step 2: Dual-Write During Transition
During the gradual traffic shift, green instances write to both old and new columns (if applicable). This ensures data consistency. Old code can still read the old columns if needed.
Step 3: Cleanup After Cutover
Once we’ve been running 100% on green for a period, we run cleanup migrations: dropping unused columns, removing deprecated indexes, and optimizing the schema. This happens after blue is decommissioned, so there’s no risk of incompatibility.
This approach requires careful planning. Before we upgrade, we review the Superset release notes and identify schema changes. We test migrations against a production-like copy of the database. We measure migration time and plan for it. The official Superset upgrade documentation provides migration scripts, but in production environments, we always run them in a staging environment first.
Connection Pooling and Query Continuity
During an upgrade, queries that are already executing should complete without interruption. This requires careful connection management.
Superset uses a connection pool to the data warehouse (Snowflake, BigQuery, PostgreSQL, etc.). When we upgrade, we don’t immediately close all connections. Instead:
-
Drain New Connections: New instances (green) get routed to the new application code, but old instances (blue) stop accepting new connections from the load balancer.
-
Let Existing Queries Complete: Old instances keep running, serving queries that are already in flight. A user who started a 5-minute query before the upgrade completes that query on the old instance.
-
Graceful Shutdown: Once all in-flight queries complete (or reach a timeout), the old instance shuts down cleanly.
This is called a “drain and replace” pattern. It ensures no query is interrupted mid-execution. For long-running queries (common in data exploration), this is critical.
We also use connection pooling with PgBouncer for PostgreSQL connections, which allows us to maintain connection limits while supporting many concurrent users. This prevents connection exhaustion during the upgrade window.
Caching Strategy During Upgrades
Superset caches query results aggressively. A dashboard with 10 charts might execute 10 queries, but if those queries are cached, the dashboard loads in milliseconds instead of seconds.
During an upgrade, we need to handle caching carefully. Here’s our approach:
Preserve Cache Across Versions
We use Redis for caching, and Redis persists data independently of the Superset application. When we upgrade, the cache survives. Green instances can read cached results from blue’s execution.
This has a subtle benefit: dashboards load faster immediately after upgrade because the cache is warm. Users don’t experience the “cold cache” slowdown that typically follows a deployment.
Invalidate Cache for Changed Queries
If an upgrade changes how queries are executed (e.g., a new optimization that changes the query plan), we need to invalidate the cache for affected queries. We do this by tagging cache entries with a version number. When we upgrade, we bump the version for specific query types, invalidating old entries.
This prevents stale results from being served by new code that might interpret them differently.
Monitor Cache Hit Rates
During the gradual traffic shift, we monitor cache hit rates on both blue and green. If green has significantly lower hit rates, it suggests a problem. We investigate before shifting more traffic.
Monitoring and Observability During Upgrades
You can’t execute safe upgrades without visibility. We instrument every stage of the upgrade with metrics and logs.
Real-Time Dashboards
During an upgrade, we’re watching:
- Error Rates: Are 500 errors increasing? If green’s error rate spikes above blue’s, we roll back immediately.
- Query Latency: Are queries slower on green? A 20% latency increase might indicate a performance regression.
- Cache Hit Rates: Are dashboards being served from cache, or is every request hitting the database?
- Connection Pool Utilization: Are we running out of database connections?
- Traffic Distribution: Are we successfully shifting traffic from blue to green?
We display these metrics on a dedicated dashboard that the on-call engineer watches throughout the upgrade. If any metric goes red, we have a runbook for immediate rollback.
Distributed Tracing
We use distributed tracing (we instrument Superset with OpenTelemetry) to follow individual requests through the system. If a user reports that a dashboard is slow after upgrade, we can trace that request, see exactly which services it touched, and identify the bottleneck.
Alerting Thresholds
We set specific thresholds that trigger automatic rollback:
- Error rate > 1% on green (vs. < 0.1% on blue)
- Median query latency > 1.5x blue’s latency
- Database connection pool exhaustion
- Cache hit rate drop > 20%
If green triggers any of these conditions, we automatically flip traffic back to blue and page the on-call engineer.
Testing Before Production: The Staging Environment
Zero-downtime upgrades in production are only possible because we’ve already tested everything in a staging environment that mirrors production exactly.
Our staging setup includes:
- Identical Infrastructure: Same number of Superset instances, same database configuration, same cache setup.
- Production-Like Data: We use a copy of the production database (with sensitive data masked) so we’re testing against real schemas and real data volumes.
- Load Testing: We run synthetic traffic against staging that mimics production load patterns. We execute 100 concurrent queries, load 50 dashboards, and run API calls.
- Chaos Testing: We deliberately break things—kill a database connection, reduce memory, introduce network latency—to see how the upgrade handles failures.
Only after staging passes all tests do we schedule the production upgrade.
Rollback Procedures: The Safety Net
Despite careful planning, sometimes things go wrong. A subtle bug emerges under production load. A third-party integration breaks. A query optimization causes unexpected results.
We have multiple rollback paths:
Immediate Rollback (First 24 Hours)
We keep the blue environment running in standby for 24 hours after cutover. If we discover a critical issue, we flip traffic back to blue instantly. This takes 30 seconds. The old code is already running, so there’s no startup delay.
Full Rollback (After 24 Hours)
After blue is decommissioned, we keep a backup of the old Superset container image and a database backup from just before the upgrade. If we discover a critical issue, we:
- Restore the database to its pre-upgrade state (this rolls back any schema changes)
- Spin up containers with the old Superset version
- Restore from the backup
This takes a few minutes but is fully automated. We test it monthly in staging to ensure it works.
Partial Rollback
For non-critical issues, we might roll back specific features or dashboards rather than the entire upgrade. For example, if a new visualization type has a bug, we might disable it for specific users while we fix it.
The Human Element: Communication and Runbooks
Technical infrastructure is only half the story. The other half is process and communication.
Before every upgrade, we:
- Notify Stakeholders: We send a message to all teams using D23, explaining what’s changing, why it matters, and what to expect.
- Create a Runbook: The on-call engineer has a detailed, step-by-step runbook for the upgrade, including decision points (“if error rate spikes, do X”) and rollback procedures.
- Schedule Office Hours: We’re available for questions during the upgrade window.
- Post-Upgrade Report: After the upgrade completes, we share metrics, any issues encountered, and lessons learned.
This communication builds confidence. Teams trust that upgrades are planned, tested, and safe.
Real-World Example: Upgrading Superset 3.0 to 3.1
Let’s walk through a concrete example. Suppose Superset 3.1 is released with a performance improvement and a security patch. Here’s how we’d execute the upgrade:
T-7 Days: We review the release notes, identify schema changes (adding a feature_flags table), and test migrations in staging.
T-3 Days: We run load tests in staging, simulating 500 concurrent users and 1,000 dashboard loads. We verify performance improves.
T-1 Day: We prepare green infrastructure, build containers with Superset 3.1, and run smoke tests.
T-0 (Upgrade Day, 2 AM UTC): We execute the upgrade during our lowest-traffic window.
- 2:00 AM: Green environment starts receiving 5% of traffic. Error rate: 0.05%. Latency: 150ms (same as blue).
- 2:15 AM: Shift to 25%. Error rate: 0.04%. Latency: 145ms. Looks good.
- 2:30 AM: Shift to 50%. Error rate: 0.06%. Latency: 160ms. Slightly higher but within threshold.
- 2:45 AM: Shift to 100%. Error rate: 0.05%. Latency: 155ms. Stable.
- 3:00 AM: Monitor for 1 hour. All metrics normal.
- 4:00 AM: Decommission blue. Upgrade complete.
Total downtime: zero. Total time: 2 hours. Users: completely unaware anything happened.
Comparing to Competitors: Why This Matters
If you’re evaluating managed Superset alternatives, this upgrade strategy is worth understanding. Some competitors—including Preset (Superset’s commercial offering), Looker, Tableau, and Power BI—handle upgrades differently.
Preset offers cloud hosting of Superset, but their upgrade strategy varies by plan. Looker and Tableau are proprietary platforms that handle upgrades automatically but with less transparency into the process. Power BI upgrades are frequent but can impact performance.
D23’s approach is different because we’re transparent about our process, we prioritize zero downtime, and we give you control. You can see exactly how we upgrade, understand the trade-offs, and have confidence in your analytics infrastructure.
If you’re running Superset yourself, you can implement these same patterns. The Kubernetes deployment documentation describes rolling updates, which is essentially blue-green deployment. The Docker Compose documentation shows how to manage multi-container applications. And the official Superset upgrade guide provides the migration scripts you need.
But implementing this requires significant operational expertise. You need to understand containerization, orchestration, database migration patterns, and distributed systems. You need to build monitoring and alerting. You need to test thoroughly. This is why many organizations choose managed services—the operational burden is substantial.
Performance Optimization During Upgrades
Beyond just maintaining availability, we optimize performance during upgrades. New versions of Superset often include query optimizations, caching improvements, and UI enhancements that make dashboards faster.
We follow the best practices outlined by CelerData for dashboard optimization, including load balancing strategies and caching configurations. During the gradual traffic shift, we monitor whether green actually delivers the performance improvements the upgrade promises.
If green is faster, we can shift traffic more aggressively because users benefit immediately. If green is slower, we investigate before proceeding.
We also follow Preset’s guidance on optimizing Superset dashboards, which covers query optimization and caching strategies that become especially important during upgrades when schema changes might affect query plans.
Security Considerations
Upgrades often include security patches. We prioritize these above all else, which is why we maintain the ability to upgrade quickly and safely.
Before upgrading, we review the security advisory, understand the risk, and assess whether we need to upgrade immediately (critical vulnerability) or can schedule it normally (low-risk patch).
For critical vulnerabilities, we might execute an emergency upgrade outside normal windows. Our zero-downtime process means we can do this without impacting users.
We follow security best practices for Superset deployments as outlined by enterprise deployment experts, including containerization with Docker for security isolation and Kubernetes for secure orchestration.
Continuous Improvement: Learning from Each Upgrade
Every upgrade is an opportunity to improve the process. After each upgrade, we conduct a post-mortem:
- What went well? Which monitoring alerts were most useful? Which runbook steps were unclear?
- What went wrong? Did we miss any edge cases? Did performance behave unexpectedly?
- What can we improve? Should we add more smoke tests? Adjust traffic shift percentages? Update documentation?
Over time, this continuous improvement makes upgrades faster and safer. Our first managed upgrade took 4 hours and required constant monitoring. Now, routine upgrades take 2 hours and are largely automated.
Implementation for Your Own Superset Deployment
If you’re running Superset yourself and want to implement zero-downtime upgrades, here’s the priority order:
-
Containerize Everything: Use Docker Compose or Kubernetes to run Superset, PostgreSQL, and Redis as containers. This makes blue-green deployment possible.
-
Separate Stateless and Stateful Components: Ensure your Superset application layer has no local state. Move sessions to Redis, configurations to environment variables.
-
Set Up Load Balancing: Use Nginx or a cloud load balancer to distribute traffic across multiple Superset instances.
-
Implement Database Migration Testing: Before any production upgrade, run migrations against a production-like database copy.
-
Build Monitoring: Instrument error rates, latency, cache hit rates, and connection pool utilization. Set up alerts.
-
Create Runbooks: Document the exact steps for upgrade, traffic shift, and rollback. Test them regularly.
-
Test in Staging: Mirror production exactly, run load tests, and verify the upgrade process works before touching production.
This is a significant undertaking, which is why many organizations prefer managed services. But if you have the engineering capacity, it’s absolutely doable.
Conclusion: Zero-Downtime Upgrades as Competitive Advantage
Zero-downtime upgrades aren’t a nice-to-have feature. They’re a fundamental requirement for production analytics infrastructure. Teams depend on dashboards, queries, and API endpoints staying available. Downtime erodes trust and slows decision-making.
At D23, we’ve invested heavily in the infrastructure, processes, and expertise to make zero-downtime upgrades routine. We use blue-green deployments, backward-compatible database migrations, gradual traffic shifts, comprehensive monitoring, and tested rollback procedures.
The result is that you can upgrade Apache Superset confidently, knowing that your analytics infrastructure will remain available, performant, and reliable. Whether you’re evaluating D23’s managed Superset service, running Superset yourself, or comparing options with Looker, Tableau, or other BI platforms, understanding upgrade strategy is crucial.
Zero downtime is possible. It requires planning, testing, and operational discipline. But it’s absolutely worth it.