Guide April 18, 2026 · 16 mins · The D23 Team

Apache Superset Health Checks and Alerting

Monitor Apache Superset deployments with health endpoints, Prometheus, and PagerDuty. Production-grade alerting for dashboards and analytics.

Apache Superset Health Checks and Alerting

Apache Superset Health Checks and Alerting

Running Apache Superset in production means more than deploying dashboards. You need visibility into system health, query performance, and data freshness. Without proper monitoring and alerting, you’ll discover critical failures from angry users, not from your infrastructure.

This guide walks you through implementing comprehensive health checks and alerting for Apache Superset deployments—from basic health endpoints to advanced multi-layer monitoring with Prometheus, PagerDuty, and Superset’s native alerting framework. Whether you’re managing a single Superset instance or orchestrating analytics across multiple teams, understanding these patterns will keep your dashboards fast, reliable, and trustworthy.

Why Health Checks and Alerting Matter for Superset

Apache Superset is a stateless, horizontally scalable application. That flexibility is powerful, but it creates blind spots. A single slow query can cascade into timeout errors across your dashboard layer. A misconfigured cache can silently serve stale data for weeks. A database connection pool exhaustion might leave your analytics infrastructure locked up while everything else appears normal.

Proactive monitoring answers three critical questions:

Is Superset running? Health endpoints tell you whether the application itself is alive and responding to requests. This is your first line of defense—if Superset is down, nothing else matters.

Is Superset healthy? Deeper checks verify that Superset can connect to its metadata database, reach your data sources, and execute queries. A running application with broken database connectivity is worse than a down application—it creates confusion and data trust issues.

Are your dashboards and queries performing? Query latency, cache hit rates, and alert execution times reveal whether your analytics infrastructure is meeting SLAs. Slow dashboards erode adoption and waste engineering time on performance debugging.

Alerting converts this visibility into action. When something breaks, the right people get notified immediately, before it impacts business decisions.

Health Endpoint Basics

Apache Superset exposes a built-in health endpoint at /health on your Superset instance. This is the foundation of any monitoring strategy.

A simple GET request returns the health status:

GET http://your-superset-instance/health

A successful response looks like this:

{
  "status": "ok"
}

If Superset can’t reach its metadata database or encounters critical initialization errors, the endpoint returns a 500 status code:

{
  "status": "error",
  "message": "Unable to connect to metadata database"
}

This endpoint is intentionally simple. It’s designed to be lightweight and fast—something you can poll frequently without adding overhead. Most infrastructure monitoring tools (Kubernetes liveness probes, Datadog, New Relic, Prometheus) can scrape this endpoint every 10–30 seconds.

Configuring Superset for Health Monitoring

The health endpoint is enabled by default in modern Superset versions. If you’re running an older deployment, ensure you’re on Superset 1.5 or later. No additional configuration is required to expose /health, but you should verify it’s accessible from your monitoring infrastructure.

If you’re running Superset behind a load balancer or reverse proxy, ensure the health endpoint bypasses authentication. This is critical—your monitoring system needs to reach /health without credentials, otherwise a credential rotation or SSO outage will blind you to the actual problem.

For Nginx, add this block before your standard Superset location:

location /health {
    proxy_pass http://superset_backend;
    access_log off;
}

For Apache httpd:

<Location /health>
    ProxyPass http://superset_backend/health
    ProxyPassReverse http://superset_backend/health
</Location>

This ensures monitoring traffic doesn’t authenticate through your SSO provider, which could cause cascading failures if your identity provider goes down.

Prometheus Integration for Metrics Collection

The /health endpoint tells you if Superset is alive, but it doesn’t expose performance metrics. For deeper visibility, integrate Superset with Prometheus, the industry-standard metrics collection system.

Superset exposes Prometheus metrics at /metrics when the SUPERSET_PROMETHEUS_EXPORTER feature flag is enabled. This endpoint provides granular data on query execution times, cache performance, and request counts.

Enabling Prometheus Metrics

Add this to your Superset configuration file (superset_config.py):

FEATURE_FLAGS = {
    "PROMETHEUS_EXPORTER": True,
}

Restart Superset and verify metrics are exposed:

GET http://your-superset-instance/metrics

You’ll see output like:

# HELP superset_request_latency_seconds Request latency in seconds
# TYPE superset_request_latency_seconds histogram
superset_request_latency_seconds_bucket{endpoint="/api/v1/chart/data",le="0.1"} 45
superset_request_latency_seconds_bucket{endpoint="/api/v1/chart/data",le="0.5"} 128
superset_request_latency_seconds_bucket{endpoint="/api/v1/chart/data",le="1.0"} 156

# HELP superset_query_execution_time_seconds Query execution time
# TYPE superset_query_execution_time_seconds histogram
superset_query_execution_time_seconds_bucket{database="production",le="1"} 234
superset_query_execution_time_seconds_bucket{database="production",le="5"} 567

Key metrics to monitor:

  • superset_request_latency_seconds: How long HTTP requests take to complete. High values indicate slow dashboards.
  • superset_query_execution_time_seconds: Time spent executing database queries. Spikes suggest data source performance degradation.
  • superset_cache_hit_ratio: Percentage of queries served from cache. Low ratios indicate cache misses, which increase database load.
  • superset_active_connections: Number of concurrent connections to Superset. Sustained high values might indicate connection leaks.

Setting Up Prometheus Scraping

Create a prometheus.yml configuration file that scrapes your Superset instance:

global:
  scrape_interval: 30s
  evaluation_interval: 30s

scrape_configs:
  - job_name: 'superset'
    static_configs:
      - targets: ['your-superset-instance:8088']
    metrics_path: '/metrics'
    scrape_interval: 30s
    scrape_timeout: 10s

Start Prometheus and verify it’s collecting metrics. Within a few minutes, you’ll have 30 days of historical data on query performance, request latency, and system health.

Prometheus is self-hosted and lightweight—a single instance can monitor dozens of Superset deployments. If you’re already running Prometheus for infrastructure monitoring, adding Superset is trivial.

Native Alerts and Reports in Superset

Apache Superset includes a native alerting system designed specifically for analytics use cases. Unlike infrastructure monitoring (which watches Superset itself), Superset alerts watch your data.

You can create alerts that trigger when:

  • A metric crosses a threshold (revenue drops below target, error rate spikes)
  • A query returns unexpected results (no records when records are expected)
  • Data freshness degrades (a table hasn’t been updated in 24 hours)
  • A dashboard query exceeds a latency SLA

Configuring Superset alerts requires a message broker (Redis or RabbitMQ) and a Celery worker to execute the alerts asynchronously. This decouples alert evaluation from the main Superset application, preventing slow alerts from blocking dashboard loads.

Setting Up Alerts Infrastructure

First, ensure you have a message broker. Redis is the easiest starting point:

docker run -d -p 6379:6379 redis:7-alpine

Next, configure Superset to use Redis as the message broker. Add this to superset_config.py:

# Message broker for Celery
CELERY_BROKER_URL = "redis://localhost:6379/0"
CELERY_RESULT_BACKEND = "redis://localhost:6379/1"

# Enable alerts feature
FEATURE_FLAGS = {
    "ALERT_REPORTS": True,
}

# SMTP configuration for email alerts
SMTP_HOST = "smtp.your-email-provider.com"
SMTP_PORT = 587
SMTP_STARTTLS = True
SMTP_USERNAME = "your-email@example.com"
SMTP_PASSWORD = "your-app-password"
SMTP_FROM_ADDRESS = "alerts@example.com"

Start a Celery worker to process alerts:

celery -A superset.tasks worker --loglevel=info

Verify the worker is running and connected to Redis. You should see log output confirming the connection.

Creating Your First Alert

In Superset, navigate to a chart and click the alert icon. Create a new alert with:

  • Trigger Condition: The SQL condition that determines when to alert (e.g., SELECT COUNT(*) FROM orders WHERE status = 'failed' > 100)
  • Frequency: How often to evaluate the condition (every hour, daily, etc.)
  • Recipients: Email addresses or Slack channels that receive notifications
  • Notification Format: Include the query result, threshold, and timestamp

Superset will evaluate this condition on your specified schedule and send notifications when triggered.

For detailed configuration instructions, the official Apache Superset alerts documentation covers feature flags, Celery setup, and executor configuration. The Preset blog provides practical examples of setting up alerts for common monitoring scenarios.

Integrating PagerDuty for On-Call Escalation

Superset’s native alerts are great for notifying teams about data issues, but production systems need escalation paths. If an alert fires at 2 AM and no one acknowledges it within 5 minutes, it should page the on-call engineer.

PagerDuty integrates with Superset through webhook notifications. When an alert triggers, Superset can POST to a PagerDuty integration endpoint, creating an incident that escalates according to your on-call schedule.

Setting Up PagerDuty Integration

  1. In PagerDuty, create a new service for your analytics platform
  2. Under “Integrations,” select “Events API v2”
  3. Copy the integration key
  4. In Superset, create an alert with a webhook notification
  5. Configure the webhook to POST to PagerDuty’s events endpoint

A typical webhook payload looks like:

{
  "routing_key": "YOUR_PAGERDUTY_INTEGRATION_KEY",
  "event_action": "trigger",
  "dedup_key": "superset-alert-{{ alert_id }}",
  "payload": {
    "summary": "Superset Alert: {{ alert_name }}",
    "severity": "critical",
    "source": "Superset Analytics",
    "custom_details": {
      "alert_name": "{{ alert_name }}",
      "threshold": "{{ threshold }}",
      "value": "{{ value }}",
      "timestamp": "{{ timestamp }}"
    }
  }
}

When this webhook fires, PagerDuty creates an incident and notifies the on-call engineer according to your escalation policy. If the engineer doesn’t acknowledge within 5 minutes, it escalates to the next level. This ensures critical data issues get immediate attention.

Multi-Layer Monitoring Architecture

Production Superset deployments typically combine multiple monitoring layers:

Layer 1: Infrastructure Health — Kubernetes liveness/readiness probes or load balancer health checks monitor the /health endpoint every 10–30 seconds. If Superset stops responding, the orchestration layer automatically restarts the container or removes it from the load balancer.

Layer 2: Application Metrics — Prometheus scrapes /metrics every 30 seconds, collecting query latency, cache performance, and connection pool metrics. Grafana dashboards visualize these metrics in real-time.

Layer 3: Data Quality Alerts — Superset’s native alerting system evaluates data conditions (thresholds, freshness, anomalies) on a schedule you define. Alerts notify teams via email, Slack, or PagerDuty.

Layer 4: Synthetic Monitoring — Automated tests run queries against key dashboards every 5 minutes, verifying they return expected results within SLA latency. This catches issues that don’t trigger infrastructure or data alerts.

Each layer monitors different aspects of the system. Together, they provide comprehensive visibility.

Configuring Alerting for Query Performance

One of the most common Superset monitoring gaps is query performance. A query that takes 45 seconds to execute will timeout on most dashboards, but it won’t trigger infrastructure alerts—Superset is still running fine.

Create a performance alert by adding this to your Superset configuration:

# Alert if any query takes longer than 30 seconds
QUERY_TIMEOUT_THRESHOLD_SECONDS = 30
ALERT_ON_SLOW_QUERIES = True
SLOW_QUERY_ALERT_RECIPIENTS = ["data-ops@example.com"]

Alternatively, use Prometheus metrics to create alerts in your monitoring system:

# Prometheus alert rule
groups:
  - name: superset
    rules:
      - alert: SupersetSlowQueries
        expr: histogram_quantile(0.95, superset_query_execution_time_seconds) > 30
        for: 5m
        annotations:
          summary: "95th percentile query latency exceeds 30 seconds"
          description: "Superset queries are slow. Check database performance and query optimization."

This alert fires when the 95th percentile of query execution time exceeds 30 seconds for 5 consecutive minutes, preventing noise from occasional slow queries.

Health Checks for Superset Dependencies

Superset depends on several external systems:

  • Metadata Database (PostgreSQL, MySQL): Stores dashboard definitions, user permissions, and alert configurations
  • Data Sources (data warehouses, databases): The systems Superset queries
  • Message Broker (Redis, RabbitMQ): Powers async alerts and background jobs
  • Cache Layer (Redis, Memcached): Caches query results

A failure in any of these systems will degrade Superset, but the /health endpoint might still return 200 OK if the metadata database is reachable.

Create comprehensive health checks that verify all dependencies:

# Custom health check endpoint
from flask import jsonify
from superset.extensions import db, cache
import redis

@app.route("/health/comprehensive")
def comprehensive_health():
    health = {"status": "ok", "checks": {}}
    
    # Metadata database
    try:
        db.session.execute("SELECT 1")
        health["checks"]["metadata_db"] = "ok"
    except Exception as e:
        health["checks"]["metadata_db"] = f"error: {str(e)}"
        health["status"] = "degraded"
    
    # Cache layer
    try:
        cache.get("health_check")
        health["checks"]["cache"] = "ok"
    except Exception as e:
        health["checks"]["cache"] = f"error: {str(e)}"
        health["status"] = "degraded"
    
    # Message broker
    try:
        r = redis.from_url(CELERY_BROKER_URL)
        r.ping()
        health["checks"]["message_broker"] = "ok"
    except Exception as e:
        health["checks"]["message_broker"] = f"error: {str(e)}"
        health["status"] = "degraded"
    
    # Data source connectivity (sample check)
    try:
        # Test a simple query against your primary data source
        db_engine = get_engine(database_id=1)
        db_engine.execute("SELECT 1")
        health["checks"]["data_source"] = "ok"
    except Exception as e:
        health["checks"]["data_source"] = f"error: {str(e)}"
        health["status"] = "degraded"
    
    status_code = 200 if health["status"] == "ok" else 503
    return jsonify(health), status_code

This endpoint returns detailed information about each dependency. If any check fails, the HTTP status code is 503 (Service Unavailable), which your monitoring system can use to trigger alerts.

Alerting on Data Freshness

Stale data is a silent killer. Dashboards might load quickly and show no errors, but if the underlying data hasn’t been refreshed in days, business decisions are based on outdated information.

Create alerts that verify data freshness by checking when tables were last updated:

-- Alert if the orders table hasn't been updated in 24 hours
SELECT 
    table_name,
    EXTRACT(EPOCH FROM (NOW() - MAX(updated_at))) / 3600 as hours_since_update
FROM orders
GROUP BY table_name
HAVING EXTRACT(EPOCH FROM (NOW() - MAX(updated_at))) / 3600 > 24

Set this as an alert condition in Superset with a daily evaluation schedule. If any table is older than 24 hours, the alert fires and notifies your data engineering team.

For more sophisticated freshness monitoring, integrate with your data pipeline orchestration tool (Airflow, dbt, Dagster). These tools have native alerting that can notify Superset when data loads complete or fail.

Monitoring Cache Performance

Superset’s query result cache is critical for dashboard performance. When cache is working well, repeated queries return in milliseconds. When cache is misconfigured or overwhelmed, every query hits the database, causing latency spikes.

Monitor cache performance with these Prometheus queries:

# Cache hit ratio (should be >80% for stable dashboards)
rate(superset_cache_hits_total[5m]) / (rate(superset_cache_hits_total[5m]) + rate(superset_cache_misses_total[5m]))

# Cache size (track growth over time)
superset_cache_size_bytes

# Cache eviction rate (high evictions indicate cache is too small)
rate(superset_cache_evictions_total[5m])

Create alerts for:

  • Cache hit ratio dropping below 60% (indicates misconfiguration or insufficient cache size)
  • Cache size growing unbounded (indicates a memory leak or missing TTL configuration)
  • Cache eviction rate spiking (indicates cache is too small for your workload)

When cache performance degrades, investigate:

  1. Are dashboards caching results? Check the cache TTL settings on each chart
  2. Is the cache layer running out of memory? Monitor Redis/Memcached memory usage
  3. Are queries changing frequently? If dashboard filters change every request, cache won’t help

Implementing Synthetic Monitoring

Synthetic monitoring runs automated tests against your dashboards to catch issues before users do. Unlike passive monitoring (which waits for problems to occur), synthetic monitoring actively verifies functionality.

Create a simple synthetic test that loads your critical dashboards and verifies they return data within SLA:

import requests
import time
from datetime import datetime

def synthetic_dashboard_test():
    dashboards = [
        {"id": 1, "name": "Revenue Dashboard", "sla_seconds": 5},
        {"id": 2, "name": "Operations Dashboard", "sla_seconds": 10},
    ]
    
    for dashboard in dashboards:
        start_time = time.time()
        
        try:
            response = requests.get(
                f"http://superset/api/v1/dashboard/{dashboard['id']}",
                headers={"Authorization": f"Bearer {SUPERSET_API_TOKEN}"}
            )
            elapsed = time.time() - start_time
            
            if response.status_code != 200:
                alert(f"{dashboard['name']} returned {response.status_code}")
            elif elapsed > dashboard["sla_seconds"]:
                alert(f"{dashboard['name']} took {elapsed:.2f}s (SLA: {dashboard['sla_seconds']}s)")
            else:
                log(f"{dashboard['name']} OK ({elapsed:.2f}s)")
        
        except Exception as e:
            alert(f"{dashboard['name']} failed: {str(e)}")

# Run every 5 minutes
schedule.every(5).minutes.do(synthetic_dashboard_test)

Run this test from an external location (not your Superset server) to catch network issues and slow responses. If a dashboard fails the synthetic test, create a PagerDuty incident immediately.

Logging and Tracing for Debugging

When something goes wrong, you need detailed logs to understand what happened. Configure Superset to log at DEBUG level in production (with sampling to avoid overwhelming your log aggregation system):

LOGGING = {
    "version": 1,
    "disable_existing_loggers": False,
    "formatters": {
        "standard": {
            "format": "%(asctime)s [%(levelname)s] %(name)s: %(message)s"
        },
    },
    "handlers": {
        "default": {
            "level": "INFO",
            "class": "logging.StreamHandler",
            "formatter": "standard",
        },
        "superset": {
            "level": "DEBUG",
            "class": "logging.StreamHandler",
            "formatter": "standard",
        },
    },
    "loggers": {
        "superset": {
            "handlers": ["superset"],
            "level": "DEBUG",
            "propagate": False,
        },
    },
}

Send logs to a centralized aggregation system (ELK, Splunk, Datadog) where you can search and correlate events across your infrastructure.

For distributed tracing, integrate Superset with OpenTelemetry to track requests across services:

from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor

FlaskInstrumentor().instrument_app(app)
SQLAlchemyInstrumentor().instrument()

This creates end-to-end traces showing exactly where time is spent—whether it’s in Superset, the database, or the network.

Best Practices for Superset Alerting

1. Alert on outcomes, not events. Don’t alert every time a query is slow; alert when the 95th percentile latency exceeds your SLA for 5 minutes. This reduces noise and prevents alert fatigue.

2. Include context in alert messages. When an alert fires, include the metric value, threshold, and timestamp. Make it easy for the on-call engineer to understand what’s wrong without digging through logs.

3. Test your alerts. Regularly verify that alerts fire when they should and don’t fire when they shouldn’t. A broken alert is worse than no alert.

4. Escalate appropriately. Not every alert needs to page the on-call engineer at 3 AM. Use severity levels—critical alerts page immediately, warnings notify via Slack, info goes to logs.

5. Monitor your monitoring. If your alerting system fails, you’re blind. Monitor the health of Prometheus, PagerDuty, and your message broker as carefully as you monitor Superset.

6. Document runbooks. When an alert fires, on-call engineers need to know what to do. Create runbooks that explain what each alert means and the troubleshooting steps.

Choosing Between Managed and Self-Hosted Monitoring

If you’re running Superset on your own infrastructure, you have two paths:

Self-hosted monitoring (Prometheus + Grafana + PagerDuty) gives you complete control and integrates seamlessly with your existing infrastructure. It’s the right choice if you already have Prometheus running or if you need highly customized alerting logic.

Managed monitoring (Datadog, New Relic, Splunk) abstracts away infrastructure management. You get pre-built dashboards, intelligent alerting, and enterprise support. It’s the right choice if you want to minimize operational overhead.

For teams evaluating managed Apache Superset hosting, many providers (including D23) handle health monitoring and alerting as part of the managed service. This means you get production-grade monitoring without managing Prometheus or PagerDuty integrations yourself.

Integrating Health Checks with Your CI/CD Pipeline

Before deploying a new version of Superset, verify that health checks pass:

#!/bin/bash

# Deploy new Superset version
docker pull superset:latest
docker-compose up -d

# Wait for Superset to start
sleep 10

# Check health endpoint
HEALTH=$(curl -s http://localhost:8088/health | jq -r '.status')

if [ "$HEALTH" != "ok" ]; then
    echo "Health check failed after deployment"
    docker-compose down
    exit 1
fi

# Run comprehensive health checks
COMPREHENSIVE=$(curl -s http://localhost:8088/health/comprehensive | jq -r '.status')

if [ "$COMPREHENSIVE" != "ok" ]; then
    echo "Comprehensive health check failed"
    docker-compose down
    exit 1
fi

echo "Deployment successful"

This ensures every deployment is verified before it reaches production. If health checks fail, the deployment automatically rolls back.

Conclusion: From Reactive to Proactive Monitoring

Healthy Superset deployments don’t happen by accident. They require a combination of infrastructure monitoring (health endpoints, Prometheus metrics), data-aware alerting (Superset’s native alerts), and escalation paths (PagerDuty) that ensure the right people are notified when things break.

Start with the basics: expose the /health endpoint and set up basic load balancer health checks. As your deployment grows, add Prometheus metrics collection and create alerts for query performance. Finally, implement native Superset alerts for data quality and freshness.

The goal isn’t to eliminate all alerts—it’s to eliminate surprises. When your dashboards are slow, you should know before your stakeholders do. When data becomes stale, your team should be notified automatically. When Superset fails, on-call engineers should be paged immediately.

For teams running Superset at scale, this level of monitoring is non-negotiable. If you’re managing multiple Superset instances across teams or embedding analytics in your product, consider whether self-hosted monitoring is worth the operational overhead. Managed platforms like D23 bundle health monitoring, alerting, and expert support, letting you focus on analytics instead of infrastructure.

The references above—including the official Apache Superset alerts documentation, the GitHub discussion on SSO-only setups, and tutorials from Preset, dbt, Astronomer, and DataCamp—provide deep dives into specific configuration scenarios and best practices.

Your dashboards are only as reliable as your monitoring. Build it right from the start.