Embedded Analytics SLAs: What to Promise Your Customers
Define realistic embedded analytics SLAs for availability, latency, and freshness. Learn what to promise and how to deliver without overcommitting.
Embedded Analytics SLAs: What to Promise Your Customers
When you embed analytics into your product, you’re making a promise to your customers. That promise isn’t just about pretty dashboards or clever queries—it’s about uptime, speed, and data freshness. The moment you put analytics in the critical path of your customer’s workflow, you’ve entered SLA territory.
The problem: most teams building embedded analytics don’t think about SLAs until something breaks. By then, you’re scrambling to explain why a dashboard went dark, why queries take 45 seconds, or why yesterday’s numbers don’t match today’s. This article walks you through what embedded analytics SLAs actually are, why they matter, and how to set them realistically without crippling your infrastructure.
Understanding Embedded Analytics SLAs
An SLA—Service Level Agreement—is a contract between you and your customer about what they can expect from your service. In the context of embedded analytics, that means commitments around three core dimensions: availability (is the dashboard up?), latency (how fast are queries?), and freshness (how recent is the data?).
Embedded analytics differs fundamentally from standalone BI tools. When a customer opens Tableau or Looker, they’re aware they’re using a BI tool. They expect occasional downtime, refresh delays, and occasional slowness. But when analytics are embedded directly into your product—say, a revenue dashboard in your SaaS platform or a performance report in your mobile app—customers don’t think of it as “BI.” They think of it as part of your core product. They expect it to work like the rest of your application.
That’s the core tension: embedded analytics sit at the intersection of operational systems and analytical systems. Operational systems are built for speed and reliability. Analytical systems are built for flexibility and complex queries. Your SLA needs to reflect that reality.
When you’re using managed Apache Superset or building on open-source BI, you have direct control over infrastructure, caching, and query optimization. That’s powerful—it means you can make deliberate trade-offs. But it also means the SLA is on you.
The Three Pillars of Embedded Analytics SLAs
Availability: The Uptime Commitment
Availability is the simplest pillar to understand but the hardest to get right. It answers the question: “Is the analytics dashboard accessible right now?”
Availability is usually expressed as a percentage over a time period. “99.9% availability” means the service can be down for about 43 minutes per month. “99.99%” means about 4 minutes per month. For embedded analytics, most teams aim for 99.5% to 99.9%.
Here’s what matters:
The infrastructure stack. If your analytics platform depends on a single database, a single application server, and a single network path, your availability is limited by the weakest link. Each component with 99% availability means combined availability of roughly 99% × 99% × 99% = 97%. That’s three nines becoming less than two. This is why managed platforms often outperform self-hosted setups—they distribute load, implement redundancy, and fail over automatically.
Scheduled maintenance. Most SLAs exclude scheduled maintenance windows. If you say “99.9% availability excluding scheduled maintenance,” you’re buying yourself maintenance windows. A typical SLA might allow 4 hours per month of scheduled downtime. Be explicit about when maintenance happens. Sunday 2 AM UTC might work for your US-based customers but devastate your Asia-Pacific users.
What counts as “down.” Does a single customer seeing a 500 error count as downtime for the entire service? Or only if 10% of customers can’t access dashboards? Most SLAs define a threshold—typically, the service is considered down if more than 5% of requests fail or if a specific region becomes unreachable. Be specific in your SLA.
Graceful degradation. In practice, you rarely achieve true binary up/down. More often, you have partial degradation: some queries run fast, others time out. Some dashboards load, others don’t. Your SLA should account for this. You might commit to “95% of queries complete within 30 seconds” rather than “all queries complete within 30 seconds.” This is more realistic and more defensible.
For embedded analytics specifically, availability often matters more than in standalone BI because it’s part of your core product experience. A Looker dashboard going down for an hour is annoying. Your embedded revenue dashboard going down for an hour might cost you customer trust.
Latency: The Speed Commitment
Latency is how long it takes for a query to return results. In embedded analytics, latency directly impacts user experience. A dashboard that takes 15 seconds to load feels broken, even if it technically works.
Latency SLAs are usually expressed as percentiles. “p50 latency < 2 seconds” means 50% of queries finish in under 2 seconds. “p95 latency < 10 seconds” means 95% of queries finish in under 10 seconds. “p99 latency < 30 seconds” means even the slowest 1% of queries finish within 30 seconds.
Why percentiles? Because averages lie. If 99 queries take 1 second and 1 query takes 100 seconds, the average is 1.99 seconds. But your customer sees that 100-second query and thinks your system is broken. Percentiles force you to think about the tail.
Latency depends on several factors:
Query complexity. A simple “count of events today” query might run in 100ms. A query joining five tables, filtering by 20 conditions, and aggregating across billions of rows might take 30 seconds. Your SLA needs to account for this range. You might commit to different latencies for different dashboard types: “standard dashboards < 5 seconds, custom reports < 30 seconds.”
Data volume. As your customers’ data grows, queries slow down. A query that runs in 2 seconds on 10 million rows might take 20 seconds on 1 billion rows. This is why it’s critical to understand your customers’ data volumes when setting SLAs. If you promise “all queries < 5 seconds” but your customer has 50 billion events, you’re setting yourself up for failure.
Caching strategy. This is where embedded analytics shine. Unlike ad-hoc BI tools where every query is unique, embedded dashboards often show the same visualizations repeatedly. You can pre-compute results, cache them, and serve cached results instantly. A well-designed caching layer can reduce p95 latency from 20 seconds to 2 seconds. But caching introduces staleness—which brings us to freshness.
Concurrency. When multiple customers query simultaneously, database load increases and latency degrades. Your SLA should specify latency under normal load (e.g., “p95 < 5 seconds at 100 concurrent users”) or peak load (“p95 < 10 seconds at 1000 concurrent users”).
For embedded analytics, latency SLAs are critical. Users expect embedded experiences to feel snappy. If your embedded dashboard takes 10 seconds to load, users will perceive your entire product as slow, even if the rest of your application is fast.
Freshness: The Data Recency Commitment
Freshness answers: “How old is the data in this dashboard?” It’s measured as the time between when an event occurs and when it appears in analytics.
Freshness is often the most contentious SLA dimension because it directly conflicts with latency and cost. Real-time data (< 1 second latency) is expensive. It requires streaming infrastructure, complex event processing, and careful orchestration. Near-real-time (< 5 minutes) is more reasonable. Daily batches are cheap but stale.
Freshness depends on your data pipeline architecture:
Batch ETL. Data is extracted, transformed, and loaded on a schedule—typically daily or hourly. Freshness is determined by how often you run the job. If you run daily at midnight UTC, data is up to 24 hours old at the start of the day. This is simple and cheap but stale. Best practices for reliable pipelines emphasize that batch freshness is predictable—you know exactly when data will refresh.
Streaming ingestion. Events flow into your data warehouse in real-time or near-real-time. Freshness is seconds or minutes. This is more expensive (streaming infrastructure, schema management, exactly-once semantics) but much fresher. Building SLAs for real-time dashboards with AI-ETL provides detailed guidance on committing to real-time freshness.
Hybrid approaches. Many teams use a combination: real-time streaming for critical metrics (revenue, user activity) and daily batches for less critical data (customer demographics, historical trends). Your SLA can reflect this: “core metrics updated every 5 minutes, supporting data updated daily.”
Freshness also depends on your analytics platform. If you’re using managed Apache Superset, you control when data refreshes. You can implement smart caching that serves fresh data for recent time periods and cached data for historical periods. You can refresh different datasets on different schedules.
Here’s the key insight: freshness, latency, and cost form a triangle. Pick two, and the third suffers. Real-time + fast = expensive. Real-time + cheap = slow. Fast + cheap = stale. Your SLA should reflect this trade-off explicitly.
Setting Realistic SLA Targets
Now that you understand the three pillars, how do you actually set targets? The answer depends on your customers, your infrastructure, and your business model.
Understanding Your Customer’s Needs
Different customers have different requirements. A venture capital firm tracking portfolio performance doesn’t need real-time data—daily or weekly updates are fine. A SaaS platform showing customers their usage metrics needs data fresh within the hour. A trading platform needs sub-second latency.
Before setting SLAs, ask your customers:
- How fresh does the data need to be? Is daily sufficient, or do you need hourly? Minute-level?
- How fast should queries run? Is 5 seconds acceptable, or do you need sub-second responses?
- How much downtime can you tolerate? Is 99% acceptable, or do you need 99.9%?
You’ll likely get a range of answers. That’s okay. You can tier your SLAs: “standard tier: 99.5% availability, p95 latency < 10 seconds, daily data refresh. Premium tier: 99.9% availability, p95 latency < 5 seconds, hourly data refresh.”
Benchmarking Against Competitors
Look at what Looker, Tableau, Power BI, and other competitors promise. Most traditional BI platforms don’t publish detailed SLAs—they’re vague about latency and freshness. That’s because they can’t control these dimensions; they depend on customer infrastructure.
Managed platforms like Preset (the commercial Superset offering) typically commit to 99.5% availability. Most cloud BI platforms commit to 99.9% for premium tiers.
For latency, Looker and Tableau don’t typically commit to specific numbers—they say “it depends on your data.” That’s honest but unhelpful. Managed platforms are more specific because they control the infrastructure.
For freshness, traditional BI platforms don’t commit to anything. They assume you’ll sync your data warehouse on your own schedule. Managed platforms can be more specific because they often manage the data pipeline.
Understanding Your Infrastructure Limits
Your SLA is only as good as your infrastructure. Before committing to anything, map out your actual capabilities:
Availability. What’s your current uptime? If you’re running on a single database server, your availability is probably 99% at best. If you’re running on managed cloud infrastructure with multi-region failover, you might hit 99.99%. Be honest about what you can actually deliver.
Latency. Run load tests. How fast do queries actually run at peak load? If p95 is currently 8 seconds, don’t promise 5 seconds. Promise 8 seconds, then work on optimization.
Freshness. What’s your current data pipeline? If you’re running daily batch jobs, you can’t promise hourly freshness without major changes. Understand the cost of each improvement: moving from daily to hourly might require 3x infrastructure investment. Moving from hourly to real-time might require 10x.
The SLA Ladder
A useful framework is the SLA ladder: start conservative, then improve. This is especially important for new products.
Year 1: 99% availability, p95 latency < 15 seconds, daily data refresh. You’re learning, your infrastructure is simple, your customer base is small.
Year 2: 99.5% availability, p95 latency < 10 seconds, 6-hourly data refresh. You’ve optimized your database, implemented caching, added redundancy.
Year 3: 99.9% availability, p95 latency < 5 seconds, hourly data refresh. You’ve invested in multi-region infrastructure, sophisticated query optimization, streaming data pipelines.
This ladder gives you room to grow without overpromising. It also gives you a roadmap for infrastructure investment.
Implementing SLA Observability
Setting an SLA is one thing. Measuring it is another. You need visibility into whether you’re meeting your commitments.
Monitoring Availability
Availability monitoring is straightforward: ping your dashboard endpoint every 30 seconds from multiple geographic locations. If it responds with a 200 status, it’s up. If it doesn’t, it’s down.
But this is too simplistic for embedded analytics. You need to monitor:
- API availability. Is your query API responding?
- Dashboard rendering. Can users actually see dashboards, or do they get blank pages?
- Data freshness. Is the data actually being refreshed on schedule?
A dashboard might return a 200 status but show stale data or failed queries. That’s not really “up.”
Monitoring Latency
For latency, you need to track actual query performance in production. Instrument your query layer to record:
- Query execution time
- Query type (simple, complex, etc.)
- Data volume queried
- Concurrent queries at the time
- User/customer ID
Aggregate this data to compute percentiles. “p95 latency is currently 8 seconds, up from 5 seconds yesterday” tells you something is wrong.
Visualize latency over time. Create alerts: “if p95 latency exceeds 10 seconds for 5 minutes, alert on-call engineer.”
Monitoring Freshness
Freshness monitoring is often overlooked. You need to track:
- Data arrival latency. When does data actually arrive in your warehouse after an event occurs?
- Pipeline latency. How long does it take from data arrival to availability in dashboards?
- Refresh timeliness. Are scheduled refreshes actually happening on time?
Implement data freshness checks: “verify that today’s data is present by 9 AM UTC.” If the check fails, alert.
Publishing SLA Metrics
Make your SLA metrics visible to customers. Many platforms publish a status page showing current availability, latency, and freshness. This builds trust and reduces support burden—customers can see that you’re tracking these metrics seriously.
Include historical data: “availability last 30 days: 99.92%.” This shows you’re consistently meeting commitments.
Handling SLA Breaches
Eventually, you’ll breach an SLA. A query times out. A dashboard goes down. Data doesn’t refresh on schedule. What then?
The Incident Response Protocol
Have a clear process:
- Detect. Monitoring alerts fire. On-call engineer investigates.
- Communicate. Notify affected customers immediately. “We’re experiencing elevated latency. ETA for resolution: 30 minutes.”
- Remediate. Fix the underlying issue. Scale up database. Clear cache. Restart service.
- Verify. Confirm that the issue is resolved and SLA is being met again.
- Analyze. Post-incident, understand root cause. Was it a capacity issue? A bug? A dependency failure?
- Improve. Make changes to prevent recurrence. Add capacity. Improve monitoring. Implement circuit breakers.
SLA Credits
Many SLAs include credits for breaches. “If we miss 99.5% availability in a month, we’ll credit 10% of your monthly fee.”
Credits incentivize you to take SLAs seriously. They also compensate customers for the impact of your failure.
Be specific about credit calculation:
- Availability 99.0-99.4%: 10% credit
- Availability 98.0-98.9%: 25% credit
- Availability < 98%: 50% credit
Credits are usually capped at 100% of monthly fees (you can’t owe more than the customer paid).
Communicating Breaches
When you breach an SLA, communicate clearly:
- What happened? (“Database connection pool exhausted due to unexpected traffic spike”)
- When did it happen? (“3:15 AM - 3:47 UTC on March 15”)
- What was the impact? (“Queries took 45 seconds on average; 2% of requests timed out”)
- What are you doing about it? (“We’ve scaled database from 4 to 8 cores and implemented connection pooling”)
- How are we making it right? (“We’re crediting 10% of your monthly fee”)
This transparency builds trust. Customers understand that systems fail; they respect teams that handle failures well.
SLAs for Different Embedded Analytics Use Cases
Different use cases have different SLA requirements. Here are some examples:
Executive Dashboards and KPI Reporting
Executives check dashboards daily or weekly. They don’t need real-time data. They do need high availability (they trust the numbers they see).
Recommended SLA:
- Availability: 99.9%
- Latency: p95 < 5 seconds (executives expect snappy dashboards)
- Freshness: Daily (refreshed overnight)
Executive dashboards are often high-stakes. A wrong number in a board meeting is expensive. Prioritize accuracy and availability over freshness.
Operational Dashboards (Real-Time Monitoring)
Operational dashboards show current system state: server status, customer activity, revenue, etc. Teams rely on them to make decisions. They need fresher data.
Recommended SLA:
- Availability: 99.95%
- Latency: p95 < 3 seconds (operations teams need quick feedback)
- Freshness: 5-15 minutes (near real-time)
Operational dashboards are often in the critical path of incident response. If your operations dashboard goes down during an outage, you’ve made the situation worse.
Customer-Facing Analytics (Embedded in SaaS Products)
Customers see these dashboards regularly. They expect them to work like any other part of your product. They need good availability and reasonable latency.
Recommended SLA:
- Availability: 99.5%
- Latency: p95 < 5 seconds
- Freshness: Hourly (most customers accept hourly delays)
Customer-facing analytics are part of your product experience. A slow or broken dashboard reflects poorly on your entire product.
Self-Service Analytics (Data Exploration)
Users run ad-hoc queries, exploring data. Query complexity varies wildly. Availability and latency expectations are lower.
Recommended SLA:
- Availability: 99%
- Latency: p95 < 30 seconds (users expect exploration to take time)
- Freshness: Daily or hourly (depends on use case)
Self-service analytics are less mission-critical. Users understand that complex queries take time. You have more flexibility here.
Advanced SLA Considerations
Multi-Tenant SLA Isolation
When you serve multiple customers, one customer’s heavy query shouldn’t impact another customer’s performance. This requires:
- Query queuing. Limit concurrent queries per customer.
- Resource allocation. Allocate CPU, memory, and I/O per customer.
- Query timeouts. Kill long-running queries to prevent resource hogging.
Your SLA might be: “p95 latency < 5 seconds for standard queries, subject to per-customer concurrency limits.”
Seasonal Variations
Some analytics workloads are seasonal. A retail company’s analytics are heavy during holiday season. A school’s analytics are heavy during registration periods.
Your SLA might account for this: “99.5% availability during normal periods, 99% during peak periods (defined as Q4 for retail, August-September for education).”
Data Quality SLAs
Beyond availability, latency, and freshness, consider data quality. Data SLAs for reliable pipelines emphasize that data completeness and accuracy are as important as freshness.
You might commit to: “99.9% of data is complete and accurate within 24 hours of collection.”
This requires data validation, anomaly detection, and data quality monitoring.
Dependency SLAs
Your analytics depend on upstream systems: data warehouses, APIs, ETL tools. If your data warehouse is down, your analytics are down. But you didn’t cause the outage.
Most SLAs exclude dependency failures: “99.9% availability, excluding outages of third-party services like Snowflake or BigQuery.”
But you should still monitor and communicate dependency issues. If your SLA is being breached because of a dependency, customers need to know.
Communicating SLAs to Customers
Your SLA is only valuable if customers know about it. Include it in:
- Service agreements. Embed SLA terms in your standard contracts.
- Documentation. Publish SLA targets in your docs.
- Status pages. Show current and historical SLA metrics.
- Onboarding. Discuss SLA expectations during customer onboarding.
Be clear about what’s included and excluded:
- “99.9% availability, measured monthly, excluding scheduled maintenance windows (up to 4 hours per month, scheduled on Sundays 2-6 AM UTC).”
- “p95 query latency < 5 seconds for dashboards with < 1 billion rows. Dashboards with > 1 billion rows may have higher latency.”
- “Data freshness: core metrics updated hourly, supporting data updated daily.”
This clarity prevents misunderstandings and sets realistic expectations.
Building SLA Culture
SLAs aren’t just legal documents. They’re commitments that shape how your team works.
Making SLAs Operational
Internalize SLAs in your engineering culture:
- On-call rotations. Assign engineers to respond to SLA breaches.
- SLA budgets. Allocate “error budget”—if you have 99.9% availability, you can afford 43 minutes of downtime per month. Track this budget. When you’re approaching the limit, prioritize stability over new features.
- Blameless postmortems. When SLAs are breached, analyze without blame. Focus on systems and processes, not individuals.
- Continuous improvement. Use SLA metrics to drive infrastructure improvements. “p95 latency is 8 seconds; our goal is 5 seconds. What infrastructure changes would help?”
Aligning SLAs with Business Goals
Different customers have different SLA requirements. Understanding how SLAs depend on trustworthy analytics and BI emphasizes that SLAs should align with business impact.
A customer paying $10k/month might require 99.9% availability. A customer paying $1k/month might accept 99%. Your SLA structure should reflect this.
Conclusion: SLAs as Product Strategy
Embedded analytics SLAs aren’t just operational commitments—they’re product strategy. They communicate what you value: availability, speed, or freshness. They shape your infrastructure decisions. They influence your pricing.
Start conservative. Measure relentlessly. Improve systematically. Over time, you’ll build embedded analytics that customers trust.
When you’re ready to implement embedded analytics with strong SLA foundations, D23 provides managed Apache Superset with the infrastructure and expertise to meet ambitious SLA targets. Whether you’re building executive dashboards, operational analytics, or customer-facing analytics, we help you set realistic SLAs and deliver on them consistently.
The key is being honest about what you can deliver, measuring whether you’re delivering it, and continuously improving. That’s how you build embedded analytics that customers rely on.