SRE & Observability

Service Level Objectives

Smart Alerting

17 min read

SLOs That Actually Work: 5 SLOs for SaaS

Production-ready Service Level Objectives with Prometheus queries, error budget calculations, and alert rules designed to minimize alert fatigue while catching real issues.

Published: January 2, 2025

•

Updated: January 2, 2025

Why Most SLOs Fail (And How to Fix Them)

You've probably seen SLOs like this:

❌ "99.9% uptime"
❌ "P99 latency under 200ms"
❌ "Zero errors"

These SLOs are useless because they're not measurable, not actionable, and create alert fatigue.

Good SLOs have three properties:

User-centric: Measure what users experience, not internal metrics
Error budget-based: Allow for controlled failure (100% is impossible and wasteful)
Burn rate alerts: Page when SLO is at risk, not on every error

What You'll Get

5 Essential SLOs for SaaS platforms (availability, latency, error rate, throughput, data durability)
Prometheus queries for each SLI (Service Level Indicator)
Error budget math (e.g., 99.9% = 43 minutes downtime/month)
Multi-window burn rate alerts (fast burn: page immediately, slow burn: ticket)
Grafana dashboards for visualizing SLO compliance

SLO Terminology: SLI, SLO, SLA, Error Budget

SLI (Service Level Indicator)

A metric that measures a specific aspect of service quality. Examples: request success rate, P99 latency, availability percentage.

SLO (Service Level Objective)

A target value for an SLI over a time period. Example: "99.9% of requests must succeed over a 30-day window."

SLA (Service Level Agreement)

A contractual promise to customers, often with penalties. Example: "We guarantee 99.9% uptime or you get a refund." SLAs should be looser than internal SLOs.

Error Budget

The allowed amount of failure before violating an SLO. If SLO is 99.9%, error budget is 0.1% (43 minutes/month). This budget can be "spent" on deployments, experiments, incidents.

The 5 Essential SLOs for SaaS

SLO #1: Availability (Request Success Rate)

Definition: Percentage of HTTP requests that return non-5xx status codes.

SLO Target: 99.9% success rate over 30 days

Error Budget: 0.1% = 43.2 minutes of 5xx errors per month
Calculation: 30 days × 24 hours × 60 minutes × 0.001 = 43.2 minutes

Prometheus SLI Query

# SLI: Success rate (ratio of good requests to total requests) sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m])) # This gives you a value between 0.0 (0% success) and 1.0 (100% success)

Burn Rate Alerts

# prometheus-rules.yaml groups: - name: availability-slo rules: # Fast burn: Page immediately (consuming 5% of monthly budget in 1 hour) - alert: AvailabilitySLOFastBurn expr: | ( 1 - (sum(rate(http_requests_total{status!~"5.."}[1h])) / sum(rate(http_requests_total[1h]))) ) > (14.4 * 0.001) for: 2m labels: severity: page annotations: summary: "Availability SLO is burning too fast" description: "Error rate is {{ $value | humanizePercentage }} (threshold: 1.44%). At this rate, monthly error budget will be exhausted in 6 hours." # Slow burn: Create ticket (consuming 10% of budget in 24 hours) - alert: AvailabilitySLOSlowBurn expr: | ( 1 - (sum(rate(http_requests_total{status!~"5.."}[24h])) / sum(rate(http_requests_total[24h]))) ) > (3 * 0.001) for: 15m labels: severity: ticket annotations: summary: "Availability SLO budget at risk" description: "Error rate over 24h is {{ $value | humanizePercentage }}. Review error trends."

Why these numbers? Fast burn (14.4x error budget rate) means you'll exhaust your monthly budget in ~2 hours if not fixed. Slow burn (3x rate) gives you 10 days before exhaustion—time to investigate without panic.

SLO #2: Latency (Request Duration)

Definition: Percentage of requests that complete within a target duration.

SLO Target: 99% of requests complete in under 500ms over 30 days

Error Budget: 1% of requests can be slower than 500ms
Note: Use P99 (99th percentile), not average. Average hides outliers.

# SLI: Percentage of requests faster than 500ms sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m])) / sum(rate(http_request_duration_seconds_count[5m])) # Requires histogram metrics (http_request_duration_seconds_bucket)

# Alert: Latency SLO burn rate groups: - name: latency-slo rules: - alert: LatencySLOFastBurn expr: | ( 1 - (sum(rate(http_request_duration_seconds_bucket{le="0.5"}[1h])) / sum(rate(http_request_duration_seconds_count[1h]))) ) > (14.4 * 0.01) for: 2m labels: severity: page annotations: summary: "Latency SLO burning too fast" description: "{{ $value | humanizePercentage }} of requests are slower than 500ms (threshold: 14.4%). P99 latency is likely impacted."

SLO #3: Error Rate (Application Errors)

Definition: Percentage of requests that return 4xx/5xx errors.

SLO Target: 99.5% of requests return 2xx/3xx over 30 days

Error Budget: 0.5% = 216 minutes of errors per month

# SLI: Percentage of non-error responses sum(rate(http_requests_total{status=~"[23].."}[5m])) / sum(rate(http_requests_total[5m])) # Or inverse (error rate): sum(rate(http_requests_total{status=~"[45].."}[5m])) / sum(rate(http_requests_total[5m]))

SLO #4: Throughput (Traffic Capacity)

Definition: System handles expected peak traffic without degradation.

SLO Target: System handles 10,000 RPS with P99 latency under 500ms

Why this matters: Prevents cascading failures during traffic spikes

# SLI: Request rate is within capacity AND latency is acceptable ( sum(rate(http_requests_total[5m])) < 10000 ) and ( histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) < 0.5 ) # Alert when approaching capacity - alert: ApproachingTrafficCapacity expr: sum(rate(http_requests_total[5m])) > 8000 for: 10m labels: severity: warning annotations: summary: "Traffic approaching capacity limit" description: "Current RPS: {{ $value }}. Capacity limit: 10,000 RPS. Consider scaling."

SLO #5: Data Durability (For Storage/Database Services)

Definition: Percentage of successful write operations that are retrievable.

SLO Target: 99.999% data durability (five 9s)

Error Budget: 0.001% = 26 seconds of data loss per month
Implementation: Synthetic checks that write/read test data

# SLI: Synthetic data availability check sum(rate(synthetic_data_check_success[5m])) / sum(rate(synthetic_data_check_total[5m])) # Example: Write a unique test record every minute, try to read it back # If read fails, data durability is violated

Error Budget Math: The Complete Guide

┌──────────┬────────────────────┬─────────────────┬────────────────────────┐ │ SLO │ Error Budget │ Downtime/Month │ Downtime/Year │ ├──────────┼────────────────────┼─────────────────┼────────────────────────┤ │ 90% │ 10% │ 3 days │ 36.5 days │ │ 95% │ 5% │ 1.5 days │ 18.25 days │ │ 99% │ 1% │ 7.2 hours │ 3.65 days │ │ 99.5% │ 0.5% │ 3.6 hours │ 1.83 days │ │ 99.9% │ 0.1% │ 43 minutes │ 8.76 hours │ │ 99.95% │ 0.05% │ 21 minutes │ 4.38 hours │ │ 99.99% │ 0.01% │ 4.3 minutes │ 52.6 minutes │ │ 99.999% │ 0.001% │ 26 seconds │ 5.26 minutes │ └──────────┴────────────────────┴─────────────────┴────────────────────────┘

Why not 100%? Every "9" you add doubles your operational cost. 99.9% → 99.99% means you need multi-region failover, chaos engineering, extensive testing. Most SaaS can deliver great UX at 99.9%.

Implementing Error Budget Policies

Error budgets aren't just numbers—they're decision-making tools. Here's how to use them:

Error Budget > 50% remaining

Status: Healthy
Actions: Ship new features aggressively, experiment with risky changes, schedule maintenance windows

Error Budget: 10-50% remaining

Status: At Risk
Actions: Slow down feature velocity, focus on reliability improvements, require extra review for risky changes

Error Budget < 10% remaining or exhausted

Status: FREEZE
Actions: Stop all feature releases, only deploy critical bug fixes, focus 100% on reliability, post-mortem required

Grafana Dashboard for SLO Tracking

Create a dashboard with these panels:

# Panel 1: SLO Compliance (Last 30 days) # Shows current SLI value vs SLO target 1 - ( sum(increase(http_requests_total{status=~"5.."}[30d])) / sum(increase(http_requests_total[30d])) ) # Display as percentage with SLO threshold line at 0.999 # Panel 2: Error Budget Remaining # Shows percentage of error budget left ( (0.001) - (1 - (sum(increase(http_requests_total{status!~"5.."}[30d])) / sum(increase(http_requests_total[30d])))) ) / (0.001) * 100 # Display as gauge: Green (>50%), Yellow (10-50%), Red (<10%) # Panel 3: Error Budget Burn Rate # Shows current burn rate (1.0 = consuming budget at target rate) ( 1 - (sum(rate(http_requests_total{status!~"5.."}[1h])) / sum(rate(http_requests_total[1h]))) ) / (0.001) # Display as graph over time. Values >1 mean burning faster than allowed # Panel 4: Time to Budget Exhaustion # Calculates hours until budget runs out at current burn rate ( (0.001) - (1 - (sum(increase(http_requests_total{status!~"5.."}[30d])) / sum(increase(http_requests_total[30d])))) ) / ( 1 - (sum(rate(http_requests_total{status!~"5.."}[1h])) / sum(rate(http_requests_total[1h]))) ) # Display as stat panel with alert threshold at 24 hours

Common Mistakes & How to Avoid Them

❌ Mistake #1: Alerting on SLO Violations

Don't alert when SLO drops below 99.9%. By then, damage is done.

✅ Solution: Alert on error budget burn rate. Fast burn = page now, slow burn = investigate later.

❌ Mistake #2: Too Many SLOs

Having 20 SLOs means none of them matter. You'll drown in alerts.

✅ Solution: Start with 3-5 SLOs max. Focus on user-facing metrics (availability, latency, errors).

❌ Mistake #3: Internal Metrics as SLIs

"CPU < 80%" is not an SLO. Users don't care about CPU, they care if the app works.

✅ Solution: Only use user-facing metrics. Ask: "Would a user notice if this SLI is violated?"

Tooling: Implementing SLOs in Production

Option 1: Prometheus + Alertmanager + Grafana (DIY)

Pros: Full control, free, flexible
Cons: Manual setup, requires PromQL expertise

Option 2: Sloth (SLO Generator)

Sloth generates Prometheus rules from SLO definitions:

# slo.yaml version: "prometheus/v1" service: "api" slos: - name: "availability" objective: 99.9 description: "API availability SLO" sli: events: error_query: sum(rate(http_requests_total{status=~"5.."}[{{.window}}])) total_query: sum(rate(http_requests_total[{{.window}}])) alerting: name: AvailabilitySLO labels: category: "availability" page_alert: labels: severity: page ticket_alert: labels: severity: ticket # Generate Prometheus rules sloth generate -i slo.yaml -o prometheus-slo-rules.yaml

Option 3: Google SLO Platform (GCP Only)

Pros: Fully managed, integrates with GCP monitoring
Cons: Vendor lock-in, requires GCP

Conclusion: SLOs as a Product Discipline

Good SLOs aren't just operational metrics—they're a product decision. They answer:

What level of reliability do users need? (Not want—need. 100% is wasteful.)
How much can we invest in reliability? (Error budget = innovation budget.)
When should we stop shipping features and fix bugs? (When budget is exhausted.)

Start simple: Pick 1-2 SLOs (availability + latency). Implement burn rate alerts. Track error budget in your weekly standup. Once that's working, expand to the full 5 SLOs above.

Need Help Implementing SLOs?

We design and implement production SLO monitoring: Prometheus setup, error budget dashboards, burn rate alerts, Grafana configuration, and SRE best practices.