Skip to main content
SRE & Observability
Service Level Objectives
Smart Alerting
17 min read

SLOs That Actually Work: 5 SLOs for SaaS

Production-ready Service Level Objectives with Prometheus queries, error budget calculations, and alert rules designed to minimize alert fatigue while catching real issues.

Published: January 2, 2025

Updated: January 2, 2025

Why Most SLOs Fail (And How to Fix Them)

You've probably seen SLOs like this:

❌ "99.9% uptime"
❌ "P99 latency under 200ms"
❌ "Zero errors"

These SLOs are useless because they're not measurable, not actionable, and create alert fatigue.

Good SLOs have three properties:

  1. User-centric: Measure what users experience, not internal metrics
  2. Error budget-based: Allow for controlled failure (100% is impossible and wasteful)
  3. Burn rate alerts: Page when SLO is at risk, not on every error
What You'll Get
  • 5 Essential SLOs for SaaS platforms (availability, latency, error rate, throughput, data durability)
  • Prometheus queries for each SLI (Service Level Indicator)
  • Error budget math (e.g., 99.9% = 43 minutes downtime/month)
  • Multi-window burn rate alerts (fast burn: page immediately, slow burn: ticket)
  • Grafana dashboards for visualizing SLO compliance

SLO Terminology: SLI, SLO, SLA, Error Budget

SLI (Service Level Indicator)

A metric that measures a specific aspect of service quality. Examples: request success rate, P99 latency, availability percentage.

SLO (Service Level Objective)

A target value for an SLI over a time period. Example: "99.9% of requests must succeed over a 30-day window."

SLA (Service Level Agreement)

A contractual promise to customers, often with penalties. Example: "We guarantee 99.9% uptime or you get a refund." SLAs should be looser than internal SLOs.

Error Budget

The allowed amount of failure before violating an SLO. If SLO is 99.9%, error budget is 0.1% (43 minutes/month). This budget can be "spent" on deployments, experiments, incidents.

The 5 Essential SLOs for SaaS

SLO #1: Availability (Request Success Rate)

Definition: Percentage of HTTP requests that return non-5xx status codes.

SLO Target: 99.9% success rate over 30 days

Error Budget: 0.1% = 43.2 minutes of 5xx errors per month
Calculation: 30 days × 24 hours × 60 minutes × 0.001 = 43.2 minutes

Prometheus SLI Query

# SLI: Success rate (ratio of good requests to total requests) sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m])) # This gives you a value between 0.0 (0% success) and 1.0 (100% success)

Burn Rate Alerts

# prometheus-rules.yaml groups: - name: availability-slo rules: # Fast burn: Page immediately (consuming 5% of monthly budget in 1 hour) - alert: AvailabilitySLOFastBurn expr: | ( 1 - (sum(rate(http_requests_total{status!~"5.."}[1h])) / sum(rate(http_requests_total[1h]))) ) > (14.4 * 0.001) for: 2m labels: severity: page annotations: summary: "Availability SLO is burning too fast" description: "Error rate is {{ $value | humanizePercentage }} (threshold: 1.44%). At this rate, monthly error budget will be exhausted in 6 hours." # Slow burn: Create ticket (consuming 10% of budget in 24 hours) - alert: AvailabilitySLOSlowBurn expr: | ( 1 - (sum(rate(http_requests_total{status!~"5.."}[24h])) / sum(rate(http_requests_total[24h]))) ) > (3 * 0.001) for: 15m labels: severity: ticket annotations: summary: "Availability SLO budget at risk" description: "Error rate over 24h is {{ $value | humanizePercentage }}. Review error trends."

Why these numbers? Fast burn (14.4x error budget rate) means you'll exhaust your monthly budget in ~2 hours if not fixed. Slow burn (3x rate) gives you 10 days before exhaustion—time to investigate without panic.

SLO #2: Latency (Request Duration)

Definition: Percentage of requests that complete within a target duration.

SLO Target: 99% of requests complete in under 500ms over 30 days

Error Budget: 1% of requests can be slower than 500ms
Note: Use P99 (99th percentile), not average. Average hides outliers.

# SLI: Percentage of requests faster than 500ms sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m])) / sum(rate(http_request_duration_seconds_count[5m])) # Requires histogram metrics (http_request_duration_seconds_bucket)

# Alert: Latency SLO burn rate groups: - name: latency-slo rules: - alert: LatencySLOFastBurn expr: | ( 1 - (sum(rate(http_request_duration_seconds_bucket{le="0.5"}[1h])) / sum(rate(http_request_duration_seconds_count[1h]))) ) > (14.4 * 0.01) for: 2m labels: severity: page annotations: summary: "Latency SLO burning too fast" description: "{{ $value | humanizePercentage }} of requests are slower than 500ms (threshold: 14.4%). P99 latency is likely impacted."

SLO #3: Error Rate (Application Errors)

Definition: Percentage of requests that return 4xx/5xx errors.

SLO Target: 99.5% of requests return 2xx/3xx over 30 days

Error Budget: 0.5% = 216 minutes of errors per month

# SLI: Percentage of non-error responses sum(rate(http_requests_total{status=~"[23].."}[5m])) / sum(rate(http_requests_total[5m])) # Or inverse (error rate): sum(rate(http_requests_total{status=~"[45].."}[5m])) / sum(rate(http_requests_total[5m]))

SLO #4: Throughput (Traffic Capacity)

Definition: System handles expected peak traffic without degradation.

SLO Target: System handles 10,000 RPS with P99 latency under 500ms

Why this matters: Prevents cascading failures during traffic spikes

# SLI: Request rate is within capacity AND latency is acceptable ( sum(rate(http_requests_total[5m])) < 10000 ) and ( histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) < 0.5 ) # Alert when approaching capacity - alert: ApproachingTrafficCapacity expr: sum(rate(http_requests_total[5m])) > 8000 for: 10m labels: severity: warning annotations: summary: "Traffic approaching capacity limit" description: "Current RPS: {{ $value }}. Capacity limit: 10,000 RPS. Consider scaling."

SLO #5: Data Durability (For Storage/Database Services)

Definition: Percentage of successful write operations that are retrievable.

SLO Target: 99.999% data durability (five 9s)

Error Budget: 0.001% = 26 seconds of data loss per month
Implementation: Synthetic checks that write/read test data

# SLI: Synthetic data availability check sum(rate(synthetic_data_check_success[5m])) / sum(rate(synthetic_data_check_total[5m])) # Example: Write a unique test record every minute, try to read it back # If read fails, data durability is violated

Error Budget Math: The Complete Guide

┌──────────┬────────────────────┬─────────────────┬────────────────────────┐ │ SLO │ Error Budget │ Downtime/Month │ Downtime/Year │ ├──────────┼────────────────────┼─────────────────┼────────────────────────┤ │ 90% │ 10% │ 3 days │ 36.5 days │ │ 95% │ 5% │ 1.5 days │ 18.25 days │ │ 99% │ 1% │ 7.2 hours │ 3.65 days │ │ 99.5% │ 0.5% │ 3.6 hours │ 1.83 days │ │ 99.9% │ 0.1% │ 43 minutes │ 8.76 hours │ │ 99.95% │ 0.05% │ 21 minutes │ 4.38 hours │ │ 99.99% │ 0.01% │ 4.3 minutes │ 52.6 minutes │ │ 99.999% │ 0.001% │ 26 seconds │ 5.26 minutes │ └──────────┴────────────────────┴─────────────────┴────────────────────────┘

Why not 100%? Every "9" you add doubles your operational cost. 99.9% → 99.99% means you need multi-region failover, chaos engineering, extensive testing. Most SaaS can deliver great UX at 99.9%.

Implementing Error Budget Policies

Error budgets aren't just numbers—they're decision-making tools. Here's how to use them:

Error Budget > 50% remaining

Status: Healthy
Actions: Ship new features aggressively, experiment with risky changes, schedule maintenance windows

Error Budget: 10-50% remaining

Status: At Risk
Actions: Slow down feature velocity, focus on reliability improvements, require extra review for risky changes

Error Budget < 10% remaining or exhausted

Status: FREEZE
Actions: Stop all feature releases, only deploy critical bug fixes, focus 100% on reliability, post-mortem required

Grafana Dashboard for SLO Tracking

Create a dashboard with these panels:

# Panel 1: SLO Compliance (Last 30 days) # Shows current SLI value vs SLO target 1 - ( sum(increase(http_requests_total{status=~"5.."}[30d])) / sum(increase(http_requests_total[30d])) ) # Display as percentage with SLO threshold line at 0.999 # Panel 2: Error Budget Remaining # Shows percentage of error budget left ( (0.001) - (1 - (sum(increase(http_requests_total{status!~"5.."}[30d])) / sum(increase(http_requests_total[30d])))) ) / (0.001) * 100 # Display as gauge: Green (>50%), Yellow (10-50%), Red (<10%) # Panel 3: Error Budget Burn Rate # Shows current burn rate (1.0 = consuming budget at target rate) ( 1 - (sum(rate(http_requests_total{status!~"5.."}[1h])) / sum(rate(http_requests_total[1h]))) ) / (0.001) # Display as graph over time. Values >1 mean burning faster than allowed # Panel 4: Time to Budget Exhaustion # Calculates hours until budget runs out at current burn rate ( (0.001) - (1 - (sum(increase(http_requests_total{status!~"5.."}[30d])) / sum(increase(http_requests_total[30d])))) ) / ( 1 - (sum(rate(http_requests_total{status!~"5.."}[1h])) / sum(rate(http_requests_total[1h]))) ) # Display as stat panel with alert threshold at 24 hours

Common Mistakes & How to Avoid Them

❌ Mistake #1: Alerting on SLO Violations

Don't alert when SLO drops below 99.9%. By then, damage is done.

✅ Solution: Alert on error budget burn rate. Fast burn = page now, slow burn = investigate later.

❌ Mistake #2: Too Many SLOs

Having 20 SLOs means none of them matter. You'll drown in alerts.

✅ Solution: Start with 3-5 SLOs max. Focus on user-facing metrics (availability, latency, errors).

❌ Mistake #3: Internal Metrics as SLIs

"CPU < 80%" is not an SLO. Users don't care about CPU, they care if the app works.

✅ Solution: Only use user-facing metrics. Ask: "Would a user notice if this SLI is violated?"

Tooling: Implementing SLOs in Production

Option 1: Prometheus + Alertmanager + Grafana (DIY)

Pros: Full control, free, flexible
Cons: Manual setup, requires PromQL expertise

Option 2: Sloth (SLO Generator)

Sloth generates Prometheus rules from SLO definitions:

# slo.yaml version: "prometheus/v1" service: "api" slos: - name: "availability" objective: 99.9 description: "API availability SLO" sli: events: error_query: sum(rate(http_requests_total{status=~"5.."}[{{.window}}])) total_query: sum(rate(http_requests_total[{{.window}}])) alerting: name: AvailabilitySLO labels: category: "availability" page_alert: labels: severity: page ticket_alert: labels: severity: ticket # Generate Prometheus rules sloth generate -i slo.yaml -o prometheus-slo-rules.yaml

Option 3: Google SLO Platform (GCP Only)

Pros: Fully managed, integrates with GCP monitoring
Cons: Vendor lock-in, requires GCP

Conclusion: SLOs as a Product Discipline

Good SLOs aren't just operational metrics—they're a product decision. They answer:

  • What level of reliability do users need? (Not want—need. 100% is wasteful.)
  • How much can we invest in reliability? (Error budget = innovation budget.)
  • When should we stop shipping features and fix bugs? (When budget is exhausted.)

Start simple: Pick 1-2 SLOs (availability + latency). Implement burn rate alerts. Track error budget in your weekly standup. Once that's working, expand to the full 5 SLOs above.

Need Help Implementing SLOs?

We design and implement production SLO monitoring: Prometheus setup, error budget dashboards, burn rate alerts, Grafana configuration, and SRE best practices.

Related Articles

HostingX Solutions company logo

HostingX Solutions

Expert DevOps and automation services accelerating B2B delivery and operations.

michael@hostingx.co.il
+972544810489

Connect

EmailIcon

Subscribe to our newsletter

Get monthly email updates about improvements.


© 2026 HostingX Solutions LLC. All Rights Reserved.

LLC No. 0008072296 | Est. 2026 | New Mexico, USA

Legal

Terms of Service

Privacy Policy

Acceptable Use Policy

Security & Compliance

Security Policy

Service Level Agreement

Compliance & Certifications

Accessibility Statement

Privacy & Preferences

Cookie Policy

Manage Cookie Preferences

Data Subject Rights (DSAR)

Unsubscribe from Emails