Published: January 2, 2025
•
Updated: January 2, 2025
You've probably seen SLOs like this:
❌ "99.9% uptime"
❌ "P99 latency under 200ms"
❌ "Zero errors"
These SLOs are useless because they're not measurable, not actionable, and create alert fatigue.
Good SLOs have three properties:
A metric that measures a specific aspect of service quality. Examples: request success rate, P99 latency, availability percentage.
A target value for an SLI over a time period. Example: "99.9% of requests must succeed over a 30-day window."
A contractual promise to customers, often with penalties. Example: "We guarantee 99.9% uptime or you get a refund." SLAs should be looser than internal SLOs.
The allowed amount of failure before violating an SLO. If SLO is 99.9%, error budget is 0.1% (43 minutes/month). This budget can be "spent" on deployments, experiments, incidents.
Definition: Percentage of HTTP requests that return non-5xx status codes.
SLO Target: 99.9% success rate over 30 days
Error Budget: 0.1% = 43.2 minutes of 5xx errors per month
Calculation: 30 days × 24 hours × 60 minutes × 0.001 = 43.2 minutes
# SLI: Success rate (ratio of good requests to total requests) sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m])) # This gives you a value between 0.0 (0% success) and 1.0 (100% success)
# prometheus-rules.yaml groups: - name: availability-slo rules: # Fast burn: Page immediately (consuming 5% of monthly budget in 1 hour) - alert: AvailabilitySLOFastBurn expr: | ( 1 - (sum(rate(http_requests_total{status!~"5.."}[1h])) / sum(rate(http_requests_total[1h]))) ) > (14.4 * 0.001) for: 2m labels: severity: page annotations: summary: "Availability SLO is burning too fast" description: "Error rate is {{ $value | humanizePercentage }} (threshold: 1.44%). At this rate, monthly error budget will be exhausted in 6 hours." # Slow burn: Create ticket (consuming 10% of budget in 24 hours) - alert: AvailabilitySLOSlowBurn expr: | ( 1 - (sum(rate(http_requests_total{status!~"5.."}[24h])) / sum(rate(http_requests_total[24h]))) ) > (3 * 0.001) for: 15m labels: severity: ticket annotations: summary: "Availability SLO budget at risk" description: "Error rate over 24h is {{ $value | humanizePercentage }}. Review error trends."
Why these numbers? Fast burn (14.4x error budget rate) means you'll exhaust your monthly budget in ~2 hours if not fixed. Slow burn (3x rate) gives you 10 days before exhaustion—time to investigate without panic.
Definition: Percentage of requests that complete within a target duration.
SLO Target: 99% of requests complete in under 500ms over 30 days
Error Budget: 1% of requests can be slower than 500ms
Note: Use P99 (99th percentile), not average. Average hides outliers.
# SLI: Percentage of requests faster than 500ms sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m])) / sum(rate(http_request_duration_seconds_count[5m])) # Requires histogram metrics (http_request_duration_seconds_bucket)
# Alert: Latency SLO burn rate groups: - name: latency-slo rules: - alert: LatencySLOFastBurn expr: | ( 1 - (sum(rate(http_request_duration_seconds_bucket{le="0.5"}[1h])) / sum(rate(http_request_duration_seconds_count[1h]))) ) > (14.4 * 0.01) for: 2m labels: severity: page annotations: summary: "Latency SLO burning too fast" description: "{{ $value | humanizePercentage }} of requests are slower than 500ms (threshold: 14.4%). P99 latency is likely impacted."
Definition: Percentage of requests that return 4xx/5xx errors.
SLO Target: 99.5% of requests return 2xx/3xx over 30 days
Error Budget: 0.5% = 216 minutes of errors per month
# SLI: Percentage of non-error responses sum(rate(http_requests_total{status=~"[23].."}[5m])) / sum(rate(http_requests_total[5m])) # Or inverse (error rate): sum(rate(http_requests_total{status=~"[45].."}[5m])) / sum(rate(http_requests_total[5m]))
Definition: System handles expected peak traffic without degradation.
SLO Target: System handles 10,000 RPS with P99 latency under 500ms
Why this matters: Prevents cascading failures during traffic spikes
# SLI: Request rate is within capacity AND latency is acceptable ( sum(rate(http_requests_total[5m])) < 10000 ) and ( histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) < 0.5 ) # Alert when approaching capacity - alert: ApproachingTrafficCapacity expr: sum(rate(http_requests_total[5m])) > 8000 for: 10m labels: severity: warning annotations: summary: "Traffic approaching capacity limit" description: "Current RPS: {{ $value }}. Capacity limit: 10,000 RPS. Consider scaling."
Definition: Percentage of successful write operations that are retrievable.
SLO Target: 99.999% data durability (five 9s)
Error Budget: 0.001% = 26 seconds of data loss per month
Implementation: Synthetic checks that write/read test data
# SLI: Synthetic data availability check sum(rate(synthetic_data_check_success[5m])) / sum(rate(synthetic_data_check_total[5m])) # Example: Write a unique test record every minute, try to read it back # If read fails, data durability is violated
┌──────────┬────────────────────┬─────────────────┬────────────────────────┐ │ SLO │ Error Budget │ Downtime/Month │ Downtime/Year │ ├──────────┼────────────────────┼─────────────────┼────────────────────────┤ │ 90% │ 10% │ 3 days │ 36.5 days │ │ 95% │ 5% │ 1.5 days │ 18.25 days │ │ 99% │ 1% │ 7.2 hours │ 3.65 days │ │ 99.5% │ 0.5% │ 3.6 hours │ 1.83 days │ │ 99.9% │ 0.1% │ 43 minutes │ 8.76 hours │ │ 99.95% │ 0.05% │ 21 minutes │ 4.38 hours │ │ 99.99% │ 0.01% │ 4.3 minutes │ 52.6 minutes │ │ 99.999% │ 0.001% │ 26 seconds │ 5.26 minutes │ └──────────┴────────────────────┴─────────────────┴────────────────────────┘
Why not 100%? Every "9" you add doubles your operational cost. 99.9% → 99.99% means you need multi-region failover, chaos engineering, extensive testing. Most SaaS can deliver great UX at 99.9%.
Error budgets aren't just numbers—they're decision-making tools. Here's how to use them:
Status: Healthy
Actions: Ship new features aggressively, experiment with risky changes, schedule maintenance windows
Status: At Risk
Actions: Slow down feature velocity, focus on reliability improvements, require extra review for risky changes
Status: FREEZE
Actions: Stop all feature releases, only deploy critical bug fixes, focus 100% on reliability, post-mortem required
Create a dashboard with these panels:
# Panel 1: SLO Compliance (Last 30 days) # Shows current SLI value vs SLO target 1 - ( sum(increase(http_requests_total{status=~"5.."}[30d])) / sum(increase(http_requests_total[30d])) ) # Display as percentage with SLO threshold line at 0.999 # Panel 2: Error Budget Remaining # Shows percentage of error budget left ( (0.001) - (1 - (sum(increase(http_requests_total{status!~"5.."}[30d])) / sum(increase(http_requests_total[30d])))) ) / (0.001) * 100 # Display as gauge: Green (>50%), Yellow (10-50%), Red (<10%) # Panel 3: Error Budget Burn Rate # Shows current burn rate (1.0 = consuming budget at target rate) ( 1 - (sum(rate(http_requests_total{status!~"5.."}[1h])) / sum(rate(http_requests_total[1h]))) ) / (0.001) # Display as graph over time. Values >1 mean burning faster than allowed # Panel 4: Time to Budget Exhaustion # Calculates hours until budget runs out at current burn rate ( (0.001) - (1 - (sum(increase(http_requests_total{status!~"5.."}[30d])) / sum(increase(http_requests_total[30d])))) ) / ( 1 - (sum(rate(http_requests_total{status!~"5.."}[1h])) / sum(rate(http_requests_total[1h]))) ) # Display as stat panel with alert threshold at 24 hours
Don't alert when SLO drops below 99.9%. By then, damage is done.
✅ Solution: Alert on error budget burn rate. Fast burn = page now, slow burn = investigate later.
Having 20 SLOs means none of them matter. You'll drown in alerts.
✅ Solution: Start with 3-5 SLOs max. Focus on user-facing metrics (availability, latency, errors).
"CPU < 80%" is not an SLO. Users don't care about CPU, they care if the app works.
✅ Solution: Only use user-facing metrics. Ask: "Would a user notice if this SLI is violated?"
Pros: Full control, free, flexible
Cons: Manual setup, requires PromQL expertise
Sloth generates Prometheus rules from SLO definitions:
# slo.yaml version: "prometheus/v1" service: "api" slos: - name: "availability" objective: 99.9 description: "API availability SLO" sli: events: error_query: sum(rate(http_requests_total{status=~"5.."}[{{.window}}])) total_query: sum(rate(http_requests_total[{{.window}}])) alerting: name: AvailabilitySLO labels: category: "availability" page_alert: labels: severity: page ticket_alert: labels: severity: ticket # Generate Prometheus rules sloth generate -i slo.yaml -o prometheus-slo-rules.yaml
Pros: Fully managed, integrates with GCP monitoring
Cons: Vendor lock-in, requires GCP
Good SLOs aren't just operational metrics—they're a product decision. They answer:
Start simple: Pick 1-2 SLOs (availability + latency). Implement burn rate alerts. Track error budget in your weekly standup. Once that's working, expand to the full 5 SLOs above.
We design and implement production SLO monitoring: Prometheus setup, error budget dashboards, burn rate alerts, Grafana configuration, and SRE best practices.
HostingX IL
Scalable automation & integration platform accelerating modern B2B product teams.
Services
Subscribe to our newsletter
Get monthly email updates about improvements.
Copyright © 2025 HostingX IL. All Rights Reserved.