Before diving into specifics, let's clarify the terminology:
A metric that measures service performance. Example: "Request success rate = 99.95%"
Internal target for an SLI. Example: "Maintain 99.9% uptime" (not contractual)
Contractual commitment with penalties. Example: "Guarantee 99.9% uptime or 10% monthly credit"
Definition: Complete service outage affecting all customers
Examples: Production environment down, all API requests failing, database unreachable
Response: 15 minutes 24/7
Mitigation: 1 hour
Resolution: 4 hours
Definition: Critical feature unavailable or severely degraded
Examples: Payment processing down, auth service intermittent, 50%+ error rate
Response: 30 minutes 24/7
Mitigation: 2 hours
Resolution: 8 hours
Definition: Non-critical feature degraded, workaround available
Examples: Slow queries, intermittent errors (<10%), non-prod environment issues
Response: 4 hours (business hours)
Mitigation: 8 hours
Resolution: 24 hours
Definition: Minor bug, cosmetic issue, feature request
Examples: UI glitch, documentation update, configuration tweak
Response: 24 hours (business hours)
Mitigation: N/A
Resolution: Best effort (7-14 days)
┌─────────────┬─────────────────────────────────────────────────────────────┐ │ Phase │ Definition & Actions │ ├─────────────┼─────────────────────────────────────────────────────────────┤ │ **Response**│ Time until on-call engineer acknowledges & starts work │ │ │ - PagerDuty alert acknowledged │ │ │ - Initial triage begins │ │ │ - Status update posted to customer │ │ │ │ │ **Mitigation**│ Time until impact is reduced (not fully fixed) │ │ │ - Failover to backup system │ │ │ - Scale up resources │ │ │ - Apply temporary workaround │ │ │ - Customer impact reduced to <10% │ │ │ │ │ **Resolution**│ Time until root cause fixed & verified stable │ │ │ - Root cause identified & fixed │ │ │ - Monitoring shows stable for 30+ minutes │ │ │ - Postmortem initiated │ │ │ - All systems back to normal │ └─────────────┴─────────────────────────────────────────────────────────────┘
┌────────────┬─────────────────┬─────────────────┬───────────────────┐ │ Tier │ Uptime Target │ Downtime/Month │ SLA Credit │ ├────────────┼─────────────────┼─────────────────┼───────────────────┤ │ Startup │ 99.5% │ 3.6 hours │ 10% if <99.5% │ │ │ │ │ 25% if <99.0% │ │ │ │ │ │ │ Growth │ 99.9% │ 43 minutes │ 10% if <99.9% │ │ │ │ │ 25% if <99.5% │ │ │ │ │ 50% if <99.0% │ │ │ │ │ │ │ Enterprise │ 99.95% │ 21 minutes │ 10% if <99.95% │ │ │ │ │ 25% if <99.9% │ │ │ │ │ 50% if <99.5% │ │ │ │ │ 100% if <99.0% │ └────────────┴─────────────────┴─────────────────┴───────────────────┘ Notes: - Downtime = P0 incidents only - Scheduled maintenance excluded (with 7 days notice) - Credits automatically applied to next invoice - Max credit: 100% of monthly fee
SLAs typically exclude downtime caused by:
# Month: January 2025 # Total minutes: 44,640 (31 days × 24 hours × 60 minutes) ## Incidents: 1. P0: Database failure - 25 minutes downtime 2. P1: API degradation - 15 minutes partial outage (counts as 50% = 7.5 min) 3. Scheduled maintenance - 120 minutes (excluded from SLA) ## Calculation: Total downtime = 25 + 7.5 = 32.5 minutes Uptime = (44,640 - 32.5) / 44,640 = 99.927% ## SLA Compliance: Target: 99.9% (Growth tier) Actual: 99.927% Status: ✅ PASS (exceeds target) Credit: None ## If it had been 99.85%: Actual: 99.85% Status: ❌ FAIL (below 99.9%) Credit: 10% of monthly fee automatically applied
Bottom line: A good SLA is specific, measurable, and backed by financial commitments. If it's vague or has no teeth, it's marketing, not a real SLA.
HostingX IL
Scalable automation & integration platform accelerating modern B2B product teams.
Services
Subscribe to our newsletter
Get monthly email updates about improvements.
Copyright © 2025 HostingX IL. All Rights Reserved.