What's in a Managed Platform SLA

Complete breakdown of managed platform service level agreements: P0/P1/P2 incident definitions, response times, resolution targets, mitigation strategies, and reporting requirements.

Understanding SLAs vs SLOs vs SLIs

Before diving into specifics, let's clarify the terminology:

SLI (Service Level Indicator)

A metric that measures service performance. Example: "Request success rate = 99.95%"

SLO (Service Level Objective)

Internal target for an SLI. Example: "Maintain 99.9% uptime" (not contractual)

SLA (Service Level Agreement)

Contractual commitment with penalties. Example: "Guarantee 99.9% uptime or 10% monthly credit"

Incident Severity Definitions

P0: Critical (Total Outage)

Definition: Complete service outage affecting all customers
Examples: Production environment down, all API requests failing, database unreachable

Response: 15 minutes 24/7
Mitigation: 1 hour
Resolution: 4 hours

P1: High (Major Functionality Loss)

Definition: Critical feature unavailable or severely degraded
Examples: Payment processing down, auth service intermittent, 50%+ error rate

Response: 30 minutes 24/7
Mitigation: 2 hours
Resolution: 8 hours

P2: Medium (Partial Degradation)

Definition: Non-critical feature degraded, workaround available
Examples: Slow queries, intermittent errors (<10%), non-prod environment issues

Response: 4 hours (business hours)
Mitigation: 8 hours
Resolution: 24 hours

P3: Low (Minor Issues)

Definition: Minor bug, cosmetic issue, feature request
Examples: UI glitch, documentation update, configuration tweak

Response: 24 hours (business hours)
Mitigation: N/A
Resolution: Best effort (7-14 days)

Response vs Mitigation vs Resolution

┌─────────────┬─────────────────────────────────────────────────────────────┐ │ Phase │ Definition & Actions │ ├─────────────┼─────────────────────────────────────────────────────────────┤ │ **Response**│ Time until on-call engineer acknowledges & starts work │ │ │ - PagerDuty alert acknowledged │ │ │ - Initial triage begins │ │ │ - Status update posted to customer │ │ │ │ │ **Mitigation**│ Time until impact is reduced (not fully fixed) │ │ │ - Failover to backup system │ │ │ - Scale up resources │ │ │ - Apply temporary workaround │ │ │ - Customer impact reduced to <10% │ │ │ │ │ **Resolution**│ Time until root cause fixed & verified stable │ │ │ - Root cause identified & fixed │ │ │ - Monitoring shows stable for 30+ minutes │ │ │ - Postmortem initiated │ │ │ - All systems back to normal │ └─────────────┴─────────────────────────────────────────────────────────────┘

Availability SLA Tiers

┌────────────┬─────────────────┬─────────────────┬───────────────────┐ │ Tier │ Uptime Target │ Downtime/Month │ SLA Credit │ ├────────────┼─────────────────┼─────────────────┼───────────────────┤ │ Startup │ 99.5% │ 3.6 hours │ 10% if <99.5% │ │ │ │ │ 25% if <99.0% │ │ │ │ │ │ │ Growth │ 99.9% │ 43 minutes │ 10% if <99.9% │ │ │ │ │ 25% if <99.5% │ │ │ │ │ 50% if <99.0% │ │ │ │ │ │ │ Enterprise │ 99.95% │ 21 minutes │ 10% if <99.95% │ │ │ │ │ 25% if <99.9% │ │ │ │ │ 50% if <99.5% │ │ │ │ │ 100% if <99.0% │ └────────────┴─────────────────┴─────────────────┴───────────────────┘ Notes: - Downtime = P0 incidents only - Scheduled maintenance excluded (with 7 days notice) - Credits automatically applied to next invoice - Max credit: 100% of monthly fee

Communication & Reporting Requirements

During Incidents

P0: Updates every 30 minutes via Slack + status page
P1: Updates every 60 minutes via Slack
P2: Initial update, then updates at major milestones
All: Final resolution summary within 24 hours

Monthly Reporting

Availability Report: Uptime %, downtime breakdown, SLA compliance
Incident Summary: All P0/P1/P2 incidents with MTTR
Performance Metrics: Latency P50/P95/P99, error rates
Capacity Report: Resource utilization, scaling recommendations
Cost Report: Cloud spend breakdown, optimization opportunities

Exclusions from SLA

SLAs typically exclude downtime caused by:

Scheduled Maintenance: With 7 days advance notice (max 4 hours/month)
Customer-Caused Issues: Application bugs, configuration errors, resource exhaustion
Third-Party Failures: AWS outages, DNS provider issues, payment gateway down
Force Majeure: Natural disasters, war, government action
Security Incidents: DDoS attacks, intrusions (mitigation best-effort)

Example SLA Calculation

# Month: January 2025 # Total minutes: 44,640 (31 days × 24 hours × 60 minutes) ## Incidents: 1. P0: Database failure - 25 minutes downtime 2. P1: API degradation - 15 minutes partial outage (counts as 50% = 7.5 min) 3. Scheduled maintenance - 120 minutes (excluded from SLA) ## Calculation: Total downtime = 25 + 7.5 = 32.5 minutes Uptime = (44,640 - 32.5) / 44,640 = 99.927% ## SLA Compliance: Target: 99.9% (Growth tier) Actual: 99.927% Status: ✅ PASS (exceeds target) Credit: None ## If it had been 99.85%: Actual: 99.85% Status: ❌ FAIL (below 99.9%) Credit: 10% of monthly fee automatically applied

What to Look for in a Managed SLA

✅ Good SLA Indicators

Clear severity definitions with examples
Specific response times (not "as soon as possible")
Financial penalties (SLA credits) for non-compliance
Transparent reporting with public status page
Reasonable exclusions (not everything excluded)

🚩 Red Flags

"Best effort" language with no specific commitments
No financial penalties for SLA violations
Overly broad exclusions ("any third-party failure")
Vague definitions ("critical issues resolved quickly")
No reporting or transparency into actual uptime

Bottom line: A good SLA is specific, measurable, and backed by financial commitments. If it's vague or has no teeth, it's marketing, not a real SLA.