Incident Response Automation: Runbooks as Code

Complete guide to automating incident response with executable runbooks, ChatOps workflows, PagerDuty/Slack integration, and automated remediation. Reduce MTTR by 70%.

Published: January 2, 2025

The Incident Response Problem

It's 3 AM. PagerDuty fires. Your on-call engineer wakes up, reads the alert, remembers there's a runbook somewhere in Confluence, searches for 10 minutes, finds an outdated doc from 2022, tries the steps, realizes the commands don't work, escalates to senior engineer, who manually fixes it.

Mean Time To Resolution (MTTR): 45 minutes. Most of that was searching for runbooks and figuring out what to do.

With automated runbooks: Alert fires → Slack bot suggests remediation → Engineer clicks "Run" → Issue fixed in 3 minutes. MTTR reduced by 70%.

What You'll Implement

Runbooks as Code: Executable scripts in Git, not wiki pages
ChatOps Integration: Slack/MS Teams bot for incident commands
Automated Remediation: Self-healing for common issues (disk full, pod restart)
Context Gathering: Auto-fetch logs, metrics, traces on alert
Postmortem Automation: Generate timeline from chat logs

Runbooks as Code: The Foundation

Traditional runbooks are markdown files in Confluence. They rot quickly. Runbooks as code are executable scripts that live in Git, get tested in CI, and run automatically.

# runbooks/restart-api-pod.sh #!/bin/bash set -e # Metadata for automation # @trigger alert:api_high_memory # @severity P2 # @approval_required false echo "🔍 Checking API pod health..." # Get pod name POD=$(kubectl get pods -n production -l app=api --field-selector=status.phase=Running -o jsonpath='{.items[0].metadata.name}') echo "📊 Current memory usage:" kubectl top pod $POD -n production echo "🔄 Restarting pod $POD..." kubectl delete pod $POD -n production echo "⏳ Waiting for new pod to be ready..." kubectl wait --for=condition=ready pod -l app=api -n production --timeout=60s echo "✅ Pod restarted successfully" kubectl get pods -n production -l app=api

Runbook Repository Structure

runbooks/ ├── README.md # Index of all runbooks ├── common/ │ ├── gather-context.sh # Fetch logs, metrics, traces │ └── notify-slack.sh # Post to incident channel ├── api/ │ ├── high-memory.sh # Restart pod when OOM │ ├── rate-limited.sh # Scale up replicas │ └── database-connection.sh # Reset connection pool ├── database/ │ ├── high-cpu.sh # Analyze slow queries │ ├── replication-lag.sh # Force sync replica │ └── disk-space.sh # Archive old logs └── network/ ├── dns-resolution.sh # Flush DNS cache └── ssl-certificate.sh # Renew expiring cert # Each runbook: # 1. Is executable (chmod +x) # 2. Has metadata comments (@trigger, @severity) # 3. Outputs human-readable status # 4. Returns exit code 0 on success

ChatOps Integration with Slack

ChatOps = Operations via chat. When an alert fires, a bot posts to Slack with suggested runbooks. Engineer clicks a button, bot runs the script, posts output.

# Deploy Slack bot (using Bolt framework) # bot.py from slack_bolt import App from slack_bolt.adapter.socket_mode import SocketModeHandler import subprocess import os app = App(token=os.environ["SLACK_BOT_TOKEN"]) @app.command("/runbook") def handle_runbook(ack, command, client): ack() runbook_name = command['text'] runbook_path = f"runbooks/{runbook_name}.sh" if not os.path.exists(runbook_path): client.chat_postMessage( channel=command['channel_id'], text=f"❌ Runbook '{runbook_name}' not found" ) return # Post confirmation with button client.chat_postMessage( channel=command['channel_id'], text=f"🤖 Ready to run runbook: `{runbook_name}`", blocks=[ { "type": "section", "text": {"type": "mrkdwn", "text": f"Run `{runbook_name}`?"} }, { "type": "actions", "elements": [ { "type": "button", "text": {"type": "plain_text", "text": "✅ Run"}, "style": "primary", "action_id": f"run_runbook:{runbook_name}" }, { "type": "button", "text": {"type": "plain_text", "text": "❌ Cancel"}, "action_id": "cancel" } ] } ] ) @app.action("run_runbook:*") def handle_run_runbook(ack, action, client, body): ack() runbook_name = action['action_id'].split(':')[1] channel = body['channel']['id'] user = body['user']['name'] client.chat_postMessage( channel=channel, text=f"🏃 Running `{runbook_name}` (started by @{user})..." ) try: result = subprocess.run( [f"runbooks/{runbook_name}.sh"], capture_output=True, text=True, timeout=300 ) client.chat_postMessage( channel=channel, text=f"✅ Runbook completed\n```\n{result.stdout}\n```" ) except Exception as e: client.chat_postMessage( channel=channel, text=f"❌ Runbook failed: {str(e)}" ) if __name__ == "__main__": handler = SocketModeHandler(app, os.environ["SLACK_APP_TOKEN"]) handler.start()

PagerDuty Integration: Auto-Suggest Runbooks

When PagerDuty fires an alert, automatically post to Slack with relevant runbooks based on alert labels.

# pagerduty-webhook-handler.py from flask import Flask, request import requests import os app = Flask(__name__) RUNBOOK_MAP = { "high_memory": "api/high-memory", "high_cpu": "database/high-cpu", "disk_full": "database/disk-space", "ssl_expiring": "network/ssl-certificate" } @app.route('/webhook', methods=['POST']) def handle_pagerduty_webhook(): event = request.json if event['event'] == 'incident.triggered': incident = event['incident'] alert_key = incident.get('alert_key', '') # Find matching runbook runbook = None for key, rb in RUNBOOK_MAP.items(): if key in alert_key: runbook = rb break message = f""" 🚨 *Incident: {incident['title']}* Severity: {incident['urgency']} Service: {incident['service']['name']} 📊 Context: • Incident URL: {incident['html_url']} • Triggered: {incident['created_at']} """ if runbook: message += f""" 🤖 *Suggested Runbook:* `{runbook}` Run with: `/runbook {runbook}` """ # Post to Slack requests.post( 'https://slack.com/api/chat.postMessage', headers={'Authorization': f"Bearer {os.environ['SLACK_BOT_TOKEN']}"}, json={ 'channel': '#incidents', 'text': message } ) return {'status': 'ok'} if __name__ == '__main__': app.run(port=5000)

Auto-Remediation: Self-Healing Infrastructure

For common, low-risk issues, don't wait for human intervention. Run runbooks automatically.

# Alertmanager config with auto-remediation # alertmanager.yml route: receiver: 'slack' routes: - match: severity: P3 auto_remediate: true receiver: 'auto-remediate' continue: true # Also send to Slack receivers: - name: 'auto-remediate' webhook_configs: - url: 'http://runbook-executor:8080/auto-remediate' send_resolved: false - name: 'slack' slack_configs: - api_url: '<webhook_url>' channel: '#alerts' # runbook-executor service # auto-remediate.py from flask import Flask, request import subprocess import logging app = Flask(__name__) AUTO_REMEDIATE_MAP = { "api_high_memory": "api/high-memory.sh", "disk_space_warning": "database/disk-space.sh", "pod_crashloop": "common/restart-pod.sh" } @app.route('/auto-remediate', methods=['POST']) def auto_remediate(): alert = request.json['alerts'][0] alert_name = alert['labels']['alertname'] if alert_name in AUTO_REMEDIATE_MAP: runbook = AUTO_REMEDIATE_MAP[alert_name] logging.info(f"Auto-remediating {alert_name} with {runbook}") try: result = subprocess.run( [f"runbooks/{runbook}"], capture_output=True, text=True, timeout=300 ) # Post result to Slack post_to_slack( f"🤖 Auto-remediation for {alert_name}\n" f"Status: {'✅ Success' if result.returncode == 0 else '❌ Failed'}\n" f"```{result.stdout}```" ) return {'status': 'executed', 'exit_code': result.returncode} except Exception as e: logging.error(f"Auto-remediation failed: {e}") post_to_slack(f"❌ Auto-remediation failed: {str(e)}") return {'status': 'error', 'message': str(e)}, 500 return {'status': 'no_runbook_found'}, 404

Context Gathering: Auto-Fetch Diagnostics

When an incident fires, automatically gather context: recent logs, metrics, traces, config changes.

# runbooks/common/gather-context.sh #!/bin/bash SERVICE=$1 NAMESPACE=$2 TIME_WINDOW="15m" echo "📊 Gathering context for $SERVICE in $NAMESPACE..." # Recent logs (errors only) echo "\n🔍 Recent ERROR logs:" kubectl logs -n $NAMESPACE -l app=$SERVICE --tail=50 --since=$TIME_WINDOW | grep ERROR # Pod status echo "\n📦 Pod status:" kubectl get pods -n $NAMESPACE -l app=$SERVICE # Resource usage echo "\n💻 Resource usage:" kubectl top pods -n $NAMESPACE -l app=$SERVICE # Recent deployments echo "\n🚀 Recent deployments:" kubectl rollout history deployment/$SERVICE -n $NAMESPACE | tail -5 # Prometheus query: Error rate echo "\n📈 Error rate (last 15m):" curl -s "http://prometheus:9090/api/v1/query?query=rate(http_requests_total{service='$SERVICE',status=~'5..'}[5m])" \ | jq -r '.data.result[0].value[1]' # Recent trace (if Tempo available) echo "\n🔎 Recent slow trace:" curl -s "http://tempo:3100/api/search?service=$SERVICE&minDuration=1s&limit=1" \ | jq -r '.traces[0].traceID' echo "\n✅ Context gathering complete"

Postmortem Automation

Generate postmortem templates automatically from Slack chat logs and timeline.

# generate-postmortem.py import os from datetime import datetime from slack_sdk import WebClient client = WebClient(token=os.environ["SLACK_BOT_TOKEN"]) def generate_postmortem(incident_channel, start_time, resolution_time): # Fetch all messages from incident channel messages = client.conversations_history( channel=incident_channel, oldest=start_time, latest=resolution_time ) timeline = [] actions_taken = [] for msg in messages['messages']: timestamp = datetime.fromtimestamp(float(msg['ts'])) text = msg.get('text', '') user = msg.get('user', 'System') # Extract timeline events if 'alert' in text.lower() or 'incident' in text.lower(): timeline.append(f"- **{timestamp.strftime('%H:%M:%S')}**: {text}") # Extract actions if '/runbook' in text or 'ran' in text.lower(): actions_taken.append(f"- {text}") # Generate markdown postmortem postmortem = f"""# Postmortem: {incident_channel} **Date**: {datetime.now().strftime('%Y-%m-%d')} **Duration**: {(resolution_time - start_time) / 60:.0f} minutes **Severity**: P1 **Status**: Resolved ## Summary [Brief description of what happened] ## Timeline {"".join(timeline)} ## Root Cause [To be filled in by incident commander] ## Actions Taken {"".join(actions_taken)} ## What Went Well - ChatOps commands executed successfully - Auto-remediation contained the issue - Context gathering provided immediate diagnostics ## What Went Wrong [To be filled in] ## Action Items - [ ] [Action item 1] - [ ] [Action item 2] ## Lessons Learned [To be filled in] """ # Save to GitHub with open(f"postmortems/{incident_channel}.md", 'w') as f: f.write(postmortem) return postmortem

Measuring Success: MTTR Reduction

┌─────────────────────────┬─────────────┬──────────────────┬────────────────┐ │ Phase │ Before │ After │ Improvement │ ├─────────────────────────┼─────────────┼──────────────────┼────────────────┤ │ Alert → On-call wakes │ 2 min │ 2 min │ - │ │ Find runbook │ 10 min │ 0 min (auto-post)│ -10 min │ │ Understand issue │ 15 min │ 2 min (auto-ctx) │ -13 min │ │ Execute fix │ 15 min │ 3 min (ChatOps) │ -12 min │ │ Verify resolution │ 5 min │ 2 min │ -3 min │ ├─────────────────────────┼─────────────┼──────────────────┼────────────────┤ │ **Total MTTR** │ **47 min** │ **9 min** │ **-81%** │ └─────────────────────────┴─────────────┴──────────────────┴────────────────┘ With auto-remediation (P3 issues): - Alert → Auto-fix → Resolution: 3 minutes - No human intervention required