Incident Response Automation: Runbooks as Code
Complete guide to automating incident response with executable runbooks, ChatOps workflows, PagerDuty/Slack integration, and automated remediation. Reduce MTTR by 70%.
Published: January 2, 2025
The Incident Response Problem
It's 3 AM. PagerDuty fires. Your on-call engineer wakes up, reads the alert, remembers there's a runbook somewhere in Confluence, searches for 10 minutes, finds an outdated doc from 2022, tries the steps, realizes the commands don't work, escalates to senior engineer, who manually fixes it.
Mean Time To Resolution (MTTR): 45 minutes. Most of that was searching for runbooks and figuring out what to do.
With automated runbooks: Alert fires ā Slack bot suggests remediation ā Engineer clicks "Run" ā Issue fixed in 3 minutes. MTTR reduced by 70%.
What You'll Implement
- Runbooks as Code: Executable scripts in Git, not wiki pages
- ChatOps Integration: Slack/MS Teams bot for incident commands
- Automated Remediation: Self-healing for common issues (disk full, pod restart)
- Context Gathering: Auto-fetch logs, metrics, traces on alert
- Postmortem Automation: Generate timeline from chat logs
Runbooks as Code: The Foundation
Traditional runbooks are markdown files in Confluence. They rot quickly. Runbooks as code are executable scripts that live in Git, get tested in CI, and run automatically.
# runbooks/restart-api-pod.sh #!/bin/bash set -e # Metadata for automation # @trigger alert:api_high_memory # @severity P2 # @approval_required false echo "š Checking API pod health..." # Get pod name POD=$(kubectl get pods -n production -l app=api --field-selector=status.phase=Running -o jsonpath='{.items[0].metadata.name}') echo "š Current memory usage:" kubectl top pod $POD -n production echo "š Restarting pod $POD..." kubectl delete pod $POD -n production echo "ā³ Waiting for new pod to be ready..." kubectl wait --for=condition=ready pod -l app=api -n production --timeout=60s echo "ā Pod restarted successfully" kubectl get pods -n production -l app=api
Runbook Repository Structure
runbooks/ āāā README.md # Index of all runbooks āāā common/ ā āāā gather-context.sh # Fetch logs, metrics, traces ā āāā notify-slack.sh # Post to incident channel āāā api/ ā āāā high-memory.sh # Restart pod when OOM ā āāā rate-limited.sh # Scale up replicas ā āāā database-connection.sh # Reset connection pool āāā database/ ā āāā high-cpu.sh # Analyze slow queries ā āāā replication-lag.sh # Force sync replica ā āāā disk-space.sh # Archive old logs āāā network/ āāā dns-resolution.sh # Flush DNS cache āāā ssl-certificate.sh # Renew expiring cert # Each runbook: # 1. Is executable (chmod +x) # 2. Has metadata comments (@trigger, @severity) # 3. Outputs human-readable status # 4. Returns exit code 0 on success
ChatOps Integration with Slack
ChatOps = Operations via chat. When an alert fires, a bot posts to Slack with suggested runbooks. Engineer clicks a button, bot runs the script, posts output.
# Deploy Slack bot (using Bolt framework) # bot.py from slack_bolt import App from slack_bolt.adapter.socket_mode import SocketModeHandler import subprocess import os app = App(token=os.environ["SLACK_BOT_TOKEN"]) @app.command("/runbook") def handle_runbook(ack, command, client): ack() runbook_name = command['text'] runbook_path = f"runbooks/{runbook_name}.sh" if not os.path.exists(runbook_path): client.chat_postMessage( channel=command['channel_id'], text=f"ā Runbook '{runbook_name}' not found" ) return # Post confirmation with button client.chat_postMessage( channel=command['channel_id'], text=f"š¤ Ready to run runbook: `{runbook_name}`", blocks=[ { "type": "section", "text": {"type": "mrkdwn", "text": f"Run `{runbook_name}`?"} }, { "type": "actions", "elements": [ { "type": "button", "text": {"type": "plain_text", "text": "ā Run"}, "style": "primary", "action_id": f"run_runbook:{runbook_name}" }, { "type": "button", "text": {"type": "plain_text", "text": "ā Cancel"}, "action_id": "cancel" } ] } ] ) @app.action("run_runbook:*") def handle_run_runbook(ack, action, client, body): ack() runbook_name = action['action_id'].split(':')[1] channel = body['channel']['id'] user = body['user']['name'] client.chat_postMessage( channel=channel, text=f"š Running `{runbook_name}` (started by @{user})..." ) try: result = subprocess.run( [f"runbooks/{runbook_name}.sh"], capture_output=True, text=True, timeout=300 ) client.chat_postMessage( channel=channel, text=f"ā Runbook completed\n```\n{result.stdout}\n```" ) except Exception as e: client.chat_postMessage( channel=channel, text=f"ā Runbook failed: {str(e)}" ) if __name__ == "__main__": handler = SocketModeHandler(app, os.environ["SLACK_APP_TOKEN"]) handler.start()
PagerDuty Integration: Auto-Suggest Runbooks
When PagerDuty fires an alert, automatically post to Slack with relevant runbooks based on alert labels.
# pagerduty-webhook-handler.py from flask import Flask, request import requests import os app = Flask(__name__) RUNBOOK_MAP = { "high_memory": "api/high-memory", "high_cpu": "database/high-cpu", "disk_full": "database/disk-space", "ssl_expiring": "network/ssl-certificate" } @app.route('/webhook', methods=['POST']) def handle_pagerduty_webhook(): event = request.json if event['event'] == 'incident.triggered': incident = event['incident'] alert_key = incident.get('alert_key', '') # Find matching runbook runbook = None for key, rb in RUNBOOK_MAP.items(): if key in alert_key: runbook = rb break message = f""" šØ *Incident: {incident['title']}* Severity: {incident['urgency']} Service: {incident['service']['name']} š Context: ⢠Incident URL: {incident['html_url']} ⢠Triggered: {incident['created_at']} """ if runbook: message += f""" š¤ *Suggested Runbook:* `{runbook}` Run with: `/runbook {runbook}` """ # Post to Slack requests.post( 'https://slack.com/api/chat.postMessage', headers={'Authorization': f"Bearer {os.environ['SLACK_BOT_TOKEN']}"}, json={ 'channel': '#incidents', 'text': message } ) return {'status': 'ok'} if __name__ == '__main__': app.run(port=5000)
Auto-Remediation: Self-Healing Infrastructure
For common, low-risk issues, don't wait for human intervention. Run runbooks automatically.
# Alertmanager config with auto-remediation # alertmanager.yml route: receiver: 'slack' routes: - match: severity: P3 auto_remediate: true receiver: 'auto-remediate' continue: true # Also send to Slack receivers: - name: 'auto-remediate' webhook_configs: - url: 'http://runbook-executor:8080/auto-remediate' send_resolved: false - name: 'slack' slack_configs: - api_url: '<webhook_url>' channel: '#alerts' # runbook-executor service # auto-remediate.py from flask import Flask, request import subprocess import logging app = Flask(__name__) AUTO_REMEDIATE_MAP = { "api_high_memory": "api/high-memory.sh", "disk_space_warning": "database/disk-space.sh", "pod_crashloop": "common/restart-pod.sh" } @app.route('/auto-remediate', methods=['POST']) def auto_remediate(): alert = request.json['alerts'][0] alert_name = alert['labels']['alertname'] if alert_name in AUTO_REMEDIATE_MAP: runbook = AUTO_REMEDIATE_MAP[alert_name] logging.info(f"Auto-remediating {alert_name} with {runbook}") try: result = subprocess.run( [f"runbooks/{runbook}"], capture_output=True, text=True, timeout=300 ) # Post result to Slack post_to_slack( f"š¤ Auto-remediation for {alert_name}\n" f"Status: {'ā Success' if result.returncode == 0 else 'ā Failed'}\n" f"```{result.stdout}```" ) return {'status': 'executed', 'exit_code': result.returncode} except Exception as e: logging.error(f"Auto-remediation failed: {e}") post_to_slack(f"ā Auto-remediation failed: {str(e)}") return {'status': 'error', 'message': str(e)}, 500 return {'status': 'no_runbook_found'}, 404
Context Gathering: Auto-Fetch Diagnostics
When an incident fires, automatically gather context: recent logs, metrics, traces, config changes.
# runbooks/common/gather-context.sh #!/bin/bash SERVICE=$1 NAMESPACE=$2 TIME_WINDOW="15m" echo "š Gathering context for $SERVICE in $NAMESPACE..." # Recent logs (errors only) echo "\nš Recent ERROR logs:" kubectl logs -n $NAMESPACE -l app=$SERVICE --tail=50 --since=$TIME_WINDOW | grep ERROR # Pod status echo "\nš¦ Pod status:" kubectl get pods -n $NAMESPACE -l app=$SERVICE # Resource usage echo "\nš» Resource usage:" kubectl top pods -n $NAMESPACE -l app=$SERVICE # Recent deployments echo "\nš Recent deployments:" kubectl rollout history deployment/$SERVICE -n $NAMESPACE | tail -5 # Prometheus query: Error rate echo "\nš Error rate (last 15m):" curl -s "http://prometheus:9090/api/v1/query?query=rate(http_requests_total{service='$SERVICE',status=~'5..'}[5m])" \ | jq -r '.data.result[0].value[1]' # Recent trace (if Tempo available) echo "\nš Recent slow trace:" curl -s "http://tempo:3100/api/search?service=$SERVICE&minDuration=1s&limit=1" \ | jq -r '.traces[0].traceID' echo "\nā Context gathering complete"
Postmortem Automation
Generate postmortem templates automatically from Slack chat logs and timeline.
# generate-postmortem.py import os from datetime import datetime from slack_sdk import WebClient client = WebClient(token=os.environ["SLACK_BOT_TOKEN"]) def generate_postmortem(incident_channel, start_time, resolution_time): # Fetch all messages from incident channel messages = client.conversations_history( channel=incident_channel, oldest=start_time, latest=resolution_time ) timeline = [] actions_taken = [] for msg in messages['messages']: timestamp = datetime.fromtimestamp(float(msg['ts'])) text = msg.get('text', '') user = msg.get('user', 'System') # Extract timeline events if 'alert' in text.lower() or 'incident' in text.lower(): timeline.append(f"- **{timestamp.strftime('%H:%M:%S')}**: {text}") # Extract actions if '/runbook' in text or 'ran' in text.lower(): actions_taken.append(f"- {text}") # Generate markdown postmortem postmortem = f"""# Postmortem: {incident_channel} **Date**: {datetime.now().strftime('%Y-%m-%d')} **Duration**: {(resolution_time - start_time) / 60:.0f} minutes **Severity**: P1 **Status**: Resolved ## Summary [Brief description of what happened] ## Timeline {"".join(timeline)} ## Root Cause [To be filled in by incident commander] ## Actions Taken {"".join(actions_taken)} ## What Went Well - ChatOps commands executed successfully - Auto-remediation contained the issue - Context gathering provided immediate diagnostics ## What Went Wrong [To be filled in] ## Action Items - [ ] [Action item 1] - [ ] [Action item 2] ## Lessons Learned [To be filled in] """ # Save to GitHub with open(f"postmortems/{incident_channel}.md", 'w') as f: f.write(postmortem) return postmortem
Measuring Success: MTTR Reduction
āāāāāāāāāāāāāāāāāāāāāāāāāāā¬āāāāāāāāāāāāāā¬āāāāāāāāāāāāāāāāāāā¬āāāāāāāāāāāāāāāāā ā Phase ā Before ā After ā Improvement ā āāāāāāāāāāāāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāāā¼āāāāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāāāāā⤠ā Alert ā On-call wakes ā 2 min ā 2 min ā - ā ā Find runbook ā 10 min ā 0 min (auto-post)ā -10 min ā ā Understand issue ā 15 min ā 2 min (auto-ctx) ā -13 min ā ā Execute fix ā 15 min ā 3 min (ChatOps) ā -12 min ā ā Verify resolution ā 5 min ā 2 min ā -3 min ā āāāāāāāāāāāāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāāā¼āāāāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāāāāā⤠ā **Total MTTR** ā **47 min** ā **9 min** ā **-81%** ā āāāāāāāāāāāāāāāāāāāāāāāāāāā“āāāāāāāāāāāāāā“āāāāāāāāāāāāāāāāāāā“āāāāāāāāāāāāāāāāā With auto-remediation (P3 issues): - Alert ā Auto-fix ā Resolution: 3 minutes - No human intervention required
Best Practices
- Test runbooks in CI: Run in staging on every PR
- Version control: All runbooks in Git, require PR reviews
- Approval gates: P1 incidents require human approval before auto-remediation
- Audit logs: Track who ran what, when (compliance requirement)
- Graceful degradation: If automation fails, fall back to manual steps
Start this week: Convert your top 3 runbooks to executable scripts. Set up a Slack bot. Auto-remediate one P3 alert. Watch your MTTR drop.
HostingX Solutions
Expert DevOps and automation services accelerating B2B delivery and operations.
Services
Subscribe to our newsletter
Get monthly email updates about improvements.
Ā© 2026 HostingX Solutions LLC. All Rights Reserved.
LLC No. 0008072296 | Est. 2026 | New Mexico, USA
Terms of Service
Privacy Policy
Acceptable Use Policy